Notes on the dendritic unit
An aside from the rest of the lab. I wanted to see what happens when you take the MLP's nonlinearity off the soma and put it inside the unit, where dendrites compete and coactivate. Open notes on what I tried and where it ends up surprising me.
This one is not part of any larger plan. It started as a daydream - what if you took the activation function out of the MLP and let the unit itself be nonlinear, the way a biological neuron is nonlinear, by virtue of how its dendrites compete and combine. I had no idea where it would go. I still mostly don't. But it has produced enough surprises that I keep coming back to it.
The thing I am playing with is structural plasticity inside a single unit. Specifically: a learned competition between dendrite-style branches, plus a soft coactivation rule that lets several branches contribute at once when their evidence agrees. There is no , or in the block. The nonlinearity emerges from the selection rule.
The thing it replaces
A standard MLP block in a transformer is, schematically:
A dendritic unit instead defines branches, each its own affine map of the input. Then it picks among them, with a competition rule that lets either a single winner dominate, or a few branches coactivate, depending on how clearly the input separates them.
Where the structural plasticity comes in
The selection rule is differentiable. Beta itself can be learned per layer, per unit, even per channel. What I have been seeing in small runs is that during training, branches do not all get used equally. Some specialize - they end up the only ones with non-trivial weight on a particular subset of inputs. Others fade. A handful end up doing the bulk of the work for clean clusters of inputs, and the remaining ones pick up edge cases.
That is what I mean by structural plasticity. The unit's effective shape - how many branches it really uses, how sharp the selection is - is something it learns, not something I picked. Compare that to a block, which always has one nonlinearity, applied identically everywhere, with no internal structure to specialize.
ReLU is one way of being nonlinear. Selection between competing affine branches is another. The interesting question is whether the second one carries more useful inductive bias than the first.
What surprised me
- Coactivation matters more than I expected. When I forced hard winner-take-all (large beta from the start), training was wobbly. With a learnable beta that starts small and sharpens slowly, training is stable and the unit actually learns to specialize gradually.
- Branches die unevenly. Without any explicit balance loss, you tend to end up with some branches doing all the work and others permanently silent. This is not always bad - it looks like learned capacity allocation - but it can lock in suboptimal configurations early.
- Drop-in compatibility is real. You can swap an MLP block for a dendritic block in a small transformer without changing anything else, and training proceeds. Whether it actually wins on a real benchmark is a different and much harder question.
What I have not figured out
How to read the geometry of what the branches end up representing. There is something appealing about the idea that branches naturally factor the input space, but I do not have a clean way to look inside and confirm or refute it. That is the next thing I want to chase.
Whether any of this matters at scale. Small experiments are encouraging, but I have not run anything where I would bet money on the result. ML history is full of cute primitives that look great on toy tasks and then fail to compound into anything real. I am keeping my expectations honest.
Code is at github.com/mflRevan/dendritic_unit. It is research-grade and will stay that way for a while. If any of this resonates with something you are working on, I would love to talk.