/aiman.tech
← NotesResearch log · structural plasticity

Notes on the dendritic unit

Published··9 min·
Dendritic UnitStructural plasticityCompetitionMLP

An aside from the rest of the lab. I wanted to see what happens when you take the MLP's nonlinearity off the soma and put it inside the unit, where dendrites compete and coactivate. Open notes on what I tried and where it ends up surprising me.

This one is not part of any larger plan. It started as a daydream - what if you took the activation function out of the MLP and let the unit itself be nonlinear, the way a biological neuron is nonlinear, by virtue of how its dendrites compete and combine. I had no idea where it would go. I still mostly don't. But it has produced enough surprises that I keep coming back to it.

The thing I am playing with is structural plasticity inside a single unit. Specifically: a learned competition between dendrite-style branches, plus a soft coactivation rule that lets several branches contribute at once when their evidence agrees. There is no ReLU\mathrm{ReLU}, SiLU\mathrm{SiLU} or GeLU\mathrm{GeLU} in the block. The nonlinearity emerges from the selection rule.

The thing it replaces

A standard MLP block in a transformer is, schematically:

y=W2σ(W1x+b1)+b2y = W_2 \, \sigma(W_1 x + b_1) + b_2
Standard MLP. The single nonlinearity sigma is bolted between two linear maps. ReLU/SiLU/GeLU are particular choices of sigma; the unit itself doesn't know about it.

A dendritic unit instead defines KK branches, each its own affine map of the input. Then it picks among them, with a competition rule that lets either a single winner dominate, or a few branches coactivate, depending on how clearly the input separates them.

branchi(x)=Wix+bi,i=1,,K\text{branch}_i(x) = W_i x + b_i, \quad i = 1, \dots, K
Each branch is a small affine. K is small, typically 4-8 in the experiments I have run.
αi(x)=exp ⁣(βsi(x))jexp ⁣(βsj(x))\alpha_i(x) = \frac{\exp\!\left(\beta \cdot s_i(x)\right)}{\sum_j \exp\!\left(\beta \cdot s_j(x)\right)}
Selection. Branch i's share alpha_i is a softmax over an evidence score s_i(x). Temperature beta moves the rule between hard winner-take-all (large beta) and full coactivation (small beta).
y=i=1Kαi(x)branchi(x)y = \sum_{i=1}^{K} \alpha_i(x) \cdot \text{branch}_i(x)
Output. A weighted sum where the weights are themselves a function of the input - so the output is genuinely nonlinear in x even though every branch is linear.
STANDARD MLP BLOCKxINPUTW₁ x + b₁LINEARσ(·)RELU/SILU/GELUW₂ · + b₂LINEARyOUTPUT↑ NONLINEARITY · BOLTED ON · ONE FUNCTION FOR ALL INPUTSDENDRITIC UNITxINPUTbranch 1: Wᵢ x + bᵢbranch 2: Wᵢ x + bᵢbranch 3: Wᵢ x + bᵢbranch 4: Wᵢ x + bᵢα = softmax(β · s(x))LEARNED COMPETITIONΣ αᵢ · branchᵢyOUTPUT↑ NONLINEARITY · IN THE SELECTION RULE · DIFFERENT BRANCH FOR DIFFERENT INPUTS
[ FIG · 03 / Dendritic vs MLP ]
Where the nonlinearity lives
Fig 01 - Where the nonlinearity lives. Standard MLP applies one fixed activation between two linears; the dendritic unit makes the selection rule itself the source of nonlinearity.
[ FIG · 02 / Dendritic Unit ]
structural · activation-free
Soma · output
Dendrite branch
Synapse · input
Active branch
Fig 02 - One unit, mid-step. Input arrives at the synapse tips, branches compute their affine, compete via the selection rule, and the active branch (gold) is the one whose pulse reaches the soma at a given step.

Where the structural plasticity comes in

The selection rule is differentiable. Beta itself can be learned per layer, per unit, even per channel. What I have been seeing in small runs is that during training, branches do not all get used equally. Some specialize - they end up the only ones with non-trivial weight on a particular subset of inputs. Others fade. A handful end up doing the bulk of the work for clean clusters of inputs, and the remaining ones pick up edge cases.

That is what I mean by structural plasticity. The unit's effective shape - how many branches it really uses, how sharp the selection is - is something it learns, not something I picked. Compare that to a ReLU\mathrm{ReLU} block, which always has one nonlinearity, applied identically everywhere, with no internal structure to specialize.

ReLU is one way of being nonlinear. Selection between competing affine branches is another. The interesting question is whether the second one carries more useful inductive bias than the first.

What surprised me

  • Coactivation matters more than I expected. When I forced hard winner-take-all (large beta from the start), training was wobbly. With a learnable beta that starts small and sharpens slowly, training is stable and the unit actually learns to specialize gradually.
  • Branches die unevenly. Without any explicit balance loss, you tend to end up with some branches doing all the work and others permanently silent. This is not always bad - it looks like learned capacity allocation - but it can lock in suboptimal configurations early.
  • Drop-in compatibility is real. You can swap an MLP block for a dendritic block in a small transformer without changing anything else, and training proceeds. Whether it actually wins on a real benchmark is a different and much harder question.

What I have not figured out

How to read the geometry of what the branches end up representing. There is something appealing about the idea that branches naturally factor the input space, but I do not have a clean way to look inside and confirm or refute it. That is the next thing I want to chase.

Whether any of this matters at scale. Small experiments are encouraging, but I have not run anything where I would bet money on the result. ML history is full of cute primitives that look great on toy tasks and then fail to compound into anything real. I am keeping my expectations honest.

Code is at github.com/mflRevan/dendritic_unit. It is research-grade and will stay that way for a while. If any of this resonates with something you are working on, I would love to talk.