A Sparse Recursive Routing Architecture for Adaptive Neural Computation

A Sparse Recursive Routing Architecture for Adaptive Neural Computation

Abstract

This work presents a neural architecture based on recursively expanding latent structure with hierarchical conditional routing rather than dense uniform computation. Unlike conventional transformer systems in which all parameters participate in every forward pass, this architecture organizes computation as a branching hierarchical graph in which nodes selectively route information toward specialized substructures.

The model combines:

- sparse connectivity
- recursive state accumulation
- conditional execution
- learned routing
- dynamic specialization
- structural locality

The resulting system behaves as a hybrid between:

- neural networks
- recursive computational trees
- mixture-of-experts systems
- sparse attention mechanisms
- and hierarchical memory structures

The architecture differs fundamentally from dense transformer computation in that only relevant branches participate in inference. Computation therefore scales according to activated paths rather than total parameter count.


1. Introduction

Most modern neural architectures operate through dense global computation. Transformers evaluate all layers and most parameters for every token regardless of contextual relevance.

This creates several limitations:

- quadratic attention cost
- dense parameter activation
- inefficient memory utilization
- limited structural specialization
- lack of persistent computational locality

The architecture presented here explores an alternative principle:

computation as sparse recursive traversal through specialized computational subgraphs.

Instead of treating the network as a uniform differentiable tensor stack, the model is organized as a routing hierarchy in which:

- nodes specialize over time
- branches encode computational subdomains
- routing determines active execution paths
- inactive branches remain dormant

The resulting structure resembles:

- biological cortical specialization
- hierarchical memory systems
- conditional computation
- and adaptive execution graphs


2. Core Architectural Principles

2.1 Sparse Connectivity

Each node receives only a sparse subset of the global input state rather than consuming all available features.

This creates:

- feature locality
- computational specialization
- reduced parameter interaction
- and implicit modularity


struct Connection
{
    int inputIndex = 0;
    float weight = 0.0f;
};

Rather than fully connected dense matrices, nodes maintain selective projections into the evolving state space.


2.2 Recursive State Expansion

Nodes not only process inputs but also contribute new latent representations back into the global computational state.


float h = m_nodes[i]->Forward(state);

hidden.push_back(h);

state.push_back(h);

This transforms computation into a recursive representational process in which:

- earlier nodes shape future computation
- latent abstractions accumulate dynamically
- and deeper nodes operate on progressively richer representations

The mechanism resembles recurrent computation despite operating within a feedforward traversal.


2.3 Learned Routing

Each node contains routing weights used to determine which downstream branches should receive further computation.


int Route() const
{
    int bestIndex = 0;
    float bestScore = -1e9f;

    for (int i = 0; i < (int)routerWeights.size(); i++)
    {
        float score = activation * routerWeights[i];

        if (score > bestScore)
        {
            bestScore = score;
            bestIndex = i;
        }
    }

    return bestIndex;
}

Routing transforms the network from a static computation graph into an input-conditioned execution graph.

Different inputs activate different computational paths.


3. Node Structure

Each node functions simultaneously as:

- a feature detector
- a latent transformer
- a routing controller
- and a recursive computational unit


struct Node
{
    vector<Connection> connections;

    vector<float> routerWeights;

    vector<unique_ptr<Node>> children;

    float bias = 0.0f;

    float activation = 0.0f;

    float relevance = 0.0f;

    vector<float> lastInput;
};

This unifies:

- representation learning
- conditional execution
- and structural specialization

within the same primitive computational element.


4. Sparse Hierarchical Computation

The architecture naturally forms a tree-like execution structure.

Example:


                 Root
              /    |    \
             A     B     C
           / | \       / | \
          D  E  F     G  H  I

Inference activates only relevant branches.

This creates:

- sparse execution
- conditional parameter activation
- hierarchical specialization
- logarithmic traversal potential

Unlike transformers, the full parameter space is not evaluated for every token.


5. Relationship to Attention

The architecture shares important similarities with attention mechanisms.

Each node effectively performs:

- feature selection
- contextual routing
- dynamic weighting
- and latent refinement

However, routing is structural rather than purely algebraic.

Instead of attending across a dense token-token matrix, the system traverses computational branches.

Attention becomes:

recursive computational routing.


6. Positional Representation

Unlike transformers, the architecture is not permutation invariant.

Sequence order already exists implicitly through:

- ordered embedding concatenation
- sparse indexed connectivity
- and recursive state accumulation


for (int token : context)
{
    for (float v : m_embeddings[token])
    {
        input.push_back(v);
    }
}

Nodes can therefore specialize to:

- recent tokens
- distant tokens
- or specific positional regions

without explicit positional embeddings.

Nevertheless, learned positional representations may still improve:

- generalization
- long-range structure learning
- and relative sequence abstraction


7. Conditional Computation and Scalability

One of the most important properties of the architecture is conditional activation.

Only relevant branches execute for a given input.

This creates the potential for sublinear active parameter utilization relative to total model capacity.

Rather than executing all parameters uniformly, computation may scale according to activated computational paths and routing depth.

This resembles modern sparse mixture-of-experts systems but introduces persistent structural specialization through recursive branching.


8. Distributed Computational Memory

The architecture naturally supports externalized computational memory.

Branches can theoretically be:

- serialized
- indexed
- dynamically loaded
- or distributed across storage systems

Inference becomes:

input → routing → branch retrieval → computation

rather than execution over a monolithic tensor model.

This transforms the network into a form of:

hierarchical computational memory.


9. Parallelization Properties

The model is sequential along individual computational paths but parallel across branches.

Example:


root
├── branch A
├── branch B
└── branch C

Subtrees may execute simultaneously.

This creates:

- sparse parallel execution
- conditional workload distribution
- and scalable branch specialization

The architecture therefore differs from transformer parallelism while still supporting large-scale distributed execution.


10. Learning Dynamics

Learning occurs through backpropagation over:

- sparse node connections
- routing weights
- embeddings
- and output projections

Unlike fixed dense architectures, however, specialization emerges structurally through routing statistics and relevance accumulation.


relevance += std::abs(activation);

Inactive branches naturally become candidates for pruning.


11. Pruning and Structural Compression

Nodes maintain relevance scores accumulated during training.


void Prune(float threshold)
{
    for (auto& node : m_nodes)
    {
        if (node->relevance < threshold)
        {
            for (auto& c : node->connections)
            {
                c.weight = 0.0f;
            }
        }
    }
}

This enables:

- automatic sparsification
- structural compression
- dormant branch elimination
- and adaptive capacity allocation


12. Comparison with Transformers

Transformers:

- dense global attention
- uniform layer execution
- full parameter activation
- quadratic token interactions

This architecture:

- sparse recursive routing
- conditional branch execution
- hierarchical specialization
- dynamic computational locality

The architecture is therefore not merely a modified transformer but a fundamentally different computational paradigm.


13. Emergent Specialization

As training progresses, branches may evolve into domain-specific computational regions.

Examples:

- arithmetic branches
- syntactic branches
- semantic abstraction branches
- long-range memory branches
- rhythmic or stylistic branches

Specialization emerges through repeated routing pressure and gradient optimization.


14. Future Directions

Several extensions appear promising:

- top-k routing instead of single-path routing
- differentiable routing softmax
- dynamic branch growth
- external branch storage
- hierarchical retrieval systems
- multimodal routing specialization
- recursive memory compression

The architecture may eventually support:

- distributed computational graphs
- modular persistent memory
- and adaptive large-scale sparse cognition


15. Conclusion

This work presents a sparse recursive routing architecture in which computation emerges through conditional traversal over specialized computational branches.

Unlike dense transformer systems, the model activates only relevant substructures for a given input, enabling:

- sparse computation
- hierarchical specialization
- recursive latent refinement
- and adaptive computational locality

The architecture combines principles from:

- neural networks
- recursive systems
- sparse mixture-of-experts
- and hierarchical memory structures

while introducing a distinct computational framework based on:

recursive sparse routing over evolving computational subgraphs.

This suggests an alternative scaling direction for machine intelligence:

not larger dense tensors, but increasingly specialized sparse computational ecosystems.


Appendix

Step-by-Step Execution (How the Model Actually Runs)

To understand the architecture concretely, it is useful to describe exactly how a single training or inference pass executes from start to finish.


16.1 Inference Pass (Forward Computation)

Given an input token sequence, the model performs the following steps:

Step 1 — Token Embedding
Each token is converted into a continuous vector representation. These vectors are concatenated into a single initial state vector.


state = concat(embedding(token₁), embedding(token₂), ...)

At this stage, the model has no predictions yet — only a feature state.


Step 2 — Initialize Execution at Root Node
Computation begins at a single entry point:


active[0] = true;

Only node 0 is initially evaluated.


Step 3 — Node Evaluation
Each active node performs:

- reads the full current state vector
- computes a scalar activation using sparse connections
- stores its output into the hidden representation


h = tanh(bias + Σ(input[i] * weight[i]))
hidden[i] = h;
state.push_back(h);

At this point, the model is expanding its own feature space dynamically.


Step 4 — Routing Decision
Each node selects a continuation path:


route = argmax(activation * routerWeights)

This determines which downstream node becomes active next.

Execution is therefore not fully parallel — it is a controlled expansion of active paths.


Step 5 — Sparse Branch Expansion
Based on routing, one or more child indices are activated:


active[childIndex] = true;

This creates a small, input-dependent execution subtree rather than a full network traversal.


Step 6 — Hidden State Assembly
After all active nodes have executed, their outputs form the hidden vector:


hidden = [h₀, h₁, h₂, ... hₙ]

This vector represents the model’s final internal interpretation of the input.


Step 7 — Logit Computation (Output Layer)
The hidden state is projected into vocabulary space:


logits[token] = bias[token] + Σ(hidden[i] * weight[token][i])

These logits are not probabilities — they are raw scores for each possible next token.


Step 8 — Sampling or Argmax
The model converts logits into probabilities and selects the next token either deterministically or stochastically.


16.2 Training Pass (Learning Dynamics)

Training follows the same forward path, followed by gradient propagation.


Step 1 — Forward Pass
The model computes logits exactly as in inference.


Step 2 — Loss Computation
A cross-entropy loss is computed against the target token:


loss = -log(prob[target])

Step 3 — Output Layer Update
Gradients flow into:

- output weights
- output bias
- hidden state contribution


Step 4 — Hidden Layer Backpropagation
Each node receives a gradient signal and updates its parameters locally:


delta = grad * tanh'(activation)
weight -= learningRate * delta * input
bias   -= learningRate * delta

Step 5 — Structural Credit Assignment
Nodes that were active accumulate relevance:


relevance += abs(activation)

This creates long-term structural specialization over training time.


16.3 Key Insight

The entire model can be understood as:

1. A dynamic feature expansion process (state growth)
2. A sparse routing system (execution path selection)
3. A final linear classifier (logit projection)

What makes the architecture unusual is that computation is not uniform — it is conditionally constructed at runtime through recursive activation and routing.

Source Code

An early prototype implementing sparse recursive latent expansion with preliminary routing structures.


#include <algorithm>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <map>
#include <memory>
#include <numeric>
#include <random>
#include <set>
#include <sstream>
#include <string>
#include <unordered_map>
#include <vector>

using namespace std;

static std::mt19937 rng((std::random_device())());

float RandFloat(float minValue, float maxValue)
{
    std::uniform_real_distribution<float> dist(minValue, maxValue);
    return dist(rng);
}

int RandInt(int minValue, int maxValue)
{
    std::uniform_int_distribution<int> dist(minValue, maxValue);
    return dist(rng);
}

float Tanh(float x)
{
    return std::tanh(x);
}

float TanhDerivative(float y)
{
    return 1.0f - y * y;
}

float SoftmaxSample(const vector<float>& logits)
{
    float maxLogit = *max_element(logits.begin(), logits.end());

    vector<float> probs(logits.size());

    float sum = 0.0f;

    for (size_t i = 0; i < logits.size(); i++)
    {
        probs[i] = std::exp(logits[i] - maxLogit);
        sum += probs[i];
    }

    if (sum <= 0.0f || std::isnan(sum))
    {
        return 0.0f;
    }

    for (float& p : probs)
    {
        p /= sum;
    }

    float r = RandFloat(0.0f, 1.0f);
    float c = 0.0f;

    for (size_t i = 0; i < probs.size(); i++)
    {
        c += probs[i];

        if (r <= c)
        {
            return (float)i;
        }
    }

    return (float)(probs.size() - 1);
}

vector<string> Tokenize(const string& text)
{
    vector<string> tokens;
    string token;

    std::stringstream ss(text);

    while (ss >> token)
    {
        tokens.push_back(token);
    }

    return tokens;
}

struct Vocabulary
{
    unordered_map<string, int> tokenToIndex;

    vector<string> indexToToken;

    int Add(const string& token)
    {
        auto it = tokenToIndex.find(token);

        if (it != tokenToIndex.end())
        {
            return it->second;
        }

        int idx = (int)indexToToken.size();

        tokenToIndex[token] = idx;

        indexToToken.push_back(token);

        return idx;
    }

    int Get(const string& token) const
    {
        auto it = tokenToIndex.find(token);

        if (it == tokenToIndex.end())
        {
            return -1;
        }

        return it->second;
    }
};

struct Sample
{
    vector<int> context;

    int target;
};

struct Connection
{
    int inputIndex = 0;

    float weight = 0.0f;
};

struct Node
{
    vector<Connection> connections;

    vector<float> routerWeights;

    vector<unique_ptr<Node>> children;

    float bias = 0.0f;

    float activation = 0.0f;

    float relevance = 0.0f;

    vector<float> lastInput;

    bool wasActive = false;

    Node(int inputCount, int childCount)
    {
        bias = RandFloat(-0.1f, 0.1f);

        int connectionCount =
            RandInt(2, std::max(2, inputCount / 2));

        set<int> used;

        while ((int)connections.size() < connectionCount)
        {
            int idx = RandInt(0, inputCount - 1);

            if (used.count(idx))
            {
                continue;
            }

            used.insert(idx);

            Connection c;

            c.inputIndex = idx;

            c.weight = RandFloat(-0.2f, 0.2f);

            connections.push_back(c);
        }

        routerWeights.resize(childCount);

        for (float& w : routerWeights)
        {
            w = RandFloat(-0.2f, 0.2f);
        }
    }

    float Forward(const vector<float>& input)
    {
        wasActive = true;

        lastInput = input;

        float sum = bias;

        for (const auto& c : connections)
        {
            if (c.inputIndex < (int)input.size())
            {
                sum += input[c.inputIndex] * c.weight;
            }
        }

        activation = Tanh(sum);

        relevance += std::abs(activation);

        return activation;
    }

    int Route() const
    {
        if (children.empty())
        {
            return -1;
        }

        int bestIndex = 0;

        float bestScore = -1e9f;

        for (int i = 0; i < (int)routerWeights.size(); i++)
        {
            float score =
                activation * routerWeights[i] +
                relevance * 0.0001f;

            if (score > bestScore)
            {
                bestScore = score;

                bestIndex = i;
            }
        }

        return bestIndex;
    }

    void Backward(float grad, float learningRate)
    {
        if (!wasActive)
        {
            return;
        }

        float delta = grad * TanhDerivative(activation);

        if (std::isnan(delta))
        {
            return;
        }

        for (auto& c : connections)
        {
            if (c.inputIndex >= (int)lastInput.size())
            {
                continue;
            }

            float inputValue = lastInput[c.inputIndex];

            c.weight -= learningRate * delta * inputValue;
        }

        bias -= learningRate * delta;
    }
};

struct Network
{
    int m_contextSize = 4;

    int m_embeddingSize = 16;

    int m_hiddenSize = 32;

    int m_vocabSize = 0;

    vector<vector<float>> m_embeddings;

    vector<unique_ptr<Node>> m_nodes;

    vector<vector<float>> m_outputWeights;

    vector<float> m_outputBias;

    Network(int vocabSize)
    {
        m_vocabSize = vocabSize;

        m_embeddings.resize(vocabSize);

        for (int i = 0; i < vocabSize; i++)
        {
            m_embeddings[i].resize(m_embeddingSize);

            for (float& v : m_embeddings[i])
            {
                v = RandFloat(-0.1f, 0.1f);
            }
        }

        int inputSize =
            m_contextSize * m_embeddingSize;

        for (int i = 0; i < m_hiddenSize; i++)
        {
            m_nodes.push_back(
                std::make_unique<Node>(
                    inputSize + i,
                    2
                )
            );
        }

        for (int i = 0; i < m_hiddenSize - 1; i++)
        {
            if (i + 1 < m_hiddenSize)
            {
                m_nodes[i]->children.push_back(
                    std::make_unique<Node>(
                        inputSize + i + 1,
                        0
                    )
                );
            }

            if (i + 2 < m_hiddenSize)
            {
                m_nodes[i]->children.push_back(
                    std::make_unique<Node>(
                        inputSize + i + 2,
                        0
                    )
                );
            }
        }

        m_outputWeights.resize(m_vocabSize);

        for (int i = 0; i < m_vocabSize; i++)
        {
            m_outputWeights[i].resize(m_hiddenSize);

            for (float& w : m_outputWeights[i])
            {
                w = RandFloat(-0.1f, 0.1f);
            }
        }

        m_outputBias.resize(m_vocabSize, 0.0f);
    }

    vector<float> BuildInput(const vector<int>& context)
    {
        vector<float> input;

        for (int token : context)
        {
            for (float v : m_embeddings[token])
            {
                input.push_back(v);
            }
        }

        return input;
    }

    vector<float> ForwardHidden(const vector<int>& context)
    {
        vector<float> state = BuildInput(context);

        vector<float> hidden(
            m_hiddenSize,
            0.0f
        );

        vector<bool> active(
            m_hiddenSize,
            false
        );

        for (auto& node : m_nodes)
        {
            node->wasActive = false;
        }

        active[0] = true;

        for (int i = 0; i < m_hiddenSize; i++)
        {
            if (!active[i])
            {
                continue;
            }

            float h = m_nodes[i]->Forward(state);

            hidden[i] = h;

            state.push_back(h);

            int route = m_nodes[i]->Route();

            if (
                route >= 0 &&
                route < (int)m_nodes[i]->children.size()
            )
            {
                int childIndex = -1;

                if (route == 0)
                {
                    childIndex = i + 1;
                }
                else if (route == 1)
                {
                    childIndex = i + 2;
                }

                if (
                    childIndex >= 0 &&
                    childIndex < m_hiddenSize
                )
                {
                    active[childIndex] = true;
                }
            }
        }

        return hidden;
    }

    vector<float> ForwardLogits(const vector<int>& context)
    {
        vector<float> hidden =
            ForwardHidden(context);

        vector<float> logits(
            m_vocabSize,
            0.0f
        );

        for (int token = 0; token < m_vocabSize; token++)
        {
            float sum = m_outputBias[token];

            for (int i = 0; i < m_hiddenSize; i++)
            {
                sum +=
                    hidden[i] *
                    m_outputWeights[token][i];
            }

            logits[token] = sum;
        }

        return logits;
    }

    vector<float> Softmax(const vector<float>& logits)
    {
        vector<float> probs(logits.size());

        float maxLogit =
            *max_element(
                logits.begin(),
                logits.end()
            );

        float sum = 0.0f;

        for (size_t i = 0; i < logits.size(); i++)
        {
            probs[i] =
                std::exp(logits[i] - maxLogit);

            sum += probs[i];
        }

        if (sum <= 0.0f || std::isnan(sum))
        {
            float uniform =
                1.0f / (float)probs.size();

            for (float& p : probs)
            {
                p = uniform;
            }

            return probs;
        }

        for (float& p : probs)
        {
            p /= sum;
        }

        return probs;
    }

    float TrainSample(
        const Sample& sample,
        float learningRate
    )
    {
        vector<float> hidden =
            ForwardHidden(sample.context);

        vector<float> logits(
            m_vocabSize,
            0.0f
        );

        for (int token = 0; token < m_vocabSize; token++)
        {
            float sum = m_outputBias[token];

            for (int i = 0; i < m_hiddenSize; i++)
            {
                sum +=
                    hidden[i] *
                    m_outputWeights[token][i];
            }

            logits[token] = sum;
        }

        vector<float> probs = Softmax(logits);

        float loss =
            -std::log(
                std::max(
                    1e-6f,
                    probs[sample.target]
                )
            );

        vector<float> outputGradients = probs;

        outputGradients[sample.target] -= 1.0f;

        vector<float> hiddenGradients(
            m_hiddenSize,
            0.0f
        );

        for (int token = 0; token < m_vocabSize; token++)
        {
            for (int i = 0; i < m_hiddenSize; i++)
            {
                hiddenGradients[i] +=
                    outputGradients[token] *
                    m_outputWeights[token][i];

                m_outputWeights[token][i] -=
                    learningRate *
                    outputGradients[token] *
                    hidden[i];
            }

            m_outputBias[token] -=
                learningRate *
                outputGradients[token];
        }

        for (int i = 0; i < m_hiddenSize; i++)
        {
            m_nodes[i]->Backward(
                hiddenGradients[i],
                learningRate
            );
        }

        return loss;
    }

    void Train(
        const vector<Sample>& samples,
        int epochs,
        float learningRate
    )
    {
        for (int epoch = 0; epoch < epochs; epoch++)
        {
            float loss = 0.0f;

            for (const auto& sample : samples)
            {
                loss +=
                    TrainSample(
                        sample,
                        learningRate
                    );
            }

            loss /= (float)samples.size();

            if (epoch % 50 == 0)
            {
                std::cout
                    << "Epoch "
                    << std::setw(4)
                    << epoch
                    << " Loss="
                    << std::fixed
                    << std::setprecision(5)
                    << loss
                    << "\n";
            }
        }
    }

    int Predict(const vector<int>& context)
    {
        vector<float> logits =
            ForwardLogits(context);

        return (int)(
            std::max_element(
                logits.begin(),
                logits.end()
            ) - logits.begin()
        );
    }

    void Prune(float threshold)
    {
        for (auto& node : m_nodes)
        {
            if (node->relevance < threshold)
            {
                for (auto& c : node->connections)
                {
                    c.weight = 0.0f;
                }
            }
        }
    }
};

int main()
{
    string poem =
        "the moonlight dances on the sea "
        "the silent wind remembers me "
        "the stars awaken one by one "
        "until the night and dawn are done "
        "the river dreams beneath the sky "
        "while distant birds and shadows fly";

    Vocabulary vocab;

    vector<string> tokens =
        Tokenize(poem);

    for (const auto& token : tokens)
    {
        vocab.Add(token);
    }

    vector<int> encoded;

    for (const auto& token : tokens)
    {
        encoded.push_back(
            vocab.Get(token)
        );
    }

    const int contextSize = 4;

    vector<Sample> samples;

    for (
        int i = 0;
        i < (int)encoded.size() - contextSize;
        i++
    )
    {
        Sample s;

        for (int j = 0; j < contextSize; j++)
        {
            s.context.push_back(
                encoded[i + j]
            );
        }

        s.target =
            encoded[i + contextSize];

        samples.push_back(s);
    }

    Network net(
        (int)vocab.indexToToken.size()
    );

    net.Train(
        samples,
        10000,
        0.01f
    );

    net.Prune(0.1f);

    vector<int> seed;

    seed.push_back(vocab.Get("the"));
    seed.push_back(vocab.Get("moonlight"));
    seed.push_back(vocab.Get("dances"));
    seed.push_back(vocab.Get("on"));

    std::cout
        << "\n=== GENERATION ===\n\n";

    std::cout << "Seed: ";

    for (int token : seed)
    {
        std::cout
            << vocab.indexToToken[token]
            << " ";
    }

    std::cout
        << "\n\nGenerated: ";

    for (int i = 0; i < 30; i++)
    {
        vector<float> logits =
            net.ForwardLogits(seed);

        vector<float> probs =
            net.Softmax(logits);

        int next =
            (int)SoftmaxSample(logits);

        std::cout
            << vocab.indexToToken[next]
            << " ";

        std::cout
            << "(p="
            << std::fixed
            << std::setprecision(2)
            << probs[next]
            << ") ";

        seed.erase(seed.begin());

        seed.push_back(next);
    }

    std::cout
        << "\n\n=== NODE RELEVANCE ===\n\n";

    for (size_t i = 0; i < net.m_nodes.size(); i++)
    {
        std::cout
            << "Node "
            << std::setw(2)
            << i
            << " Relevance="
            << std::fixed
            << std::setprecision(5)
            << net.m_nodes[i]->relevance
            << " Connections="
            << net.m_nodes[i]->connections.size()
            << " Active="
            << net.m_nodes[i]->wasActive
            << "\n";
    }

    std::cin.get();

    return 0;
}