{"slug": "neural-algorithmic-reasoning", "title": "Neural algorithmic reasoning", "summary": "Researchers are exploring how to replicate classical algorithms—such as shortest path-finding and sorting—within deep neural networks to address key weaknesses in modern AI, including lack of provable correctness, poor generalization, and black-box behavior. By teaching neural networks to execute these provably correct, interpretable, and compositional algorithms, the work aims to create AI systems that are more reliable and instructive for humans, potentially advancing toward general intelligence.", "body_md": "In this article, we will talk about classical computation: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler problems, incredible ways to organise data for efficient retrieval and updates. Of course, given The Gradient’s focus on Artificial Intelligence, we will not stop there; we will also investigate how to capture such computation with deep neural networks.\nWhy capture classical computation?\nMaybe it’s worth starting by clarifying where my vested interest in classical computation comes from. Competitive programming—the art of solving problems by rapidly writing programs that need to terminate in a given amount of time, and within certain memory constraints—was a highly popular activity in my secondary school. For me, it was truly the gateway into Computer Science, and I trust the story is similar for many machine learning practitioners and researchers today. I have been able to win several medals at international programming competitions, such as the Northwestern Europe Regionals of the ACM-ICPC, the top-tier Computer Science competition for university students. Hopefully, my successes in competitive programming also give me some credentials to write about this topic.\nWhile this should make clear why I care about classical computation, why should we all care? To arrive at this answer, let us ponder some of the key properties that classical algorithms have:\n- They are provably correct, and we can often have strong guarantees about the resources (time or memory) required for the computation to terminate.\n- They offer strong generalisation: while algorithms are often devised by observing several small-scale example inputs, once implemented, they will work without fault on inputs that are significantly larger, or distributionally different than such examples.\n- By design, they are interpretable and compositional: their (pseudo)code representation makes it much easier to reason about what the computation is actually doing, and one can easily recompose various computations together through subroutines to achieve different capabilities.\nLooking at all of these properties taken together, they seem to be exactly the issues that plague modern deep neural networks the most: you can rarely guarantee their accuracy, they often collapse on out-of-distribution inputs, and they are very notorious as black boxes, with compounding errors that can hinder compositionality.\nHowever, it is exactly those skills that are important for making AI instructive and useful to humans! For example, to have an AI system that reliably and instructively teaches a concept to a human, the quality of its output should not depend on minor details of the input, and it should be able to generalise that concept to novel situations. Arguably, these skills are also a missing key step on the road to building generally-intelligent agents. Therefore, if we are able to make any strides towards capturing traits of classical computation in deep neural networks, this is likely to be a very fruitful pursuit.\nFirst impressions: Algorithmic alignment\nMy journey into the field started with an interesting, but seemingly inconsequential question:\nCan neural networks learn to execute classical algorithms?\nThis can be seen as a good way to benchmark to what extent are certain neural networks capable of behaving algorithmically; arguably, if a system can produce the outputs of a certain computation, it has “captured” it. Further, learning to execute provides a uniquely well-built environment for evaluating machine learning models:\n- An infinite data source—as we can generate arbitrary amounts of inputs;\n- They require complex data manipulation—making it a challenging task for deep learning;\n- They have a clearly specified target function—simplifying interpretability analyses.\nWhen we started studying this area in 2019, we really did not think more of it than a very neat benchmark—but it was certainly becoming a very lively area. Concurrently with our efforts, a team from MIT tried tackling a more ambitious, but strongly related problem:\nWhat makes a neural network better (or worse) at fitting certain (algorithmic) tasks?\nThe landmark paper, “What Can Neural Networks Reason About?” [2] established a mathematical foundation for why an architecture might be better for a task (in terms of sample complexity: the number of training examples needed to reduce validation loss below epsilon). The authors’ main theorem states that better algorithmic alignment leads to better generalisation. Rigorously defining algorithmic alignment is out of scope of this text, but it can be very intuitively visualised:\nHere, we can see a visual decomposition of how a graph neural network (GNN) [3] aligns with the classical Bellman-Ford [4] algorithm for shortest path-finding. Specifically, Bellman-Ford maintains its current estimate of how far away each node is from the source node: a distance variable (du) for each node u in a graph. At every step, for every neighbour v of node u, an update to du is proposed: a combination of (optimally) reaching v, and then traversing the edge connecting v and u (du + wvu). The distance variable is then updated as the optimal out of all the proposals. Operations of a graph neural network can naturally decompose to follow the data flow of Bellman-Ford:\n- The distance variables correspond to the node features maintained by the GNN;\n- Adding the edge distance to a distance variable corresponds to computing the GNN’s message function;\n- Choosing the optimal neighbour based on this measure corresponds to the GNN’s permutation-invariant aggregation function, ⨁.\nGenerally, it can be proven that, the more closely we can decompose our neural network to follow the algorithm’s structure, the more favourable sample complexity we can expect when learning to execute such algorithms. Bellman-Ford is a typical instance of a dynamic programming (DP) algorithm [5], a general-purpose problem-solving strategy that breaks the problem down into subproblems, and then recombines their solutions to find the final solution.\nThe MIT team made the important observation that GNNs appear to algorithmically align with DP, and since DP can itself be used to express many useful forms of classical computation, GNNs should be a very potent general-purpose model for learning to execute. This was validated by several carefully constructed DP execution benchmarks, where relational models like GNNs clearly outperformed architectures with weaker inductive biases. GNNs have been a long-standing interest of mine [6], so the time was right to release our own contribution to the area:\nNeural Execution of Graph Algorithms\nIn this paper [7], concurrently released with Xu et al. [2], we conducted a thorough empirical analysis of learning to execute with GNNs. We found that while algorithmic alignment is indeed a powerful tool for model class selection, it unfortunately does not allow us to be reckless. Namely, we cannot just apply any expressive GNN to an algorithmic execution task and expect great results, especially out-of-distribution—which we previously identified as a key setting in which “true” reasoning systems should perform well. Much like other neural networks, GNNs can very easily overfit to the characteristics of the distribution of training inputs, learning “clever hacks” and sidestepping the actual procedure that they are attempting to execute.\nWe hence identify three key observations on careful inductive biases to use, to improve the algorithmic alignment to certain path-finding problems even further and allow for generalising to 5x larger inputs at test time:\n- Most traditional deep learning setups involve a stack of layers with unshared parameters. This fundamentally limits the amount of computation the neural network can perform: if, at test time, an input much larger than the ones in the training data arrives, it would be expected that more computational steps are needed—yet, the unshared GNN has no way to support that. To address this, we adopt the encode-process-decode paradigm [8]: a single shared processor GNN is iterated for many steps, and this number of steps can be variable at both training and inference time. Such an architecture also allows a neat way to algorithmically align to iterative computation, as most algorithms involve repeatedly applying a certain computation until convergence.\n- Since most path-finding algorithms (including Bellman-Ford) require “local” optimisation (i.e. choosing exactly one optimal neighbour at every step), we opted to use max aggregation to combine the messages sent in GNNs. While this may seem like a very intuitive idea, it went strongly against the folklore of the times, as max-GNNs were known to be theoretically inferior to sum-GNNs at distinguishing non-isomorphic graphs [9]. (We now have solid theoretical evidence [10] for why this is a good idea.)\n- Lastly, while most deep learning setups only require producing an output given an input, we found that this misses out on a wealth of ways to instruct the model to align to the algorithm. For example, there are many interesting invariants that algorithms have that can be explicitly taught to a GNN. In the case of Bellman-Ford, after k iterations are executed, we should be expected to be able to recover all shortest paths that are no more than k hops away from the source node. Accordingly, we use this insight to provide step-wise supervision to the GNN at every step. This idea appears to be gaining traction in Large Language Model design in recent months [11, 12].\nAll three of the above changes make for stronger algorithmically-aligned GNNs.\nPlaying the alignment game\nIt must be stressed that the intuitive idea of algorithmic alignment—taking inspiration from Computer Science concepts in architecture design—is by no means novel! The text-book example of this are the neural Turing machine (NTM) and differentiable neural computer (DNC) [13, 14]. These works are decisively influential; in fact, in its attempt to make random-access memory compatible with gradient-based optimisation, the NTM gave us one of the earliest forms of content-based attention, three years before Transformers [15]!\nHowever, despite their influence, these architectures are nowadays virtually never used in practice. There are many possible causes for this, but in my opinion it is because their design was too brittle: trying to introduce many differentiable components into the same architecture at once, without a clear guidance on how to compose them, or a way to easily debug them once they failed to show useful signal on a new task. Our line of work still wants to build something like an NTM—but make it more successfully deployed, by using algorithmic alignment to more carefully prototype each building block in isolation, and have a more granular view of which blocks benefit execution of which target algorithms.\nOur approach of “playing the algorithmic alignment game” appears to have yielded a successful line of specialised (G)NNs, and we now have worthy ‘fine-grained’ solutions for learning to execute linearithmic sequence algorithms [16], iterative algorithms [17], pointer-based data structures [18], as well as persistent auxiliary memory [19]. Eventually, these insights also carried over to more fine-grained theory as well. In light of our NEGA paper [7], the inventors of algorithmic alignment refined their theory into what is now known as linear algorithmic alignment [10], providing justification for, among other things, our use of the max aggregation. More recent insights show that understanding algorithmic alignment may require causal reasoning [20,21], properly formalising it may require category theory [22], and properly describing it may require analysing asynchronous computation [23]. Algorithmic alignment is therefore t", "url": "https://wpnews.pro/news/neural-algorithmic-reasoning", "canonical_source": "https://thegradient.pub/neural-algorithmic-reasoning/", "published_at": "2023-10-14 15:30:15+00:00", "updated_at": "2026-05-18 05:17:38.049632+00:00", "lang": "en", "topics": [], "entities": [], "alternates": {"html": "https://wpnews.pro/news/neural-algorithmic-reasoning", "markdown": "https://wpnews.pro/news/neural-algorithmic-reasoning.md", "text": "https://wpnews.pro/news/neural-algorithmic-reasoning.txt", "jsonld": "https://wpnews.pro/news/neural-algorithmic-reasoning.jsonld"}}