An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

wpnews.pro

cd /news/large-language-models/an-expressivity-analysis-of-hierarch… · home › topics › large-language-models › article

[ARTICLE · art-30537] src=arxiv.org ↗ pub=2026-06-17T04:00Z topic=large-language-models verified=true sentiment=· neutral

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

Researchers have theoretically analyzed how deep transformers represent hierarchical structures in language, constructing models whose depth grows linearly with grammar depth and whose neuron count scales with derivation-tree shapes and production rules. The work supports the linear representation hypothesis, showing transformers can encode abstract grammatical states into low-dimensional subspaces.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17522v1 Announce Type: new Abstract: Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

source & further reading

arxiv.org — original article

── more in #large-language-models 4 stories · sorted by recency

letsdatascience.com · 17 Jun · #large-language-models

ChatGPT Expands Voice Input to 70+ Languages

arxiv.org · 17 Jun · #large-language-models

Looped World Models

arxiv.org · 17 Jun · #large-language-models

The Slop Paradox

arxiv.org · 17 Jun · #large-language-models

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required