Community Interest Check: LLMs from Scratch in Pure Julia

Palli · August 17, 2025, 8:18pm

This seems to be a big deal (to recreated in Julia), and the code open source/Apache 2:

The more I read of their excellent paper, then more I want to read and quote, I think it should be obvious to practitioners, this is a game-changer:

Hierarchical Reasoning Model
https://arxiv.org/pdf/2506.21734

… These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.

The fixed depth of standard Transformers places them in computational complexity classes such as AC0 or TC0, preventing them from solving problems that require polynomial time. LLMs are not Turing-complete and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic reasoning that is necessary for deliberate planning or symbolic manipulation tasks

The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning.
CoT externalizes reasoning into token-level language by breaking down complex tasks into simpler intermediate steps, sequentially generating text using a shallow model. However, CoT for reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions …
A more efficient approach is needed to minimize these data requirements.

Towards this goal, we explore “latent reasoning”, where the model conducts computations within its internal hidden state space. This aligns with the understanding that language is a tool for human communication, not the substrate of thought itself; the brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language. However, the power of latent reasoning is still fundamentally constrained by a model’s effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gradients, which plague training stability and effectiveness. Recurrent architectures, a natural alternative for sequential tasks, often suffer from early convergence, rendering subsequent computational steps inert, and rely on the biologically implausible, computationally expensive and memory intensive Backpropagation Through Time (BPTT) for training.

The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack. It organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning …

Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierarchical Reasoning Model (HRM). HRM is designed to significantly increase the effective computational depth. It features two coupled recurrent modules: a high-level (H) module for abstract, deliberate reasoning, and a low-level (L) module for fast, detailed computations

Figure 2: The necessity of depth for complex reasoning

Furthermore, we propose a one-step gradient approximation for training HRM, which offers improved efficiency and eliminates the requirement for BPTT. This design maintains a constant memory footprint (O(1) compared to BPTT’s O(T) for T timesteps) throughout the backpropagation process, making it scalable and more biologically plausible.
Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs

Figure 4: Top: Diagram of HRM with approximate gradient. Bottom: Pseudocode of HRM with deep supervision training in PyTorch.

Adaptive computational time (ACT) The brain dynamically alternates between automatic thinking (“System 1”) and deliberate reasoning (“System 2”) …
Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning algorithm

Inference-time scaling … As illustrated in Figure 5-(c), HRM seamlessly achieves inference-time scaling

Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is
known to be prone to instability, often requiring stabilization techniques such as replay buffers
and target networks, which are absent in our design. Our approach, however, achieves stability through the intrinsic properties of our model and training procedure. Recent theoretical work by Gallici et al. … Our model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a layer normalization variant) and the AdamW optimizer

Architectural details We employ a sequence-to-sequence architecture for HRM …
For all Transformer blocks in this work—including those in the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama architectures). These improvements include Rotary Positional Encoding, Gated Linear Units, RMSNorm, and the removal of bias terms from linear layers.
…
Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture with weights initialized via truncated LeCun Normal initialization, while the scale and bias
parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 optimizer, a scale-invariant variant of Adam
…
We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforementioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally difficult for human players:
…
We use Sudoku-Extreme in our main experiments (Figure 1)

Remarkably, HRM attains these results with just ~1000 training examples per task—and
without pretraining or CoT labels

3.3 Visualization of intermediate timesteps
Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intriguing question: what underlying reasoning algorithms does the HRM neural network actually implement? Addressing this question is important for enhancing model interpretability and developing a deeper understanding of the HRM solution space.

While a definitive answer lies beyond our current scope, we begin our investigation by analyzing
state trajectories and their corresponding solution evolution

4 Brain Correspondence
…
Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex.
…
The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse
cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where
last-layer features converge to a low-dimensional subspace. HRM therefore departs from the
collapse pattern and instead fosters a high-dimensional representation in its higher module

5 Related Work
…
Another notable work in this area is Recurrent Relational Networks (RRN)

6 Discussions
Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal
Transformer, HRM is computationally universal when given sufficient memory and time constraints. In other words, it falls into the category of models that can simulate any Turing machine, overcoming the computational limitations of standard Transformers discussed previously in the introduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks, they suffer from premature convergence and memory intensive BPTT. …

It references many intriguing papers e.g.:

Meta’s (paper updated in July 2025)

Reinforcement learning for reasoning in large language models with one
training example, 2025. URL [2504.20571] Reinforcement Learning for Reasoning in Large Language Models with One Training Example

https://arxiv.org/pdf/2506.04761

arXiv:2506.04761v2 [cs.LG] 25 Jun 2025
Log-Linear Attention

And this work/paper:

Looking up post-norm I find:

[There are also 4 open source game AI generators out, seemingly as good as latest from Deepmind, which is still not released.]

I had in drafts from months ago (do not recall the exact context, believe this was Deepseek R1, when “reasoning” was new, and Julia code generated, for phase 1 of some neural compression proof-of-concept code I was doing):

“Thought for 732 seconds”
[Outputted thinking process, 48 screenfuls! Something you can hide or see, unlike for OpenAI, and it at least started along the same lines as I was thinking, but yes, I did NOT read all of it, just confirmed the final code works.]

So the code for compression seems to handle all cases. [..]

Topic		Replies	Views
On Machine Learning and Programming Languages Machine Learning	48	8899	January 25, 2018
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6804	October 6, 2018
Flux.jl RNN performance Machine Learning	11	3173	October 28, 2018
Attempted to port an RL algorithm from PyTorch to Flux and it's 10x slower Machine Learning	17	1410	January 29, 2021
Starting a Deep Learning project, should I keep using Julia or jump to Python? Machine Learning question	15	12972	December 8, 2019

Related topics