Community Interest Check: LLMs from Scratch in Pure Julia

Those are huggingface’s terminology, and they are mostly regular MLP/Attention/Gelu. They prefer to define a new class for each component of a model instead of defining/using a compositional class. This is also the reason why we need to register a loader for each model in Julia. The class layout would affect the structure of the state_dict and thus we need to manually align the layout. The layers defined in Transformers.jl are designed to be as composable as possible to reduce the need to define new structs when registering a new loader.

There are a few things on my list of priorities. Two major parts I’m currently slowly working on are splitting out the workpiece/sentencepiece tokenizer into a separate package and GPU abstraction for attentions. Besides, I would also want to enhance HuggingFaceApi.jl to use huggingface’s dataset viewer API and use DuckDB.jl to load those processed datasets. Unfortunately, my current bandwidth is mostly allocated to surviving and job hunting, so you probably won’t be able to see them in the near future.

One package I would love to see and surely beyond my scope is a better data loader design/interface with distributed support. I only roughly scan through so this might not be precise, but it seems the data loaders we have are relatively naive compared to the distributed data loader in pytorch or huggingface.

6 Likes

I will be back, I want to do the fastest Llama2.jl gpu support I can do, I just couldn’t finish it ye.

I am curious how fast things would get if we fuse most kernel calls (I keep hearing it is already done, but I want to double test it).

2 Likes

Since December before my Christmas holiday, my thinking on what we need to do has completely transformed.

Just generating tokens one by one has known limitations.
https://arxiv.org/pdf/2404.19737

Better & Faster Large Language Models via Multi-token Prediction

Gains are especially pronounced on generative benchmarks like coding […]
Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3× faster at inference, even with large batch sizes.

But what caught my attention the most wasn’t multi-token like in DeepSeek (why I looked this research from April, by e.g. Meta FAIR/“Facebook”), rather the multi-byte prediction (and table 1) not mentioned in the abstract, i.e. with multi-byte, it’s finally very practical to get rid of tokens (which should be a goal since they have lots of problems):

Section 3.3 demonstrates how multi-token prediction promotes learning longer-term patterns, a fact most apparent in the extreme case of byte-level tokenization.

We believe this usefulness only at scale to be a likely reason why multi-token prediction has so far been largely overlooked

3.3. Learning global patterns with multi-byte prediction

To show that the next-token prediction task latches to local
patterns, we went to the extreme case of byte-level tokeniza-
tion by training a 7B parameter byte-level transformer on
314B bytes, which is equivalent to around 116B tokens.
The 8-byte prediction model achieves astounding improve-
ments compared to next-byte prediction, solving 67% more
problems on MBPP pass@1 and 20% more problems on
HumanEval pass@1.

Multi-byte prediction is therefore a very promising avenue to unlock efficient training of byte-level models. Self-
speculative decoding can achieve speedups of 6 times for
the 8-byte prediction model, which would allow to fully
compensate the cost of longer byte-level sequences at infer-
ence time and even be faster than a next-token prediction
model by nearly two times. The 8-byte prediction model
is a strong byte-based model, approaching the performance
of token-based models despite having been trained on 1.7×
less data.

[I added bold. Note how 8-byte is always best, in table 1, for byte-level, while 4-token is usually best, but somethings 6-token, and you likely need to choose one fixed n, 8, or maybe 7 or 9 better for bytes…? Also I though you had 1 or more epochs, so what does 0.5 epochs really mean there?! It seems the stopped the byte experiment prematurely, and with a full epoch or more, and training on as much data, not just “116B token”-equivalent amount, would have been fair, and likely shows it even better.]

7. Conclusions

[…] We posit that our method reduces distribution mismatch between teacher-forced training and autoregressive generation. When used with speculative decoding, exact inference gets 3 times faster.[…]

Also, optimal vocabulary sizes for multi-token prediction are likely different from those for next-token prediction, and tuning them could lead to better
results, as well as improved trade-offs between compressed sequence length and compute-per-byte expenses

Interesting, but I’m not yet sure if it’s a step forward or backwords (since it thinks in thought vectors not words, going even further from interpretability, and maybe further away how the brain think, unless it uses multi-dimenional vectors too, highly plausible):

Training Large Language Models to Reason in a Continuous Latent Space

we introduce a new paradigm Coconut (Chain of Continuous Thought). […]
Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference.

I’m not sure if “thinking tokens” are some specialized tokens, I rather think they are simply tokens generated while “thinking”, what OpenAI doesn’t show you, but DeepSeek does I believe (I guess you can disable/enable showing the thinking process). There likely are some tokens like <though> ... </thought> so that the user interface knows what NOT to show in-between, I see now it’s called <bot> for beginning-of-thought and <eot>, and it might conflict with bytes only rather than tokens (also for visual models), then unclear what to show, if every byte is potentially a legal UTF-8 (use some illegal byte sequences?).

I will not summarize, best to read that full blog:

We expect to see ModernBERT become the new standard in the numerous applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommendation systems.

In addition to being faster and more accurate, ModernBERT also increases context length to 8k tokens (compared to just 512 for most encoders), and is the first encoder-only model that includes a large amount of code in its training data.
[…]
For code retrieval, ModernBERT is unique. There’s nothing to really compare it to, since there’s never been an encoder model like this trained on a large amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing both code and natural language, ModernBERT’s specialized code understanding and long-context capabilities make it the only backbone to score over 80 on this task.

This means whole new applications are likely to be built on this capability. For instance, imagine an AI-connected IDE which had an entire enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.
[…]
ModernBERT is 2-3x faster than the next fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the additional heavy “xformers” dependency, but instead only requires the now commonplace Flash Attention as a dependency.

Paper updated in December:

PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.

ReLU seems back, when done right with dReLU, and I believe sparsification is very important:

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

https://arxiv.org/pdf/2406.05955

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising per-
formance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation.
To address these challenges, we propose a novel dReLU function, which is de-
signed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. […]
Evaluation results demonstrate that this sparsity achieves a 2-5× decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second.
Our models are available at PowerInfer (PowerInfer).

To address these challenges, we first conduct a comprehensive analysis of the existing ReLUfication approach and identify that its shortcomings stem from the negative activations in the GLU component.
Therefore, we propose an efficient activation function named dReLU. We apply dReLU in the pretraining of small-scale LLMs, alongside SwiGLU, and our findings indicate that LLMs using dReLU match the performance of those using SwiGLU, while also achieving close to 90% sparsity.
Additionally, we collect a diverse range of pretraining corpora from the open-source community, including web, code, and mathematical datasets, to enhance the effectiveness of ReLUfication.
Meanwhile, we also conduct a sparsity analysis on MoE-based LLMs. Interestingly, we observe that the feed-forward networks (FFNs) within the experts remain sparsely activated, similar to the behavior exhibited by dense LLMs. This phenomenon suggests an opportunity to further accelerate
inference speed by combining MoE techniques with ReLU-based sparse activation.

The key contributions of this paper include:
• Efficient dReLU activation function: Our method utilizes fewer than 150B tokens, repre-
senting less than 1% of the typical pretraining tokens (commonly 15T tokens [11]).
• Sparse activated models: We will release our sparsely-activated TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B models. Both models demonstrate better performance compared to the original versions.
• Practical inference speedup: Evaluation shows that with our models, we can achieve a 2-5× speedup. Notably, we can achieve up to 10 tokens/s even without a GPU on TurboSparse-Mixtral-47B

Figure 3. The figure shows that after ReLUfication, the combined activation becomes more concentrated around 0, with the sparsity increasing to 67%.
This can be attributed to the ReLU activation function applied after the gate weight, which masks all negative activations to zero.

3.2 dReLU

We introduce a new activation function, named dReLU (Equation 2), where
ReLU is applied after both the up- and gate-projection1.

CombineddReLU(x) := max(0, xWgate) ∗ max(0, xWup) (2)

Inspired by our discoveries in MoE models, we are convinced that ReLUfication can be extended to MoE models and is not restricted to dense models. As the proportion of FFN weights in MoE models increases, the FLOP reduction achieved through ReLUfication will be even more pronounced.

@mantzaris, I may add even more links here when I find them again…
But at least one more paper I just discovered that seems like a big deal:

Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks

https://arxiv.org/pdf/2405.05097

Biological neural networks seem qualitatively superior
(e.g. in learning, flexibility, robustness) to current artificial like
Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network
(KAN). Simultaneously, in contrast to them: biological have funda-
mentally multidirectional signal propagation [1], also of probability
distributions e.g. for uncertainty estimation, and are believed not
being able to use standard backpropagation training [2]. There
are proposed novel artificial neurons based on HCR (Hierarchical
Correlation Reconstruction) allowing to remove the above low level
differences: with neurons containing local joint distribution model
(of its connections),
[…]
we get simple formulas for e.g. conditional expected
values for propagation in any direction, like E[x|y, z], E[y|x], which
degenerate to KAN-like parametrization if restricting to pairwise
dependencies. Such HCR network can also propagate probability
distributions (also joint) like ρ(y, z|x). It also allows for additional
training approaches, like direct (aj) estimation, through tensor
decomposition, or more biologically plausible information bottleneck
training: layers directly influencing only neighbors, optimizing con-
tent to maximize information about the next layer, and minimizing
about the previous to remove noise, extract crucial information.

Keywords: machine learning, neural networks, Kolmogorov-
Arnold Network, joint distribution, conditional distribution,
Bayesian Neural Networks, tensor decomposition, mutual infor-
mation, information bottleneck approach, HSIC

Biological neurons use complex propagation of action po-
tentials, travelling in both directions of e.g. axons: ”it is not
uncommon for axonal propagation of action potentials to happen
in both directions” [1].

[My computer froze while editing this so I post in case it does again and web browser crashes… intend to add more]

I was looking into Jarek Duda again, he’s famous for other things, well inventing the new ANS compression, wasn’t expecting this, nor what he references, as collaborative works:

[13] J. Duda and G. Bhatta, “Gamma-ray blazar variability: new statistical meth-
ods of time-flux distributions,” Monthly Notices of the Royal Astronomical
Society, vol. 508, no. 1, pp. 1446–1458, 2021.
[14] J. Duda and S. Podlewska, “Prediction of probability distributions of
molecular properties: towards more efficient virtual screening and better
understanding of compound representations,” Molecular Diversity, pp. 1–
12, 2022.
[15] J. Duda and G. Bhatta, “Predicting conditional probability distributions of
redshifts of active galactic nuclei using hierarchical correlation reconstruc-
tion,” Monthly Notices of the Royal Astronomical Society, p. stae963, 2024

4 Likes

Great references again! Lots of new things to read up on.

1 Like

This seems to be a big deal (to recreated in Julia), and the code open source/Apache 2:

FYI @mantzaris

The more I read of their excellent paper, then more I want to read and quote, I think it should be obvious to practitioners, this is a game-changer:

Hierarchical Reasoning Model
https://arxiv.org/pdf/2506.21734

… These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.

The fixed depth of standard Transformers places them in computational complexity classes such as AC0 or TC0, preventing them from solving problems that require polynomial time. LLMs are not Turing-complete and thus they cannot, at least in a purely end-to-end manner, execute complex algorithmic reasoning that is necessary for deliberate planning or symbolic manipulation tasks

The LLMs literature has relied largely on Chain-of-Thought (CoT) prompting for reasoning.
CoT externalizes reasoning into token-level language by breaking down complex tasks into simpler intermediate steps, sequentially generating text using a shallow model. However, CoT for reasoning is a crutch, not a satisfactory solution. It relies on brittle, human-defined decompositions …
A more efficient approach is needed to minimize these data requirements.

Towards this goal, we explore “latent reasoning”, where the model conducts computations within its internal hidden state space. This aligns with the understanding that language is a tool for human communication, not the substrate of thought itself; the brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language. However, the power of latent reasoning is still fundamentally constrained by a model’s effective computational depth. Naively stacking layers is notoriously difficult due to vanishing gradients, which plague training stability and effectiveness. Recurrent architectures, a natural alternative for sequential tasks, often suffer from early convergence, rendering subsequent computational steps inert, and rely on the biologically implausible, computationally expensive and memory intensive Backpropagation Through Time (BPTT) for training.

The human brain provides a compelling blueprint for achieving the effective computational depth that contemporary artificial models lack. It organizes computation hierarchically across cortical regions operating at different timescales, enabling deep, multi-stage reasoning …

Inspired by this hierarchical and multi-timescale biological architecture, we propose the Hierarchical Reasoning Model (HRM). HRM is designed to significantly increase the effective computational depth. It features two coupled recurrent modules: a high-level (H) module for abstract, deliberate reasoning, and a low-level (L) module for fast, detailed computations

Figure 2: The necessity of depth for complex reasoning

Furthermore, we propose a one-step gradient approximation for training HRM, which offers improved efficiency and eliminates the requirement for BPTT. This design maintains a constant memory footprint (O(1) compared to BPTT’s O(T) for T timesteps) throughout the backpropagation process, making it scalable and more biologically plausible.
Leveraging its enhanced effective depth, HRM excels at tasks that demand extensive search and backtracking. Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs

Figure 4: Top: Diagram of HRM with approximate gradient. Bottom: Pseudocode of HRM with deep supervision training in PyTorch.

Adaptive computational time (ACT) The brain dynamically alternates between automatic thinking (“System 1”) and deliberate reasoning (“System 2”) …
Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables “thinking, fast and slow”. This integration leverages deep supervision and uses the Q-learning algorithm

Inference-time scaling … As illustrated in Figure 5-(c), HRM seamlessly achieves inference-time scaling

Stability of Q-learning in ACT The deep Q-learning that underpins our ACT mechanism is
known to be prone to instability, often requiring stabilization techniques such as replay buffers
and target networks, which are absent in our design. Our approach, however, achieves stability through the intrinsic properties of our model and training procedure. Recent theoretical work by Gallici et al. … Our model satisfies these conditions through its Post-Norm architecture that employs RMSNorm (a layer normalization variant) and the AdamW optimizer

Architectural details We employ a sequence-to-sequence architecture for HRM …
For all Transformer blocks in this work—including those in the baseline models—we incorporate the enhancements found in modern LLMs (based on Llama architectures). These improvements include Rotary Positional Encoding, Gated Linear Units, RMSNorm, and the removal of bias terms from linear layers.

Furthermore, both HRM and recurrent Transformer models implement a Post-Norm architecture with weights initialized via truncated LeCun Normal initialization, while the scale and bias
parameters are excluded from RMSNorm. All parameters are optimized using the Adam-atan2 optimizer, a scale-invariant variant of Adam

We introduce Sudoku-Extreme, a more challenging dataset that is compiled from the aforementioned easy datasets as well as puzzles recognized by the Sudoku community as exceptionally difficult for human players:

We use Sudoku-Extreme in our main experiments (Figure 1)

Remarkably, HRM attains these results with just ~1000 training examples per task—and
without pretraining or CoT labels

3.3 Visualization of intermediate timesteps
Although HRM demonstrates strong performance on complex reasoning tasks, it raises an intriguing question: what underlying reasoning algorithms does the HRM neural network actually implement? Addressing this question is important for enhancing model interpretability and developing a deeper understanding of the HRM solution space.

While a definitive answer lies beyond our current scope, we begin our investigation by analyzing
state trajectories and their corresponding solution evolution

4 Brain Correspondence

Figure 8: Hierarchical Dimensionality Organization in the HRM and Mouse Cortex.

The high-to-low PR ratio in HRM (zH /zL ≈ 2.98) closely matches that measured in the mouse
cortex (≈ 2.25). In contrast, conventional deep networks often exhibit neural collapse, where
last-layer features converge to a low-dimensional subspace. HRM therefore departs from the
collapse pattern and instead fosters a high-dimensional representation in its higher module

5 Related Work

Another notable work in this area is Recurrent Relational Networks (RRN)

6 Discussions
Turing-completeness of HRM Like earlier neural reasoning algorithms including the Universal
Transformer, HRM is computationally universal when given sufficient memory and time constraints. In other words, it falls into the category of models that can simulate any Turing machine, overcoming the computational limitations of standard Transformers discussed previously in the introduction. Given that earlier neural algorithm reasoners were trained as recurrent neural networks, they suffer from premature convergence and memory intensive BPTT. …

It references many intriguing papers e.g.:

Meta’s (paper updated in July 2025)

Reinforcement learning for reasoning in large language models with one
training example, 2025. URL [2504.20571] Reinforcement Learning for Reasoning in Large Language Models with One Training Example

https://arxiv.org/pdf/2506.04761

arXiv:2506.04761v2 [cs.LG] 25 Jun 2025
Log-Linear Attention

And this work/paper:

Looking up post-norm I find:

[There are also 4 open source game AI generators out, seemingly as good as latest from Deepmind, which is still not released.]

I had in drafts from months ago (do not recall the exact context, believe this was Deepseek R1, when “reasoning” was new, and Julia code generated, for phase 1 of some neural compression proof-of-concept code I was doing):

“Thought for 732 seconds”
[Outputted thinking process, 48 screenfuls! Something you can hide or see, unlike for OpenAI, and it at least started along the same lines as I was thinking, but yes, I did NOT read all of it, just confirmed the final code works.]

So the code for compression seems to handle all cases. [..]

4 Likes

@Palli That approach has some great results for a relatively small model; 27M. I really like the coupling of a fast and a slow thinker internally. I agree with you, it does seem to be a big deal.

Initially I did not think it would be straightforward with the current packages, but there does seem to be all the key components already in Flux. It looks like a good endeavor.

Are you interested in working on it? Is anyone else as well? For some time I have been focused on rudimentary packages mostly.

1 Like

Initially I did not think it would be straightforward with the current packages, but there is a repo for Flash Attention. … It is not the FA-2/3 but should work for a basic recreation.

I’m now most interested in this HRM model and I then think it makes Flash attention redundant (doesn’t mention it, though “Since HRM focuses on reasoning, full attention is applied for
simplicity.”). Since it doesn’t use backpropagation. Or rather the paper mention BPTT and avoiding that variant, or optimizing it from O(n) to O(1), so I think then Flash is out too.

From the paper:

Furthermore, we propose a one-step gradient approximation for training HRM, which offers improved efficiency and eliminates the requirement for BPTT. This design maintains a constant memory footprint (O(1) compared to BPTT’s O(T ) for T timesteps) throughout the backpropagation process, making it scalable and more biologically plausible.

This heavy memory burden forces smaller batch sizes and leads to poor GPU utilization, especially for large-scale networks. Additionally, because retaining the full history trace through time is biologically implausible, it is unlikely that the brain implements BPTT.
Fortunately, if a recurrent neural network converges to a fixed point, we can avoid unrolling its state sequence by applying backpropagation in a single step at that equilibrium point. Moreover, such a mechanism could plausibly be implemented in the brain using only local learning rules

The above method needs O(1) memory, does not require unrolling through time, and can be easily implemented with an autograd framework such as PyTorch, as shown in Figure 4. Given that each module only needs to back-propagate errors through its most recent local synaptic activity, this approach aligns well with the perspective that cortical credit assignment relies on short-range, temporally local mechanisms rather than on a global replay of activity patterns.

The one-step gradient approximation is theoretically grounded in the mathematics of Deep Equilibrium Models (DEQ) which employs the Implicit Function Theorem (IFT) to bypass BPTT, as detailed next. … The Implicit Function Theorem then allows us to calculate the exact gradient of fixed point z⋆H [that copies badly]
with respect to the parameters θ without explicit back-
propagation:

I.e. badly formatted what I mentioned from Fig. 4:

def hrm(z, x, N=2, T=2):
x = input_embedding(x)
zH, zL = z
with torch.no_grad():
for _i in range(N ∗ T − 1):
zL = L_net(zL, zH, x)
if (_i + 1) % T == 0:
zH = H_net(zH, zL)
# 1−step grad
zL = L_net(zL, zH, x)
zH = H_net(zH, zL)
return (zH, zL), output_head(zH)

# Deep Supervision
for x, y_true in train_dataloader:
z = z_init
for step in range(N_supervision):
z, y_hat = hrm(z, x)
loss = softmax_cross_entropy(y_hat, y_true)
z = z.detach()
loss.backward()
opt.step()
opt.zero_grad()

Figure 4: Top: Diagram of HRM with approximate gradient. Bottom: Pseudocode of HRM with deep supervision training in PyTorch.

The paper references these:
Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton.
Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.
https://www.nature.com/articles/s41583-020-0277-3

Although feedback connections are ubiquitous in the cortex, it is difficult to see how they could deliver the error signals required by strict formulations of backpropagation. Here we build on past and recent developments to argue that feedback connections may instead induce neural activities whose differences can be used to locally approximate these signals and hence drive effective learning in deep networks in the brain.

and:
Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between
energy-based models and backpropagation. Frontiers in Computational Neuroscience, 11,
2016.

But if you want to implement it anyway see:

But see also FP4, MXFP4, and NVFP4, for it or other things:

2 Likes

@Palli , that is very encouraging. I will read those references in more detail. But based on what you have understood do you think this is a good opportunity to pursue? Do you see any gaps in the Julia ecosystem that would hinder progress? I have skimmed over the github code and have found only certain aspects which I think would require workarounds. If you think this is a good route maybe it should become a joint effort?

Regarding the precision, it may not be so efficient but for a prototype FP16 or FP8 should be ok right?

1 Like

@Palli , I spent some more time studying the paper, and looked at the recent api surface of Flux and even Lux. I think Lux is slightly more suitable for the HRM with the direct explicit state management. I have made a simple very minimal exploration in a notebook using an MLP instead of a transformer block just to test out the L and H modules in operation with either other. I used basic a regression task and it seems to work. You can check out the code,

I will work on a next revision which will bring it closer to the paper’s implementation. Feel free to join in, or share any ideas.

6 Likes

The approach seems interesting, but I have heard that in the HRM paper, the author actually trained on the test data and that the number of examples used were much more than advertised because they use augmentation methods on the initial sets of data, I’m not sure if these critiques were right or not, I just wanted to let you know about them, do you know if these holds or not?

@Tortar , they augmented it by adjusting the actual samples, I am aware thanks. I see it to be similar to the process of when people train CNNs for images, they rotate the data and do various transforms on the original data points producing a lot more samples using a single scen. But, I do not see where the underlying ‘reasoning’ behind the solutions was directly perturbed to increase the sample size.(I did not see that they trained directly on the test split)

3 Likes

The test of HRM of the original paper is legit as verified by The Hidden Drivers of HRM's Performance on ARC-AGI. However, the driver of its reasoning performance seems unrelated to its hierarchical reasoning architecture (low and high frequency modes). The performance was mostly generated by the outer loop and the data augmentation technique.

5 Likes

I have a framework i neen working on, pure julia and from scratch. Its on my github, check it out.

2 Likes

I checked out the repo I think you were referring to but could not make out the scope. Is it for LLMs for more generic ml? There was quite a bit of model selection code there.

Yeah, if you read the longroad.md it explains alot, but its a domain based AI but i pushed as far as my experise can go, i made the data server for it and the security module too. I would be open to suggestions or help with it.

2 Likes

I briefly went through your repo. It looks interesting. May I ask, do you think that your proposal may lead to gaining a unique competitive advantage, particularly in relation to or over Transformers, Mamba, or Hierarchical Reasoning … architectures?

The BehavioralTrainingRegiment and TrainingCommandCenter require an event-driven architecture to handle real-time adaptation efficiently. […]

[…] Comprehensive monitoring stacks using Prometheus + Grafana provide metrics collection and visualization with automated alerting integration.

I have set up the following architecture for the other project I am working on:

  • BeeGFS in the cloud

  • Julia container (using Libwebsockets.jl, YYJSON.jl, RDKafka.jl for an asynchronous, event-driven client-side architecture; I also have a version using HTTP.jl and JSON3.jl)

  • Julia transfers JSON data to:

    • Redpanda containers (brokers, console, connect, Kafka)
      • Redpanda transfers JSON data to:
        • QuestDB container
        • Arroyo container (which can also write back to Redpanda if needed)
  • Grafana local and Grafana Private Data Source containers, both reading from QuestDB.

I guess you would like to see Prometheus, and probably not only QuestDB, but also an additional database capable of vector data types and similarity search on vectors? In addition, I guess you would like to see Dagger.jl instead of asynchronous processing? And Libwebsockets.jl acting as a server instead of a client? And Protobuf instead of JSON? And all running on a Kubernetes cluster with high availability and GPU/TPU support instead of with podman play kube? And some lighter middleware like Rembus.jl alongside Julia? At least at the start? Am I right?

Hi, r u offering equity or equity equivalents and what is / are your objective(s) if I may ask.

GitHub - obsidianjulua/Julia-Net: julia build internet for real. I tried doing it the hard way.

1 Like

Quality code. I like Redpanda. Please take a look if you have some spare time. It’s like a Swiss Army knife. It is not the fastest option for data streaming in absolute terms, and it is not fully open source. However, there is a Community Edition under the Business Source License (BSL), and it is very reliable

1 Like

I must have misunderstood something here, sorry.