AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5

I feel obligated to state that state-space models (SSM), AND Transformers have provable limitations, for code related and e.g. chess, i.e. sequential, I can dig up the paper, e.g. also Mamba as affected as I recall, and then likely all Mamba/transformer hybrid such as Jamba, what I posted recently, but the paper has a solution, small fix to SSMs).

Llama3 is brand-new but already outdated, even though it was best open source. More recent Phi-3 and Arctic are very intriguing (also Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" Google paper/video see below, I supose the 1 million context length of Gemin 1.5 is based on it). The original Phi showed that the training data is very important, and we have similar ideas for code, developments from last day (its paper doesn’t mention Julia, only other languages though might apply to it, or at least should theoretically), WaveCoder-Ultra-6.7B seems to be better but only the others updated recently:

[…] Hence, we introduce CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal code-related tasks, which is aimed at augmenting the effectiveness of instruction tuning and improving the generalization ability of fine-tuned model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. Our experiments demonstrate that WaveCoder outperforms other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, WaveCoder exhibits high efficiency in previous code generation tasks. This paper thus offers a significant contribution to the field of instruction data generation and fine-tuning models, providing new insights and tools for enhancing performance in code-related tasks

Apparently WaveCoder-DS-6.7B beating GPT 4 on Python and average across languages (and WaveCoder-CL-13B on Go, and almost for Rust) in some cases:

Table 4: Results of pass@1 on HumanEvalFix benchmark. [Note, not to be confused with HumanEval, then GPT 4 is beating on that metric, see table 3.]

You can try Arctic here (pleas do for something more complex, I got good short answer and Fibonacci code for) for the default (Julia) prompt I always try first:

What is the Julia language and can you show be example code?

Enterprises want to use LLMs to build conversational SQL data copilots, code copilots and RAG chatbots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following and the ability to produce grounded answers. We capture these abilities into a single metric we call enterprise intelligence by taking an average of Coding (HumanEval+ and MBPP+), SQL Generation (Spider) and Instruction following (IFEval).

Arctic offers top-tier enterprise intelligence among open source LLMs, and it does so using a training compute budget of roughly under $2 million (less than 3K GPU weeks). This means Arctic is more capable than other open source models trained with a similar compute budget. More importantly, it excels at enterprise intelligence, even when compared to those trained with a significantly higher compute budget. The high training efficiency of Arctic also means that Snowflake customers and the AI community at large can train custom models in a much more affordable way.

As seen in Figure 1, Arctic is on par or better than both LLAMA 3 8B and LLAMA 2 70B on enterprise metrics, while using less than ½ of the training compute budget. Similarly, despite using 17x less compute budget, Arctic is on par with Llama3 70B in enterprise metrics like Coding (HumanEval+ & MBPP+), SQL (Spider) and Instruction Following (IFEval). It does so while remaining competitive on overall performance. For example, despite using 7x less compute than DBRX it remains competitive on Language Understanding and Reasoning (a collection of 11 metrics) while being better in Math (GSM8K). For a detailed breakdown of results by individual benchmark, see the Metrics section.

To achieve this level of training efficiency, Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. It was designed and trained using the following three key insights and innovations:

Based on this insight, Arctic is designed to have 480B parameters spread across 128 fine-grained experts and uses top-2 gating to choose 17B active parameters. In contrast, recent MoE models are built with significantly fewer experts as shown in Table 2.

  1. Enterprise-Focused Data Curriculum: Excelling at enterprise metrics like Code Generation and SQL requires a vastly different data curriculum than training models for generic metrics.

Arctic has a throughput of over 70+ tokens/second for effective interactive serving.

At this point, Arctic incurs 4x less compute than CodeLlama 70B and Llama 3 70B.

Table 3. Full Metrics Table. Comparing Snowflake Arctic with DBRX, LLAMA-3 8B, LLAMA-3 70B, Mixtral 8x7B, Mixtral 8x22B (instruction-tuned or chat variants if available).

It scores better than Google Gecko though it also seems interesting:

On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

Beating GPT 3.5 with a mobile phone LLM (and the larger Phi-3 beat e.g. Llama-3-In
8b on most metrics):

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

Hilarious quote from Phi-3 model “I’m sorry, as an AI developed by OpenAI” (yes, Microsoft is its partner, and is allowed (unlike others) to use its tech, to build smaller models:

Great idea (also applying to code? 10% better then?), much better SOTA accuracy, and 5x to 10x faster, matching “gpt4-early (pal)”, on the MATH metric, but behind gpt-4-turbo-2024-04-09 (cot); gets 94.5 (beating all small models, and almost all, gpt4-early gets 97.7) on MAWPS “online repository of Math Word Problems”:

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that “Not all tokens in a corpus are equally important for language model training”. […]
Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Training Tinyllama-1B on 80B tokens with SLM improves 6.8% on average across 15 benchmarks, with gains over 10% in code and math tasks.

Timewarp is a neural network that predicts the future 3D positions of a small peptide (2- 4 amino acids) based on its current state. It is a research project that investigates using deep learning to accelerate molecular dynamics simulations.