AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5

Also new:

I don’t know if they have a competitive advantage, for Julia or e.g. Python, but they have a video on it helping with translation COBOL to OO Java (I don’t know if that’s actually better I see lots of getters with just one return something…, I suppose they should rather translate to Julia :slight_smile: ) Potentially it can help just with Julia or converting from e.g. Python. Yes I found this from a sponsored link: How to scale a business-ready AI platform with watsonx: Q&A with IBM - Stack Overflow

This seems very intriguing (see e.g. two demo videos there, could be asked to use Julia?):

Microsoft also has the small Orca 2 model/paper it’s considered a “GIANT Breakthrough For AI Logic/Reasoning”

Orca 2: Teaching Small Language Models How to Reason

A.
We’ve had a breakthrough in AI with Gemini from Google’s Deepmind, beating GPT-4 on almost all benchmarks and GPT-4V all tested on.

And AlphaCode 2 based on Gemini. It solves e.g. competitive programming problems including one only 0.2% of humans can solve (maybe because it involved dynamic programming, and people might not be familiar?). More generally about 75% accurate, or with help up to 90%.

From natural image, audio and video understanding to mathematical reasoning, Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in large language model (LLM) research and development.

With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding), which uses a combination of 57 subjects such as math, physics, history, law, medicine and ethics for testing both world knowledge and problem-solving abilities.

Our new benchmark approach to MMLU enables Gemini to use its reasoning capabilities to think more carefully before answering difficult questions, leading to significant improvements over just using its first impression.
[table]
Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding.

Gemini Ultra also achieves a state-of-the-art score of 59.4% on the new MMMU benchmark, which consists of multimodal tasks spanning different domains requiring deliberate reasoning.

For speech recognition (FLEUR; metric where lower is better, I think it’s an error rate), 7.6% vs 17.6% for Whisper v3 which was state-of-the-art at least very recently. Also massively improving on Whisper’s BLEU score, i.e. for speech translation, though for some strange reason then comparing to version 2.

They do not mention AGI, artificial general intelligence but it seems close. What I see missing is continual learning, most important (there’s research on that possibility, the catastrophic forgetting" was a problem, I doubt Grok is there best in class, but at least claiming). Even with AI leaning over time, it’s still turn-based currently (as with Chess and most board-games), you prompt, I answer etc. and back and forth, without the AI ever interrupting. That’s considered a good thing, often, but might be required for AGI, and for it to set its own goals and work autonomously to execute them and prioritize. Deepmind has some experience with non-turn-based e.g. their StartCraft AI I think. It’s unclear if we want AGI, i.e. it autonomous, then no longer a tool, and can’t likely be controlled, but could a tool, turn-based be still AGI?

They’ve exceeded slightly over Human expert (MMLU) at 89.9%. We of course want 100% but that’s not how AGI is defined, and 90%+ already is pretty good. I’m not sure what the average human would score, I suspect Gemini already exceeds average humans on all the benchmarks tested on (we can make more challenging tests, including for programming). Not yet beating all humans or all experts (just average on some test), and not on real-time tasks, or “non-intelligence”, physical-based, i.e. movement and “acting” int the real word. But AGI does depend on that, at least historically hasn’t been defined that way (otherwise physically disabled people wouldn’t have general intelligence).

Gemini was not trained on Nvidia or GPUs apparently (though models might be run on, or it a possibility), and the hardware side is also interesting:

We trained Gemini models using TPUv5e and TPUv4 (Jouppi et al., 2023), depending on their sizes and configuration. Training Gemini Ultra used a large fleet of TPUv4 accelerators across multiple datacenters […]

TPUv4 accelerators are deployed in “SuperPods” of 4096 chips, each connected to a dedicated optical switch, which can dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies in around 10 seconds […]

The GSPMD partitioner (Xu et al., 2021) in the XLA compiler partitions the training step computation, and the MegaScale XLA compiler (XLA, 2019) pass statically schedules appropriate collectives so that they maximally overlap with the computation with very little variation in step time.

Maintaining a high goodput at this scale would have been impossible using the conventional approach of periodic checkpointing of weights to persistent cluster storage. For Gemini, we instead made use of redundant in-memory copies of the model state, and on any unplanned hardware failures, we rapidly recover directly from an intact model replica. Compared to both PaLM and PaLM-2 (Anil et al., 2023), this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%

Training at unprecedented scale invariably surfaces new and interesting systems failure modes - and in this instance one of the problems that we needed to address was that of “Silent Data Corruption (SDC)” (Dixit et al., 2021; Hochschild et al., 2021; Vishwanathan et al., 2015). Although these are extremely rare, the scale of Gemini means that we can expect SDC events to impact training every week or two.

B.

Simultaneous translation (also known as real-time or streaming translation) is the task of generating translations incrementally given partial input only.

So on Nov. 30:

[…] In addition, […] The experimental results demonstrate that the proposed model attains state-of-the-art performance in simultaneous speech-to-text translation on the Spanish and English translation task.

3 Likes

Well, that illustrates the problem with AI language models: produce grammatically valid sentences, but content correctness is at random: about 50% of the listed solutions are invalid. 21,27,33, and 39 are no prime numbers.

2 Likes

https://siml.earth/Julia-LLM-Leaderboard/stable/

2 Likes

[It’s an updated graph, same as at github I think, but the ones in the docs are outdated, maybe update, show a disclaimer or just drop there? To not confuse people… rather link to the README file.]

Looks great, it will be very interesting if you can try out e.g.:

The only model tagged with Julia at HuggingFace. Also Jamba, it’s based on, is very intriguing since not a Transformer (or purely, is a hybrid; I believe you only show transformers, and are any of them “Universial Transformers” that are known to be better?):

I probably wrote way too much there, quoted to much, I’m just excited about the future. See there at the bottom more models, likely good.

Also for (many) more models, e.g. from this week see:

such as:

And like Jamba another hybrid: Zamba — Zyphra

If I want to try any of these out, how do I do that? Currently, I use Codeium in VSCode, and that is really awesome.

You can run open source locally, depending on the model you may need up to 80 GB of VRAM (and one older 400 GB of RAM…), others work on most GPUs.

Many models are available on poe.com, and I suppose other places, not sure which has many and most recent. For code specifically and integration e.g. in VS Code, you may have to wait, or someone already has to connect to arbitrary (local) model? They need not be local, and companies are for sure willing to sell you access to their model.

I don’t know that Zamba is accessible, but surly will be, it was trained in only a month, a new record, for training (not just fine-tuning; you can do that in a day, e.g. at home for Julia, please do…).

1 Like

Llama3 just came out. It’s trained on 15T(!) tokens.

Llama 3 models take data and scale to new heights. It’s been trained on our two recently announced custom-built 24K GPU clusters on over 15T token of data – a training dataset 7x larger than that used for Llama 2, including 4x more code. This results in the most capable Llama model yet, which supports a 8K context length that doubles the capacity of Llama 2.

4 Likes

Yes, saw that, it was released this hour, and also the code repo updated within the hour (seemingly continuously, something a min ago), but I do not see similar code capabilities claimed, or similar codellama [EDIT: since code generation is part of Llama3 we may not be expecting a codelama fine-tune of it; unless unofficially e.g. if you want to do it for Julia, and the 406B variant of it forthcoming is intriguing…], but that was a fine-tune I believe for version 2, and maybe forthcoming officially or unofficially. It was made with “7.7M GPU hours of computation on hardware of type H100-80GB”.

However there’s also 3 hours old, I believe also for code:

Yeah, I would like to use that Llama3 model :smile:

Mark

“… for Llama 3, we focused a lot on coding”

1 Like

Unless LLama3 changes things, and I see now code generation mentioned (but not best metric, HumanEval, also it’s beaten on it by model below, but could be “contamination” for either, thus score too high), I think this may be the best open model:

This model is based on deepseek-coder-33b-base.

Its paper from late February: https://arxiv.org/pdf/2402.14658.pdf

Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4’s 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreter brings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

Maybe AgentCode can be applied not just to GPT-4 (then widening its lead to 91.8 on MBPP), but e.g. to OpenCodeInterpreter.

Tricks like using AgentCoder with it, or Claude 3 Opus, most likely help (likely what Devin, Devina, and Open Devin do), and “self-debugging”, “turbulance” and, if it applies LATS (likely NOT outdated) too, i.e. all tools concurrently; e.g.:

From late March paper:

This paper introduces LCG, a code generation framework inspired
by established software engineering practices. LCG leverages
multiple Large Language Model (LLM) agents to emulate var-
ious software process models, namely LCGWaterfall, LCGTDD, and
LCGScrum. Each model assigns LLM agents specific roles such as
requirement engineer, architect, developer, tester, and scrum mas-
ter, mirroring typical development activities and communication
patterns. Through collaborative efforts utilizing chain-of-thought
and prompt composition techniques, the agents continuously
refine themselves to enhance code quality. Utilizing GPT3.5 as the
underlying LLM and baseline (GPT), we evaluate LCG across
four code generation benchmarks: HumanEval, HumanEval-ET,
MBPP, and MBPP-ET. Results indicate LCGScrum outperforms
other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7
in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respec-
tively—an average 15% improvement over GPT. Analysis reveals
distinct impacts of development activities on generated code, with
design and code reviews contributing to enhanced exception-
handling, while design, testing, and code reviews mitigate code
smells. […] This stability underscores the importance
of adopting software process models to bolster the quality and
consistency of LLM-generated code.

Interesting new tool:

Turbulence consists of a large set of natural language question templates, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated test oracle that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a neighborhood of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM’s code generation abilities to be identified, including where the LLM correctly solves almost all questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta

[Italics in original, e.g. question templates, which seems like Llama3 is using. Note when copy pasting this was dropped, so please be careful when doing that…]

[…] Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

Since then Claude 3 Opus jumped way ahead of GPT4 (and Gemini Ultra).

In particular, LATS achieves 94.4% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.

Self-Debugging improves the baseline accuracy by up to 12%.

[It’s applied to GPT-4, but possibly at that time, a year ago, to older GPT-4, so ranked #9 may be misleading.]

Still most used(?) but may not be best benchmarks: MBPP Benchmark (Code Generation) | Papers With Code HumanEval Benchmark (Code Generation) | Papers With Code

Regarding Google Gemini 1.5 (and data contamination of code evaluation benchmarks):

Gemini 1.5 Pro is our best performing model in code to date, surpassing Gemini 1.0 Ultra on Natural2Code, our internal held-out code generation test set made to prevent web-leakage.

HumanEval leakage HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conservative filtering heuristics. An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pretraining on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. […]

So claims like this may not be meaningful (always, I forget where that came from):

Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.

Llama-3 still indicates that slices in Julia are views by default. Still only Claude 3 and ChatGPT 4 can answer the question right.

2 Likes

Maybe you can point it out, put it in a (system) prompt: I find it hilarious that prompts like “Take a deep breath and think this through [and keep in mind views in Julia…]” work for LLMs to improve results (I suppose for code too, I heard about in other context)!

1 Like

It was worth the try though haha but in any case, it’d make it unreliable if it’s prompt-specific.

1 Like

I wonder where they get the wrong info from.

Doesnt that seem like they are asking the wrong source, opposite to the model being not sophisticated enough?

1 Like

Literally, every single model answers that (except for ChatGPT 4 and Claude 3-Opus) or gives a wrong question. Even for some of the top models.

For example, the following is Claude-3 Sonnet

and Mistral Large

2 Likes

Another thing. In all the models (including chatgpt4 and claude3-opus), if you ask to optimize an operation (e.g., a simple for-loop), it never wraps code in a function. Even if you indicate it “I’m going to copy and paste whatever you give me, so don’t assume anything”.

3 Likes

Lets see how Llama3 405B behaves, I guess :slight_smile:

1 Like

I feel obligated to state that state-space models (SSM), AND Transformers have provable limitations, for code related and e.g. chess, i.e. sequential, I can dig up the paper, e.g. also Mamba as affected as I recall, and then likely all Mamba/transformer hybrid such as Jamba, what I posted recently, but the paper has a solution, small fix to SSMs).

Llama3 is brand-new but already outdated, even though it was best open source. More recent Phi-3 and Arctic are very intriguing (also Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" Google paper/video see below, I supose the 1 million context length of Gemin 1.5 is based on it). The original Phi showed that the training data is very important, and we have similar ideas for code, developments from last day (its paper doesn’t mention Julia, only other languages though might apply to it, or at least should theoretically), WaveCoder-Ultra-6.7B seems to be better but only the others updated recently:

[…] Hence, we introduce CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal code-related tasks, which is aimed at augmenting the effectiveness of instruction tuning and improving the generalization ability of fine-tuned model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. Our experiments demonstrate that WaveCoder outperforms other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, WaveCoder exhibits high efficiency in previous code generation tasks. This paper thus offers a significant contribution to the field of instruction data generation and fine-tuning models, providing new insights and tools for enhancing performance in code-related tasks

Apparently WaveCoder-DS-6.7B beating GPT 4 on Python and average across languages (and WaveCoder-CL-13B on Go, and almost for Rust) in some cases:

Table 4: Results of pass@1 on HumanEvalFix benchmark. [Note, not to be confused with HumanEval, then GPT 4 is beating on that metric, see table 3.]

You can try Arctic here (pleas do for something more complex, I got good short answer and Fibonacci code for) for the default (Julia) prompt I always try first:

What is the Julia language and can you show be example code?

Enterprises want to use LLMs to build conversational SQL data copilots, code copilots and RAG chatbots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following and the ability to produce grounded answers. We capture these abilities into a single metric we call enterprise intelligence by taking an average of Coding (HumanEval+ and MBPP+), SQL Generation (Spider) and Instruction following (IFEval).

Arctic offers top-tier enterprise intelligence among open source LLMs, and it does so using a training compute budget of roughly under $2 million (less than 3K GPU weeks). This means Arctic is more capable than other open source models trained with a similar compute budget. More importantly, it excels at enterprise intelligence, even when compared to those trained with a significantly higher compute budget. The high training efficiency of Arctic also means that Snowflake customers and the AI community at large can train custom models in a much more affordable way.

As seen in Figure 1, Arctic is on par or better than both LLAMA 3 8B and LLAMA 2 70B on enterprise metrics, while using less than ½ of the training compute budget. Similarly, despite using 17x less compute budget, Arctic is on par with Llama3 70B in enterprise metrics like Coding (HumanEval+ & MBPP+), SQL (Spider) and Instruction Following (IFEval). It does so while remaining competitive on overall performance. For example, despite using 7x less compute than DBRX it remains competitive on Language Understanding and Reasoning (a collection of 11 metrics) while being better in Math (GSM8K). For a detailed breakdown of results by individual benchmark, see the Metrics section.

To achieve this level of training efficiency, Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. It was designed and trained using the following three key insights and innovations:

Based on this insight, Arctic is designed to have 480B parameters spread across 128 fine-grained experts and uses top-2 gating to choose 17B active parameters. In contrast, recent MoE models are built with significantly fewer experts as shown in Table 2.

  1. Enterprise-Focused Data Curriculum: Excelling at enterprise metrics like Code Generation and SQL requires a vastly different data curriculum than training models for generic metrics.

Arctic has a throughput of over 70+ tokens/second for effective interactive serving.

At this point, Arctic incurs 4x less compute than CodeLlama 70B and Llama 3 70B.

Table 3. Full Metrics Table. Comparing Snowflake Arctic with DBRX, LLAMA-3 8B, LLAMA-3 70B, Mixtral 8x7B, Mixtral 8x22B (instruction-tuned or chat variants if available).

It scores better than Google Gecko though it also seems interesting:

On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

Beating GPT 3.5 with a mobile phone LLM (and the larger Phi-3 beat e.g. Llama-3-In
8b on most metrics):

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

Hilarious quote from Phi-3 model “I’m sorry, as an AI developed by OpenAI” (yes, Microsoft is its partner, and is allowed (unlike others) to use its tech, to build smaller models:

Great idea (also applying to code? 10% better then?), much better SOTA accuracy, and 5x to 10x faster, matching “gpt4-early (pal)”, on the MATH metric, but behind gpt-4-turbo-2024-04-09 (cot); gets 94.5 (beating all small models, and almost all, gpt4-early gets 97.7) on MAWPS “online repository of Math Word Problems”:

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that “Not all tokens in a corpus are equally important for language model training”. […]
Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Training Tinyllama-1B on 80B tokens with SLM improves 6.8% on average across 15 benchmarks, with gains over 10% in code and math tasks.

Timewarp is a neural network that predicts the future 3D positions of a small peptide (2- 4 amino acids) based on its current state. It is a research project that investigates using deep learning to accelerate molecular dynamics simulations.

2 Likes

One of the core contributors is from our team. We discussed applying the same approach to Julia but I shifted to something more urgent later. If anyone else here is still interested in trying out this idea, pleas let me know.

1 Like