[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems!

Hello Julia Community!

We’re excited to share with you the “Julia LLM Leaderboard” - a new project aimed at benchmarking various GenAI models for Julia code generation.

While our approach is super simple (perhaps naive?) – generate code, run it, and see if it works – our goal is quite ambitious: to determine which GenAI models and prompting strategies excel in producing syntactically correct Julia code to help you with choosing your “default” approach.

We’re calling on all of you to contribute and enrich our test cases. By adding your cases to the definition.toml file, you’ll be helping us and the wider Julia community in identifying the most efficient AI models for our specific needs. See examples/create_definition.jl for the workflow or ping us on Slack in the chatgpt channel.

Currently, our evaluation method involves parsing, execution, unit tests, and example runs, with each test case scoring up to 100 points. We have some preliminary results, but they need refinement. Your contributions will not only improve the robustness of our methodology but also help iron out bugs and inconsistencies.

Let’s collaborate and make the Julia LLM Leaderboard a reliable resource for everyone. Bring your test cases and let’s see how these AI models really perform!

Toy result on 1 test case (treat as a preview):

16 Likes

Sounds awesome!
Interesting results, the gpt4 not scoring the highest everywhere bothers me a little.
What is the temperature for the tests?
I 'll look into the tests!

1 Like

There a few reasons for that - I’ll need to re-run all of it when I fix a few bugs in the eval logic.

Eg, NamedTuples are killing most OSS models, many models include ‘> ‘ prompt before code or don’t code fence properly (including both code and outputs in code blocks)

So don’t over-index on the results yet, it’s pre-0.1.0!

Side note: I’m really struggling with good prompt templates for smaller OSS models, they really don’t do well with “bigger” instructions. Any experience with that?

New from today: I would really like new version of (Mistral and its derivative) Mixtral claiming best open-source beating GPT 3.5, including for coding, tested. See below.

For state-of-the-art, GPT-4 should be tried with this addon (and other models if possible): GitHub - andyz245/LanguageAgentTreeSearch

It can be tried out here: CodeLATS - a Hugging Face Space by AIatUIUC

Its paper was updated Dec. 5 and using it improves on GPT-4 Pass@1 accuracy (also better than Gemini Ultra) on HumanEval even with Reflexion and massively over GPT 3.5 even with it or chain-of-thought (CoT), tree-of-thought or RAP.

[…] We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning […]
LATS achieves 94.4% for programming on HumanEval with GPT-4 and an
average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating
the effectiveness and generality of our method.

Thanks, I did not know about: GitHub - ise-uiuc/magicoder: Magicoder: Source Code Is All You Need

  • Magicoder-S-DS-6.7B outperforms gpt-3.5-turbo-1106 and Gemini Ultra on HumanEval (76.8 vs. [72.6 and 74.4])!
  • Find more detailed comparisons with other SOTA models on the :trophy: EvalPlus Leaderboard :trophy:!

It seems be be an interesting model, claiming best (i.e. as a base model without addons, but it seems LATS could be used with it), so unclear and disappointing it doesn’t do better on Julia. Maybe it can simply be fine-tuned for Julia.

However GitHub - deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the Code Write Itself lists Julia on Github as its first language for some reason, while not explicitly mentioning at its main web page. Would be nice to have tested.

I test all models with “What is the Julia language and can you show me example code?” and Qi was disappointing repeating trivial stuff over and over (maybe a similar question in Chinese gets better results), but this Capybara derivative of it was ok and might need testing:

https://app.fireworks.ai/models/fireworks/yi-34b-200k-capybara

the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts models (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.

Mixtral has the following capabilities.

  • It gracefully handles a context of 32k tokens.
  • It handles English, French, Italian, German and Spanish.
  • It has strong performance in code generation.
  • It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

Pushing the frontier of open models with sparse architectures

Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.

We compare Mixtral to the Llama 2 family and the GPT3.5 base model. Mixtral matches or outperforms Llama 2 70B, as well as GPT3.5, on most benchmarks.

I’ve seen 7Bx8 = 50B claimed, i.e. 7B but in practice because of 8 times sparse “experts” effectively larger (shouldn’t it be 56B?), and I’m not sure what “Mixtral has 45B total parameters but only uses 12B parameters per token” there means exactly, it seems like 45/12=3.75, non-integer so 4 or 8 experts?

Compared to: GPT-4 architecture, datasets, costs and more leaked

The key points:

  • GPT-4’s Scale: GPT-4 has ~1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3.
  • Mixture Of Experts (MoE): OpenAI utilizes 16 experts within their model, each with ~111B parameters for MLP. Two of these experts are routed per forward pass, which contributes to keeping costs manageable.
  • Dataset: GPT-4 is trained on ~13T tokens, including both text-based and code-based data

See also: TheBloke/mixtral-7B-8expert-GPTQ · Hugging Face and

slightly better than the original an e.g. WinoGrande and MMLU but not better on all metrics:

Thanks to @dzhulgakov for his early implementation (GitHub - dzhulgakov/llama-mistral: Inference code for Mistral and Mixtral hacked up into original Llama implementation) that helped me find a working setup.

Also many thanks to our friends at LAION and HessianAI for the compute used for these projects!

Benchmark scores:

hella swag: 0.8661
winogrande: 0.824
truthfulqa_mc2: 0.4855
arc_challenge:  0.6638
gsm8k: 0.5709
MMLU: 0.7173

Older if you like to have a smaller model too:
Microsoft makes new 1.3B coding LLM that outperforms all models on MBPP except GPT-4, reaches third place on HumanEval above GPT-3.5, and shows emergent properties
https://www.reddit.com/r/LocalLLaMA/comments/14ez6qf/microsoft_makes_new_13b_coding_llm_that/

Textbooks Are All You Need

Paper: [2306.11644] Textbooks Are All You Need

Despite being several orders of magnitude smaller than competing models, both in terms of dataset and model size, we attain 50.6% pass@1 accuracy on HumanEval and 55.5% pass@1 accuracy on MBPP (Mostly Basic Python Programs), which are one of the best self-reported numbers using only one LLM generation. Moreover, despite being trained on much fewer tokens compared to existing models, phi-1 still displays emergent properties.

GPT-4 is still the best on (non-coding) Hellaswag:

while the TheBloke/llama-2-70b-Guanaco-QLoRA-fp16 is next there with (10-shot) ahead of PaLM 2-L 87.4 PaLM 2 Technical Report and PaLM 2-M ranked next which seems unfair since both listed (one-shot)
MUPPET Roberta Large then LLaMA-65B+CFG (zero-shot) ahead of PaLM 2-S (one-shot), so this paper seem intriguing:

PanGu-Coder-FT-I SotA here has a large leap over others this paper:

Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets

We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow.

MatrianCG is best on Django (may be outdated):

3 Likes

Thanks, @Palli ! A lot of good resources – I’ll definitely have a look at LATS.

In general, I’d like to add some advanced “fixing” into the PromptingTools.jl (be it Agents, Chains, or some clever heuristics) to improve the first result performance.

Re. Mixtral, I’m well aware - I’ve been following it quite closely in the past week.
I’ve also added the option to call MistralAI to the PromptingTools, because I was keen to try “mistral-medium” (the bigger beast). I wasn’t too impressed by the results – I have to debug the evaluation traces to see why it scores so low


I’ve refreshed the benchmark - there are 14 test cases and 16 models now.

The initial results are quite interesting!
Check out the updated README. I’ve separated out Paid models vs OSS for fairness.

You can easily deep dive on any evaluation/conversation trace like this:

using JuliaLLMLeaderboard
using DataFramesMeta

# Load everything
df = load_evals("code_generation"; max_history=1)

# Inspect some result
row = @chain df begin
  @rsubset :prompt_label == "InJulia" :name == "event_scheduler" :model == "mistral-small"
end

# take the evaluation filename and use it to load up the conversation
conv = row.filename[end] |> load_conversation_from_eval
edit(conv) # Opens a preview in VSCode, alternatively use `preview(conv)`

# Check out the code parsing results and errors
cb = AICode(last(conv); skip_unsafe=true)
cb.error
2 Likes

So I did a quick debug of the “mistral-medium” solutions and they are so good!

They just often make some silly single character error (like r""pattern"" with doubled quotes, or add an extra space somewhere). Otherwise, the code is so nice and probably one of my favourites - basically, you need to make 1-2 silly tweaks and you’ll have a great solution.

I tried to ping people on MistralAI discord to see if anyone has a similar experience or any solutions. It sounds similar to the bugs people saw in the Llama.cpp PR, which makes me wonder if it’s related to the MoE architecture :thinking:

2 Likes

For coding in particular, I was pleasantly surprised by DeepSeek Coder. It’s open source. The website requires you to register.

3 Likes

Thanks! I just tried it - it’s pretty good for the outline version!

Not sure I believe their claims here: https://deepseekcoder.github.io/

Also, it doesn’t seem like the “coder” finetune has been open sourced, right? I can see only “chat” and “base”

@Palli Thanks for the LATS paper link.

It sounds great, but I have some concerns about its practicality, eg, they often refer to running 50 trajectories to reach a solution. That’s nearly unusable even with GPT3.5Turbo, let alone any open-source model.

But I’ll explore some halfway implementation, basically adding a few trajectors and minimizing the amount of prompt+gen as much as possible.

Last week I implemented some agentic functionality in my package that basically runs the code, grabs the errors and throws it back at LLM. I had high hopes, but it performed pretty poorly (besides GPT4, that nailed it). I can share some stats from the leaderboard, but the gist is that on average it doesn’t help, because:

  • most open-source models feel over-fine-tuned and will break down after 3-4th reply
  • the reasoning of the smaller models tends to be “poor-er”, especially when it comes to identifying an error from Julia stack trace (even when distilled). It keeps going in circles
  • if it starts wrong, it rarely recovers (especially when the tests it generates have some weird syntax error)

So all in all, I think the way forward is much more targeted feedback, doing as much work in Julia as we can vs in LLM, not sending whole conversations but only single turns, having some parallel trajectories to explore different paths, etc. A lot of good ideas from LATS.

Have you explored the LATS since?

1 Like

Hi everyone!

If we were to refresh this LLM leaderboard, what models would you like to see on it? (Only a few, so please prioritize)

I’m thinking:

  • new OpenAI: GPT4T, GPT35T
  • new CodeLlama series (7,34,70b?)
  • ? Stable code (merely a curiousity)
    (Mention the quant version if you care)

WDYT?

Please pick from models provided by Ollama to make my life easier: stable-code

@Palli @alfaromartino

1 Like

Hi, good job thank you.

In case it helps, there’s a prompting guide for codellama here: https://www.promptingguide.ai/models/code-llama

1 Like

Updated Jan 22:
Unifying the Perspectives of NLP and Software Engineering:
A Survey on Language Models for Code

It seems we want DeepSeek Coder-Instruct 33B with 79.3 on HumanEval vs GPT4 67.0/82, and/or something even better. CodeFuse 33B might be it (for Julia). Already better then Gemini Ultra (I suppose there meaning AlphaCode 2 based on it?)

See also:
Figure 8: Major implementation changes in LLM over the past few years.

[already outdated]

  • Code infilling is another recently proposed task, after fill-in-the-middle pretraining (Bavarian et al., 2022) became popular. It is a generalization of code completion, where not only the left context, but also the right context is given. However, it differs from cloze test in that the target of cloze test is only one token, while the target of code infilling can be an entire line or even multiple lines, which requires a decoder to generate autoregressively
    […]
    2.1.3 Code-to-Text
    Code-to-text tasks take code as input, and output text. Related methods are listed in Figure 6.
  • Code summarization, also referred to as docstring generation, […]
  • Code review
    […]
  • Commit message generation
    […]
    2.1.4 Code-to-Pattern
    Code-to-pattern tasks conduct classification on code. Related methods are listed in Figure 7.
  • Type prediction […]
  • Code reasoning is a recently introduced task for evaluating LLMs, and often comes as a subset of general evaluation benchmarks such as MMLU (Hendrycks et al., 2021b). This task requires the model to reason about the code or algorithms, and answer related questions which are written in multiple-choice format or
    free-form QA format and may range from conceptual understanding to numerical calculation and complexity analysis.
  • Code classification aims to predict the functionality of a piece of code within a predefined set of labels. […]
  • Machine code detection is another recently introduced task and aims to predict whether the input code is written by human or generated by machines.

for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow.

[Already outdated, or not?] The Code Llama paper was updated 3 days ago (and the recently updated model now claimed better than GPT 4 by some for code generation, but wrongly or they would say so?):

Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E.

[Those number updated from “53% and 55%”.]

Intriguingly Rust is worst there (at best 26.3%) than C++ (a language known to get bad output, because it’s complex, so Rust, even more?), and its for CodeLLama number (but outdated, so much better now?):

includes Julia as one of the 18 [Julia is 19th] languages (graph there outdated). MultiPL-E/docs/examples.md at b94ff9202bca71741eb0cb0f711c660a3a612489 · nuprl/MultiPL-E · GitHub

A generated solution in Julia follows this pattern, but this time with comments:

Note however:

1 Like

In Nov.:

Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4% on the HumaneEval benchmark, surpassing GPT-4 performance (67%, zero-shot).

[since then new CodeLlama, and MFT updated, based on it, or older? Maybe could be even better?]

:fire::fire::fire: [2024/01/17] We released MFTCoder v0.3.0, mainly for MFTCoder-accelerate. It now supports new models like Mixtral(MoE), DeepSeek-coder, chatglm3. It supports FSDP as an option. It also supports Self-paced Loss as a solution for convergence balance in Multitask Fine-tuning.

:fire::fire::fire: [2024/01/17] CodeFuse-DeepSeek-33B has been released, achieving a pass@1 (greedy decoding) score of 78.7% on HumanEval. It lists as top-1 LLM on Bigcode Leardboard in terms of win-rate, the official result is going to be published later.

:fire::fire::fire: [2024/01/17] CodeFuse-Mixtral-8x7B has been released, achieving a pass@1 (greedy decoding) score of 56.1% on HumanEval.

E.g. the leaderboard ranking IS a bit outdated (any other known, better?), using e.g. older CodeLlama, also strangely DeepSeek-Coder-7b-instruct a bit better than the much larger DeepSeek-Coder-33b-instruct (on Python):

I can use that model, it knows of Julia and gave me demo code, web server that started, but it failed on “HTTP.parse_params” (mixing up with Python?):

I thought these links might give me demos to run, they have info, but not corresponding “studio” links (yet, that I can locate):

After undergoing 4-bit quantization, the CodeFuse-DeepSeek-33B-4bits model can be loaded on either a single A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quantized model still achieves an impressive accuracy of 78.05% on the Humaneval pass@1 metric.

News and Updates
:fire::fire::fire: 2024-01-12 CodeFuse-DeepSeek-33B-4bits has been released. Despite the quantization process, the model still achieves a remarkable 78.05% accuracy (greedy decoding) on the HumanEval pass@1 metric.

:fire::fire::fire: 2024-01-12 CodeFuse-DeepSeek-33B has been released, achiving a pass@1 (greedy decoding) score of 78.65% on HumanEval.

:fire::fire: 2023-11-10 CodeFuse-CodeGeeX2-6B has been released, achieving
[…]
:fire::fire: 2023-09-27 CodeFuse-StarCoder-15B has been released, achieving a pass@1 (greedy decoding) score of 54.9% on HumanEval, which is a 21% increase compared to StarCoder’s 33.6%.

:fire::fire::fire: 2023-09-26 We are pleased to announce the release of the 4-bit quantized version of CodeFuse-CodeLlama-34B. Despite the quantization process, the model still achieves a remarkable 73.8% accuracy (greedy decoding) on the HumanEval pass@1 metric.

:fire::fire::fire: 2023-09-11 CodeFuse-CodeLlama34B has achived 74.4% of pass@1 (greedy decoding) on HumanEval, which is SOTA results for open-sourced LLMs at present.

[At least those old(er) entries are outdated.]

CodeFuse apparently made by ApplePay competitor, AliPay/Ant Group, in China:

https://codefuse.alipay.com/welcome/product

https://codefuse-alipay-com.translate.goog/welcome/product?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp

Finetune Mistral, Llama 2-5x faster with 50% less memory!

a product by:

Moonshot’s algorithms are proven to be the fastest in the world as confirmed by […] They run 1,000,000x faster, use 50% less resources, and work on all devices.

???!

ML-BENCH: LARGE LANGUAGE MODELS LEVERAGE OPEN-SOURCE LIBRARIES FOR MACHINE LEARNING TASKS

  1. We propose a novel task that requires LLMs to comprehend long-context documents, navigate codebases, understand instructions, and generate executable code.
    […]
    We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements.

Intriguing CoC (see video/paper there):

Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought.

and PoT:

By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets.

3. When Coding Meets Reasoning

3.1 Coding for Reasoning

  1. PAL: “PAL: Program-aided Language Models”, 2022-11, ICML 2023, [paper] [repo]
  2. PoT: “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks”, 2022-11, TMLR 2023, [paper] [repo]
  3. CoC: “Chain of Code: Reasoning with a Language Model-Augmented Code Emulator”, 2023-12, arXiv, [paper]

3.2 Coding via Planning

  1. YAYI2: “YAYI 2: Multilingual Open-Source Large Language Models”, 2023-12, arXiv, [paper] [repo]
  2. DeepSeek: “DeepSeek LLM: Scaling Open-Source Language Models with Longtermism”, 2024-01, arXiv, [paper] [repo]
  3. Mixtral: “Mixtral of Experts”, 2024-01, arXiv, [paper] [blod]
  4. DeepSeekMoE: “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models”, 2024-01, arXiv, [paper]

[…] Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default
[…]
Figure 1: FIM can be learned for free. We pretrain language models with 50% and 0% FIM rates on two domains, natural language and code

We provide more evidence for the FIM-for-free property by comparing FIM and AR models on non-loss based benchmarks in Section 4. Moreover, we see in Section 4.2 that there is a stronger form of the FIM-for-free property. Not only there is no hit in autoregressive capabilities from FIM training on the final checkpoints, the same also holds throughout training. This is evidenced by the matching learning curves between AR and FIM models in left-to-right loss and HumanEval evaluations in Figures 4 and 5.

Beside studying the effect of FIM training on the left-to-right capability, it is also important to show that the models are in fact learning to infill from FIM training. Figure 2 provides evidence for this in the context of FIM test losses.

related seemingly:

The YAYI 2 multilingual tokenizer, maybe, more so than the LLM model itself, might help Julia (plus many natural languages), because of Unicode:

Toxicity Filtering To alleviate this problem, we propose a dual-filtering mechanism, which uses a Yayi 2 Checker model based on sensitive words for
screening at the first stage and employs a classification model based on quantum heuristic language to complete secondary filtering.

2.2 Tokenization
In the international landscape, most LLMs are centered around English, limiting their generalization ability in other languages. Similarly, LLMs released in China tend to focus on bilingual scenarios (Chinese and English), lacking a multilingual training corpus.
[…]
Training Data The tokenizer of YAYI 2 is trained on a 500GB high-quality multilingual corpus, which covers over ten commonly used languages including Chinese, English, French, Russian, etc.
[…]
Normalization The YAYI 2 tokenizer adopts a unique approach by directly utilizing raw text for training without undergoing normalization. This strategy ensures the model’s adeptness in handling general scenarios.
Algorithm By training using the Byte-Pair Encoding (BPE) algorithm (Shibatay et al., 1999) from the Sentence-Piece library (Kudo and Richardson, 2018), the YAYI 2 tokenizer exhibits a robust approach. During training, each digit of a number is intelligently split to facilitate mathematical reasoning. The manually curated vocabulary includes an array of HTML identifiers, common punctuation to enhance segmentation accuracy, and 200 reserved slots for potential applications like adding identifiers during SFT. As a byte-level segmentation algorithm, the YAYI 2 tokenizer excels in handling unknown characters.

FYI: Apparently جوليا is Julia in Arabic. I tried to ask some Arabic LLM about Julia (I mean I asked in English, I don’t know Arabic…), and got this (with Python in Roman/Latin unlike Julia, I added the newline to confirm that the first word was Julia, then next line got upper-cased, Arabic doesn’t have casing). It was understandable and clear after I ran this through google translate:

جوليا
هي لغة برمجة ديناميكية عالية المستوى وعالية الأداء للحوسبة التقنية. تم تصميمه مع أخذ الأداء وسهولة الاستخدام في الاعتبار، بهدف توفير سهولة الاستخدام والقدرات التفاعلية للغات البرمجة النصية مثل Python وMATLAB، مع تحقيق السرعة والكفاءة للغات منخفضة المستوى مثل C وFortran.

بناء جملة جوليا نظيف ومألوف، مما يجعل من السهل تعلمه لأولئك الذين يشعرون بالارتياح بالفعل مع لغات البرمجة الشائعة الأخرى. إحدى ميزاته الرئيسية هي برنامج التحويل البرمجي في الوقت المناسب (JIT)، والذي يسمح له بإنشاء رمز آلة محسّن للغاية تلقائيًا في وقت التشغيل، مما يمكّنه من مطابقة أداء اللغات المترجمة بشكل ثابت أو حتى تجاوزه في العديد من الحسابات الرقمية.

Thanks for the tips! I’ll read it properly later.

From the list:

  • CodeFuse + MFT Coder sounds intriguing, but I didn’t see them on Ollama, which I think is the main backend for Julia users? I’ll look around
  • CodeLlama should be in the benchmark (did several quants for comparison), but I was surprised by how poorly it performed
  • StableCode - I think Jun mentioned on Slack that Julia is not even in the training data?

In general, despite all the hype with new models, I rarely see them perform much better when I try.

Is there a model that you’ve used a lot that seems better than everything else?

2 Likes

In other news:

:tada: New Julia LLM Leaderboard Update! :rocket:

Quick highlights:

  • New Models Added :globe_with_meridians:: Now featuring OpenAI’s “0125” versions, Codellama (full sweep up to 70b), and more for even richer evaluations.
  • AgentCodeFixer Loop :arrows_counterclockwise:: Introducing the ability to evaluate code with codefixing_num_rounds>0 for iterative improvements. (No, it didn’t do amazing for OSS models :frowning: )
  • Smart Seeding :seedling:: We’ve tweaked seed settings for MistralAI & OpenAI to sidestep caching mechanisms, ensuring fresh results.
  • Re-scoring Submissions :bar_chart:: All past entries have been re-evaluated with our updated methodology for fairness and accuracy (there were many changes in PromptingTools)
  • Quantization & Temperature Effects :microscope:: Conducted experiments on Yi34b and Magicoder 7b models to explore their impact. Details to follow (but you can see the reports in the repo already)!

Fixes & Improvements :wrench::

  • Enhanced code loading/debugging with Julia’s include_string, pinpointing error sources more accurately. Use evaluate(...; verbose=true) for insights.
  • Better error handling and scoring updates, including improved parsing error detection and a fix for the mkdir bug in run_benchmark.

Streamlining :scissors::

  • Said goodbye to the @timeout macro, now part of PromptingTools.

Excited to see how these updates enhance your leaderboard experience! :star2:

My favourite change:
The latest GPT 3.5 Turbo (0125) scores pretty high and it’s cheaper than ever before:


(But, yes, Magicoder 7b with lower temperature comes close to that performance for 1-shot tasks! It falls apart for anything more complicated :smiley: )

1 Like

What would people want to see here, longer contexts, for input, or output? I think you only get one number, and it applies to both.

You mean on Julia? Could be getting better for e.g. Python, but not Julia, and benchmarks are only that, and I see something about “contamination” and thus them not be reliable, i.e. numbers inflated, only made to show working well for them (but not out-of-distribution).

I would like to see Monarch-Mixer models (and/or RWKV), they are the new thing, that will take over, I believe, also for code, most likely.

Text embeddings are a critical piece of many pipelines, from search, to RAG, to vector databases and more. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text, but documents can be very long –
[…]
code repositories, etc can be tens of thousands of tokens long (or more). Here, we’re taking a first step towards developing long-context retrieval models.

[I added bold above and below]

We build on Monarch Mixer (M2), a recent model family developing attention- and MLP-free BERT models, which are enabling long-context BERT models. Today, we’re releasing a preview of a few models: long-context versions of M2-BERT up to 32K context length, as well as

These models achieve state-of-the-art performance in MTEB showing comparable or even better accuracies than closed models. Additionally, M2-BERT retrieval models significantly outperform other closed models in long context retrieval tasks. This means you can now generate embeddings for long documents without splitting them into many short chunks while containing more meaningful contexts in the embeddings. You can also access these powerful models at very competitive prices (up to 4x cheaper) as seen in the pricing graph below.

Check out code here, and models up on HuggingFace here:

These models are also available on Together AI’s new embedding service – check it out here!

Monarch matrices are a sub-quadratic primitive (you can compute them in O(N^3/2)) that are also hardware-efficient and expressive. The block-diagonal matrices map onto tensor cores, and the permutations generalize the Fast Fourier Transform. As a result, Monarch matrices can efficiently capture all sorts of structured linear transforms:

Monarch matrices can capture many structured linear transforms, including Toeplitz, Fourier, Hadamard, and more.

In M2, we use Monarch matrices to replace both attention and MLPs in Transformers. We replace attention by using Monarch matrices to construct a gated long convolution layer, similar to work like H3, Hyena, GSS, and BiGS. Specifically, Monarch matrices can implement the FFT, which can be used to compute a long convolution efficiently:

I couldn’t try out those models at HF, nor the endpoint (I suppose I could, pay, I’m looking for a web interface, to test). I suppose not trained on code, though unsure, but could, and these models still small, since not scaled up, and that is costly since a new type of model, even thought time-complexity is better. Transformers are O(n^2), these are O(n^1.5). Linear transformers exist (even before RWKV that claims “infinite” context length), just known to be fast(er), its point, while quality suffers (I believe also for RWKV, but I’ve not kept up with their updates). I hope, and it seems, that doesn’t happen here, just becomes more efficient.

Better for text (and images), thus code too:

and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1× higher throughput at sequence length 4K. […] Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE–showing for the first time that it may be possible to match Transformer quality without attention or MLPs.

[table, intriguingly the throughput increases with larger context windows, in absolute terms, and not just their relative advantage, to 9x (also lowers for the others at least up to 8192 if not OOM there).]

That OOM in HF BERT-base is particularly important (and FlashAttention BERT-base eventually OOMs as well). That means that any retriever with a Transformers-based BERT backbone will have trouble with long-context – that’s everything from sentence-BERT to ColBERT to BGE and more!

[their bold here, the time-complexity is that, but I think also for space, why OOM, and will happen for them too eventually just not as quickly.]

Github is only 7.59%, 95.16 GiB of the Pile (and Julia a fraction of that…): GitHub - EleutherAI/the-pile but at least was then trained on code, I’m not sure of the fraction on the best code models, because you need e.g. English too.

Over the past six years, we’ve seen Transformers take the world by storm. [E,g, ChatGPT]

Are Transformers the only way to get this amazing performance?

Now, the first reason we’ve been poking around at this is because it’s really interesting! […] – hence the line of work in our lab looking into replacing attention with a sub-quadratic operator (S4, H3, Hyena, HyenaDNA. And we’re encouraged by the groundswell of work into new architectures for long sequences, from RetNet to RWKV, and positional interpolation – just to name a few!

Its paper updated in Dec. with:

Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

https://wiki.rwkv.com/

From 9 hours ago (v6 in training):

  • Ranks as the world’s greenest 7B model (per token)
  • Trained on 1.1 Trillion Tokens across 100+ languages (70% English, 15% multi lang, 15% code)
  • Outperforms all 7B class models in multi-lingual benchmarks
  • Approaches Falcon (1.5T), LLaMA2 (2T), Mistral (>2T?) level of performance in English evals

Also smaller, slightly older (and larger “14B model / 7B 2T model”) available (and 8x7B MoE model scheduled): RWKV/rwkv-4-world-1b5 · Hugging Face

I’m guessing MTEB benchmark might be relevant or when extended to code:

  1. Multinguality MTEB contains multilingual classification, STS and bitext mining datasets. […] Further, MTEB does not contain any code datasets that could be used to benchmark code models (Neelakantan et al., 2022; Allal et al., 2023). It should be easy to extend MTEB with datasets, such as CodeSearchNet

Yeah, same. I’d be keen to see different architectures but it’s not easy with the lack of support by the popular backends right now. If you know of an easy route, let me know!

Re embedding benchmarks, that’s also on my radar, but that requires to build the test set first. PRs welcome :slight_smile:

Added Qwen-1.5 benchmarks - thanks to 01.ai!

It performs surprisingly badly :confused:
Check the traces and the updated leaderboard!

PS: I tested only the GGUF quants, so perhaps there is an issue?

EDIT: There is a bug in Ollama/GGUF files: The `qwen:72b-chat-v1.5` model (and likely all the other v1.5 models too) is missing the `rope_frequency_base` value in the GGUF file. · Issue #2379 · ollama/ollama · GitHub

1 Like

It would be interesting to get Google’s Gemma models added, until then a demo link here of the smaller one:

What is the Julia language and can you show me example code?

[I tried that, as I do for all models, and it worked, got small simple Julia code, part of longer relevant text.]

The Gemma models are claimed state-of-the-art, and useful for code generation (for Julia too?), better than all similarly sized:

Gemini was already claimed SOTA, and Gemma built on its tech, though Gemini is likely still much larger and better, and since then Gemini 1.5 awesome, waiting for public release. OpenAI’s Sora is also very new and awesome SOTA text-to-video, a large leap.

Google seems to release the models on Kaggle, or there first, but also, and to try:

They come in two sizes, and for models for each, plus unofficial, likely worth a try:

In case it helps, here’s code: GitHub - google/gemma.cpp: lightweight, standalone C++ inference engine for Google's Gemma models. GitHub - google/gemma_pytorch: The official PyTorch implementation of Google's Gemma models

Others larger to test from the leaderboard in order of increasing size (sorted by MMLU before choosing):

We recently released Smaug-72B-v0.1 which has taken first place on the Open LLM Leaderboard by HuggingFace. It is the first open-source model to surpass an average score of 80%.

Also:

Also seemingly of interest (for medical, seemingly not code generation, but who knows):

1 Like

Agreed. I did it overnight, only to find out that there is a bug in the Ollama implementation.

It performed terribly. A lot of users are seeing the same issue as well, so hopefully it will get resolved.

Are you on Julia Slack? We discussed it there in the “generative-ai” channel.

1 Like