GPT 4 was best and got a big jump in capability this month to 82% on HumanEval, 13% better, or by 9.5-percentage points (and with Reflexion to 91% (probably an outdated number already), and to 86.6% with OctoPack; Julia is at least in the paper on it, i.e. in dataset list “COMMITPACK AND COMMITPACKFT LANGUAGES”, with 0.02% share):
WizardCoder is best freely available, and seemingly can too be made better with Reflexion (and/or OctoPack). We’re down to 4-bit models as the norm, I think it is too, but also down to under 3 bits per weight.
[2023/08/26] We released WizardCoder-Python-34B-V1.0, which achieves the 73.2 pass@1 and surpasses GPT4 (2023/03/15), ChatGPT-3.5, and Claude2 on the HumanEval Benchmarks. For more details, please refer to WizardCoder.
[2023/06/16] We released WizardCoder-15B-V1.0 , which surpasses Claude-Plus (+6.8), Bard (+15.3) and InstructCodeT5+ (+22.3)
We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. […] For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%.
We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack’s benefits in generalizing to a wider set of languages and natural coding tasks.
The metric of course matters and what programming language is being evaluated, and possibly a metric and test going forward:
Supports: [long list of languages, Julia likely needs to be added to it.]
Leaderboard for Leetcode Hard (Python): Pass@1
The best AI LLMs (at least no cost and ok for commercial use) for code (for at least Python, if not Julia) are the (updated) WizardCoder (WizardMath also claims better than any (other) previous SOTA open-source model), and the days old Code Llama 2 (then claimed SOTA on some benchmarks; is better than the older WizardCoder, but its paper did not compare to the newer or older).
WizardCoder beats the freely available ChatGPT, AND older GPT 4 (2023/03/15), current (free?) ChatGPT is better than that older GPT 4, and WizardCoder beats everyone but the latest GPT 4 (2023/08/26), is only 11% worse than it or by 8.8 percentage points.
The jump to 73.2 (from 18.3 or lower of rather recent code models or) from 59.8 on HumanEval is very impressive. I get the user interface for the demo links, but getting an answer takes a lot of time.
The phi-1 model and its way of reduced dataset is also interesting, it got 50.6%. [2306.11644] Textbooks Are All You Need
WizardCoder model (largest) is 35 GB “so quite large”.
Older version:
Repositories available
- 4-bit GPTQ models for GPU inference
- 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference
See on newer standard: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md
GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. […]
It is a successor file format to GGML, GGMF and GGJT,
Ultimately, it is likely that GGUF will remain necessary for the foreseeable future, and it is better to have a single format that is well-documented and supported by all executors than to contort an existing format to fit the needs of GGML.
This model can seemingly be run there, at no cost, but very slowly:
This model is an Open-Assistant fine-tuning of Meta’s CodeLlama 13B LLM.
2- and 3-bit models:
This repo contains GGML format model files for OpenAssistant’s CodeLlama 13B OASST SFT v10.
Important note regarding GGML files.
The GGML format has now been superseded by GGUF. As of August 21st 2023, llama.cpp no longer supports GGML models.
[…]
GGML_TYPE_Q2_K - “type-1” 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
codellama-13b-oasst-sft-v10.ggmlv3.Q2_K.bin Q2_K 2 5.74 GB 8.24 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
airoboros-l2-70b-2.1.ggmlv3.Q2_K.bin Q2_K 2 28.59 GB 31.09 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
gptq-3bit–1g-actorder_True 3 None Yes 0.1 wikitext 4096 26.77 GB No 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g.
In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. WizardMath surpasses all other open-source LLMs by a substantial margin. Furthermore, our model even outperforms ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci-002, PaLM-1 and GPT-3 on MATH. More details and model weights are public
Dataset used for OctoPack:
Name Megabytes % of total Samples % of total
julia 752.068 0.0203% 22695 0.0393%
Widely adopted programming languages like C and Javascript are overrepresented compared to niche programming languages like Julia and Scala.
You can check to see if you’re in and opt out of the dataset (I see a lot of my repos were in, but many are forks, so not really mine, then duplicate codebases in [for Julia]?):
Google’s PaLM 2, smaller than original PaLM, and now tested for Julia, was state-of-the art. An open-source implementation: GitHub - conceptofmind/PaLM: An open-source implementation of Google's PaLM models
WizardMath claims better than any (other) previous SOTA open-source model, but is still a bit disappointing (it may be good on word problems, as other LLMs, but arithmetic is a challenge for many at least irrationals):
The square root of 2 is a mathematical concept that refers to the number that, when multiplied by itself, gives the original number. In this case, the original number is 2.
[..]
Step 3: Apply the Concept
To find the square root of 2, we need to find a number that, when multiplied by itself, equals 2. We can start by listing numbers and checking if they satisfy this condition:
- 1 * 1 = 1
- 2 * 1 = 2
- 3 * 1 = 3
[..]
Step 4: Identify the Solution
From the list above, we can see that 2 is the only number that, when multiplied by itself, equals 2. Therefore, the square root of 2 is 2.
Final Answer: The square root of 2 is 2.
If I checked chain-of-thought I got:
[…]
Step 5: Conclusion
Since none of the numbers we calculated have a square equal to 2, we can conclude that the square root of 2 does not exist.
The answer is: Since the square root of 2 does not exist, the question is not valid.