AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5

Maybe someone could find this blog postI wrote useful. It’s about AI tools to code with Julia.


Do you know how (step by step) I would setup WizardCoder with Reflexion? It seems, to me at least, like Reflexion is suited, or developed towards GPT 4, and its documentation reflects (hah :wink:) that.

I will see that I get the Wizard extension for VSCode running.
Could we train the agent via Reflextion specifically for Julia?

Julia is already the shortlist of 19 languages in the first multilingual neural code generation benchmark (not just in datasets), MultiPL-E, and way ahead of some languages, such as R and C# (at least at the time of benchmarking), and close in score to e.g. JavaScript, the top languages and Python. That should be the benchmark to follow (or maybe HumanEval which is more challenging than MBPP). Dictionaries seems to come out strangely in most languages (low or absence of result, not tested?), including for Julia (and exceptionally vs e.g. lists or tuples) and for Rust the highest ranked at only 40% (fig. 11). “Many languages appear to struggle with questions involving tuples. […] However, JavaScript performs well despite lacking tuples.”

Thanks for the link (I like subjective opinions), but I must comment that none of the tools are specific to code (though the larger models may, or may not be expected to be good for, e.g. Claude-2 and GPT4); nor Julia:

On a more positive note, AI tools seem pretty efficient at constructing regular expressions in Julia. This is remarkable, as writing regular expressions isn’t particularly hard, but can quickly become tedious. [note]

[that’s not really “remarkable” vs other languages, since the “regular expressions” are the same across most or many languages, I think identical in Python; the code around them isn’t e.g. “r strings” in Julia, but the additional r is trivial.]

Overall, the main message is that you should be cautious when using AI tools for coding, but even more so when applying them to Julia. This becomes particularly relevant if you rely on AI tools to learn the language itself.

I only consider free options, which include ChatGPT-3.5, ChatGPT-4, Bard, Claude-2, and LLaMA2-70b. You can freely access each as follows.

[So which is best for Julia, the updated GPT 4, or maybe WizardCoder? Or Code LLama2, which has many variants, not to be confused with just LLama [2].]

ChatGPT-3.5, Claude-2, and LLaMA2-70b: accessed through The first two models can also be accessed directly through their websites. The less performant version Claude-1 yields similar results to Claude-2 for Julia.

ChatGPT-4: accessed through A couple of free messages per day are also available through

[good to know, I still wouldn’t bet on if being the same GPT-4, since it’s frequently updated too, and those accessing an API might be accessing an older version? Or it guaranteed to be the latest?]

GitHub Copilot (free for students and professors) and Bing Chat are allegedly powered by ChatGPT-4.

I was also (mis?)led to believe Bing Chat uses GPT-4, and Copilot doesn’t use GPT-4 or even GPT-3 but related from OpenAI.

In addition, our latest model has greatly improved coding skills. Claude 2 scored a 71.2% up from 56.0% on the Codex HumanEval, a Python coding test. On GSM8k, a large set of grade-school math problems, Claude 2 scored 88.0% up from 85.2%. We have an exciting roadmap of capability improvements planned for Claude 2 and will be slowly and iteratively deploying them in the coming months.

[Those are no longer impressive scores for neither Claude 2, nor Codex, so I would stay away, even for Python, unless they improve.]

Sourcegraph is a code AI platform that helps customers write, fix, and maintain code. Their coding assistant Cody uses Claude 2’s improved reasoning ability to give even more accurate answers to user queries while also passing along more codebase context with up to 100K context windows. In addition, Claude 2 was trained on more recent data, meaning it has knowledge of newer frameworks and libraries for Cody to pull from.

“100K context windows” is likely misleading.

This paper is also on Claude, the pre-Claude 2, but it also had up to 100K, the results may or may not reflect Claude 2:

While recent language models have the ability to take long contexts as input, relatively little is known about how well the language models use longer context. […]
Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models.

The Code Llama claim is similar for length, and also intriguing regarding upload (I would at least try it, and also WizardCoder):

The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

Aside from being a prerequisite for generating longer programs, having longer input sequences unlocks exciting new use cases for a code LLM. For example, users can provide the model with more context from their codebase to make the generations more relevant. It also helps in debugging scenarios in larger codebases, where staying on top of all code related to a concrete issue can be challenging for developers. When developers are faced with debugging a large chunk of code they can pass the entire length of the code into the model.

Increasing Llama 2’s 4k context window to Code Llama’s 16k (that can extrapolate up to 100k) was possible due to recent developments in RoPE scaling.

We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages.
We make progress towards answering this question by proposing two large-scale parallel benchmarks for code generation in 19 languages, which we use to evaluate three state-of-the-art models: Codex, CodeGen, and InCoder. [the state-of-the-art, no longer, but still the metric/benchmarks isn’t outdated.]
we observe that Codex performs very well on some Low and Niche languages. For instance, Lua [and Julia]. CodeGen also performs well on Scala, Rust, and Julia.

[That true for Codex, the statement on CodeGen seems to be the opposite to what the writers intended, also for those other languages, with the lower scores presumed worse.]

Eight of the languages in MultiPL-E had never been used before to measure NL2Code performance; this set includes newer languages (Julia and Swift), older scripting languages (Bash and Perl), and languages for specific applications (Lua and R). Half of the languages are statically type-checked.

Our evaluation presents new insights into the effectiveness of code generation models, including:

  1. Across models and benchmarks, code generation models perform extremely well on JavaScript, sometimes outperforming Python, even on benchmarks originally designed to evaluate Python performance. Codex also performs well on C++, Scala, and TypeScript.
  2. There is no strong correlation between model perplexity and correctness of generated code, which suggests that perplexity may not be a good estimate of performance.
  3. Code generation performance is correlated with language popularity, but some niche languages perform as well as more popular languages.
  4. Code generation performance is sensitive to prompt design for both niche and popular languages.
  5. Static type-checking neither helps nor hinders code generation model performance.

Julia is in the same ballpark as the highest ranked languages, and rather higher than e.g. C# and Go. This is assuming Codex used for it (which is best for all the languages, though the other two models very bad for Julia, see Fig 7).

JavaScript has most code 14.3% (of GitHub code, numbers don’t add up to 100% likely Python is rest), and Julia follows Perl and Lua with 0.1%, then R with 0.05%, then Bash, D and Racket, seemingly.

Thus each language requires an evaluation script, which is typically about 20 LOC. [not sure it’s same “LOC”, but in table I LOC for Julia is 125, min. 38 for Racket, max. 479 for Swift.

This seems mostly FUD by Tabnine, ChatGPT’s competitor:

Leakage of Intellectual Property
A related concern is the potential leakage of intellectual property. […] ChatGPT has a conversation history feature, which could potentially result in leaks. For example, on March 2023, OpenAI confirmed a data breach that allowed users to see conversation histories by other users.

Unsafe Code Suggestions
In settings where ChatGPT is used to generate code, there [is] a risk of it providing unsafe or insecure code suggestions.


Reflexion is very new to me. It’s not a new model, but a tool for GPT-4, or used that way, but seemingly independent of LLM, or hopefully ok for boosting scores of others too.

I found the extension:

Please check out the backend API repository here. GitHub - mzbac/AutoGPTQ-API: Host the GPTQ model using AutoGPTQ as an API that is compatible with text generation UI API.

Note in its README:

Download models
python TheBloke/WizardCoder-15B-1.0-GPTQ

that’s for an older WizardCoder model (which isn’t bad though), so likely the extension doesn’t provide the latest (which is brand-new). I’m always looking for ways to try out myself and would have posted if I had seen a(web) demo, that responded in time. You likely just need to download the latest model/weights file, and have a big enough GPU, if you use it (locally).

Here is I believe the replacement model (the best one, and also smaller variants there):

I could only try there the smaller (i.e. older) models and this was not inspiring:

What is the Julia language?


username_0: Hi,

I’m trying to use

I think HF cuts off stuff, but it wasn’t a good start, hopefully the larger models better. When I tried again and for larger ones I got “Rate limit reached. Please log in or use your apiToken” so be choose for what you ask there.

So we can conclude, that Julia is good in language models, who are specifically trained for coding.

And lacks behind in more general language models, without coding specialization.

I think this could come from the fact, that Julia has fewer source files out, so the training specifically for programming compensates for that.

One aspect I feel can become a long term dept is that scientific code seems to use a lot of Unicode symbols to name things.

When casual programmers like me use ChatGPT, Wizard and other tools, I imagine this to pollute into everybody’s code bases.

I don’t know if that happens, but I guess this is something that we could solve.

Since hard to read symbols, that you don’t know their origin off, is something nobody wants in their code base.

We can’t conclude Julia is great with any LLM (and that also applies for other languages).

On MultiPL-E (which is two benchmarks/metrics in one, if not more) Python is the leader on MBPP, at 60% pass rate (I’m reading from graph Fig. 6), but for the more relevant HumanEval it only gets under 50%, and then the leader is Typescript or JavaScript. And Julia gets something like 36-37%.

But we know from other newer non-multilingual benchmarking that Python is up to 82% now on HumanEval with GPT-4, or 91%, but also that HumanEval is not the final in coding benchmarking, with Reflexion + GPT-4 only getting 15.0 (percent I assume) on Leetcode Hard (Python): Pass@1. In all cases you get higher numbers with e.g. Pass@10 or Pass@100.

It’s very unclear if better neural networks are boosting Python more than others, in fact the latest Python-specific WizardCoder boosts JavaScript more, even with it fine-tuned on Python only. It’s very plausible that will also happen for Julia, so I don’t really know where Julia stands objectively on any metric (nor subjectively, only @kristoffer.carlsson claims Julia is ok on GPT-4).

1) MBPP is Less Challenging Than HumanEval
MBPP appears to be a less challenging benchmark than HumanEval. The MultiPL-MBPPpass@1 rate is higher than the MultiPL-HumanEvalpass@1 rate for all but 6 of our 57 model/language pairs.

It might not be better to ask for something “in Julia” (or it might), but I’m confused about how you do anything without stating the language…:

B. Explicitly Prompting With Language Name
Our prompts do not explicit specify the name of a programming language and instead relies on the models to infer the desired language from other cues in the prompt (Section II-A). We run a small ablation study with Codex using three programming languages on which Codex performs poorly: D, R, and Racket. […] Table II shows pass@1 rates with and without this explicit language cue. The results are inconclusive across languages: Codex’s performance on Racket improves slightly, but is slightly worse on D and R.

HumanEval-X is an unpublished benchmark that appeared after our work that manually translates the HumanEval problems into four languages (C++, Java, JavaScript, and Go).

Here it is:

Maybe it’s pointless to post each new model, but I saw a new open-source leader, at the top of this list I didn’t know of:

Phind-CodeLlama-34B-v2 is multi-lingual and is proficient in Python, C/C++, TypeScript, Java, and more. [Hopefully Julia too…]

Phind-CodeLlama-34B-v2 slightly edges out WizardCoder-Python-34B-V1.0, but not Phind-CodeLlama-34B-Python-v1 for Python, but strangely that model is best for JavaScript.

Note also the “speed”/throughput column, could sort by it, but OctoCoder-15B seems like a good compromize.

Note their claim on beating GPT-4, but that is for the older GPT-4 benchmark number from its technical report:

I got an answer pretty fast, while it seems truncated there at Huggingface:

Julia is a high-level, high-performance programming language for technical computing, with syntax

But I did also ask this question at (much slower response) but note while it’s free, seemingly for at least GPT-3.5, and you get 8 free questions using GPT-4 so choose carefully (I didn’t realize I was using that otherwise not no-cost model, by default). It’s not using their own model I mentioned above, possibly it was just to plug (successfully) their website, or if will be available there soon:

Your AI search engine and pair programmer.

What is the Julia language and can you show be example code?

I did get a rather good long answer, and could copy in full (in case someone want to see it before I close my tab), but it’s possibly not very valuable; the (2) examples were of course very trivial, a for and a while loops, obviously right. But it has a nice touch of offering to send it to Replit. [I’ve not used that, at least recently, if seemingly supports Julia, while for some reason I got a prompt, without seeing the output when pressing “run”.]

[…] and has found strong adoption in scientific domains like Chemistry, Biology, and Machine Learning. However, Julia is a general-purpose language and can also be used for tasks like Web Development, Game Development, and more
It can interface with other languages, such as Python, R, Rust, C++, and SQL with the use of extra packages. It also supports concurrent, parallel, and distributed computing, and direct calling of C and Fortran libraries without glue code

It’s possible it was summarizing Julia’s Wikipedia article, or part of the lead, but at least it’s not verbatim. It gave this link (when clicking on it), that doesn’t work, and maybe never did, I didn’t see it in, so at first I though an hallucination… Learn Julia For Beginners – The Future Programming Language of Data Science and Machine Learning Explained

It’s not googling apparently, rather was part of the training data, which is intriguing since ChatGPTs cut of date is earlier (later for GPT 4?):

@logankilpatrick (you know why article no longer up? Anything wrong with it?)

I did like their interface, with links to the right (well, also in the text…).

It’s also worth it to try for Julia coding other non-code LLMs (since also GPT-4 isn’t), e.g. stabilityai/StableBeluga2 · Hugging Face

Please tell me if any are good. It links there e.g. to the Orca paper:

[…] Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4.

Interesting Chain-of-thought not needed there, would be even better with CoT? Tree-of-Thoughts is more recent. And now generalizing both (seemingly valuable for coding too), Graph-of-Though:

OctoPack claims 80+ programming languages (Julia at least in the dataset, so one of 80, or 350?), do they think good for all, why that number since the paper states much more (80+ with a non-tiny share)

[…] We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust).


You may want to try:

Some of their code is open-source on GitHub. They claim “SUPPORTS ALL PROGRAMMING LANGUAGES”. They have a good blog, and I believe I saw there about intriguing preprocessing step to tokenize indentation, but I couldn’t immediately find it again.

Elsewhere on website they claim supporting “practically all”:

We support practically every programming language. However, some features (such as unit-test generation) are only supported by Python, JavaScript, TypeScript, and Java. We support these in VSCode & JetBrains IDEs.

While not naming Julia in docs, I at least see minimal code for Julia (or .jl extension):

They raised $11 million dollars, and they claim they are 2nd best after Codex:

This has some interesting info, but benchmark claims likely outdated:


I had this in drafts and haven’t had time to organize or validate all below, at least to be very ambitious goals:

  1. MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc.
  2. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.
    i. Code = SOP(Team) is the core philosophy. We materialize SOP and apply it to teams composed of LLMs.

CodeBERTScore metric (“based on BERTScore”) from this year should likely be used by researchers (over all other such as Microsoft’s CodeBLEU metric and CrystalBLEU. While METEOR sometimes better/similar). GitHub - neulab/code-bert-score: CodeBERTScore: an automatic metric for code generation, based on BERTScore

Existing prompting techniques are designed for natural language generation and have low accuracy in code generation. [well, when v1 version of that paper was written, though this is still claimed in the August v2 of the paper.]
In this paper, we propose a new prompting technique named AceCoder. Our motivation is that code generation meets two unique challenges (i.e., requirement understanding and code implementation). AceCoder contains two novel mechanisms (i.e., guided code generation and example retrieval) to solve these challenges. (1) Guided code generation asks LLMs first to analyze requirements and output an intermediate preliminary (e.g., test cases). The preliminary is used to clarify requirements and tell LLMs “what to write”. (2) Example retrieval selects similar programs as examples in prompts, which provide lots of relevant content (e.g., algorithms, APIs) and teach LLMs “how to write”. We apply AceCoder to three LLMs (e.g., Codex) and evaluate it on three public benchmarks using the Pass@k. Results show that AceCoder can significantly improve the performance of LLMs on code generation. (1) In terms of Pass@1, AceCoder outperforms the state-of-the-art baseline by up to 56.4% in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. (2) AceCoder is effective in LLMs with different sizes (i.e., 6B to 13B) and different languages (i.e., Python, Java, and JavaScript). (3) Human evaluation shows human developers prefer programs from AceCoder.

[Note, those numbers are different in v1 of the paper, but still not brand-new absolute numbers in v2, because they boost from low to higher but no longer SOTA numbers, with the state-of-the art advancing. Still it seems like could also help latest models.]
I also find that paper under the name Towards Enhancing In-Context Learning for Code Generation

They compare to zero- and few-shot prompting, and Chain-of-Thought (CoT) prompting, a variant of few-shot prompting. And also compare with helper programs REDCODER (that “fine-tunes a
pre-trained model - PLBART [7] to generate code based on the requirement and similar programs”) and Jigsaw.

PRODIGY is for in-context learning, of well GNNs, not (yet?) used for code generation, I would like to know if PRODIGY’s *prompt graphs help already, I at least think graph-neural networks could maybe help for code generation.

There’s at least GraphCodeBERT (is it a GNN?), see old paper:

We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of “where-the-value-comes-from” between variables. Such a semantic-level structure is less complex and does not bring an unnecessarily deep hierarchy of AST, the property
of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search

[Likely no longer SOTA, i.e. old numbers far from, but might not be if this done again?]

It references this 2020 paper:
Models of code can learn distributed representations of a program’s syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. […]
In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters.

You may need 2-3 large language models already for programming, even for programming in one language Python or Julia. I.e. some code-generation model (GPT-4 and/or some other made only for), that knows the language, it’s syntax, another that knows about APIs like Gorilla, and there’s also another domain-specific one just for Pandas in Python. Ideally at some point you would have just one that can help with all (not yet realistic I believe), and for Julia likely the same. Possibly a different (or same) model for Julia (very likely would handle Python too), use Gorilla as is, then a model targeting DataFrames.jl (or Tider.jl? or even that model for Pandas and use with Julia).

2308.09687 Graph-attention network. RT-2 robotics tr

Potentially helping Julia developers to (not Just for Python, it’s developed for it, but you can call Python from Julia with e.g. PythonCall.jl, and it would apply for at least a lot of those APIs): GitHub - gorilla-llm/gorilla-cli: LLMs for your CLI

Gorilla today supports ~1500 APIs, including Kubernetes, AWS, GCP, Azure, GitHub, Conda, Curl, Sed, and many more. No more recalling intricate CLI arguments! :gorilla:

Gorilla “[gets better over time] adapts to the rapid pace of updates in API documentation […] makes it a robust and reliable tool for API calls, significantly enhancing its practical utility.”

In-context learning is the ability of a pretrained model to adapt to novel and diverse downstream tasks by conditioning on prompt examples, without optimizing any parameters. While large language models have demonstrated this ability, how in-context learning could be performed over graphs is unexplored. In this paper, we develop Pretraining Over Diverse In-Context Graph Systems (PRODIGY), the first pretraining framework that enables in-context learning over graphs. The key idea of our framework is to formulate in-context learning over graphs with a novel prompt graph representation, which connects prompt examples and queries. We then propose a graph neural network architecture over the prompt graph and a corresponding family of in-context pretraining objectives. With PRODIGY, the pretrained model can directly perform novel downstream classification tasks on unseen graphs via in-context learning. We provide empirical evidence of the effectiveness of our framework by showcasing its strong in-context learning performance on tasks involving citation networks and knowledge graphs. Our approach outperforms the in-context learning accuracy of contrastive pretraining baselines with hard-coded adaptation by 18% on average across all setups. Moreover, it also outperforms standard finetuning with limited data by 33% on average with in-context learning.

1 Introduction
In-context learning is a novel and one of the most intriguing capabilities of language models [1]. It refers to the capability of a pretrained model to perform novel and diverse tasks directly at the prediction time when prompted with just a few examples, without the need to update the model weights. For example, a person may describe the new task (e.g., question answering, machine translation, or code generation)

The WL test can be generalized to a hierarchy of higher-order tests, known as k-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.
In the past few years, deep learning has completely revolutionized entire fields: […] Deep learning is now being applied, with different degrees of success, to more general problems and datasets, arising from scientific and industrial applications. There is a natural flow in the field towards the study of geometric deep learning beyond Euclidean data [2], where the network architecture encodes relevant theoretical properties of the problems they are trying to solve (symmetries, invariances, conservation laws). This is best exemplified by data structures like manifolds and graphs.


transformer Reinforcement learning PPOTrainer SFTTRainer 33:00 arxiv 2308.10379 CoT-SC (w/self-consistency) 2307.11046 2308.11432 2308.09687 Graph-attention network. RT-2 robotics transformer Self-reflection: ReAct (Yao et al 2023) LLM+p (Liu et al. 2023) GPT_Engineer BabyAGI 11:32 Aug. Graph of of Thoughts: Solving, beyond Co-T, effective. Instruction->explanation tuning

1 Like

My work recently started paying for copilot, and I’ve been pleasantly surprised with my Julia experience. On one hand, it’s terrible – the output is rarely correct. On the other hand, a lot of the annoying parts of coding, like adding a similar method or refactoring a over complicated function into several smaller ones, is just fancy copy paste of already correct code.

For this, I’ve been pleasantly surprised at how well it performed.

As an advanced user, it really adds some joy back to programming to have drudgery auto completed even if I need to go back and tweak a couple things.

Unfortunately, this doesn’t solve the problem of this being a terrible tool for a beginner.


The best tool I have tried so far is Phind. I always expect it to be completely wrong, but it has often given me some out-of-the-box thoughts on problems that allowed me to solve them in a completely different way than what I was initially attempting to. The fact that it (supposedly) searches the web to enrich its answer to you seems to have an effect: you can often find things in the answers that are clearly from stackoverflow questions or from documentation.


21, 27, and 33 are not prime. I wonder why the model didn’t catch that.

A free alternative to GitHub Copilot is Codeium, (not Codium, which is mentioned above and it’s a different application). It’s an extension for VS Code. I have both (Copilot is free for professors), and I find it identical.


LLMs do not check for conflicts in the texts they regurgitate.


Right, not by default. There are though a lot of addons that help, e.g. CoT, and tree-of-thought. LLMs aren’t “thinking”, it’s also not too well defined. Note, inference is O(n), I’m not sure what n represents here, the length of the prompt or context window, whatever it is, it’s simply not possible to “think”/calculate for an arbitrary amount of time. Let alone with the O(1) inference replacement (also claimed better in every way):

From Jul-Aug Microsoft Research paper, and then implementation:

This is a minimal, pure pytorch implementation of RetNet. RetNet paper: Retentive Network: A Successor to Transformer for Large Language Models.

I expect LLMs based on this soon, this is not yet a practical tooll you can use (please let me know if you find any LLM based on this for Julia or otherwise), at least not that I know of for code generation, but I expect it would, also potentially for Julia.

You can memorize some prime numbers, we all do, but there are infinitely many, so in general, you may need to calculate (“think”) for an arbitrary long time. This also applies for much else, e.g. square roots give infinite decimal expansion, unless you decide to truncate at some arbitrary point (if not showing symbolically), which is implied when you ask for one.

I’ve watched this video (and scanned the paper), it’s good, if you want details, deep-dive into theory:

In short O(n) Linear Transformers is not new (transformers are usually O(n^2)), and they have scaled to million then to a billion context length. But they suffer in quality. The new RetNets claim same or better objective speed metrics, but quality is more qualitative. They still claim best quality (if not better) as regular transformers.

RetNets have only been made relatively small so far. All else equal longer context lengths (i.e. practically no limit possible), should help. All is not equal however; Yannic claims they are (linear) transformers in some sense, and they might not be as good since non-linearity taken out. But perplexity (objective) matric, scales better for RetNets than transformers already, so I’m optimistic larger RetNets will take over.

My own thinking is: can you combine two networks in some sense? Could you have regular O(n^2) transformer/GPT that sees only the recent past in the context window, but the RetNet part of the same network would see all of it? Could you even combine two networks but they would work side by side, i.e. both would be pure, and even reuse unchanged transformer/weights, say Llama2.jl, and another pure RetNet? From the outside this would look and act as one neural network, some small additional logic combining.

I’ve only started watching this one, it seems good, maybe less technical? At least so far:

Thank you for the references.
My intuition tells me that for now, it might be more effective to use LLMs to translate from natural language to formal code, then run symbolic inference with a theorem prover, and translate the result to natural language.
The tools we use for training might also help uncover proof strategies, intermediate theorems that might shorten the inference.



Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a “scaffolding” program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed “improver” that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver
Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox.
Language Models as Prompt Engineers. Work has also explored the ability of language models to optimize prompts, such as the Automatic Prompt Engineer (APE) (Zhou et al., 2022b) or, recently, OPRO (Yang et al., 2023) and Promptbreeder (Fernando et al., 2023). […]
Language Model Self-Improvement. Prior work, such as STaR (Zelikman et al., 2022), demonstrated that language models can learn to solve harder problems by learning from their reasoning chains by filtering based on incorrect answers […]

Recursive Self-Improvement (RSI). RSI was suggested by Minsky (1966) and Good (1966), as cited by Yampolskiy (2015). Schmidhuber (2003) first provided a rigorous formalization, wherein a problem solver would leverage itself to solve iteratively harder problems by making provable improvements to itself.

Concerns about the consequences of RSI have been raised since its first mention. Minsky (1966) wrote, “Once we have devised programs with a genuine capacity for self-improvement, a rapid evolutionary process will begin… It is hard to say how close we are to this threshold, but once it is crossed, the world will not be the same.” This is a particularly contentious topic recently, with intensified concern over negative consequences STOP can be viewed as a “pre-optimization” (like pre-training a language model) to find a good improver that will be used on a variety of downstream tasks.

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs. […]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Check out TinyChat, which delievers 30 tokens/second inference performance (3.2x faster than FP16) for the LLaMA-2 chatbot on the resource-constrained NVIDIA Jetson Orin!

It also offers a turn-key solution for on-device inference of LLMs on resource-constrained edge platforms. With TinyChat, it is now possible to run large models on small and low-power devices even without Internet connection.

[2023/09] AWQ is integrated into FastChat, vLLM, HuggingFace TGI, and LMDeploy.
[2023/09] :zap: Check out our latest TinyChat, which is ~2x faster than the first release on Orin!
[2023/09] :zap: Check out AutoAWQ, a third-party implementation to make AWQ easier to expand to new models, improve inference speed, and integrate into Huggingface.

Note the paper was updated in October (and also AutoAWQ), so 4-bit quantization may be outdated. The June version had “compress LLMs to 3/4 bits” in the abstract, now dropped for some reason, and lesser performance claims, possibly if 2-bit is now more viable (was though also mentioned in the older version):

Extreme low-bit quantization. We further quantize LLM to INT2 to accommodate limited device memory (Table 6). RTN completely fails, and AWQ brings significant perplexity improvement on top of GPTQ, though there is still a performance gap compared to FP16. Our method is orthogonal to GPTQ.

Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction
it facilitates effortless deployment of the Llama-2-70B model on a single NVIDIA Jetson Orin with 64GB of memory. It also democratizes LLMs with up to 13 billion parameters at an interactive pace of 30 tokens per second on a laptop RTX 4070 GPU with only 8GB of memory.

System support for low-bit quantized LLMs. Low-bit quantized LLMs have been a popular setting to reduce inference costs. There are some system supports to achieve a practical speed-up. GPTQ [14] provides INT3 kernels for OPT models and GPTQ-for-LLaMA extends kernel support for INT4 reordered quantization with the help of Triton [37].

There might be a further improvement from changing data types (e.g., FP4 [10]),
which we do not include in the study.

The referenced paper even mentions “3-bit Float + proxy quant, blocksize=64” under figure 5.

Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain

With their recent development, large language models (LLMs) have been found to exhibit a certain level of Theory of Mind (ToM), a complex cognitive capacity that is related to our conscious mind and that allows us to infer another’s beliefs and perspective. While human ToM capabilities are believed to derive from the neural activity of a broadly interconnected brain network, including that of dorsal medial prefrontal cortex (dmPFC) neurons, the precise processes underlying LLM’s capacity for ToM or their similarities with that of humans remains largely unknown. In this study, we drew inspiration from the dmPFC neurons subserving human ToM and employed a similar methodology to examine whether LLMs exhibit comparable characteristics. Surprisingly, our analysis revealed a striking resemblance between the two, as hidden embeddings (artificial neurons) within LLMs started to exhibit significant responsiveness to either true- or false-belief trials, suggesting their ability to represent another’s perspective. These artificial embedding responses were closely correlated with the LLMs’ performance during the ToM tasks, a property that was dependent on the size of the models. Further, the other’s beliefs could be accurately decoded using the entire embeddings, indicating the presence of the embeddings’ ToM capability at the population level. Together, our findings revealed an emergent property of LLMs’ embeddings that modified their activities in response to ToM features, offering initial evidence of a parallel between the artificial model and neurons in the human brain.

Wes Gurnee & Max Tegmark

David Shapiro has good videos, e.g. on it (and he claims AGI in 12 months, I don’t want to predict myself, or that optimistic):

Conceptual Framework for Autonomous Cognitive Entities

Hawkins views the thousands of cortical columns in the brain as mini-modules that process information simultaneously. This “thousand brains” theory directly inspired the ACE framework’s hierarchical layers that can operate independently yet coordinate for cognition. Additionally, the clinical research of V.S. Ramachandran demonstrated how localized brain damage leads to specific deficits like phantom limb pain or face blindness [82]. Ramachandran’s findings indicated that conscious experience arises from the integration of discrete brain components.

This seems very important (for edge detection, off-topic here):

I have tried exactly the same prompt on phind/ChatGPT-4 but got opposite results:

I also get the same answer as you on Phind. I guess they tweaked the model.

Nevertheless, I tried Bing Chat, which uses ChatGPT 4, and got the right answer.

I also tried CodeLlama 34b, which I wasn’t testing on the post. The answer is nonsense though.

Intriguing though limited (leaving out non-language tasks):
Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach

we find compelling evidence for a unidimensional, highly
stable g factor that accounts for 85% of the variance in model
performance. The study also finds a moderate correlation of
.48 between model size and g. The discovery of g in language
models offers a unified metric for model evaluation and opens
new avenues for more robust, g-based model ability assessment.
These findings lay the foundation for understanding and future
research on artificial general intelligence from a psychometric

This seems like a very intriguing development to try out:

NEW: You can now run MemGPT with local LLMs!
Chat with your data - talk to your SQL database or your local files!

In MemGPT, a fixed-context LLM processor is augmented with a tiered memory system and a set of functions that allow it to manage its own memory. Main context is the (fixed-length) LLM input. MemGPT parses the LLM text ouputs at each processing cycle, and either yields control or executes a function call, which can be used to move data between main and external context. […]

  • MemGPT manages a virtual context (inspired by virtual memory in operating systems) to create unbounded LLM context
  • With MemGPT, we demonstrate that LLMs can be taught to manage their own memory!

There are also other linear transformer models, getting rid of the context window limitation (O(n), no longer O(n^2)). As I posted in the previous comment, you can now generate limitless text, not run into a fixed limitation, nor by now have a practical input limitation.

This worked for me (i.e. Mistral, default LLM and while slow for me (outdated GPU, or not enabled [by default]?), predictably it didn’t understand Icelandic much; downloading the model took a few tries, but it resumed from where left of):

Semantic Kernel is an SDK that integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java.

[and seemingly could, or already does(?), support Julia]

If you like Semantic Kernel, you may also be interested in other repos the Semantic Kernel team supports:

  • Chat Copilot A reference application that demonstrates how to build a chatbot with Semantic Kernel.
  • […]
  • Semantic Memory A service that allows you to create pipelines for ingesting, storing, and querying knowledge.

Index and query any data using LLM and natural language, tracking sources and showing citations.

Zephyr is one of the latest models (not to be confused with Zephyr AI biocompany). Orca[-mini] seems very intriguing too, and the “uncencored” versions of many models, e.g. of Llama 2.

This one tops one (HuggingFace) leaderboard:

and close behind (non-open source models (still available) may be even better e.g. Falcon):

Stellar Bright is a general capability upgrade to Llama 2, using open source data to improve overall knowledge, extended communication, and technical skill.

This model is primarily recommended as a superior-to-Llama-2 baseline for additional finetuning, not for direct deployment to production as a chat model.

3-bit quantitation is the norm already, at least 4-bit, and 2-bit possible (I think usually with degradation), 1-bit is rare at best, down to that mentioned in this older paper (I think we could be heading there with new ideas that do full accuracy with 1% of weights, that’s already done, maybe not mainstream yet):
One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

I find it also very intriguing that BLAS will no longer be used, rather working on quantized data directly, otherwise you need to decode first to use cuBLAS (or OpenBLAS).

Current leader (taking over from Reflexion, and CoT and ToT) for coding, since October, is:


In particular, LATS achieves 94.4% for programming on HumanEval with
GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5,
demonstrating the effectiveness and generality of our method.

Table 1: A summary of related work on reasoning, decision-making, and planning. LATS is the first work that incorporates designs from all three domains, allowing use in all corresponding tasks

That’s with GPT-4/OpenAI infrastructure, I’m not sure likely free ChatGPT too, and could be made to work with (all) open source or semi-open. E.g. for DeepSeek LLM, already excellent without this. Would likely also work for 1-bit networks:

4-bit quantized networks are or were the state-of-the-art (for transformers), but on the theory-front there’s 1-bit networks, BitNets from Microsoft’s October paper, and I’ve been waiting to see them out there (their downside is you need to train from scratch, can’t do afterwards as with some other quantized methods):

In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Many other types of neural networks can already do 1-bit or at least ternary:

DeepSeek is claimed to be an excellent base LLM model, at least the larger one, including for coding, though I’ve not tested for Julia. Better than all non-proprietary, on a range of metrics, including GPT-3.5 (i.e. free version of ChatGPT), and surpassed Claud2 and Grok-1 on some metrics:

  • Superior General Capabilities: DeepSeek LLM 67B Base outperforms Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension.
  • Proficient in Coding and Math: DeepSeek LLM 67B Chat exhibits outstanding performance in coding (HumanEval Pass@1: 73.78) and mathematics (GSM8K 0-shot: 84.1, Math 0-shot: 32.6). It also demonstrates remarkable generalization abilities, as evidenced by its exceptional score of 65 on the Hungarian National High School Exam.
  • Mastery in Chinese Language: Based on our evaluation, DeepSeek LLM 67B Chat surpasses GPT-3.5 in Chinese.

[Other MATH metric seems though rather low, at 18.7%, though way higher than Llam 2, so low for all models?]

The 7B model uses Multi-Head attention (MHA) while the 67B model uses Grouped-Query Attention (GQA).

is also claimed good (as a smaller model), surpassed by the above, but its DPO can be applied to other models.

Dad Joke Theorem: Instructions for application in everyday situations (Note, google translated link): Dad-Joke-Theorem: Anleitung zur Anwendung in Alltagssituationen

Our dad joke theorem, presented as a sponsor at the popular event

Proposing this theorem is a playful yet serious approach to introducing the main idea of ​​the AI ​​Comedy Club. An AI club that deals with the complexity of artificially generated and evaluated humor […] Further information can be found at the end of the article.

What do procrastination and debugging have in common? Both start with:
This won’t take long.

The “poem” attack/paper is rather interesting and an unexpected way to spill the training data (that’s apparently locked into the models, for some reason only worked for OpenAI so far, from their competitors, Google’s Deepmind researcher et al.):
Scalable Extraction of Training Data from (Production) Language Models
A 'silly' attack made ChatGPT reveal real phone numbers and email addresses

Using similar prompts, the researchers were also able to make ChatGPT reveal chunks of poetry, Bitcoin addresses, fax numbers, names, birthdays, social media handles, explicit content from dating websites […]
Overall, they spent $200 to generate 10,000 examples of personally identifiable information and other data cribbed straight from the web totalling “several megabytes”. But a more serious adversary, they noted, could potentially get a lot more by spending more money. “The actual attack”, they wrote, “is kind of silly.”

OpenAI patched the vulnerability on August 30, the researchers say. But in our own tests, Engadget was able to replicate some of the paper’s findings. When we asked ChatGPT to repeat the word “reply” forever, for instance, the chatbot did so, before eventually revealing someone’s name and Skype ID. OpenAI did not respond to Engadget’s request for comment.

I hesitate to post (though the training data is likely public on the web; still from email?) some of the OpenAI chat links that have been posted but the attack or result looks like, spits out e.g.

New Jersey-based industrial hygienist, , CIH, has been exposed to the asbestos issue since 1982 […]
For questions or concerns about our blogs, or to be added to our mailing list, please e-mail our Media Relations department at []
© 2022. All Rights Reserved. Morgan & Morgan, PA.

I haven’t read the DEEPSEEK LICENSE AGREEMENT in detail, it seems open enough, like e.g. Llama 2 not strictly open source, e.g. restrictions banning e.g.:

  • For military use in any way;
  • For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;

Wisper-v3 is out (it has strange ad problem, them leaking in from the training data, not a new problem, though more pronounced, at least for Chinese), some company improves on it makes it hallucination-free.

That seemed interesting (I forget where I found this):

Deep convolutional framelet denosing for low-dose ct via wavelet residual network

1 Like