You need to choose at a minimum what model to finetune, and Lama3 is outdated already, and (I’ve only scanned the Slack thread to that point) this one, was good:
Maybe this is useful ? bigcode/starcoderbase · Hugging Face
Arctic LLM seems best now (the Base model updated 2 hours ago) and/or Phi-3 (for a small one, also new), would now be on my short-list (also WaveCoder and its paper also, for “LLM-based Generator-Discriminator data process framework to generate diverse, high-quality instruction data from open source code”, and also hybrid Mamba/Transformer ajibawa-2023/Code-Jamba-v0.1 · Hugging Face still the only “Julia” tagged model on HF):
It’s a question is it better to start with a larger model (presumably better, but not always), or do they learn slower when fine-tuning? I.e. because of the size? You have a potential “catastrophic forgetting” with fine-tuning (for out-of-distribution data, so start with non-awful for Julia?), maybe not a bad thing forgetting the other languages Python etc…
One other thing maybe ruling out models is the tokenizer is fixed, some recent with about 30.000 possible tokens, and I’m thinking do they include the Unicode options Julia has? We want all the math operators supported/supportable…