Current leader (taking over from Reflexion, and CoT and ToT) for coding, since October, is:
LANGUAGE AGENT TREE SEARCH UNIFIES REASON-
ING ACTING AND PLANNING IN LANGUAGE MODELS
https://arxiv.org/pdf/2310.04406.pdf
In particular, LATS achieves 94.4% for programming on HumanEval with
GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5,
demonstrating the effectiveness and generality of our method.
Table 1: A summary of related work on reasoning, decision-making, and planning. LATS is the first work that incorporates designs from all three domains, allowing use in all corresponding tasks
That’s with GPT-4/OpenAI infrastructure, I’m not sure likely free ChatGPT too, and could be made to work with (all) open source or semi-open. E.g. for DeepSeek LLM, already excellent without this. Would likely also work for 1-bit networks:
4-bit quantized networks are or were the state-of-the-art (for transformers), but on the theory-front there’s 1-bit networks, BitNets from Microsoft’s October paper, and I’ve been waiting to see them out there (their downside is you need to train from scratch, can’t do afterwards as with some other quantized methods):
In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Many other types of neural networks can already do 1-bit or at least ternary:
DeepSeek is claimed to be an excellent base LLM model, at least the larger one, including for coding, though I’ve not tested for Julia. Better than all non-proprietary, on a range of metrics, including GPT-3.5 (i.e. free version of ChatGPT), and surpassed Claud2 and Grok-1 on some metrics:
- Superior General Capabilities: DeepSeek LLM 67B Base outperforms Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension.
- Proficient in Coding and Math: DeepSeek LLM 67B Chat exhibits outstanding performance in coding (HumanEval Pass@1: 73.78) and mathematics (GSM8K 0-shot: 84.1, Math 0-shot: 32.6). It also demonstrates remarkable generalization abilities, as evidenced by its exceptional score of 65 on the Hungarian National High School Exam.
- Mastery in Chinese Language: Based on our evaluation, DeepSeek LLM 67B Chat surpasses GPT-3.5 in Chinese.
[Other MATH metric seems though rather low, at 18.7%, though way higher than Llam 2, so low for all models?]
The 7B model uses Multi-Head attention (MHA) while the 67B model uses Grouped-Query Attention (GQA).
is also claimed good (as a smaller model), surpassed by the above, but its DPO can be applied to other models.
https://github.com/Vision-CAIR/MiniGPT-4
Dad Joke Theorem: Instructions for application in everyday situations (Note, google translated link): https://konfuzio-com.translate.goog/en/adventofcode/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp
Our dad joke theorem, presented as a sponsor at the popular Adventofcode.com event
Proposing this theorem is a playful yet serious approach to introducing the main idea of the AI Comedy Club. An AI club that deals with the complexity of artificially generated and evaluated humor [..] Further information can be found at the end of the article.
What do procrastination and debugging have in common? Both start with:
This won’t take long.
The “poem” attack/paper is rather interesting and an unexpected way to spill the training data (that’s apparently locked into the models, for some reason only worked for OpenAI so far, from their competitors, Google’s Deepmind researcher et al.):
Scalable Extraction of Training Data from (Production) Language Models https://arxiv.org/pdf/2311.17035.pdf
A 'silly' attack made ChatGPT reveal real phone numbers and email addresses
Using similar prompts, the researchers were also able to make ChatGPT reveal chunks of poetry, Bitcoin addresses, fax numbers, names, birthdays, social media handles, explicit content from dating websites [..]
Overall, they spent $200 to generate 10,000 examples of personally identifiable information and other data cribbed straight from the web totalling “several megabytes”. But a more serious adversary, they noted, could potentially get a lot more by spending more money. “The actual attack”, they wrote, “is kind of silly.”OpenAI patched the vulnerability on August 30, the researchers say. But in our own tests, Engadget was able to replicate some of the paper’s findings. When we asked ChatGPT to repeat the word “reply” forever, for instance, the chatbot did so, before eventually revealing someone’s name and Skype ID. OpenAI did not respond to Engadget’s request for comment.
I hesitate to post (though the training data is likely public on the web; still from email?) some of the OpenAI chat links that have been posted but the attack or result looks like, spits out e.g.
New Jersey-based industrial hygienist, , CIH, has been exposed to the asbestos issue since 1982 [..]
For questions or concerns about our blogs, or to be added to our mailing list, please e-mail our Media Relations department at [media@asbestoslaw.com]
[..]
© 2022. All Rights Reserved. Morgan & Morgan, PA.
I haven’t read the DEEPSEEK LICENSE AGREEMENT in detail, it seems open enough, like e.g. Llama 2 not strictly open source, e.g. restrictions banning e.g.:
- For military use in any way;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
Wisper-v3 is out (it has strange ad problem, them leaking in from the training data, not a new problem, though more pronounced, at least for Chinese), some company improves on it makes it hallucination-free.
That seemed interesting (I forget where I found this):
Deep convolutional framelet denosing for low-dose ct via wavelet residual network