AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5

A. I don’t know if it’s an inherent limitation. Another limitation is you can’t (yet, it seems) follow-up with more questions. If that’s possible or becomes, then the trick with ChatGPT is saying “continue” if you do not get full code, and it proceeds with more, and that might also work for RWKV eventually.

B. I just discovered, there’s also codealpaca that could work for Julia, and if not already, then we might help with translation that file to Julia code:

https://raw.githubusercontent.com/sahil280114/codealpaca/master/data/code_alpaca_20k.json

poe.com seems good (is also on Quora), has access to e.g. ChatGPT and Claude.

RWKV has better time (and space) complexity, why I think it’s important:

Transformers are usually O(n^2), but not all, e.g. this one:

With RWKV-4-Pile-14B-20230313-ctx8192-test1050 I got this answer…:

theoh
http://julialang.org/
======
empath75
It’s not a very good language for the use case that people are interested in
it for, which is basically scientific computing and graphics, so it doesn’t
have the maturity that a language like go has or c++ or java.


e19293001
Julia is a really interesting language and it is very interesting to see that
Julia community has already created a package manager[1]. I would like to see
more projects using Julia as their programming language. I would also like to
see more projects using Julia as their language of choice when developing
scientific computing software and graphic software.

[1

It seemed like quoting verbatim, at least the former part (answer from same query…), but I googled for it and couldn’t confirm that…

In case people are interested in theory (first linke updated in March)

Generative Pre-trained Transformer models, known as GPT or OPT, […] Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. […] Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.
[…]
Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at this https URL.

So only (down to) 3852 MB RAM needed, and since from 2 weeks ago (in the code):

  • two new tricks:–act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ’s strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

My own Julia code/idea I posted on Discourse a while ago, for down to 1-bit, seems now increasingly plausible to work.

Older paper:
The case for 4-bit precision: k-bit Inference Scaling Laws https://arxiv.org/pdf/2212.09720.pdf

For a given zero-shot performance, 4-bit precision yields optimal scaling for almost all model families and model scales. The only exception is BLOOM 176B where 3-bit is slightly but not significantly better.

3-bit Float + proxy quant, blocksize=64

Below 1 bit is also possible (indirectly, pruning models and/or e.g. Huffman compression), and while I’m not sure the above works for RNNs, if they are making a comeback, then the idea, older code and paper, here is relevant:

1 Like