I have tried Llama2 large language model in Julia following https://github.com/chengchingwen/Transformers.jl/blob/master/example/Llama2_example.ipynb. This works really nice and smoothly, but, the example uses
Float32. To save memory, I wanted to use it with
Float16, since the model card of llama2 says it that
Float16 When I try that, the model starts to halucinate, so I guess that something overflow / underflows. I wanted to give a try to
BFloat16, since they can better handle large differences in magnitude. Does anyone has an experience with BFloats and CUDA? Is there some
bf16 equivalent of
I have tried this repository
but I am not sure, how relevant it is.
Thanks for answers in advance.
That repo is a software implementation, so I suspect it will be slow. If it works as advertised, you should be able to get going for small problems to see if it solves your problems. There is Bfloat16 support in hardware out there (Apple M* seems to have it somewhere, maybe in the neural engine) but software support could be hard to come by.
I would expect the software support to suck. But cuda has HW support, therefore I was hoping it would be possible to use it with
CUDA.jl already support BFloat16s for some common API functions, like
gemv, etc. Native kernel support for BFloat16 depends on Julia properly supporting the type, i.e., not through BFloat16s.jl’ emulation. Keep an eye on Add support for BFloat16 · Issue #41075 · JuliaLang/julia · GitHub for the status of that.