Hello,
I have tried Llama2 large language model in Julia following https://github.com/chengchingwen/Transformers.jl/blob/master/example/Llama2_example.ipynb. This works really nice and smoothly, but, the example uses Float32
. To save memory, I wanted to use it with Float16
, since the model card of llama2 says it that torch_dtype
is Float16
When I try that, the model starts to halucinate, so I guess that something overflow / underflows. I wanted to give a try to BFloat16
, since they can better handle large differences in magnitude. Does anyone has an experience with BFloats and CUDA? Is there some bf16
equivalent of f16
?
I have tried this repository
but I am not sure, how relevant it is.
Thanks for answers in advance.
Tomas
That repo is a software implementation, so I suspect it will be slow. If it works as advertised, you should be able to get going for small problems to see if it solves your problems. There is Bfloat16 support in hardware out there (Apple M* seems to have it somewhere, maybe in the neural engine) but software support could be hard to come by.
I would expect the software support to suck. But cuda has HW support, therefore I was hoping it would be possible to use it with CUDA.jl
CUDA.jl already support BFloat16s for some common API functions, like gemm
, gemv
, etc. Native kernel support for BFloat16 depends on Julia properly supporting the type, i.e., not through BFloat16s.jl’ emulation. Keep an eye on Add support for BFloat16 · Issue #41075 · JuliaLang/julia · GitHub for the status of that.
2 Likes