Those are huggingface’s terminology, and they are mostly regular MLP/Attention/Gelu. They prefer to define a new class for each component of a model instead of defining/using a compositional class. This is also the reason why we need to register a loader for each model in Julia. The class layout would affect the structure of the state_dict
and thus we need to manually align the layout. The layers defined in Transformers.jl are designed to be as composable as possible to reduce the need to define new structs when registering a new loader.
There are a few things on my list of priorities. Two major parts I’m currently slowly working on are splitting out the workpiece/sentencepiece tokenizer into a separate package and GPU abstraction for attentions. Besides, I would also want to enhance HuggingFaceApi.jl to use huggingface’s dataset viewer API and use DuckDB.jl to load those processed datasets. Unfortunately, my current bandwidth is mostly allocated to surviving and job hunting, so you probably won’t be able to see them in the near future.
One package I would love to see and surely beyond my scope is a better data loader design/interface with distributed support. I only roughly scan through so this might not be precise, but it seems the data loaders we have are relatively naive compared to the distributed data loader in pytorch or huggingface.