It appears that the entire Julia data science ecosystem is built around Julia’s Array. However, from what I’ve observed, Array’s flexibility makes it too disuniform to serve as an in-memory data structure for sharing information between data science applications:
- Any element within an array can be of any type. It’s a list of lists with somewhat passive metadata about shape and type.
- Its dimensionality is hard to pin down (often returns a vector of matrices or vectors instead of
Array{T, 3}
) → “I just want an array” - It’s a challenge to combine arrays (LazyArray RecursiveArrayTools.jl) into multi-dimensional structures.
Although there are alternatives to Array with stronger columnar typing like StructArrays.jl and DataFrames.jl - they are only meant for handling 2D data (rows, columns). This falls apart pretty quickly in the deep learning space where a single image has 3D data (row, column, color) and a single time series has 3D data (batch, timestep, feature).
In order to mass-convert the Python data science community, who will be coming from daily NumPy/ TF/ Torch/ Parquet usage - Julia needs to provide a n-dimensional, columnar-typed, and in-memory data structure. This could of course be one of the existing 2D structures that evolves to handle 3D+ data.
- Ship this class in the core language.
- Promote/ encourage its adoption throughout the ecosystem.
Update example:
tensor = Tensor(
[#3D (batch, sample, site, etc.)
[#2D (channel, timestep, sample, etc.)
#Types: Int, Float, String
[1, 1.1, "a"],
[2, 2.2, "b"],
[3, 3.3, "c"]
],
[
#Types: Int, Float, String
[4, 4.4, "d"],
[5, 5.5, "e"],
[6, 6.6, "f"]
],
[
#Types: Int, Float, String
[7, 7.7, "g"],
[8, 8.8, "h"],
[9, 9.9, "i"]
]
]
)