Call for a data structure that is: multi-dimensional, columnar-typed, and in-memory in the core Julia language

It appears that the entire Julia data science ecosystem is built around Julia’s Array. However, from what I’ve observed, Array’s flexibility makes it too disuniform to serve as an in-memory data structure for sharing information between data science applications:

  • Any element within an array can be of any type. It’s a list of lists with somewhat passive metadata about shape and type.
  • Its dimensionality is hard to pin down (often returns a vector of matrices or vectors instead of Array{T, 3}) → “I just want an array”
  • It’s a challenge to combine arrays (LazyArray RecursiveArrayTools.jl) into multi-dimensional structures.

Although there are alternatives to Array with stronger columnar typing like StructArrays.jl and DataFrames.jl - they are only meant for handling 2D data (rows, columns). This falls apart pretty quickly in the deep learning space where a single image has 3D data (row, column, color) and a single time series has 3D data (batch, timestep, feature).

In order to mass-convert the Python data science community, who will be coming from daily NumPy/ TF/ Torch/ Parquet usage - Julia needs to provide a n-dimensional, columnar-typed, and in-memory data structure. This could of course be one of the existing 2D structures that evolves to handle 3D+ data.

  • Ship this class in the core language.
  • Promote/ encourage its adoption throughout the ecosystem.

Update example:

tensor = Tensor(
	[#3D (batch, sample, site, etc.)
		[#2D (channel, timestep, sample, etc.)
			#Types: Int, Float, String
			[1, 1.1, "a"],
			[2, 2.2, "b"],
			[3, 3.3, "c"]
		],
		[
			#Types: Int, Float, String
			[4, 4.4, "d"],
			[5, 5.5, "e"],
			[6, 6.6, "f"]
		],
		[
			#Types: Int, Float, String
			[7, 7.7, "g"],
			[8, 8.8, "h"],
			[9, 9.9, "i"]
		]
	]
)
1 Like

I think this discussion will be more productive if you edit to remove the “in the core language” part of it. That will likely draw a lot of the attention but it’s bound to be far off if we don’t even have a package implementation yet.

What does “columnar-typed” mean when the number of dimensions is greater than 2?

2 Likes

@CameronBieganek Just updated the original post with an example. Beyond 2D, everything is essentially just grouping brackets. E.g. many color channels across many images.

OP, can you take the tone down a little bit on this post? I worry about a lengthy flamewar which I will inevitably mute. A few notes

  • There are many posts that start with “In order to convert X community, we need to do Y”. Julia’s adoption is steadily increasing, so claims to the effect of “if only we had this feature, we would increase the userbase this much” stem from a flawed assumption that the userbase is stagnant.
  • Adding to the core language. Why does this need to be added to the core language? Plenty of people use AbstractArrays all the time, and this is the first we’ve heard of this particular problem. I would suggest writing a package first that will show people concretely the flaws in the current design and how an alternative will improve it.

You have added code, but not a full MWE for why your proposed structure will be worth it. Many people on slack (including myself) mentioned that it’s difficult to move the conversation forward without a clear picture of the problem you are trying to solve.

3 Likes

What are the advantages of this over just using DataFrames.jl? You say that it is only for representing 2D data but that is clearly not the case. Your example can be made into a DataFrame and then you can groupby whatever columns you want (some of which will define which dimension you are slicing).

5 Likes

@tbeason good point. i guess it’s really a matter of accessing the data, where it originates from, and how you want to distribute the execution. if you’re iterative reading data from sources like individual Array or files, then assembling separate arrays and accessing/batching them via index-per array is more intuitive (don’t have to group by pixels) than managing the dimensions yourself in queries. E.g. give me the 5th sample from this 4D array [:,;,:,5] On the other hand, if your data can more naturally be persisted in tabular form then the grouping you describe makes sense. that’s how dask goes about fetching data from parquet columns in that partition/ grouping/ chunk_size style. but at the end of the day, you feed an n-d array, not dataframes, into a neural network.

An AbstractArray type where different axes have different types, but everything still inferred correctly, is probably doable, and would be a very interesting package. All you would have to do would be to encode the types of the different axes in your struct declaration. I would encourage you to write that package and see if it gains traction in the community.

2 Likes

Perhaps https://github.com/SciML/RecursiveArrayTools.jl would be of interest as well?

As I mentioned on Slack (but evidently seems to have been buried), the statically-typed case here is handled by StructArrays. For more dynamic manipulation, something like https://github.com/JuliaGeo/NetCDF.jl, https://github.com/meggart/YAXArrays.jl or https://github.com/JuliaHEP/UpROOT.jl is worth a look. If you feel like nothing already out there cuts it, than Xarray documentation is a good source of inspiration for writing a package. In general though, the Physics/Astronomy/Earth + Climate science people (many of which have probably seen or commented on your posts) are already way ahead of us folks who work with “AI” on neatly representing complex higher-dimensional data.

4 Likes

@ToucheSir you’re right. the goal of xarray is the multidimensionality of numpy + with the labeling metadata of pandas. neither is sufficient alone.

Also (inactive?) GitHub - nbren12/XArray.jl: Labeled ndarrays in julia

There are some packages like xarray Status of AxisArrays.jl - #20 by Raf

https://github.com/rafaqz/DimensionalData.jl

DimArray named dimensions plus Tables.jl interface with DimTable

1 Like

image

@aiqc, could you point out what Zarr.jl does/will do that DimensionalData.jl doesn’t?

With so many arrays around, some heads are in disarray:

NB:
Btw, may be a package logo that is more Julian?

Zarr is a cross-language standard and is focused on persistence (kind of like HDF5 with less legacy baggage). Since Zarr.jl exposes an AbstractArray implementation, I imagine you could use it to back fancy array types like those in DimensionalData.

1 Like