Most popular tabular/multidimensional data types in Julia

I’m writing a Julia connector to a weather data API, and want some insight into the types of data preferred by the Julia community. It seems to me that the built-in DataFrames are very limited; furthermore, users might request data which would be better stored in multi-dimensional structures. I come from a Python background, so am looking to emulate things like Pandas and Xarray, but a) would like to avoid using PyCall explicitly and b) feel that leaning too heavily on frontends like Pandas.jl may not be the most Julian way of proceeding.

Any comments on your favourite packages and types for time-series or grids of one or more variables much appreciated! Additional info: the API responds with .csvs, which can be converted to built-in DataFrames as an intermediate step: I may well implement some kind of method/type switching options for the user, so ideally whatever packages you suggest would be able to take a native DataFrame as input.

in what way ?

1 Like

Can you expand on this? It’s not clear from this post exactly what functionality you need and where DataFrames.jl falls short.

It’s true DataFrames requires columns to be vectors, but so does pandas, correct? This Split-Apply-Combine strategy in DataFrames is very powerful and can emulate multi-dimensionality very well.

Maybe you want NamedArrays.jl Or AxisArrays.jl?

1 Like

As others already mentioned above, it would be nice to understand your expectations from a data frame. DataFrames.jl is super powerful and does things very well when you have 2 dimensions (rows and columns). If you need more than 2 dimensions, consider other established data structures and formats such as HDF5, NetCDF, etc. We have all that in Julia as well.

If you want to integrate with the data science ecosystem in Julia, I would even try to create a custom data type that implements the Tables.jl interface. It would automatically work with statistical packages for example.

Notice that I have done a similar job connecting meteorological data here:

DataFrames.jl was the perfect fit.

Another package that we wrote connecting meteorological data APIs:

https://github.com/JuliaClimate/CDSAPI.jl

1 Like

I think the OP is familiar with the multidimensional functionality of Pandas, which I personally find overly complicated: MultiIndex / advanced indexing — pandas 2.1.3 documentation

They extend the concept of tables to hierarchical tables like we see in spreadsheets, where main columns are subdivided into other columns, …

2 Likes

I am working quite a lot with Pandas, but MultiIndex is a complexity nightmare without a clear unique use case.
Glad that DataFrames.jl does not have it (and probably nobody misses it).

3 Likes

I second that. Maybe there are use cases out there, but I often find other data structures that are more appropriate and more intuitive when it comes to more than 2 dimensions. Particularly in Julia where Arrays are multidimensional by default and where so many other data structures are available implementing the AbstractArray interface.

1 Like

I actually agree with @juliohm that the multi-index functionality of Python’s Pandas is unnecessarily complicated. The principle thing I dislike about Julia DataFrames is that, as far as I can tell, I can’t set an index for my data manually, and that the index resets each time a new view is created. Finding that people disagree with me is valuable in its own right: if the community at large uses DataFrames heavily then I ought to implement a DataFrames solution!

@pdeffebach by multi-dimensional I mean something like netCDF data, where a 3D grid might be required to either store 3D data or to store 2D data for a range of variables. A 4th dimension is often also applicable to 3D grid time-series. Xarray is a nice package for this in Python, so I’m wondering what the Julia community uses.

The question is what is the use case for an index in Pandas? It is essentially an additional column with a special behavior in some functions.
Some Pandas functions need the index to be set to a specific column (like df.plot()) or produce a specifically set index (like df.groupby()), but in the end it is just an additional column and there is no fundamental need for an index in a dataframe-structure.

That’s all true, but I guess one reason I was asking this question is to assess how people feel about this behaviour: if lots of people are coming to Julia from Python and are used to Pandas then they might switch to a different package for data-wrangling, which would be valuable information for me

Coming from Pandas, DataFrames.jl works great for me. I like the minimalistic approach and you get essentially all relevant Pandas features with DataFrames.jl plus additional packages like CSV.jl, ShiftedArrays, etc.
One big strengh of DataFrames.jl is that you can use any custom array element type with good performance, not just the built-in standard ones like in Pandas.

I spent a month in Pandas build an application for wrangling. “wow, this is cool”

Then I hit memory stress (not Panda’s fault), so I thought “hmm, perhaps Julia would be better”

So I rebuilt it in Julia, my first time stressing DataFrames in any way “wow, this is so much better than Pandas”

But really, I don’t think it matters what Data Format you produce. Unless you are outputting 100Gb of data, consumers can just transform it.

Maybe I don’t understand the usecase. But wouldn’t a regular 3D Array work? You can also fill arrays with arbitrary elements, like tuples, or even custom structs, with full performance. What do Dataframes add to the mix?

We would like it to be as user friendly as possible - that by making a query using one of several functions (depending on the type of request) the returned data type is well presented and ready to use without writing a personal converter. Otherwise, frankly, the user might as well write the whole connector - all it does is parse your arguments as a URL and call HTTP.

That’s reminded me of another thing I felt was a limitation of DataFrames (or possibly of Julia?): that I couldn’t split out the metadata and associate it with the DataFrame (like an object attribute in Python). Of course, I could create a struct which contained the DataFrame and the metadata, but that’s another aspect that you have to explain to the user. Returning everything in a well established type which achieves all of this would make our documentation less meaty (although I appreciate that’s basically passing the buck to someone else’s careful documentation)

Generally, labelling. It’s not clear from a 3D array of data alone that a given axis is latitude and another is longitude. IIRC Julia arrays don’t allow type mixing, so a String couldn’t be used to describe the remaining values in a row/column; even if you forced this by overloading Array, I don’t think you can then easily select based on column names. I know from experience of using numpy to manipulate netCDF data before Pandas and Xarray were known to me that this way of life is a nightmare (primarily from a debugging standpoint)

@Eagertom what is the topology of your data set? Is it like a raster file or like a point set or like a collection of polygons? Depending on this information we can provide more directions of data structures available in the language.

I’m not quite sure I understand the question but I think raster. The data we provide is all gridded by lat/lon/altitude i.e. spherical polar.

I can provide more info if you can clarify. Also our GitHub page and API documentation provide examples of features I’d ultimately like to implement. I already have an implementation which produces DataFrames, but want to ensure that this is really the best way of serving up the data for Julia users.

Thanks for the links to your own repositories provided above by the way - they look well developed and promise to shed some light on this topic for me!

It sounds like you want AxisArrays I think. It allows you to name dimensions and have primary keys. But people more familiar with spatial / geological data may have better advice.