Is it possible to write Parquet files from Julia? I found Parquet.jl but looks like it can read but not write…
Not the most optimal way, but Spark.jl contains methods for reading and writing Parquet files. The downside is dependency on Java, of course.
Good call… but it requires a spark dataset to write Parquet file. I’m trying to find a more efficient way to push data to Spark, a chicken and egg problem…
Even in Python it seems that people always deal with parquets using spark. This does seem like ridiculous overkill depending on what you’re doing, but my impression is that parquets were designed for enormous datasets.
It still seems like a bad solution since there appears to be no direct way of loading feather files to spark, but you might check out feather.
My data set isn’t very big - just 60G. It seems that the most reliable way to load data into spark is CSV. however, CSV is lossy for Float64 so I’m looking for a binary format. Compression is best since I need to upload the file to spark via the net since the environment is i the cloud.
I’m aware of Feather.jl and I love it!
Yeah, the binary data format situation is bizarrely terrible.
From where I sit right now in the private sector I can tell you that a big part of the reason this is the case is because of a staggering amount of horribly formatted legacy data, exacerbated by people who don’t know any better constantly outputting everything in sight to (often corrupted) csv’s, or, even worse excel sheets (yes, there are people out there who even seem to think it’s a good idea to have a 500MB excel sheet). Explaining the concept of double precision floats to those people is downright impossible.
Anyway, the feather standard will ultimately work for large datasets by linking together 4GB files. Unfortunately it seems that the feather authors have been dragging their feet and don’t have any kind of standard yet (that I know of). If this doesn’t happen before long, we will come up with our own temporary solution for Feather.jl.
You might look to see if there is some sort of protobuf related solution, I know that in many cases data is stored en masse in protobufs, but in the examples I know of this was somehow done inside of a parquet.
I’ve been thinking more about this recently; I think we might be better off just committing more to Arrow.jl and it’s format; there really isn’t any long-term need, IMO, for feather since Arrow is the actual format layout definition, including binary tranfser. There’s no reason you couldn’t just write the Arrow format to disk and use that instead of feather.
The one advantage that Parquet includes over something like Arrow, at least as far as I understand the implementations currently, is that Parquet includes native support for compression.
The thing we are really missing right now is an expanded metadata for Feather. Right now it only supports single files up to 4GB, we need the extension to the metadata that will support groups of Feather files.
The important thing missing from Arrow.jl right now is a “concatenation wrapper”: i.e. an AbstractArray
that joins together multiple underlying arrow arrays. As far as I can tell this is the only additional arrow component needed to support large feather files.
Parquet already complies with the arrow standard somehow. I don’t know how it handles compression, but I assume that there is some standard for compressing arrow data that is not well documented. I’ve thought about writing a package that uses the Arrow.jl back-end that reads and writes parquets, but given that Parquet.jl exists, is well maintained and already uses memory mapping for lazy loading, my motivation to do that is approximately 0 (it would be a lot of work).
@quinnj, I think that the metadata aspect of Arrow is sufficiently vague that there actually is a very significant difference between different arrow compliant formats. After all, feather and parquet are quite different. As you can see the current Feather.jl code is not quite trivial, though I like to think I’ve gone to great lengths to simplify it as much as possible and to make future metadata layers on top of Arrow.jl as simple as possible (e.g. the “Locate
” interface).
One limitation I found working with Feather is that it could not save/load non-bits types (meaning, if a column of the dataset had non bits types, it would error). Would Parquet also have the same limitation? If not, that may be a reason for investing effort in the Parquet format to read/write data.
Yes. These formats are not intended for serializing arbitrary Julia data. In principle once we implement arrow structs we will give you some kind of way of doing this, but it still probably wouldn’t be your best option for that. I think what you are looking for is JLD2.jl.
I wonder if I could just use PyCall to hook up with fastparquet? Itching to try…
Looks like you’ll basically have 3-way serialization overhead (Julia → Pandas → Parquet) but might as well give it a try.
So it does work. In my test, for a 17 GiB data set, it only takes few seconds for Julia → Pandas but Pandas → Parquet with GZIP took over 1 hr 15 mins. I guess Python isn’t very fast there…
Amen to the discussion on CSV files. At my last employ I discovered that one set of engineers were working with a directory which had thousands of CSV files. These turned out to be images of semiconductor wafers which were being processed, along a time axis.
In my experience engineeers just want to get on and do their work, and their managers are generally breathing down their necks to get results. So if one person implements a workflow which uses CSV files, the next person then the next will come along and repeat that.
To be fair the Julia → Pandas probably isn’t very much actual work, depending on your data. In fact, if all you had were Vector
s of integers and floats, the Julia and numpy memory layouts would be identical, so it would be instantaneous in that case. The GZIP compression on the other hand may actually have to do quite a lot of work, depending on which compression algorithm it’s using, though I don’t understand why it took quite that long.
That’s a great (but scary) story!
A major part of the problem with the csv’s is that there aren’t really that many sensible alternatives. HDF5 is fine, but there never seems to be any consensus on how to lay it out. As far as I know there aren’t even any truly “standard” HDF5 tabular formats. Parquet is increasingly popular, but it does seem very much geared toward huge datasets, and I know that with it’s many separate files it can sometimes be a burden on the file system.
When I was in high energy physics we used ROOT trees for storing data. It was a standard practice that nobody ever questioned and it worked just fine. When I came to doing data science in the private sector, I was shocked that there seemed to be no standard non-HEP alternative to this. When I realized it was common practice to store a dataset in a 10GB csv I thought “what have I gotten myself into?”.
This is why I’ve done so much work on Feather.jl which has lots of really nice features now and we will announce some time after the official 0.7 release, though the maintainers of the Feather standard have yet to decide on metadata formats for large datasets (we will come up with our own if they don’t).
There’s still no support for writing parquets in Parquet.jl though. Nor anywhere else (except through Spark) currently afaik.
I got a functional parquet writer here
Please help test out Diban.jl, I will start work on a PR to Parquet.jl, I really don’t want the community to have two separately maintained parquet package, but I also don’t see why ppl should wait for something that can be done right away. So I will take a two pronged approach where I will try to PR to Parquet.jl and if it takes too long, I will publish Diban
Sorry for necro-posting but there is a parquet writer in Parquet.jl now Release v0.6.0 · JuliaIO/Parquet.jl · GitHub
It’s based on the Diban work
I just want to thank you for this work, which apparently instigated the addition feature.
Its becoming somewhat of an industry standard to store datasets on s3 and run data pipelines via Spark jobs, before writing again to s3 (or the google/azure equiv).
The format is commonly parquet, which is convenient because:
- columnn store gives faster read access on transformations.
- partitioning is inherent to the file structure, again for more direct access to relevant data
- the schema is stored with the objects, including types, so there is no laborious schema definition phase to reading it
- files are written in chunks for better parallel loading
- array type columns, which is commonly used to store image or text embedding data along with other records
thanks @merlin I got some funding from NumFocus too.