I am working on a project, where I want to do some machine learning over large data (about 50Gb) stored in AWS S3. I would like to ask people about their opinions and experiences with the dataformat they use and they would recommend. My current approach uses JLD, but since different folks of the team I am a member use Spark, they obviously do not fancy JLD. And because big part of the preprocessing is already in Spark, they do not fancy hdf5, because there is no good support.
Their preferred format is Avro, but I have found that the library is quite poorly written and reading a 3Gb file is 10 times slower than in Scala. I would like to ask about experiences with Feather.jl and Parquet.jl? Are they in a good state (supported)? Are there other alternatives I am not aware off?
What’s your use case for these data? E.g. parquet is a columnar storage format very well optimized for analytical queries (e.g. you don’t have to read all columns, just some of them), while Avro is row-based and essentially schema-less and so is good for data transfer.
columns might hold different types (Float,Int,String).
Do you mean 2 different columns may be of different types (e.g. like a table in a relational database) or there may be data with 2 different types within one column (i.e. heterogenous columns)?
If it’s about 2 different columns, Parquet is a good choice. However, using it from Julia is a different story. I’ve just tried to build Parquet.jl on macOS and it failed because of Thrift.jl (although I normally don’t use Mac for Julia development, so it might be just my broken setup).
Another option is to Spark.jl, especially given that your team (assuming you and the topic started are from the same team) already uses Spark. Spark.jl currently doesn’t support Parquet, but maybe it’s just about time to add it. So if you ping me in the evening (~5h from now), I’ll try to do it.
What I mean is the first case that 2 different columns can have two different data types, but inside one column all values are of the same tuple. From what I have read, Parquet seems to be a decent option, yet as David has written the library is not in the perfect state.
Currently, it seems like that our engineers will go with protobuffers. I cannot judge if it is a good option or not. I will see with them.
Since my interest is in large scale, I feel like this is an important question.
It took me longer than I expected, but I’ve finally migrated Spark.jl to Julia 0.6 and added SQL interface. Now you can load Parquet files like this (requires Spark.jl master):
using Spark
# Spark's Dataset
ds = read_parquet("path/to/file.parquet")
# collect to a list of Julia tuples
collect(ds)
Note that this uses Java libraries to read parquet, having Parquet.jl on Julia 0.6 is still very desirable feature.
Hi Andrei,
thanks for the answer. We have tried this solution, but it has failed to load 3Gb file. So we have ended up using Protobuf. After fixing some type instability (other still remains), reading is 4x slower that in Scala.
Tomas
Parquet seems like a reasonably good format, but I have found it extremely irritating that the only decent interface for it is Spark. Loading up Spark just to view the contents of a (non-distributed) parquet is like lighting a cigarette with a cruise missile.
One suggestion that I don’t see here is Feather.jl. I have a PR that allows you to pull one field of data from it at a time (without loading the whole thing into memory), but it’s in purgatory right now while @quinnj works on overhauling the DataStreams infrastructure.
@dfdx, thanks for updating Spark. It seems to work fine. If I had to guess, I’d say that getting it working in the first place was probably a gigantic pain in the ass.
We have tried Feather, but I do not know, if there is an exporter from Spar (Java / Scala). I do not know, if Feather is the right solution, since we assume the data to be stored in S3, but it can be compressed before. So the exporter from Scala is probably the biggest bottleneck.
Tomas
Do you remember the error you have got? If it’s OutOfMemory, you can increase JVM memory by passing something larger than 1024m.
After fixing some type instability (other still remains), reading is 4x slower that in Scala.
I remember some earlier versions of Julia had file IO much slower than, say, Python. If it’s the case for current version, it’s worth reporting/reminding.
Yes,
we have considered to increase memory limit of Java, but then the arguments of expandingman has won. It seems just superweird to me to launch jvm just to load a file.
Regarding the comparatively slower reading of Protobufs to Scala, we believe that it is due to type instability. We have already removed one
See this PR https://github.com/JuliaIO/ProtoBuf.jl/pull/95
but there appears to be some others. We have observed the similar issue with Avros.
(This is exactly what I like about Julia community: when something doesn’t work people don’t just complain but instead create a PR that fixes the issue )
Well,
the problem here is that the fix is not complete. As my colleague has told me, there are still some type instabilities remaining, but it seems like that fixing them is non-trivial.
Gee, this is confusing… I tried to find some information whether the folks that are pushing Parquet are on board with carbondata, but didn’t really find anything, so the situation there seems a bit unclear to me?
AFAIK carbondata comes from Huawei and it’s a different team than Parquet. For the time being Parquet is older and more entrenched and will still be used in the near future. So the two (together with a bunch of older file formats) will coexist but given that carbondata is superior in some cases to Parquet it might become the default fileformat for Spark/Spark SQL but we’re not there yet.
That’s very interesting. I do wonder how how well custom types will be handled, that is something that seems to be an afterthought, which might be an issue for Julia programmers, where the very powerful custom types are so central.
Also, CarbonData is designed strictly for OLAP work, so I’m curious if most technical/scientific HPC programmers ever need high performance OLTP oriented storage?
What I’ve gathered over the past 18 months is that the situation in private industry (at least in media, but that seems to be the sector that drives most of this stuff) is simply a giant mess. It’s sort of as if they woke up one day some time in early 2015 and all of the sudden realized that they might actually have to do something with the data they have sitting around in ridiculous formats in SQL databases and (the horror, the horror) csv files. What has followed has been a mad scramble to get in place basic functionality that should have existed since some time in the early 2000’s. You have to remember that before Parquet and Avro and what-have-you there were basically no alternatives. People would keep stuff sitting around in HDFS clusters in csv formats and have giant, comically inefficient distributed systems for doing even basic parsing of that data.
Anyway, my point is, the existence of something like CarbonData is hardly surprising. I doubt that either Parquet or CarbonData will emerge as any sort of long-term standard. In the meantime, it’s probably best to do whatever works best for you.
Like you said it’s designed for OLAP. I think for OLTP in the big data space things like NoSQL stores (e.g. Cassandra) or in-memory solutions like Ignite (has good integration with Spark) or SnappyData (even tighter integration with Spark: in fact it fuses Gemfire with Spark and changes the Spark code making Spark a “real” OLTP SQL DB) are used.