What fileformat to use to load data for high performance computing

ScottPJones · July 20, 2017, 8:54pm

In a previous life, I was one of the principal system architects of an OLTP oriented database system (all proprietary, used heavily in the healthcare / insurance industries), which is also used to store the data from ESA’s Gaia project, so it’s interesting to me to see the rather different design choices made for something like CarbonData.

Thanks for the link!

Steven_Sagaert · July 20, 2017, 9:03pm

At the end of the presentation they talk about future things and on one of those is to also have a row based format for fast data ingestion (like in Streaming) and then convert that to column based for OLAP/analystics so you kind of get a mixed OLTP/OLAP story.

Tomas_Pevny · July 21, 2017, 3:39am

Thank you for a discussion. My feeling is that the state is not really nice, as there is no easy solution. I personally like feather files, as column structure makes a lot of sense to me.

I would like to ask Anthony, if he would object against support for reading Feather files from streaming sources. I have tried to “hack” the solution by reading the entire stream to memory first and then let it the existing parser parse the data. Although the change is tiny and conceptually does not make sense, it helps a lot when you are reading gzipped files (they nicely compress).

The connected question is, if someone knows how to produce feather files from Spark? I have not find any solution.

Thanks for the answers.

Steven_Sagaert · July 24, 2017, 11:15am

You might also want to check out https://kudu.apache.org/

It would be interesting to see a benchmark of Spark (SQLor dataFrames/dataTables API) on Parquet vs Spark on Carbondata vs Spark on Kudu. I’ve seen some benchmarks with some of these in it but not all and never Carbondata vs Kudu (I guess they are too new).

I think Kudu and Carbondata are quite close feature wise except that Kudu isn’t a public file format (on object storage cloud filesystem) but an object store.

Pevnak · July 25, 2017, 9:43am

Can you conveniently load the data to julia?

Steven_Sagaert · July 25, 2017, 12:10pm

Not yet. Spark.jl would need to expand the Spark API coverage and start supporting the Spark SQL/Dataframe/Dataset API of Spark 2.x instead of the low level RDD API of Spark1.x. Once that is in place you could read in all sorts of extra datasources supported by Spark including Carbondata & Kudu.
Or for a Kudu specific low level RDD approach: Spark.jl would need to wrap KuduRDD.

dfdx · July 25, 2017, 12:58pm

The basic support is already there, I just don’t have time to cover the rest. As far as I understand, supporting Kudu and Carbondata should be pretty easy, if you show me how to read these formats from Java (or Scala without implicit conversions, since they aren’t part of JNI and can’t be used from JavaCall), I’ll try to add corresponding functions to Spark.jl.

Steven_Sagaert · July 25, 2017, 1:13pm

For Kudu see Apache Kudu - Developing Applications With Apache Kudu and/or Cloudera Blog

for CarbonData: see the slides from this presentation Home - Data + AI Summit 2022 | Databricks

dfdx · July 25, 2017, 2:55pm

For Kudu I really need either Java code, or Scala without implicit conversions. For example this line:

customersAppendDF.write.options(kuduOptions).mode("append").kudu

implicitly converts Spark’s Dataset into something Kudu-specific that has a .kudu() method. Since I’ve never worked with Kudu I don’t know what and when is converted, and don’t really have time to investigate it.

For Carbondata, generic read_df and write_df should work (just added them). Do you have an example of Carbondata file that I can test reading on?

Steven_Sagaert · July 25, 2017, 7:23pm

You probably can do the insert/read/Write/update from KuduContext instead of via Dataframe

For the implicit method “kudu”
I think I found it in the Git logs:

+++ b/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/package.scala
@@ -32,7 +32,7 @@ package object kudu {
* Adds a method, kudu, to DataFrameWriter that allows writes to Kudu
using
* the DataFileWriter
*/

implicit class KuduDataFrameWriter(writer: DataFrameWriter) {

 def kudu = writer.format("org.apache.kudu.spark.kudu").save

}

implicit class KuduDataFrameWriter(writer: DataFrameWriter) {
def kudu = writer.format(“org.apache.kudu.spark.kudu”).save
}
}

so it’s a method on the package object kudu in java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/package.scala.

I don’t have a carbondata file but you can read in data from another format like csv save it in carbondata and then read it back.

dfdx · July 25, 2017, 11:23pm

This line is essential, thanks.

However, it seems like Kudu isn’t just a file format, but rather a distributed storage - all examples I’ve seen require specifying “kudu.master” and “kudu.table” options instead of file path. If this is correct, Kudu looks out of scope of this discussion (although may be in scope of Spark.jl). Have anybody used Kudu in practice to confirm or deny my assumption?

As for Carbondata, their integration with Spark breaks JSON support. I think I will wait until it gets more stable before including it into Spark.jl.

Steven_Sagaert · July 25, 2017, 11:30pm

that’s correct : Kudu isn’t a file format but a distributed store which can serve as a datasource to Spark.

The reason I brought it up in this thread is because it seems to ly in between pure olap and pure oltp and Carbondata seems to also have some oltp features and I wanted to contrast it to that. Anyway from the point of Spark it’s pretty much all the same: a datasource and querying it is the same (SQL/Dataframe/Dataset).

Pevnak · July 26, 2017, 4:44am

It seems like that the good file format is really missing for hpc is missing. Feather seems to be promising, but it breaks when you try to save and load unicode characters (we have issued bug report). Moreover, I have not find any feather library for Spark.
Loading of Protobufs to Julia is surprisingly slow, because it seems like that the code is not type-stable.
HDF5 (JLD) has poor support in Spark and it produces larger files than gziped feather, even when the compression is turned on.
I am little bit desperate and do not see a good solution by now.

Steven_Sagaert · July 26, 2017, 7:56am

have you tried this https://github.com/valiantljk/h5spark?

Pevnak · July 26, 2017, 9:45am

No,
but it only supports reading of HDF5, not writing. We need to go in the direction from preparing data in Spark and do the processing in Julia (TensorFlow).
Tomas

Steven_Sagaert · July 26, 2017, 10:52am

That seems backwards to me but ok: just use Python in Spark and use http://www.h5py.org/ to write the data to HDF5.

dfdx · July 26, 2017, 1:29pm

I haven’t tried it myself since I don’t have Kudu installed, but something like this should work:

Checkout latest master of Spark.jl.
In ~/.julia/v0.6/Spark/jvm/sparkjl/pom.xml, uncomment dependencies related to Kudu.
Run mvn clean package from the same dir to rebuild main JAR.
Run from Julia:

spark = SparkSession()
df = read_json(spark, "path/to/sample.jsom")
options = Dict("mode" => "append", "kudu.master" => "kudu.master:7051", "kudu.table" => "test_table")
write_df(df; format="org.apache.kudu.spark.kudu", options=options)

Essentially, you can try out any Spark-compatible file format using this approach - just add required dependencies and specify correct format and options.

bchi · December 1, 2018, 3:41pm

Apache Arrow might be another option. Arrow and Parquet are merging code base.

Topic		Replies	Views
Writing Parquet files General Usage	28	5361	November 12, 2020
The poor state of fileformats for High Performance computing General Usage	16	2681	August 13, 2017
Arrow, Feather, and Parquet Data parquet , arrow	48	13146	November 1, 2020
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	7094	October 25, 2018
Reading Data Is Still Too Slow Data	35	8969	August 2, 2019

What fileformat to use to load data for high performance computing

Related topics