What fileformat to use to load data for high performance computing

In a previous life, I was one of the principal system architects of an OLTP oriented database system (all proprietary, used heavily in the healthcare / insurance industries), which is also used to store the data from ESA’s Gaia project, so it’s interesting to me to see the rather different design choices made for something like CarbonData.

Thanks for the link!

At the end of the presentation they talk about future things and on one of those is to also have a row based format for fast data ingestion (like in Streaming) and then convert that to column based for OLAP/analystics so you kind of get a mixed OLTP/OLAP story.

Thank you for a discussion. My feeling is that the state is not really nice, as there is no easy solution. I personally like feather files, as column structure makes a lot of sense to me.

I would like to ask Anthony, if he would object against support for reading Feather files from streaming sources. I have tried to “hack” the solution by reading the entire stream to memory first and then let it the existing parser parse the data. Although the change is tiny and conceptually does not make sense, it helps a lot when you are reading gzipped files (they nicely compress).

The connected question is, if someone knows how to produce feather files from Spark? I have not find any solution.

Thanks for the answers.

You might also want to check out https://kudu.apache.org/

It would be interesting to see a benchmark of Spark (SQLor dataFrames/dataTables API) on Parquet vs Spark on Carbondata vs Spark on Kudu. I’ve seen some benchmarks with some of these in it but not all and never Carbondata vs Kudu (I guess they are too new).

I think Kudu and Carbondata are quite close feature wise except that Kudu isn’t a public file format (on object storage cloud filesystem) but an object store.

Can you conveniently load the data to julia?

Not yet. Spark.jl would need to expand the Spark API coverage and start supporting the Spark SQL/Dataframe/Dataset API of Spark 2.x instead of the low level RDD API of Spark1.x. Once that is in place you could read in all sorts of extra datasources supported by Spark including Carbondata & Kudu.
Or for a Kudu specific low level RDD approach: Spark.jl would need to wrap KuduRDD.

The basic support is already there, I just don’t have time to cover the rest. As far as I understand, supporting Kudu and Carbondata should be pretty easy, if you show me how to read these formats from Java (or Scala without implicit conversions, since they aren’t part of JNI and can’t be used from JavaCall), I’ll try to add corresponding functions to Spark.jl.

For Kudu see Apache Kudu - Developing Applications With Apache Kudu and/or Cloudera Blog

for CarbonData: see the slides from this presentation Home - Data + AI Summit 2022 | Databricks

For Kudu I really need either Java code, or Scala without implicit conversions. For example this line:

customersAppendDF.write.options(kuduOptions).mode("append").kudu

implicitly converts Spark’s Dataset into something Kudu-specific that has a .kudu() method. Since I’ve never worked with Kudu I don’t know what and when is converted, and don’t really have time to investigate it.

For Carbondata, generic read_df and write_df should work (just added them). Do you have an example of Carbondata file that I can test reading on?

You probably can do the insert/read/Write/update from KuduContext instead of via Dataframe

For the implicit method “kudu”
I think I found it in the Git logs:

+++ b/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/package.scala
@@ -32,7 +32,7 @@ package object kudu {
* Adds a method, kudu, to DataFrameWriter that allows writes to Kudu
using
* the DataFileWriter
*/

  • implicit class KuduDataFrameWriter(writer: DataFrameWriter) {
  •  def kudu = writer.format("org.apache.kudu.spark.kudu").save
    
  • }
  • implicit class KuduDataFrameWriter(writer: DataFrameWriter) {
  • def kudu = writer.format(“org.apache.kudu.spark.kudu”).save
  • }
    }

so it’s a method on the package object kudu in java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/package.scala.

I don’t have a carbondata file but you can read in data from another format like csv save it in carbondata and then read it back.

This line is essential, thanks.

However, it seems like Kudu isn’t just a file format, but rather a distributed storage - all examples I’ve seen require specifying “kudu.master” and “kudu.table” options instead of file path. If this is correct, Kudu looks out of scope of this discussion (although may be in scope of Spark.jl). Have anybody used Kudu in practice to confirm or deny my assumption?

As for Carbondata, their integration with Spark breaks JSON support. I think I will wait until it gets more stable before including it into Spark.jl.

that’s correct : Kudu isn’t a file format but a distributed store which can serve as a datasource to Spark.

The reason I brought it up in this thread is because it seems to ly in between pure olap and pure oltp and Carbondata seems to also have some oltp features and I wanted to contrast it to that. Anyway from the point of Spark it’s pretty much all the same: a datasource and querying it is the same (SQL/Dataframe/Dataset).

It seems like that the good file format is really missing for hpc is missing. Feather seems to be promising, but it breaks when you try to save and load unicode characters (we have issued bug report). Moreover, I have not find any feather library for Spark.
Loading of Protobufs to Julia is surprisingly slow, because it seems like that the code is not type-stable.
HDF5 (JLD) has poor support in Spark and it produces larger files than gziped feather, even when the compression is turned on.
I am little bit desperate and do not see a good solution by now.

have you tried this https://github.com/valiantljk/h5spark?

No,
but it only supports reading of HDF5, not writing. We need to go in the direction from preparing data in Spark and do the processing in Julia (TensorFlow).
Tomas

That seems backwards to me but ok: just use Python in Spark and use http://www.h5py.org/ to write the data to HDF5.

I haven’t tried it myself since I don’t have Kudu installed, but something like this should work:

  1. Checkout latest master of Spark.jl.
  2. In ~/.julia/v0.6/Spark/jvm/sparkjl/pom.xml, uncomment dependencies related to Kudu.
  3. Run mvn clean package from the same dir to rebuild main JAR.
  4. Run from Julia:
spark = SparkSession()
df = read_json(spark, "path/to/sample.jsom")
options = Dict("mode" => "append", "kudu.master" => "kudu.master:7051", "kudu.table" => "test_table")
write_df(df; format="org.apache.kudu.spark.kudu", options=options)

Essentially, you can try out any Spark-compatible file format using this approach - just add required dependencies and specify correct format and options.

Apache Arrow might be another option. Arrow and Parquet are merging code base.