I’d like to use the cpp lib for apache arrow but I’m having a tough time with it.
For some context, I’m aware that there is an Arrow.jl package however that only supports the memory format. The cpp lib is very mature and supports IO, has a compute API, and abstracts over filesystems to access hdfs/s3.
I’ve looked at Cxx but it looks like 1.5 is not supported, and I’ve looked at CxxWrap but it looks like a lot more work on the cpp side.
Are there any other options that are worth exploring?
I’m not familiar with interopping with cpp at all, but I’m curious in what ways you feel the Arrow.jl library doesn’t provide enough functionality currently? It supports file and arbitrary IO support, and the array types you get when deserializing can be used in wide variety of ways for compute (for just one example, you can do DataFrame(Arrow.Table(file)) and use the entire DataFrames.jl library of functionality). There’s still some work to do for more seamless Parquet.jl interoperability, but it’s at least doable right now.
Anyway, just looking for feedback on what would make it better and be usable for your use-case.
If your interactions between Julia and C++ are not too complicated, then you might consider creating a very simple C wrapper around the C++ code of interest (i.e. using an extern C directive). Interacting with C is very easy in any Julia version–you can use @ccall to call the functions from your C wrapper library, and those wrapper functions can then call whatever C++ code you need. This style of interface is pretty common in other languages (e.g. Python) since calling C code is universally easier than interacting with C++.
If your use-case involves passing basic data types (arrays, strings, etc.) back and forth between C++ and Julia, then this could work quite well. On the other hand, if you want to wrap custom C++ data structures or manage lifetimes of objects between both languages, then creating the thing C wrapper could be more effort than it is worth.
I wanted to write a distributed DataFrame lib kind of like Spark and noticed that the Arrow cpp lib has a lot of features already implemented. Reading and writing parquet files and newline-delimited json files, for example, are one function call away. It would also be nice to use the compute API for the arrow cpp library. I think the Arrow.jl implementation allows for storing objects in the Arrow format, which is great, but I’d rather not reimplement all the other functionality that comes with the lib.
@rdeits Thanks! This is how the lib interacts with python but it’s a big lib so I don’t know if it will be easy to write my own wrapper. I will learn more about this though
Sounds great; yeah, I think a lot of the compute functionality should be possible via DataFrames/regular Julia array functions, but the IO story is still progressing.
That’s a good point, Julia should be fast enough for array functions but that does mean I’ve to put in some elbow grease in reimplementing those.
I usually prefer reusing things. It’s easy to write code but maintaining code can often cost several times more than the initial cost to write. Reusing things means there’s less maintenance for me to do in the future.
But seeing as Interop with Cpp seems prohibitively difficult I might go for the Julia impl. Not sure how far I’ll get but it’s something I’m doing as a hobby anyway
Can you clarify what you mean here? I was referring to the fact that you can do sum(A), mean(A), etc. and those will work already on the array types provided by Arrow.jl. So no reimplementing needed. Another example would be the functions in the SplitApplyCombine.jl, which provides generic split-apply-combine functionality on any AbstractArray (which includes the arrow array types).
Anyway, if there’s functionality you feel is still missing, it’d be great to discuss that and where the best place for it to live: Arrow.jl, or SplitApplyCombine.jl, or somewhere else.