As for Arrow itself, the conference post you shared was interesting, and I certainly share some of their apparent consternation about it. As for C++ idioms, I did not spend too much time looking at their code, and though I have a good deal of experience in C++, I was working in the high energy physics community and was somewhat in a bubble, so I’m not sure I’d recognize common C++ practices in the broader software community.
I also had some misgivings about the code being in a monorepo. I felt that, were we to have a Julia arrow package there, it would suddenly become much more difficult to work on because of the friction involved in working on such a large project. The authors seemed very insistent that a monorepo was a good idea even to the extent that they expressed interest on moving Julia code there even very long before it was in a state appropriate for that repo. The main reason they gave for wanting a monorepo was their claim that it made testing easier, which, to be honest, I find rather dubious.
I suspect that one of the reasons they want the monorepo is because they did not do an incredibly thorough job of documenting the standard, instead relying on all the code being in that monorepo where the arrow offers can make it compliant. I suspect this is why people expressed concerns about their not being a clear distinction between the specification and the implementation. A clear specification would require a much more detailed white paper, which they lack and the authors have not shown any interest in, as far as I have seen.
As for the format itself, after my relatively recent work implementing it, I have several concerns about it:
- The format is extremely general and supports quite a few composite data types. I therefore don’t really understand why it was decided to keep the format so focused on tabular data. The format itself in principle supports all sorts of data structures, but the metadata seems very strongly focused on tabular data. A good place to see where this has had dubious consequences is that the tensor metadata seems entirely tacked-on. I suspect if they had realized where this was all heading earlier, they would have changed the metadata and wound up with a more generalized IPC data format, which I think still would have been extremely useful, because, perhaps surprisingly, the alternatives are lacking.
- The metadata seems to have unnecessary inconsistencies and this apparently causes significantly more code to have to be written and maintained. For example, strings are basically a structure that they have called
List<UInt8>
, but rather than implementing it this way, there are several places where the metadata is different for strings. This makes it much harder than it should be to write generic code that works for many different data types.
- It is very hard to do much of anything without individual arrow messages or batches, which is a little unfortunate as it makes data harder to construct. For example, if a batch simply had metadata with it saying that it is a particular binary data type (like int or float), the format would be far less dependent on the header, and rather than having to orchestrate the construction of an object accross a bunch of separate batches, one could build it as if by bricks. I realized they wanted to keep the metadata small because this is an IPC format, but I felt there were a few places where they could have made the metadata slightly bigger in the batches for a lot more convenience (along with smaller metadata in the header and possibly even less reliance on referencing the specification).
I suspect a lot of the things that seem odd about the format have something to do with pandas or Python, but I don’t really know. Also bear in mind that I know very little about what it takes to make a really good IPC format, so take what I say with a grain of salt. The arrow authors may have perfectly good reasons for much of what I’m complaining about here, in which case I’d be happy to retract my criticism.
Having said all this, there is already quite a lot of stuff that arrow has been implemented for, and it could be useful for Julia in all sorts of ways. For people to be willing to use Julia at all in the “big data” world it’s going to have to do things like interact efficiently with spark, which will require reading and writing arrow.
I have not given up on my overhaul of Arrow.jl, and I have made lots of significant progress (I can serialize and deserialize many messages), but I’ve repeatedly gotten derailed by work and personal stuff, so I can’t say exactly when I’ll return to it in earnest.