[ANN] RowTables.jl

RowTables.jl is intended to be (and often is) faster than DataFrames at row-wise operations.

This package is not registered. See README.md for a few benchmarks.


Very cool! Alas, I almost never have data where I do only row-wise stuff or only column-wise stuff. It’s always a mix…

IndexedTables should be quite fast at iterating rows, esp. if the entries are isbits. Have you benchmarked your package relative to it?

The idea is that if you have some row-intensive operations, eg, pushing rows, or permuting (repeatedly), it is worth it to convert to RowTable, do the operations and convert back. The conversion is just RowTable(df, tuples=false) and DataFrame(rt). I made the indexing almost the same as DataFrames (the exception is in the examples in the Readme). It’s not a problem to make it exactly the same. For this reason, it might be easy to try the optimization by adding a couple of lines and to revert.

I also worked on it because I was curious.

I wrote this several months ago. Benchmarking showed that it was not worth the conversion in the use case at hand, so I left it. A couple of days ago, I thought, why not post the code, so I wrote a README, etc. and put it on github.

I did new benchmarks for the README. I don’t have a use for it now, but the benchmark times make me think such a case might arise.

I just tried this a little bit. Doing it more carefully would take a bit of time.
I had trouble testing IndexedTables when I wrote this several months ago, don’t remember what it was. I just now installed IndexedTables. AFAICT, it requires v0.6. But, RowTables only works with v0.7. Julia version is a major factor that I can’t hold constant.

I just spent some time trying to understand IndexedTables. I only looked at the non-sparse storage option. I have not worked much with named Tuples. Trying construct them programatically was a PITA… I’m sure there are better ways. To try to make the comparison more fair, I store the RowTable rows as Tuples. Making them named tuples would take a bit of time for me, probably very little for someone who knows them well.

In any case, once you have the row, quick benchmarks show that, say summing the elements, takes the same time with plain-vanilla Tuples on v0.7 as for named Tuples on v0.6 [No, see below]. It looks like IndexedTables stores the data as columns. It requires a fixed named Tuple type. To get a row, it creates a named tuple by iterating over columns. As expected, accessing a row is faster with RowTables, with the advantage increasing with the number of columns. But, in some code, I found that IndexedTables got rows faster; I couldn’t chase this down quickly. In any case, if you have a 50x100 table, and sum all rows by summing each row and accumulating, then RowTables is faster. I won’t quote numbers now because there are too many variables to nail down.

If someone is interested, I’d be more motivated (Well, I’m motivated, I just have other stuff to do.) to do benchmarks and post them.

EDIT: Named tuples are not only named, but typed, and the compiler takes advantage of this. Summing a named tuple of floats is faster than summing an ordinary tuple with the same data.

The README shows benchmarks compared to DataFrames. I benchmarked IteratedTables and the numbers are similar. It’s not surprising. One package stores data by rows, the other two by columns. I added support for named tuples to RowTables. But, this is not worth much without associated functionality… like type-stable column access. There are several table packages, and they require a lot of effort to maintain. Likewise, developing RowTables would require a lot of effort. RowTables is an experiment to see how performant row-wise storage is.

The benchmark looks very impressive! Most of my operation in the work is to read the data and do some calculation. Very useful package!

1 Like