Help with Arrow.jl and size of files

I was recently exploring the Arrow.jl package and arrow package in R. I had a project where I needed to write a function to convert a large file in Julia, then port it to R for another package we need to use there.

I wrote this large array out as a DataFrame. It’s a matrix of only space delimited 1’s and 0’s (haplotypes). ~10,000 rows with ~45,000 columns.

My question is Arrow.jl produces a binary, yet that binary file is 3.4G vs the plain text ASCII is 854M.

What am I doing wrong here? I expected the binary to be considerably smaller and faster to read into R than the ASCII file.

using Arrow
Arrow.write("file.arrow", DataFrame(my_array))
library(arrow)
read_feather("file.arrow")

Let me know if you can help. Thanks.

Does it help to specify compression when when writing the file?

1 Like

First DataFrame(my_array) normally should error, and instead DataFrame(my_array, :auto) would be needed. What is the type of my_array?

Now a crucial thing is that you should make your data to have Int8 element type. If you use Int64 this wastes a lot of memory. If you do it (I did it for data of your size):

  • CSV.jl has almost 870MB, and is slow to write;
  • Arrow.jl has 450MB, and is faster to write;

(this is without compression)

Also using Int8 will put much less burden on RAM. If instead you wanted to represent your data a Bool, then using BitArray would give you another 8x savings in RAM.

5 Likes

I suppose it’s a worst-case scenario, for the default type. In CSV, they are already 1 byte plus another for comma (or space), so not too bad. Maybe you should stick to just CSV as near-optimal (for short strings)? Though, ideally you would want a BitArray (but only if you have no Missing) compressing 8x.

[There do exist some specialized bioinformatics formats, as you may know, and they are supported (at least the most common) with Julia packages. Also if you have a lot more zeros, then a sparse matrix is better, but I’m not sure if DataFrames.jl supports it, I guess it should support all Julia types, but looking up it may not be implemented yet, and still converts to dense? or at least if Arrow does support such.]

I’m not sure if this is the sparse that may be wanted in Julia (but it’s merged): https://github.com/JuliaData/DataFrames.jl/pull/1040

See also: SparseDataFrame.to_parquet fails · Issue #20692 · pandas-dev/pandas · GitHub

@bkamins Even if you start with Int64 (or was it Float64, wouldn’t matter size-wise), and those seem logical when you import into a DataFrame (from e.g. CSV, you might not know sizes ahead), would it be wise on output for DataFrames to export as smallest in it can get away with (since you can ask for extreme values of columns, and I’m not sure maybe those values are cashed)?

The drawback (or a plus?) is when you import back you would get columns of Int8 (or BitArrays) and would be bad for mutability if you later want to change (with more extreme values). I guess Array/Feather keeps track of only one type (per column), not sure if it could keep the original type in some metadata in case you want it on import.

You also want compatibility with other languages (not sure would be a problem for at least Python and R):

  • The primary reason for the existence of Feather is to have a data format using which data frames can be exchanged between Python and R. […]
  • The example Python program creates a pandas dataframe object from a Python dictionary. The dataframe is persisted into a disk file in Feather format by calling the to_feather() method on the dataframe instance.
1 Like

Thanks a lot! This was helpful. I never considered Int64 vs anything else (still moving from R). Matrix{Int64} is they type based on varinfo().

No errors but I’ll try :auto and Int8 or Bool. Yes, CSV is slow, just used to write out to double check the file against my input matrix (think my function is working as I thought it should).

You say “without compression”, so :auto gives me the compression? I thought Arrow did this naturally, but didn’t have time to read their full documentation yet.

I’ll see what I can do here. I know others in our field has moved to Bit storage for genotype data. Thank you very much.

Arrow.jl has complete support for the basic datatypes of Arrow. From memory C++ had better support for some extras, maybe not relevant here (and unclear if R and Python would support thosee by now).

I see at least C++ supports sparse (tensors), in cases you needed such support:
https://arrow.apache.org/docs/cpp/api/tensor.html#classarrow_1_1_sparse_tensor

EXPERIMENTAL: The index format type of SparseTensor.

[including, I had not heard of that variant] Compressed sparse fiber (CSF) format.

[I assume you would lose the “O(1) (constant-time) random access”, but get something close enough.]

1 Like

Mybe you don’t need tables at all. If its just 1 and 0, then BitArray 2-dimensional matrix would be most compact type. And maybe even smaller on disk if compressed into HDF5 format.

2 Likes

My understanding is Feather was a kind of preliminary format, with Arrow being the “final” and full featured format. So they’re related but Arrow is intended to supersede Feather?

Adding to that Parquet is kind of the “archival” format from the same suite, aimed at making smaller files with compression etc.

3 Likes

Thanks a lot for your reply. Still new to Julia finding my way around.

BitArray could work, as I don’t have missing (well technically I should if there is ambiguity in the phasing algorithm that was used as input), but I removed this as there are few and just took the highest probability haplotype. Not familiar with specialized bioinformatics formats but I’ll do some homework.

I had used Feather before, but looking at docs, seems like it was deprecated for Arrow so I switched to that, I just ran with Feather.jl and in R and found the same problem… Thank you for the links. I’ll keep trying some things here that were suggested.

If you have no errors without :auto it means that you are on very old DataFrames.jl and I recommend you update it.

:auto does not give you compression. You can turn-on compression when writing file to disk (but most likely it might lead to slower write times, at the benefit of having smaller file size).

2 Likes

I guess technically, Feather v1 was the preliminary format, and Feather v2 is the Arrow format… but basically I think you’re best off to use Arrow.jl to read/write.

https://arrow.apache.org/docs/python/feather.html

1 Like

Thanks, I saw Parquet in a YT video, I can try that too I suppose while I’m testing other implementations.

Yeah, I think the tradeoff is that Parquet is good for archiving, but Arrow is going to read in super-fast. If you plan to do something like run your code hourly, want to have reading in the file be a blink of an eye, then use Arrow. If you want the file to be 10MB and don’t mind reading it in to be 5 seconds look to Parquet

3 Likes

hummm… I’m on Julia 1.7.3 on an HPC, just downloaded the packages last week… I’ll check version numbers here. Maybe I am behind.

Okay, I guess I’ll see the tradeoff of compression, I guess storage doesn’t matter that much… Just wanted binaries that were faster to read into R mostly.

Really dumb question, I’m just coming back to Julia and not a programmer. Can you indicate the docs for converting my Matrix{Int64} to Int8. I tried convert(Int8, my_array) but no go… Found the page in julia docs but don’t seem to be working. Also to a BitArray if you know.

Nevermind, I figured out the conversions here.

Parquet has compression even if you have compression off (it has running length encoding etc.)

And most importantly, NEITHER of the Julia’s two Parquet implementations support vector of vector (just an example, my point is it supports less types than Parquet format supports)

My suggestion is to go with Arrow.jl, although it’s also not perfect for example reading compressed Arrow.jl file means no mmap (i.e. >> RAM and you can’t read it), but over all it’s much better

1 Like

You can download julia 1.8.2 and put it in your home directory and run it there. there’s no need to wait for an administrator to change the system-wide version. You might have reasons not to do that, but just letting you know it’s trivial to update.

Just do Int8.(my_array).

In general you should not need to use convert at all in your code. This is a low-level function that is typically not needed (it is used by several functionalities of Base Julia). I have been writing Julia code every day for last several years and never (literally) needed to use convert explicitly (but often implicitly by calling some functionalities of Base Julia). I recommend you use type constructors instead.

3 Likes

To comment on parquet: unlike arrow the format requires a lot of copying thanks to a fairly crazy sparse-encoding scheme that it uses. Therefore, while there’s probably a lot that can be improved in Parquet2.jl, it will always be slower than arrow, and it will especially always require a hell of a lot more memory.

3 Likes

Thanks, probably good advice. I’ve not done my research on any of these, I just previously liked the feather idea and used it some.