Storing big data file for fast access?

I have a somewhat big .txt data file (~70 MB, ~1M lines, ~30 columns) that I plan to study during the next few months. Data are mostly floats, but also strings and integers. And I’ll probably work with more files like that in the future. That means I most likely have to read it in more than a couple of times.

What is the recommended way of handling repeated reading of big files this? I would guess it is reasonable to save it as something else than .txt, which would be easier for the computer to read. But I don’t know if the implementation of reading .csv files is fast enough or should it be more binary like.

I have very briefly looked at JuliaDB, HDF5, and FortranFiles. At the moment, JuliaDB seems the most approriate.

I am open to recommendations and tips. :slight_smile:

That’s not very big—it easily fits in memory. Are you launching Julia over and over? Why not just leave a Jupyter/IJulia notebook open for your analyses with the data loaded in memory all of the time? If you only relaunch it once every few days or every few weeks, the load time ceases to matter.

(You can use Revise.jl to re-load modules you are editing without restarting Julia.)

1 Like

I’m using Arrow.jl for larger files (20Gb) and it works very well. I especially like that there is compression built into the package. I compress with lz4 which is reasonably fast.

3 Likes

FWIW:

julia> using CSV

julia> df = DataFrame(rand(1_000_000, 30), :auto);

julia> CSV.write("out.csv", df);

julia> using BenchmarkTools

julia> @btime CSV.read("out.csv", DataFrame);
  1.824 s (930 allocations: 254.16 MiB)

The csv file created above is ~550MB large, so I would expect your file to be parsed more quickly. Also, this is just using one thread, so might benefit from starting Julia with more threads if available. I’d think your file should be parsed in less than a second using CSV.jl on any reasonably modern machine.

Also CSV.jl should handle txt files just fine (if they are formatted correctly), so no need to save as a csv separately.

That said, for really large files I second the Arrow recommendation.

6 Likes

Arrow format doesn’t do compression, right?

Relation to other projects

What is the difference between Apache Arrow and Apache Parquet?

Parquet is not a “runtime in-memory format”; in general, file formats almost always have to be deserialized into some in-memory data structure for processing. We intend for Arrow to be that in-memory data structure.

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

Therefore, Arrow and Parquet complement each other and are commonly used together in applications. Storing your data on disk using Parquet and reading it into memory in the Arrow f

in the Arrow format will allow you to make the most of your computing hardware.

As far as I understand the compression in the Arrow.jl package happens separate from the Arrow format using CodecLz4.jl (see here and here), but it is convenient since its just an argument in the Arrow.write function.

1 Like

You are certainly right, I learned something.
A couple of months ago I was juggling a dataset on my Mac which had trouble fitting in memory. Perhaps this would have helped. I thought I’d want compression, but I thought it doesn’t exist. I was using Dataframes.jl.

To clarify, I was thinking of using compressed column data in-memory to enable a larger amount of it to be held in memory. Is this only for disk files?