Storing big data file for fast access?

mihrits · February 10, 2021, 1:42pm

I have a somewhat big .txt data file (~70 MB, ~1M lines, ~30 columns) that I plan to study during the next few months. Data are mostly floats, but also strings and integers. And I’ll probably work with more files like that in the future. That means I most likely have to read it in more than a couple of times.

What is the recommended way of handling repeated reading of big files this? I would guess it is reasonable to save it as something else than .txt, which would be easier for the computer to read. But I don’t know if the implementation of reading .csv files is fast enough or should it be more binary like.

I have very briefly looked at JuliaDB, HDF5, and FortranFiles. At the moment, JuliaDB seems the most approriate.

I am open to recommendations and tips.

stevengj · February 10, 2021, 2:05pm

That’s not very big—it easily fits in memory. Are you launching Julia over and over? Why not just leave a Jupyter/IJulia notebook open for your analyses with the data loaded in memory all of the time? If you only relaunch it once every few days or every few weeks, the load time ceases to matter.

(You can use Revise.jl to re-load modules you are editing without restarting Julia.)

danielw2904 · February 10, 2021, 2:05pm

I’m using Arrow.jl for larger files (20Gb) and it works very well. I especially like that there is compression built into the package. I compress with lz4 which is reasonably fast.

nilshg · February 10, 2021, 2:24pm

FWIW:

julia> using CSV

julia> df = DataFrame(rand(1_000_000, 30), :auto);

julia> CSV.write("out.csv", df);

julia> using BenchmarkTools

julia> @btime CSV.read("out.csv", DataFrame);
  1.824 s (930 allocations: 254.16 MiB)

The csv file created above is ~550MB large, so I would expect your file to be parsed more quickly. Also, this is just using one thread, so might benefit from starting Julia with more threads if available. I’d think your file should be parsed in less than a second using CSV.jl on any reasonably modern machine.

Also CSV.jl should handle txt files just fine (if they are formatted correctly), so no need to save as a csv separately.

That said, for really large files I second the Arrow recommendation.

StatisticalMouse · February 10, 2021, 3:12pm

Arrow format doesn’t do compression, right?

Relation to other projects

What is the difference between Apache Arrow and Apache Parquet?

Parquet is not a “runtime in-memory format”; in general, file formats almost always have to be deserialized into some in-memory data structure for processing. We intend for Arrow to be that in-memory data structure.

Parquet is a storage format designed for maximum space efficiency, using advanced compression and encoding techniques. It is ideal when wanting to minimize disk usage while storing gigabytes of data, or perhaps more. This efficiency comes at the cost of relatively expensive reading into memory, as Parquet data cannot be directly operated on but must be decoded in large chunks.

Conversely, Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is not compressed (or only lightly so, when using dictionary encoding) but laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed.

Therefore, Arrow and Parquet complement each other and are commonly used together in applications. Storing your data on disk using Parquet and reading it into memory in the Arrow f

StatisticalMouse · February 10, 2021, 3:13pm

in the Arrow format will allow you to make the most of your computing hardware.

danielw2904 · February 10, 2021, 3:43pm

As far as I understand the compression in the Arrow.jl package happens separate from the Arrow format using CodecLz4.jl (see here and here), but it is convenient since its just an argument in the Arrow.write function.

StatisticalMouse · February 10, 2021, 7:25pm

You are certainly right, I learned something.
A couple of months ago I was juggling a dataset on my Mac which had trouble fitting in memory. Perhaps this would have helped. I thought I’d want compression, but I thought it doesn’t exist. I was using Dataframes.jl.

StatisticalMouse · February 10, 2021, 7:31pm

To clarify, I was thinking of using compressed column data in-memory to enable a larger amount of it to be held in memory. Is this only for disk files?

Topic		Replies	Views
Importing big data General Usage question	21	5446	November 14, 2017
Help with Arrow.jl and size of files Data question , arrow	23	1898	October 21, 2022
Maximizing Input File Read Performance Performance	7	243	September 13, 2024
Reading Data Is Still Too Slow Data	35	8818	August 2, 2019
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2697	February 24, 2023

Storing big data file for fast access?

Relation to other projects

What is the difference between Apache Arrow and Apache Parquet?

Related topics