Failing to import (relatively) large CSV file with Julia and VSC

Hello everyone,
I am completly new to Julia, so please my apologies if my question is trivial. I have quickly looked for the reply and I have not found it.

I need to import a large matrix (40 000 x 40 000, app 6GB) into Julia, all rows and columns without changing the order. The file is in csv and only with numerical values. So nothing particularly complex. I have 32 GB of RAM, and I have already done it with Python (so I assume my hardware is not the issue, even if it takes a couple of minutes to do this import with Python, and we are at the limit because it does not work if several other applications using a lot of RAM are open).

I have installed VSC to execute Julia (with Python, I use mainly Anaconda) and made an import of a small CSV test (11MB) with Julia. No particular issue.

But when I try with my big matrix,

CSV.read(csv_path, DataFrame),

it runs for several hours (!!), and then VSC freezes.

Then I tried to import it directly as a matrix (and not as a data frame) and in float32 to see if this could help, I wrote the following function:

function load_csv_to_matrix(file_path, num_rows, num_columns)
data_matrix = Array{Float32}(undef, num_rows, num_columns)

row_index = 1
for row in CSV.Rows(file_path)
    col_index = 1
    for value in row
        # Check to ensure we don't exceed the column bounds of the matrix
        if col_index > num_columns
            break
        end
        data_matrix[row_index, col_index] = parse(Float32, value)
        col_index += 1
    end
    row_index += 1

    if row_index > num_rows
        break
    end
end

return data_matrix

end

Here again, it works with my small CSV test, but not with my big matrix.

Last, maybe the issue comes from my use of VSC. So far, I have written my code into the editor area, and then run it by doing: โ€œJulia: Execute code in REPLโ€ (but I do not think it is the issue as it works for my small CSV).

Thank you in advance for any help.

What is the output of:

versioninfo()

?

You should do CSV.File(โ€œfile.csvโ€) instead, which loads the file lazily.

The CSV.read function attempts to load the entire file to RAM converting to a DataFrame.

Thank you for your reply.

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 ร— Intel(R) Coreโ„ข i7-10610U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =

You could also try:

1 Like

Thank, you for your help. I have tried the two first points: I am now using Julia 1.10 with multiple threads.

Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 ร— Intel(R) Coreโ„ข i7-10610U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =

I have also installed:

using InMemoryDatasets

the import of a small CSV test file is clearly faster.
But the import of my large CSV file is still not working (or at least I stopped it after 40 min as it was still not working).

For a by-pass solution, as my CSV files are only numerical matrices, maybe I could convert them into another format (.mat, .npz etc) with which Julia would struggle less?

Well, this is definitely possible with Julia. Can you provide a script that creates such a 6GB example file or the file itself?

Of course, the file is publicly available. It comes from the Industrial Ecology Department at Sydney University:

20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv

In case the share point link from my uni above create an issue, the original file is :

Dropbox - GLORIA_MRIO_database - Simplify your life
Folder latest_release โ†’ 059 โ†’ GLORIA_MRIO_Loop059_part_I_MRIOdatabase
โ†’ GLORIA_MRIOs_59_2022.zip โ†’ 20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv

A binary file format like arrow is much more memory efficientโ€ฆ

The download startedโ€ฆ Lets see how long it takesโ€ฆ

OK, downloaded an unpacked. The zip file contains:

-rw-rw----  1 ufechner ufechner 5,9G jan 10 12:21 '20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv'
-rw-rw----  1 ufechner ufechner 5,8G jan 10 12:32 '20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup002(full).csv'
-rw-rw----  1 ufechner ufechner 5,9G jan 10 12:44 '20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup003(full).csv'
-rw-rw----  1 ufechner ufechner 5,6G jan 10 12:55 '20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup004(full).csv'
-rw-rw----  1 ufechner ufechner 5,8G jan 10 13:06 '20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup005(full).csv'
-rw-rw----  1 ufechner ufechner  75M jan 10 13:07 '20240110_120secMother_AllCountries_002_V-Results_2022_059_Markup001(full).csv'
-rw-rw----  1 ufechner ufechner 211M jan 10 12:21 '20240110_120secMother_AllCountries_002_Y-Results_2022_059_Markup001(full).csv'
-rw-rw----  1 ufechner ufechner 204M jan 10 12:33 '20240110_120secMother_AllCountries_002_Y-Results_2022_059_Markup002(full).csv'
-rw-rw----  1 ufechner ufechner 220M jan 10 12:44 '20240110_120secMother_AllCountries_002_Y-Results_2022_059_Markup003(full).csv'
-rw-rw----  1 ufechner ufechner 129M jan 10 12:55 '20240110_120secMother_AllCountries_002_Y-Results_2022_059_Markup004(full).csv'
-rw-rw----  1 ufechner ufechner  92M jan 10 13:06 '20240110_120secMother_AllCountries_002_Y-Results_2022_059_Markup005(full).csv'

Shall I try reading the first file?

OK, as first step I split the file into smaller files, on Linux you can use the command split to do this:

split 20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001\(full\).csv 

I then get 40 smaller files:

ufechner@ufryzen:~/repos/CSV_test$ ls x*
xaa  xac  xae  xag  xai  xak  xam  xao  xaq  xas  xau  xaw  xay  xba  xbc  xbe  xbg  xbi  xbk  xbm
xab  xad  xaf  xah  xaj  xal  xan  xap  xar  xat  xav  xax  xaz  xbb  xbd  xbf  xbh  xbj  xbl  xbn

For reading I use:

using CSV

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"
FILENAME="xaa"

csv=CSV.File(FILENAME)
nothing

which is fast:

julia> @time include("read_csv.jl")
  1.230072 seconds (34.37 M allocations: 2.214 GiB, 16.53% gc time)

I tried, and it doesnโ€™t work. My code:

using CSV

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"
FILENAME="xaa"

function read_file()
    csv = CSV.File(FILENAME; header=false)
end
csv = read_file()
csv[1] # return first row
# alternatively return nothing
nothing

I can load an 1.5GB csv file, and loading it (returning nothing)
gives:

julia> @time include("read_csv2.jl")
  9.206542 seconds (45.75 M allocations: 13.171 GiB, 5.18% gc time, 301.85% compilation time)

In other words, with 32 GB RAM the max file size that can be loaded in this way is about 3GB, but loading a 6GB file fails.

I used the command:

split -l 10000 20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001\(full\).csv 

to split the original file into pieces of 10000 lines each.

I have this memory and โ€ฆ

julia> Sys.total_memory()/2^20
7948.6171875

julia> Sys.free_memory()/2^20
1322.1484375

โ€ฆ in a few minutes I created, saved and read this

using CSV, DataFrames, Random


# m=rand(Float32,4*10^4,4*10^3);
# tm=Tables.table(m)  
# CSV.write("m10k_X_1k.csv", tm);
filesize("m10k_X_1k.csv")/1e6  # 1700.485444
t=CSV.File("m10k_X_1k.csv")

julia> DataFrame(t)
40000ร—4000 DataFrame
   Row โ”‚ Column1    Column2    Column3    Column4    Column5    Column6   โ‹ฏ
       โ”‚ Float64    Float64    Float64    Float64    Float64    Float64   โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚ 0.45914    0.323786   0.287918   0.973984   0.373564   0.94596   โ‹ฏ
     2 โ”‚ 0.148653   0.228002   0.775495   0.181961   0.205018   0.974561   
     3 โ”‚ 0.545785   0.464947   0.786147   0.203469   0.0108073  0.23109    
     4 โ”‚ 0.976365   0.633956   0.177804   0.567126   0.726814   0.539547   
     5 โ”‚ 0.295424   0.140896   0.799063   0.0587873  0.92122    0.838071  โ‹ฏ
     6 โ”‚ 0.698935   0.925796   0.653495   0.885776   0.892522   0.710776   
     7 โ”‚ 0.788658   0.389966   0.392589   0.106059   0.15693    0.0068168  
     8 โ”‚ 0.469697   0.0133684  0.350079   0.161829   0.224559   0.571217   
     9 โ”‚ 0.69308    0.0798342  0.683515   0.327015   0.0991643  0.926583  โ‹ฏ
    10 โ”‚ 0.686669   0.0751054  0.915328   0.532172   0.871903   0.572102   
    11 โ”‚ 0.104822   0.150674   0.199238   0.251358   0.498402   0.975431   
    12 โ”‚ 0.138714   0.911603   0.364109   0.187194   0.745571   0.678922   
    13 โ”‚ 0.192071   0.892678   0.85928    0.513568   0.601975   0.140359  โ‹ฏ
    14 โ”‚ 0.268991   0.667317   0.0689645  0.364786   0.504987   0.757816   
    15 โ”‚ 0.361199   0.568255   0.381775   0.305797   0.708287   0.658362   
    16 โ”‚ 0.295467   0.547508   0.92817    0.408412   0.411363   0.707712   
   โ‹ฎ   โ”‚     โ‹ฎ          โ‹ฎ          โ‹ฎ          โ‹ฎ          โ‹ฎ          โ‹ฎ     โ‹ฑ
 39986 โ”‚ 0.63584    0.637439   0.0449622  0.194153   0.159854   0.435054  โ‹ฏ
 39987 โ”‚ 0.0470939  0.123351   0.745204   0.0304491  0.426111   0.397158   
 39988 โ”‚ 0.589865   0.793649   0.19325    0.221891   0.610391   0.810318   
 39989 โ”‚ 0.795853   0.300518   0.243793   0.307483   0.34223    0.83377    
 39990 โ”‚ 0.820287   0.440265   0.412261   0.309526   0.939981   0.50417   โ‹ฏ
 39991 โ”‚ 0.215501   0.526097   0.603948   0.155538   0.721428   0.346939   
 39992 โ”‚ 0.473193   0.701665   0.963082   0.720776   0.641577   0.614307   
 39993 โ”‚ 0.802456   0.277233   0.736046   0.535992   0.54705    0.173352   
 39994 โ”‚ 0.274464   0.680145   0.580526   0.244665   0.0791177  0.780809  โ‹ฏ
 39995 โ”‚ 0.285998   0.0708128  0.447914   0.676254   0.423098   0.530598   
 39996 โ”‚ 0.190717   0.0346309  0.775401   0.0556166  0.70802    0.670697   
 39997 โ”‚ 0.796287   0.509074   0.748359   0.855386   0.354784   0.647198   
 39998 โ”‚ 0.146545   0.507778   0.981678   0.0832134  0.373617   0.625176  โ‹ฏ
 39999 โ”‚ 0.158659   0.278268   0.444578   0.998274   0.897327   0.357139   
 40000 โ”‚ 0.816053   0.587334   0.0366446  0.148506   0.65696    0.47753    
                                        3995 columns and 39969 rows omitted
julia> begin
           println(now())
           t=CSV.File("m10k_X_1k.csv");
           println(now())
           DataFrame(t);
           println(now())
       end
2024-03-31T19:06:28.566
2024-03-31T19:06:41.880
2024-03-31T19:06:52.558

That is strange. We use CSV.File all the time with very large files.

The following code seams to work, but only with single-threaded Julia:

using CSV

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"
CHUNK_SIZE = 1000

function read_all()
    lines = CHUNK_SIZE
    total = 0
    while lines >= CHUNK_SIZE
        chunk=CSV.File(FILENAME_FULL; header=false, skipto=total, limit=CHUNK_SIZE)
        lines = length(chunk.Column1)
        total += lines
        println("Was reading: $total  lines, $(Sys.free_memory()/1e9) GB free")
    end
end
read_all()
nothing

This also works, also multithreaded:

using CSV

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"

function read_file()
    csv = CSV.File(FILENAME_FULL; header=false, types=Float32)
end
csv = read_file()
csv[1] # return first row
nothing
julia> @time include("read_csv.jl")
 20.251769 seconds (47.93 M allocations: 17.421 GiB, 2.23% gc time, 94.94% compilation time)

But this is still not an efficient use of the available memory, using 17GB RAM to read a 6GB csv file is not very efficientโ€ฆ

EDIT:
This also works:

using CSV

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"

csv = nothing
GC.gc(true)
function read_file()
    global csv
    csv = nothing
    CSV.File(FILENAME_FULL; header=false, types=Float64)
end
csv = read_file()
csv[1]      # return first row
csv.Column1 # return first column
println("Size of csv variable: $(Base.summarysize(csv)/1e9) GB")
nothing

But the memory usage reached 32 GB which was only possible because I have 32 GB RAM plus an 8 GB swap fileโ€ฆ Really at the edge.

1 Like

My final suggestion:

First step: Convert the .csv file to .arrow format using the convert.jl script:

using CSV, Arrow

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"
OUT_FILE = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).arrow"

Arrow.write(OUT_FILE, CSV.File(FILENAME_FULL; header=false, types=Float32))
nothing

Second step: Read the .arrow file and convert it to an array (if that is what you need):

using Arrow, Tables

IN_FILE = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).arrow"

m = nothing
GC.gc(true)
m = Tables.matrix(Arrow.Table(IN_FILE))
println("Size of matrix variable: $(Base.summarysize(m)/1e9) GB")
nothing

The first script needs 45 seconds on my PC (Ryzen 7950X), the second script 4.5s.

@rocco_sprmnt21 used a matrix of 40000x4000 elements, we have:

julia> m
39360ร—39360 Matrix{Float32}

which is ten times as largeโ€ฆ

3 Likes

Can anyone explain why, in the accepted solution, Arrow.jl parses the large CSV file much faster than CSV.jl, writes the data to disk and then reads it, with the whole process being much faster than using CSV.jl alone?

Arrow.jl does not parse a .csv file, it reads an .arrow file. In the first step the .csv file was converted into an .arrow file.

Arrow files are binary files, .csv files are text files, therefore reading an arrow file is much faster.

The whole process is not faster, but if you need to read the file more than once, then converting it to arrow format first saves time.

1 Like

And that does not involve parsing?

On my Windows laptop, CSV.jl took 4 minutes to read the 30 GB csv file into a dataframe.

If you want to read a large .CSV file that containes floating point values only once, and convert it into an array, you can also do:

using CSV, Tables

FILENAME_FULL = "20240110_120secMother_AllCountries_002_T-Results_2022_059_Markup001(full).csv"

m = Tables.matrix(CSV.File(FILENAME_FULL; header=false, types=Float32))
println("Size of matrix variable: $(Base.summarysize(m)/1e9) GB")
nothing

which takes 26 s on my PC (16 fast cores, 32 GB RAM, fast SSD, Linux) for a .csv file of 5.9 GB size.

The key point that reduces the memory usage an increases the speed is to pass the parameter types=Float32 to the CSV.File constructor. Float64 would also work, but if you do not pass this parameter, then CSV.jl has to figure out the correct type(s) itself which seams to be costly.

Finally, I am not reading the file into a DataFrame, but converting it into an array, which is cheaper.

UPDATE:
But converting to .arrow format works for me also with Float64, reading the .csv file and converting it to an Array directly fails for me with Float64 with an out-of-memory errorโ€ฆ

1 Like