General Arrow questions

I am considering switching to Arrow to improve my data loading times. Here is my setup: just over 1,000 files, each representing a single date, currently stored in SAS .sas7bdat format, sizes ranging from a 100MB to 3GB. I think my ideal end result would be a single Arrow file that is quickly indexable by date. The Arrow.jl and the official Arrow docs are pretty heavy on implementation specifics, and not so verbose when it comes to suggestions, so I’m left with some questions.

  1. Is this a case where I should use a “partitioned” file? I have done a small test case with 10 days of cut down data, with and without partitioning. The written file is identical in size (about 260MB). The read times of the Arrow files also seemed identical. As a further test, I appended all 10 DataFrames together to create a single large one, then wrote it out. Again, nothing seemed any different. But all methods were much faster than reading the SAS files!
  2. Is it better for me to write using the streaming IPC format or the file format? Again, I haven’t noticed a difference in my test cases, and I’m still not certain I fully understand the implications of this choice.
  3. Does Arrow.write require the full file to be in memory before it is written? My tests seemed to indicate this was the case. This is obviously a problem for me, as the entire dataset is around 1TB.

This doesn’t answer your question, but why have it all in one file? Maybe the benefits you want can be found another way.

I mean I haven’t had it in one file so far. It has worked but honestly is a bit cumbersome. I’m more often pulling subsets of the data from the whole time series, while conditioning on other columns.

Hi,

From the description it sounds like you want to query your data as if it were one single table (Arrow File) via a series of time series ? You could achieve that with a partitioner as you mention:

csv_parts = Tables.partitioner(CSV.File, csv_files) Arrow.write(io, csv_parts)

but I’d imagine the filtering to be quite painful after the fact.

is there no way that you can do something like:


struct arrowFile
    dateStart
    dateEnd
    fileName
    arrowFileLocation
end

arrowFileDateRanges = Vector{arrowFile}()

for file in Files

    dateStart , dateEnd = extractDateRange(file)
    arrowLocation = replace(fileEnding,".arrow")
    aFile = arrowFile(dateStart , dateEnd,file.Name,arrowLocation)
    push!(arrowFileDateRanges, aFile)
    Arrow.write(arrowLocation,file)

end

function QueryArrowFilesByDate(date)

    arrowFileLocations = Vector{String}()
    for aFile in arrowFileDateRanges

        if between(date, aFile.Start, aFile.End)
            push!(arrowFileLocations, aFile.arrowFileLocation)
        end

    end

    return arrowFileLocations
end


function applyQueriesToFilesInDateRange(Query, date)

    for filteredFiles in QueryArrowFilesByDate(date)
        Query(filteredFiles)
    end

end

since the original need is to minimise the data loading time?

note that you can also apply compression = true to the Arrow.write method call.

Regards,

I’m not sure, it seems like the overall cost might favor the single huge file approach. The file can be read from disk and filtered before being materialized, so if the filters filter out enough data (as well as selecting only relevant columns) then I thought this might actually be the faster approach.

For additional context, there are roughly 400 columns although I tend to ignore (eg. exclude) upwards of 250 of them.

Hi,

Note that you can treat a vector of filepaths as a Table itself as per the docs:

Arrow.Table(inputs::Vector; convert::Bool=true)

A Vector of any of the above, in which each input should be an IPC or arrow file and must match schema

which would allow you to create and act on the filtered subset (I didnt actually spot you originally said that each file represents a single date).

The alternative to turn the entire dataset into a single file would be via append - but to your original question you are writing directly into an IO stream so I would assume it will fill your RAM then spill to disk and be really quite slow to create. To then apply an index after the fact would be quite straightforward since you could just establish the start and end points for a given date and query it in a similar way to my original post.

Regards,

1 Like

Alright that was actually a nice find. I will convert each daily file to an Arrow file, and then I can pass the list of all of them to Arrow.Table, which I can then filter/subset using DataFrames or TableOperations…

I’m going to leave this open in hopes of generating some answers to the other questions or an even better approach.

Hi,

No longer at computer (so difficult to provide snippets) but applying a data frames + tableoperations approach to all of the files every time has to bring some redundancy that can be mitigated by pre filtering the vector of file locations that you supply to Arrow.Table. I do know that dataframes lazily reads but you’ll still be iterating through rows that definitely won’t meet your criteria.

Also - as an aside - given how well partitioned your data is, it would appear to really lend itself to Spark (I can’t recall if arrow a viable file format, but parquet is) which you can combine with spark.jl to fully distribute your queries and collect after the fact.

Regards,