Guidance on profiling / benchmarking I/O opperations on sparse dataset

I have a specific data reading problem that I’m trying to improve. I’m not sure if the details are important or not, but here’s the gist:

  • I have some number (high hundreds to low thousands) of indiviudual CSV files. Each file has 2 columns, call them feature::String and value::Float, and represents one observation
  • In the end, I need a single table, probably sparse, containing all of the values for each feature&observation pair. The orientation doesn’t matter all that much, I routinely need to access feature-wise and observation-wise.
  • there are on the order of 5-50 million different features across the dataset, but each observation tends to have values for at most hundreds of thousands of them. All other features can be said to have a value of 0, but are not represented in the rows of the CSVs if they aren’t in that observation.
  • I am far more often loading the table in a new julia session than I am adding additional data, and I’m not really limited by diskspace, so anything to massage the data in a way that makes it easier to load would be fine, even if it takes a while.

Here’s a way to generate some toy data that has the form I’m concerned with:

using Random
using DataFrames
using CSV
using SparseArrays

"""
    nfeatures: total number of possible features
    nobservations: total number of observations
    nfeatperobs: Number of features in each observation
"""
function generate_testdata(nfeatures, nobservations, nfeatperobs, datadir=normpath(joinpath(@__DIR__, "..", "data")); force=false)
    if isdir(datadir)
        force ? rm(datadir, recursive=true) : error("$datadir exists, use `force=true` to overwrite")
    end
    mkdir(datadir)

    allfeatures = [randstring() for _ in 1:nfeatures]

    for o in 1:nobservations
        fpick = unique(rand(1:nfeatures, nfeatperobs))
        df = DataFrame(feature=allfeatures[fpick], value=rand(length(fpick)))
        CSV.write(joinpath(datadir, "obs$o.csv"), df)
    end
end

generate_testdata(10_000, 100, 500)

My first naive implementation was to basically read all of the CSVs into dictionaries, then fill a sparse array:

function join_from_csv1(datadir=normpath(joinpath(@__DIR__, "..", "data")))
    obs = map(f-> replace(f, ".csv"=>""), readdir(datadir))

    obsdict = Dict{String, Dict{String,Float64}}()
    features = Set()
    for ob in obs
        df = CSV.File(joinpath(datadir, "$ob.csv")) |> DataFrame
        obsdict[ob] = Dict(row.feature=> row.value for row in eachrow(df))
        union!(features, df.feature)
    end
    features=sort(collect(features))
    sa = spzeros(length(features),length(obs))
    featuredict = Dict(f=>i for (i,f) in enumerate(features))
    for (i, ob) in enumerate(obs)
        for f in keys(obsdict[ob])
            sa[featuredict[f], i] = obsdict[ob][f]
        end
    end
    return sa, features, obs
end
julia> @benchmark join_from_csv1()
BenchmarkTools.Trial:
  memory estimate:  20.98 MiB
  allocs estimate:  370977
  --------------
  minimum time:     134.594 ms (0.00% GC)
  median time:      150.562 ms (0.00% GC)
  mean time:        153.927 ms (4.30% GC)
  maximum time:     169.961 ms (9.90% GC)
  --------------
  samples:          33
  evals/sample:     1

This implementation takes ~35 min with my actual dataset, and I’m sure there are a bunch of ways to improve, but I have no experience with profiling and optimization. I’m of course happy to take any suggestions to improve the obvious problems, but I thought it would be a good opportunity to learn those skills. But I have a couple of questions:

  1. Is the smaller sample data sufficient for profiling / benchmarking? I’m assuming that steps I take to optimize on this small dataset will scale up, but is there any reason to suspect that’s not true?
  2. Is using the built-in profiler / ProfileView the right way to go? I had to run the function like 1000 times to get any substantial reading, but probably doing something wrong there.
  3. Anything else I should be considering, or good resources to look at?
  4. Assuming the no observation/feature pair overlap, are there any additonal considerations to trying to make this parallelizable? I’m doing this primarily on a system with 8 cores, though not sure the extent to which I’m bound reading off the HDDs

Several things:
Focus on single core as much as possible before going multi-core. Everything is harder in multi-core including optimization.
Use small examples (.5 sec to 1 minute). That will make testing iterative changes non-painful.

For a first pass you want to figure out where the bulk of your time is spent, then you can focus on that part of the code. At this point you just need rough numbers so I would run something this:

using Dates

function join_from_csv1(datadir=normpath(joinpath(@__DIR__, "..", "data")))
    loadStart = now()
    obs = map(f-> replace(f, ".csv"=>""), readdir(datadir))
    @info "Load Time: $(now() - loadStart)"

    obsStart = now()
    obsdict = Dict{String, Dict{String,Float64}}()
    features = Set()
    for ob in obs
        df = CSV.File(joinpath(datadir, "$ob.csv")) |> DataFrame
        obsdict[ob] = Dict(row.feature=> row.value for row in eachrow(df))
        union!(features, df.feature)
    end
    @info "Obs Time: $(now() - obsStart)"

    sortStart = now()
    features=sort(collect(features))
    @info "Sort Time: $(now() - sortStart)"

    featureStart = now()
    sa = spzeros(length(features),length(obs))
    featuredict = Dict(f=>i for (i,f) in enumerate(features))
    @info "Feature Time: $(now() - featureStart)"

    saStart = now()
    for (i, ob) in enumerate(obs)
        for f in keys(obsdict[ob])
            sa[featuredict[f], i] = obsdict[ob][f]
        end
    end
    @info "SA Time: $(now() - saStart)"

    return sa, features, obs
end

That will give you general times for each chunk of processing. If 90% of your time is spent in 1 or 2 of these areas, those are the areas to improve. The times will all be in milliseconds so you might want to do Second(now() - start) or even Minute(now() - start).

Once you have an idea of what area to start with you/we can look at getting better information about what is happening there.

1 Like

I’m a fan of the julia profiler. I haven’t used ProfileView for a while (I think it’s gotten some big upgrades recently). I really like Juno’s support for exploring profiling data. I agree with the general comments in the last two posts.

One specific comment: It looks like you’re initialising an empty sparse matrix with spzeros then filling it one entry at a time with getindex. Building sparse matrices this way is very inefficient in general. Your code is constantly reallocating to expand the internal storage arrays used to store the sparse matrix entries and non-zero locations. A simple alternative is to use the sparse function to generate a sparse matrix from three vectors containing the non-zeros values themselves and the indices of their positions in the array. From the sparse docstring:

sparse(I, J, V,[ m, n, combine])

  Create a sparse matrix S of dimensions m x n such that S[I[k], J[k]] = V[k]. The combine function is used to
  combine duplicates. If m and n are not specified, they are set to maximum(I) and maximum(J) respectively. If the
  combine function is not supplied, combine defaults to + unless the elements of V are Booleans in which case
  combine defaults to |. All elements of I must satisfy 1 <= I[k] <= m, and all elements of J must satisfy 1 <=
  J[k] <= n. Numerical zeros in (I, J, V) are retained as structural nonzeros; to drop numerical zeros, use
  dropzeros!.

If you don’t know the lengths of I, J, and K in advance you can at least guess and preallocate arrays you think will be big enough. Having your nested loop fill up these arrays and then calling sparse at the end should be a lot faster than what your current code.

2 Likes

I would do just that, maybe as a ragged array:

  1. save the values in a flat vector contiguous by (feature,observation), which you can then mmap,
  2. save the UnitRange of indexes for each (feature,observation) in a Dict as either CSV (my preference), JSON, or HDF5.

Process the data once, and after that it will be very fast (if I understood the specs correctly).

I did something similar for a few TB of data.

1 Like

Good idea. I sort of assumed it was in the “obs” section. Turns out I was right, but that’s definitely good to know

julia> TableLoadTest.join_from_csv1();
[ Info: Load Time: 0 milliseconds
[ Info: Obs Time: 122 milliseconds
[ Info: Sort Time: 10 milliseconds
[ Info: Feature Time: 4 milliseconds
[ Info: SA Time: 36 milliseconds

julia> TableLoadTest.join_from_csv1();
[ Info: Load Time: 1 millisecond
[ Info: Obs Time: 129 milliseconds
[ Info: Sort Time: 8 milliseconds
[ Info: Feature Time: 9 milliseconds
[ Info: SA Time: 49 milliseconds

I’m surprised how much time is spent on the SparseArray part, probably because of

It makes sense now that you say it, but I had in my mind the equivalent dense matrix, assuming assigning values would be fast since the memory was already allocated (I see now why this makes no sense in the sparse case). I’m not quite understanding how to use the sparse function the way you describe, but even allocating a dense array, filling it, and then converting to sparse seems to take less time:

julia> TableLoadTest.join_from_csv2();
[ Info: Load Time: 0 milliseconds
[ Info: Obs Time: 134 milliseconds
[ Info: Sort Time: 8 milliseconds
[ Info: Feature Time: 48 milliseconds
[ Info: Array Time: 19 milliseconds
[ Info: Sparse Time: 4 milliseconds

julia> TableLoadTest.join_from_csv2();
[ Info: Load Time: 1 millisecond
[ Info: Obs Time: 102 milliseconds
[ Info: Sort Time: 11 milliseconds
[ Info: Feature Time: 37 milliseconds
[ Info: Array Time: 22 milliseconds
[ Info: Sparse Time: 2 milliseconds

Is the sparse() method you mention similar to what @Tamas_Papp is describing? I take it V is a vector of values, and then I and J have the same length, where I is the row index and K is the column index?

I have some clarifying questions because this vocabulary is new to me

I’m reading this to mean I should have a Array{Float64,1} where all the features for one observation are next to eachother, followed by the features for the next observation etc. I know mmap means memory map, but that’s the limit of my knowledge. Is there a built-in way to do this?

This seems similar to the things needed for the sparse constructor. Can I just save all three things (I,J,V) in the same CSV? Would I still want to mmap that? Can I use something like a feather file or SQLite database to get some of this for free?

I find it’s always better to be sure. I don’t want to think about how much time I’ve wasted because where I thought the problem was, wasn’t.

Anyway, what I would suggest is get timings from the 3 statements inside the obs loop. You might want to sum the times so you get total time loading the CSV file into the dataframe and total time creating the dictionary, and total time performing the union.

The union you might also want to print out the individual times, just to see if they are increasing exponentially as the loop progresses…

1 Like

You may also consider TimerOutputs package. It has rather convenient way of gathering and showing data. And base Profile is very efficient tool to quickly find bottlenecks, I usually prefer to run it with Profile.print(format=:flat, sortedby=:count).

1 Like

I think @Tamas_Papp’s suggestion and mine are closely related but aimed at accomplishing different things. My comment was just about how to construct a sparse matrix. It assumes that you have a reason to have your data loaded in memory as a sparse matrix for some further analysis. If I understood correctly, @Tamas_Papp’s suggestion was about reformatting your data and saving it to disk in a format that would be efficient to load from disk in the future.

His suggested storage format sounds to me like compressed sparse column (or row) format. See: Sparse matrix - Wikipedia. I was suggesting forming your matrix in what’s called coordinate list format, storing the entries as a set of I,J,V triples. The sparse function then takes that format as input and converts it to the more efficient compressed sparse column format, used internally by the SparseMatrixCSC type. The coordinate list format is less efficient than CSC but it’s also simpler.

If that’s all a bit abstract, here’s an example. Let’s say you want to represent your table as a matrix, where the rows represent features and the columns observations. I’m assuming here that there can only be one value for each (feature,observation) pair. Let’s say we have the following data, represented as (feature, observation, value) triples and that there are a total of 3 features and three observations:

(1,1,2.5)
(3,1,0.8)
(1,2,1.1)
(3,3,2.2)

As a matrix that looks like:

 2.5  1.1  0.0
 0.0  0.0  0.0
 0.8  0.0  2.2

In coordinate list format (I,J,V) that looks like:

using SparseArrays
julia> I = [1,3,1,3]

julia> J = [1,1,2,3]

julia> V = [2.5,0.8,1.1,2.2]

Then you can create a sparse matrix like so:

julia> A = sparse(I,J,V,3,3)
3×3 SparseMatrixCSC{Float64,Int64} with 4 stored entries:
  [1, 1]  =  2.5
  [3, 1]  =  0.8
  [1, 2]  =  1.1
  [3, 3]  =  2.2

julia> Matrix(A)
3×3 Array{Float64,2}:
 2.5  1.1  0.0
 0.0  0.0  0.0
 0.8  0.0  2.2

If you want to explore the compressed sparse column representation of that matrix you can look at the fields of the struct storing the sparse matrix. Note that for this tiny example the CSC format isn’t actually any more efficient than the coordinate list format but for most real world sparse matrices it will be.

As for saving to disk for future use, you could certainly just save the I,J,V vectors to disk in a three column csv file, no mmapping involved. It’s not optimal but it’ll be a lot faster to load than what you’ve got stored on disk now. You could also save the sparse array in compressed sparse column format. I think what @Tamas_Papp is suggesting is effectively saving the Vector{Float64} of values to disk in compressed sparse column order in a memory mapped file. Then saving the index information in a separate text file, e.g. in csv or json format. The MMap stdlib has tools for doing the memory mapping if you want to go that route.

1 Like

Thanks for the thorough explanation! That makes a lot of sense. I think storing as a 3 column CSV (or feather or something) will make the most sense for me, then I’ll also want to store the feature_name=>feature_index and observation_name=>observation_index as separate files, but that shouldn’t be too hard.

I’ll get coding and report back!

(note: I also changed the title of the post to better reflect the content)

No, I wasn’t suggesting CSC, since it is my understanding that there is no indexing within each (feature, observation), just a bunch of values.

Yes. I wrote a package for a project where I needed something similar,

feel free to dust it off (it is pre-1.0) and/or use it as a starting point.

Yes, see Memory-mapped I/O · The Julia Language

You probably don’t need the equivalent of I, and I would store what corresponds to J as ranges. CSC is optimized for sparse linear algebra, which I am not sure you need. Just invent your own structure for this.

I would not bother mmaping this unless it is really huge. Just save something like

feature,observation,start_index,end_index
"A",1,1,5
"A",2,6,12
"B",1,13,20

You can probably economize on the end_index using contiguity, I would not bother to do this in the first pass of problem solving. Then read this into a

Dict([("A",1) => 1:5, ("A",2) => 6:12, ...])

Just keep it simple. The whole thing is about 50–100 LOC.

1 Like

Each feature:observation pair has a single value. The way I’ve been thinking about it is as a table, and it’s intuitive to think of it as indexable by observations (columns) and features (rows). I’m also using @mkborregaard’s SpatialEcology package, which uses CSC (and has a bunch of functionality I use). I’m definitely open to other data structures, and your RaggedData package definitely seems after a quick glance to be appropriate, though I need to spend some more time wrapping my head around it.

Thanks, I may just do that :-D. First I’m going to see how much I gain with the coordinate list for storage => CSC format because I understand that now, CSC is how I need it for SpatialEcology anyway. If that gets me down to a few minutes to load, I might not bother with additional optimization at this point (since I need to do actual science!), but will definitely keep it in mind for the future.

My first attempt at optimizing was actually storing something like this using SQLite, but I was just saving the values rather than indices, and then ended up spending all my time building the table. It makes a lot more sense to store the table structure in the database as well… not sure why that didn’t occur to me :thinking:

1 Like