Request for feedback on potential CSV.jl feature

There’s an open CSV.jl issue which is generally about the idea of somehow “ingesting” multiple, same-schema delimited data files in a single call. I wanted to open this post as a request for feedback on what people think would be the most helpful/useful here. Here are a few things I think about with regards to this:

  • You can obviously use existing broadcast syntax to do something like: files = CSV.File.(list_of_files).
    • Pros: super simple, nothing else needed in CSV.jl
    • Cons: each file is technically parsed separately; i.e. any types inferred from the first file aren’t passed to the subsequent (not critical since they’re supposed to all have the same schema anyway). The resulting structure is…not especially helpful? I.e. you can’t really do anything convenient with Vector{CSV.File}. You could convert to a DataFrame I think by doing something like reduce(vcat, DataFrame(files[1]), files[2:end]...), but again, that’s not super convenient.
  • You could use Tables.partitioner like parts = Tables.partitioner(CSV.File, list_of_files)
    • Pros: also very simple, nothing else required in CSV.jl
    • Cons: the result is a valid “partitions” object, which means you need to use it somewhere that expects partitions, like Arrow.write(file, parts), or to get a DataFrame, you’d do something like DataFrame(TableOperations.joinpartitions(parts)), where TableOperations.joinpartitions takes any valid “partitions” and will lazily join them vertically into a single “table”. Similar to 1st option, no inference/results are shared between CSV.File calls; they are each independent
  • We make some kind of new CSV.Files(list_of_files) object that we had complete control over internally how it processed files
    • Pros: we could infer column types from the first file and assert each subsequent file matched. We could return a CSV.Files object that acted like CSV.File in that it’s a “table”, i.e. you could access a column like files = CSV.Files(list_of_files); files.col1. We could also implement Tables.partitions on CSV.Files so partition-aware sinks could “split” the files up if desired.
    • Cons: …I guess a decent amount of work in CSV.jl. Actually probably not too bad, but it is maintenance going forward.

Anyway, just wondered if anyone had ideas on other features/issues they’d like or we should at least consider with this kind of new functionality.

7 Likes

Would there be any performance benefits? This seems like the biggest advantage if it is the case. Without it, this kinda just seems like dressing up Vector{CSV.File} to be more usable like a single table. Nothing wrong with that, though.

5 Likes

What will happen if they don’t actually have the same schema? e.g. different column names, types, number of columns.

1 Like

I thought that partitioner was partly (heh heh) about that. In case it’s worth shipping checking schema and colum types, maybe just a keyword settings for Tables.partitioner would be enough (although the implementation might be hard), something like identical_schema = true.

I often use the first solution (dfs = CSV.read.(list_of_files, DataFrame)), but I think the intended use is more to keep files separated (although you can easily do df = vcat(dfs...)), while the opened issue seems more to mean treating several files as if it was one.

If so couldn’t this handled as a pure IO problem independently of CSV.jl ? something like :

file = virtual_grouped_files(file1, file2); CSV.File(file)

Maybe not so easy because of headers… but I’m thinking that 1) it’s a problem that is more general than reading CSV 2) it clarifies the intend (“please treat this bunch of files as if it was a single one”). That said it’s more or less the Tables.partitioner solution ?

I think my initial reaction would be to keep things simple and the surface API as small as possible, which to me seems like a bit of a design principle of the Julia data ecosystem.

On the cons of the current approach, I think writing reduce(vcat, CSV.read.(myfiles, DataFrame)) isn’t all that inconvenient, and has the benefit of being explicit. I second Tyler’s question on the performance benefit, it seems like this would mostly be an issue for very large numbers of small files, where the type inference overhead is actually detectable? In that case I assume one could currently parse the types from the first file and then pass them to CSV.File.(myfiles) to recover the performance?

3 Likes

This is a good question; the main performance concern would be what to do about automatic threading. If we know there are a number of files, we could automatically adjust settings so each file is read with a single thread, with multiple files being read concurrently (the current default is to use multiple threads to read big enough files in chunks concurrently).

Currently, if you just do CSV.File.(list_of_files), it will most likely use multiple threads to read each file in parallel, but each file sequentially, i.e. broadcasting isn’t concurrent. Which isn’t necessarily better or worse than reading multiple files concurrently, one per thread, but I would lean towards the latter as being a little better overall.

1 Like

Two important use cases where an application pulls in a succession of CSV files, each given once in a sequence of CSV file pathnames, are distinguished by whether or not their individual schemas (loosely, the named+typed columns in order of tabulation) are strongly consistent.

(quasi-plausable CSV examples, one of each)

  • A country-wide econometric study of _ gathering data from each geopolitical subregion as iCSV files (all files adhering to preagreed table organizing principles, and sharing schema as much as possible).

  • A longitudinal study of _ gathering age-appropriately distinct factors in its own CSV file, many with inconsistent schema from disparate factor sorts.

As has been mentioned, CSV.jl could assist the first example with fast reapplication of column subschema and multitable acceleration where schema intersects allow (CSV acceleration by schema association).

How may CSV.jl assist the second example, in addition to any transferability of first example assistance?

1 Like

@quinnj Just want to say I appreciate you seeking community input on this proposal. There’s only so much a library developer can know about their users’ needs without asking!

1 Like

@quinnj, fyi there was a related inverse functionality sought here that you may consider.

I think that doing this in CSV.jl is the wrong level of abstraction (unless the performance gains are really compelling, but I doubt that). It would be better to have support for (lazily) concatenating tables in a mini-package that depends on Tables.jl.

Ideally the API should allow matching T and Union{T,Missing} for column types silently, because that would be data-dependent: some files could have missing values, some not.

Mainly because this has not been mentioned here, I was wondering how much of what is proposed is already possible with a reasonably simple API using CSV + shashi/FileTrees.jl: Parallel file processing made easy (github.com), see also the package announcement.

This probably does not address the parsing / promotion issues, but I thought it could be a relevant reference (it should definitely handle reading multiple files concurrently).

Another useful reference for the API could be JuliaDB.loadtable, which also allows to ingest multiple files at once. It also supports adding a separate column that is populated with the name of each file (or a function thereof). This can actually be pretty useful if some relevant information is encoded in the file name and not in the csv itself.

3 Likes