Read CSVs. Build DataFrame with column identifying the original CSV

I’ve been trying to learn Julia (1.4) by doing something I commonly do with R or Python:

  1. Have a bunch of CSVs, each with a year (YYYY) in the filename. Some datasets have columns that aren’t in the other datasets.
  2. Read each CSV to build a list of DataFrames
  3. Combine all DataFrames into a single DataFrame
    a. Make a Year column that contains the year associated with each original CSV.

I can’t seem to get 3a right in Julia. Here’s how I’d do it in R (Tidyverse):

library(dplyr)
library(purrr)
library(readr)
library(stringr)

# Assume there are only CSVs in the current working directory
files = list.files(path = ".", full.names = TRUE)

years = str_extract(files, "\\d{4}")
# Basically creates a dictionary, where each key is the year and
# the value is the filename
names(files) = years

# .id="Year" will create Year column from the keys. So rows from
# each DataFrame will have a Year value equal to the year from
# that CSV's filename.
#
# purrr::map_dfr allows column names to differ across DataFrames.
df = map_dfr(files, read_csv, .id = "Year")

Below is what I’ve tried in Julia. I’m missing a way to create a Year column in the final DataFrame, containing the year associated with each original CSV (task 3a above).

using CSV, DataFrames

# Assume there are only CSVs in the current working directory
files = readdir()

years = map(
    m -> String(m.match),
    match.(r"\d{4}", files),
)

df = mapreduce(
    x -> CSV.File(x) |> DataFrame,
    # Need cols=:union since columns aren't exactly the same in all
    # DataFrames
    (x, y) -> vcat(x, y, cols = :union),
    files,
)

Any tips?

The easiest way I can think to do this is in multiple steps.

For each file, read in the DataFrame, and create the “Year” column on that DataFrame. Should be easy because it is in the filename. Store these DataFrames in a vector. Could do that all with a call to map or a simple for loop. Then Do a vcat at the end.

Maybe you can do this within mapreduce, I don’t know for sure.

This does not directly answer your question but I recommend using the package DataFramesMeta.jl that has methods similar to dplyr and a comparison table of method names. I have a recent Jupyter Notebook that I am using to do exploratory analysis I did in R before using Julia, maybe you can take something from there too (at least it brings a relation of packages to do in Julia what I did in R).

This is what I came up with.

using CSV, DataFrames

# Assume there are only CSVs in the current working directory
files = readdir()

# returns a DataFrame with a year column from just a filename
function dfwithyear(filename)
    year = String(match(r"\d{4}", filename).match)
    df = DataFrame(CSV.File(filename))
    df.year = repeat([year], size(df, 1))
    return df
end

df = mapreduce(
    dfwithyear,
    (x, y) -> vcat(x, y, cols = :union),
    files
)
1 Like

You simplify this using broadcasting:
df[!, :year] .= year

2 Likes

I would use an anonymous function for map here, replace DataFrame(CSV.File with CSV.read, and also convert the year to a number - not sure if the R example does this, but an integer year is likely more convenient than a string. The resulting code:

using CSV, DataFrames

files = readdir()

df = mapreduce(
    (x, y) -> vcat(x, y, cols = :union),
    files
) do filename
    df = CSV.read(filename)
    year = parse(Int, match(r"\d{4}", filename).match)
    df[!, :year] .= year
    return df
end
2 Likes

I think there are plans to deprecate CSV.read so CSV can drop its dependency on DataFrames, at which point DataFrame(CSV.File()) will be the way to get a DataFrame from a csv file.

4 Likes

is that documented somewhere?

It is not deprecated yet, so it is not documented, but I confirm @quinnj has such plans.

2 Likes

Yes, I know of the coming deprecation so I am moving away from CSV.read.