Get JuliaDB.loadtable to parse all columns in CSVs as String

I’m using Julia 1.4.

I want to use JuliaDB.jl, specifically, to read a bunch of CSVs and combine them into one big DataFrame. Here’s the issue: When reading the CSVs, I want all columns to be parsed as String. The number of columns in each CSV differs.

Here’s what I’ve tried:

using CSV
using DataFrames  # just for creating the example DataFrames
using JuliaDB

df1 = DataFrame(
    [['a', 'b', 'c'], [1, 2, 3]],
    ["name", "id"]
)
df2 = DataFrame(
    [['d', 'e', 'f'], [4, 5, 6], [11, 22, 33]],
    ["name", "id", "other"]
)

# For simplicity, I will read just two CSVs, but imagine 20+.
#
# Assume these CSVs are the only files returned by `readdir()`
# below.
CSV.write("df1.csv", df1)
CSV.write("df2.csv", df2)

# This works only if each CSV has the same number of columns with
# the exact same name. But I need it to work for CSVs with
# differing numbers of columns and column names. Also, this gets
# unwieldy if there are many columns.
df = loadtable(readdir(); colparsers=Dict(:name=>String, :id=>String))

# This doesn't work
df = loadtable(readdir(); colparsers=String)
# MethodError: no method matching iterate(::Type{String})

Here’s how I’d do it in R:

library(purrr)  # Need dplyr installed for `map_dfr()` to work

# Assume list.files() returns just the two above-specified CSVs
df = map_dfr(list.files(), read.csv, colClasses = "character")

I believe you can also use the column index with colparsers, e.g.

loadtable(path, colparsers=Dict(i => String for i in 1:n_cols))

This gives me:

UndefVarError: n_cols not defined

Where is n_cols supposed to be defined here?

path and n_cols are for you to define.

1 Like

But the number of columns varies by CSV, so me manually typing out and passing a Vector of Ints to n_cols would be tedious and error-prone if there are, say, 20 CSVs.

You can do readline(my_csv_file) to read the header. Then you can count how many times the delimiter occurs in the header. Generally this should be quite simple (except if the delimiter would occur in the header names, or if the first row of the file is not the header)