Get JuliaDB.loadtable to parse all columns in CSVs as String

anon64288406 · June 2, 2020, 3:40pm

I’m using Julia 1.4.

I want to use JuliaDB.jl, specifically, to read a bunch of CSVs and combine them into one big DataFrame. Here’s the issue: When reading the CSVs, I want all columns to be parsed as String. The number of columns in each CSV differs.

Here’s what I’ve tried:

using CSV
using DataFrames  # just for creating the example DataFrames
using JuliaDB

df1 = DataFrame(
    [['a', 'b', 'c'], [1, 2, 3]],
    ["name", "id"]
)
df2 = DataFrame(
    [['d', 'e', 'f'], [4, 5, 6], [11, 22, 33]],
    ["name", "id", "other"]
)

# For simplicity, I will read just two CSVs, but imagine 20+.
#
# Assume these CSVs are the only files returned by `readdir()`
# below.
CSV.write("df1.csv", df1)
CSV.write("df2.csv", df2)

# This works only if each CSV has the same number of columns with
# the exact same name. But I need it to work for CSVs with
# differing numbers of columns and column names. Also, this gets
# unwieldy if there are many columns.
df = loadtable(readdir(); colparsers=Dict(:name=>String, :id=>String))

# This doesn't work
df = loadtable(readdir(); colparsers=String)
# MethodError: no method matching iterate(::Type{String})

Here’s how I’d do it in R:

library(purrr)  # Need dplyr installed for `map_dfr()` to work

# Assume list.files() returns just the two above-specified CSVs
df = map_dfr(list.files(), read.csv, colClasses = "character")

joshday · June 2, 2020, 6:52pm

I believe you can also use the column index with colparsers, e.g.

loadtable(path, colparsers=Dict(i => String for i in 1:n_cols))

anon64288406 · June 3, 2020, 1:10am

This gives me:

UndefVarError: n_cols not defined

Where is n_cols supposed to be defined here?

joshday · June 3, 2020, 11:17am

path and n_cols are for you to define.

anon64288406 · July 6, 2020, 12:49pm

But the number of columns varies by CSV, so me manually typing out and passing a Vector of Ints to n_cols would be tedious and error-prone if there are, say, 20 CSVs.

bernhard · July 6, 2020, 2:51pm

You can do readline(my_csv_file) to read the header. Then you can count how many times the delimiter occurs in the header. Generally this should be quite simple (except if the delimiter would occur in the header names, or if the first row of the file is not the header)

Topic		Replies	Views
JuliaDB loading data General Usage juliadb	15	1871	July 12, 2019
Using JuliaDB to read in all the CSVs fails, but not when reading them singly General Usage data	2	741	March 18, 2019
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2741	March 26, 2021
Reading multiple CSVs into one DataFrame New to Julia	12	1153	February 11, 2021
CSV.read() won't accept just a file name? General Usage	3	3771	February 20, 2019

Get JuliaDB.loadtable to parse all columns in CSVs as String

Related topics