Specifying column type efficiently in CSV.read for large datasets

bray2016 · June 22, 2020, 2:39pm

Hello all,

Complete Julia noob here. I am trying to figure out how to work with large datasets in Julia. I am trying to import a dataset that has missing values. It seems that Julia defaults to importing all these columns as strings. I only want to import some columns as strings. Currently I have:

temp=CSV.read(file,types=Dict(1=> String, 2=> String, 3=> Int64, 4=> Int64, 5=> Int64, 6=> Int64, 7=> Int64, 8=> Int64, 9=> Int64, 10=> Int64, 11=> Int64, 12=> Int64, 13=> Int64, 14=> Int64, 15=> Int64, 16=> Int64, 17=> Int64, 18=> Int64, 19=> Int64, 20=> Int64, 21=> Int64, 22=> Int64, 23=> Int64, 24=> Int64, 25=> Int64, 26=> Int64, 27=> Int64, 28=> Int64, 29=> Int64, 30=> Int64, 31=> Int64, 32=> Int64, 33=> Int64, 34=> Int64, 35=> Int64, 36=> Int64, 37=> Int64, 38=> Int64, 39=> Int64, 40=> Int64, 41=> Int64, 42=> Int64, 43=> Int64, 44=> Int64, 45=> Int64, 46=> Int64), silencewarnings=true);

It seems like there should be a way more efficient way to specify that columns 3 through 46 should be a mix of integer and missing? I couldn’t get any sort of looping to work (but I am new to Julia so it could be user error).

More generally, is there a good resource for working with large datasets in Julia? The problem I have encountered so far in my Julia experience is that all the examples I can find are only micro level examples (i.e. write a dictionary for a small dataset with 5 variables, which doesn’t translate to doing this with 46 variables).

Any guidance would be greatly appreciated!

pdeffebach · June 22, 2020, 2:46pm

I think you want missingstring. It’s probably reading it in as a string because it’s encountering something like "NA" or "." and doesn’t know it’s missing.

bray2016 · June 22, 2020, 2:50pm

Thanks, I knew I had to be missing something. Somehow I missed that after hours of searching. I will try to keep my noob questions to a minimum as I work my way through the growing pains of learning a new language.

pdeffebach · June 22, 2020, 2:52pm

Feel free to ask questions!

stillyslalom · June 22, 2020, 6:07pm

For bulk dict assignment, you can use broadcasting and splatting:

types = Dict((1:2 .=> String)..., (3:46 .=> Union{Int, Missing})...)

The key insight here is that => is an operator, just like +, -, ^, etc., so you can broadcast it like so:

julia> 1:2 .=> String
2-element Array{Pair{Int64,DataType},1}:
 1 => String
 2 => String

…and then splat the resultant array into a tuple:

julia> tuple((1:2 .=> String)...)
(1 => String, 2 => String)

Topic		Replies	Views
Importing CSV with missing data Data dataframes	13	4523	April 30, 2018
CSV.jl : how to specify the columns types when the total columns number is not known? Data	8	3155	February 18, 2020
How to specify `CSV.read` column types? General Usage question , type , csv	4	2134	August 7, 2018
Csv error reading numbers as string General Usage	16	2295	December 6, 2020
CSV.read() faults on exponentially notated integers General Usage	2	636	December 28, 2017

Specifying column type efficiently in CSV.read for large datasets

Related topics