Specifying column type efficiently in CSV.read for large datasets

Hello all,

Complete Julia noob here. I am trying to figure out how to work with large datasets in Julia. I am trying to import a dataset that has missing values. It seems that Julia defaults to importing all these columns as strings. I only want to import some columns as strings. Currently I have:

temp=CSV.read(file,types=Dict(1=> String, 2=> String, 3=> Int64, 4=> Int64, 5=> Int64, 6=> Int64, 7=> Int64, 8=> Int64, 9=> Int64, 10=> Int64, 11=> Int64, 12=> Int64, 13=> Int64, 14=> Int64, 15=> Int64, 16=> Int64, 17=> Int64, 18=> Int64, 19=> Int64, 20=> Int64, 21=> Int64, 22=> Int64, 23=> Int64, 24=> Int64, 25=> Int64, 26=> Int64, 27=> Int64, 28=> Int64, 29=> Int64, 30=> Int64, 31=> Int64, 32=> Int64, 33=> Int64, 34=> Int64, 35=> Int64, 36=> Int64, 37=> Int64, 38=> Int64, 39=> Int64, 40=> Int64, 41=> Int64, 42=> Int64, 43=> Int64, 44=> Int64, 45=> Int64, 46=> Int64), silencewarnings=true);

It seems like there should be a way more efficient way to specify that columns 3 through 46 should be a mix of integer and missing? I couldn’t get any sort of looping to work (but I am new to Julia so it could be user error).

More generally, is there a good resource for working with large datasets in Julia? The problem I have encountered so far in my Julia experience is that all the examples I can find are only micro level examples (i.e. write a dictionary for a small dataset with 5 variables, which doesn’t translate to doing this with 46 variables).

Any guidance would be greatly appreciated!

I think you want missingstring. It’s probably reading it in as a string because it’s encountering something like "NA" or "." and doesn’t know it’s missing.

2 Likes

Thanks, I knew I had to be missing something. Somehow I missed that after hours of searching. I will try to keep my noob questions to a minimum as I work my way through the growing pains of learning a new language.

Feel free to ask questions!

1 Like

For bulk dict assignment, you can use broadcasting and splatting:

types = Dict((1:2 .=> String)..., (3:46 .=> Union{Int, Missing})...)

The key insight here is that => is an operator, just like +, -, ^, etc., so you can broadcast it like so:

julia> 1:2 .=> String
2-element Array{Pair{Int64,DataType},1}:
 1 => String
 2 => String

…and then splat the resultant array into a tuple:

julia> tuple((1:2 .=> String)...)
(1 => String, 2 => String)
2 Likes