CSV.jl : how to specify the columns types when the total columns number is not known?


I have to read large tables, I don’t know the total number of columns of the tables but I do know that the first one contains strings and all the others are Float64. I can easily specify the type of the first one like this
x = CSV.read("file.tsv"; delim ="\t", types=Dict(1=>String)) |> DataFrame!
But how I can set all other columns to Float64 to increase the parsing speed ?

Thanks in advance :wink:

CSV.jl will recognize those column types without help.
try x = CSV.read("file.tsv"; delim='\t', header=false)
or just x = CSV.read("file.tsv"; header=false) assuming no header line
or just x = CSV.read("file.tsv") assuming a header line (column names)

@JeffreySarnoff thank you for your answer ! In fact I want to specify the columns type to increase the parsing speed :wink:

If you know the most columns that you may have (or some number greater than that which is certain to be larger than the number of columns, most), you can set types=Dict(1=>String, 2=>Float64, ... most=>Float64) with no harm and it will work. You could read in the first row only limit=1 and find out how many columns there are. There may well be other ways.

@JeffreySarnoff thank you ! The column number is highly variable unfortunately. If the possibility I search is not implemented in the package, I have indeed only the possibility to read the first line, but I think it could be slower than left the columns type blank.

I don’t think there is much you need to worry about with not explicitly giving the column types. That stuff gets done quite quickly.

There is a keyword argument type=Float64, that allows specifying the type for all columns, but they have to be homogenous. One thing we should probably allow is doing something like type=Float64, types=(:col1=String,), so you could specify the types of “all” columns, but override a single column with types. If you wouldn’t mind opening an issue about it, I’m planning on doing a bunch of CSV.jl work in the next week or two.

But as @JeffreySarnoff mentioned, this also shouldn’t make a drastic change in performance as the detection code is usually negligible.

issue raised https://github.com/JuliaData/CSV.jl/issues/575

Thank you @quinnj, @JeffreySarnoff !