I have to read large tables, I don’t know the total number of columns of the tables but I do know that the first one contains strings and all the others are Float64. I can easily specify the type of the first one like this x = CSV.read("file.tsv"; delim ="\t", types=Dict(1=>String)) |> DataFrame!
But how I can set all other columns to Float64 to increase the parsing speed ?
CSV.jl will recognize those column types without help.
try x = CSV.read("file.tsv"; delim='\t', header=false)
or just x = CSV.read("file.tsv"; header=false) assuming no header line
or just x = CSV.read("file.tsv") assuming a header line (column names)
If you know the most columns that you may have (or some number greater than that which is certain to be larger than the number of columns, most), you can set types=Dict(1=>String, 2=>Float64, ... most=>Float64) with no harm and it will work. You could read in the first row only limit=1 and find out how many columns there are. There may well be other ways.
@JeffreySarnoff thank you ! The column number is highly variable unfortunately. If the possibility I search is not implemented in the package, I have indeed only the possibility to read the first line, but I think it could be slower than left the columns type blank.
There is a keyword argument type=Float64, that allows specifying the type for all columns, but they have to be homogenous. One thing we should probably allow is doing something like type=Float64, types=(:col1=String,), so you could specify the types of “all” columns, but override a single column with types. If you wouldn’t mind opening an issue about it, I’m planning on doing a bunch of CSV.jl work in the next week or two.
But as @JeffreySarnoff mentioned, this also shouldn’t make a drastic change in performance as the detection code is usually negligible.