CSV.jl : how to specify the columns types when the total columns number is not known?

Fred · February 17, 2020, 10:42am

Hi,

I have to read large tables, I don’t know the total number of columns of the tables but I do know that the first one contains strings and all the others are Float64. I can easily specify the type of the first one like this
x = CSV.read("file.tsv"; delim ="\t", types=Dict(1=>String)) |> DataFrame!
But how I can set all other columns to Float64 to increase the parsing speed ?

Thanks in advance

JeffreySarnoff · February 17, 2020, 11:19am

CSV.jl will recognize those column types without help.
try x = CSV.read("file.tsv"; delim='\t', header=false)
or just x = CSV.read("file.tsv"; header=false) assuming no header line
or just x = CSV.read("file.tsv") assuming a header line (column names)

Fred · February 17, 2020, 12:35pm

@JeffreySarnoff thank you for your answer ! In fact I want to specify the columns type to increase the parsing speed

JeffreySarnoff · February 17, 2020, 2:21pm

If you know the most columns that you may have (or some number greater than that which is certain to be larger than the number of columns, most), you can set types=Dict(1=>String, 2=>Float64, ... most=>Float64) with no harm and it will work. You could read in the first row only limit=1 and find out how many columns there are. There may well be other ways.

Fred · February 17, 2020, 2:42pm

@JeffreySarnoff thank you ! The column number is highly variable unfortunately. If the possibility I search is not implemented in the package, I have indeed only the possibility to read the first line, but I think it could be slower than left the columns type blank.

JeffreySarnoff · February 17, 2020, 3:01pm

I don’t think there is much you need to worry about with not explicitly giving the column types. That stuff gets done quite quickly.

quinnj · February 17, 2020, 4:05pm

There is a keyword argument type=Float64, that allows specifying the type for all columns, but they have to be homogenous. One thing we should probably allow is doing something like type=Float64, types=(:col1=String,), so you could specify the types of “all” columns, but override a single column with types. If you wouldn’t mind opening an issue about it, I’m planning on doing a bunch of CSV.jl work in the next week or two.

But as @JeffreySarnoff mentioned, this also shouldn’t make a drastic change in performance as the detection code is usually negligible.

JeffreySarnoff · February 17, 2020, 7:21pm

issue raised Allow a default col type and some cols to have specific types · Issue #575 · JuliaData/CSV.jl · GitHub

Fred · February 18, 2020, 6:27am

Thank you @quinnj, @JeffreySarnoff !

Topic		Replies	Views
How to specify `CSV.read` column types? General Usage question , type , csv	4	2134	August 7, 2018
DataFrame : eltypes with variable length Data data	2	1015	September 21, 2017
Is there a way to read a DataFrame from file specifying the type of each column? New to Julia question	7	114	November 1, 2024
Csv error reading numbers as string General Usage	16	2295	December 6, 2020
Specifying column type efficiently in CSV.read for large datasets General Usage	4	621	June 22, 2020

CSV.jl : how to specify the columns types when the total columns number is not known?

Related topics