CSV.jl : mixing symbol and column number in select option

Fred · February 17, 2021, 10:01am

Hi,
I am trying to read the rownames and some specific column of a CSV file (to save RAM) so I use the “select” option. The problem with this option is that rownames are dropped, so I want to include the first column using a column index because this column as not always the same name.

So my question is : how is it possible to read the rownames and other columns without reading the whole dataframe ?

One possibility is to do that in two steps and then join a dataframe with the rownames with a second dataframe with the columns selected by names. But maybe there is a better solution ?

Thanks for your advices !

julia> DataFrame(CSV.File("test.csv"))
2×4 DataFrame
 Row │ variable_name  col2   col3   col4  
     │ String         Int64  Int64  Int64 
─────┼────────────────────────────────────
   1 │ A                  1      2      3
   2 │ B                  4      5      6

# this is the result I want to obtain, 
# but I want to use the name of the column3 because this column is not always at position 3...
julia> DataFrame(CSV.File("test.csv", select=[1,3]))
2×2 DataFrame
 Row │ variable_name  col3  
     │ String         Int64 
─────┼──────────────────────
   1 │ A                  2
   2 │ B                  5

# now I want to read the colum3 and the rownames. 
# The problem is that the first column has not always the same name so I use the column position 1 to obtain the rownames
julia> DataFrame(CSV.File("test.csv", select=[1,:col3]))
ERROR: `select` keyword argument must be an `AbstractVector` of `Int`, `Symbol`, `String`, or `Bool`, or a selector function of the form `(i, name) -> keep::Bool`

# my 2 steps solution to the problem, but maybe there is a better one...
julia> col1 = DataFrame(CSV.File("test.csv", select=[1]))
2×1 DataFrame
 Row │ variable_name 
     │ String        
─────┼───────────────
   1 │ A
   2 │ B

julia> df2 = DataFrame(CSV.File("test.csv", select=[:col3]))
2×1 DataFrame
 Row │ col3  
     │ Int64 
─────┼───────
   1 │     2
   2 │     5

julia> hcat(col1,df2)
2×2 DataFrame
 Row │ variable_name  col3  
     │ String         Int64 
─────┼──────────────────────
   1 │ A                  2
   2 │ B                  5

quinnj · February 22, 2021, 11:01pm

Sorry for the slow response; I’ve opened an issue to look into supporting this. I’m planning on doing some CSV.jl work in the near future, so I’ll look into it then.

Fred · February 23, 2021, 9:04am

@quinnj Thank you for supporting this, in some specific cases like mine, it avoid to read the dataframe twice. Thank you for your work, I appreciate using CSV.jl

Topic		Replies	Views
How to read specific columns of a csv file New to Julia question	1	1173	September 1, 2022
Load large Datamatrix with column and rownames New to Julia	3	342	September 29, 2020
How to select dataframe column by name? New to Julia dataframes	6	7204	August 27, 2021
Creating an identifier column when combing multiple DataFrames with CSV.read New to Julia question , csv	2	139	April 16, 2024
How to convert a csv column into a vector New to Julia question , csv	4	1252	September 2, 2022

CSV.jl : mixing symbol and column number in select option

Related topics