CSV.jl : mixing symbol and column number in select option

Hi,
I am trying to read the rownames and some specific column of a CSV file (to save RAM) so I use the “select” option. The problem with this option is that rownames are dropped, so I want to include the first column using a column index because this column as not always the same name.

So my question is : how is it possible to read the rownames and other columns without reading the whole dataframe ?

One possibility is to do that in two steps and then join a dataframe with the rownames with a second dataframe with the columns selected by names. But maybe there is a better solution ?

Thanks for your advices !

julia> DataFrame(CSV.File("test.csv"))
2×4 DataFrame
 Row │ variable_name  col2   col3   col4  
     │ String         Int64  Int64  Int64 
─────┼────────────────────────────────────
   1 │ A                  1      2      3
   2 │ B                  4      5      6

# this is the result I want to obtain, 
# but I want to use the name of the column3 because this column is not always at position 3...
julia> DataFrame(CSV.File("test.csv", select=[1,3]))
2×2 DataFrame
 Row │ variable_name  col3  
     │ String         Int64 
─────┼──────────────────────
   1 │ A                  2
   2 │ B                  5

# now I want to read the colum3 and the rownames. 
# The problem is that the first column has not always the same name so I use the column position 1 to obtain the rownames
julia> DataFrame(CSV.File("test.csv", select=[1,:col3]))
ERROR: `select` keyword argument must be an `AbstractVector` of `Int`, `Symbol`, `String`, or `Bool`, or a selector function of the form `(i, name) -> keep::Bool`
# my 2 steps solution to the problem, but maybe there is a better one...
julia> col1 = DataFrame(CSV.File("test.csv", select=[1]))
2×1 DataFrame
 Row │ variable_name 
     │ String        
─────┼───────────────
   1 │ A
   2 │ B

julia> df2 = DataFrame(CSV.File("test.csv", select=[:col3]))
2×1 DataFrame
 Row │ col3  
     │ Int64 
─────┼───────
   1 │     2
   2 │     5

julia> hcat(col1,df2)
2×2 DataFrame
 Row │ variable_name  col3  
     │ String         Int64 
─────┼──────────────────────
   1 │ A                  2
   2 │ B                  5

Sorry for the slow response; I’ve opened an issue to look into supporting this. I’m planning on doing some CSV.jl work in the near future, so I’ll look into it then.

2 Likes

@quinnj Thank you for supporting this, in some specific cases like mine, it avoid to read the dataframe twice. Thank you for your work, I appreciate using CSV.jl :wink: