Is there a way to read a DataFrame from file specifying the type of each column?

Here is some code which reads a DataFrame from a (csv) file:

using DataFrames
using CSV

df = DataFrame(CSV.File(filename))

By default, I think the columns are String type. (Maybe it is inferred from the data values?)

… either way -

  • Is there a way to specify the data types of each column when reading the data from file?

For this to work, obviously the column names need to be known in advance. In this case, that’s ok, because the DataFrame stores some data which has some constraints on its schema. There are a fixed set of column names, and the data type for each one is known in advance of reading the file from disk.

If this isn’t possible - and I suspect it may not be - is there a way to convert the columns using something equivalent to this python code:

df = (
    df.astype(
        {
            'col_a': 'int',
            'col_b': 'string',
            'col_c': 'float',
        }
    )
)

Sure, this is possible, e.g. via CSV.read’s types argument.

Example

test.csv:

Col1,Col2,Col3
1,2.3,"sdf"
12,-0.213,"ds"
julia> using CSV, DataFrames

julia> CSV.read("test.csv", DataFrame)
2×3 DataFrame
 Row │ Col1    Col2     Col3
     │ Int64  Float64  String3
─────┼─────────────────────────
   1 │     1    2.3    sdf
   2 │    12   -0.213  ds

julia> CSV.read("test.csv", types=Dict(:Col1 => Float64), DataFrame)
2×3 DataFrame
 Row │ Col1      Col2     Col3
     │ Float64  Float64  String3
─────┼───────────────────────────
   1 │     1.0    2.3    sdf
   2 │    12.0   -0.213  ds

Just read the documentation: Reading · CSV.jl

No way to do it using the DataFrames library? I guess it makes sense that the type conversion is done at the time of reading the file though… Thanks for the tips… guess I was looking in the wrong place…

I think I have spoken too soon.

The problem with CSV.read is it requires a filename as the argument, whereas what I need is a CSV.File

Tried this in the REPL… This seems to work…

CSV.File("example.txt", types=Dict(:col1=>Date,:col2=>String,:col3=>String,:col4=>Float64))

Why exactly do you want a CSV.File? If you want to end up with a DataFrame you can use DataFrame as sink in CSV.read.

You could just manually convert after reading, using something like

julia> df = CSV.read("test.csv", DataFrame)  # initial DataFrame
2×3 DataFrame
 Row │ Col1   Col2     Col3
     │ Int64  Float64  String3
─────┼─────────────────────────
   1 │     1    2.3    sdf
   2 │    12   -0.213  ds

julia> for (col_symb, col_type) in zip((:Col1, :Col3), (Float32, String))
           df[!, col_symb] .= convert.(col_type, df[!, col_symb])
       end

julia> df
2×3 DataFrame
 Row │ Col1     Col2     Col3
     │ Float32  Float64  String
─────┼──────────────────────────
   1 │     1.0    2.3    sdf
   2 │    12.0   -0.213  ds

as in this topic

I have an interface which needs to take a file-like object, rather than a DataFrame directly.

That would work too, thanks for the suggestion.