DataFrame : eltypes with variable length

data

#1

Hi,

The function
readtable(filename, [keyword options])

as an optional keyword

eltypes::Vector – Specify the types of all columns. Defaults to [].

When the exact size of the table, it is possible to specify the types of the columns

 x = readtable("data/data.csv", separator = '\t' , eltypes = [String, Float64, Float64, Float64, Float64])
4×5 DataFrames.DataFrame
│ Row │ id   │ s1       │ s2       │ s3       │ s4       │
├─────┼──────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ "g1" │ 0.134978 │ 0.231912 │ 0.479582 │ 0.134978 │
│ 2   │ "g2" │ 0.972158 │ 0.437821 │ NA       │ 0.848548 │
│ 3   │ "g3" │ 0.152925 │ NA       │ 0.848548 │ 0.152925 │
│ 4   │ "g4" │ 0.813864 │ 0.972158 │ 0.917429 │ 0.813864 │

But if the size of the table is not known and I only know that the first column is of type String how I can set eltypes ?

eltypes = [String, Float64...]]

Thanks !


#2

With the CSV.jl package, you can just do

CSV.read("data/data.csv", delim='\t', types=Dict(1=>String))

this will return a DataFrame by default.


#3

Thank you very much Quinnj !

The reason I stay with DataFrames and readtable is that I have a better speed with readtable even if I don’t specify the eltypes. But it is possible that I have done something wrong.

$ julia DataCSV.jl 

WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/fred/.julia/v0.6/NullableArrays/src/operators.jl:99.
Reading...	data.csv
Reading...	data2.csv
elapsed time: 3.95377051 seconds

$ julia DataFrames.jl 
Reading...	data.csv
Reading...	data2.csv
elapsed time: 1.566749483 seconds

DataCSV

using CSV

##########################################
# read dataframe
function readTable(file, sep, h)
    println("Reading...\t", file)
    x = CSV.read(file ; delim = sep, types=Dict(1=>String), header = h, null="NA") # read data file
    return x
end

function main()
    sep = '\t'       # table separator
    h = true         # table header
  
    # process data
    f = ["data.csv", "data2.csv"]
    
    for file in f
        tab = readTable(file, sep, h)
    end
end

##########################################

tic()
main()
toc()

DataFrames

using DataFrames

# read dataframe
function readTable(file, sep, h)
    println("Reading...\t", file)
    x = readtable(file , separator = sep, header = h) # read data file
    return x
end

function main()
    sep = '\t'       # table separator
    h = true         # table header
  
    # process data
    f = ["data.csv", "data2.csv"]
    
    for file in f
        tab = readTable(file, sep, h)
    end
end

##########################################

tic()
main()
toc()
data.csv (tab separator)
id	s1	s2	s3	s4
g1	0.1349779443	0.2319120248	0.4795815343	0.1349779443
g2	0.9721584522	0.4378209082	0.8485481786
g3	0.1529253099	0.8485481786	0.1529253099
g4	0.8138636984	0.9721584522	0.9174289651	0.8138636984

data2.csv
id	s1	s2	s3	s4
g1	0.2235082715	0.726445808	0.3964289063	0.2169791684
g2	0.6151192371	0.7863019568	0.6236194363	
g3	0.9810212048	0.2967554158	0.5556356032
g4	0.0347811024	0.5602313542	0.1317892775	0.4228049423