Help with filling dataframe

I’m having trouble with the code to fill a dataframe with some strings

say the strings are defined as

data = ["a\tb","c\td","e\tf"]

a dataframe is preallocated as

df = DataFrame([String, String],[:X,:Y],3)

I’m trying to fill the dataframe like this, but it doesn’t seem to be working. Is there something basic I’m doing wrong?

df = split.(data, "\t")

I’ve also tried this, but this doesn’t work either.

for i in 1:length(data)
     df[i,:] = split(data[i], "\t")
end

I don’t find any examples in the documentation that covers this use case.

Please note that this is a deprecated syntax (use --depwarn=yes at startup to see warnings)

Now the way to get what you want is e.g.:

julia> rename(DataFrame(Tuple.(split.(data))), [:X, :Y])
3×2 DataFrame
 Row │ X          Y
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d
   3 │ e          f

or

julia> select(DataFrame(data=data), :data => ByRow(split) => [:X, :Y])
3×2 DataFrame
 Row │ X          Y
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d
   3 │ e          f

or

julia> s = split.(data)
3-element Vector{Vector{SubString{String}}}:
 ["a", "b"]
 ["c", "d"]
 ["e", "f"]

julia> DataFrame(X=getindex.(s, 1), Y=getindex.(s, 2))
3×2 DataFrame
 Row │ X          Y
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d
   3 │ e          f

whichever seems easier for you to use.

1 Like

@bkamins Thanks for your help, I was actually right in the middle of your course on DataFrames on JuliaAcademy so it’s nice to get an answer from you.

I didn’t realize that method of preallocating DataFrames was deprecated.

From your suggestions, how would you define the data types using that syntax? In this case, it is easy because they are all strings, but what about if I wanted to define those types up front?

I assume you are asking how to create a data frame with uninitialized columns of certain types with a certain number of rows. Then you can do e.g.:

DataFrame([n => Vector{T}(undef, 3) for (n, T) in [:X => String, :Y => Int]])

or

DataFrame([:X, :Y] .=> [Vector{T}(undef, 3) for T in [String, Int]])

or

DataFrame([Vector{T}(undef, 3) for T in [String, Int]], [:X, :Y])

I know this is not super friendly to type, but this has a reason. If it not efficient to create a data frame this way and later fill it with data. That is why it is also discouraged on a syntax level. It is usually better to construct a data frame dynamically as data is added to it.

1 Like

@bkamins OK, I’ll try it the way you previously suggested. From the documentation, it says that row by row construction of a dataframe is not performant, so I was going with this preallocation, but it seems to be fast enough for what I need.

Thanks for your help!

Could you please pass me a link to this part of documentation? Thank you!

@bkamins From this link: Getting Started · DataFrames.jl

“Note that constructing a DataFrame row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrame s this may be a consideration.”

Ah this is correct as this is compared to “creating is all at once” or “column by column” (which are both faster).

The option to leave entries of a data #undef and then fill them is usually slowest.

OK, I see. Thanks for the clarification.

One final question, would it be possible to have an array of tab delimited strings like in my example and then use that to create a dataframe column by column?

But you mean that each string holds one row? If yes - then all three answers I have given in my initial reply work this way.