Please note that this is a deprecated syntax (use --depwarn=yes at startup to see warnings)
Now the way to get what you want is e.g.:
julia> rename(DataFrame(Tuple.(split.(data))), [:X, :Y])
3×2 DataFrame
Row │ X Y
│ SubStrin… SubStrin…
─────┼──────────────────────
1 │ a b
2 │ c d
3 │ e f
or
julia> select(DataFrame(data=data), :data => ByRow(split) => [:X, :Y])
3×2 DataFrame
Row │ X Y
│ SubStrin… SubStrin…
─────┼──────────────────────
1 │ a b
2 │ c d
3 │ e f
or
julia> s = split.(data)
3-element Vector{Vector{SubString{String}}}:
["a", "b"]
["c", "d"]
["e", "f"]
julia> DataFrame(X=getindex.(s, 1), Y=getindex.(s, 2))
3×2 DataFrame
Row │ X Y
│ SubStrin… SubStrin…
─────┼──────────────────────
1 │ a b
2 │ c d
3 │ e f
@bkamins Thanks for your help, I was actually right in the middle of your course on DataFrames on JuliaAcademy so it’s nice to get an answer from you.
I didn’t realize that method of preallocating DataFrames was deprecated.
From your suggestions, how would you define the data types using that syntax? In this case, it is easy because they are all strings, but what about if I wanted to define those types up front?
I assume you are asking how to create a data frame with uninitialized columns of certain types with a certain number of rows. Then you can do e.g.:
DataFrame([n => Vector{T}(undef, 3) for (n, T) in [:X => String, :Y => Int]])
or
DataFrame([:X, :Y] .=> [Vector{T}(undef, 3) for T in [String, Int]])
or
DataFrame([Vector{T}(undef, 3) for T in [String, Int]], [:X, :Y])
I know this is not super friendly to type, but this has a reason. If it not efficient to create a data frame this way and later fill it with data. That is why it is also discouraged on a syntax level. It is usually better to construct a data frame dynamically as data is added to it.
@bkamins OK, I’ll try it the way you previously suggested. From the documentation, it says that row by row construction of a dataframe is not performant, so I was going with this preallocation, but it seems to be fast enough for what I need.
“Note that constructing a DataFrame row by row is significantly less performant than constructing it all at once, or column by column. For many use-cases this will not matter, but for very large DataFrame s this may be a consideration.”
One final question, would it be possible to have an array of tab delimited strings like in my example and then use that to create a dataframe column by column?