DataFrame array subset

I have a DataFrame

dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = [“M”,“F”,“F”])

3×2 DataFrame
│ Row │ a │ b │
│ │ Array… │ String │
├─────┼───────────┼────────┤
│ 1 │ [1, 2, 3] │ M │
│ 2 │ [4, 5, 6] │ F │
│ 3 │ [7, 8, 9] │ F │

I would like to append new variable x1 based on first column from dfy.a, x2 based on second column from dfy.b, and x3 based on third column fron dfy.b.

3×5 DataFrame
│ Row │ a │ b │ x1 │ x2 │ x3 │
│ │ Array… │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1 │ [1, 2, 3] │ M │ 1 │ 2 │ 3 │
│ 2 │ [4, 5, 6] │ F │ 4 │ 5 │ 6 │
│ 3 │ [7, 8, 9] │ F │ 7 │ 8 │ 9 │

Short of doing the below, is there a way I can use for loop or some other way to create the above dataset?

dfy[!,:x1]= getindex.(dfy.a,1)
dfy[!,:x2]= getindex.(dfy.a,2)
dfy[!,:x3]= getindex.(dfy.a,3)

would this work?

using DataFramesMeta
@transform(dfy, x1 = getindex.(:a, 1), x2 = getindex.(:a, 2), x3 = getindex.(:a, 3))
1 Like

How about:

julia> using DataFrames

julia> dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = ['M','F','F'])
3×2 DataFrame
│ Row │ a         │ b    │
│     │ Array…    │ Char │ 
├─────┼───────────┼──────┤ 
│ 1   │ [1, 2, 3] │ 'M'  │ 
│ 2   │ [4, 5, 6] │ 'F'  │ 
│ 3   │ [7, 8, 9] │ 'F'  │

julia> hcat(dfy, DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3]))
3×5 DataFrame
│ Row │ a         │ b    │ x1    │ x2    │ x3    │
│     │ Array…    │ Char │ Int64 │ Int64 │ Int64 │   
├─────┼───────────┼──────┼───────┼───────┼───────┤  
│ 1   │ [1, 2, 3] │ 'M'  │ 1     │ 2     │ 3     │ 
│ 2   │ [4, 5, 6] │ 'F'  │ 4     │ 5     │ 6     │
│ 3   │ [7, 8, 9] │ 'F'  │ 7     │ 8     │ 9     │ 
2 Likes

Thank you, when you create the new variable how can we convert the variables to Float64 instead of Int?

Just use Float64.(getindex.(...)) in the comprehension

2 Likes

Just do Float64.(getindex.(dfy.a, i))

But are you sure you need them to all be Floats? Things should “just work” without the explicit conversion.

3 Likes

strange: When I execute the code, I get the following result:

3×4 DataFrame
│ Row │ a │ b │ first │ second │
│ │ Array… │ String │ String │ Array… │
├─────┼───────────┼────────┼────────┼───────────┤
│ 1 │ [1, 2, 3] │ M │ x1 │ [1, 4, 7] │
│ 2 │ [4, 5, 6] │ F │ x2 │ [2, 5, 8] │
│ 3 │ [7, 8, 9] │ F │ x3 │ [3, 6, 9] │

I can’t replicate that result.

julia> dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = ["M","F","F"])
3×2 DataFrame
│ Row │ a         │ b      │
│     │ Array…    │ String │
├─────┼───────────┼────────┤
│ 1   │ [1, 2, 3] │ M      │
│ 2   │ [4, 5, 6] │ F      │
│ 3   │ [7, 8, 9] │ F      │

julia> hcat(dfy, DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3]))
3×5 DataFrame
│ Row │ a         │ b      │ x1    │ x2    │ x3    │
│     │ Array…    │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1   │ [1, 2, 3] │ M      │ 1     │ 2     │ 3     │
│ 2   │ [4, 5, 6] │ F      │ 4     │ 5     │ 6     │
│ 3   │ [7, 8, 9] │ F      │ 7     │ 8     │ 9     │

Additionally, can you please quote code and output using triple backticks?

```
like this
```

However I think your best solution is a simply for loop

julia> for i in 1:3
       dfy[:, "x$i"] = getindex.(dfy.a, i)
       end

julia> dfy
3×5 DataFrame
│ Row │ a         │ b      │ x1    │ x2    │ x3    │
│     │ Array…    │ String │ Int64 │ Int64 │ Int64 │
├─────┼───────────┼────────┼───────┼───────┼───────┤
│ 1   │ [1, 2, 3] │ M      │ 1     │ 2     │ 3     │
│ 2   │ [4, 5, 6] │ F      │ 4     │ 5     │ 6     │
│ 3   │ [7, 8, 9] │ F      │ 7     │ 8     │ 9     │
1 Like

That’s odd, as what I posted above was a minimum working example in a fresh Julia session. What versions of Julia and DataFrames are you using (although of the top of my head I’m not using anything fancy in this example that has recently changed!?)

I also agree with Peter that a for loop is probably the cleanest solution here, I’d only use what I posted if for some reason I wanted a one liner at all costs.

1 Like

Thank you for suggestions. Below is what I get:

julia> VERSION
v"1.4.2"
pkg> status "DataFrames"
DataFrames v0.20.2

If I use for loop I get the following error.

MethodError: no method matching setindex!(::DataFrame, ::Array{Int64,1}, ::String)
Closest candidates are:
  setindex!(::DataFrame, ::Any, ::Any, !Matched::Colon) at C:\Users\.julia\packages\DataFrames\S3ZFo\src\deprecated.jl:1516
  setindex!(::DataFrame, ::AbstractArray{T,1} where T, !Matched::typeof(!), !Matched::Union{Signed, Symbol, Unsigned}) at C:\Users\.julia\packages\DataFrames\S3ZFo\src\dataframe\dataframe.jl:482
  setindex!(::DataFrame, ::AbstractArray{T,1} where T, !Matched::Union{Signed, Symbol, Unsigned}) at deprecated.jl:65
  ...
setindex!(::DataFrame, ::Array{Int64,1}, ::Colon, ::String) at deprecated.jl:1524
top-level scope at trust_region_new_array.jl:292

Hi @nilshg, see below, Can you please let me know what version you are using ?

julia> VERSION
v"1.4.2"
pkg> status "DataFrames"
DataFrames v0.20.2

Can you please post the full MWE using triple backticks?

here it is:

dfy = DataFrame(a = [[1,2,3],[4,5,6],[7,8,9]], b = ["M","F","F"])

for i in 1:3
       dfy[:, "x$i"] = getindex.(dfy.a, i)
end

hcat(dfy, DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3]))

Okay this is a weird error. I apologize for this confusion.

First, the problem with my for loop is that this version of DataFrames doesn’t allow String indexing. You should replace "x$i" with Symbol(:x, i) and the code will work.

The second, more confusing error is related to the code

 DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3])

It seems there was a breaking change in the constructor of DataFrames between 0.20.0 and 0.21.0 that I did not know about.

In 0.20.2 we have

julia> DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3])
3×2 DataFrame
│ Row │ first  │ second    │
│     │ String │ Array…    │
├─────┼────────┼───────────┤
│ 1   │ x1     │ [1, 4, 7] │
│ 2   │ x2     │ [2, 5, 8] │
│ 3   │ x3     │ [3, 6, 9] │

On 0.21.0 we have

julia> DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3])
3×3 DataFrame
│ Row │ x1    │ x2    │ x3    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
│ 2   │ 4     │ 5     │ 6     │
│ 3   │ 7     │ 8     │ 9     │

@bkamins do you know why this constructor changed?

OP, my advice is to update DataFrames and use the examples put forward in this thread.

1 Like

Thank you. Symbol(:x, i) did the magic. It appears the for v"1.4.2" only DataFrames v0.20.2. is available, I guess, I have to update both Julia to 1.5 and then get latest DataFrames

You can get the latest dataframes without updating your julia installation. no need to upgrade to 1.5 if it’s too much work.

Thank you. Can you please let me know how to update DataFrames package without upgrading Julia?

Here is how I do it: pkg> add DataFrames and I get DataFrames v0.20.2

Use:

DataFrame([Symbol("x", i) => getindex.(dfy.a, i) for i in 1:3]...)

to be backward compatible.

In general this constructor is useful for DataFrame(pairs(eachcol(df))), but we can drop it if we want and replace it with DataFrame(pairs(eachcol(df))...).

Also I find:

julia> DataFrame(["x$(i)" => getindex.(dfy.a, i) for i in 1:3])
3×2 DataFrame
│ Row │ first  │ second    │
│     │ String │ Array…    │
├─────┼────────┼───────────┤
│ 1   │ x1     │ [1, 4, 7] │
│ 2   │ x2     │ [2, 5, 8] │
│ 3   │ x3     │ [3, 6, 9] │

quite non-intuitive TBH (but it is consistent with Tables.jl).

Do ] add DataFrames@0.21, you’ll then see what’s holding you back

1 Like