Reshaping dataframe

I show a way to transform the starting table to have only one “X” per row.
I ask for comments on how to do the various steps differently and / or better.
For example how to insert an “index” column or how to filter the rows that contain “X”, or other like the general setting of the solution.

df
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4    
     │ String    String     Int64  String?     String?     String?     String?       
─────┼────────────────────────────────────────────────────────────────────────────   
   1 │ Sander    1/1/2022       8  X           missing     missing     missing       
   2 │ Sander    1/2/2022       5  missing     X           X           missing       
   3 │ Sander    1/3/2022       7  missing     missing     X           missing       
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing   
sdf=stack(df, r"Criteria")
insertcols!(sdf, 1, :Row => 1:nrow(sdf))
udf=unstack(sdf)
filter(r-> any("X" .=== values(r))  , udf)

PS
Is it possible, in some way, to upload csv files?

Instead of uploading a csv you can just copy/paste a string representation like here:

https://stackoverflow.com/questions/61723347/how-to-provide-reproducible-sample-data-in-julia/61724167

tanks!
this way?

string_representation = String(take!(CSV.write(IOBuffer(), df)))
"Name,Date,Hours,Criteria 1,Criteria 2,Criteria 3,Criteria 4\nSander,1/1/2022,8,X,,,\nSander,1/2/2022,5,,X,X,\nSander,1/3/2022,7,,,X,\nSander,1/4/2022,8,,,,X\nSomebody,2/3/2022,4,X,X,,X\nSomebody,4/3/2022,9,X,,X,X\nSomebody,5/14/2022,2,,,X,\n"

Yep:

julia> CSV.read(IOBuffer(string_representation), DataFrame)
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4
     │ String    String     Int64  String?     String?     String?     String?
─────┼────────────────────────────────────────────────────────────────────────────
   1 │ Sander    1/1/2022       8  X           missing     missing     missing
   2 │ Sander    1/2/2022       5  missing     X           X           missing
   3 │ Sander    1/3/2022       7  missing     missing     X           missing
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing

Tanks again.
I add one more question.
How can you do to correctly interpret the strings of the “Date” column?

Same as with any other csv:

julia> CSV.read(IOBuffer(string_representation), DataFrame; dateformat = "m/d/y")

Also sorry this was a complete distraction from your real question, which I admit I don’t understand.

Could you show the expected result? When I run your code I end up with:

julia> filter(r-> any("X" .=== values(r))  , udf)
0×8 DataFrame

which I assume isn’t what you’re after?

1 Like

the result should be, and to me it actually is, this:

julia> filter(r-> any("X" .=== values(r))  , udf)
12×8 DataFrame
 Row │ Row    Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Crite ⋯
     │ Int64  String    String     Int64  String?     String?     String?     Strin ⋯
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1  Sander    1/1/2022       8  X           missing     missing     missi ⋯
   2 │     5  Somebody  2/3/2022       4  X           missing     missing     missi  
   3 │     6  Somebody  4/3/2022       9  X           missing     missing     missi  
   4 │     9  Sander    1/2/2022       5  missing     X           missing     missi  
   5 │    12  Somebody  2/3/2022       4  missing     X           missing     missi ⋯
   6 │    16  Sander    1/2/2022       5  missing     missing     X           missi  
   7 │    17  Sander    1/3/2022       7  missing     missing     X           missi  
   8 │    20  Somebody  4/3/2022       9  missing     missing     X           missi  
   9 │    21  Somebody  5/14/2022      2  missing     missing     X           missi ⋯
  10 │    25  Sander    1/4/2022       8  missing     missing     missing     X      
  11 │    26  Somebody  2/3/2022       4  missing     missing     missing     X      
  12 │    27  Somebody  4/3/2022       9  missing     missing     missing     X      
                   

julia> versioninfo()
Julia Version 1.7.0-beta3.0
DataFrames v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main

I tried to use the aprameter dateformat = “d / m / yyyy” but it didn’t give me the expected result.
but even so it is no different

julia> df=CSV.read("unpivot pivot.CSV", DataFrame, dateformat="d/m/y")
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4    
     │ String    String     Int64  String?     String?     String?     String?       
─────┼────────────────────────────────────────────────────────────────────────────   
   1 │ Sander    1/1/2022       8  X           missing     missing     missing       
   2 │ Sander    1/2/2022       5  missing     X           X           missing       
   3 │ Sander    1/3/2022       7  missing     missing     X           missing       
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing       

That’s because the dateformat is m/d/y, no? One of the date strings is 5/14/2022, so when you pass d/m/y as dateformat, it will try to parse this string and conclude that 14 is not a valid month so this isn’t a date column.

Thank you.
Also for the patience you put into the explanations.
I completely missed the reversal of the month with the day in the input format …

1 Like

No worries - as for your actual question I now understand a bit better, but don’t think I’ve got a better solution.

Your final filter call doesn’t work for me on DataFrames 1.3.2, Julia 1.7.2, but I can replicate your result using the (imo slightly clearer)

filter(r -> any(isequal("X"), r), udf)
1 Like

Could you find an “easy” way to use the subset () function in this case?
Although, it comes to mind now, it is preferable to filter the datafreme sdf (stacked df) on the single column :value, rather than filtering at the end udf.

@rocco_sprmnt21 Regarding your last question do you want a way to rewrite:

filter(r -> any(isequal("X"), r), udf)

using subset?

If this is what you are after then you can write:

subset(udf, AsTable(:) => ByRow(r -> any(isequal("X"), r))
1 Like