Reshaping dataframe

rocco_sprmnt21 · March 17, 2022, 10:54am

I show a way to transform the starting table to have only one “X” per row.
I ask for comments on how to do the various steps differently and / or better.
For example how to insert an “index” column or how to filter the rows that contain “X”, or other like the general setting of the solution.

df
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4    
     │ String    String     Int64  String?     String?     String?     String?       
─────┼────────────────────────────────────────────────────────────────────────────   
   1 │ Sander    1/1/2022       8  X           missing     missing     missing       
   2 │ Sander    1/2/2022       5  missing     X           X           missing       
   3 │ Sander    1/3/2022       7  missing     missing     X           missing       
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing

sdf=stack(df, r"Criteria")
insertcols!(sdf, 1, :Row => 1:nrow(sdf))
udf=unstack(sdf)
filter(r-> any("X" .=== values(r))  , udf)

PS
Is it possible, in some way, to upload csv files?

nilshg · March 17, 2022, 10:55am

Instead of uploading a csv you can just copy/paste a string representation like here:

https://stackoverflow.com/questions/61723347/how-to-provide-reproducible-sample-data-in-julia/61724167

rocco_sprmnt21 · March 17, 2022, 10:58am

tanks!
this way?

string_representation = String(take!(CSV.write(IOBuffer(), df)))
"Name,Date,Hours,Criteria 1,Criteria 2,Criteria 3,Criteria 4\nSander,1/1/2022,8,X,,,\nSander,1/2/2022,5,,X,X,\nSander,1/3/2022,7,,,X,\nSander,1/4/2022,8,,,,X\nSomebody,2/3/2022,4,X,X,,X\nSomebody,4/3/2022,9,X,,X,X\nSomebody,5/14/2022,2,,,X,\n"

nilshg · March 17, 2022, 10:59am

Yep:

julia> CSV.read(IOBuffer(string_representation), DataFrame)
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4
     │ String    String     Int64  String?     String?     String?     String?
─────┼────────────────────────────────────────────────────────────────────────────
   1 │ Sander    1/1/2022       8  X           missing     missing     missing
   2 │ Sander    1/2/2022       5  missing     X           X           missing
   3 │ Sander    1/3/2022       7  missing     missing     X           missing
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing

rocco_sprmnt21 · March 17, 2022, 11:02am

Tanks again.
I add one more question.
How can you do to correctly interpret the strings of the “Date” column?

nilshg · March 17, 2022, 12:13pm

Same as with any other csv:

julia> CSV.read(IOBuffer(string_representation), DataFrame; dateformat = "m/d/y")

nilshg · March 17, 2022, 12:17pm

Also sorry this was a complete distraction from your real question, which I admit I don’t understand.

Could you show the expected result? When I run your code I end up with:

julia> filter(r-> any("X" .=== values(r))  , udf)
0×8 DataFrame

which I assume isn’t what you’re after?

rocco_sprmnt21 · March 17, 2022, 2:04pm

the result should be, and to me it actually is, this:

julia> filter(r-> any("X" .=== values(r))  , udf)
12×8 DataFrame
 Row │ Row    Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Crite ⋯
     │ Int64  String    String     Int64  String?     String?     String?     Strin ⋯
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │     1  Sander    1/1/2022       8  X           missing     missing     missi ⋯
   2 │     5  Somebody  2/3/2022       4  X           missing     missing     missi  
   3 │     6  Somebody  4/3/2022       9  X           missing     missing     missi  
   4 │     9  Sander    1/2/2022       5  missing     X           missing     missi  
   5 │    12  Somebody  2/3/2022       4  missing     X           missing     missi ⋯
   6 │    16  Sander    1/2/2022       5  missing     missing     X           missi  
   7 │    17  Sander    1/3/2022       7  missing     missing     X           missi  
   8 │    20  Somebody  4/3/2022       9  missing     missing     X           missi  
   9 │    21  Somebody  5/14/2022      2  missing     missing     X           missi ⋯
  10 │    25  Sander    1/4/2022       8  missing     missing     missing     X      
  11 │    26  Somebody  2/3/2022       4  missing     missing     missing     X      
  12 │    27  Somebody  4/3/2022       9  missing     missing     missing     X

julia> versioninfo()
Julia Version 1.7.0-beta3.0
DataFrames v1.4.0 `https://github.com/JuliaData/DataFrames.jl.git#main

rocco_sprmnt21 · March 17, 2022, 2:07pm

I tried to use the aprameter dateformat = “d / m / yyyy” but it didn’t give me the expected result.
but even so it is no different

julia> df=CSV.read("unpivot pivot.CSV", DataFrame, dateformat="d/m/y")
7×7 DataFrame
 Row │ Name      Date       Hours  Criteria 1  Criteria 2  Criteria 3  Criteria 4    
     │ String    String     Int64  String?     String?     String?     String?       
─────┼────────────────────────────────────────────────────────────────────────────   
   1 │ Sander    1/1/2022       8  X           missing     missing     missing       
   2 │ Sander    1/2/2022       5  missing     X           X           missing       
   3 │ Sander    1/3/2022       7  missing     missing     X           missing       
   4 │ Sander    1/4/2022       8  missing     missing     missing     X
   5 │ Somebody  2/3/2022       4  X           X           missing     X
   6 │ Somebody  4/3/2022       9  X           missing     X           X
   7 │ Somebody  5/14/2022      2  missing     missing     X           missing

nilshg · March 17, 2022, 2:09pm

That’s because the dateformat is m/d/y, no? One of the date strings is 5/14/2022, so when you pass d/m/y as dateformat, it will try to parse this string and conclude that 14 is not a valid month so this isn’t a date column.

rocco_sprmnt21 · March 17, 2022, 2:19pm

Thank you.
Also for the patience you put into the explanations.
I completely missed the reversal of the month with the day in the input format …

nilshg · March 17, 2022, 2:33pm

No worries - as for your actual question I now understand a bit better, but don’t think I’ve got a better solution.

Your final filter call doesn’t work for me on DataFrames 1.3.2, Julia 1.7.2, but I can replicate your result using the (imo slightly clearer)

filter(r -> any(isequal("X"), r), udf)

rocco_sprmnt21 · March 17, 2022, 2:44pm

Could you find an “easy” way to use the subset () function in this case?
Although, it comes to mind now, it is preferable to filter the datafreme sdf (stacked df) on the single column :value, rather than filtering at the end udf.

bkamins · March 17, 2022, 2:52pm

@rocco_sprmnt21 Regarding your last question do you want a way to rewrite:

filter(r -> any(isequal("X"), r), udf)

using subset?

If this is what you are after then you can write:

subset(udf, AsTable(:) => ByRow(r -> any(isequal("X"), r))

Topic		Replies	Views
Reshaping ID and STatus by CATegory Data	2	310	May 24, 2021
UnStack dataframe with aggregated string column General Usage dataframes	3	306	May 14, 2022
Transform DataFrame into a single row by adding row contents as new columns New to Julia dataframes	3	321	May 1, 2021
[DataFrames Question]: How to convert single column with row of dictionary to multiple columns Specific Domains question , dataframes	4	520	May 14, 2022
How do I not skip missing values when using stack / unstack for pivot tables General Usage	2	453	July 17, 2020

Reshaping dataframe

Related topics