Tidying data: DataFrame

Hello, I have been trying to do some basic Data Analysis with Julia DataFrames and cannot find a way to tidy my data.

I would like to simply use values of one of the columns to group df based on them and later plot. My impression from reading documentation is that Julia’s groupby work differently than R’s group_by and I see that grouped df isn’t of a shape I would expect it to be after R called group_by. Could you please navigate me the Julia’s function/macro that could be helpful (ideally still in DataFrames.jl since I expect it has to be there)

code example:

>println(first(df,1))

1×4 DataFrame
│ Row │ slice             │ count_ │ img_name                      │ circ        │
│     │ String            │ Int64  │ String                        │ String      │
├─────┼───────────────────┼────────┼───────────────────────────────┼─────────────┤
│ 1   │ Image[name=2,dim] │ 1      │ Labeling[\nname=ImageA7.tif;] │ 0,733056435 │
 
groupby(df, :img_name)

#general look of df output renames the same while the type is now I belive SubDataFrame 

Glad you are working with DataFrames.

I think there might be some confusion about what groupby does in both dplyr and DataFrames. In both cases, they create a new object, not a modified one. I think you want

gdf = groupby(df, :img_name)

R should work the same way, where grouping gives you a new object. However R defines more operations on a grouped data frame. For instance, in R you can rename columns in a grouped data frame. You can’t do this in Julia, you can only rename DataFrames.

2 Likes

Can you please describe what you want to achieve with some concrete columns then probably we can help you to find a proper command to achieve this.

Oh, I see thank you!

Thank you for taking the time, I wanted to make a new column based on img_name column so that grouped df had a new column f.e. img_name_Labelling_name contained only values like ImageA7 and no additional variable information to later plot circ values grouped by values from a new column, here ImageA7

If I understand you correctly you could do (you have not provided the code to reproduce your request nor information what plotting package you use so I am using placeholder functions).

df.img_name_Labelling_name = extract_name(df.img_name)
for sdf in groupby(df, :img_name_Labelling_name)
    plot(sdf.circ) # or whatever plotting you need
end

Oh, sorry I though I those snippets should be enough, thank you for the for loop it looked like something I wanted but after modifying it (so basically going around extract_name) it hasn’t worked for me giving blanck outputs:

just modified for loop:

p = plot()
for sdf in groupby(df, :img_name)
    histogram!(sdf.f_circ)
end
p.show()

the whole code:

using CSV, DataFrames, Statistics, RCall, Plots; pyplot();
folder_path = ".../file.csv"

#read csv and rename the Cols of it
df = CSV.read(folder_path, copycols = true, header = 0)
    hd = Dict("Col1" =>"slice", "Col2" =>"count_",
        "Col3" => "img_name", "Col4" => "circ")
        rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
        rename!(df, hd)

#TIDYING DATA:
#1. shorten redundant names with the help of manually composed Dict:
unique_names = Array(unique!(df.img_name))
sort(unique_names)#copied names from unique_names later
name_dict = Dict(
"Labeling[\nname=ImageA0.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA0",
"Labeling[\nname=ImageA1.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA1",
"Labeling[\nname=ImageA2.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA2",
"Labeling[\nname=ImageA3.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA3",
"Labeling[\nname=ImageA4.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA4",
"Labeling[\nname=ImageA5.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA5",
"Labeling[\nname=ImageA6.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA6",
"Labeling[\nname=ImageA7.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA7",
"Labeling[\nname=ImageA8.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA8",
"Labeling[\nname=ImageA9.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA9")


replace!(df[!, :img_name], name_dict...)
    df.slice = replace.(df[!, :slice], "\nsource=;" => "")
    df.slice = replace.(df[!, :slice], "\ndimensions" => "dim")
    df.slice = replace.(df[!, :slice], "\npixel" => "")
    println("\t **after** renaming and before parsing: ", first(df,5))

#2. Preparing for parsing: replacing dots for comma to parse as Float later
df.f_circ = replace.(df[:, :circ], "0," => "0.")
for i in 1:size(df,1)
    if startswith.(df[i, :circ], "0,") || startswith.(df[i, :circ], "1,")
        df.f_circ[i] = replace.(df[i, :circ], "0," => "0.")
        df.f_circ[i] = replace.(df[i, :f_circ], "1," => "1.")
    end
end

#Investigation
df[!,:f_circ] = parse.(Float16,df[!, :f_circ])
    println("\t **after** renaming and parsing: ",first(df,5))

#groupby for loop **here**

output of the last of investiagion println (hence outlook of data in df)

**after** renaming and parsing: 5×5 DataFrame
│ Row │ slice                                                        │ count_ │ img_name │ circ        │ f_circ  │
│     │ String                                                       │ Int64  │ String   │ String      │ Float16 │
├─────┼──────────────────────────────────────────────────────────────┼────────┼──────────┼─────────────┼─────────┤
│ 1   │ Image[\nname=1;dim=14,14 (X,Y);\nmin=325,99; type=BitType)]  │ 1      │ ImageA7  │ 0,733056435 │ 0.733   │
│ 2   │ Image[\nname=2;dim=19,20 (X,Y);\nmin=196,123; type=BitType)] │ 2      │ ImageA6  │ 0,787814224 │ 0.7876  │
│ 3   │ Image[\nname=3;dim=14,12 (X,Y);\nmin=50,138; type=BitType)]  │ 3      │ ImageA4  │ 0,749961791 │ 0.75    │
│ 4   │ Image[\nname=4;dim=13,14 (X,Y);\nmin=339,141; type=BitType)] │ 4      │ ImageA5  │ 0,715588033 │ 0.716   │
│ 5   │ Image[\nname=5;dim=1,1 (X,Y);\nmin=474,143; type=BitType)]   │ 5      │ ImageA1  │ 0           │ 0.0     │

sorry for the general messy thread and thanks for looking replies

Seeing the output I am not clear what you are trying to achieve. You have one observation per group so how do you expect the histogram to look like?

1 Like

With one observation in group (here :img_name containing ImageA0 to ImageA9) I would be expecting to see just one histrogram from the for loop (with bars on x-axis as ImageA0 to ImageA9 and y-axis showing f_circ values).

With more observations there I thought to see number of histograms equaled to added observatios (so if :img_name contained ImageA0_A and ImageA0_B to ImageA9_A and ImageA9_B would be expecting two histograms _A and _B of the same parameters as described above). But for now, trying to do it on the first case to practice and learn on dummy data

but with one observation (as in the data frame you have shared) it will be “one bar” not “bars”.


Probably it would be simplest if you could share your data and the code you now use so that we can help you fix it (if sharing them is possible).

Yes, I would love that, thanks for the help! Full code is essentially what have I posted previously and is in archive together with used .csv on the google-drive here : https://drive.google.com/file/d/1yKMIZMVUFZ6rHognlFmtu9pm37YX4RLi/view?usp=sharing

Here are my comments on your code (I concentrate on major things):

Array(unique!(df.img_name))

is a serious bug - it resizes :img_name column in place and corrupts df. You will get errors when trying to work with df after this operation. Just write:

unique(df.img_name)

instead.

If you fix this all works correctly except that you need to display your plot:

p = plot()
for sdf in groupby(df, :img_name)
    histogram!(sdf.f_circ)
end
display(p)
3 Likes

Ah yeah, it does now I can go on now, thanks for helping me out !