Tidying data: DataFrame

Auhen_Shauchuk · June 3, 2020, 7:24pm

Hello, I have been trying to do some basic Data Analysis with Julia DataFrames and cannot find a way to tidy my data.

I would like to simply use values of one of the columns to group df based on them and later plot. My impression from reading documentation is that Julia’s groupby work differently than R’s group_by and I see that grouped df isn’t of a shape I would expect it to be after R called group_by. Could you please navigate me the Julia’s function/macro that could be helpful (ideally still in DataFrames.jl since I expect it has to be there)

code example:

>println(first(df,1))

1×4 DataFrame
│ Row │ slice             │ count_ │ img_name                      │ circ        │
│     │ String            │ Int64  │ String                        │ String      │
├─────┼───────────────────┼────────┼───────────────────────────────┼─────────────┤
│ 1   │ Image[name=2,dim] │ 1      │ Labeling[\nname=ImageA7.tif;] │ 0,733056435 │
 
groupby(df, :img_name)

#general look of df output renames the same while the type is now I belive SubDataFrame

pdeffebach · June 3, 2020, 7:36pm

Glad you are working with DataFrames.

I think there might be some confusion about what groupby does in both dplyr and DataFrames. In both cases, they create a new object, not a modified one. I think you want

gdf = groupby(df, :img_name)

R should work the same way, where grouping gives you a new object. However R defines more operations on a grouped data frame. For instance, in R you can rename columns in a grouped data frame. You can’t do this in Julia, you can only rename DataFrames.

bkamins · June 3, 2020, 7:42pm

Can you please describe what you want to achieve with some concrete columns then probably we can help you to find a proper command to achieve this.

Auhen_Shauchuk · June 3, 2020, 10:14pm

Oh, I see thank you!

Auhen_Shauchuk · June 3, 2020, 10:26pm

Thank you for taking the time, I wanted to make a new column based on img_name column so that grouped df had a new column f.e. img_name_Labelling_name contained only values like ImageA7 and no additional variable information to later plot circ values grouped by values from a new column, here ImageA7

bkamins · June 4, 2020, 6:36am

If I understand you correctly you could do (you have not provided the code to reproduce your request nor information what plotting package you use so I am using placeholder functions).

df.img_name_Labelling_name = extract_name(df.img_name)
for sdf in groupby(df, :img_name_Labelling_name)
    plot(sdf.circ) # or whatever plotting you need
end

Auhen_Shauchuk · June 5, 2020, 2:04am

Oh, sorry I though I those snippets should be enough, thank you for the for loop it looked like something I wanted but after modifying it (so basically going around extract_name) it hasn’t worked for me giving blanck outputs:

just modified for loop:

p = plot()
for sdf in groupby(df, :img_name)
    histogram!(sdf.f_circ)
end
p.show()

the whole code:

using CSV, DataFrames, Statistics, RCall, Plots; pyplot();
folder_path = ".../file.csv"

#read csv and rename the Cols of it
df = CSV.read(folder_path, copycols = true, header = 0)
    hd = Dict("Col1" =>"slice", "Col2" =>"count_",
        "Col3" => "img_name", "Col4" => "circ")
        rename!(df, [Symbol("Col$i") for i in 1:size(df,2)])
        rename!(df, hd)

#TIDYING DATA:
#1. shorten redundant names with the help of manually composed Dict:
unique_names = Array(unique!(df.img_name))
sort(unique_names)#copied names from unique_names later
name_dict = Dict(
"Labeling[\nname=ImageA0.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA0",
"Labeling[\nname=ImageA1.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA1",
"Labeling[\nname=ImageA2.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA2",
"Labeling[\nname=ImageA3.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA3",
"Labeling[\nname=ImageA4.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA4",
"Labeling[\nname=ImageA5.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA5",
"Labeling[\nname=ImageA6.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA6",
"Labeling[\nname=ImageA7.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA7",
"Labeling[\nname=ImageA8.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA8",
"Labeling[\nname=ImageA9.tif;\nsource=;\ndimensions=512,512 (X,Y)]"=> "ImageA9")


replace!(df[!, :img_name], name_dict...)
    df.slice = replace.(df[!, :slice], "\nsource=;" => "")
    df.slice = replace.(df[!, :slice], "\ndimensions" => "dim")
    df.slice = replace.(df[!, :slice], "\npixel" => "")
    println("\t **after** renaming and before parsing: ", first(df,5))

#2. Preparing for parsing: replacing dots for comma to parse as Float later
df.f_circ = replace.(df[:, :circ], "0," => "0.")
for i in 1:size(df,1)
    if startswith.(df[i, :circ], "0,") || startswith.(df[i, :circ], "1,")
        df.f_circ[i] = replace.(df[i, :circ], "0," => "0.")
        df.f_circ[i] = replace.(df[i, :f_circ], "1," => "1.")
    end
end

#Investigation
df[!,:f_circ] = parse.(Float16,df[!, :f_circ])
    println("\t **after** renaming and parsing: ",first(df,5))

#groupby for loop **here**

output of the last of investiagion println (hence outlook of data in df)

**after** renaming and parsing: 5×5 DataFrame
│ Row │ slice                                                        │ count_ │ img_name │ circ        │ f_circ  │
│     │ String                                                       │ Int64  │ String   │ String      │ Float16 │
├─────┼──────────────────────────────────────────────────────────────┼────────┼──────────┼─────────────┼─────────┤
│ 1   │ Image[\nname=1;dim=14,14 (X,Y);\nmin=325,99; type=BitType)]  │ 1      │ ImageA7  │ 0,733056435 │ 0.733   │
│ 2   │ Image[\nname=2;dim=19,20 (X,Y);\nmin=196,123; type=BitType)] │ 2      │ ImageA6  │ 0,787814224 │ 0.7876  │
│ 3   │ Image[\nname=3;dim=14,12 (X,Y);\nmin=50,138; type=BitType)]  │ 3      │ ImageA4  │ 0,749961791 │ 0.75    │
│ 4   │ Image[\nname=4;dim=13,14 (X,Y);\nmin=339,141; type=BitType)] │ 4      │ ImageA5  │ 0,715588033 │ 0.716   │
│ 5   │ Image[\nname=5;dim=1,1 (X,Y);\nmin=474,143; type=BitType)]   │ 5      │ ImageA1  │ 0           │ 0.0     │

sorry for the general messy thread and thanks for looking replies

bkamins · June 5, 2020, 5:58am

Seeing the output I am not clear what you are trying to achieve. You have one observation per group so how do you expect the histogram to look like?

Auhen_Shauchuk · June 5, 2020, 2:37pm

With one observation in group (here :img_name containing ImageA0 to ImageA9) I would be expecting to see just one histrogram from the for loop (with bars on x-axis as ImageA0 to ImageA9 and y-axis showing f_circ values).

With more observations there I thought to see number of histograms equaled to added observatios (so if :img_name contained ImageA0_A and ImageA0_B to ImageA9_A and ImageA9_B would be expecting two histograms _A and _B of the same parameters as described above). But for now, trying to do it on the first case to practice and learn on dummy data

bkamins · June 5, 2020, 2:45pm

but with one observation (as in the data frame you have shared) it will be “one bar” not “bars”.

Probably it would be simplest if you could share your data and the code you now use so that we can help you fix it (if sharing them is possible).

Auhen_Shauchuk · June 5, 2020, 4:59pm

Yes, I would love that, thanks for the help! Full code is essentially what have I posted previously and is in archive together with used .csv on the google-drive here : circularity._julia_Shauchuk.zip - Google Drive

bkamins · June 5, 2020, 8:45pm

Here are my comments on your code (I concentrate on major things):

Array(unique!(df.img_name))

is a serious bug - it resizes :img_name column in place and corrupts df. You will get errors when trying to work with df after this operation. Just write:

unique(df.img_name)

instead.

If you fix this all works correctly except that you need to display your plot:

p = plot()
for sdf in groupby(df, :img_name)
    histogram!(sdf.f_circ)
end
display(p)

Auhen_Shauchuk · June 5, 2020, 9:38pm

Ah yeah, it does now I can go on now, thanks for helping me out !

Topic		Replies	Views
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018
Groupby on an expression or a vector? New to Julia	21	565	June 11, 2024
How to easily rename column of GroupedDataFrame General Usage	2	856	June 16, 2020
Create grouped dataframe by properties of a given column? New to Julia dataframes , grouped-data	9	393	April 26, 2024
Groupby / reshaping dataframe with unique values Data data , dataframes	17	1480	December 19, 2020

Tidying data: DataFrame

Related topics