Is it bad practice to assign values to the String entries of a DataFrame?

phantom · October 19, 2022, 6:23am

I have to create a set of DataFrames based off of the entries of an existing DataFrame. One way to approach would be to create a function that turns the entires in a DataFrame into variables and then assign values to them.

artist = DataFrame("A" =>["Bob Marley", "Kiiara", "Sinead"])

function discography(artist)
    for i = artist
    @eval $(Symbol(i)) = DataFrame("songs"=>[],"concerts" =>[])
    end
end

discography(artist.A)

I actually have a two questions about this.

I saw in another post that one should avoid using @eval until you thoroughly understand it because of its global scope, so I was wondering if there was a better way to do this without using @eval or affecting the entire module.
In general is it bad practice to name things after values in DataFrame entries? I read in another post that dynamically naming things is frowned upon because it can lead to bugs in the code. I’m not sure if there is a less clunky approach that is generally taken when a new DataFrame needs to be created based off of entries of an existing one and stored?

sylvaticus · October 19, 2022, 9:09am

it isn’t very clear to me what you want to achieve… that code doesn’t do much…

I would organise my data on individual aspects (i.e. make a relational database) and then use some join function to retrieve the specific information you need. For example:

using Dates

artists     = DataFrame(name = ["Bob Marley", "Kiiara", "Sinead"], country = ["country1","country2","country3"])
songs       = DataFrame(artist = ["Bob Marley","Bob Marley","Sinead"], songs = ["a song of BM", "another song of BM", "a song of Sinead"])
concerts    = DataFrame(artist = ["Bob Marley","Kiiara","Kiiara"], location = ["New York","London","Paris"], date = [Date("1975-1-31"),Date("1980-1-31"),Date("1985-12-1")])

artist_songs = innerjoin(artists,songs,on=["name"=>"artist"] )

nilshg · October 19, 2022, 9:25am

The things you’ve read are generally correct - usually when you are dynamically creating variables like this there’s a better approach.

If you want help with (potentially!) finding this better approach, you need to tell us a bit more about what your aim is though. Given your MWE one might consider writing the following:

julia> discographies = Dict(x => DataFrame(songs = [], concerts = []) for x ∈ artist.A)
Dict{String, DataFrame} with 3 entries:
  "Bob Marley" => 0×2 DataFrame
  "Sinead"     => 0×2 DataFrame
  "Kiiara"     => 0×2 DataFrame

julia> discographies["Bob Marley"]
0×2 DataFrame

but it’s hard to tell whether that’s a good idea without knowing more about where you want to go from here.

phantom · October 20, 2022, 12:53am

Thanks so much guys these responses are incredibly helpful! Sorry the description was not a very good attempt at simplifying and generalizing the question.

Specifically I have a GroupDataFrame with a few thousand SubDataFrames. e.g.

source = DataFrame("artist"=>["Marley","Marley", "Marley", "Sinead", "Sinead"], "Plays"=>[100,200,25,50,60], "Dates" =>[1,2,3,1,2])
8×3 DataFrame
 Row │ artist  Plays  Dates 
     │ String  Int64  Int64 
─────┼──────────────────────
   1 │ Marley    100      1
   2 │ Marley    200      2
   3 │ Marley     25      3
   4 │ Sinead     50      1
   5 │ Sinead     60      2

artists = groupby(source, :artist)

I have a wrapper function that then iterates the following N functions over the rows of specific columns of the SubDataFrames e.g,

function concert_1 
 # some function
end 
.
.
function concert_N 
 #some function 
end 

concert_1(artist[1].Plays)

The outputs of concert_n(artist[n].Plays) would then be ranked against each other for purposes of optimization.

I intended to store the output of each function in a new DataFrame, naming them as described above. Upon reading your answers I realize this is suboptimal. I think the dictionary method works and I assume to populate it I would use:

for a = keys(discographies)
push!(discographies[a], [ data1, data2])
end

Is there a better approach to store and compare the data in situations like this? Will using a dictionary have a negative impact performance?

dlakelan · October 20, 2022, 2:41am

I think you just want @by(mydf,:artist,:mysummary=summarizer(:mycolumn))

phantom · October 20, 2022, 11:48pm

Sorry would you mind expanding on this a little, I can’t seem to find anything in the docs on @by or summarizer.

dlakelan · October 21, 2022, 12:29am

summarizer was a stand in for your function that you’re using to summarize stuff… @by is a super handy macro from DataFramesMeta.jl which I consider a core package that everyone who uses DataFrames should check out and definitely use a lot.

The @by macro splits a data frame “by” the value of the first argument after the data frame, and then for each group, it creates new variables by calling the summarization functions you call on each sub-group, and then combines the results back into a new DataFrame. This is the “split-apply-combine” pattern.

For example

@by(mydataframe,:county,:meanincome = mean(:income))

will give you a new data frame with columns “county” and “meanincome”

in your case, it sounds like you want to run multiple summary functions on each group, and keep track of the results of each of those, so you’ll have multiple assignments…

@by(mydataframe, :artist, :summary1 = summarizer1(:column_for_1), :summary2 = summarizer2(:column_for_2,:other_column_for_2), ... )

where you fill in more columns where I’ve put …

phantom · October 21, 2022, 3:32am

awesome thanks so much for the explanation! Quick follow up though, what happens if the summarizer function involves Data from different columns? I ask because if I do something like

@by(source, :artist, :Dailes = :Plays .+ :Dates)

I get the expected result of

5×2 DataFrame
 Row │ artist  Dailes 
     │ String  Int64  
─────┼────────────────
   1 │ Marley     101
   2 │ Marley     202
   3 │ Marley      28
   4 │ Sinead      51
   5 │ Sinead      62

But if have the summarizer incorporated into a function

function h(x)
       x.Plays .+ x.Dates
end

then

h(source)

yields

5-element Vector{Int64}:
 101
 202
  28
  51
  62

but

@by(source, :artist, :Dailes = h(source))

yields

10×2 DataFrame
 Row │ artist  Dailes 
     │ String  Int64  
─────┼────────────────
   1 │ Marley     101
   2 │ Marley     202
   3 │ Marley      28
   4 │ Marley      51
   5 │ Marley      62
   6 │ Sinead     101
   7 │ Sinead     202
   8 │ Sinead      28
   9 │ Sinead      51
  10 │ Sinead      62

Why isn’t :Dailies == h(source)? Am I implementing @by incorrectly or is there something wrong with the function for this purpose?

dlakelan · October 21, 2022, 3:34am

instead have h take columns… like

function h(plays,dates)
...do stuff here
end

then in your @by

@by(source, :artist, :summary1 = h(:Plays,:Dates))

EDIT: to explain further, the reason your code doesn’t work, is that it takes the entire dataframe not the group…

@by(source, :artist, :Dailies = h(source))

h(source) is taking the entire data frame as a constant, the same for every group…

dlakelan · October 21, 2022, 2:17pm

By the way, once you get something working, try doing @macroexpand @by(...) to see what code the by macro is generating, you’ll find that the macro is making a fairly big calculation into one line for you. Welcome to understanding the magic of macros enabling domain specific languages within your main language!

rocco_sprmnt21 · October 21, 2022, 9:20pm

Another way to keep all the info together


artist = DataFrame("A" =>["Bob Marley", "Kiiara", "Sinead"])


df1=DataFrame(songs = ["a song of BM", "another song of BM", "a song of Sinead"], concerts = ["New York","London","Paris"])

df2=DataFrame(songs = ["a song of K", "another song of k", ], loc = ["New York","London"], dates=[Date("1975-1-31"),Date("1980-1-31")])

df3=vcat(df1,df2, cols=:union)
julia> cd=combine(artist, :A,:A=>(x->[df1,df2,df3])=>:discographie)
3×2 DataFrame
 Row │ A           discographie  
     │ String      DataFrame
─────┼───────────────────────────
   1 │ Bob Marley  3×2 DataFrame
   2 │ Kiiara      2×3 DataFrame
   3 │ Sinead      5×4 DataFrame

Topic		Replies	Views
Use a String For a Variable name and convert Dictionaries to DataFrames General Usage question , package , dictionary , dataframes	31	1802	June 1, 2021
Using variable names to parse DataFrame New to Julia dataframes	6	830	August 7, 2021
String Index for DataFrames Data question	1	2306	September 4, 2019
Help with filling dataframe General Usage dataframes	9	988	February 23, 2021
Avoiding global variables while using DataFrames General Usage question , dataframes	3	1609	October 25, 2021

Is it bad practice to assign values to the String entries of a DataFrame?

Related topics