Is it bad practice to assign values to the String entries of a DataFrame?

I have to create a set of DataFrames based off of the entries of an existing DataFrame. One way to approach would be to create a function that turns the entires in a DataFrame into variables and then assign values to them.

artist = DataFrame("A" =>["Bob Marley", "Kiiara", "Sinead"])

function discography(artist)
    for i = artist
    @eval $(Symbol(i)) = DataFrame("songs"=>[],"concerts" =>[])
    end
end

discography(artist.A)

I actually have a two questions about this.

  1. I saw in another post that one should avoid using @eval until you thoroughly understand it because of its global scope, so I was wondering if there was a better way to do this without using @eval or affecting the entire module.

  2. In general is it bad practice to name things after values in DataFrame entries? I read in another post that dynamically naming things is frowned upon because it can lead to bugs in the code. I’m not sure if there is a less clunky approach that is generally taken when a new DataFrame needs to be created based off of entries of an existing one and stored?

it isn’t very clear to me what you want to achieve… that code doesn’t do much…

I would organise my data on individual aspects (i.e. make a relational database) and then use some join function to retrieve the specific information you need. For example:

using Dates

artists     = DataFrame(name = ["Bob Marley", "Kiiara", "Sinead"], country = ["country1","country2","country3"])
songs       = DataFrame(artist = ["Bob Marley","Bob Marley","Sinead"], songs = ["a song of BM", "another song of BM", "a song of Sinead"])
concerts    = DataFrame(artist = ["Bob Marley","Kiiara","Kiiara"], location = ["New York","London","Paris"], date = [Date("1975-1-31"),Date("1980-1-31"),Date("1985-12-1")])

artist_songs = innerjoin(artists,songs,on=["name"=>"artist"] )
1 Like

The things you’ve read are generally correct - usually when you are dynamically creating variables like this there’s a better approach.

If you want help with (potentially!) finding this better approach, you need to tell us a bit more about what your aim is though. Given your MWE one might consider writing the following:

julia> discographies = Dict(x => DataFrame(songs = [], concerts = []) for x ∈ artist.A)
Dict{String, DataFrame} with 3 entries:
  "Bob Marley" => 0Γ—2 DataFrame
  "Sinead"     => 0Γ—2 DataFrame
  "Kiiara"     => 0Γ—2 DataFrame

julia> discographies["Bob Marley"]
0Γ—2 DataFrame

but it’s hard to tell whether that’s a good idea without knowing more about where you want to go from here.

2 Likes

Thanks so much guys these responses are incredibly helpful! Sorry the description was not a very good attempt at simplifying and generalizing the question.

Specifically I have a GroupDataFrame with a few thousand SubDataFrames. e.g.

source = DataFrame("artist"=>["Marley","Marley", "Marley", "Sinead", "Sinead"], "Plays"=>[100,200,25,50,60], "Dates" =>[1,2,3,1,2])
8Γ—3 DataFrame
 Row β”‚ artist  Plays  Dates 
     β”‚ String  Int64  Int64 
─────┼──────────────────────
   1 β”‚ Marley    100      1
   2 β”‚ Marley    200      2
   3 β”‚ Marley     25      3
   4 β”‚ Sinead     50      1
   5 β”‚ Sinead     60      2

artists = groupby(source, :artist) 

I have a wrapper function that then iterates the following N functions over the rows of specific columns of the SubDataFrames e.g,

function concert_1 
 # some function
end 
.
.
function concert_N 
 #some function 
end 

concert_1(artist[1].Plays)

The outputs of concert_n(artist[n].Plays) would then be ranked against each other for purposes of optimization.

I intended to store the output of each function in a new DataFrame, naming them as described above. Upon reading your answers I realize this is suboptimal. I think the dictionary method works and I assume to populate it I would use:

for a = keys(discographies)
push!(discographies[a], [ data1, data2])
end 

Is there a better approach to store and compare the data in situations like this? Will using a dictionary have a negative impact performance?

I think you just want @by(mydf,:artist,:mysummary=summarizer(:mycolumn))

1 Like

Sorry would you mind expanding on this a little, I can’t seem to find anything in the docs on @by or summarizer.

summarizer was a stand in for your function that you’re using to summarize stuff… @by is a super handy macro from DataFramesMeta.jl which I consider a core package that everyone who uses DataFrames should check out and definitely use a lot.

The @by macro splits a data frame β€œby” the value of the first argument after the data frame, and then for each group, it creates new variables by calling the summarization functions you call on each sub-group, and then combines the results back into a new DataFrame. This is the β€œsplit-apply-combine” pattern.

For example

@by(mydataframe,:county,:meanincome = mean(:income))

will give you a new data frame with columns β€œcounty” and β€œmeanincome”

in your case, it sounds like you want to run multiple summary functions on each group, and keep track of the results of each of those, so you’ll have multiple assignments…

@by(mydataframe, :artist, :summary1 = summarizer1(:column_for_1), :summary2 = summarizer2(:column_for_2,:other_column_for_2), ... )

where you fill in more columns where I’ve put …

2 Likes

awesome thanks so much for the explanation! Quick follow up though, what happens if the summarizer function involves Data from different columns? I ask because if I do something like

@by(source, :artist, :Dailes = :Plays .+ :Dates)

I get the expected result of

5Γ—2 DataFrame
 Row β”‚ artist  Dailes 
     β”‚ String  Int64  
─────┼────────────────
   1 β”‚ Marley     101
   2 β”‚ Marley     202
   3 β”‚ Marley      28
   4 β”‚ Sinead      51
   5 β”‚ Sinead      62

But if have the summarizer incorporated into a function

function h(x)
       x.Plays .+ x.Dates
end

then

h(source)

yields

5-element Vector{Int64}:
 101
 202
  28
  51
  62

but

@by(source, :artist, :Dailes = h(source))

yields

10Γ—2 DataFrame
 Row β”‚ artist  Dailes 
     β”‚ String  Int64  
─────┼────────────────
   1 β”‚ Marley     101
   2 β”‚ Marley     202
   3 β”‚ Marley      28
   4 β”‚ Marley      51
   5 β”‚ Marley      62
   6 β”‚ Sinead     101
   7 β”‚ Sinead     202
   8 β”‚ Sinead      28
   9 β”‚ Sinead      51
  10 β”‚ Sinead      62

Why isn’t :Dailies == h(source)? Am I implementing @by incorrectly or is there something wrong with the function for this purpose?

instead have h take columns… like

function h(plays,dates)
...do stuff here
end

then in your @by

@by(source, :artist, :summary1 = h(:Plays,:Dates))

EDIT: to explain further, the reason your code doesn’t work, is that it takes the entire dataframe not the group…

@by(source, :artist, :Dailies = h(source))

h(source) is taking the entire data frame as a constant, the same for every group…

1 Like

By the way, once you get something working, try doing @macroexpand @by(...) to see what code the by macro is generating, you’ll find that the macro is making a fairly big calculation into one line for you. Welcome to understanding the magic of macros enabling domain specific languages within your main language!

2 Likes

Another way to keep all the info together


artist = DataFrame("A" =>["Bob Marley", "Kiiara", "Sinead"])


df1=DataFrame(songs = ["a song of BM", "another song of BM", "a song of Sinead"], concerts = ["New York","London","Paris"])

df2=DataFrame(songs = ["a song of K", "another song of k", ], loc = ["New York","London"], dates=[Date("1975-1-31"),Date("1980-1-31")])

df3=vcat(df1,df2, cols=:union)
julia> cd=combine(artist, :A,:A=>(x->[df1,df2,df3])=>:discographie)
3Γ—2 DataFrame
 Row β”‚ A           discographie  
     β”‚ String      DataFrame
─────┼───────────────────────────
   1 β”‚ Bob Marley  3Γ—2 DataFrame
   2 β”‚ Kiiara      2Γ—3 DataFrame
   3 β”‚ Sinead      5Γ—4 DataFrame

1 Like