Combining dataframe and csv: undefined functions in DataFrames

I’m running into a number of obstacles in trying to merge a dataframe and CSV (just started working with DataFrames today). I’m benchmarking runs of an inference task. There are a number of parameters for the task and I’d like to aggregate runs in a CSV that I update after each set.

If there were two parameters A and B and for an assignment of these parameters, A=1, B=2 I did 10 runs with 3 succeeding, I would put these results in a DataFrame and write them to a CSV:

using CSV, DataFrame

df = DataFrame(A=1, B=2, Nrun=10, Nsuccess=3)
CSV.write("test.csv", df)

Now if I do another set of runs with A=1, B=2 and get 5 successes in 15 runs, I’d like to update the entry in test.csv to show Nrun=25 and Nsuccess = 8. I’d like to do this by scanning through the file rather than loading it all into memory but couldn’t find a nice way of doing this. The best way I could find was to load the previous data, push a new row to it, groupby the parameters and then sum over the runs:

df = CSV.read("test.csv", DataFrame)

newruns = Dict(:A=>1, :B=>2, :Nrun=>15, :Nsuccess=>5)
push!(df, newruns)

gdf = groupby(df, [:A, :B])
fdf = combine(gdf, valuecols(gdf) .=> sum .=>valuecols(gdf))

CSV.write("test.csv", fdf)

If I run this in a Pluto notebook I get UndefVarError: groupby not defined which seems a bit odd because I thought Julia should check the DataFrames namespace. If I change to DataFrames.groupby I get UndefVarError: valuecols not defined and adding DataFrames.valuecols does not change this.

Am I missing something? Is there a cleaner way of doing this? I’d think this should be very simple but have been fighting with it for a few hours. (I also had similar problems when trying to use Underscores to pipe these group-combine operations.)

What version of DataFrames are you using? It seems like valuecols is exported in 1.0 at least.

groupby has a conflict with Lazy.jl, is that package loaded? Maybe it’s causing an issue.

These are strange errors, but at the top of your code you have using DataFrame rather than using DataFrames… maybe it’s something simple like this? It all works for me.

Please try in the REPL to help us debug.

Ah, it’s something with the version. I installed everything on a fresh laptop 2 weeks ago so I thought it would be current but there is some problem with updating DataFrames

(@v1.6) pkg> up
    Updating registry at `~/.julia/registries/General`
  No Changes to `~/.julia/environments/v1.6/Project.toml`
  No Changes to `~/.julia/environments/v1.6/Manifest.toml`

(@v1.6) pkg> status
      Status `~/.julia/environments/v1.6/Project.toml`
  [7f9c7709] BIGUQ v0.8.0
  [336ed68f] CSV v0.8.4
  [8f4d0f93] Conda v1.5.2
  [a93c6f00] DataFrames v0.21.8
  [31c24e10] Distributions v0.23.12
  [c91e804a] Gadfly v1.3.3
  [73787735] GraphicalModelLearning v0.2.1 `~/.julia/dev/GraphicalModelLearning`
  [7073ff75] IJulia v1.23.2
  [c8e1da08] IterTools v1.3.0
  [6f286f6a] MultivariateStats v0.8.0
  [91a5bcdd] Plots v1.15.0
  [c3e4b0f8] Pluto v0.14.5
  [438e738f] PyCall v1.92.3
  [d330b81b] PyPlot v2.9.0
  [2913bbd2] StatsBase v0.32.2
  [f3b207a7] StatsPlots v0.14.21
  [d9a01c3f] Underscores v2.0.0
  [9a3f8284] Random

Any idea of what could be stopping DataFrames from updating?

Try ]add DataFrames@1.1.1, it should tell you what’s holding it back.

For whatever reason I had to rm DataFrames first before adding it back. It was BIGUQ which is a Bayesian information gap and uncertainty quantification package that must have been a dependency of a nonnegative matrix factorization package I had installed. I removed it and DataFrames now is at current version. Thanks so much!

1 Like

Make sure to have all your projects in their own directories with their own Project.toml files.

2 Likes