Dataframes combine over 100s of columns

Trying to perform combine over one columns logged over multiple days and get delta (max - min) for 300 columns in a dataset with 10 million rows.

  1. How can i do this efficiently ?
  2. I also notice if i combine over 10 columns at a time, the answers are returned instantly, is there a way to auto-generate code for combine ?
  3. Is there a way to use for loop to loop over the columns to combine the same way ?
get_delta(x) = maximum(skipmissing(x)) - minimum(skipmissing(x))
get_delta_days(x) = (x -> x.value).(maximum(skipmissing(x)) - minimum(skipmissing(x)))

@time delta_df = @chain data_joined begin
    groupby(["unique_column"])
    combine(
        :column1 => first => :column1,
        :datestamp => get_delta_days => :days,
        :column2_hw => get_delta => :column2_diff,
        :column3_hw => get_delta => :column3_diff,
       # have 300 more columns
    )

    # @combine begin
    #         :days = get_delta_days(:datestamp)
    #         :n = length(:device_id)
    #         :column2_diff = get_delta(:column2)
    #         :column3_diff = get_delta(:column3)
    #         $AsTable = (column2_diff = get_delta(:column2), column3_diff = get_delta(:column3))
    # end
end

Pkg and machine info

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, westmere)

DataFrames v1.3.5

it is not clear to me how the get_delta () function works, but if you have to apply the same function to a group of columns you can do something like this.

df=DataFrame(rand(1:1000, 10^6, 300),:auto)

g=groupby(df, :x1)

Diff(x) = -(-(extrema(x)...))
combine(g, 3:300 .=> Diff)
1 Like

It’s helpful to provide dummy data for others to replicate the entirety of your example instead of guessing about the format of your data. Regardless, you’ll want to use extrema, which calculates minimum and maximum with one pass through an array rather than two. For a million rows, it takes about half as long:

get_delta0(x) = maximum(skipmissing(x)) - minimum(skipmissing(x))
get_delta1(x) = -(-(extrema(skipmissing(x))...))
julia> using BenchmarkTools

julia> @btime get_delta0($x)
  8.716 ms (10 allocations: 160 bytes)
1.0

julia> @btime get_delta1($x)
  4.877 ms (5 allocations: 112 bytes)
1.0

Just for fun, instead of defining get_delta_days, you can add a method for get_delta that operates only on vectors of DateTime:

get_delta(x::Vector{<:DateTime}) = -(-(extrema(x -> x.value, skipmissing(x))...))
julia> @btime get_delta_days($datestamp)
  3.415 ms (10 allocations: 160 bytes)
2999602398

julia> @btime get_delta($datestamp)
  2.195 ms (5 allocations: 112 bytes)
2999602398 days
1 Like

I wrote julia code for my dataset and it took more than 15 hours :rofl: and i cancelled the processing, to post here. your code snippets made it run in 15 seconds :exploding_head:

Thanks folks. Everytime i get frustrated with julia, its because of my ill-informed code. I love how the community always comes to the rescue.

2 Likes

I’m still not sure what was taking so long with @combine. Do you think its because there are so many commands in a single call? Can you try profiling that a bit more with the better version of get_delta?

I was wrong; i got the results in 15 seconds caz i executed only 2 columns :slight_smile:
I think compute intensive things running on jupyter with julia kernel has always been a pain, sometimes the cells never complete and i mistake it that julia is still computing.
I tried running same code on julia console and were running properly execution completion in seconds. i did not time them as I am working on a deadline and this is a very bad experience :frowning:

Julia and jupyter needs to work better. I do not use pluto, since i prefer jupyter compute style.
Let me know how to profile this jupyter-julia compatible runs

There should not be any difference in timings between the Julia REPL and a Jupyter notebook - after all it’s just a front end that in simple terms takes the code and sends it to a Julia process which is not different from the one executing code entered in the REPL.

As others have said it would be great if you could provide an MWE with dummy data that demonstrates the issue.

2 Likes

I completely switched to jupyter with julia for my every day data analysis work.
I will try and get MWE soon, that shows how jupyter does not complete a cell run started with julia kernel

If you have MWE I will investigate it. Thank you!

On terminal the join completed in 156 seconds; but on jupyter, the cell does not complete after one hour. I have attached the notebook. It has some unwanted imports, kept those to simulate same amount of RAM usage. I have the jupyterlab notebook that simulates the issue, refer

How much RAM does the machine have on which you run this code. Most likely the high run time is due to disk swapping (at least this happens on my machine that has 32 GB RAM).

1 Like

I have 64 GB RAM. Is there reason the same code could run properly on the julia console but not on Jupyter ?

What is the level of RAM usage when you run this code?

I reduced your code down to this:

import Pkg; Pkg.activate(; temp = true); Pkg.add(["UUIDs", "StatsBase", "InlineStrings", "DataFrames"])
using UUIDs, Random, StatsBase, DataFrames, InlineStrings

n = 10^6

uuids = string.([UUIDs.uuid4() for i in 1:10^6])
xs = [String63.([randstring(12) for i ∈ 1:x]) for x ∈ [16, 20, 100, 6, 5]]

df1 = DataFrame(rand(1:10^6, n, 300),:auto)
df1[!, "uuid"] = StatsBase.sample(uuids, 10^6, replace = true)
for y ∈ 1:5
    df1[!, "x$y"] = StatsBase.sample(xs[y], n, replace = true)
end
println(size(df1))

df2 = DataFrame(rand(1:10^6, n, 300),:auto)
df2[!, "uuid"] = StatsBase.sample(uuids, 10^6, replace = true)
for y ∈ 1:5
    df2[!, "x$y"] = StatsBase.sample(xs[y], n, replace = true)
end
println(size(df2))

on_columns = ["x1", "uuid"]
@time data_joined = leftjoin(
    select(df1, vcat(on_columns, ["x$i" for i in 52:300])),
    select(df2, vcat(on_columns, ["x$i" for i in 2:50])),
    matchmissing = :notequal, 
    on =  on_columns)

println(size(data_joined))

Running this in an IJulia notebook I got:

Create dummy source data
Create df1
(10000000, 301)
Create df2
(10000000, 301)
Join
251.119597 seconds (4.61 M allocations: 61.146 GiB, 3.01% gc time, 4.03% compilation time: 2% of which was recompilation)
(11603549, 300)

whereas in the REPL I got (just on the last step):

julia> @time data_joined = leftjoin(
           select(df1, vcat(on_columns, ["x$i" for i in 52:300])),
           select(df2, vcat(on_columns, ["x$i" for i in 2:50])),
           matchmissing = :notequal,
           on =  on_columns)
472.345997 seconds (4.57 M allocations: 61.131 GiB, 4.27% gc time, 3.38% compilation time: 12% of which was recompilation)

I killed the IJulia kernel before running things in the REPL but it looks like something else was running during my REPL session, as I had less free memory to start off with (resource monitor showed around 7 out of 48GB in use before I ran the code in IJulia, and 10GB in use when I started the REPL session).

As Bogumil says this code is very likely to hit swap - each of your DataFrames is more than 25GB in memory, so even before joining them or doing any operations you’re likely close to using your 64GB (taking into account that you also have an OS running). I think my results demonstrate this - on the second run in the REPL I just hit swap earlier and therefore more often, which massively increased overall runtime.

In summary, I don’t think this has anything to do with IJulia (maybe apart from the fact that to run IJulia you need to have a browser running, which adds to the RAM pressure compared to just having a terminal open with the REPL, but that should be negligible overall) nor with “combining over 100s of columns” as the title suggest, but just with the fact exceeding your available RAM and going into swap is very costly in performance terms.

1 Like

I have run the code with 10^6 rows, and get the following timing:

julia> @time data_joined = leftjoin(
           select(df1, vcat(on_columns, ["x$i" for i in 52:300]), copycols=false),
           select(df2, vcat(on_columns, ["x$i" for i in 2:50]), copycols=false),
           matchmissing = :notequal,
           on =  on_columns)
  1.812051 seconds (91.02 k allocations: 2.798 GiB, 18.60% gc time, 2.05% compilation time)

So with 10x bigger data it should be less than 30 seconds if no swapping happens.
Note that I added copycols=false in select because making a copy unnecessarily would use additional memory (which seems to be a bottleneck in this exercise).

1 Like

Thanks @nilshg for minimizing the example.

import Pkg; Pkg.activate(; temp = true); Pkg.add(["UUIDs", "StatsBase", "InlineStrings", "DataFrames"])
using UUIDs, Random, StatsBase, DataFrames, InlineStrings

n = 10^7

uuids = string.([UUIDs.uuid4() for i in 1:10^6])
xs = [String63.([randstring(12) for i ∈ 1:x]) for x ∈ [16, 20, 100, 6, 5]]

df1 = DataFrame(rand(1:10^6, n, 300),:auto)
df1[!, "uuid"] = StatsBase.sample(uuids, n, replace = true)
for y ∈ 1:5
    df1[!, "x$y"] = StatsBase.sample(xs[y], n, replace = true)
end
println(size(df1))

df2 = DataFrame(rand(1:10^6, n, 300),:auto)
df2[!, "uuid"] = StatsBase.sample(uuids, n, replace = true)
for y ∈ 1:5
    df2[!, "x$y"] = StatsBase.sample(xs[y], n, replace = true)
end
println(size(df2))

on_columns = ["x1", "uuid"]
@time data_joined = leftjoin(
    select(df1, vcat(on_columns, ["x$i" for i in 52:300]), copycols=false),
    select(df2, vcat(on_columns, ["x$i" for i in 2:50]), copycols=false),
    matchmissing = :notequal, 
    on =  on_columns)

The same code with @bkamins update, runs on my terminal in 85 seconds. I ran multiple times to check on the terminal run.

...
julia> @time data_joined = leftjoin(
           select(df1, vcat(on_columns, ["x$i" for i in 52:300]), copycols=false),
           select(df2, vcat(on_columns, ["x$i" for i in 2:50]), copycols=false),
           matchmissing = :notequal,
           on =  on_columns)
 85.477057 seconds (4.34 M allocations: 33.896 GiB, 53.28% gc time, 2.42% compilation time)

But on jupyter, it does not complete even after 30 mins.
I also watched activity monitor for whats happening on the memory footprint and they look similar.

jupyter metainfo:

Selected Jupyter core packages...
IPython          : 8.4.0
ipykernel        : 6.9.1
ipywidgets       : not installed
jupyter_client   : 7.3.4
jupyter_core     : 4.10.0
jupyter_server   : 1.18.1
jupyterlab       : 3.4.4
nbclient         : 0.5.13
nbconvert        : 6.4.4
nbformat         : 5.3.0
notebook         : 6.4.12
qtconsole        : not installed
traitlets        : 5.1.1

Let me know if there is any other way we can monitor/trace the runs that can reveal issues.

This is strange. Maybe @stevengj who maintains IJulia.jl will know the reason?

Some quick questions (maybe there were already answers for these above but just to have them in one place):

  • what OS do you use?
  • How does CPU and memory usage look when the process runs in REPL vs Jupyter (and for both which are the key processes that use-up CPU and RAM)
  • Do you use Jupyter Notebook in the browser or in VS Code
  • How big is the size of Jupyter Notebook when you save it before running the expensive operation (one of the issues might be that Jupyter Notebook caches some large output unnecessarily and thus using-up RAM - this will be visible in the size of the saved file)

I think equally you could do Base.summarysize(Out) to check whether there’s an issue with intermediate outputs taking up space (as fas as I understand the IJulia design all the captured outputs are in the Out dictionary)

  1. Use macOS Montery 12.4, refer the first post, it has all the system details
  2. The CPU usage and the repel usage looks the same meaning, same number of threads, similar real memory, virtual and shared memory usage; In repl, the cpu usage comes down (from 100%) after 85 seconds, but from jupyter, it still holds the cpu to 100% usage
  3. Jupyter Notebook in the browser firefox
  4. ipynb file is 3 KB, no large objects inside them.

Base.summarysize(Out) which i sprinkled all over the code snippet, looks same . I also noticed in the middle of all these tests the jupyter julia kernel stated behaving properly like repl, meaning cells complete in time similar to repl.

I have had similar issues before in the past and the way i circumvented is using the following steps/trick. Lets say i want to run this piece of code

delta(x) = -(-(extrema(x)...))
get_delta_days(x) = (x -> x.value).(delta(skipmissing(x)))
@time delta_df = @chain data_joined begin
    groupby(["device_id"])
    combine(
        [:x1, :x2] .=> first .=> [:x1, :x2],
        :x3 => length => :n,
        :x4 => get_delta_days => :days,
        [:x5, :x6] .=> maximum .=> [:x5, :x6],
       [:x5, :x6, x7, x8] .=> delta,
        ["x$(i)" for i in 9:255] .=> delta
    )
    transform(:x10 => ByRow(x-> x < 0 ? 0.0 : x ) => :x10_delta)
end

The jupyter cell will not complete at all (i have waited hours).
Then i will kill / interrupt the kernel, the run the following with commented code, this will execute in seconds even for 10 million records.

delta(x) = -(-(extrema(x)...))
get_delta_days(x) = (x -> x.value).(delta(skipmissing(x)))
@time delta_df = @chain data_joined begin
    groupby(["device_id"])
    combine(
        [:x1, :x2] .=> first .=> [:x1, :x2],
        :x3 => length => :n,
       # :x4 => get_delta_days => :days,
       # [:x5, :x6] .=> maximum .=> [:x5, :x6],
       # [:x5, :x6, x7, x8] .=> delta,
       # ["x$(i)" for i in 9:255] .=> delta
    )
    transform(:x10 => ByRow(x-> x < 0 ? 0.0 : x ) => :x10_delta)
end

Then I will remove commented lines one by one for following runs executing the same cell.
The whole cell which did not complete in hours before, now will complete in seconds for the same code unwind-ed with comments removal.

This is why i started this thread complaining about jupyter kernel, since the same piece of code runs properly in repl. I do-not know how to profile this inconsistent behavior. But this has happened to me lot of times which is when i try to update jupyterlab / dataframes / julia(LTS);

This is one issue i want resolved to adopt for our ETL pipeline using julia notebooks replacing python notebooks.