Simple Table Operation Has Very Large Compilation Time with MLJ

I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.

I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.

This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this.

Thank you,

Jack


using MLJ
using DataFrames

[line 1] @time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];
output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)
[line 2] @time df = DataFrames.DataFrame(arr, :auto)
output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)
[line 3] @time y, X = unpack(df, ==(:x1));
does not finish running

My guess is that it’s casting the DataFrame as a NamedTuple of vectors at some point. I took a glance but couldn’t find where that was happening.

1 Like

x-link to SO: julia - Simple Table Operation Has Very Large Compilation Time with MLJ - Stack Overflow

1 Like

@ablaom is the best person to answer this I think. The offending lines are most likely MLJBase.jl/data_utils.jl at 483263e45023f886563420cced1435e1ddf7b4a9 · JuliaAI/MLJBase.jl · GitHub and MLJBase.jl/data_utils.jl at 483263e45023f886563420cced1435e1ddf7b4a9 · JuliaAI/MLJBase.jl · GitHub. In general it should be possible to change the implementation there to something more generic (so that DataFrame stays DataFrame) - we would need to discuss this.

Let me just note that the whole point of DataFrames.jl is to make such operations fast (i.e. as long as you keep working with DataFrame and do not try converting it to a type-stable object like NamedTuple things will be fast as @pdeffebach noted).

The problem is this function.

function MMI.selectrows(::FI, ::Val{:table}, X, r)
    r = r isa Integer ? (r:r) : r
    # next uncommented line is a hack; see
    # https://github.com/alan-turing-institute/MLJBase.jl/issues/151
    isdataframe(X) && return X[r, :]
    cols = Tables.columntable(X)
    new_cols = NamedTuple{keys(cols)}(tuple((c[r] for c in values(cols))...))
    return Tables.materializer(X)(new_cols)
end

isdataframe(X) is not correctly implemented. See here

typename(X) = split(string(supertype(typeof(X)).name), '.')[end]
isdataframe(X) = typename(X) == "AbstractDataFrame"

on 1.7:

julia> using DataFrames;

julia> typename(X) = split(string(supertype(typeof(X)).name), '.')[end]
typename (generic function with 1 method)

julia> isdataframe(X) = typename(X) == "AbstractDataFrame";

julia> df = DataFrame(x = 1);

julia> isdataframe(df)
false

julia> typename(df)
"typename(AbstractDataFrame)
3 Likes

Thank you very much for identifying the issue!

What do you suggest I do next so that I can use my DataFrame with MLJ?

Thank you,

Jack

As I said on SO, you can just do

y, X = df.x1, select!(df, Not(:x1))

which will be fast irrespective of the number of colums in df. I’d assume that the MLJ maintainers will fix the issue in the near future though.

2 Likes

Thank you for your answer. That would definitely work for the unpack method, but will I run into issues using other MLJ methods with my DataFrame? Would it be better to edit my version of MLJ to fix the bug instead?

Thanks,

Jack

I can’t see how this would be a problem - all that unpack does is assign one column to y and a DataFrame to X, so it should be equivalent:

julia> df = DataFrame(rand(1:10, 5, 5), :auto);

julia> y, X = unpack(df, ==(:x1));

julia> y2, X2 = df.x1, select(df, Not(:x1));

julia> y == y && X == X2
true

Thank you and sorry if I was unclear. That seems like an excellent fix for unpack but as for other functions in MLJ, it seems like the issue pointed out by pdeffebach above could have an impact. I want to avoid issues with other functions further on in my program. Is it possible that I could run into more problems because of the issue pointed out above?

You can redefine the method for typename in MLJBase in the following way

MLJBase.typename(X) = split(string(supertype(typeof(X)).name.name), '.')[end]

It should fix the problem (at least the one with this function).

julia> using DataFrames;

julia> typename(X) = split(string(supertype(typeof(X)).name.name), '.')[end]
typename (generic function with 1 method)

julia> isdataframe(X) = typename(X) == "AbstractDataFrame";

julia> df = DataFrame(x = 1);

julia> isdataframe(df)
true

julia> typename(df)
"AbstractDataFrame"

Nice catch. @pdeffebach . I’ll open a PR to fix this soon. Going forward, I’ll add a test to specifically check for this.

3 Likes

@Jack_N. Thanks for reporting this issue. You shouldn’t run into any other related issues, once I open a PR to fix this. MLJ implements efficient fallbacks for Tables.MatrixTable and DataFrames table types, which support very wide data.

2 Likes

Worth noting that Start an idea of what an "in memory" requirement would look like by quinnj · Pull Request #278 · JuliaData/Tables.jl · GitHub should allow us to remove this hack altogether.

2 Likes

A potential temporary fix (before the more generic one pointed at by @ablaom above) is to do:

julia> typename(X) = split(string(supertype(typeof(X))), '.')[end]
julia> import DataFrames; df = DataFrames.DataFrame(x = 1); typename(df)
"AbstractDataFrame"

i.e. not use .name or .name.name.

I seem to recall working on this (looks like I’m partially to blame here) and I must have tested this somehow (even though clearly without writing formal tests :clown_face: ), so maybe this stems from using Julia internals that we shouldn’t have used and this breaking with newer versions of Julia. @samuel_okon if you end up opening a PR for this, you may have to check on older versions (<= 1.5) whether this definition of typename works, if not then I guess there’ll need to be a VERSION check in that function with the current definition for older versions of Julia and the one above for newer ones.

Edit: note for people passing by that here the interface does not load DataFrames (otherwise it would be too heavy and cause an unnecessary overhead), this is why we’re doing this kind of weird check whether an object X is of a type inaccessible to the interface by calling typeof, at runtime, and so can’t do something like multiple dispatch. Hopefully that’s something we can eventually remove as per Anthony’s message. This is also why there’s the split... as you would have supertype(typeof(X)) = DataFrames.AbstractDataFrame not just AbstractDataFrame.

2 Likes

I tried debugging the code for the the unpack method and I noticed that line 280 significantly slows down the code execution. More specifically line 280 at https://github.com/JuliaAI/MLJBase.jl/blob/dev/src/data/data.jl, push!(unpacked, selectcols(Xfixed, names)).

This line takes a very long time to run and it makes unpacking a 5 x 2,000 DataFrame of Ints take almost 2 minutes!

I am not sure how to go about fixing this. My assumption is that the selectcols method is the cause of the delay but I am not sure how to fix this. I believe the issue arises from line 100 at https://github.com/JuliaAI/MLJBase.jl/blob/dev/src/interface/data_utils.jl, cols = Tables.columntable(X) # named tuple of vectors.

It seems like in this case, the dataframe is converted into a named tuple of vectors which is what is causing the very slow code execution.

This may also be an area in need of improvement in the MLJ package. It seems to me that the selectcols implementation should detect isdataframe and respond accordingly similar to the selectrows method does a few lines above.

Please let me know what the best way to address this issue is.

Thanks,

Jack

@Jack_N I believe @pdeffebach has correctly identified the issue and am confident @samuel_okon can make a fix in due course.

I definitely agree that @pdeffebach pointed out a key issue. If I understand correctly, they pointed out that the MMI.selectrows function does not work correctly since isdataframe(X) had an issue.

However, in addition to that issue, it sees as though there is another issue, this time pertaining to the selectcols methods referred to by the unpack method. The selectcols methods defined in data_utils.jl seem to convert the data passed to it, X, to a Table.columntable (a named tuple of vectors). This is very innefficient and it contributes to the long runtime of the unpack method when used with the DataFrame type.

Since I am using MLJ with a dataframe I made a quick fix on my local version to the selectcols functions using the isdataframe method as is seen below. This avoids copying the DataFrame to a named tuple.

I hope you better understand my issue now and that a solution to this issue similar to mine could be added to the PR as well (addressing the selectcols issue for DataFrames).

function MMI.selectcols(::FI, ::Val{:table}, X, c::Union{Symbol, Integer})
    if !(isdataframe(X))
        cols = Tables.columntable(X) # named tuple of vectors
        return cols[c]
    else
        return X[!, c]
    end
end

function MMI.selectcols(::FI, ::Val{:table}, X, c::AbstractArray)
    if !(isdataframe(X))
        cols = Tables.columntable(X) # named tuple of vectors
        newcols = project(cols, c)
        return Tables.materializer(X)(newcols)
    else
        return X[!, c]
    end
end```

(removing my previous answer which was incorrect, I think it’s helpful to indicate that the current full stuff is

MMI.selectrows(::FI, ::Val{:table}, X, ::Colon) = X
MMI.selectcols(::FI, ::Val{:table}, X, ::Colon) = X

function MMI.selectrows(::FI, ::Val{:table}, X, r)
    r = r isa Integer ? (r:r) : r
    # next uncommented line is a hack; see
    # https://github.com/alan-turing-institute/MLJBase.jl/issues/151
    isdataframe(X) && return X[r, :]
    cols = Tables.columntable(X)
    new_cols = NamedTuple{keys(cols)}(tuple((c[r] for c in values(cols))...))
    return Tables.materializer(X)(new_cols)
end

function MMI.selectcols(::FI, ::Val{:table}, X, c::Union{Symbol, Integer})
    cols = Tables.columntable(X) # named tuple of vectors
    return cols[c]
end

function MMI.selectcols(::FI, ::Val{:table}, X, c::AbstractArray)
    cols = Tables.columntable(X) # named tuple of vectors
    newcols = project(cols, c)
    return Tables.materializer(X)(newcols)
end

@Jack_N probably open a PR with this, I think a condition with isdataframe in selectcols as per your code is a good idea here too; or if you wait a bit I’m working on a PR with the whole lot.

Edit: PR with the fixes incoming: closes #784 by fixing typename + tests by tlienart · Pull Request #786 · JuliaAI/MLJBase.jl · GitHub

Fwiw I think the real issue is for everyone to agree on a non-type-stable table type that implements only the Tables.jl interface. Then you can convert to the equivalent of a namedtuple of vectors, but without the compile cost. It should be an easy package to write.