Segfault on DataFrames `getindex`

Hi folks,

We’re getting a somewhat random segfault on a dataframes getindex operation inside a for-loop. I haven’t been able to extract a MWE from the proprietary code where this is occurring, but here is part of the segfault and some notes on what we’ve tried. Maybe folks will have some recommendations on things to try?

Traceback:

2019-01-23T13:42:31Z    signal (11): Segmentation fault
2019-01-23T13:42:31Z    in expression starting at no file:0
2019-01-23T13:42:31Z    ht_keyindex at ./dict.jl:280
2019-01-23T13:42:31Z    getindex at ./dict.jl:477 [inlined]
2019-01-23T13:42:31Z    getindex at /root/.julia/packages/DataFrames/lyCjP/src/other/index.jl:148 [inlined]
2019-01-23T13:42:31Z    getindex at /root/.julia/packages/DataFrames/lyCjP/src/dataframe/dataframe.jl:243 [inlined]
2019-01-23T13:42:31Z    view at /root/.julia/packages/DataFrames/lyCjP/src/subdataframe/subdataframe.jl:83 [inlined]
2019-01-23T13:42:31Z    getindex at /root/.julia/packages/DataFrames/lyCjP/src/subdataframe/subdataframe.jl:102 [inlined]
2019-01-23T13:42:31Z    maybeview at ./views.jl:123 [inlined]

NOTES:

  1. Fails on Julia 1.0.3 and 1.1 w/ DataFrames 0.16.0 - 0.17.1 0.17.0
  2. It seems to be tied to precompilation because if the function works once on the same data it seems to continue working for subsequent calls. Also, it never seems to error if the code is extracted from the function and pasted into the REPL.
  3. We always get the segfault on the second pass through our for-loop
  4. So far adding an @show or GC.gc() call prior to the failing getindex operation seems to fix the problem.

I’ll keep working on a MWE that I can share, but any recommendations would be greatly appreciated.

Thank you

2 Likes

pretty sure you mean compilation, not pre-compilation.
No?

The cause of the error is most likely the fact that most probably you are doing getindex with @inbounds turned on and the index you are trying to get is out of bounds.

In order to help you better could you report three things:

  • from which version of DataFrames.jl this stack trace is (as the relevant line numbers have changed between releases)
  • could you copy-paste the relevant parts of the offending code (EDIT: I do not need to see the full code - only the general structure of the part that you are using, the most important thing is if you are trying to index a column using a Symbol or using an Int)
  • could you report what happens when you run Julia with --check-bounds=yes switch set

OK - I think I have tracked it. If I see it correctly your trace is from DataFrames.jl v 0.17 and you are trying to access a column in a data frame that does not exist using a Symbol and that symbol is not present in the data frame and you use @inbounds to try to get it.

If this is not the case then the issue needs an additional investigation. If this is the case then the segfault will randomly appear or not as it typically happens with @inbounds.

1 Like

There is no @inbounds present in the code.
My theory on the cause of this was that it was related to some @inbounds that was added to DataFrames itself during the switch to returning views.
But I could not find them, in the DataFrames.jl source.

Nothing out of bounds should be being read, since it works fine if GC.gc is called it works find and gives correct results (I think they are correct at least, could be garbage.)

The general structure of the code is

for (i, df) in enumerate(groupby(mydataframe, :n))
        for foo in (1,2)
                x = df[1, :n]
                # other operations on `x` and `foo` that do not touch any dataframes
                @show 1 # Add this to stop  segufault.
         end
end

(Someone came to me with this segfault, and when I found that added @show fixed it, i was pretty WAT, and so Rory is looking at it.)

This is very strange, as /src/subdataframe/subdataframe.jl:102 in your stack trace is a call like df[:n] not df[1, :n] (df[1, :n] should call line 107 not 102).

The other strange thing is that the error seems to show that lookup dictionary is somehow corrupted. Can you check the lengths and contents of lookup.slots an lookup.keys of mydataframe in the loop (unfortunately --check-bounds=yes will not add anything because the segfault happens in function from Base).

Ooops, apparently folks had only been getting the segfault on 0.16.0 and 0.17.0, adding the GC.gc() didn’t actually fix the problem cause I’ve been testing on 0.17.1. FWIW, adding the print/debugging statements for mydataframe appear to prevent the segfault from happening (similar to @show) on 0.17.0:

println(DataFrames.index(mydataframe).lookup.slots)
println(DataFrames.index(mydataframe).lookup.keys)

Anyways, it sounds like updating to 0.17.1 fixes the issue.

Good to hear that 0.17.1 solves the issue. Thank you.
But I would like to make sure that something is not lurking unnoticed. From what you report a change that most likely affected how your code behaves is:
https://github.com/JuliaData/DataFrames.jl/commit/8d75d7464bdea76bb394081f66d870b2d9bf1c3b#diff-0aa1ff52d56badabf7d1511d3dc3f279R242
My problem is that actually it should not change anything in your case.

@Rory-Finnegan
may have switched from x = df[1, :n] to x = first(df[:n]) or possibly x = first(@view df[:n])
while debuging.
My recollection is that all 3 cause the segfault.

Thank you. This would mean that we are getting a segfault going through different paths in DataFrames.jl (which would suggest some problem with lookup dict corruption). Hopefully this is a bug in DataFrames.jl (as this will be easy to fix). What you say makes me think that it cannot be excluded that heavy inlining that happens here (see that almost all methods got inlined in your original stack trace) might lead to later code transformations by the Julia compiler infrastructure that lead to an error (hopefully not).

@Rory-Finnegan: did I understand you correctly that on DataFrames 0.17.1 the problem disappears, because earlier you have reported that the bug still persisted? Thank you.

did I understand you correctly that on DataFrames 0.17.1 the problem disappears, because earlier you have reported that the bug still persisted?

That’s correct, on 0.17.1 I can’t reproduce a segfault. I’ll update my previous comment to make that clearer. Thank you.

NOTE: On 0.17.0, setting --check-bounds=yes cause julia to exit with 'julia-1.0 --check-bounds=yes' terminated by signal SIGSEGV (Address boundary error)

1 Like

This is really weird. In DataFrames.jl we do not do any unsafe operations that directly operate on pointers or similar stuff. So it must be the loop in ht_keyindex that is @inbounds but it is compiled int a system image.

Can you try running the code in 0.17.0 with the following method redefinition:

@noinline function Base.ht_keyindex(h::Dict{K,V}, key) where V where K
    sz = length(h.keys)
    iter = 0
    maxprobe = h.maxprobe
    index = Base.hashindex(key, sz)
    keys = h.keys

    while true
        if Base.isslotempty(h,index)
            break
        end
        if !Base.isslotmissing(h,index) && (key === keys[index] || isequal(key,keys[index]))
            return index
        end

        index = (index & (sz-1)) + 1
        iter += 1
        iter > maxprobe && break
    end
    return -1
end

hopefully this should safely (i.e. without crashing) catch this bug - and show where the problem was.

CC @nalimilan

Have you tried with Julia 1.1?

Hmmm, it still segfaults with the overwritten Base.ht_keyindex. On 1.1 it gives me a slightly longer traceback, indicating that the call to isslotempty is erroring.

signal (11): Segmentation fault: 11
in expression starting at no file:0
getindex at ./array.jl:729 [inlined]
isslotempty at ./dict.jl:171 [inlined]
ht_keyindex at ./dict.jl:287
getindex at ./dict.jl:477 [inlined]
getindex at /root/.julia/packages/DataFrames/lyCjP/src/dataframe/dataframe.jl:243 [inlined]
view at /root/.julia/packages/DataFrames/lyCjP/src/subdataframe/subdataframe.jl:83 [inlined]
getindex at /root/.julia/packages/DataFrames/lyCjP/src/subdataframe/subdataframe.jl:102 [inlined]
maybeview at ./views.jl:123 [inlined]
...
Allocations: 148831146 (Pool: 148803391; Big: 27755); GC: 330
fish: 'julia-1.1 --check-bounds=yes' terminated by signal SIGSEGV (Address boundary error)

This is exactly what I have suspected, i.e. the lookup dict gets corrupted somehow (and its slots field is shorter than keys field).

Can you try this (sorry for this ping-pong, but I do not know a better way):

@noinline function Base.ht_keyindex(h::Dict{K,V}, key) where V where K
    sz = length(h.keys)
    iter = 0
    maxprobe = h.maxprobe
    index = Base.hashindex(key, sz)
    keys = h.keys

    while true
        if h.slots[index] == 0x0
            break
        end
        if !(h.slots[index] == 0x2) && (key === keys[index] || isequal(key,keys[index]))
            return index
        end

        index = (index & (sz-1)) + 1
        iter += 1
        iter > maxprobe && break
    end
    return -1
end

and if this still segfaults then:

@noinline function Base.ht_keyindex(h::Dict{K,V}, key) where V where K
    sz = length(h.keys)
    iter = 0
    maxprobe = h.maxprobe
    index = Base.hashindex(key, sz)
    keys = h.keys

    if sz != length(h.slots)
        dump(h)
        error()
    end

    while true
        if h.slots[index] == 0x0
            break
        end
        if !(h.slots[index] == 0x2) && (key === keys[index] || isequal(key,keys[index]))
            return index
        end

        index = (index & (sz-1)) + 1
        iter += 1
        iter > maxprobe && break
    end
    return -1
end

so that we know that the lookup dict holds when we get an error.

And it is best done with the bounds checking flag turned on on Julia startup.

Thank you.

Okay, so the first version successfully threw a KeyError:

ERROR: KeyError: key :n not found

The second version resulted in a segfault though:

signal (11): Segmentation fault: 11
in expression starting at no file:0
getindex at /root/.julia//packages/DataFrames/lyCjP/src/dataframe/dataframe.jl:228 [inlined]
maybeview at ./views.jl:123 [inlined]

I assume that obviously :n is expected to be in the data frame - right?
In general this is very strange (especially the last segfault).
Also in src/dataframe/dataframe.jl:228 there is index not getindex function defined, so this is strange.

Probably having a some shareable reproducing example at this point would be ideal as what you report looks more and more like more related to Base. If it is not possible, then - of course if you are willing to - can you try running git bisect on DataFrames.jl git repo to identify the commit after which the problem started to appear and the commit after which the problem stops to emerge?

1 Like

You will note that in the example code from before,
we groupby :n at the start,
so it certainly should be in the resulting group subdataframes

I agree, but the situation is so strange that I want to rule out any option (e.g. that :n actually is not present in mydataframe and by some weird reason passes groupby).

Related to this would you want to check the following cases - side-stepping groupby (maybe this would help us locate the problem):

for (i, n_val) in enumerate(unique(mydataframe.n))
        row_sel = mydataframe.n .== n_val
        df = view(mydataframe, row_sel, :)
        for foo in (1,2)
                x = df[1, :n]
                # other operations on `x` and `foo` that do not touch any dataframes
                @show 1 # Add this to stop  segufault.
         end
end

and

for (i, n_val) in enumerate(unique(mydataframe.n))
        row_sel = mydataframe.n .== n_val
        df = mydataframe[row_sel, :]
        for foo in (1,2)
                x = df[1, :n]
                # other operations on `x` and `foo` that do not touch any dataframes
                @show 1 # Add this to stop  segufault.
         end
end

Both are kind of inefficient groupby - one uses SubDataFrame and the other DataFrame.

Where does mydataframe come from? Is it a DataFrame?

1 Like