ERROR: BoundsError when using function taking dataframe as input

Hi there,

I have been experiencing a bound error while using a function taking as input a dataframe and returning one as well. I have tried different versions of the code hereafter

function Getdf(df_in)
    cert = convert(Vector, df_in[:,"cert"]);
    certu = unique(cert);
    cert_keep = certu[1:10];
    idx_keep = cert .== cert_keep[1];
    jt = 2;
    maxjt = length(cert_keep);
    while  jt <= maxjt
        id = cert_keep[jt];
        idx_keep = idx_keep .| (cert .== id);
        jt = jt + 1;
    end
    df_out = df_in[vec(idx_keep),:];
    
    return df_out
end

df_in_new = Getdf(df_in_old);

however, I keep receiving the following message:

ERROR: BoundsError: attempt to access 0-element Vector{Base.StackTraces.StackFrame} at index [1]

There is something fundamental here that I do not seem to understand.

Most of the questions related to this topic suggest preallocating df_in_new. I am trying to define df_in_new by using the function Getdf. In the past, I was able to define variables in this way without prelocating them. So, I am not able to see what is different now.

I would greatly appreciate your help on this.

Hi Skander, welcome.

Below is how I would write your function, but I’ve added some questions as comments. I think this is causing your error:

using DataFrames

# function Getdf(df_in)
function Getdf(df_in::DataFrame) # Are you always expecting the input to be a dataframe?
    # cert = convert(Vector, df_in[:,"cert"]);
    cert = df_in[:,"cert"] # dataframe columns are already `Vector`s`

    # certu = unique(cert);
    certu = unique(cert)  # terminating with a semicolon in functions makes no different

    # cert_keep = certu[1:10];
    # But how do you know there are at least 10 unique elements in cert?
    cert_keep = certu[1:10] # ? I think this is where your error is occuring ?

    # idx_keep = cert .== cert_keep[1];
    idx_keep = (cert .== cert_keep[1]) # bracket to help readability

    # jt = 2;
    # maxjt = length(cert_keep);
    # while  jt <= maxjt
    for jt in 2:length(cert_keep)
        # id = cert_keep[jt];
        id = cert_keep[jt]
        # idx_keep = idx_keep .| (cert .== id);
        idx_keep = (idx_keep .|| (cert .== id)) # use the "boolean or" || (double) instead of the "bitwise or" | (single)
        # jt = jt + 1;
    end
    # df_out = df_in[vec(idx_keep),:];
    df_out = df_in[idx_keep,:]  # I think `idx_keep` is always a Vector, so `vec` is redundant?
    
    return df_out
end

Can you provide a minimal input dataframe example to debug this? e.g.: something like

julia> df = DataFrame(A = 1:3, B = [2.0, -1.1, 2.8], cert = ["p","q","r"])
julia> Getdf(df)
ERROR: BoundsError: attempt to access 3-element Vector{String} at index [1:10]
Stacktrace:
 [1] throw_boundserror(A::Vector{String}, I::Tuple{UnitRange{Int64}})
   @ Base ./abstractarray.jl:691
 [2] checkbounds
   @ ./abstractarray.jl:656 [inlined]
 [3] getindex(A::Vector{String}, I::UnitRange{Int64})
   @ Base ./array.jl:867
 [4] Getdf(df_in::DataFrame)
   @ Main ~/julia/Examples/discourse/dataframe_err.jl:14

Hi James,

Thanks for taking the time to write all those comments!

Here is an example that you can run and will give you the same error.

using DataFrames

df = DataFrame(A = 1:6, B = 1:6, cert = [1,1,2,2,3,3])

function Getdf(df_in::DataFrame) 
    cert = df_in[:,"cert"]
    certu = unique(cert)
    cert_keep = certu[1:2]
    idx_keep = (cert .== cert_keep[1])
    for jt in 2:length(cert_keep)
        id = cert_keep[jt]
        idx_keep = (idx_keep .| (cert .== id)) 
    end
    df_out = df_in[idx_keep,:]  
    return df_out
end

df_in_new = Getdf(df);

Notice that I have changed the line cert_keep = certu[1:10] to cert_keep = certu[1:2] to fit the example. My understanding from infiltrating the function is that the problem happens when I return the function’s output.

To answer your question, in this case, I am expecting a dataframe but I don’t know if that is generating the problem. I use the function to reduce the size of large dataset. The goal is to debug a set of functions on a small dataset to make the process faster. So, I know from the data that certu has more than 10 elements.

I hope this helps.

Skander

That example runs fine for me and produces

julia> Getdf(df)
4Γ—3 DataFrame
 Row β”‚ A      B      cert
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      1      1
   2 β”‚     2      2      1
   3 β”‚     3      3      2
   4 β”‚     4      4      2

same for me

Really? This is what I get.

ERROR: BoundsError: attempt to access 0-element Vector{Base.StackTraces.StackFrame} at index [1]
Stacktrace:
 [1] getindex
   @ .\array.jl:801 [inlined]
 [2] start_prompt(mod::Module, locals::Dict{Symbol, Any}, file::String, fileline::Int64; terminal::Nothing, repl::Nothing, nostack::Bool)
   @ Infiltrator C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:212
 [3] start_prompt(mod::Module, locals::Dict{Symbol, Any}, file::String, fileline::Int64)
   @ Infiltrator C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:193
 [4] top-level scope
   @ C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:52

Are you running this in a new Julia session?

Yes, I am running it on my main script.

Sorry I don’t know what that means. Are you saying that if you start a fresh Julia REPL in your terminal and paste in the example you posted above you are getting the indexing error?

Bogumil and I are saying that we have tried this and did not see any errors, so if that’s still he case for you we’d need to know the ouptut of versioninfo() and ]st to understand how you’re running Julia.

From the error stacktrace, this is running out of an Infiltrator call.

So, I have written the code on a script that I am running on VScode. I am not using the Julia REPL. As James mentioned, I am also using an infiltrator after the line

df_in_new = Getdf(df);

to inspect my variables. I should have probably mentioned that earlier.

Sorry for the confusion.

Ah, that clarifies things - I don’t know Infiltrator at all, but it appears that this error is entirely unrelated to your function which runs without error.

Thanks for the answers. Somehow I thought that the problem was coming from the function because of the bound error. I just tried the regular debugger from VScode, and it seems to be working, but it’s way slower than the infiltrator. So, I will use the debugger until I figure out what is happening with @infiltrate.