ERROR: BoundsError when using function taking dataframe as input

Hi there,

I have been experiencing a bound error while using a function taking as input a dataframe and returning one as well. I have tried different versions of the code hereafter

function Getdf(df_in)
    cert = convert(Vector, df_in[:,"cert"]);
    certu = unique(cert);
    cert_keep = certu[1:10];
    idx_keep = cert .== cert_keep[1];
    jt = 2;
    maxjt = length(cert_keep);
    while  jt <= maxjt
        id = cert_keep[jt];
        idx_keep = idx_keep .| (cert .== id);
        jt = jt + 1;
    end
    df_out = df_in[vec(idx_keep),:];
    
    return df_out
end

df_in_new = Getdf(df_in_old);

however, I keep receiving the following message:

ERROR: BoundsError: attempt to access 0-element Vector{Base.StackTraces.StackFrame} at index [1]

There is something fundamental here that I do not seem to understand.

Most of the questions related to this topic suggest preallocating df_in_new. I am trying to define df_in_new by using the function Getdf. In the past, I was able to define variables in this way without prelocating them. So, I am not able to see what is different now.

I would greatly appreciate your help on this.

Hi Skander, welcome.

Below is how I would write your function, but I’ve added some questions as comments. I think this is causing your error:

using DataFrames

# function Getdf(df_in)
function Getdf(df_in::DataFrame) # Are you always expecting the input to be a dataframe?
    # cert = convert(Vector, df_in[:,"cert"]);
    cert = df_in[:,"cert"] # dataframe columns are already `Vector`s`

    # certu = unique(cert);
    certu = unique(cert)  # terminating with a semicolon in functions makes no different

    # cert_keep = certu[1:10];
    # But how do you know there are at least 10 unique elements in cert?
    cert_keep = certu[1:10] # ? I think this is where your error is occuring ?

    # idx_keep = cert .== cert_keep[1];
    idx_keep = (cert .== cert_keep[1]) # bracket to help readability

    # jt = 2;
    # maxjt = length(cert_keep);
    # while  jt <= maxjt
    for jt in 2:length(cert_keep)
        # id = cert_keep[jt];
        id = cert_keep[jt]
        # idx_keep = idx_keep .| (cert .== id);
        idx_keep = (idx_keep .|| (cert .== id)) # use the "boolean or" || (double) instead of the "bitwise or" | (single)
        # jt = jt + 1;
    end
    # df_out = df_in[vec(idx_keep),:];
    df_out = df_in[idx_keep,:]  # I think `idx_keep` is always a Vector, so `vec` is redundant?
    
    return df_out
end

Can you provide a minimal input dataframe example to debug this? e.g.: something like

julia> df = DataFrame(A = 1:3, B = [2.0, -1.1, 2.8], cert = ["p","q","r"])
julia> Getdf(df)
ERROR: BoundsError: attempt to access 3-element Vector{String} at index [1:10]
Stacktrace:
 [1] throw_boundserror(A::Vector{String}, I::Tuple{UnitRange{Int64}})
   @ Base ./abstractarray.jl:691
 [2] checkbounds
   @ ./abstractarray.jl:656 [inlined]
 [3] getindex(A::Vector{String}, I::UnitRange{Int64})
   @ Base ./array.jl:867
 [4] Getdf(df_in::DataFrame)
   @ Main ~/julia/Examples/discourse/dataframe_err.jl:14
1 Like

Hi James,

Thanks for taking the time to write all those comments!

Here is an example that you can run and will give you the same error.

using DataFrames

df = DataFrame(A = 1:6, B = 1:6, cert = [1,1,2,2,3,3])

function Getdf(df_in::DataFrame) 
    cert = df_in[:,"cert"]
    certu = unique(cert)
    cert_keep = certu[1:2]
    idx_keep = (cert .== cert_keep[1])
    for jt in 2:length(cert_keep)
        id = cert_keep[jt]
        idx_keep = (idx_keep .| (cert .== id)) 
    end
    df_out = df_in[idx_keep,:]  
    return df_out
end

df_in_new = Getdf(df);

Notice that I have changed the line cert_keep = certu[1:10] to cert_keep = certu[1:2] to fit the example. My understanding from infiltrating the function is that the problem happens when I return the function’s output.

To answer your question, in this case, I am expecting a dataframe but I don’t know if that is generating the problem. I use the function to reduce the size of large dataset. The goal is to debug a set of functions on a small dataset to make the process faster. So, I know from the data that certu has more than 10 elements.

I hope this helps.

Skander

That example runs fine for me and produces

julia> Getdf(df)
4Γ—3 DataFrame
 Row β”‚ A      B      cert
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      1      1
   2 β”‚     2      2      1
   3 β”‚     3      3      2
   4 β”‚     4      4      2
1 Like

same for me

Really? This is what I get.

ERROR: BoundsError: attempt to access 0-element Vector{Base.StackTraces.StackFrame} at index [1]
Stacktrace:
 [1] getindex
   @ .\array.jl:801 [inlined]
 [2] start_prompt(mod::Module, locals::Dict{Symbol, Any}, file::String, fileline::Int64; terminal::Nothing, repl::Nothing, nostack::Bool)
   @ Infiltrator C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:212
 [3] start_prompt(mod::Module, locals::Dict{Symbol, Any}, file::String, fileline::Int64)
   @ Infiltrator C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:193
 [4] top-level scope
   @ C:\Users\DELL\.julia\packages\Infiltrator\doHg1\src\Infiltrator.jl:52

Are you running this in a new Julia session?

Yes, I am running it on my main script.

Sorry I don’t know what that means. Are you saying that if you start a fresh Julia REPL in your terminal and paste in the example you posted above you are getting the indexing error?

Bogumil and I are saying that we have tried this and did not see any errors, so if that’s still he case for you we’d need to know the ouptut of versioninfo() and ]st to understand how you’re running Julia.

From the error stacktrace, this is running out of an Infiltrator call.

So, I have written the code on a script that I am running on VScode. I am not using the Julia REPL. As James mentioned, I am also using an infiltrator after the line

df_in_new = Getdf(df);

to inspect my variables. I should have probably mentioned that earlier.

Sorry for the confusion.

Ah, that clarifies things - I don’t know Infiltrator at all, but it appears that this error is entirely unrelated to your function which runs without error.

Thanks for the answers. Somehow I thought that the problem was coming from the function because of the bound error. I just tried the regular debugger from VScode, and it seems to be working, but it’s way slower than the infiltrator. So, I will use the debugger until I figure out what is happening with @infiltrate.