I have been experiencing a bound error while using a function taking as input a dataframe and returning one as well. I have tried different versions of the code hereafter
function Getdf(df_in)
cert = convert(Vector, df_in[:,"cert"]);
certu = unique(cert);
cert_keep = certu[1:10];
idx_keep = cert .== cert_keep[1];
jt = 2;
maxjt = length(cert_keep);
while jt <= maxjt
id = cert_keep[jt];
idx_keep = idx_keep .| (cert .== id);
jt = jt + 1;
end
df_out = df_in[vec(idx_keep),:];
return df_out
end
df_in_new = Getdf(df_in_old);
however, I keep receiving the following message:
ERROR: BoundsError: attempt to access 0-element Vector{Base.StackTraces.StackFrame} at index [1]
There is something fundamental here that I do not seem to understand.
Most of the questions related to this topic suggest preallocating df_in_new. I am trying to define df_in_new by using the function Getdf. In the past, I was able to define variables in this way without prelocating them. So, I am not able to see what is different now.
Below is how I would write your function, but Iβve added some questions as comments. I think this is causing your error:
using DataFrames
# function Getdf(df_in)
function Getdf(df_in::DataFrame) # Are you always expecting the input to be a dataframe?
# cert = convert(Vector, df_in[:,"cert"]);
cert = df_in[:,"cert"] # dataframe columns are already `Vector`s`
# certu = unique(cert);
certu = unique(cert) # terminating with a semicolon in functions makes no different
# cert_keep = certu[1:10];
# But how do you know there are at least 10 unique elements in cert?
cert_keep = certu[1:10] # ? I think this is where your error is occuring ?
# idx_keep = cert .== cert_keep[1];
idx_keep = (cert .== cert_keep[1]) # bracket to help readability
# jt = 2;
# maxjt = length(cert_keep);
# while jt <= maxjt
for jt in 2:length(cert_keep)
# id = cert_keep[jt];
id = cert_keep[jt]
# idx_keep = idx_keep .| (cert .== id);
idx_keep = (idx_keep .|| (cert .== id)) # use the "boolean or" || (double) instead of the "bitwise or" | (single)
# jt = jt + 1;
end
# df_out = df_in[vec(idx_keep),:];
df_out = df_in[idx_keep,:] # I think `idx_keep` is always a Vector, so `vec` is redundant?
return df_out
end
Can you provide a minimal input dataframe example to debug this? e.g.: something like
julia> df = DataFrame(A = 1:3, B = [2.0, -1.1, 2.8], cert = ["p","q","r"])
julia> Getdf(df)
ERROR: BoundsError: attempt to access 3-element Vector{String} at index [1:10]
Stacktrace:
[1] throw_boundserror(A::Vector{String}, I::Tuple{UnitRange{Int64}})
@ Base ./abstractarray.jl:691
[2] checkbounds
@ ./abstractarray.jl:656 [inlined]
[3] getindex(A::Vector{String}, I::UnitRange{Int64})
@ Base ./array.jl:867
[4] Getdf(df_in::DataFrame)
@ Main ~/julia/Examples/discourse/dataframe_err.jl:14
Thanks for taking the time to write all those comments!
Here is an example that you can run and will give you the same error.
using DataFrames
df = DataFrame(A = 1:6, B = 1:6, cert = [1,1,2,2,3,3])
function Getdf(df_in::DataFrame)
cert = df_in[:,"cert"]
certu = unique(cert)
cert_keep = certu[1:2]
idx_keep = (cert .== cert_keep[1])
for jt in 2:length(cert_keep)
id = cert_keep[jt]
idx_keep = (idx_keep .| (cert .== id))
end
df_out = df_in[idx_keep,:]
return df_out
end
df_in_new = Getdf(df);
Notice that I have changed the line cert_keep = certu[1:10] to cert_keep = certu[1:2] to fit the example. My understanding from infiltrating the function is that the problem happens when I return the functionβs output.
To answer your question, in this case, I am expecting a dataframe but I donβt know if that is generating the problem. I use the function to reduce the size of large dataset. The goal is to debug a set of functions on a small dataset to make the process faster. So, I know from the data that certu has more than 10 elements.
Sorry I donβt know what that means. Are you saying that if you start a fresh Julia REPL in your terminal and paste in the example you posted above you are getting the indexing error?
Bogumil and I are saying that we have tried this and did not see any errors, so if thatβs still he case for you weβd need to know the ouptut of versioninfo() and ]st to understand how youβre running Julia.
So, I have written the code on a script that I am running on VScode. I am not using the Julia REPL. As James mentioned, I am also using an infiltrator after the line
df_in_new = Getdf(df);
to inspect my variables. I should have probably mentioned that earlier.
Ah, that clarifies things - I donβt know Infiltrator at all, but it appears that this error is entirely unrelated to your function which runs without error.
Thanks for the answers. Somehow I thought that the problem was coming from the function because of the bound error. I just tried the regular debugger from VScode, and it seems to be working, but itβs way slower than the infiltrator. So, I will use the debugger until I figure out what is happening with @infiltrate.