I would like to perform an operation similar to the SQL SELECT * FROM table WHERE col IN ("1","2","3") on a DataFrame. I have an array with a few strings that I wish to use in this selection. From what I have read this could be obtained with the occursin function, however, this is what happens when I try to use it:
julia> occursin(selection, df)
ERROR: MethodError: no method matching occursin(::Array{Union{Missing, String},1}, ::Array{Union{Missing, String},1})
Stacktrace:
[1] top-level scope at none:0
Note that there are no NA or nothing values in the selection array. I have two questions:
Is occursin the right way to this?
If yes, how can the MethodError message be addressed?
If you prefer SQL style statements check out Query.jl or DataFramesMeta.jl, but, as you see above, simply using Base and basic DataFrames functions is quite nice.
Re-reading your statement, I’m a little confused about whether I had the use case right… is your column String valued? In that case, you could do the above with ["1", "2", "3"].
you are correct, the column is of type String, as is the array with the selection values. I tried the methods your propose, both return empty DataFrames.
Z@Luis_de_Sousa: @where is to filter a subset of dataframe columns based on certain conditions. However, you can use any of the two commnads to extract a subset of columns
flds=map(e->e∈[:x1,:x2,:x3,:x4],names(features))
1. features[flds] Or
2. @select(features,flds)
I hope this meets your requirements.
Edit:
You can also use column names directly in place of ‘flds’
This also returns an empty DataFrame. I suspect the type of the array with the selection values (Array{Union{Missing, String}) is causing some issue. I will try to get to a minimum workable example.
Btw, my experience shows that filter() is several times slower than @where macro from DataFramesMeta or plain DataFrames filters - on significant datasets.
I’m adding few in and not in queries - notice the time and memory usage from filter() function: