DataFrames: obtaining the subset of rows by a set of values

dataframes

#1

Hi everyone,

I would like to perform an operation similar to the SQL SELECT * FROM table WHERE col IN ("1","2","3") on a DataFrame. I have an array with a few strings that I wish to use in this selection. From what I have read this could be obtained with the occursin function, however, this is what happens when I try to use it:

julia> occursin(selection, df)
ERROR: MethodError: no method matching occursin(::Array{Union{Missing, String},1}, ::Array{Union{Missing, String},1})
Stacktrace:
 [1] top-level scope at none:0

Note that there are no NA or nothing values in the selection array. I have two questions:

  1. Is occursin the right way to this?
  2. If yes, how can the MethodError message be addressed?

Thank you.


#2

Normally I’d do

filter(row -> row.col ∈ [1,2,3], df)

You could also do

df[∈([1,2,3]).(df.col), :]

If you prefer SQL style statements check out Query.jl or DataFramesMeta.jl, but, as you see above, simply using Base and basic DataFrames functions is quite nice.

Re-reading your statement, I’m a little confused about whether I had the use case right… is your column String valued? In that case, you could do the above with ["1", "2", "3"].


#3

I think you can use dataframesmeta package and apply the following commands the way they suit your needs.

Select row subsets.

@where(df, :x .> 1)
@where(df, :x .> x)
@where(df, :x .> x, :y .== 3) # the two expressions are “and-ed”


#4

Hi ExpandingMan,

you are correct, the column is of type String, as is the array with the selection values. I tried the methods your propose, both return empty DataFrames.

Thank you.


#5

Dear Ajay,

Is there a way of applying the @where function with an type of operator?

Thank you.


#6

Ah, is this:

filter(r -> any(occursin.(["1", "2", "3"], r.col)), df)

what you’re looking for?


#7

Z@Luis_de_Sousa: @where is to filter a subset of dataframe columns based on certain conditions. However, you can use any of the two commnads to extract a subset of columns

flds=map(e->e∈[:x1,:x2,:x3,:x4],names(features))

1. features[flds]   Or
2. @select(features,flds)

I hope this meets your requirements.

Edit:
You can also use column names directly in place of ‘flds’


#8

You can also use columns nos in place of 'flds" e.g. [1,3,5]


#9

Here is the Query.jl way of doing this (if I understand the question correctly):

using Queryverse

df = # Get your DataFrame

df |> @filter(_.colA ∈ ["stringA", "stringB", "stringC"]) |> DataFrame

#10

Another (probably faster) way of doing this is df[findall(in(["stringA", "stringB", "stringC"]), df.col), :].

Note that occursin is meant to look for substrings inside a string. So that’s quite different.


#11

I hope this is what you are looking for.

@where(features,map(x->x ∈ [60,55],features.Age))[1:5]

The suffix [1:5] is for the columns nos to select for output.


#12

DataFramesMeta lets you do this:

@linq df |>
       where(:a .∈ Ref([1,2,5,7]))

Key thing here is the Ref for the broadcasted in. I suppose thats the most reasonable behavior but its a bit unexpected.


#13

You could do:

@where(df, :col .∈ [[1,2,3]])

The trick is putting array into array to apply dot operator on whole array.


#14

Hi again ExpandingMan,

This last formulation returns an error:

ERROR: MethodError: no method matching filter(::getfield(Main, Symbol("##19#20")))
Closest candidates are:
  filter(::Any, ::Array{T,1} where T) at array.jl:2352
  filter(::Any, ::BitArray) at bitarray.jl:1637
  filter(::Any, ::AbstractArray) at array.jl:2313
  ...
Stacktrace:
 [1] top-level scope at none:0

Thank you.


#15

Hello again,

This also returns an empty DataFrame. I suspect the type of the array with the selection values (Array{Union{Missing, String}) is causing some issue. I will try to get to a minimum workable example.

Thank you.


#16

Hi David,

Unfortunately, Queryverse is failing to install on the Julia version I use (1.0.0). So I can not try this suggestion.

Thank you.


#17

The following code should not return any error:

using DataFrames

df = DataFrame(col=rand(["something1", "something2", "something4"], 100), x=rand(100))

tdf = filter(r -> any(occursin.(["1", "2", "3"], r.col)), df)

and should return a non-empty dataframe (unless you are extremely unlucky! :laughing:) with only entries where col is "something1" or "something2".


#18

Could you open an issue about the Queryverse install issue? I’d like to fix whatever the problem is :slight_smile:


#19

This is now issue #11.


#20

Btw, my experience shows that filter() is several times slower than @where macro from DataFramesMeta or plain DataFrames filters - on significant datasets.

I’m adding few in and not in queries - notice the time and memory usage from filter() function:

(this is from Julia 1.0.1)

julia> @time df = DataFrame(col=rand(["something1", "something2", "something3", "something4"], 10000000), x=rand(10000000));
  0.238002 seconds (54 allocations: 152.591 MiB, 25.23% gc time)

julia> @time tdf = filter(r -> in(r.col, ["something1", "something2"]), df); size(tdf)
  3.934676 seconds (85.04 M allocations: 2.694 GiB, 20.75% gc time)
(4999142, 2)

julia> @time tdf = @where df (in.(:col, [["something1", "something2"]])); size(tdf)
  0.441025 seconds (43.80 k allocations: 79.399 MiB, 2.83% gc time)
(4999142, 2)

julia> @time tdf = df[in(["something1", "something2"]).(df.col), :]; size(tdf)
  0.396203 seconds (41 allocations: 77.479 MiB, 5.39% gc time)
(4999142, 2)

julia> @time tdf = @where df .!(:col .∈ [["something1", "something2"]]); size(tdf)
  0.490810 seconds (45.49 k allocations: 79.496 MiB, 10.09% gc time)
(5000858, 2)

julia> @time tdf = @where df .!(in.(:col, [["something1", "something2"]])); size(tdf)
  0.607302 seconds (45.49 k allocations: 79.496 MiB, 27.52% gc time)
(5000858, 2)