Filter DataFrame by an Array

Hi,
I want to sub-select based on specific values from a Dataframe, like in R would be done by
valuesX %in% listY

temp_df = DataFrame(IndexVal = 1:10, Names = [“A”,“B”,“C”,“A1”,“B1”,“C1”,“A2”,“B2”,“C2”,“A3”])

temp_df[[temp_df[:IndexVal] in [1,3,5]],:]

Here I want to select only the values where IndexVal is 1,3 or 5.

How do I do this?

Thank you in advance

Sounds like you need my favorite function that’s not in base.

function vectorin(a, b)
    bset = Set(b)
    [i in bset for i in a]
end
temp_df[vectorin(temp_df[:IndexVal], [1,3,5]),:]
1 Like

I think a more idiomatic way of doing this in Julia would be to use filter:

filter(row -> row[:IndexVal] in [1,3,5], temp_df)

But in order to get in working as you want, you need to use . to vectorize it and use a little “trick”:

temp_df[in.(temp_df[:IndexVal], ([1,3,5],)), :]

The trick there is to have the second array as an (only) element of some other iterable (I used a 1-element tuple but it can also be an array of arrays - [[1,3,5]]).
You can read more about vectorizing/broadcasting from the Functions section of the manual.

Yet another way is to use findin to get the desired indices:

temp_df[findin(temp_df[:IndexVal], [1,3,5]), :]
2 Likes

Thank you both - very clear and concise

is findin still supported currently?

It is not. You can use findall(in(A), B) instead:

temp_df[findall(in([1,3,5]), temp_df[:IndexVal]), :]
2 Likes

@ValdarT could you also post the syntax needed to select values from temp_df where IndexVal is NOT 1, 3, or 5? My first impression is to surround in([1,3,5]) with Not() but that can’t be right.

Not is meant to be used for selection of columns, what you are looking for is just the classical predicate negation function: !

temp_df[findall(!in([1,3,5]), temp_df[!,:IndexVal]), :]
1 Like

You may want to check out the latest doc: http://juliadata.github.io/DataFrames.jl/v0.20/man/getting_started/

So for your case, the standard way should be:
temp_df[in.(temp_df.IndexVal, Ref([1,3,5]), :]
or
temp_df[in([1,3,5]).(temp_df.IndexVal), :]

Ref here is used to protect parameter that you don’t want broadcast operation to apply to ([1,3,5] in your example)

1 Like