Creating sub-arrays by comparing multiple arrays DataFrames

Hi

I am very new to Julia, and have some experience with Matlab. However, I have been struggling to perform the following:

I have two DataFrames X and Y that I have created (imported from txt file) .

Let’s say X is the following

│ X1 │ X2  │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ "a" │ "a" │ 1  │ "a" │
│ 2  │ "b" │ "b" │ 2  │ "b" │
│ 3  │ "c" │ "c" │ 3  │ "c" │
│ 1  │ "a" │ "d" │ 4  │ "a" │

Where column X1 contains numbers and column X2 contains strings.

Let’s say Y is the following

│ Y1 │Y2 │ Y3  │ Y4 │Y5   │ Y6 │ Y7  │
-----------------------------------------------
│ 1  │ 1 │ "a" │ 1  │ "a" │ 1  │ "a" |
│ 2  │ 2 │ "b" │ 2  │ "b" │ 1  │ "b" |
│ 1  │ 1 │ "c" │ 3  │ "c" │ 1  │ "d" |
│ 3  │ 2 │ "d" │ 4  │ "a" │ 1  │ "e" |

Also with columns a mix of numbers and strings.

I would like to find the subset of X where only elements of the column X2 that match the values contained in Y7 i.e.

│ X1 │ X2  │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ "a" │ "a" │ 1  │ "a" │
│ 2  │ "b" │ "b" │ 2  │ "b" │
│ 1  │ "a" │ "d" │ 4  │ "a" │

Also, I would like to apply a second filter on that subset, by filtering those elements of X1 that are equal to a certain value (value contained in Y2), in this case 1.

│ X1 │X2 │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ 1 │ "a" │ 1  │ "a" │
│ 1  │ 2 │ "d" │ 4  │ "a" │

How could I perform this?

Thanks a lot for your help.

If all the X and Y are in the same DataFrame, you could do something like this with DataFramesMeta

julia> using DataFrames, DataFramesMeta
julia> df = DataFrame(x = 1:10, y = 1:10)
10×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
│ 7   │ 7     │ 7     │
│ 8   │ 8     │ 8     │
│ 9   │ 9     │ 9     │
│ 10  │ 10    │ 10    │

julia> @where(df, :x .== :y)
10×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
│ 7   │ 7     │ 7     │
│ 8   │ 8     │ 8     │
│ 9   │ 9     │ 9     │
│ 10  │ 10    │ 10    │

Are your X and Y Arrays or DataFrames? Your question isn’t clear to me.

This works:

using DataFrames

# create dummy data
x = DataFrame(:x1 => rand(1:10, 100), :x2 => rand(1:10, 100))
y = DataFrame(:y1 => rand(1:10, 100), :y2 => rand(1:10, 100))

# Select rows where the values in the x1 column of x are the same as the y1
# column of y, and the x2 value is 7 (arbitrary value for illustration)
x[(x.x1 .== y.y1) .& (x.x2 .== 7), :]

This however implicitly assumes that it is meaningful to compare the values of a column in x to the values in the same row of a column in y. In that case it would probably be more natural to simply have x and y as a single DataFrame with all columns in x and y:

df = hcat(x,y)

then:

df[(df.x1 .== df.y1) .& (df.x2 .== 7), [:x1, :x2]] == x[(x.x1 .== y.y1) .& (x.x2 .== 7), :]
# true

Thank you, I just edited the post. Unfortunately, X and Y are not in the same DataFrame.

My suggestion still works in principle, however upon reading your edited post it appears that your initial criterion is not that the value of X.X2 should not be the same as the value in the same row in Y.Y7, but rather you want all rows for which the value in X.X2 is contained anywhere in Y.Y7, correct?

In which case you want do to:

x[in.(x.x2, Ref(unique(y.y7))) .& (x.x2 .== 1), :]

EDIT a slightly more concise way to write the above:

x[in(unique(y.Y7)).(x.X2), :]
1 Like