Creating sub-arrays by comparing multiple arrays DataFrames

eed · April 25, 2019, 4:12pm

Hi

I am very new to Julia, and have some experience with Matlab. However, I have been struggling to perform the following:

I have two DataFrames X and Y that I have created (imported from txt file) .

Let’s say X is the following

│ X1 │ X2  │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ "a" │ "a" │ 1  │ "a" │
│ 2  │ "b" │ "b" │ 2  │ "b" │
│ 3  │ "c" │ "c" │ 3  │ "c" │
│ 1  │ "a" │ "d" │ 4  │ "a" │

Where column X1 contains numbers and column X2 contains strings.

Let’s say Y is the following

│ Y1 │Y2 │ Y3  │ Y4 │Y5   │ Y6 │ Y7  │
-----------------------------------------------
│ 1  │ 1 │ "a" │ 1  │ "a" │ 1  │ "a" |
│ 2  │ 2 │ "b" │ 2  │ "b" │ 1  │ "b" |
│ 1  │ 1 │ "c" │ 3  │ "c" │ 1  │ "d" |
│ 3  │ 2 │ "d" │ 4  │ "a" │ 1  │ "e" |

Also with columns a mix of numbers and strings.

I would like to find the subset of X where only elements of the column X2 that match the values contained in Y7 i.e.

│ X1 │ X2  │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ "a" │ "a" │ 1  │ "a" │
│ 2  │ "b" │ "b" │ 2  │ "b" │
│ 1  │ "a" │ "d" │ 4  │ "a" │

Also, I would like to apply a second filter on that subset, by filtering those elements of X1 that are equal to a certain value (value contained in Y2), in this case 1.

│ X1 │X2 │ X3  │ X4 │ X5  │
-----------------------------------
│ 1  │ 1 │ "a" │ 1  │ "a" │
│ 1  │ 2 │ "d" │ 4  │ "a" │

How could I perform this?

Thanks a lot for your help.

aaowens · April 25, 2019, 4:19pm

If all the X and Y are in the same DataFrame, you could do something like this with DataFramesMeta

julia> using DataFrames, DataFramesMeta
julia> df = DataFrame(x = 1:10, y = 1:10)
10×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
│ 7   │ 7     │ 7     │
│ 8   │ 8     │ 8     │
│ 9   │ 9     │ 9     │
│ 10  │ 10    │ 10    │

julia> @where(df, :x .== :y)
10×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
│ 7   │ 7     │ 7     │
│ 8   │ 8     │ 8     │
│ 9   │ 9     │ 9     │
│ 10  │ 10    │ 10    │

nilshg · April 25, 2019, 6:26pm

Are your X and Y Arrays or DataFrames? Your question isn’t clear to me.

This works:

using DataFrames

# create dummy data
x = DataFrame(:x1 => rand(1:10, 100), :x2 => rand(1:10, 100))
y = DataFrame(:y1 => rand(1:10, 100), :y2 => rand(1:10, 100))

# Select rows where the values in the x1 column of x are the same as the y1
# column of y, and the x2 value is 7 (arbitrary value for illustration)
x[(x.x1 .== y.y1) .& (x.x2 .== 7), :]

This however implicitly assumes that it is meaningful to compare the values of a column in x to the values in the same row of a column in y. In that case it would probably be more natural to simply have x and y as a single DataFrame with all columns in x and y:

df = hcat(x,y)

then:

df[(df.x1 .== df.y1) .& (df.x2 .== 7), [:x1, :x2]] == x[(x.x1 .== y.y1) .& (x.x2 .== 7), :]
# true

eed · April 26, 2019, 9:22am

Thank you, I just edited the post. Unfortunately, X and Y are not in the same DataFrame.

nilshg · April 26, 2019, 10:11am

My suggestion still works in principle, however upon reading your edited post it appears that your initial criterion is not that the value of X.X2 should not be the same as the value in the same row in Y.Y7, but rather you want all rows for which the value in X.X2 is contained anywhere in Y.Y7, correct?

In which case you want do to:

x[in.(x.x2, Ref(unique(y.y7))) .& (x.x2 .== 1), :]

EDIT a slightly more concise way to write the above:

x[in(unique(y.Y7)).(x.X2), :]

Topic		Replies	Views
SubArray -selecting same combination of rows from 2 arrays New to Julia	11	505	March 26, 2022
DataFrame array subset General Usage array	18	552	August 11, 2020
Creating dataframe with arrays of different length New to Julia dataframes	12	4632	November 17, 2021
Filter DataFrame by an Array New to Julia	8	5884	December 10, 2019
Generate a dataset New to Julia tuple , arrays	4	1061	March 10, 2020

Creating sub-arrays by comparing multiple arrays DataFrames

Related topics