Function like VLOOKUP

yasser_rajwani · May 20, 2020, 9:51am


I have the following data and require a function like Vlookup in excel.

Table 1								
Criteria 1	Criteria 2	    Factor 1
A	              1	           75
B	              2	           85
A	              2	           50
B	              1	           50

Table 2
Sample	     Criteria 1	     Criteria 2
Sample 1	        B	          1
Sample 2	        A	          2

Require a function which can pick the right factor based on the criteria’s in table 2 from table 1, considering the criteria’s are in string format.

I tried using the filter function in the dataframes package but have not got the required results

dmolina · May 20, 2020, 10:04am

Hi, @yasser_rajwani

You should see Joins · DataFrames.jl, last version includes many joins and I am sure one of them is useful in your case.

yasser_rajwani · May 20, 2020, 10:46am

Hi, @dmolina

Thank you for your reply.

I tried using the innerjoin function, however, am getting the following error

‘’’
UndefVarError: innerjoin not defined

Stacktrace:
[1] top-level scope at In[20]:2
‘’’

nilshg · May 20, 2020, 11:00am

That probably means you’re not on the latest version of DataFrames, innerjoin was only introduced in the recent 0.21 update. Best thing is to update, if that isn’t possible you can use join(..., kind = :inner) on the previous versions.

yasser_rajwani · May 20, 2020, 11:44am

Thank you for your reply @nilshg

I am still unable to get the desired result using the join function, I require a function which enables me to look up / filter for a specific value that matches the criteria

bkamins · May 20, 2020, 12:09pm

Can you please send a specification of the input and the desired operation and I will propose you the ways how to achieve a desired result. Also please confirm what version of DataFrames.jl you are using (I recommend you to use version 0.21.0 as @nilshg suggested).

yasser_rajwani · May 20, 2020, 12:21pm

Hi @bkamins,

I have an input in the form of the following table
‘’’

Criteria 1	Criteria 2	Factor 1
A	1	75
B	2	85
A	2	60
B	1	50
‘’’
Sample	Criteria 1	Criteria 2
—	—	—
Sample 1	B	1
Sample 2	A	2

‘’’
I require a function that would return the value of 50 for sample 1 and value of 60 for sample 2 respectively from table 1.

I am also updating all my packages using

Pkg.update()

which should update the DataFrame package to the latest version.

Apologies I am a beginer on Julia and a bit lost

bkamins · May 20, 2020, 12:50pm

Under DataFrames.jl version 0.21.0 you can do:

julia> df1 = DataFrame("Criteria 1" => ["A","B","A","B"],
                       "Criteria 2" => [1, 2, 2, 1],
                       "Factor 1" => [75, 85, 60, 50])
4×3 DataFrame
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ A          │ 1          │ 75       │
│ 2   │ B          │ 2          │ 85       │
│ 3   │ A          │ 2          │ 60       │
│ 4   │ B          │ 1          │ 50       │

julia> df2 = DataFrame("Criteria 1" => ["B","A"],
                       "Criteria 2" => [1, 2])
2×2 DataFrame
│ Row │ Criteria 1 │ Criteria 2 │
│     │ String     │ Int64      │
├─────┼────────────┼────────────┤
│ 1   │ B          │ 1          │
│ 2   │ A          │ 2          │

julia> rightjoin(df1, df2, on=["Criteria 1", "Criteria 2"])
2×3 DataFrame
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String?    │ Int64?     │ Int64?   │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ B          │ 1          │ 50       │
│ 2   │ A          │ 2          │ 60       │

a more advanced pattern is the following

julia> gdf = groupby(df1, 1:2)
GroupedDataFrame with 4 groups based on keys: Criteria 1, Criteria 2
First Group (1 row): Criteria 1 = "A", Criteria 2 = 1
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ A          │ 1          │ 75       │
⋮
Last Group (1 row): Criteria 1 = "B", Criteria 2 = 1
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ B          │ 1          │ 50       │

julia> gdf[NamedTuple(df2[1, :])]
1×3 SubDataFrame
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ B          │ 1          │ 50       │

julia> gdf[NamedTuple(df2[2, :])]
1×3 SubDataFrame
│ Row │ Criteria 1 │ Criteria 2 │ Factor 1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ A          │ 2          │ 60       │

which allows you to do a lookup per row (if you wanted e.g. to do iteration).

Now related to package version. Pkg.update() does not have to give you the result you expect. I have recently written blog posts here and here trying to explain the potential problems.

However, if you want to keep working in default project environment it is easiest to run add DataFrames@v0.21 command in Package Manager mode that will make sure you have a right package version (in general I also recommend reading this part of the manual).

pdeffebach · May 20, 2020, 1:40pm

I think there is a bit of conceptual ambiguity here. In Julia and R, these kinds of lookup tables aren’t used very much compared to excel. Rather, we join everything into the same data frame.

If you are used to this kind of “lookup” workflow, which has advantages, you might want to consider using a Dict to organize information. However I also wrote a function which should work for your purposes.

function lookup(df, lookup_pairs, valcol)
       groupcols = first.(lookup_pairs)
       symbol_clean = [Symbol(first(p)) => last(p) for p in lookup_pairs] 
       nt = (;symbol_clean...)
       gdf = groupby(df, groupcols)
       out = gdf[nt][:, valcol]
       @assert length(out) == 1 # only want one observation
       return first(out)
 end

julia> lookup(df1, ["Criteria 1" => "A", "Criteria 2" => 1], "Factor 1")
75

bkamins · May 20, 2020, 1:49pm

I agree that vlookup is typically done in Julia using Dicts. However, what I wanted to highlight, is that GroupedDataFrame supports dictionary interface and provides a very fast lookup by grouping keys (with the same speed as if you used the Dict - actually internally we store Dict to do the lookup).

Juan · May 20, 2020, 3:47pm

What is the advantage of the second method?

mthelm85 · May 20, 2020, 3:59pm

Here’s another way to do it:

using DataFrames

table1 = DataFrame(
    Criteria_1 = ["A", "B", "A", "B"],
    Criteria_2 = [1,2,2,1],
    Factor_1 = [75,85,60,50]
)

table2 = DataFrame(
    Sample = ["Sample 1", "Sample 2"],
    Criteria_1 = ["B", "A"],
    Criteria_2 = [1,2]
)

function get_factor(sample)
    crit1 = table2.Criteria_1[findfirst(x -> x == sample, table2.Sample)]
    crit2 = table2.Criteria_2[findfirst(x -> x == sample, table2.Sample)]
    return table1[(table1.Criteria_1 .== crit1) .& (table1.Criteria_2 .== crit2), :Factor_1][1]
end

julia> get_factor("Sample 1")
50

I’m not sure, but you might also just be trying to do what the last line of the get_factor function does, which would look like this:

julia> table1[(table1.Criteria_1 .== "B") .& (table1.Criteria_2 .== 1), :Factor_1][1]
50

# or if you want to return that whole row:

julia> table1[(table1.Criteria_1 .== "B") .& (table1.Criteria_2 .== 1), :]
1×3 DataFrame
│ Row │ Criteria_1 │ Criteria_2 │ Factor_1 │
│     │ String     │ Int64      │ Int64    │
├─────┼────────────┼────────────┼──────────┤
│ 1   │ B          │ 1          │ 50       │

I’ve not checked the performance of this so that may be a consideration if the tables you have are really large.

pdeffebach · May 20, 2020, 4:03pm

The advantage is just that it looks more like the excel workflow you are used to. You actually input values into a function and get them back. If you are really comfortable using this workflow, then you can use one of the many methods outlined in this thread to do a lookup.

However a join is arguably more idiomatic in Julia for these kinds of operations. I would encourage you to explore joins more.

nilshg · May 20, 2020, 4:14pm

Arguably VLOOKUP isn’t even idiomatic in Excel though - see e.g. here

yasser_rajwani · May 21, 2020, 10:35am

Thank you @bkamins,
A sight iterarion of the suggested code worked well for me.
Appreciate the help

yasser_rajwani · May 21, 2020, 10:37am

Thank you @pdeffebach and @mthelm85,
appreicate the help

yasser_rajwani · May 27, 2020, 2:14pm

Hi @bkamins,

Thank you for your help on my previous question.

I have managed to complete my task using the code you suggested.

The next step that I want to learn is to run the same code for 100 samples instead of 2.

From my limited knowledge of julia, I understand that I need to use a loop, however, I have not been sucessful.

Would you be kind enough to help out ?

Thanks You.

pdeffebach · May 27, 2020, 2:17pm

Can you show us what you have tried? Try posting the error.

If you error is something along the lines of

ERROR: UndefVarError: s not defined

Then make sure you put your code in a let block.

For example, if you are running the code

julia> s = 0
julia> for i in 1:10
       s = s + i
       end

The solution is to do

julia> s = let t = 0
       for i in 1:10
           t = t + i
       end
       t
       end

yasser_rajwani · May 27, 2020, 3:01pm

Hi @pdeffebach,

The code that I have written is

‘’’
Y1 = (Data[1:15,10].* ((1 .- Data[1:15,4]) + (0.3 .* Data[1:15,4])))

‘’’

I now need the Y1 to go Y2,

‘’’
Y2 = (Data[1:15,11].* ((1 .- Data[1:15,5]) + (0.3 .* Data[1:15,5])))

‘’’
As you can observe, some columns need to move increase by 1 (which I am hoping to code using the loop function)

I would like this to go on till Y10, with the columns changing (increase by 1)

Apologies, if this is difficult to understand.

pdeffebach · May 27, 2020, 3:23pm

Thanks for the detailed response. One quick thing, the correct way to quote code is with back-ticks. You are using quotation marks.

``` 
like this
```

The correct way to do this would be with a loop. Let’s store the output of our computation in a new data frame

julia> Data = DataFrame(rand(15, 20)); # simulated data frame

julia> out = DataFrame(); # an empty data frame to add columns to

julia> for i in 1:10
           out[:, "Y$i"] = # "Y$i" makes a new column 
                           # with names "Y1" etc.
               Data[1:15, i + 9] .* ((1 .- Data[1:15, i + 4]) .+ (.3 .* Data[1:15, i + 4]))
       end

julia> out

Topic		Replies	Views
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24026	April 27, 2024
Findall/Findfirst across multiple columns in dataframe New to Julia	15	2348	March 26, 2021
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	1974	August 25, 2023
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10530	April 22, 2022
Is there a faster way to extract single rows of a dataframe than the filter function General Usage	3	652	February 18, 2020

Function like VLOOKUP

Related topics