Function like VLOOKUP


I have the following data and require a function like Vlookup in excel.

Table 1								
Criteria 1	Criteria 2	    Factor 1
A	              1	           75
B	              2	           85
A	              2	           50
B	              1	           50

Table 2
Sample	     Criteria 1	     Criteria 2
Sample 1	        B	          1
Sample 2	        A	          2

Require a function which can pick the right factor based on the criteriaโ€™s in table 2 from table 1, considering the criteriaโ€™s are in string format.

I tried using the filter function in the dataframes package but have not got the required results

Hi, @yasser_rajwani

You should see Joins ยท DataFrames.jl, last version includes many joins and I am sure one of them is useful in your case.

Hi, @dmolina

Thank you for your reply.

I tried using the innerjoin function, however, am getting the following error

โ€˜โ€™โ€™
UndefVarError: innerjoin not defined

Stacktrace:
[1] top-level scope at In[20]:2
โ€˜โ€™โ€™

That probably means youโ€™re not on the latest version of DataFrames, innerjoin was only introduced in the recent 0.21 update. Best thing is to update, if that isnโ€™t possible you can use join(..., kind = :inner) on the previous versions.

Thank you for your reply @nilshg

I am still unable to get the desired result using the join function, I require a function which enables me to look up / filter for a specific value that matches the criteria

Can you please send a specification of the input and the desired operation and I will propose you the ways how to achieve a desired result. Also please confirm what version of DataFrames.jl you are using (I recommend you to use version 0.21.0 as @nilshg suggested).

Hi @bkamins,

I have an input in the form of the following table
โ€˜โ€™โ€™

Criteria 1 Criteria 2 Factor 1
A 1 75
B 2 85
A 2 60
B 1 50
โ€˜โ€™โ€™
Sample Criteria 1 Criteria 2
โ€” โ€” โ€”
Sample 1 B 1
Sample 2 A 2

โ€˜โ€™โ€™
I require a function that would return the value of 50 for sample 1 and value of 60 for sample 2 respectively from table 1.

I am also updating all my packages using

Pkg.update()

which should update the DataFrame package to the latest version.

Apologies I am a beginer on Julia and a bit lost

Under DataFrames.jl version 0.21.0 you can do:

julia> df1 = DataFrame("Criteria 1" => ["A","B","A","B"],
                       "Criteria 2" => [1, 2, 2, 1],
                       "Factor 1" => [75, 85, 60, 50])
4ร—3 DataFrame
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ A          โ”‚ 1          โ”‚ 75       โ”‚
โ”‚ 2   โ”‚ B          โ”‚ 2          โ”‚ 85       โ”‚
โ”‚ 3   โ”‚ A          โ”‚ 2          โ”‚ 60       โ”‚
โ”‚ 4   โ”‚ B          โ”‚ 1          โ”‚ 50       โ”‚

julia> df2 = DataFrame("Criteria 1" => ["B","A"],
                       "Criteria 2" => [1, 2])
2ร—2 DataFrame
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ B          โ”‚ 1          โ”‚
โ”‚ 2   โ”‚ A          โ”‚ 2          โ”‚

julia> rightjoin(df1, df2, on=["Criteria 1", "Criteria 2"])
2ร—3 DataFrame
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String?    โ”‚ Int64?     โ”‚ Int64?   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ B          โ”‚ 1          โ”‚ 50       โ”‚
โ”‚ 2   โ”‚ A          โ”‚ 2          โ”‚ 60       โ”‚

a more advanced pattern is the following

julia> gdf = groupby(df1, 1:2)
GroupedDataFrame with 4 groups based on keys: Criteria 1, Criteria 2
First Group (1 row): Criteria 1 = "A", Criteria 2 = 1
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ A          โ”‚ 1          โ”‚ 75       โ”‚
โ‹ฎ
Last Group (1 row): Criteria 1 = "B", Criteria 2 = 1
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ B          โ”‚ 1          โ”‚ 50       โ”‚

julia> gdf[NamedTuple(df2[1, :])]
1ร—3 SubDataFrame
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ B          โ”‚ 1          โ”‚ 50       โ”‚

julia> gdf[NamedTuple(df2[2, :])]
1ร—3 SubDataFrame
โ”‚ Row โ”‚ Criteria 1 โ”‚ Criteria 2 โ”‚ Factor 1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ A          โ”‚ 2          โ”‚ 60       โ”‚

which allows you to do a lookup per row (if you wanted e.g. to do iteration).


Now related to package version. Pkg.update() does not have to give you the result you expect. I have recently written blog posts here and here trying to explain the potential problems.

However, if you want to keep working in default project environment it is easiest to run add DataFrames@v0.21 command in Package Manager mode that will make sure you have a right package version (in general I also recommend reading this part of the manual).

1 Like

I think there is a bit of conceptual ambiguity here. In Julia and R, these kinds of lookup tables arenโ€™t used very much compared to excel. Rather, we join everything into the same data frame.

If you are used to this kind of โ€œlookupโ€ workflow, which has advantages, you might want to consider using a Dict to organize information. However I also wrote a function which should work for your purposes.

function lookup(df, lookup_pairs, valcol)
       groupcols = first.(lookup_pairs)
       symbol_clean = [Symbol(first(p)) => last(p) for p in lookup_pairs] 
       nt = (;symbol_clean...)
       gdf = groupby(df, groupcols)
       out = gdf[nt][:, valcol]
       @assert length(out) == 1 # only want one observation
       return first(out)
 end

julia> lookup(df1, ["Criteria 1" => "A", "Criteria 2" => 1], "Factor 1")
75
1 Like

I agree that vlookup is typically done in Julia using Dicts. However, what I wanted to highlight, is that GroupedDataFrame supports dictionary interface and provides a very fast lookup by grouping keys (with the same speed as if you used the Dict - actually internally we store Dict to do the lookup).

1 Like

What is the advantage of the second method?

Hereโ€™s another way to do it:

using DataFrames

table1 = DataFrame(
    Criteria_1 = ["A", "B", "A", "B"],
    Criteria_2 = [1,2,2,1],
    Factor_1 = [75,85,60,50]
)

table2 = DataFrame(
    Sample = ["Sample 1", "Sample 2"],
    Criteria_1 = ["B", "A"],
    Criteria_2 = [1,2]
)

function get_factor(sample)
    crit1 = table2.Criteria_1[findfirst(x -> x == sample, table2.Sample)]
    crit2 = table2.Criteria_2[findfirst(x -> x == sample, table2.Sample)]
    return table1[(table1.Criteria_1 .== crit1) .& (table1.Criteria_2 .== crit2), :Factor_1][1]
end

julia> get_factor("Sample 1")
50

Iโ€™m not sure, but you might also just be trying to do what the last line of the get_factor function does, which would look like this:

julia> table1[(table1.Criteria_1 .== "B") .& (table1.Criteria_2 .== 1), :Factor_1][1]
50

# or if you want to return that whole row:

julia> table1[(table1.Criteria_1 .== "B") .& (table1.Criteria_2 .== 1), :]
1ร—3 DataFrame
โ”‚ Row โ”‚ Criteria_1 โ”‚ Criteria_2 โ”‚ Factor_1 โ”‚
โ”‚     โ”‚ String     โ”‚ Int64      โ”‚ Int64    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ B          โ”‚ 1          โ”‚ 50       โ”‚

Iโ€™ve not checked the performance of this so that may be a consideration if the tables you have are really large.

1 Like

The advantage is just that it looks more like the excel workflow you are used to. You actually input values into a function and get them back. If you are really comfortable using this workflow, then you can use one of the many methods outlined in this thread to do a lookup.

However a join is arguably more idiomatic in Julia for these kinds of operations. I would encourage you to explore joins more.

2 Likes

Arguably VLOOKUP isnโ€™t even idiomatic in Excel though :slight_smile: - see e.g. here

Thank you @bkamins,
A sight iterarion of the suggested code worked well for me.
Appreciate the help

1 Like

Thank you @pdeffebach and @mthelm85,
appreicate the help

Hi @bkamins,

Thank you for your help on my previous question.

I have managed to complete my task using the code you suggested.

The next step that I want to learn is to run the same code for 100 samples instead of 2.

From my limited knowledge of julia, I understand that I need to use a loop, however, I have not been sucessful.

Would you be kind enough to help out ?

Thanks You.

Can you show us what you have tried? Try posting the error.

If you error is something along the lines of

ERROR: UndefVarError: s not defined

Then make sure you put your code in a let block.

For example, if you are running the code

julia> s = 0
julia> for i in 1:10
       s = s + i
       end

The solution is to do

julia> s = let t = 0
       for i in 1:10
           t = t + i
       end
       t
       end

Hi @pdeffebach,

The code that I have written is

โ€˜โ€™โ€™
Y1 = (Data[1:15,10].* ((1 .- Data[1:15,4]) + (0.3 .* Data[1:15,4])))

โ€˜โ€™โ€™

I now need the Y1 to go Y2,

โ€˜โ€™โ€™
Y2 = (Data[1:15,11].* ((1 .- Data[1:15,5]) + (0.3 .* Data[1:15,5])))

โ€˜โ€™โ€™
As you can observe, some columns need to move increase by 1 (which I am hoping to code using the loop function)

I would like this to go on till Y10, with the columns changing (increase by 1)

Apologies, if this is difficult to understand.

Thanks for the detailed response. One quick thing, the correct way to quote code is with back-ticks. You are using quotation marks.

``` 
like this
```

The correct way to do this would be with a loop. Letโ€™s store the output of our computation in a new data frame

julia> Data = DataFrame(rand(15, 20)); # simulated data frame

julia> out = DataFrame(); # an empty data frame to add columns to

julia> for i in 1:10
           out[:, "Y$i"] = # "Y$i" makes a new column 
                           # with names "Y1" etc.
               Data[1:15, i + 9] .* ((1 .- Data[1:15, i + 4]) .+ (.3 .* Data[1:15, i + 4]))
       end

julia> out