Multi-threading with DataFrames

CC @pdeffebach

But I think the only thing you can do is to add @transform to the expression and then evaluate it

You can parse the :c = 2 * :a + :b to become a src => fun => dest pair that you can pass around and use later. Use the (currently un-exported) DataFramesMeta.@col feature

julia> using DataFramesMeta

julia> df = DataFrame(a = 1, b = 2);

julia> t = DataFramesMeta.@col :c = 2 * :a + :b
[:a, :b] => (var"#3#4"() => :c)

julia> transform(df, t)
1Γ—3 DataFrame
 Row β”‚ a      b      c
     β”‚ Int64  Int64  Int64
─────┼─────────────────────
   1 β”‚     1      2      4

I’ve been meaning to export this feature, since it is useful.

Thing is in this case the transformation I have as an expression in a variable, which I can’t pass as an argument to @col

If that’s the case I think you should re-think your whole approach. Use functions to store transformations, not expressions. You should not be passing around expressions like that.

perhaps if you provided more details about the problem you face, more suggestions might come.
Could the symbolics.jl package be useful to you in your case?

Certainly, I’m not sure it will be of much help (compared to what I mentioned in above posts) given the generic nature of what I’m trying to achieve, but let me do it anyway.

I have DataFrames of time series data, over which the end user of the program (very often, me) performs high numbers of transformations for further analysis. These transformations do vary and are best stored as lists of Julia Expr (made from strings). I realize that this is essentially me using a subset of the Julia language as my own language that is exposed to the end user, hence no way around eval() unless I describe my transformations ahead of time in very generic terms and reduce flexibility.

So with the example DataFrame above, I want the user to be able to apply say the 4 operations, sine/cosine, exp, log, and more (some being defined in my module), with any arbitraty combination possible. I’m essentially recreating a small calculator that is applied column-wise to a DataFrame. Hope this makes sense. Appreciate you reading my convoluted problem.

This is absolutely feasible with functions. And you should definitely be using them instead of expressions.

julia> df = DataFrame(a = [1, 2, 3]; b = [4, 5, 6]);

julia> function operate_on_col(fun, col1, col2)
           DataFramesMeta.@col @byrow :newcol = begin
               $col1 * 2 + fun($col2)
           end
       end;

julia> t1 = operate_on_col(sin, :a, :b); t2 = operate_on_col(cos, :a, :b);

julia> transform(df, t1)
3Γ—3 DataFrame
 Row β”‚ a      b      newcol  
     β”‚ Int64  Int64  Float64 
─────┼───────────────────────
   1 β”‚     1      4  1.2432
   2 β”‚     2      5  3.04108
   3 β”‚     3      6  5.72058

julia> transform(df, t2)
3Γ—3 DataFrame
 Row β”‚ a      b      newcol  
     β”‚ Int64  Int64  Float64 
─────┼───────────────────────
   1 β”‚     1      4  1.34636
   2 β”‚     2      5  4.28366
   3 β”‚     3      6  6.96017

Thank you for the suggestion, I think my issue is fun is not known ahead of time, but let me take a closer look at your code see if I can leverage that. (And by β€œnot known” I mean β€œnot defined”).

In your example, I would pretty much need to have the user pass the definition of fun as a string and eval() its definition, which takes me back to square one.

Could you not have the user pass fun? They can’t make it themselves?

The user lives outside of Julia, meaning they can pass as string, which would have to be parsed and eval’d?

Yeah that would most likely have to be parsed and eval’d, which isn’t ideal. Maybe someone else can chime in on the best way to do that particular task.

julia> using  DataFrames, Symbolics

julia> df_flows = DataFrame(;
           from = ["p1", "p1", "p2", "p2", "p1", "p2"],
           to = ["d", "d", "d", "d", "d", "d"],
           rp = [1, 1, 1, 1, 2, 2],
           tb = [3,5,7,4,6,8],
           index = 1:6,
       )
6Γ—5 DataFrame
 Row β”‚ from    to      rp     tb     index 
     β”‚ String  String  Int64  Int64  Int64 
─────┼─────────────────────────────────────
   1 β”‚ p1      d           1      3      1
   2 β”‚ p1      d           1      5      2
   3 β”‚ p2      d           1      7      3
   4 β”‚ p2      d           1      4      4
   5 β”‚ p1      d           2      6      5
   6 β”‚ p2      d           2      8      6

julia> @variables x y z
3-element Vector{Num}:
 x
 y
 z

julia> w=x^2+y*sqrt(z*x)
x^2 + y*sqrt(x*z)

julia> transform(df_flows,[3,4,5]=>ByRow((r,t,i)->substitute(w, Dict(x=>r,y=>t,z=>i))))
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  rp_tb_index_function   
     β”‚ String  String  Int64  Int64  Int64  Num
─────┼───────────────────────────────────────────────────────────  
   1 β”‚ p1      d           1      3      1               4.0       
   2 β”‚ p1      d           1      5      2               8.07107   
   3 β”‚ p2      d           1      7      3              13.1244    
   4 β”‚ p2      d           1      4      4               9.0       
   5 β”‚ p1      d           2      6      5              22.9737    
   6 β”‚ p2      d           2      8      6              31.7128    

julia> w=3x-2y
3x - 2y

julia> transform(df_flows,[3,4,5]=>ByRow((r,t,i)->substitute(w, Dict(x=>r,y=>t,z=>i))))
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  rp_tb_index_function   
     β”‚ String  String  Int64  Int64  Int64  Num
─────┼───────────────────────────────────────────────────────────  
   1 β”‚ p1      d           1      3      1                    -3   
   2 β”‚ p1      d           1      5      2                    -7   
   3 β”‚ p2      d           1      7      3                   -11   
   4 β”‚ p2      d           1      4      4                    -5   
   5 β”‚ p1      d           2      6      5                    -6   
   6 β”‚ p2      d           2      8      6                   -10   

PS
I saw several old posts that discussed the problem of deriving a function from a string. Apart from the solution with parse() and eval() I have seen the use of the getfield function applied to the current module to obtain the function from the string.
But this applies to functions already defined in the module.
An equivalent would be, in my opinion, a dict with strings as keys and functions as values.

to get suggestions for the current problem (about the use of an input string) it might be more useful to open a new topic with a more specific title

In principle a β€œsimple calculator” seems feasible.
There’s a lot of work to do to make it truly functional, but just to start a seed


df=Dict("log"=>log, "sin"=>sin, "*"=>*,"+"=>+,"-"=>-,"^"=>^, "∘"=>∘)



function str2func(str)
    lff=findfirst('(', str)
    op=df[str[1:lff-1]]
    if !occursin('(',str[lff+1:end])
        par=split(str[lff+1:end-1],',')
        tp=tryparse.(Int,par)
        if all(isnothing,tp)
            return ((x,y)->(a->(b->(c->op(a,c))(b)))(x)(y))
        else
            n=only(filter(!isnothing,tp))
            return z->((x,y)->(a->(b->(c->op(a,c))(b)))(x)(y))(n,z)
        end
    else
        return (x...)->"not yet"
    end
end

julia> df_flows
6Γ—5 DataFrame
 Row β”‚ from    to      rp     tb     index 
     β”‚ String  String  Int64  Int64  Int64
─────┼─────────────────────────────────────
   1 β”‚ p1      d           1      3      1
   2 β”‚ p1      d           1      5      2
   3 β”‚ p2      d           1      7      3
   4 β”‚ p2      d           1      4      4
   5 β”‚ p1      d           2      6      5
   6 β”‚ p2      d           2      8      6

julia> transform(df_flows,[3,4]=>ByRow(str2func("*(x,y)")))    
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  rp_tb_function     
     β”‚ String  String  Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────────────────    
   1 β”‚ p1      d           1      3      1               3     
   2 β”‚ p1      d           1      5      2               5     
   3 β”‚ p2      d           1      7      3               7     
   4 β”‚ p2      d           1      4      4               4     
   5 β”‚ p1      d           2      6      5              12     
   6 β”‚ p2      d           2      8      6              16     

julia> transform(df_flows,[3,4]=>ByRow(str2func("+(x,y)")))    
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  rp_tb_function     
     β”‚ String  String  Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────────────────    
   1 β”‚ p1      d           1      3      1               4     
   2 β”‚ p1      d           1      5      2               6     
   3 β”‚ p2      d           1      7      3               8     
   4 β”‚ p2      d           1      4      4               5     
   5 β”‚ p1      d           2      6      5               8     
   6 β”‚ p2      d           2      8      6              10     

julia> transform(df_flows,[4,3]=>ByRow(str2func("^(x,y)")))    
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  tb_rp_function     
     β”‚ String  String  Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────────────────    
   1 β”‚ p1      d           1      3      1               3     
   2 β”‚ p1      d           1      5      2               5     
   3 β”‚ p2      d           1      7      3               7     
   4 β”‚ p2      d           1      4      4               4     
   5 β”‚ p1      d           2      6      5              36     
   6 β”‚ p2      d           2      8      6              64     

julia> transform(df_flows,[4,3]=>ByRow(str2func("-(x,y)")))    
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  tb_rp_function     
     β”‚ String  String  Int64  Int64  Int64  Int64
─────┼─────────────────────────────────────────────────────    
   1 β”‚ p1      d           1      3      1               2     
   2 β”‚ p1      d           1      5      2               4     
   3 β”‚ p2      d           1      7      3               6     
   4 β”‚ p2      d           1      4      4               3     
   5 β”‚ p1      d           2      6      5               4     
   6 β”‚ p2      d           2      8      6               6     

julia> transform(df_flows,[4]=>ByRow(str2func("*(3,y)")))      
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  tb_function        
     β”‚ String  String  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────────────       
   1 β”‚ p1      d           1      3      1            9        
   2 β”‚ p1      d           1      5      2           15        
   3 β”‚ p2      d           1      7      3           21        
   4 β”‚ p2      d           1      4      4           12        
   5 β”‚ p1      d           2      6      5           18        
   6 β”‚ p2      d           2      8      6           24        

julia> transform(df_flows,[4,3]=>ByRow(str2func("+(log(x),∘(sin, *(10,y)))")))
6Γ—6 DataFrame
 Row β”‚ from    to      rp     tb     index  tb_rp_function     
     β”‚ String  String  Int64  Int64  Int64  String
─────┼─────────────────────────────────────────────────────    
   1 β”‚ p1      d           1      3      1  not yet
   2 β”‚ p1      d           1      5      2  not yet
   3 β”‚ p2      d           1      7      3  not yet
   4 β”‚ p2      d           1      4      4  not yet
   5 β”‚ p1      d           2      6      5  not yet
   6 β”‚ p2      d           2      8      6  not yet

the functions thus defined can be, in appropriate cases (associative operators), applied to more than 2 elements.
And in particular in the case of operators that have methods also defined on vectors, you can do without wrapping everything with ByRow

transform(df_flows,[3,4, 5]=>ByRow((x...)->reduce(str2func("+(x,y)"),x)))


transform(df_flows,[3,4, 5]=>(x...)->foldl(str2func("+(x,y)"),x))
1 Like

a small step forward(?). If you illustrate the typical cases of formulas used, perhaps something can be added

I realize that the idea and above all the implementation is really naive and, I fear, not very efficient.
But let’s play with Julia’s expressions and while waiting for ideas for improvement, I’ll give you a small step forward(?).

If you illustrate the typical cases of formulas used, perhaps something can be added


julia> function str2func(str)
           lff=findfirst('(', str)
           op=df[str[1:lff-1]]
           if !occursin('(',str[lff+1:end])
               par=split(str[lff+1:end-1],',')
               tp=tryparse.(Int,par)
               opxy=(x,y)->(b->(c->op(c,b)))(x)(y)
               if all(isnothing,tp)
                   return opxy
               else
                   n=only(filter(!isnothing,tp))
                   return z->opxy(n,z)
               end
           else
               par=split(str[lff+1:end-1],"),")
               par[1:end-1] .*=')'
               return (x...)->op([str2func(p)(var) for (p,var) in zip(par,x)]...)
           end
       end
str2func (generic function with 1 method)

julia> str= "+(*(3,x),*(2,y))"
"+(*(3,x),*(2,y))"

julia> str2func(str)(3,3)
15

julia> str= "*(+(3,x),+(2,y))"
"*(+(3,x),+(2,y))"

julia> str2func(str)(2,3)
25

julia> str= "+(3,x)"
"+(3,x)"

julia> str2func(str)(-3)
0

I posted an updated version of the script here