Status of @tranform altering columns in DataFramesMeta


#1

What is the status of altering a column that has already been altered in a @tranform block with DataFramesMeta?

I have a dataframe with some messy descriptions in years and months in spanish, and I need to do some cleaning before I can use regex to get two numeric columns for years and months.


df = DataFrame(length_of_contact =  ["25 AÑOS", "5 años 9 meses",  "15 años "])

df = @transform(df, 
	length_of_contact = lowercase.(:length_of_contact),
	length_of_contact = replace.(:length_of_contact, "Ñ", "n"),
	length_of_contact = replace.(:length_of_contact, "ñ", "n"),
	length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))

The issue is that each time I refer to :length_of_contact, DataFramesMeta refers to the column :length_of_contact before the beginning of the @tranform block. Rather, I would like to perform a set of small sequential changes on the variable.

Am I going about this process the wrong way? Or is there a PR in the 0.7 build that might fix this (I am using 0.6.2)? Obviously, i could imagine that this isn’t trivial due to the metaprogramming involved.

Thanks


#2

My guess is that when you write it like that it does not work them sequentially as you might expect.

You could try

df = @linq df |>
   transform(length_of_contact = lowercase.(:length_of_contact)) |>
   transform(length_of_contact = replace.(:length_of_contact, "Ñ", "n")) |>
   transform(length_of_contact = replace.(:length_of_contact, "ñ", "n")) |>
   transform(length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))

#3

Why not apply all operations at the same time? I.e. lowercase.(replace.(replace.(...)))? That should also be more efficient.


#4

Yes I am trying to use Lazy.jl to do this right now. Hopefully I will figure it out!

EDIT: I was able to figure it out, for the most part.

This works:

df = @transform(df, 
	length_of_contact = @> :length_of_contact lowercase.() replace.("ñ", "n")) 

But this throws an error:


df = @transform(df, 
	length_of_contact = @> :length_of_contact 
        lowercase.() 
        replace.("ñ", "n")) 

It tells me

ERROR: LoadError: syntax: missing comma or ) in argument list
Stacktrace:
 [1] include_from_node1(::String) at .\loading.jl:576
 [2] include(::String) at .\sysimg.jl:14
while loading C:\Users\Pdeffebach\Documents\cleaning\julia_cleaning.jl, in expression starting on line 49

I think the second way is much more readable. Does anyone know how hard it would be to get Lazy to work with new lines like that?


#5
@> begin
    :length_of_contact 
    lowercase.() 
    replace.("ñ", "n"))
end

#6

This is it, thanks!

This syntax is incredibly readable and concise. Thanks to the contributors of Lazy.jl and DataFramesMeta.jl! I am working on something that compares a data cleaning task in Stata and Julia, and I was expecting Stata to be more readable, but maybe not!