Status of @tranform altering columns in DataFramesMeta

pdeffebach · March 22, 2018, 8:13pm

What is the status of altering a column that has already been altered in a @tranform block with DataFramesMeta?

I have a dataframe with some messy descriptions in years and months in spanish, and I need to do some cleaning before I can use regex to get two numeric columns for years and months.


df = DataFrame(length_of_contact =  ["25 AÑOS", "5 años 9 meses",  "15 años "])

df = @transform(df, 
	length_of_contact = lowercase.(:length_of_contact),
	length_of_contact = replace.(:length_of_contact, "Ñ", "n"),
	length_of_contact = replace.(:length_of_contact, "ñ", "n"),
	length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))

The issue is that each time I refer to :length_of_contact, DataFramesMeta refers to the column :length_of_contact before the beginning of the @tranform block. Rather, I would like to perform a set of small sequential changes on the variable.

Am I going about this process the wrong way? Or is there a PR in the 0.7 build that might fix this (I am using 0.6.2)? Obviously, i could imagine that this isn’t trivial due to the metaprogramming involved.

Thanks

tbeason · March 22, 2018, 8:31pm

My guess is that when you write it like that it does not work them sequentially as you might expect.

You could try

df = @linq df |>
   transform(length_of_contact = lowercase.(:length_of_contact)) |>
   transform(length_of_contact = replace.(:length_of_contact, "Ñ", "n")) |>
   transform(length_of_contact = replace.(:length_of_contact, "ñ", "n")) |>
   transform(length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))

nalimilan · March 22, 2018, 8:41pm

Why not apply all operations at the same time? I.e. lowercase.(replace.(replace.(...)))? That should also be more efficient.

pdeffebach · March 22, 2018, 8:42pm

Yes I am trying to use Lazy.jl to do this right now. Hopefully I will figure it out!

EDIT: I was able to figure it out, for the most part.

This works:

df = @transform(df, 
	length_of_contact = @> :length_of_contact lowercase.() replace.("ñ", "n"))

But this throws an error:


df = @transform(df, 
	length_of_contact = @> :length_of_contact 
        lowercase.() 
        replace.("ñ", "n"))

It tells me

ERROR: LoadError: syntax: missing comma or ) in argument list
Stacktrace:
 [1] include_from_node1(::String) at .\loading.jl:576
 [2] include(::String) at .\sysimg.jl:14
while loading C:\Users\Pdeffebach\Documents\cleaning\julia_cleaning.jl, in expression starting on line 49

I think the second way is much more readable. Does anyone know how hard it would be to get Lazy to work with new lines like that?

bramtayl · March 22, 2018, 8:46pm

@> begin
    :length_of_contact 
    lowercase.() 
    replace.("ñ", "n"))
end

pdeffebach · March 22, 2018, 8:49pm

This is it, thanks!

This syntax is incredibly readable and concise. Thanks to the contributors of Lazy.jl and DataFramesMeta.jl! I am working on something that compares a data cleaning task in Stata and Julia, and I was expecting Stata to be more readable, but maybe not!

Topic		Replies	Views
Cannot place this error in dataframes New to Julia dataframes	5	824	November 26, 2021
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10521	April 22, 2022
Transforming string columns in DataFrame with (regex) match General Usage	3	549	October 26, 2021
Can DataFramesMeta replace dummy values for all columns of a specific type General Usage question , dataframes , dataframesmeta	12	357	April 18, 2024
Using DataFramesMeta and Lazy in a function Data	0	848	June 12, 2018

Status of @tranform altering columns in DataFramesMeta

Related topics