What is the status of altering a column that has already been altered in a @tranform
block with DataFramesMeta?
I have a dataframe with some messy descriptions in years and months in spanish, and I need to do some cleaning before I can use regex to get two numeric columns for years and months.
df = DataFrame(length_of_contact = ["25 AÑOS", "5 años 9 meses", "15 años "])
df = @transform(df,
length_of_contact = lowercase.(:length_of_contact),
length_of_contact = replace.(:length_of_contact, "Ñ", "n"),
length_of_contact = replace.(:length_of_contact, "ñ", "n"),
length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))
The issue is that each time I refer to :length_of_contact
, DataFramesMeta refers to the column :length_of_contact
before the beginning of the @tranform
block. Rather, I would like to perform a set of small sequential changes on the variable.
Am I going about this process the wrong way? Or is there a PR in the 0.7
build that might fix this (I am using 0.6.2
)? Obviously, i could imagine that this isn’t trivial due to the metaprogramming involved.
Thanks
My guess is that when you write it like that it does not work them sequentially as you might expect.
You could try
df = @linq df |>
transform(length_of_contact = lowercase.(:length_of_contact)) |>
transform(length_of_contact = replace.(:length_of_contact, "Ñ", "n")) |>
transform(length_of_contact = replace.(:length_of_contact, "ñ", "n")) |>
transform(length_of_contact = replace.(:length_of_contact, "7meses", "7 meses"))
Why not apply all operations at the same time? I.e. lowercase.(replace.(replace.(...)))
? That should also be more efficient.
Yes I am trying to use Lazy.jl to do this right now. Hopefully I will figure it out!
EDIT: I was able to figure it out, for the most part.
This works:
df = @transform(df,
length_of_contact = @> :length_of_contact lowercase.() replace.("ñ", "n"))
But this throws an error:
df = @transform(df,
length_of_contact = @> :length_of_contact
lowercase.()
replace.("ñ", "n"))
It tells me
ERROR: LoadError: syntax: missing comma or ) in argument list
Stacktrace:
[1] include_from_node1(::String) at .\loading.jl:576
[2] include(::String) at .\sysimg.jl:14
while loading C:\Users\Pdeffebach\Documents\cleaning\julia_cleaning.jl, in expression starting on line 49
I think the second way is much more readable. Does anyone know how hard it would be to get Lazy to work with new lines like that?
@> begin
:length_of_contact
lowercase.()
replace.("ñ", "n"))
end
1 Like
This is it, thanks!
This syntax is incredibly readable and concise. Thanks to the contributors of Lazy.jl and DataFramesMeta.jl! I am working on something that compares a data cleaning task in Stata and Julia, and I was expecting Stata to be more readable, but maybe not!