Broadcast SubString to subset of DataFrame rows?

Hi,
Say I have a large DataFrame, with thousands of rows. For some of those rows, I need to parse out a substring and put it in another column. In my case, these rows have “Register(nnnn)” in them, and I need to extract out “nnnn”.

Here is a MWE with just 4 rows:

dframe = DataFrame(Item = ["Register(1234)","Flow","Register(6789)","Temp"], b = 1:4)

insertcols!(dframe,2,:RegisterID=>"")

rrows = findall(x -> split(x,"(",)[1]=="Register", dframe[:,1])

for i in rrows
    dframe[i,2]=SubString(dframe[i,1],10:13)
end 

The output is:

4×3 DataFrame
 Row │ Item            RegisterID  b     
     │ String          String      Int64 
─────┼───────────────────────────────────
   1 │ Register(1234)  1234            1
   2 │ Flow                            2
   3 │ Register(6789)  6789            3
   4 │ Temp                            4

This works fine. However, I’m suspicious there should be some clever way to broadcast that SubString() function to the subset of rows and put the result in column 2 – but I can’t figure out the syntax.

Is it possible?

Using a regular expression would be more general than parsing out the substring manually.

Probably you were just missing Ref to avoid broadcasting the range of characters:

dframe[rrows,2] .= SubString.(dframe[rrows,1], Ref(10:13))

but here’s a one-liner that implements @Jeff_Emanuel’s advice of using a regular expression:

select(dframe, :Item, :Item => ByRow(x->only(something(match(r"Register\((.*)\)", x), [""]))) => :RegisterID, :)

And here’s a more readable version of the same:

function get_id(item)
    m = match(r"Register\((.*)\)", item)
    return isnothing(m) ? "" : m[1]
end

select(dframe, :Item, :Item => ByRow(get_id) => :RegisterID, :)
1 Like

Here is the equivalent code in DataFramesMeta.jl

julia> dframe
4×3 DataFrame
 Row │ Item            RegisterID  b     
     │ String          String      Int64 
─────┼───────────────────────────────────
   1 │ Register(1234)                  1
   2 │ Flow                            2
   3 │ Register(6789)                  3
   4 │ Temp                            4

julia> @rtransform dframe :RegisterID = begin
           m = match(r"Register\((.*)\)", :Item)
           isnothing(m) ? "" : m[1]
       end
4×3 DataFrame
 Row │ Item            RegisterID  b     
     │ String          AbstractS…  Int64 
─────┼───────────────────────────────────
   1 │ Register(1234)  1234            1
   2 │ Flow                            2
   3 │ Register(6789)  6789            3
   4 │ Temp                            4
1 Like

Very cool. I’m always impressed how Julia can solve the same problem multiple different ways.

select(dframe, :Item, :Item => ByRow(x->only(something(match(r"Register\((.*)\)", x), [""]))) => :RegisterID, :)

This one’s going to take some study. I’m sure I’ll learn a lot.

1 Like

Haven’t had time to dig into DataFramesMeta yet, but it looks interesting.

It will provide a more simple syntax than DataFrames.jl for many data cleaning operations. No need to learn all of DataFrames before beginning to learn DataFramesMeta.jl