Create string out of substring in dataframe

jonjilla · September 25, 2023, 1:15am

I’m trying to create a new string parsed out of a substring of a column that is delimited by a character.

For example, in this dataframe

df = DataFrame(
     id=["1", "1", "2"],
     string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"]
)

I would like a create a new column that is created by splitting the cell called string by the “.” delimiter and removing the first and last sections of the field.

So the result should look like the following

df = DataFrame(
     id=["1", "1", "2"],
     string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"],
     newstring=["B.C.D","B.C","B"]
)

I’ve tried this but it doesn’t seem to work

df.newstring .== join.(split.(df.string, ".")[2:end-1], ".")

Any suggestions

g-gundam · September 25, 2023, 2:01am

Here’s my attempt at turning "A.B.C.D.JUNK" into ["B.C.D","B.C","B"]. I split it up into multiple lines for clarity. You can collapse it back to a one-liner later.

s = "A.B.C.D.JUNK"
ps = split(s, ".")
middle = ps[2:end-1]
newstring = reduce(middle) do m, a
	if ismissing(m)
		return [a]
	else
		return ["$(m[1]).$a", m...]
	end
end

The variable newstring should be ["B.C.D","B.C","B"]. The main difference between your attempt and mine is join vs reduce.

jonjilla · September 25, 2023, 3:10am

@g-gundam I’m not sure how this works to update the data frame? working on one string at a time, the join(split) works fine, but I’m having trouble updating the entire column in the dataframe.

g-gundam · September 25, 2023, 3:34am

@jonjilla - I misunderstood your initial intent. I’m going to try something else. One moment.

Try this:

df2 = DataFrame(
    id=["1", "1", "2"],
    string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"]
)
df2.newstring = map(df2.string) do s
    join(split(s, ".")[2:end-1], ".")
end

jonjilla · September 25, 2023, 5:05am

Thanks, that seems to get me what I need

rocco_sprmnt21 · September 25, 2023, 6:05am

A solution along the lines of your attempt could be this.

newstr=join.(getindex.(split.(df.string, "."),range.(2,lastindex.(split.(df.string, ".")).-1)),'.')

As you can easily see, it is not, in any respect, preferable to the solution already proposed.

g-gundam · September 25, 2023, 6:11am

Just because you can, doesn’t mean you should.
Sometimes, Julia reminds me of Perl, and I mean that affectionately. I liked Perl and how flexible and playful it was in its time.

rocco_sprmnt21 · September 25, 2023, 6:31am

a middle

vh=findfirst.('.',df.string)
vt=findfirst.('.',reverse.(df.string))

map(((s, h,t),)->chop(s,head=h,tail=t), zip(df.string, vh,vt))
#or
newstring = map(df.string, vh,vt) do s, h, t
	chop(s,head=h,tail=t)
end

rocco_sprmnt21 · September 25, 2023, 6:42am

replace.(df.string, r"^\w+\."=>"",r"\.\w+$"=>"")

replace.(df.string, r"^\w+\.(.+)\.\w+$"=>s"\1")

bkamins · September 25, 2023, 8:06am

This is a bit tricky, as you have:

julia> str = "ł.ą.ź"
"ł.ą.ź"

julia> vh=findfirst('.',str)
3

julia> vt=findlast('.',str)
6

julia> chop(str, head=vh, tail=vt) # wrong
""

julia> chop(str[vh+1:vt]) # correct, +1 because we know that `.` is ASCII
"ą"

And the issue is related to byte vs character indexing differences. See The String, or There and Back Again | Blog by Bogumił Kamiński

For ASCII strings things would work.

Dan · September 25, 2023, 8:37am

This wouldn’t work even with no non-ASCII as tail parameter is count from end of string, and vt is an index from beginning of string.

In any case, another mini-idea is to use SubString to avoid copying (ADDED: this code has a bug. see next 2 posts):

julia> map(df2.string) do s
       SubString(s,findfirst('.',s)+1:findlast('.',s)-1)
       end
3-element Vector{SubString{String}}:
 "B.C.D"
 "B.C"
 "B"

As Bogumil suggested, the +1 good for index math because of ASCII ‘.’ and the -1 is good because we want to index last byte of previous codepoint.

bkamins · September 25, 2023, 9:50am

ah - right.

It would not work, because the character just before the last . could be muliti-byte:

julia> s = ".ą."
".ą."

julia> SubString(s,findfirst('.',s)+1:findlast('.',s)-1)
ERROR: StringIndexError: invalid index [3], valid nearby indices [2]=>'ą', [4]=>'.'

(things are hard :(, I made such mistakes numerous times unfortunately, I work in the Polish language where such cases are super common so I am sensitive to them). You need to chop or take prevind

Dan · September 25, 2023, 9:58am

…things are indeed confusing. Just to keep working code in the thread, with prev_ind:

julia> map(df2.string) do s
       SubString(s,findfirst('.',s)+1:prevind(s,findlast('.',s)))
       end

Topic		Replies	Views
How to extract substring of a Julia Dataframe column General Usage	3	1783	June 17, 2020
Julia Row-wise operation of DataFrames New to Julia question , dataframes	2	1293	July 13, 2019
Split Column in Dataframe New to Julia dataframes	5	2190	February 4, 2022
String column dataframe: replace for another string built from a substring from each row Data	1	1292	January 20, 2021
Split string to multiple columns New to Julia	10	3564	August 1, 2019

Create string out of substring in dataframe

Related topics