Create string out of substring in dataframe

I’m trying to create a new string parsed out of a substring of a column that is delimited by a character.

For example, in this dataframe

df = DataFrame(
     id=["1", "1", "2"],
     string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"]

I would like a create a new column that is created by splitting the cell called string by the “.” delimiter and removing the first and last sections of the field.

So the result should look like the following

df = DataFrame(
     id=["1", "1", "2"],
     string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"],

I’ve tried this but it doesn’t seem to work

df.newstring .== join.(split.(df.string, ".")[2:end-1], ".")

Any suggestions

Here’s my attempt at turning "A.B.C.D.JUNK" into ["B.C.D","B.C","B"]. I split it up into multiple lines for clarity. You can collapse it back to a one-liner later.

s = "A.B.C.D.JUNK"
ps = split(s, ".")
middle = ps[2:end-1]
newstring = reduce(middle) do m, a
	if ismissing(m)
		return [a]
		return ["$(m[1]).$a", m...]

The variable newstring should be ["B.C.D","B.C","B"]. The main difference between your attempt and mine is join vs reduce.

@g-gundam I’m not sure how this works to update the data frame? working on one string at a time, the join(split) works fine, but I’m having trouble updating the entire column in the dataframe.

@jonjilla - I misunderstood your initial intent. I’m going to try something else. One moment.

Try this:

df2 = DataFrame(
    id=["1", "1", "2"],
    string=["A.B.C.D.JUNK", "A.B.C.JUNK", "A.B.JUNK"]
df2.newstring = map(df2.string) do s
    join(split(s, ".")[2:end-1], ".")

Thanks, that seems to get me what I need

1 Like

A solution along the lines of your attempt could be this.

newstr=join.(getindex.(split.(df.string, "."),range.(2,lastindex.(split.(df.string, ".")).-1)),'.')

As you can easily see, it is not, in any respect, preferable to the solution already proposed.

1 Like

Just because you can, doesn’t mean you should. :stuck_out_tongue:
Sometimes, Julia reminds me of Perl, and I mean that affectionately. I liked Perl and how flexible and playful it was in its time.

a middle


map(((s, h,t),)->chop(s,head=h,tail=t), zip(df.string, vh,vt))
newstring = map(df.string, vh,vt) do s, h, t
replace.(df.string, r"^\w+\."=>"",r"\.\w+$"=>"")
replace.(df.string, r"^\w+\.(.+)\.\w+$"=>s"\1")

This is a bit tricky, as you have:

julia> str = "ł.ą.ź"

julia> vh=findfirst('.',str)

julia> vt=findlast('.',str)

julia> chop(str, head=vh, tail=vt) # wrong

julia> chop(str[vh+1:vt]) # correct, +1 because we know that `.` is ASCII

And the issue is related to byte vs character indexing differences. See The String, or There and Back Again | Blog by Bogumił Kamiński

For ASCII strings things would work.

This wouldn’t work even with no non-ASCII as tail parameter is count from end of string, and vt is an index from beginning of string.

In any case, another mini-idea is to use SubString to avoid copying (ADDED: this code has a bug. see next 2 posts):

julia> map(df2.string) do s
3-element Vector{SubString{String}}:

As Bogumil suggested, the +1 good for index math because of ASCII ‘.’ and the -1 is good because we want to index last byte of previous codepoint.

ah - right.

It would not work, because the character just before the last . could be muliti-byte:

julia> s = ".ą."

julia> SubString(s,findfirst('.',s)+1:findlast('.',s)-1)
ERROR: StringIndexError: invalid index [3], valid nearby indices [2]=>'ą', [4]=>'.'

(things are hard :(, I made such mistakes numerous times unfortunately, I work in the Polish language where such cases are super common so I am sensitive to them). You need to chop or take prevind

1 Like

…things are indeed confusing. Just to keep working code in the thread, with prev_ind:

julia> map(df2.string) do s
1 Like