Is there a way to keep the delimiter in split function, as a separate element?

Hello,

This is identical to this.

In my case, if I want to split a string (for example a quite complex anthroponym, with dashes, apostrophes, etc.), making complex capitalization depending on these punctuation delimiter, I don’t find an easy way to keep all elements in order, as I could in other languages.

MWE:

> split("123.456.789", r"\.")
["123", "456", "789"]  # Current and expected behaviour.

> split("123.456.789", r"(\.)")
["123", ".", "456", ".", "789"]  # Expected behaviour because the pattern is inside a capture group.

> split("AAA’s BBB-CCC", r"([’'\s-])")  # More complex example.
["AAA", "s", "BBB", "CCC"]  # Current behaviour.
["AAA", "’", "s", " ", "BBB", "-", "CCC"]  # Expected behaviour.

> split("123.456.789", r"\."; keep=true)  # Kind of enhancement with retrocompatibility.
["123", ".", "456", ".", "789"]

This is quite similar to this ticket, but the main difference is to keep the delimiter as a separate element among others (but the keep argument could be a string like separate, previous or next to be useful for all cases…).

First, I planned to open a new ticket on GitHub, but maybe I missed something!

Sincerely.

I guess you’re looking for a Base Julia way to do this, but in case you just want to get your problem solved, you can do it with lookaheads and lookbehinds. I have a package ReadableRegex which makes this a bit nicer to read:

using ReadableRegex

function split_keeping_splitter(string, splitter)
    r = Regex(
        either(
            look_for("", before = splitter),
            look_for("", after = splitter)
        )
    )
    split(string, r)
end

split_keeping_splitter("123.456.789", ".")
5-element Vector{SubString{String}}:
 "123"
 "."
 "456"
 "."
 "789"

The regex being constructed for the “.” version would be r"(?:(?:(?<=\.)(?:))|(?:(?:)(?=\.)))" so I can’t recommend constructing that yourself :wink:

5 Likes

Thank you for your answer!

Yes, I was searching for a Base Julia way of doing it, but your proposal is faster than what I was planning to do: r"(?<=X)|(?=X)" is a pretty good way to solve the issue (just a regex to change, no more lines!), while waiting for a potential Base improvement (for ease of use, maybe also performance).

If I don’t have other answers soon, I will mark your answer as solution and open a ticket on Github…

1 Like