How to break down string into chunk of n Kb while maintaining meaning

I am sending string to AWS Translate but I need to respect the 10000Kb limit for the string.

in case I have longer strings, I am using this approach

my_long_string = "some very long text...."
# AWS limit
aws_limit = 10000
# how many sentences we need
number_of_sentences = ceil(Base.summarysize(my_long_string) / aws_limit)
# break the string into an array of strings
list_of_sentences = segment_text(my_long_string, number_of_sentences)

now in my segment_text function I need to break down the string in number_of_sentences, and here I struggle a bit to understand the best way to cut the string using punctuation… the idea is to take a chunk as big as aws_limit and break it to the nearest full stop (if any) before the end of the chunk, and progressively “consume” the whole my_long_string

Is this by any chance a know algorithm for which already exists a solution?

For example, the following works in-place without copying any substrings (only uses views):

function process_chunks(process::Function, s::String, chunk_size::Integer, delimiter::Char='.')
    # chunk size must be at least big enough for a single Unicode char
    chunk_size ≥ 4 || throw(ArgumentError("chunk_size = $chunk_size is not ≥ 4"))
    i = 1
    e = lastindex(s)
    @views while i <= e
        # find maximum character index such that sizeof(s[i:j]) <= chunk_size
        j = thisind(s, min(e, i + chunk_size-1))
        nextind(s, j) > i + chunk_size && (j = prevind(s, j))
        @assert j >= i
        
        # shrink to last delimiter
        j = i-1 + something(findlast(==(delimiter), s[i:j]), j-(i-1))

        # process the chunk and continue
        process(s[i:j])
        i = nextind(s, j)
    end
end

which gives:

julia> s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

julia> process_chunks(println, s, 60)
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
 do eiusmod tempor incididunt ut labore et dolore magna aliq
ua.
 Ut enim ad minim veniam, quis nostrud exercitation ullamco 
laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit e
sse cillum dolore eu fugiat nulla pariatur.
 Excepteur sint occaecat cupidatat non proident, sunt in cul
pa qui officia deserunt mollit anim id est laborum.
4 Likes