Read single quoted values inside strings

Hi all, I have opened a text file using the readlines() function. This text file has various sections containing different information. In some of the lines i am finding strings that contain information inside single quotes. There are multiple words within these quotes. how can i keep ‘THIS TEXT’ intact when splitting following string "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 " at every “” ? splitting up at every space breaks the single quoted text in two.

I believe that if you were using double quotes instead (as it is usual) maybe you could use DelimitedFile.readdlm.

Thank you Henrique, but that is exactly my problem. The textfile is an export from some application which i can not change.

Cannot you read the whole file to a String, replace all single quotes by double quotes, and then write it to a string buffer to pass it to DelimitedFiles.readdlm?

That’s worth a try, Thanks

Ugly … but seems to work on your test line:

function tokenize(s)
    outbuffer = []

    buffer = []
    quoted = false

    for c in s
        if quoted
            if c == '\''
                push!(outbuffer, join(buffer))
                quoted = false
                buffer = []
            else
                push!(buffer, c)
            end
        elseif isspace(c)
            !iszero(length(buffer)) && push!(outbuffer, join(buffer))
            buffer = []
        else
            if c == '\''
                quoted = true
                continue
            end
            push!(buffer, c)
        end
    end

    outbuffer
end
test = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
tokenize(test) = Any["#1", "1", "0", "THIS TEXT", "", "88218681", "0.398", "1", "", "1", "1", "2", "0", "168384", "461330", "0"]

wonderfull, this works!, I’ve tested it on another string as well. Thanks so much.

Just to point it out, the code above, nor DelimitedFiles.readdlm, support escaping quotes inside a quoted field. I would suggest checking if this is a possibility.

line1="#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "

split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2"))


16-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "'THIS_TEXT'"
 "''"
 "'88218681'"
 "0.398"
 "1"
 "''"
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"

split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2","'"=>""))
14-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "THIS_TEXT"
 "88218681"
 "0.398"
 "1"
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"
split(replace(line1, r"(\pL) (\pL)"=>s"\1\u00a0\2","'"=>"")," ")
17-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "THIS TEXT"
 ""
 "88218681"
 "0.398"
 "1"
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"
 ""
# a more general sring

line4="#1 1 0 'THIS2 3TEXT' '' '88g 218x y681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
split(replace(line4, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "THIS2 3TEXT"
 ""
 "88g 218x y681"
 "0.398"
 "1"
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"
 ""



PS
Can anyone explain why split(str, " ") and split(str) produce different results if there are multiple consecutive spaces (\x20)?

I (think) I am working with a comparable situation. In my case, I copy Output from another application to the system clipboard. This causes the clipboard contents as seen by julia as a string. I then (painfully) interrogate each character of the clipboard to determine whether the character is part of the string you wish, e.g., “isequal(clipboard()[i],‘char’)”. too simple?

Regexes are an obvious first-choice solution for such problems. You specify what to find in the string, and voila:

julia> map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", str))
16-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "THIS TEXT"
 ""
 "88218681"
 "0.398"
 "1"
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"
1 Like

Another option (nesting of ' will wreak havoc).

julia> line1 = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "

julia> Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
  Iterators.countfrom(1), split(line1, "'")) |> collect
16-element Vector{SubString{String}}:
 "#1"
 "1"
 "0"
 "THIS TEXT"
 ""
 "88218681"
 "0.398"
 "1"
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"

Thank you all for your interest in my problem, learning a lot here.

do you means this?
but perhaps it is not a situation that can occur


line8="#1 1 'nest'THIS1  TEXT1'tsen'  '' '88g 218x   y681' '0.398 1 '' 1 1 2 0 168384 461330 0 "

Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
  Iterators.countfrom(1), split(line8, "'")) |> collect

10-element Vector{SubString{String}}:
 "#1"
 "1"
 "nest"
 "THIS1"
 "TEXT1"
 "tsen"
 ""
 "88g 218x   y681"
 "0.398 1 "
 " 1 1 2 0 168384 461330 0 "

map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", line8))
16-element Vector{SubString{String}}:
 "#1"
 "1"
 "nest"
 "THIS1"
 "TEXT1'tsen"
 ""
 "88g 218x   y681"
 "0.398 1 "
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"

tokenize(line8)
8-element Vector{Any}:
 "#1"
 "1"
 "nest"
 "THIS1"
 "TEXT1tsen"
 ""
 "88g 218x   y681"
 "0.398 1 "



split(replace(line8, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
 "#1"
 "1"
 "nestTHIS1 TEXT1tsen"
 ""
 ""
 "88g 218x y681"
 "0.398"
 "1"
 ""
 "1"
 "1"
 "2"
 "0"
 "168384"
 "461330"
 "0"
 ""