Hi all, I have opened a text file using the readlines() function. This text file has various sections containing different information. In some of the lines i am finding strings that contain information inside single quotes. There are multiple words within these quotes. how can i keep ‘THIS TEXT’ intact when splitting following string "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 " at every “” ? splitting up at every space breaks the single quoted text in two.
I believe that if you were using double quotes instead (as it is usual) maybe you could use DelimitedFile.readdlm.
Thank you Henrique, but that is exactly my problem. The textfile is an export from some application which i can not change.
Cannot you read the whole file to a String, replace all single quotes by double quotes, and then write it to a string buffer to pass it to DelimitedFiles.readdlm?
That’s worth a try, Thanks
Ugly … but seems to work on your test line:
function tokenize(s)
outbuffer = []
buffer = []
quoted = false
for c in s
if quoted
if c == '\''
push!(outbuffer, join(buffer))
quoted = false
buffer = []
else
push!(buffer, c)
end
elseif isspace(c)
!iszero(length(buffer)) && push!(outbuffer, join(buffer))
buffer = []
else
if c == '\''
quoted = true
continue
end
push!(buffer, c)
end
end
outbuffer
end
test = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
tokenize(test) = Any["#1", "1", "0", "THIS TEXT", "", "88218681", "0.398", "1", "", "1", "1", "2", "0", "168384", "461330", "0"]
wonderfull, this works!, I’ve tested it on another string as well. Thanks so much.
Just to point it out, the code above, nor DelimitedFiles.readdlm, support escaping quotes inside a quoted field. I would suggest checking if this is a possibility.
line1="#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2"))
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"'THIS_TEXT'"
"''"
"'88218681'"
"0.398"
"1"
"''"
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2","'"=>""))
14-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS_TEXT"
"88218681"
"0.398"
"1"
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
split(replace(line1, r"(\pL) (\pL)"=>s"\1\u00a0\2","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""
# a more general sring
line4="#1 1 0 'THIS2 3TEXT' '' '88g 218x y681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
split(replace(line4, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS2 3TEXT"
""
"88g 218x y681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""
PS
Can anyone explain why split(str, " ") and split(str) produce different results if there are multiple consecutive spaces (\x20)?
I (think) I am working with a comparable situation. In my case, I copy Output from another application to the system clipboard. This causes the clipboard contents as seen by julia as a string. I then (painfully) interrogate each character of the clipboard to determine whether the character is part of the string you wish, e.g., “isequal(clipboard()[i],‘char’)”. too simple?
Regexes are an obvious first-choice solution for such problems. You specify what to find in the string, and voila:
julia> map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", str))
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
Another option (nesting of ' will wreak havoc).
julia> line1 = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
julia> Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
Iterators.countfrom(1), split(line1, "'")) |> collect
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
Thank you all for your interest in my problem, learning a lot here.
do you means this?
but perhaps it is not a situation that can occur
line8="#1 1 'nest'THIS1 TEXT1'tsen' '' '88g 218x y681' '0.398 1 '' 1 1 2 0 168384 461330 0 "
Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
Iterators.countfrom(1), split(line8, "'")) |> collect
10-element Vector{SubString{String}}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1"
"tsen"
""
"88g 218x y681"
"0.398 1 "
" 1 1 2 0 168384 461330 0 "
map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", line8))
16-element Vector{SubString{String}}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1'tsen"
""
"88g 218x y681"
"0.398 1 "
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
tokenize(line8)
8-element Vector{Any}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1tsen"
""
"88g 218x y681"
"0.398 1 "
split(replace(line8, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"nestTHIS1 TEXT1tsen"
""
""
"88g 218x y681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""