Hi all, I have opened a text file using the readlines() function. This text file has various sections containing different information. In some of the lines i am finding strings that contain information inside single quotes. There are multiple words within these quotes. how can i keep ‘THIS TEXT’ intact when splitting following string "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
at every “” ? splitting up at every space breaks the single quoted text in two.
I believe that if you were using double quotes instead (as it is usual) maybe you could use DelimitedFile.readdlm
.
Thank you Henrique, but that is exactly my problem. The textfile is an export from some application which i can not change.
Cannot you read the whole file to a String, replace all single quotes by double quotes, and then write it to a string buffer to pass it to DelimitedFiles.readdlm
?
That’s worth a try, Thanks
Ugly … but seems to work on your test line:
function tokenize(s)
outbuffer = []
buffer = []
quoted = false
for c in s
if quoted
if c == '\''
push!(outbuffer, join(buffer))
quoted = false
buffer = []
else
push!(buffer, c)
end
elseif isspace(c)
!iszero(length(buffer)) && push!(outbuffer, join(buffer))
buffer = []
else
if c == '\''
quoted = true
continue
end
push!(buffer, c)
end
end
outbuffer
end
test = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
tokenize(test) = Any["#1", "1", "0", "THIS TEXT", "", "88218681", "0.398", "1", "", "1", "1", "2", "0", "168384", "461330", "0"]
wonderfull, this works!, I’ve tested it on another string as well. Thanks so much.
Just to point it out, the code above, nor DelimitedFiles.readdlm
, support escaping quotes inside a quoted field. I would suggest checking if this is a possibility.
line1="#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2"))
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"'THIS_TEXT'"
"''"
"'88218681'"
"0.398"
"1"
"''"
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
split(replace(line1, r"(\pL) (\pL)"=>s"\1_\2","'"=>""))
14-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS_TEXT"
"88218681"
"0.398"
"1"
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
split(replace(line1, r"(\pL) (\pL)"=>s"\1\u00a0\2","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""
# a more general sring
line4="#1 1 0 'THIS2 3TEXT' '' '88g 218x y681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
split(replace(line4, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS2 3TEXT"
""
"88g 218x y681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""
PS
Can anyone explain why split(str, " ") and split(str) produce different results if there are multiple consecutive spaces (\x20)?
I (think) I am working with a comparable situation. In my case, I copy Output from another application to the system clipboard. This causes the clipboard contents as seen by julia as a string. I then (painfully) interrogate each character of the clipboard to determine whether the character is part of the string you wish, e.g., “isequal(clipboard()[i],‘char’)”. too simple?
Regexes are an obvious first-choice solution for such problems. You specify what to find in the string, and voila:
julia> map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", str))
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
Another option (nesting of '
will wreak havoc).
julia> line1 = "#1 1 0 'THIS TEXT' '' '88218681' 0.398 1 '' 1 1 2 0 168384 461330 0 "
julia> Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
Iterators.countfrom(1), split(line1, "'")) |> collect
16-element Vector{SubString{String}}:
"#1"
"1"
"0"
"THIS TEXT"
""
"88218681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
Thank you all for your interest in my problem, learning a lot here.
do you means this?
but perhaps it is not a situation that can occur
line8="#1 1 'nest'THIS1 TEXT1'tsen' '' '88g 218x y681' '0.398 1 '' 1 1 2 0 168384 461330 0 "
Iterators.flatmap((i,x)->isodd(i) ? split(x) : [x],
Iterators.countfrom(1), split(line8, "'")) |> collect
10-element Vector{SubString{String}}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1"
"tsen"
""
"88g 218x y681"
"0.398 1 "
" 1 1 2 0 168384 461330 0 "
map(m -> strip(m.match, '\''), eachmatch(r"'[^']*'|\S+", line8))
16-element Vector{SubString{String}}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1'tsen"
""
"88g 218x y681"
"0.398 1 "
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
tokenize(line8)
8-element Vector{Any}:
"#1"
"1"
"nest"
"THIS1"
"TEXT1tsen"
""
"88g 218x y681"
"0.398 1 "
split(replace(line8, r"(\w*\pL\w*) +"=>s"\1\u00a0","'"=>"")," ")
17-element Vector{SubString{String}}:
"#1"
"1"
"nestTHIS1 TEXT1tsen"
""
""
"88g 218x y681"
"0.398"
"1"
""
"1"
"1"
"2"
"0"
"168384"
"461330"
"0"
""