Hello,
I’ve been working on parsing a text file into a multidimensional array.
The text file has entries like this, with each line representing a word and all of its pronunciation information (which is a subset of sphinxsmall.txt, mentioned later):
!EXCLAMATION-POINT EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE K L OW1 Z K W OW1 T
"DOUBLE-QUOTE D AH1 B AH0 L K W OW1 T
I am able to get individual lines split using the split function, but I am currently unsuccessful at iterating over the entire collection of words.
Here is is the code that I am trying to work with:
words = []
syllables = []
a = open("sphinxsmall.txt", "r")
for line in eachline(a)
for h in eachindex(words)
syllables = hcat(split(line), words[h])
# syllables = split(line)
end
end
close(a)
If you comment out the for loop with syllables and uncomment the syllables = split(line), the script runs but the last word overwrites all of the previous words.
From what I’ve read, “you can’t use push! on multidimentional arrays. You can push onto a 1d array, and then reshape. Another way is to collect all the row/column vectors and use hcat(c…) or vcat(c…) to create an array.”
As a result, I was trying to bypass push! altogether via nested for loops, which should theoretically work for multiple dimensions.
Ultimately, I want my resultant array to be m x n dimensions (m being decided by the number of word rows, and n being decided by the number of syllable representation columns).
Right now, only this code works, but the way it is designed, split(line) is overridden each iteration, hence my rational in creating an outer for loop to save each transient iteration.
with syllables = []
a = open("sphinxsmall.txt", "r")
for line in eachline(a)
syllables = split(line)
end
close(a)
This is certainly an improvement as the values are not overridden. However, the entire array is still rendered as a 1xn dimensional array. For clear demarcation, I also need one dimension for each word (the actual corpus is 140,000 words * ≈10 phonemes each).
Unless the order of words matters to you, I would do something like
function parse_wordinfo(io::IO)
table = Dict{String,Vector{String}}()
for line in eachline(io)
parts = split(line, ' ')
word = first(parts)
syllables = parts[2:end]
table[word] = syllables
end
table
end
I did not test this, and I am not sure I fully understand the spec, so fix/modify accordingly. If you do need the order, push!(container, word => syllables) could be a solution.
Thank you so much Tamas! This accomplishes what I need in a way I would not have though of (dictionaries).
Because I have never used dictionaries, I wonder if off the top of your head (without stressing about it) if you have any advice on how to iterate through the values for each key so to count them, discern if the value strings contain certain substrings [0, 1, or 2 in my case, as this encodes phonetic accents], with the goal of appending this gleaned information as additional values to the relevant keys of the dictionary. I found this thread to help me get started, but its goal is to iterate over keys and not values.
Once again, thank you so much!
UPDATE
sv = map(length,values(table)) gets the number of values per key. The link I provided is actually proving to be quite useful. I’m working at hacking away at this problem…
Thanks Tamas, I will work with this info! Once I get a more concrete idea of what I am doing I will post a basic working example (or ideally, the solution!).
@Tamas_Papp, I’ve been trying to make more headway on this project recently. It has been really putting my programming skills to the test and I am hitting several walls.
Specifically the goal of the program at this time is to generate the second dictionary from the first dictionary:
Essentially, I am trying to deconvolve syllabic accent information (encoded as digits). My strategy was to use a regular expression in concert with the findin function to do this but I am failing. Here is my broken code:
function parse_accent_info(io::IO)
global dictsyllabicstress = SortedDict{String,Vector{String}}()
for line in eachline(io)
parts = split(line, ' ')
word = first(parts)
syllables = parts[2:end]
#syllabicstress = findin(parts,Regex("\d")) Attempt does not work
dictsyllabicstress[word] = syllabicstress
end
dictionary
end