Parsing text file into a multidimensional array

Hello,
I’ve been working on parsing a text file into a multidimensional array.

The text file has entries like this, with each line representing a word and all of its pronunciation information (which is a subset of sphinxsmall.txt, mentioned later):

!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE  K L OW1 Z K W OW1 T
"DOUBLE-QUOTE  D AH1 B AH0 L K W OW1 T

I am able to get individual lines split using the split function, but I am currently unsuccessful at iterating over the entire collection of words.

Here is is the code that I am trying to work with:

words = []
syllables = []
a = open("sphinxsmall.txt", "r")
for line in eachline(a)
    for h in eachindex(words)
        syllables = hcat(split(line), words[h]) 
        # syllables = split(line)
    end
end
close(a)

If you comment out the for loop with syllables and uncomment the syllables = split(line), the script runs but the last word overwrites all of the previous words.

Any help is greatly appreciated.

Thanks,
Nakul

Something like

push!(syllables,split(line))

Disclaimer: on the phone cannot test it’s actually works.

1 Like

From what I’ve read, “you can’t use push! on multidimentional arrays. You can push onto a 1d array, and then reshape. Another way is to collect all the row/column vectors and use hcat(c…) or vcat(c…) to create an array.”

As a result, I was trying to bypass push! altogether via nested for loops, which should theoretically work for multiple dimensions.

Ultimately, I want my resultant array to be m x n dimensions (m being decided by the number of word rows, and n being decided by the number of syllable representation columns).

Right now, only this code works, but the way it is designed, split(line) is overridden each iteration, hence my rational in creating an outer for loop to save each transient iteration.

with syllables = []
a = open("sphinxsmall.txt", "r")
for line in eachline(a)
        syllables = split(line)
end
close(a)

what about

append!(syllables, split(line))

in the loop body instead?

1 Like

This is certainly an improvement as the values are not overridden. However, the entire array is still rendered as a 1xn dimensional array. For clear demarcation, I also need one dimension for each word (the actual corpus is 140,000 words * ≈10 phonemes each).

Unless the order of words matters to you, I would do something like

function parse_wordinfo(io::IO)
    table = Dict{String,Vector{String}}()
    for line in eachline(io)
        parts = split(line, ' ')
        word = first(parts)
        syllables = parts[2:end]
        table[word] = syllables
    end
    table
end

I did not test this, and I am not sure I fully understand the spec, so fix/modify accordingly. If you do need the order, push!(container, word => syllables) could be a solution.

1 Like

Thank you so much Tamas! This accomplishes what I need in a way I would not have though of (dictionaries).

Because I have never used dictionaries, I wonder if off the top of your head (without stressing about it) if you have any advice on how to iterate through the values for each key so to count them, discern if the value strings contain certain substrings [0, 1, or 2 in my case, as this encodes phonetic accents], with the goal of appending this gleaned information as additional values to the relevant keys of the dictionary. I found this thread to help me get started, but its goal is to iterate over keys and not values.

Once again, thank you so much!

UPDATE
sv = map(length,values(table)) gets the number of values per key. The link I provided is actually proving to be quite useful. I’m working at hacking away at this problem…

I would make a small function that does the thing I want, eg for matching see findall etc and occursin, then map this through the dictionary, eg

Dict(key => my_small_function(value) for (key, value) in table)

I would collect different information in another Dict though instead of appending; if you want to keep the values use a named tuple like

Dict(key => (value = value, statistics = my_small_function(value)) for (key, value) in table)

Hope this helps. If you provide specific input/output examples (because it is not clear what you want to do), I am sure you will get more help.

2 Likes

Thanks Tamas, I will work with this info! Once I get a more concrete idea of what I am doing I will post a basic working example (or ideally, the solution!).

@Tamas_Papp, I’ve been trying to make more headway on this project recently. It has been really putting my programming skills to the test and I am hitting several walls.

Specifically the goal of the program at this time is to generate the second dictionary from the first dictionary:

1st dictionary: ‘Phoneme/syllabic accent dictionary’ entry
key: “ALLOWS”
value: “” “AH0” “L” “AW1” “Z”

2nd dictionary: Resultant ‘Syllabic accent dictionary’ entry
key: “ALLOWS”
value = 0 1

Essentially, I am trying to deconvolve syllabic accent information (encoded as digits). My strategy was to use a regular expression in concert with the findin function to do this but I am failing. Here is my broken code:

function parse_accent_info(io::IO)
    global dictsyllabicstress = SortedDict{String,Vector{String}}()
    for line in eachline(io)
        parts = split(line, ' ')
        word = first(parts)
        syllables = parts[2:end]
        #syllabicstress = findin(parts,Regex("\d")) Attempt does not work
        dictsyllabicstress[word] = syllabicstress
    end
    dictionary
end

Help is greatly appreciated!
Thanks,
Nakul