I wanted to share this that I am working on right now.
(Not yet registered)
https://github.com/oxinabox/MultiResolutionIterators.jl
I’m making it as a part in my reengineering of CorpusLoaders.jl.
Basically a lot of Natural Language data has a lot of structure.
For example Wikipedia can be broken into
doc, section, paragraph, sentence, word, character.
Depending on what you are doing you are probably not interesting in considering it at all those levels.
So you want to basically drop some dimensions.
Here is the headline example, (full context for this is in the readme).
julia> animal_info = [
[["Turtles", "are", "reptiles", "."],
["They", "have", "shells", "."],
["They", "live", "in", "the", "water"]],
[["Cats", "are", "mammals", "."],
["They", "live", "on", "the", "internet"]]
]
2-element Array{Array{Array{String,1},1},1}:
Array{String,1}[String["Turtles", "are", "reptiles", "."], String["They", "have", "shells", "."], String["They", "live", "in", "the", "water"]]
Array{String,1}[String["Cats", "are", "mammals", "."], String["They", "live", "on", "the", "internet"]]
#...
# some code here
#...
julia> # Merge everything **except** words
merge_levels(animal_info, (!lvls)(indexer, :words)) |> full_collect
22-element Array{String,1}:
"Turtles"
"are"
"reptiles"
"."
"They"
"have"
"shells"
⋮
"."
"They"
"live"
"on"
"the"
"internet"
I’ld love to hear if this is useful in any other domains.
And any other thoughts.
(If anyone cares to do a code review and post an issue, that would be really awesome and i’ll owe you one. I’ve good for it )