MultiResolutionInterators.jl: Tools for working with data that has macro-scale hierachical structure, where you don't always care about some of the levels


#1

I wanted to share this that I am working on right now.
(Not yet registered)

I’m making it as a part in my reengineering of CorpusLoaders.jl.
Basically a lot of Natural Language data has a lot of structure.

For example Wikipedia can be broken into
doc, section, paragraph, sentence, word, character.

Depending on what you are doing you are probably not interesting in considering it at all those levels.
So you want to basically drop some dimensions.

Here is the headline example, (full context for this is in the readme).

julia> animal_info = [
           [["Turtles", "are", "reptiles", "."],
            ["They", "have", "shells", "."],
            ["They", "live", "in", "the", "water"]],
           [["Cats", "are", "mammals", "."],
            ["They", "live", "on", "the", "internet"]]
           ]
2-element Array{Array{Array{String,1},1},1}:
 Array{String,1}[String["Turtles", "are", "reptiles", "."], String["They", "have", "shells", "."], String["They", "live", "in", "the", "water"]]
 Array{String,1}[String["Cats", "are", "mammals", "."], String["They", "live", "on", "the", "internet"]]

#...
# some code here
#...

julia> # Merge everything **except** words
       merge_levels(animal_info, (!lvls)(indexer, :words)) |> full_collect
22-element Array{String,1}:
 "Turtles"
 "are"
 "reptiles"
 "."
 "They"
 "have"
 "shells"
 ⋮
 "."
 "They"
 "live"
 "on"
 "the"
 "internet"

I’ld love to hear if this is useful in any other domains.
And any other thoughts.
(If anyone cares to do a code review and post an issue, that would be really awesome and i’ll owe you one. I’ve good for it :wink: )


#2

Hi,

I have created a small library around Flux used for nested multiple-instance learning, which resembles exactly what you have said. The whole purpose of the library is to classify the whole document, reflecting the structure without making the sample flat.

You can find it here


but there are no examples at the moment. They might come, if I am not super lazy or if someone is interested in the problem. It implements these two papers:
https://arxiv.org/abs/1609.07257
https://arxiv.org/abs/1703.02868

Tomas