MultiResolutionInterators.jl: Tools for working with data that has macro-scale hierachical structure, where you don't always care about some of the levels

oxinabox · April 13, 2018, 5:50am

I wanted to share this that I am working on right now.
(Not yet registered)

https://github.com/oxinabox/MultiResolutionIterators.jl

I’m making it as a part in my reengineering of CorpusLoaders.jl.
Basically a lot of Natural Language data has a lot of structure.

For example Wikipedia can be broken into
doc, section, paragraph, sentence, word, character.

Depending on what you are doing you are probably not interesting in considering it at all those levels.
So you want to basically drop some dimensions.

Here is the headline example, (full context for this is in the readme).

julia> animal_info = [
           [["Turtles", "are", "reptiles", "."],
            ["They", "have", "shells", "."],
            ["They", "live", "in", "the", "water"]],
           [["Cats", "are", "mammals", "."],
            ["They", "live", "on", "the", "internet"]]
           ]
2-element Array{Array{Array{String,1},1},1}:
 Array{String,1}[String["Turtles", "are", "reptiles", "."], String["They", "have", "shells", "."], String["They", "live", "in", "the", "water"]]
 Array{String,1}[String["Cats", "are", "mammals", "."], String["They", "live", "on", "the", "internet"]]

#...
# some code here
#...

julia> # Merge everything **except** words
       merge_levels(animal_info, (!lvls)(indexer, :words)) |> full_collect
22-element Array{String,1}:
 "Turtles"
 "are"
 "reptiles"
 "."
 "They"
 "have"
 "shells"
 ⋮
 "."
 "They"
 "live"
 "on"
 "the"
 "internet"

I’ld love to hear if this is useful in any other domains.
And any other thoughts.
(If anyone cares to do a code review and post an issue, that would be really awesome and i’ll owe you one. I’ve good for it )

Tomas_Pevny · April 13, 2018, 10:22am

Hi,

I have created a small library around Flux used for nested multiple-instance learning, which resembles exactly what you have said. The whole purpose of the library is to classify the whole document, reflecting the structure without making the sample flat.

You can find it here

but there are no examples at the moment. They might come, if I am not super lazy or if someone is interested in the problem. It implements these two papers:

Tomas

Topic		Replies	Views
[ANN] KeywordStrings.jl --- A fun and convenient string macro for interpolations Package Announcements	1	288	January 16, 2023
[ANN] LLMTextAnalysis.jl - Unveil Text Insights with LLMs! Package Announcements announcement , llm , generative-ai	1	604	January 17, 2024
[ANN] HypertextLiteral.jl - generate tagged content with interpolation Package Announcements package , strings , webapps , pluto , html	4	1575	June 20, 2021
[ANN] A new lightning fast package for data manipulation in pure Julia Package Announcements data , dataframes , inmemorydatasets	95	10550	July 4, 2022
[Pre-ANN/RFC] ExpandNestedData.jl (Previously Normalize.jl) Package Announcements	21	1468	December 7, 2022

MultiResolutionInterators.jl: Tools for working with data that has macro-scale hierachical structure, where you don't always care about some of the levels

Related topics