Help subdividing Generators

danilo-bc · December 17, 2020, 3:25am

I got inspired by this post on how to generate words from a pool of characters (Generate all words of a given size from a fixed alphabet - #4 by bennedich) and wanted to try it for myself for fun and to learn Generators, beacuse I know they’re memory efficient.

I saw this mini “find the secret” competition among friends where they give a black box function which returns 0 if you guess the right word, but 1 if you guess it wrong.

I want to use my word generator to find out the secret word, but I have two problems:
1 - If I use this simple Generator expression, I’m stuck to single-core work, which will take literal decades to find slightly long words
2 - I learned a bit about async jobs and set up an environment where it takes the Generator, partitions into “n” parts, and then does that linear search, but in multiple tasks.

itr = Iterators.product(ntuple(_ -> charPool, wordSize)...)
wordGenerator = Base.Generator(join, itr)
wordGeneratorPartitions = IterTools.partition(passGenerator, n)

With this approach, the program quickly starts to ramp up memory (which is the opposite of what I want with Generators!)

Could someone point out what am I doing wrong?

Instead of using a function like IterTools.partition, is there way to get itr to a certain state? For example, even if I don’t know the number of elements the generator will make, I’d like to be able to separate in batches of 100 elements to process per task, etc.

Thank you so much in advance!

jules · December 17, 2020, 9:31pm

You’re not partitioning the whole generator in n parts, you’re slicing it into parts of length n. I don’t know what you’re doing with that one afterwards, but it might not be what you intended?

In general that’s probably not going to be a very fast solution, because you’re not making use of the fact that all your strings are of the same known length. So if your space is large you’ll spend a long time creating strings.

Also I don’t think generators lend themselves too well to being split up into pieces that threads can operate on separately. But in your case, also the search space is known beforehand, and it’s an n-dimensional array of words where each dimension has the size of the character pool. So I’d try to use a package that gives statically sized strings, then create a lazy array of these strings with some package that offers something like that and then map threads over that array.

danilo-bc · December 17, 2020, 10:02pm

Thanks, jules.

The idea of creating staticall sized, lazily evaluated is pretty good and was my first idea, but with such a big space (each character adding roughly a factor of 60 to the equation), I simply won’t have enough memory to iterate through it.

Your input gave me an idea though.

My current generator can produce strings “aaa” to “zzz”. If I instead I create two generators from “aaa” to “jjj” then from “kkk” to “zzz”, I’ll have two partitions of my whole space and can use those two Generators on two different tasks (or threads). Another option is to change the range of the very first character. If going from ‘a’ to ‘z’ has 26 possibilities, it means a 3-character word will have 26*26*26 possibilities of characters.

I can change the very first character to vary from ‘a’ to ‘m’ and then from ‘n’ to ‘z’ instead, giving me two generators with 2*(13)2626 possibilities, which ends up in the same “word pool” I’m interested.

I like those two solutions I mentioned and I’ll try to implement them, but I still have that question regarding generators.

If I have a numerical iterator like 1:1000, it’s obvious that I can separate it in two with 1:500 then 501:1000, where this middle number can be understood as FinalValue/2 (with floor or ceiling functions applied if necessary to round up).

Question: is there a way to obtain these types of landmarks from generic Generators, like my word generator? If I’m able to get a ‘FinalValue’ and a ‘MiddleValue’, I can possibly split that Generator in half and keep going until I’m satisfied.

jules · December 17, 2020, 10:15pm

Why wouldn’t you have memory for a huge lazy array? The point of the lazy array is that it’s not backed by memory, it’s just a fancy indexing mechanism (you index into it and that generates a word). I would just think that the known dimensions of an array are what generators are missing to split them nicely across threads

danilo-bc · December 17, 2020, 10:46pm

Then I think I “missed the memo” on lazy arrays.

Do you have any page/tutorial on the concept? I’d love to read more.

jules · December 17, 2020, 10:52pm

I was intrigued so I played around with this a bit. I’m hardcoding one case here because it was faster to set up.

Basically the idea was to create an “array” from CartesianIndices. Then use mappedarray to make one word out of each index. Then in that array I can just search for the word-tuple and everything is inferred. I mean this particular problem you can probably also just solve with loops, but I thought it was kind of cool to set it up this way, and I learned a couple new things as well

using MappedArrays
using BenchmarkTools


get_word(idx, chars) = map(i -> chars[i], Tuple(idx)) # word is just a tuple of chars

function wordsearch()
    chars = ('a', 'b', 'c', 'd', 'e', 'f')
    wordlength = 6
    indices = CartesianIndices(ntuple(x -> length(chars), wordlength))
    lazy_arr = mappedarray(x -> get_word(x, chars), indices)

    word = ('f', 'b', 'c', 'd', 'a', 'd')
    findfirst(==(word), lazy_arr)
end

@btime wordsearch()

 89.885 μs (0 allocations: 0 bytes)
CartesianIndex(6, 2, 3, 4, 1, 4)

danilo-bc · December 17, 2020, 11:00pm

Interesting! I’m glad you learned a new thing as well.

So, in this case, I could iterate through the dimensions of lazy_arr and find any number of words I want, right?

In that sense, lazy_arr can also be split in 2 or more parts so I can make use of multithreaded processing, is that correct?
p.s.: by split I mean using simple index logic in for loops to separate parts, like how we use “for i in 1:half” then “half:endIndex” or whatever

I’m very curious about what you just posted ^^ so thanks a lot!

jules · December 17, 2020, 11:08pm

yeah exactly, the concept of an array isn’t tied to contiguously populated memory, it just means that a couple of array interface operations are defined. CartesianIndices is also a “lazy” array-like object in that way, and the mappedarray just piggybacks off of it.

danilo-bc · December 17, 2020, 11:20pm

I’ll take a slower look on the subject of lazy arrays/lazy evaluations to understand things better, but can you name the pros and cons of using such things?

As all things with memory/cpu tradeoff, I imagine lazy stuff may exchange memory use for use of processing power, but I don’t want to jump into conclusions.

jules · December 18, 2020, 6:05am

You have to think about how often you visit each element, for example. In this case just once, so you don’t waste CPU time

Topic		Replies	Views
Creating Generators General Usage generator	32	14767	July 6, 2022
How to improve a Generator to be more memory efficient when it is collected? Performance question , regex , optimization , iterators	21	941	January 23, 2023
Generator for formal language (a\|b)* General Usage	1	549	March 21, 2018
Hey, does anybody know why a generator does allocate memory while an iterator does General Usage	1	437	March 19, 2021
Memory allocation in generator with zip General Usage	2	649	February 5, 2020

Help subdividing Generators

Related topics