Weird behaviour when flattening generator expression with vcat


#1

Hi

I am using Julia 0.6.2 and have written the following function. (Data generation added as by request)

# Generate testdata
function fillpath(path, nbfiles)
    cd(path)
    mkdir("juliatest")
    cd("juliatest")
    
    for i in ["0","1"]
        mkdir(i)
        cd(i)
        for j in 1:nbfiles
            touch("$j.dat")
        end
        cd("..")
    end
end

fillpath("/tmp", 10000)

"""
files, subdirs = subdirlabeledfiles(path)

Return a Vector files containing the filenames in all subdirectories and a Vector
subdirs containing the name of the subdirectory for each file that can be used as a label.

"""
function subdirlabeledfiles(path)
    # Get all subdirs of path
    subdirs = filter(x -> isdir(joinpath(path,x)), readdir(path))
    # Get files in all subdirs
    subdirspaths = joinpath.(path, subdirs) # Get absolute paths of subdirs
    files = [filter(isfile, joinpath.(subdir, readdir(subdir))) for subdir in subdirspaths]
    # Get a list naming the subdir for each file
    subdirs = (fill(subdir, length(files[i])) for (i, subdir) in enumerate(subdirs)) # does not work
    # subdirs = [fill(subdir, length(files[i])) for (i, subdir) in enumerate(subdirs)] # does work
    # Flatten the results
    files = vcat(files...)
    subdirs = vcat(subdirs...)
        
    return files, subdirs
end

X, Y = subdirlabeledfiles("/tmp/juliatest")
length(Y)

The idea: I give a path and the function returns a vector containing all files within all subdirectories and an additional vector indicating in which subdirectory the given file was. It is intended to be used to load datasets where e.g. images for different labels are in different subdirectories. Probably not the most elegant way of doing this, but anyway.

Now to the problem: When I use a generator expression in the line marked with # does not work the vector I get for subdirs is much too short. Instead of 20000 entries I get 51 entries. When I change it to generate an Array with [ ] it works correctly.

Executing the working code with [ ] gives a length of 20000 which is expected. Running it with ( ) gives a length of 51. The vector contains the correct entries (first “0”, then “1”), there are just not enough entries.

Is there something I am missing? Is this intended behaviour? Is this a bug?


#2

Please provide a minimal working example one can run (with possibly simulated data).


#3

Thanks. I have extended the code with a part to generate “testdata”


#4

Can you try to simplify it even more, by removing all filesystem operations? Finding the minimal code which reproduces the problem is always useful.

I wonder whether the fact that you reassign new values to variables which are used by generators could trigger the behavior you observe (either because that’s documented, or because of a bug).


#5

Yes, this looks suspiciously like a bug. If I comment out the

files = vcat(files...)

it works. Looks like collecting the generator does something to its value.

Isolating a MWE (without filesystem stuff, which should be irrelevant) would be very useful. I can reproduce the bug on both v0.6.2 and current master v0.7-.


#6

generator depends on files. So solution is probably to swap these two lines:

    files = vcat(files...)
    subdirs = vcat(subdirs...)

or make files “local”:

subdirs = let files=files;(fill(subdir, length(copy(files)[i])) for (i, subdir) in enumerate(bsubdirs)) end

#7

@nalimilan & @Liso: Thanks, that’s it. :grin: For some reason I intuitively would have assumed that a generator is producing some form of closure.

So I guess it is not and this is expected behavior?


#8

I have no idea. If you can write a very small example to reproduce the problem, it can be worth filing an issue just to make sure that’s expected (unless somebody who knows comments here).


#9

My implicit expectation was that the generator form is functional. I could not find it explicitly documented, but how could it be otherwise?


#10

A very simple example showing this “problem”:

a = [1 , 2, 3]
g = (a[i] for i in 1:3)
a = [4, 5, 6]
collect(g)
# Gives 4, 5, 6
# Intuitively I would have expected it to give 1, 2, 3

But this is so fundamental that I would be very surprised if it wasn’t supposed to be like that?


Mediand, std, var to accept function as first argument
#11

Thanks for the MWE. I have been thinking of generators as “lazy maps”, so I expected semantics like map, somehow capturing the variables. But apparently they don’t.

I think an issue should be opened, to at least clarify this.


#12

Nope this works exactly like map:

a = [1 , 2, 3]
g = i -> a[i]
a = [4, 5, 6]
map(g, 1:3)

#13

Ok, then this is not an MWE for the problem above, but I am still under the impression that something fishy is going on with that. Unfortunately, I have no time to dissect it now.