To ~~not~~ shell out

Related to this topic, I’d like to read the modification time in the metadata of images. Right now I can do this by “shelling out” and using exiftool. While this works, it’s painfully slow (when I need to process hundreds of images for instance). So I was wondering which of these two alternatives would be easiest for me to learn, and if you people have some pointers on how to get started on either:

  1. figure out how to not “shell out” and use exiftool libraries directly?
  2. rely on existing mechanisms in ImageMagick.jl and somehow write a function that extracts that specific field (modification time)?

I have to admit that both are beyond me, but I would love to be able to rapidly access that metadata. Any help is appreciated!

Other ideas:
if you are doing hundreds of thousands of images,
and are currently using run serially,
switch to using asyncs (e.g. asyncmap) and do them in parallel.
Asyncs are not normally truly parallel but they should be for run I think.
If not async+spawn+wait, certainly will be.

How slow are you talking here? Is it comparably slow when the shell-out is timed in a loop doing nothing else?

I use pmap for this kind of stuff. you get parallization automatically.

If the problem is the fixed cost of the shell startup (don’t know, did not benchmark), then you could use a single shell run to extract data from multiple images using exiftool, which can write to CSV or text files in custom formats.

The mean time it takes to read the modification time of 100 images is ~0.09 seconds (a total of ~9 seconds). I need to process thousands of images…

asyncmap(fun, fill(file, 100))

Worked nicely! 3.6 seconds (instead of 9 seconds)!

@time begin 
    a = Vector{String}(100)
    @async foreach(1:100) do i
        @spawn a[i] = fun(file)
    end
end

Reports 0.000108 seconds but also takes about 3 seconds. Cool though.

If you mean this:

fun(file::String) = readstring(`exiftool -T -ModifyDate -n $file`)
file = "img.TIF"
@time a = pmap(1:100) do _
    fun(file) 
end

It didn’t help, it still takes the same amount of time.

Wow, ok, so now it takes just ~0.3 seconds for 100 images. To summarize up to now:

  1. @oxinabox’s asyncmap improves the time by a factor of 3.
  2. @Tamas_Papp’s passing exiftool the whole list of images improves it by a factor of 30.

I might need to change the title of this topic :slight_smile:

So now my question is, had I been smart enough to BinDeps the libraries of exiftool and figure out their API and use it from within my package, how much better would it really be? Is it even worth it? I tried @Tamas_Papp’s trick with 1000 images and it only took ~2 seconds, so it seems like most of that time is overhead…

2 Likes

The function spawn and the macro @spawn are different and are basically unrelated.

I mean the function spawn

Oh, I see.
I’m not sure however how I can replace readstring with spawn in

modify_date = readstring(`exiftool -T -ModifyDate -n $file`)

?

first addprocs(10), for example. and define fun @everywhere

Since passing multiple files to exiftool is much quicker this is a bit moot, but for completeness:

But how can I retrieve the output from the spawn function, the actual modification time that the exiftool spits out?

okay, here’s the code. It turns out that addprocs is not that fast. A hybrid strategy is the best: it reads file sizes of 10000 files in 4s.


julia> addprocs(10)

julia> folder = "/some/path/to/10000/imgs"

julia> files = joinpath.([folder], readdir(folder))

julia> @everywhere fun(f) = readstring(`exiftool -T -FileSize -n $f`)

julia> @time pmap(fun, files[1:1])
  0.096884 seconds (207 allocations: 48.266 KiB)

julia> @time pmap(fun, files[1:10])
  0.206115 seconds (896 allocations: 151.203 KiB)

julia> @time pmap(fun, files[1:100])
  1.490343 seconds (7.95 k allocations: 390.016 KiB)

julia> @time pmap(fun, files[1:1000])
 13.658115 seconds (78.64 k allocations: 2.720 MiB)

julia> @time pmap(fun, files[1:10000])
136.960294 seconds (785.42 k allocations: 26.102 MiB, 0.01% gc time)

julia> @time readstring(`exiftool -T -FileSize -n $(files[1:1000])`);
  2.327490 seconds (108 allocations: 250.953 KiB)

julia> @time readstring(`exiftool -T -FileSize -n $(files[1:10000])`);
 21.924679 seconds (188 allocations: 1.112 MiB)

julia> idx = view.([reshape(files[1:10000], 1000, 10)], :, 1:10)

julia> @time t = pmap(fun, idx);
  3.975504 seconds (1.12 k allocations: 1.165 MiB)

julia> vcat(split.(t)...)
10000-element Array{SubString{String},1}:
 "33889"
 "122117"
 "26831"
 "124722"
 "98627"
 "211157"
 ⋮
 "133214"
 "97177"
 "92494"
 "176028"
 "31780"
 "158285"

Very cool @innerlee!

But all this pmaping and multiple files passing is really just sidestepping what I assume is the ultimate way: using the libraries directly…

I would still take the lazy way out: eg when each image is created, write the output of exiftool into a sidecar file, either in JLD2 format or CSV or anything you can parse quickly. This is assuming that you use the images more times than you create them.

Yea, I was talking to @oxinabox about using DataDeps as a means to create halfway files (like your sidecar files), eliminating the need to reprocess those every time.