To not shell out

yakir12 · February 8, 2018, 9:25pm

Related to this topic, I’d like to read the modification time in the metadata of images. Right now I can do this by “shelling out” and using exiftool. While this works, it’s painfully slow (when I need to process hundreds of images for instance). So I was wondering which of these two alternatives would be easiest for me to learn, and if you people have some pointers on how to get started on either:

figure out how to not “shell out” and use exiftool libraries directly?
rely on existing mechanisms in ImageMagick.jl and somehow write a function that extracts that specific field (modification time)?

I have to admit that both are beyond me, but I would love to be able to rapidly access that metadata. Any help is appreciated!

oxinabox · February 9, 2018, 12:47am

Other ideas:
if you are doing hundreds of thousands of images,
and are currently using run serially,
switch to using asyncs (e.g. asyncmap) and do them in parallel.
Asyncs are not normally truly parallel but they should be for run I think.
If not async+spawn+wait, certainly will be.

ihnorton · February 9, 2018, 3:37am

How slow are you talking here? Is it comparably slow when the shell-out is timed in a loop doing nothing else?

innerlee · February 9, 2018, 6:58am

I use pmap for this kind of stuff. you get parallization automatically.

Tamas_Papp · February 9, 2018, 8:14am

If the problem is the fixed cost of the shell startup (don’t know, did not benchmark), then you could use a single shell run to extract data from multiple images using exiftool, which can write to CSV or text files in custom formats.

yakir12 · February 9, 2018, 8:17am

The mean time it takes to read the modification time of 100 images is ~0.09 seconds (a total of ~9 seconds). I need to process thousands of images…

asyncmap(fun, fill(file, 100))

Worked nicely! 3.6 seconds (instead of 9 seconds)!

@time begin 
    a = Vector{String}(100)
    @async foreach(1:100) do i
        @spawn a[i] = fun(file)
    end
end

Reports 0.000108 seconds but also takes about 3 seconds. Cool though.

If you mean this:

fun(file::String) = readstring(`exiftool -T -ModifyDate -n $file`)
file = "img.TIF"
@time a = pmap(1:100) do _
    fun(file) 
end

It didn’t help, it still takes the same amount of time.

yakir12 · February 9, 2018, 8:35am

Wow, ok, so now it takes just ~0.3 seconds for 100 images. To summarize up to now:

@oxinabox’s asyncmap improves the time by a factor of 3.
@Tamas_Papp’s passing exiftool the whole list of images improves it by a factor of 30.

I might need to change the title of this topic

So now my question is, had I been smart enough to BinDeps the libraries of exiftool and figure out their API and use it from within my package, how much better would it really be? Is it even worth it? I tried @Tamas_Papp’s trick with 1000 images and it only took ~2 seconds, so it seems like most of that time is overhead…

oxinabox · February 9, 2018, 8:42am

The function spawn and the macro @spawn are different and are basically unrelated.

I mean the function spawn

yakir12 · February 9, 2018, 9:09am

Oh, I see.
I’m not sure however how I can replace readstring with spawn in

modify_date = readstring(`exiftool -T -ModifyDate -n $file`)

?

innerlee · February 9, 2018, 10:18am

first addprocs(10), for example. and define fun @everywhere

yakir12 · February 9, 2018, 10:38am

Since passing multiple files to exiftool is much quicker this is a bit moot, but for completeness:

But how can I retrieve the output from the spawn function, the actual modification time that the exiftool spits out?

innerlee · February 9, 2018, 12:38pm

okay, here’s the code. It turns out that addprocs is not that fast. A hybrid strategy is the best: it reads file sizes of 10000 files in 4s.


julia> addprocs(10)

julia> folder = "/some/path/to/10000/imgs"

julia> files = joinpath.([folder], readdir(folder))

julia> @everywhere fun(f) = readstring(`exiftool -T -FileSize -n $f`)

julia> @time pmap(fun, files[1:1])
  0.096884 seconds (207 allocations: 48.266 KiB)

julia> @time pmap(fun, files[1:10])
  0.206115 seconds (896 allocations: 151.203 KiB)

julia> @time pmap(fun, files[1:100])
  1.490343 seconds (7.95 k allocations: 390.016 KiB)

julia> @time pmap(fun, files[1:1000])
 13.658115 seconds (78.64 k allocations: 2.720 MiB)

julia> @time pmap(fun, files[1:10000])
136.960294 seconds (785.42 k allocations: 26.102 MiB, 0.01% gc time)

julia> @time readstring(`exiftool -T -FileSize -n $(files[1:1000])`);
  2.327490 seconds (108 allocations: 250.953 KiB)

julia> @time readstring(`exiftool -T -FileSize -n $(files[1:10000])`);
 21.924679 seconds (188 allocations: 1.112 MiB)

julia> idx = view.([reshape(files[1:10000], 1000, 10)], :, 1:10)

julia> @time t = pmap(fun, idx);
  3.975504 seconds (1.12 k allocations: 1.165 MiB)

julia> vcat(split.(t)...)
10000-element Array{SubString{String},1}:
 "33889"
 "122117"
 "26831"
 "124722"
 "98627"
 "211157"
 ⋮
 "133214"
 "97177"
 "92494"
 "176028"
 "31780"
 "158285"

yakir12 · February 9, 2018, 12:43pm

Very cool @innerlee!

But all this pmaping and multiple files passing is really just sidestepping what I assume is the ultimate way: using the libraries directly…

Tamas_Papp · February 9, 2018, 1:47pm

I would still take the lazy way out: eg when each image is created, write the output of exiftool into a sidecar file, either in JLD2 format or CSV or anything you can parse quickly. This is assuming that you use the images more times than you create them.

yakir12 · February 9, 2018, 2:02pm

Yea, I was talking to @oxinabox about using DataDeps as a means to create halfway files (like your sidecar files), eliminating the need to reprocess those every time.

Topic		Replies	Views
Is there a package to read EXIF data from a jpg image? Specific Domains images	13	2642	October 18, 2022
Pmap slow compared to map General Usage performance , parallel	11	3045	September 25, 2018
How to get the exif tags of a Tiff images using Julia General Usage images	4	98	May 6, 2025
ExifViewer.jl: Package Announcement Package Announcements package , announcement	2	548	October 14, 2022
A faster way of getting image size from jpg files? New to Julia images , performance	7	1149	August 3, 2021

To ~~not~~ shell out

Related topics

To not shell out