Hello, I’m very new to Julia and as such am still familiarizing myself with its quirks. I’m working with a dataset from Kaggle containing 40,000 images that I would like to load in and process. The data can be found here: Surface Crack Detection | Kaggle
However, I’m unsure how to perform this task efficiently. The below code accomplishes two things:
- Finds the path to all image files. (Not very interesting, but included for reproducibility.)
using Images
using ImageIO
#Assumes current directory contains both image-containing folders.
base_path = pwd()
#Finds all file names in given path
file_pos = readdir(base_path*"\\Positive\\")
file_neg = readdir(base_path*"\\Negative\\")
#Joins the paths with the file names
path_pos = base_path*"\\Positive\\".*file_pos
path_neg = base_path*"\\Negative\\".*file_neg
img_paths = [path_pos; path_neg]
and 2. Loads in each image and performs some very basic operations on it.
#Loads an image from a given path and performs some basic transformations to it.
function process_image(path)
img = load(path)
img = Gray.(img)
img = imresize(img,(80,80))
img = vec(img)
img = convert(Array{Float64,1},img)
return img
end
#Processes all images
processed_imgs = process_image.(img_paths)
The below statistics were produced by a second run of the @time
and @code_warntype
commands.
394.428738 seconds (4.28 M allocations: 23.492 GiB, 5.85% gc time)
Variables
#self#::Core.Const(var"##dotfunction#257#7"())
x1::Vector{String}
Body::AbstractVector{var"#s831"} where var"#s831"
1 ─ %1 = Base.broadcasted(Main.process_image, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(process_image), Tuple{Vector{String}}}
│ %2 = Base.materialize(%1)::AbstractVector{var"#s831"} where var"#s831"
I’m not quite sure how to interpret this other than that there’s some types somewhere in the function that Julia had trouble discerning that are likely at least partially responsible for the slow run time.
The process_image
function currently takes around 6 minutes to work through the 40,000 images and I’m certain the function could be completely rewritten, but I’m not sure what a “Julian” way to go about that is.
I tried chaining with |>
in process_image
but can’t seem to figure out how to chain when the current result is not the left-most argument of the next function. In R you would be able to write something mid-chain like
convert(Array{Float64,1},.)
(assuming there was an equivalent convert
function)
where the .
tells the chain to use the value of the chained object as the second variable. Obviously .
is much more important to Julia’s ecosystem than R’s, so I wouldn’t expect the syntax to be identical, but I’m wondering if there’s an alternative way to accomplish this.
I’m also assuming that the CUDA library could also be used to speed things up? I tried (admittedly not very hard) to put the loaded images on my GPU with img = load(path) |> device()
, but that threw a "objects of type CuDevice are not callable"
error. Any advice on how this could be done?