Speeding up a color transformation loop

First, a little background. I have an image [2000 x 2000 pixels] I am applying different illuminants to. Towards the end of this process I have to transform the CIE1931 X,Y, and Z coordinate to RGB. These are stored as an array of 3 different 2000x2000 matrices ( colorLayers[1:3][x, y]).

To speed things up, I used CartesianIndices instead of two nested loops.

for xy in CartesianIndices(colorLayers[1])

	colorLayers[1][ xy] = (3.2404542 * XYZlayers[1][ xy]) + (-1.5371385 * XYZlayers[2][ xy]) + (-0.4985314* XYZlayers[3][ xy])
	colorLayers[2][ xy] = (-0.969266 * XYZlayers[1][ xy]) + (1.8760108 * XYZlayers[2][ xy]) + (0.041556 * XYZlayers[3][ xy])
	colorLayers[3][ xy] = (0.0556434 * XYZlayers[1][ xy]) + (-0.2040259 * XYZlayers[2][ xy]) + (1.0572252 * XYZlayers[3][ xy])
end

This takes 3.22018 seconds to perform.

I’ve tried:

  1. Putting X,Y, and Z in their own separate loops, hoping it would give an memory access advantage (it didn’t).
  2. @threads (it slowed it down, Julia set to 8 threads)
  3. Using FLoops…

FLoops seems to have a cache issue. The runs with the first few illuminants are fairly fast, but get slower over time:

illuminant-1 0.98
illuminant-2 0.73
illuminant-3 5.54
illuminant-4 5.61
illuminant-5 5.77
illuminant-6 12.44 seconds (31.07 seconds total)

In contrast, the non-FLoops code is a fairly constant ~3.2 seconds per illuminant (19.3 seconds total).

Any ideas how to speed up this code?

FLoops snippet below:

@floop	for xy in CartesianIndices(colorLayers[1])

	colorLayers[1][ xy] = (3.2404542 * XYZlayers[1][ xy]) + (-1.5371385 * XYZlayers[2][ xy]) + (-0.4985314* XYZlayers[3][ xy])
	colorLayers[2][ xy] = (-0.969266 * XYZlayers[1][ xy]) + (1.8760108 * XYZlayers[2][ xy]) + (0.041556 * XYZlayers[3][ xy])
	colorLayers[3][ xy] = (0.0556434 * XYZlayers[1][ xy]) + (-0.2040259 * XYZlayers[2][ xy]) + (1.0572252 * XYZlayers[3][ xy])
end
  1. Have you tried eachindex(colorLayers[1]) instead of Cartesian indexing?

  2. Instead of accessing XYZlayers[1][xy] 3 times, have you tried setting h1 h2 h3 to XYZlayers[1 to 3][xy] respectively and then using the hs for each calculation.

1 Like

Apart from trying Dan’s suggestions, you should put your code inside of a function. This is mentioned in the performance tips, which you should read given that you’re interested in performance: Performance Tips · The Julia Language

1 Like

Any reason for this indirect storage format? Using an 3 x 2000 x 2000 array your whole operation can be written as a matrix multiplication.

2 Likes

First off, thanks for all of the suggestions. I’ll report back later after I’ve had time to try everyone’s suggestions. In the meantime, I’ll answer some questions, which will give everyone a better background.

Big picture: I’m trying to show what hyperspectral images look like under different illuminants (e.g. daylight vs LED light bulb). Instead of 3 channels corresponding to red, green, and blue, hyperspectral images might have 31 channels, corresponding to 400 through 700nm wavelengths, in steps of 10nm.

Why hyperspectral? Metamerism: two things whose color looks identical, despite reflecting completely different wavelengths. For example, object A might reflect 610nm (orange), whereas object B reflects a mixture of yellow (580nm) and red (640nm), yet both look orange. For example, the objects on the left are rendered under a full spectrum, whereas the objects on the right have a 610nm illuminant. In this case, metamerism fails, as the mixture of yellow and red looks dark gray instead of orange.

nsajko: That’s just a snippet that is part of larger function. Below is an outline of what the function does:

function processHyperSpectral( images, CMF, Illuminant)

   # images is an array of 31 hyperspectral images (e.g. 400 nm, 410nm…700nm)
   # CMF is a color matching function, such as CIE1931. 3 (XYZ) x 31 (nm)
   # illuminant is an array of the spectrum of a light source (daylight, incandecent bulb, etc.)

1) turn the CMF into matrix of floats (was an array of xyz)
2) make separate images layers representing X,Y,and Z
3) precalculate the illuminant * CMF (color matching function adapted to the light source).
4) go through each of the 31 hyperspectral images and multiply them by the precalculated CMF
         This is summed into separate X,Y, and Z images.
5) Normalize the X, Y, and Z images
6) Transform the X, Y, and Z images to separate R, G, and B images (CURRENT ISSUE)
7) Perform sVGA gamma correction
8) Fuse the 3 images into a single RGB image

Dan: 1) No, I did not try each index, because I already knew the horizontal and vertical dimensions. Earlier iterations used:

for x in 1:horiz
    for y in 1:vert

… which was slower than the Cartersian method.

  1. I’m not sure what you mean by h1, h2, and h3. Please elaborate.

bertschi: Yes, there is a historical reason, but it might work better if I take your suggestion and
go to the next step and set it up as a matrix rather than an array of smaller matrices.
The original code leaned heavily on colors.jl. The constant battle between types
(Gray, RGB, and XYZ) was not only a pain-in-the-ass, but made arrays/matrices
and looping for efficient coding all but impossible. For example, and earlier version
looked like this:

for x in 1:horiz

    for y in 1:vert

        redLayer[x,y] = (3.2404542 * Xlayer[x,y]) + (-1.5371385 * Ylayer[x,y]) + (-0.4985314* Zlayer[x,y])
        greenLayer[x,y] = (-0.969266 * Xlayer[x,y]) + (1.8760108 * Ylayer[x,y]) + (0.041556 * Zlayer[x,y])
        blueLayer[x,y] = (0.0556434 * Xlayer[x,y]) + (-0.2040259 * Ylayer[x,y]) + (1.0572252 * Zlayer[x,y])
    end
end

…because the style of the codebase encouraged (forced?) treating red, green, and blue (or x, y, and z) as separate entities, rather than looping through colors 1,2, and 3. So my first course of action was to try to rip that out, and have a separate array entries for each color (later, during gamma correction, I do loop through the colors). The next step ended up looking like this:

for x in 1:horiz
    for y in 1:vert
        colorLayers[1][x,y] = (3.2404542 * Xlayer[x,y]) + (-1.5371385 * Ylayer[x,y]) + (-0.4985314* Zlayer[x,y])
        colorLayers[2][x,y] = (-0.969266 * Xlayer[x,y]) + (1.8760108 * Ylayer[x,y]) + (0.041556 * Zlayer[x,y])
        colorLayers[3][x,y] = (0.0556434 * Xlayer[x,y]) + (-0.2040259 * Ylayer[x,y]) + (1.0572252 * Zlayer[x,y])
    end
end

In addition, all of the typecasting and stuff really slowed down my code when I just needed to manipulate arrays/matrices of floats.

So, by first translating them to floats, doing my several steps of manipulations, and then translating back to RGB types at the last minute, ended up yielding much faster code.

If you want people to be able to help you well, you’ve got to give self-contained examples of whatever’s the problem. This means including an entire function, along with the data necessary to run it. See Please read: make it easier to help you

But don’t give us the huge function as in your actual code, rather you should produce a self-contained example that’s representative of your issue.

Pretty sure they meant to deduplicate your code by assigning the common expressions to variables:

h1 = XYZlayers[1][xy]
h2 = XYZlayers[2][xy]
h3 = XYZlayers[3][xy]

If nothing else, that would make the code read better.

2 Likes

Using Muladd (tweaked from Colors.jl) brought it down from 3.2 to 2.91 seconds, and using Dan’s suggestion of deduplicating by assigning the common expressions to variables brought it down from 2.91 to 2.47 seconds.

This is the current state of my code:

for xy in CartesianIndices(colorLayers[1])
	A = XYZlayers[1][ xy]
	B = XYZlayers[2][ xy]
	C = XYZlayers[3][ xy]
	colorLayers[1][ xy] =  muladd( 3.2404542, 
							A, 
							muladd(-1.5371385, 
							B, 
							-0.4985314 * C)
							)
	colorLayers[2][ xy] =  muladd( -0.969266, 
							A, 
							muladd(1.8760108, 
							XYZlayers[2][ xy], 
							0.041556 * C)
							)
	colorLayers[3][ xy] =  muladd( 0.0556434, 
							A, 
							muladd(-0.2040259, 
							B, 
							1.0572252 * C)
							)
		

		
	end

Next, I’m going to try the matrix suggestion.

Couldn’t you just use @fastmath and let the compiler do it?

1 Like

Do you have a full runnable benchmark? I am getting much faster timings than you and I doubt my computer is faster. Here is a slightly tweaked version of your original:

using Images
using BenchmarkTools

function layer_test_data(; N = 2000)
    return [rand(Float32, N, N) for _ = 1:3]
end

function transform_layers(XYZlayers)
    N = size(layers[1], 1)
    colorLayers = [zeros(Float32, N, N) for _ = 1:3]
    for xy in CartesianIndices(XYZlayers[1])
        h1 = XYZlayers[1][xy]
        h2 = XYZlayers[2][xy]
        h3 = XYZlayers[3][xy]
        @fastmath colorLayers[1][xy] = (3.2404542 * h1) + (-1.5371385 * h2) + (-0.4985314 * h3)
        @fastmath colorLayers[2][xy] = (-0.969266 * h1) + (1.8760108 * h2) + (0.041556 * h3)
        @fastmath colorLayers[3][xy] = (0.0556434 * h1) + (-0.2040259 * h2) + (1.0572252 * h3)
    end
    colorLayers
end

Which benchmarks pretty well

julia> layers = layer_test_data();

julia> @btime transform_layers($layers);
  734.404 ms (24000010 allocations: 595.09 MiB)

Using a matrix multiply

function color_test_data(; N = 2000)
    return XYZ.(rand(RGB{Float32}, N, N))
end

function transform_colors(colors)
    M = Float32[
        3.2404542 -1.5371385 -0.4985314
        -0.969266 1.8760108 0.041556
        0.0556434 -0.2040259 1.0572252
    ]
    N = size(colors, 1)
    colorview(RGB, reshape(M * reshape(channelview(colors), 3, N*N), 3, N, N))
end

is a big speedup

julia> ctd = color_test_data();

julia> @btime transform_colors($ctd);
  15.197 ms (6 allocations: 45.78 MiB)

The transform from Colors.jl is in the middle

julia> @btime RGB.($ctd);
  423.400 ms (2 allocations: 45.78 MiB)
2 Likes

Contradict:

I made a toy benchmark based on your code.

My results (MacStudio with M1Max chipset):

 transform_layers    82.292 ms
 my_MulAdd          101.012 ms
 transform_colors  1640.000 ms
 colors.jl          799.766 ms

So, your first algorithm wins! But (and strangely), your matrix multiply version (transform_colors) tanks on my machine, and gets outdone by colors.jl.

In practice, when I substituted you fastmath code for mine, it ended up being a little slower (~10%).

In the meantime, I’ll clean up my code and post the entire function.

Source code:

function processHyperSpectral( images, CMF, Illuminant)

	numWL = length(CMF)				# numWL = number of wavelength bins
	cmfArray = zeros( (3, numWL))

	for i in eachindex(CMF)			# extract colors.jl CMF so that are hands are not tied by their data types.
		cmfArray[1, i] = CMF[i].x
		cmfArray[2, i] = CMF[i].y
		cmfArray[3,i] = CMF[i].z
	end
	dimensions = size(images[1])

	horiz = dimensions[1]
	vert = dimensions[2]


	XYZlayers = []									# Seperate X, Y, and Z layer for destination image
	for j in 1:3
		push!( XYZlayers, fill(0.0, horiz, vert) )
	end

	#-----------------------------
	# Precalculate Illuminant x CMF 
	IllumiCMF = zeros( (3,numWL))
	for i in eachindex(CMF)
		for j in 1:3
			IllumiCMF[j, i] = cmfArray[j, i] .* Illuminant[i]
		end
	end
	#-----------------------------
	# multiply each image by the CMF/Illuminant combo
	for i in eachindex( images)		# 31 10nm images starting at 400 nm (violet)
		for j in 1:3
			XYZlayers[j] = XYZlayers[j] .+ (images[i] .*IllumiCMF[j,i])		# multiply each image by the CMF/Illuminant combo
		end
	end
	#-----------------------------
	# We'll norm these by XYZ, which is the average area under the 3 curves
	XYZ =  sum(cmfArray) 
	XYZ /= 3

	for j in 1:3
		XYZlayers[j] = XYZlayers[j] ./ XYZ					# last was no sqrt, maxed by mxval
	end

	#-----------------------------
	# Next, convert XYZ to RGB
	colorLayers = []				# This will hold the 3 color layers before performing gamma and compositing to RGB
	for j in 1:3
		push!( colorLayers, fill(0.0, horiz, vert) )
	end

	#= XYZ to RGB matrix for D65:

		3.2404542	-1.5371385	-0.4985314
		-0.969266	1.8760108	0.041556
		0.0556434	-0.2040259	1.0572252

		from http://brucelindbloom.com
	=#

	for xy in CartesianIndices(colorLayers[1])
		A = XYZlayers[1][ xy]
		B = XYZlayers[2][ xy]
		C = XYZlayers[3][ xy]
		colorLayers[1][ xy] = muladd( 3.2404542, 
								A, 
								muladd(-1.5371385, 
								B, 
								-0.4985314 * C)
								)
		colorLayers[2][ xy] = muladd( -0.969266, 
								A, 
								muladd(1.8760108, 
								XYZlayers[2][ xy], 
								0.041556 * C)
								)
		colorLayers[3][ xy] = muladd( 0.0556434, 
								A, 
								muladd(-0.2040259, 
								B, 
								1.0572252 * C)
									)
#=		for xy in CartesianIndices(XYZlayers[1])
			h1 = XYZlayers[1][xy]
			h2 = XYZlayers[2][xy]
			h3 = XYZlayers[3][xy]
			@fastmath colorLayers[1][xy] = (3.2404542 * h1) + (-1.5371385 * h2) + (-0.4985314 * h3)
			@fastmath colorLayers[2][xy] = (-0.969266 * h1) + (1.8760108 * h2) + (0.041556 * h3)
			@fastmath colorLayers[3][xy] = (0.0556434 * h1) + (-0.2040259 * h2) + (1.0572252 * h3)

=#
	end

	#-----------------------------
	# Perform gamma correction and composit images
	# 	clamp the values
	for j in 1:3
		clamp!(colorLayers[j], 0, 1)
	end
	# 	make the newImage container
	newImage = fill(RGB(0.0, .0, 0.0), horiz, vert)
	# 	perform gamma correction
	for xy in CartesianIndices(colorLayers[1])
		redTemp, greenTemp, blueTemp = sRGBGamma(colorLayers[1][ xy], colorLayers[2][ xy], colorLayers[3][ xy]) 
		newImage[xy] = RGB( redTemp, greenTemp, blueTemp )	# collect changes tuple to an array
	end

	return newImage
end
#-=================================================
function sRGBGamma(red, green, blue)

	if red > 0.0031308
		red = 1.055 * (red^(1. / 2.4)) - 0.055
	else
		red *= 12.92
	end
	red = clamp(red, 0, 1)
	if green > 0.0031308
		green = 1.055 * (green^(1. / 2.4)) - 0.055
	else
		green *= 12.92
	end
	green = clamp(green, 0, 1)
	if blue > 0.0031308
		blue = 1.055 * (blue^(1. / 2.4)) - 0.055
	else
		blue *= 12.92
	end
	blue = clamp(blue, 0, 1)
	return red, green, blue
end

One thing that is odd is that the toy simulation, despite using the same amount of data, runs substantially faster than when it is embedded in the function processHyperSpectral(). For example, in the toy version, it took 101 ms to complete, whereas when that same section is in processHyperSpectral(), it takes 2508ms to complete for the same amount of data.

The only differences between the two are (other than the randomized data in the toy version) the computations that occurred before-hand. I suspect, perhaps naively, that there is some sort of type-checking that is slowing down the computations when it is in processHyperSpectral(). Having said that, I tried to ensure that all matrices and arrays were [implicitly] of type Float64.*

  • That’s the reason I stepped away from using the built-in types used by colors.jl, as they seemed to require too much overhead when typecasting from one to another when I could have treated everything as array/matrices of float64. Indeed, when I transformed the initial CMF from using the colors.jl xyz data type to just a float, everything sped up.

You repeatedly use CartesianIndices, for no apparent reason. The appropriate thing to use in your case eachindex or for idx = 1:length(A).

The difference is that eachindex uses linear indexing. That is, each access is just basepointer + shift * (index-1). Cartesian indices are (for a matrix) two indices i, j and access at basepointer + shift * ((i-1) + size(A,1) * (j-1)). You see the extra addition and multiplication!

Never just use CartesianIndices without thinking. Only use it if you have good reasons for using it; this is not a sane default.

1 Like

These are arrays of type Any, can you initialize them with the correct types? Like Int[] for an array of integers.

edit:

In this case, do:

	XYZlayers = Matrix{Float64}[]									# Seperate X, Y, and Z layer for destination image
	for j in 1:3
		push!( XYZlayers, fill(0.0, horiz, vert) )
	end

or, more directly:

XYZlayers = [ zeros(horiz, vert) for _ in 1:3 ]

in which case the element type of the array can be deduced directly upon construction (you can use fill instead of zeros there, but zeros seems clearer to me).

1 Like

This is a sign that you may be using some global variable in your actual code that is causing a type instability, that you solved in the toy example.

Your function processHyperSpectral is unwieldy, you should separate it into many smaller functions.

There’s a horrid degree of code duplication in sRGBGamma.

These snippets should obviously be the body of their own function. If you don’t want to define a function in the global namespace for some reason, you can use a local function, like:

function sRGBGamma(red, green, blue)
  gc = function(x)
    if x > 0.0031308
      x = 1.055 * (x^(1. / 2.4)) - 0.055
    else
      x *= 12.92
    end
    clamp(x, 0, 1)
  end
  map(gc, (red, green, blue))
end

Try IllumiCMF[j, i] .= cmfArray[j, i] .* Illuminant[i] instead to prevent unnecessary allocation.

Try XYZlayers[j] .+= images[i] .* IllumiCMF[j,i] instead.

The same here.

In general, see the Performance tips in the Manual, in particular this section.

1 Like

Another thing: as I understand here XYZlayers[j] is an array. Here you are creating a new array on the right side of the equation and replacing the array on that position with this new array. You probably want to update the current array, with XYXlayers[j] .= ... (note the dot in .=). Alternatively, you can use, for simplicity:

     @. XYZlayers[j] = XYZlayers[j] + (images[i] * IllumiCMF[j,i])	

See: Common allocation mistakes

edit: Indeed, the same @nsajko posted above.

1 Like

This part would perhaps better be implemented as a polynomial approximation. In any case, I’m guessing some gamma correction functionality should already be available in a package in the ecosystem.

Doesn’t it make more sense for XYZlayers to be a 3D Array rather than a vector of matrices? For the latter there is no structural enforcement of the layers having the same size.