HPC / Julia, MPI / big data

Hello. Many years ago I was in HPC (as in university/govt supercomputers). We used to write code in C/pthreads, MPI, PVM etc. Then I left and went to the private sector and then big data came with Spark, Scala and that whole tooling world. I then went back to HPC in a university environment and realized that not much has changed - people (professors and grad students) still write distributed code in C/C++/MPI and maybe with some Python for visualization. Marrying the world of “big data” (Spark/Scala/Java…) and HPC seems to be difficult from an infrastructure point of view (running things side by side on the same HPC cluster).

Being new to Julia, I am wondering how Julia relates to all this. Is it an alternative that marries the two worlds in an incompatible (to these worlds) way? Meaning, with its ML libraries and distributed capabilities, can it be used to abandon both Spark and MPI and come out on top as a unified approach/solution? I am just trying to assess the possibilities/offerings here. It seems silly to me for a University to run separate infrastructures for their Data Science programs (Spark…) and the rest of the research (HPC/OpenMPI…).

Thank you!

1 Like

I think it depends to which purpose the HPC is being applied.
The requirements of scientific computing and data science are
quite different.

It also matters a lot what the capabilities of the computing infrastructure are.
Dedicated supercomputers and computing in the cloud tends to be used in different ways
due to their features.

Thank you for the reply. I guess I am trying to understand whether Julia can be used for both. Let’s say I am a researcher who wants to write a large scale climate model to run on a supercomputer. Can I use Julia for that and forget about OpenMPI/C/C++/Fortran? Or, alternatively, let’s say I am a data scientist who wants to run some analytics on a lot of data on the same supercomputer. Can I use Julia to do that and maybe run some machine learning models as well? In other words, I would need nothing other than the supercomputer, Julia and its libraries. I would need no Spark cluster and I would need no MPI/C/C++ knowledge.

Thanks!

1 Like

Welcome!

(a) It isn’t julia, but I would flag that dask is a great tool and one I usually find a good substitute for Spark.
(b) As far as I can tell, Julia has phenomenal tools for parallelism and distributed computing (including over MPI). See for example all the available cluster managers. If you’re writing your own code, or working at the level of raw arrays, you’re set (see DistributedArrays). I don’t know of an out-of-the-box distributed heterogeneous tabular library like dask, though, although JuliaDB is where it would be, as its parallelized locally, and maybe works on distributed platforms?

EDIT: Add obligatory link to overview of parallelism in Julia in docs.

1 Like

Julia comes with a Distributed standard library which provides many basic operations for parallel computing. There are packages built on top of this such as DistributedArrays.jl and JuliaDB.jl which provide Spark-like functionality. Many people are using these packages successfully.

For classic HPC problems (e.g. PDEs), MPI is still the dominant communication mechanism, since it lets you leverage low-latency networking hardware. For the CliMA project we’re using Julia with MPI for communication (via MPI.jl). You are correct though that this is fundamentally a different paradigm (e.g. it is written using a batch SPMD style, rather than having a main controller process).

3 Likes

If you are trying to find a modern alternative of OpenMPI/C/C++/Fortran, take a look at Chapel.

I guess I am trying to find a replacement for OpenMPI/C/C++/Fortran AND Spark in one tool/solution :wink:

Well I think you have come to the right place!
This article might interest you

Also have a look at Intel OneAPI - the relevance is that there is an integration with Julia.
We could see a future where the same code runs on CPUs, GPUs, FPGA…

What is the link of this to Julia? Looks interesting …

Chris Rackaucas pointed me toward this.
I find it exciting that Juli ais at the leading edge of developments like this.
In the past I have been a bit cynical about initiatives like this, but currently an enthused!

1 Like

In your estimation, how much of the code written today in HPC/supercomputer contexts is Julia and how much is OpenMPI/C/C++ etc.? Thanks :slight_smile:

I think it’s less the communication mechanism and more the programming model. Julia’s Distributed lets you use the faster communication as well if you ask it to. But Distributed’s programming model is so “top-down” that it’s really tailored to mapreduce types of calls, which is great for simple data-parallel computing, but MPI is just easier to do the “minimal communication + local compute” that people want for big PDEs. You could in theory do this same style of compute by making Julia’s Distributed turn on the use of the interconnects (something Valentin made work), and then spawn one process per node that then encodes all of the communication into it (i.e. spawn process once that then last the entire time, and in these processes do the interchannel communications), and you’d essentially have MPI. But… that would be going so far out of your way to write MPI-like code in Distributed that you might as well use MPI.

Note that the Julia Lab is having weekly parallel compute programming model discussions to try and find interesting new answers in this space.

Ehh I don’t see OneAPI as it. I think the problem of heterogeneous compute is different and needs a different solution… long story though.

In the demarcations that I mention in this 18.337 lecture:

OneAPI is an instantiation of a declarative heterogeneous SPMD kernel generation model. Those kinds of things are very powerful for if you have very regular compute problems (i.e. lu-factorization or a PDE stencil), but are limited in the codes that they can express well. It’s a path that should be pushed forward and will be useful for writing kernels, but I don’t think it’s the future of democratized HPC.

3 Likes

LLVM determined using 2x doubles was more efficient than using 4x doubles

  • Around the 8-minute mark.

You’d need to use fast math flags to get reassociation, which would be needed for 4x doubles. Those 2x doubles are just the real-imaginary pairs.
Try defining

Base.:+(x::MyComplex, y::MyComplex) = MyComplex(Base.FastMath.add_fast(x.real, y.real), Base.FastMath.add_fast(x.imag, y.imag))

That may give you <4 x double>.
Test:

julia> struct MyComplex
           real::Float64
           imag::Float64
       end

julia> Base.:+(x::MyComplex, y::MyComplex) = MyComplex(Base.FastMath.add_fast(x.real, y.real), Base.FastMath.add_fast(x.imag, y.imag))

julia> Base.:/(x::MyComplex, N::Int) = MyComplex(Base.FastMath.div_fast(x.real, N), Base.FastMath.div_fast(x.imag, N))

julia> average(x::AbstractArray{MyComplex}) = sum(x) / length(x)
average (generic function with 1 method)

julia> x = [MyComplex(rand(),rand()) for _ in 1:100];

julia> average(x)
MyComplex(0.5322661296777239, 0.49026719877886044)

julia> @code_llvm debuginfo=:none average(x)

define void @julia_average_884([2 x double]* noalias nocapture sret, %jl_value_t* nonnull align 16 dereferenceable(40)) {
top:
  %2 = alloca %jl_value_t*, i32 4
  %3 = alloca [2 x double], align 8
  %4 = bitcast %jl_value_t* %1 to %jl_value_t**
  %5 = getelementptr inbounds %jl_value_t*, %jl_value_t** %4, i64 3
  %6 = bitcast %jl_value_t** %5 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = icmp sgt i64 %7, 0
  %9 = select i1 %8, i64 %7, i64 0
  br i1 %8, label %L11, label %L9

L9:                                               ; preds = %top
  %10 = getelementptr %jl_value_t*, %jl_value_t** %2, i32 0
  store %jl_value_t* inttoptr (i64 139655503420368 to %jl_value_t*), %jl_value_t** %10
  %11 = getelementptr %jl_value_t*, %jl_value_t** %2, i32 1
  store %jl_value_t* inttoptr (i64 139655522948096 to %jl_value_t*), %jl_value_t** %11
  %12 = getelementptr %jl_value_t*, %jl_value_t** %2, i32 2
  store %jl_value_t* %1, %jl_value_t** %12
  %13 = getelementptr %jl_value_t*, %jl_value_t** %2, i32 3
  store %jl_value_t* inttoptr (i64 139655492231712 to %jl_value_t*), %jl_value_t** %13
  %14 = call nonnull %jl_value_t* @jl_invoke(%jl_value_t* inttoptr (i64 139655502796576 to %jl_value_t*), %jl_value_t** %2, i32 4, %jl_value_t* inttoptr (i64 139655417829904 to %jl_value_t*))
  call void @llvm.trap()
  unreachable

L11:                                              ; preds = %top
  %15 = icmp eq i64 %9, 1
  br i1 %15, label %L13, label %L15

L13:                                              ; preds = %L11
  %16 = bitcast %jl_value_t* %1 to [2 x double]**
  %17 = load [2 x double]*, [2 x double]** %16, align 8
  %.elt = getelementptr inbounds [2 x double], [2 x double]* %17, i64 0, i64 0
  %18 = bitcast double* %.elt to <2 x double>*
  %19 = load <2 x double>, <2 x double>* %18, align 8
  br label %L47

L15:                                              ; preds = %L11
  %20 = icmp sgt i64 %9, 15
  br i1 %20, label %L43, label %L17

L17:                                              ; preds = %L15
  %21 = bitcast %jl_value_t* %1 to [2 x double]**
  %22 = load [2 x double]*, [2 x double]** %21, align 8
  %.elt60 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 0, i64 0
  %23 = bitcast double* %.elt60 to <2 x double>*
  %24 = load <2 x double>, <2 x double>* %23, align 8
  %.elt64 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 1, i64 0
  %25 = bitcast double* %.elt64 to <2 x double>*
  %26 = load <2 x double>, <2 x double>* %25, align 8
  %27 = fadd fast <2 x double> %26, %24
  %28 = icmp sgt i64 %9, 2
  br i1 %28, label %L34.preheader, label %L47

L34.preheader:                                    ; preds = %L17
  %29 = icmp ugt i64 %9, 3
  %umax = select i1 %29, i64 %9, i64 3
  %30 = add i64 %umax, -2
  %min.iters.check = icmp ult i64 %30, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L34.preheader
  %n.mod.vf = urem i64 %30, 32
  %n.vec = sub i64 %30, %n.mod.vf
  %ind.end = add i64 2, %n.vec
  %31 = extractelement <2 x double> %27, i32 1
  %32 = insertelement <8 x double> zeroinitializer, double %31, i32 0
  %33 = extractelement <2 x double> %27, i32 0
  %34 = insertelement <8 x double> zeroinitializer, double %33, i32 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x double> [ %32, %vector.ph ], [ %55, %vector.body ]
  %vec.phi82 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %56, %vector.body ]
  %vec.phi83 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %57, %vector.body ]
  %vec.phi84 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %58, %vector.body ]
  %vec.phi85 = phi <8 x double> [ %34, %vector.ph ], [ %51, %vector.body ]
  %vec.phi86 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %52, %vector.body ]
  %vec.phi87 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %53, %vector.body ]
  %vec.phi88 = phi <8 x double> [ zeroinitializer, %vector.ph ], [ %54, %vector.body ]
  %offset.idx = add i64 2, %index
  %35 = add i64 %offset.idx, 0
  %36 = add i64 %offset.idx, 8
  %37 = add i64 %offset.idx, 16
  %38 = add i64 %offset.idx, 24
  %39 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 %35, i64 0
  %40 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 %36, i64 0
  %41 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 %37, i64 0
  %42 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 %38, i64 0
  %43 = getelementptr inbounds double, double* %39, i32 0
  %44 = bitcast double* %43 to <16 x double>*
  %45 = getelementptr inbounds double, double* %40, i32 0
  %46 = bitcast double* %45 to <16 x double>*
  %47 = getelementptr inbounds double, double* %41, i32 0
  %48 = bitcast double* %47 to <16 x double>*
  %49 = getelementptr inbounds double, double* %42, i32 0
  %50 = bitcast double* %49 to <16 x double>*
  %wide.vec = load <16 x double>, <16 x double>* %44, align 8
  %wide.vec89 = load <16 x double>, <16 x double>* %46, align 8
  %wide.vec90 = load <16 x double>, <16 x double>* %48, align 8
  %wide.vec91 = load <16 x double>, <16 x double>* %50, align 8
  %strided.vec = shufflevector <16 x double> %wide.vec, <16 x double> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %strided.vec92 = shufflevector <16 x double> %wide.vec89, <16 x double> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %strided.vec93 = shufflevector <16 x double> %wide.vec90, <16 x double> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %strided.vec94 = shufflevector <16 x double> %wide.vec91, <16 x double> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %strided.vec95 = shufflevector <16 x double> %wide.vec, <16 x double> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %strided.vec96 = shufflevector <16 x double> %wide.vec89, <16 x double> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %strided.vec97 = shufflevector <16 x double> %wide.vec90, <16 x double> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %strided.vec98 = shufflevector <16 x double> %wide.vec91, <16 x double> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %51 = fadd fast <8 x double> %strided.vec, %vec.phi85
  %52 = fadd fast <8 x double> %strided.vec92, %vec.phi86
  %53 = fadd fast <8 x double> %strided.vec93, %vec.phi87
  %54 = fadd fast <8 x double> %strided.vec94, %vec.phi88
  %55 = fadd fast <8 x double> %strided.vec95, %vec.phi
  %56 = fadd fast <8 x double> %strided.vec96, %vec.phi82
  %57 = fadd fast <8 x double> %strided.vec97, %vec.phi83
  %58 = fadd fast <8 x double> %strided.vec98, %vec.phi84
  %index.next = add i64 %index, 32
  %59 = icmp eq i64 %index.next, %n.vec
  br i1 %59, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx106 = fadd fast <8 x double> %52, %51
  %bin.rdx107 = fadd fast <8 x double> %53, %bin.rdx106
  %bin.rdx108 = fadd fast <8 x double> %54, %bin.rdx107
  %rdx.shuf109 = shufflevector <8 x double> %bin.rdx108, <8 x double> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx110 = fadd fast <8 x double> %bin.rdx108, %rdx.shuf109
  %rdx.shuf111 = shufflevector <8 x double> %bin.rdx110, <8 x double> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx112 = fadd fast <8 x double> %bin.rdx110, %rdx.shuf111
  %rdx.shuf113 = shufflevector <8 x double> %bin.rdx112, <8 x double> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx114 = fadd fast <8 x double> %bin.rdx112, %rdx.shuf113
  %60 = extractelement <8 x double> %bin.rdx114, i32 0
  %bin.rdx = fadd fast <8 x double> %56, %55
  %bin.rdx99 = fadd fast <8 x double> %57, %bin.rdx
  %bin.rdx100 = fadd fast <8 x double> %58, %bin.rdx99
  %rdx.shuf = shufflevector <8 x double> %bin.rdx100, <8 x double> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx101 = fadd fast <8 x double> %bin.rdx100, %rdx.shuf
  %rdx.shuf102 = shufflevector <8 x double> %bin.rdx101, <8 x double> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx103 = fadd fast <8 x double> %bin.rdx101, %rdx.shuf102
  %rdx.shuf104 = shufflevector <8 x double> %bin.rdx103, <8 x double> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx105 = fadd fast <8 x double> %bin.rdx103, %rdx.shuf104
  %61 = extractelement <8 x double> %bin.rdx105, i32 0
  %cmp.n = icmp eq i64 %30, %n.vec
  %62 = insertelement <2 x double> undef, double %60, i32 0
  %63 = insertelement <2 x double> %62, double %61, i32 1
  br i1 %cmp.n, label %L47, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L34.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 2, %L34.preheader ]
  %64 = phi <2 x double> [ %27, %L34.preheader ], [ %63, %middle.block ]
  br label %L34

L34:                                              ; preds = %L34, %scalar.ph
  %value_phi74 = phi i64 [ %66, %L34 ], [ %bc.resume.val, %scalar.ph ]
  %65 = phi <2 x double> [ %69, %L34 ], [ %64, %scalar.ph ]
  %66 = add nuw nsw i64 %value_phi74, 1
  %.elt68 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 %value_phi74, i64 0
  %67 = bitcast double* %.elt68 to <2 x double>*
  %68 = load <2 x double>, <2 x double>* %67, align 8
  %69 = fadd fast <2 x double> %68, %65
  %70 = icmp ult i64 %66, %9
  br i1 %70, label %L34, label %L47

L43:                                              ; preds = %L15
  call void @j_mapreduce_impl_885([2 x double]* noalias nocapture nonnull sret %3, %jl_value_t* nonnull %1, i64 1, i64 %9, i64 1024)
  %.sroa.035.0..sroa_idx39 = getelementptr inbounds [2 x double], [2 x double]* %3, i64 0, i64 0
  %71 = bitcast double* %.sroa.035.0..sroa_idx39 to <2 x double>*
  %72 = load <2 x double>, <2 x double>* %71, align 8
  br label %L47

L47:                                              ; preds = %L43, %L34, %middle.block, %L17, %L13
  %73 = phi <2 x double> [ %72, %L43 ], [ %19, %L13 ], [ %27, %L17 ], [ %63, %middle.block ], [ %69, %L34 ]
  %74 = bitcast %jl_value_t* %1 to %jl_array_t*
  %75 = getelementptr inbounds %jl_array_t, %jl_array_t* %74, i64 0, i32 1
  %76 = load i64, i64* %75, align 8
  %77 = sitofp i64 %76 to double
  %78 = insertelement <2 x double> undef, double %77, i32 0
  %79 = insertelement <2 x double> %78, double %77, i32 1
  %80 = fdiv fast <2 x double> %73, %79
  %.sroa.029.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
  %81 = bitcast double* %.sroa.029.0..sroa_idx to <2 x double>*
  store <2 x double> %80, <2 x double>* %81, align 8
  ret void
}

Ah interesting. Does it have to do with LLVM respecting the lack of associativity or something?

Yes.

Also, odd that LLVM is shuffling to sum the real and imag vectors into separate accumulators here (I edited to add @code_llvm above). That’s definitely worse than not doing that.

While in this case that is unnecessary and a misoptimization by LLVM, it would be necessary if you were doing something like multiplication (maybe a norm instead of a sum?), which would make an example of where struct of arrays is better (not having to shuffle to make the loaded vectors contiguous).

EDIT:
To clarify, the way I’d vectorize it with AVX512 is to, on every iteration of the SIMD loop:

s_1 += <real_1, imag_1, real_2, imag_2, real_3, imag_3, real_4, imag_4>
s_2 += <real_5, imag_5, real_6, imag_6, real_7, imag_7, real_8, imag_8>
s_3 += <real_9, imag_9, real10, imag10, real11, imag11, real12, imag12>
s_4 += <real13, imag13, real14, imag14, real15, imag15, real16, imag16>

(With AVX2, halve their lengths)
and then, to finally reduce back to a scalar

s = s_1 + s_2 + s_3 + s_4
s_reduce_4 = s[1:4] + s[5:8]
s_reduce_2 = s_reduce_4[1:2] + s_reduce_4[3:4]
s_final = s_reduce_2 / N
MyComplex(s_final[1], s_final[2])

This all requires a re-association. That’s because originally, we would have had real_1 + real_2 + real_3 + ..., but now we have real_1 + real17 + real33 + ..., only in the end added to real_2 + real18 + real34..., etc

What LLVM is doing instead is

real_s_1 += <real_1, real_2, real_3, real_4, real_5, real_6, real_7, real_8>
imag_s_1 += <imag_1, imag_2, imag_3, imag_4, imag_5, imag_6, imag_7, imag_8>
real_s_2 += ...

which requires those shuffle instructions that are unnecessary with struct of arrays.
Doing the shuffling would actually be necessary if we had a norm / self-dot product where we multiplied the number by itself or its adjoint on each iteration, so that we line up the multiplciation between real and imaginary numbers from the same complex number correctly.

If you wanted to demo possible improvements from hand-SIMDing with SIMD.jl, that would be a good demo.

1 Like

The other thing that may not be relevant to your particular interests, but are to mine, is that Julia makes “lazy computation” really easy. This often lets you run big-data analyses on relatively ordinary hardware. Our array infrastructure in particular has lots of transformations you can do without ever allocating memory:

Operation Eager method Lazy method Package or Base
Select a ROI img[5:50,100:130] view(img, 5:50, 100:130) Base Julia
Reshaping an array (e.g., 1 to 2 dims) none reshape(a, 3, 5) Base Julia
Reordering dimensions permutedims(A, (2, 1)) PermutedDimsArray(A, (2, 1)) Base Julia
Reinterpreting raw bits none reinterpret(RGB{Float32}, A) Base Julia
Apply a function to each value in an array map(f, a) mappedarray(f, a) MappedArrays.jl
Shift the indices of an array none OffsetArray(A, -3:3, 0:15) OffsetArrays.jl
Add padding to edges of array padarray(a, Pad(:circular,2)) PaddedView(NaN, a, 0:4) ImageFiltering.jl and PaddedViews.jl
Spatial deformations of an array warp(img, tfm) warpedview(img, tfm) ImageTransformations.jl
Combining arrays into color channels cat(1, red, green, blue) colorview(RGB, red, green, blue) ImageCore.jl
Spatiotemporal filtering imfilter(img, kernel) StreamingContainer{T}(f, A, axs) ImageAxes.jl, ImageFiltering.jl

Here’s a demo I once put together. I can run this on my laptop with a 2TB file and get 30fps rendering performance when visualizing the result:

using MappedArrays, CoordinateTransformations, ImageTransformations, ImageCore, ImageView, LinearAlgebra
to_photons(x) = (x - 100)*3.1   # define a function to subtract the bias and scale by gain
to_gaussian(y) = sqrt(max(y, zero(y))) #  a function to square-root transform the intensity
# Create a "virtual" image in which these two operations are applied pixel-by-pixel
movσ = mappedarray(to_gaussian ∘ to_photons, mov)    # ∘ means function composition
# Lazily create the rotated movie
tfm = AffineMap(RotZ(θ), (I - RotZ(θ))*center(mov))  # create the spatial rotation
movr = warpedview(movσ, tfm, axes(mov))              # a rotated view occupying the same ROI
# Lazily replicate the reference image across time
movieref = view(movref, :, :, ones(Int, size(mov, 3)))
# Lazily overlay in separate color channels
overlay = colorview(RGB, movr, movieref, zeroarray)

# Inspect the result
imshow(overlay)
7 Likes