See
for an implementation of STOMP. It does currently not support GPU, but it’s quite a bit faster than the implementation of the original paper.
@time matrix_profile(randn(Float32, 2^17), 256)
# 55.262552 seconds (140 allocations: 5.009 MiB)
on a laptop from 2014, while the authors implementation from the paper took 4.2 minutes
(table III from their paper).