I’m announcing a new package, BiweightStats.jl (docs).
It implements a suite of robust statistics based on the biweight transform. This transform, as implemented, will skip any non-finite values, and will filter any values greater than a given cutoff value from the median of the data.
The motivation for this package was for use in image processing, specifically for astronomical imaging where extreme outliers are not uncommon (e.g., from cosmic rays on CCDs). The statistics implemented are based on the NIST definitions (e.g., scale), and match the functionality of the astropy.stats biweight statistics.
The code has been benchmarked against the astropy implementation, and while it readily outperforms python for most of benchmarks, I’m curious how the Julia code can be improved. As of right now, the package has 0 dependencies, but does include a hand-written iterator and loops. In theory there could be speedup by using Transducers.jl, Floops.jl, LoopVectorization.jl, Polyster.jl, etc. I believe the predominate bottlenecks are the allocations and medians required for implementing the biweight transform. I’m happy to entertain any attempts at improving the performance!
I think you can save a temporary array:
julia> using Statistics, BenchmarkTools
julia> function orig(X; c=9, M=nothing)
data = filter(isfinite, X)
med = median(data)
med = M
mad = median!(abs.(data .- med))
orig (generic function with 1 method)
julia> function test(X; c=9, M=nothing)
data = filter(isfinite, X)
med = isnothing(M) ? median(data) : M
data .= abs.(data .- med)
mad = median!(data)
test (generic function with 1 method)
julia> @benchmark orig(X) setup=(X=randn(1000, 1000))
BenchmarkTools.Trial: 164 samples with 1 evaluation.
Range (min … max): 21.039 ms … 35.319 ms ┊ GC (min … max): 7.46% … 0.00%
Time (median): 27.269 ms ┊ GC (median): 1.20%
Time (mean ± σ): 27.172 ms ± 2.476 ms ┊ GC (mean ± σ): 2.73% ± 3.26%
█▃▁▄▁ ▃▄▁▃ ▁▃▄▃▁▁▃▃▆ ▆ ▃ ▁
21 ms Histogram: frequency by time 32.5 ms <
Memory estimate: 22.89 MiB, allocs estimate: 6.
julia> @benchmark test(X) setup=(X=randn(1000, 1000))
BenchmarkTools.Trial: 183 samples with 1 evaluation.
Range (min … max): 18.818 ms … 31.551 ms ┊ GC (min … max): 0.00% … 0.95%
Time (median): 25.092 ms ┊ GC (median): 0.00%
Time (mean ± σ): 24.793 ms ± 2.497 ms ┊ GC (mean ± σ): 1.28% ± 2.01%
▃ ▂ ▅ ▅ ▂ ▂▂ ▅█▆▃▅▂▃ ▃▃ ▂ ▂
18.8 ms Histogram: frequency by time 30.6 ms <
Memory estimate: 15.26 MiB, allocs estimate: 4.
It’s a ~10% improvement in this run.
Thanks @giordano! I played around with this a little before realizing it caused a bug with the covariance!
We can’t reorder the elements when calculating the covariance between two vectors. Well, they can be reordered, just so long as both vectors are reordered the same way. This means I can’t use
median! on the intermediate data for that method.
What I can do is add a keyword argument whether its safe to do this, but it means the
midcor methods won’t see the speedup.