How to make norm(::AbstractVector) faster in my case?

import Base.size
import Base.IndexStyle
import Base.getindex
import Base.setindex!

using LinearAlgebra
using BenchmarkTools

struct Dummy{T} <: AbstractVector{T}
    coeffs::Vector{T}
end


############### Base ###############################
size(c::Dummy{T}) where T = size(c.coeffs)

IndexStyle(::Type{<:Dummy}) = IndexLinear()

getindex(c::Dummy{T}, I::Int) where T = c.coeffs[I]

setindex!(c::Dummy{T}, v, I::Int) where T = setindex!(c.coeffs, v, I)

### Experiments

coeff = randn(Float32,10^9);
x = Dummy{Float32}(coeff);

@benchmark norm(x)
@benchmark norm(x.coeffs)

my example is here. I created an abstractvector called Dummy with its attribute coeffs. And then I set getindex/setindex to act directly on the coeffs attributed of Dummy. Now I found huge performance loss on norm of this AbstractVector Dummy, as

julia> @benchmark norm(x)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.956 s …  2.959 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.958 s             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.958 s Β± 1.809 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                                                      β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  2.96 s        Histogram: frequency by time        2.96 s <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark norm(x.coeffs)
BenchmarkTools.Trial: 17 samples with 1 evaluation.
 Range (min … max):  302.239 ms … 303.657 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     302.967 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   302.929 ms Β± 426.119 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁ ▁  β–ˆ              ▁  ▁  ▁▁   ▁      β–ˆβ– ▁▁      ▁ ▁        ▁  
  β–ˆβ–β–ˆβ–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–ˆβ–β–β–ˆβ–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–ˆβ–ˆβ–β–ˆβ–ˆβ–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–ˆ ▁
  302 ms           Histogram: frequency by time          304 ms <

 Memory estimate: 32 bytes, allocs estimate: 2.

May I get any suggestion on how to make the norm of this AbstractVector as fast as evaluating the norm of its coeffs? An overload of the norm would definitely work but it sounds hacky to me. Thanks much for your input!

I think you just want to define norm(x::Dummy, p) = norm(x.coeffs,p). The problem is that your version is falling back to the generic calculation rather than using BLAS.

Also Julia 1.8 might make this faster if faster and simpler generic_norm2 by oscardssmith Β· Pull Request #43256 Β· JuliaLang/julia Β· GitHub or RFC: Add `norm(A, p; dims)` by mcabbott Β· Pull Request #43459 Β· JuliaLang/julia Β· GitHub get merged.

Or, since your struct only holds a Vector, it could be declared <:DenseArray. (And will need to define pointer, strides, maybe.)

Thanks for your quick reply. Then following your suggestion, I should overload every operation, e.g. dot, norm which BLAS has its own ways to make it faster, right? That is still a lot of hacky work and I really wonder if there is any easier way to deal with these things for AbstractArrays in Julia.

Do you mean do this instead?

struct Dummy{T} <: AbstractVector{T}
    coeffs::DenseVector{T}
end

No the problem is still there

Why is the fallback so bad?

julia> mynorm(x) = sqrt(sum(val->val^2,x))
mynorm (generic function with 1 method)

julia> x = Dummy{Float32}(rand(Float32,10^8));

julia> @btime LinearAlgebra.norm($x)
  305.828 ms (0 allocations: 0 bytes)
5773.3687f0

julia> @btime mynorm($x)
  27.586 ms (0 allocations: 0 bytes)
5773.3687f0


2 reasons:

  1. it is dealing with rescaling to prevent spurious overflow/underflow
  2. it is doing so badly.
1 Like

No, I mean make the struct <:DenseArray{T,1}, not <: AbstractVector{T}. The field certainly wants to be concrete.

Oh this is just a minimal example. In my practice I still want to make it an AbstractVector so it can include metadata and more attributes

Sure, but what has that got to do with anything? Do you want norm to access these other fields? Then clearly you must write a method. But if all you want is for an existing norm method to be used, then one way to do that is by fitting into its signature.

Other fields have nothing to do with norm. So I did

import Base.size
import Base.IndexStyle
import Base.getindex
import Base.setindex!
import Base.pointer

using LinearAlgebra
using BenchmarkTools

struct Dummy{T} <: DenseVector{T}
    coeffs::Vector{T}
end

############### Base ###############################
size(c::Dummy{T}) where T = size(c.coeffs)

IndexStyle(::Type{<:Dummy}) = IndexLinear()

getindex(c::Dummy{T}, I::Int) where T = c.coeffs[I]

setindex!(c::Dummy{T}, v, I::Int) where T = setindex!(c.coeffs, v, I)

pointer(c::Dummy{T}) where T = pointer(c.coeffs) # removing this line doesn't work either

### Experiments

coeff = randn(Float32,10^4);
x = Dummy{Float32}(coeff);

@benchmark norm(x)
@benchmark norm(x.coeffs)

and got this error for norm(x)

julia> norm(x)
ERROR: conversion to pointer not defined for Dummy{Float32}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] unsafe_convert(#unused#::Type{Ptr{Float32}}, a::Dummy{Float32})
   @ Base ./pointer.jl:67
 [3] nrm2(n::Int64, X::Dummy{Float32}, incx::Int64)
   @ LinearAlgebra.BLAS /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:487
 [4] nrm2(x::Dummy{Float32})
   @ LinearAlgebra.BLAS /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:493
 [5] norm2
   @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/dense.jl:105 [inlined]
 [6] norm(itr::Dummy{Float32}, p::Int64)
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/generic.jl:627
 [7] norm(itr::Dummy{Float32})
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/generic.jl:625
 [8] top-level scope
   @ REPL[19]:1

What do I do? Sorry didn’t find any online resource about this

@Oscar_Smith

It’s a bit obscure but this is the crucial bit:

Defining that seems to be sufficient:

julia> Base.unsafe_convert(::Type{Ptr{T}}, A::Dummy{T}) where {T} =
           Base.unsafe_convert(Ptr{T}, A.coeffs)

julia> @btime norm($x)  # was 15.625 ΞΌs
  min 12.416 ΞΌs, mean 12.647 ΞΌs (0 allocations)
99.28318f0

julia> @btime norm($(x.coeffs))
  min 12.416 ΞΌs, mean 12.562 ΞΌs (0 allocations)
99.28318f0

julia> @which pointer(x)  # no method added here
pointer(x::AbstractArray{T}) where T in Base at abstractarray.jl:1170

julia> @btime $(rand(10^4, 10^4)) * $x;  # was 23.167 ms without BLAS
  min 13.536 ms, mean 14.144 ms (4 allocations, 156.34 KiB)

Thanks a lot! This works! One more question:

if I want to do this on GPU, what should I do?

struct CuDummy{T} <: DenseCuVector{T}
    coeffs::CuVector{T}
end

this throws an error as

ERROR: invalid subtyping in definition of CuDummy
Stacktrace:
 [1] top-level scope
   @ REPL[29]:1
 [2] top-level scope
   @ ~/.julia/packages/CUDA/DFAea/src/initialization.jl:52

When I set it up as

struct CuDummy{T} <: CUDA.AbstractGPUArray{T,1}
    coeffs::CuVector{T}
end

the run time for norm(x) and norm(x.coeffs) when x is a CuDummy with length of 10^6 are 12ms and 11ms, which is not too different tho