Base.Statistics -- I'm confused by the differences in arguments

Suppose I have a number of experiments, say 5, each of which consist of a 2x3 matrix of data. I can either organize these in a 2x3x5 array A, C, or in a vector of 5 elements of 2x3 matrices (B). For illustration:

# Organized in array
A = rand(2,3,5)
# Organized in vector
B = [A[:,:,i] for i in 1:size(A)[3]]
# Convert from vector to array
C = Array{Float64}(undef,size(B[1])...,length(B))
for i in 1:length(B)
       C[:,:,i] = B[i]
end

Some questions:

  1. In the Statistics package, I assume that arrays A,C (which are equal) qualify as AbstractArray while the vector B qualifies as itr or perhaps as AbstractVector??

The following pairs of operations give the same answer…

julia> mean(A,dims=3)[:,:] == mean(B)
true

julia> std(A,dims=3)[:,:] == std(B)
true

julia> var(A,dims=3)[:,:] == var(B)
true

However, if I want to compute median, quantiles, etc., things are different…

julia> median(A,dims=3)[:,:]
2×3 Array{Float64,2}:
 0.831352  0.615277  0.481671
 0.587861  0.531014  0.375483

julia> median(B)
ERROR: MethodError: no method matching isless(::Array{Float64,2}, ::Array{Float64,2})...

And for quantile, neither quantile(B, 0.5) nor quantile(A,0.5,dims=3) work.

  1. Why don’t these functions admit/similar same arguments as mean, std, var?

  2. Is there a more elegant way to convert from vector B to array C? [Something like hcat(B...,newdim=true)

1 Like

mean just isn’t looking at the contents of the iterator, but you can add matrices and divide by a scalar, as if they were numbers. But you can’t compare them with <, which is where median fails. To get the same result as median(A,dims=3) someone would have to specially write this case.

Yes:

A == cat(B..., dims=3) # slow, if B is large
A == reshape(reduce(hcat, B), 2,3,5) # fast
A == LazyStack.stack(B) # instant? (a view)
1 Like

What about quantile?

[I could certainly hack up some methods myself to do what I want, but I’m not a super good programmer, so someone would have to fix them with argument type specification, efficiency, etc.]

It looks from methods(quantile) like this only works on 1D things. But you can feed it slices:

Q1 = mapslices(v -> quantile(v, 0.25), A, dims=3)
Q1 == TensorCast.@cast Q[i,j,_] := quantile(A[i,j,:], 0.25)
1 Like

I tested the four methods of re-creating the multidimensional array (C) from my vector B.

So here are the results:

julia> @btime begin
       C = Array{Float64}(undef,size(B[1])...,length(B))
       for i in 1:length(B)
              C[:,:,i] = B[i]
       end
       end
  2.167 μs (24 allocations: 976 bytes)

julia> @btime C = cat(B..., dims=3);
  7.225 μs (81 allocations: 3.48 KiB)

julia> @btime A = reshape(reduce(hcat, B), 2,3,5);
  323.305 ns (4 allocations: 480 bytes)

julia> @btime stack(B);
  25.301 ns (1 allocation: 16 bytes)

Clear winner is the LazyStack.stack() (“view”) method. Disadvantage of this method is the need to install another package. Method reshape is also good.

Perhaps a silly question… If I do:

julia> using Statistics
#
julia> A = rand(2,3,5)
julia> B = [A[:,:,i] for i in 1:5];
#
julia> mean(A,dims=3)
2×3×1 Array{Float64,3}:
[:, :, 1] =
 0.722285  0.405353  0.461026
 0.704911  0.322218  0.8064

julia> mean(B)
2×3 Array{Float64,2}:
 0.722285  0.405353  0.461026
 0.704911  0.322218  0.8064

I get different types, depending on whether I use mean(A,dims=3) or mean(B). What I really want is the output of mean(B), which I alternatively can get with mean(A,dims=3)[:,:].

So for general, higher dimensional arrays, how can I produce the [:,:] at the end of the result of mean(A,dims=3), i.e., mean(A,dims=3)[:,:] – if the array A is, say, 4 dimensional, I need to use [:,:,:], etc. How can I produce the operation of [:,:,..., :] in code when I have to query to find the number of dimensions?

Note that this copies the result of mean. You can do dropdims(mean(A,dims=3), dims=3) or reshape(mean(A,dims=3), axes(A)[1:2]) which are a bit ugly… you can also write vec(mean(A, dims=(1,3))) when there is just one survivor.

1 Like

I tried with the reshape idea already, but – as you say – it is ugly. I’ll test the dropdims method, which seems fine.

I put this macro in my personal Utils package:

using Match

macro dropd(expr)
    func, args, dims = @match expr begin
        Expr(h, [func, args..., kws]), if h == :call end => @match kws begin
            Expr(h, [ds, dims]), if h == :kw && ds == :dims end => (func, args, dims)
            _ => error("Dims not where expected")
        end
        _ => error("Not a function call")
    end
    res = quote
        dims = $(esc(dims))
        dropdims($(esc(func))($(map(esc, args)...), dims=dims), dims=dims)
    end
    return res
end

and then use it in these ways:

@dropd(mean(array, dims=3))
@dropd(mean(array, dims=(1, 3)))
# works for any other function called like this, not only mean

I find it pretty convenient for such a common task.

Pieces of the macro originally taken from some discussions around here.

I think this is ‘lazy’, so it doesn’t actually create the array. So optimal performance depends on what you need the array for. Performance when operating on a lazy array may in some cases be poor. (I’m not familiar with LazyStack, though.) ‘Lazy’ often means ‘postpone the work until it’s needed’.

BTW, you should probably use variable interpolation with microbenchmarks like these.

Here are my attempts of functions – for median:

function median_my(A::AbstractArray; dims=1)
    if typeof(A) <: AbstractVector
        # Convert to array which `mean` can handle
        sA = size(A[1])
        nA = length(sA)
        AA = reshape(reduce(hcat,A),sA...,length(A))
        return dropdims(mean(AA,dims=nA+1),dims=nA+1)
    else
        return mean(A,dims=dims)
    end
end

and for quantiles:

function quantile_my(A::AbstractArray,qvec;dims=1)
    if !(typeof(qvec) <: AbstractVector)
        qvec = [qvec]
    end
    if typeof(A) <: AbstractVector
        # Convert to array which can be handled by mapslices
        sA = size(A[1])
        nA = length(sA)
        AA = reshape(reduce(hcat,A),sA...,length(A))
        dims = nA+1
        Q = [dropdims((mapslices(v -> quantile(v, q), AA, dims=dims)),dims=dims) for q in qvec]
    else
        Q = [mapslices(v -> quantile(v, q), A, dims=dims) for q in qvec]
    end
    return Q
end

I’m not a pro Julia programmer, so I’m sure someone can improve on these. But this is more or less methods I wish were built into Base.Statistics – to make the operation of median and quantile somewhat similar to mean, std, and var.

Indeed. If doing something fairly simple like broadcasting a function, then the cost of this lazy view seems pretty low (although perhaps there are edge cases). If doing something like matrix multiplication afterwards, then the cost of making a dense matrix will probably pay off.

But those definitions don’t make sense as median and quantile. It’s not about your programming skills.

The median of an array of matrices should be the ‘middle’ one when you sort them. But how do you sort matrices? How do you decide which matrix is ‘bigger’? You will have to make a decision, for example sort them according to norm.

1 Like

Here is a use-case of the methods which (in my view) would be useful in Base.Statistics

A: A simple Gillespie type simulation of a SIR model:
image

since I wrote the code myself, I need to compute the statistics using median_my and/or quantile_my.
B: median with quantiles.
image