# Base.Statistics -- I'm confused by the differences in arguments

Suppose I have a number of experiments, say 5, each of which consist of a 2x3 matrix of data. I can either organize these in a 2x3x5 array `A, C`, or in a vector of 5 elements of 2x3 matrices (`B`). For illustration:

``````# Organized in array
A = rand(2,3,5)
# Organized in vector
B = [A[:,:,i] for i in 1:size(A)]
# Convert from vector to array
C = Array{Float64}(undef,size(B)...,length(B))
for i in 1:length(B)
C[:,:,i] = B[i]
end
``````

Some questions:

1. In the `Statistics` package, I assume that arrays `A,C` (which are equal) qualify as `AbstractArray` while the vector `B` qualifies as `itr` or perhaps as `AbstractVector`??

The following pairs of operations give the same answer…

``````julia> mean(A,dims=3)[:,:] == mean(B)
true

julia> std(A,dims=3)[:,:] == std(B)
true

julia> var(A,dims=3)[:,:] == var(B)
true
``````

However, if I want to compute `median`, `quantiles`, etc., things are different…

``````julia> median(A,dims=3)[:,:]
2×3 Array{Float64,2}:
0.831352  0.615277  0.481671
0.587861  0.531014  0.375483

julia> median(B)
ERROR: MethodError: no method matching isless(::Array{Float64,2}, ::Array{Float64,2})...
``````

And for `quantile`, neither `quantile(B, 0.5)` nor `quantile(A,0.5,dims=3)` work.

1. Why don’t these functions admit/similar same arguments as `mean`, `std`, `var`?

2. Is there a more elegant way to convert from vector `B` to array `C`? [Something like `hcat(B...,newdim=true)`

1 Like

`mean` just isn’t looking at the contents of the iterator, but you can add matrices and divide by a scalar, as if they were numbers. But you can’t compare them with `<`, which is where `median` fails. To get the same result as `median(A,dims=3)` someone would have to specially write this case.

Yes:

``````A == cat(B..., dims=3) # slow, if B is large
A == reshape(reduce(hcat, B), 2,3,5) # fast
A == LazyStack.stack(B) # instant? (a view)
``````
1 Like

What about `quantile`?

[I could certainly hack up some methods myself to do what I want, but I’m not a super good programmer, so someone would have to fix them with argument type specification, efficiency, etc.]

It looks from `methods(quantile)` like this only works on 1D things. But you can feed it slices:

``````Q1 = mapslices(v -> quantile(v, 0.25), A, dims=3)
Q1 == TensorCast.@cast Q[i,j,_] := quantile(A[i,j,:], 0.25)
``````
1 Like

I tested the four methods of re-creating the multidimensional array (`C`) from my vector `B`.

So here are the results:

``````julia> @btime begin
C = Array{Float64}(undef,size(B)...,length(B))
for i in 1:length(B)
C[:,:,i] = B[i]
end
end
2.167 μs (24 allocations: 976 bytes)

julia> @btime C = cat(B..., dims=3);
7.225 μs (81 allocations: 3.48 KiB)

julia> @btime A = reshape(reduce(hcat, B), 2,3,5);
323.305 ns (4 allocations: 480 bytes)

julia> @btime stack(B);
25.301 ns (1 allocation: 16 bytes)
``````

Clear winner is the `LazyStack.stack()` (“view”) method. Disadvantage of this method is the need to install another package. Method `reshape` is also good.

Perhaps a silly question… If I do:

``````julia> using Statistics
#
julia> A = rand(2,3,5)
julia> B = [A[:,:,i] for i in 1:5];
#
julia> mean(A,dims=3)
2×3×1 Array{Float64,3}:
[:, :, 1] =
0.722285  0.405353  0.461026
0.704911  0.322218  0.8064

julia> mean(B)
2×3 Array{Float64,2}:
0.722285  0.405353  0.461026
0.704911  0.322218  0.8064
``````

I get different types, depending on whether I use `mean(A,dims=3)` or `mean(B)`. What I really want is the output of `mean(B)`, which I alternatively can get with `mean(A,dims=3)[:,:]`.

So for general, higher dimensional arrays, how can I produce the `[:,:]` at the end of the result of `mean(A,dims=3)`, i.e., `mean(A,dims=3)[:,:]` – if the array `A` is, say, 4 dimensional, I need to use `[:,:,:]`, etc. How can I produce the operation of `[:,:,..., :]` in code when I have to query to find the number of dimensions?

Note that this copies the result of `mean`. You can do `dropdims(mean(A,dims=3), dims=3)` or `reshape(mean(A,dims=3), axes(A)[1:2])` which are a bit ugly… you can also write `vec(mean(A, dims=(1,3)))` when there is just one survivor.

1 Like

I tried with the `reshape` idea already, but – as you say – it is ugly. I’ll test the `dropdims` method, which seems fine.

I put this macro in my personal `Utils` package:

``````using Match

macro dropd(expr)
func, args, dims = @match expr begin
Expr(h, [func, args..., kws]), if h == :call end => @match kws begin
Expr(h, [ds, dims]), if h == :kw && ds == :dims end => (func, args, dims)
_ => error("Dims not where expected")
end
_ => error("Not a function call")
end
res = quote
dims = \$(esc(dims))
dropdims(\$(esc(func))(\$(map(esc, args)...), dims=dims), dims=dims)
end
return res
end
``````

and then use it in these ways:

``````@dropd(mean(array, dims=3))
@dropd(mean(array, dims=(1, 3)))
# works for any other function called like this, not only mean
``````

I find it pretty convenient for such a common task.

Pieces of the macro originally taken from some discussions around here.

I think this is ‘lazy’, so it doesn’t actually create the array. So optimal performance depends on what you need the array for. Performance when operating on a lazy array may in some cases be poor. (I’m not familiar with `LazyStack`, though.) ‘Lazy’ often means ‘postpone the work until it’s needed’.

BTW, you should probably use variable interpolation with microbenchmarks like these.

Here are my attempts of functions – for median:

``````function median_my(A::AbstractArray; dims=1)
if typeof(A) <: AbstractVector
# Convert to array which `mean` can handle
sA = size(A)
nA = length(sA)
AA = reshape(reduce(hcat,A),sA...,length(A))
return dropdims(mean(AA,dims=nA+1),dims=nA+1)
else
return mean(A,dims=dims)
end
end
``````

and for quantiles:

``````function quantile_my(A::AbstractArray,qvec;dims=1)
if !(typeof(qvec) <: AbstractVector)
qvec = [qvec]
end
if typeof(A) <: AbstractVector
# Convert to array which can be handled by mapslices
sA = size(A)
nA = length(sA)
AA = reshape(reduce(hcat,A),sA...,length(A))
dims = nA+1
Q = [dropdims((mapslices(v -> quantile(v, q), AA, dims=dims)),dims=dims) for q in qvec]
else
Q = [mapslices(v -> quantile(v, q), A, dims=dims) for q in qvec]
end
return Q
end
``````

I’m not a pro Julia programmer, so I’m sure someone can improve on these. But this is more or less methods I wish were built into `Base.Statistics` – to make the operation of `median` and `quantile` somewhat similar to `mean`, `std`, and `var`.

Indeed. If doing something fairly simple like broadcasting a function, then the cost of this lazy view seems pretty low (although perhaps there are edge cases). If doing something like matrix multiplication afterwards, then the cost of making a dense matrix will probably pay off.

But those definitions don’t make sense as median and quantile. It’s not about your programming skills.

The median of an array of matrices should be the ‘middle’ one when you sort them. But how do you sort matrices? How do you decide which matrix is ‘bigger’? You will have to make a decision, for example sort them according to `norm`.

1 Like

Here is a use-case of the methods which (in my view) would be useful in `Base.Statistics`

A: A simple Gillespie type simulation of a SIR model: since I wrote the code myself, I need to compute the statistics using `median_my` and/or `quantile_my`.
B: median with quantiles. 