# Base.Statistics -- I'm confused by the differences in arguments

Suppose I have a number of experiments, say 5, each of which consist of a 2x3 matrix of data. I can either organize these in a 2x3x5 array `A, C`, or in a vector of 5 elements of 2x3 matrices (`B`). For illustration:

``````# Organized in array
A = rand(2,3,5)
# Organized in vector
B = [A[:,:,i] for i in 1:size(A)[3]]
# Convert from vector to array
C = Array{Float64}(undef,size(B[1])...,length(B))
for i in 1:length(B)
C[:,:,i] = B[i]
end
``````

Some questions:

1. In the `Statistics` package, I assume that arrays `A,C` (which are equal) qualify as `AbstractArray` while the vector `B` qualifies as `itr` or perhaps as `AbstractVector`??

The following pairs of operations give the same answerâŚ

``````julia> mean(A,dims=3)[:,:] == mean(B)
true

julia> std(A,dims=3)[:,:] == std(B)
true

julia> var(A,dims=3)[:,:] == var(B)
true
``````

However, if I want to compute `median`, `quantiles`, etc., things are differentâŚ

``````julia> median(A,dims=3)[:,:]
2Ă3 Array{Float64,2}:
0.831352  0.615277  0.481671
0.587861  0.531014  0.375483

julia> median(B)
ERROR: MethodError: no method matching isless(::Array{Float64,2}, ::Array{Float64,2})...
``````

And for `quantile`, neither `quantile(B, 0.5)` nor `quantile(A,0.5,dims=3)` work.

1. Why donât these functions admit/similar same arguments as `mean`, `std`, `var`?

2. Is there a more elegant way to convert from vector `B` to array `C`? [Something like `hcat(B...,newdim=true)`

1 Like

`mean` just isnât looking at the contents of the iterator, but you can add matrices and divide by a scalar, as if they were numbers. But you canât compare them with `<`, which is where `median` fails. To get the same result as `median(A,dims=3)` someone would have to specially write this case.

Yes:

``````A == cat(B..., dims=3) # slow, if B is large
A == reshape(reduce(hcat, B), 2,3,5) # fast
A == LazyStack.stack(B) # instant? (a view)
``````
1 Like

What about `quantile`?

[I could certainly hack up some methods myself to do what I want, but Iâm not a super good programmer, so someone would have to fix them with argument type specification, efficiency, etc.]

It looks from `methods(quantile)` like this only works on 1D things. But you can feed it slices:

``````Q1 = mapslices(v -> quantile(v, 0.25), A, dims=3)
Q1 == TensorCast.@cast Q[i,j,_] := quantile(A[i,j,:], 0.25)
``````
1 Like

I tested the four methods of re-creating the multidimensional array (`C`) from my vector `B`.

So here are the results:

``````julia> @btime begin
C = Array{Float64}(undef,size(B[1])...,length(B))
for i in 1:length(B)
C[:,:,i] = B[i]
end
end
2.167 Îźs (24 allocations: 976 bytes)

julia> @btime C = cat(B..., dims=3);
7.225 Îźs (81 allocations: 3.48 KiB)

julia> @btime A = reshape(reduce(hcat, B), 2,3,5);
323.305 ns (4 allocations: 480 bytes)

julia> @btime stack(B);
25.301 ns (1 allocation: 16 bytes)
``````

Clear winner is the `LazyStack.stack()` (âviewâ) method. Disadvantage of this method is the need to install another package. Method `reshape` is also good.

Perhaps a silly questionâŚ If I do:

``````julia> using Statistics
#
julia> A = rand(2,3,5)
julia> B = [A[:,:,i] for i in 1:5];
#
julia> mean(A,dims=3)
2Ă3Ă1 Array{Float64,3}:
[:, :, 1] =
0.722285  0.405353  0.461026
0.704911  0.322218  0.8064

julia> mean(B)
2Ă3 Array{Float64,2}:
0.722285  0.405353  0.461026
0.704911  0.322218  0.8064
``````

I get different types, depending on whether I use `mean(A,dims=3)` or `mean(B)`. What I really want is the output of `mean(B)`, which I alternatively can get with `mean(A,dims=3)[:,:]`.

So for general, higher dimensional arrays, how can I produce the `[:,:]` at the end of the result of `mean(A,dims=3)`, i.e., `mean(A,dims=3)[:,:]` â if the array `A` is, say, 4 dimensional, I need to use `[:,:,:]`, etc. How can I produce the operation of `[:,:,..., :]` in code when I have to query to find the number of dimensions?

Note that this copies the result of `mean`. You can do `dropdims(mean(A,dims=3), dims=3)` or `reshape(mean(A,dims=3), axes(A)[1:2])` which are a bit uglyâŚ you can also write `vec(mean(A, dims=(1,3)))` when there is just one survivor.

1 Like

I tried with the `reshape` idea already, but â as you say â it is ugly. Iâll test the `dropdims` method, which seems fine.

I put this macro in my personal `Utils` package:

``````using Match

macro dropd(expr)
func, args, dims = @match expr begin
Expr(h, [func, args..., kws]), if h == :call end => @match kws begin
Expr(h, [ds, dims]), if h == :kw && ds == :dims end => (func, args, dims)
_ => error("Dims not where expected")
end
_ => error("Not a function call")
end
res = quote
dims = \$(esc(dims))
dropdims(\$(esc(func))(\$(map(esc, args)...), dims=dims), dims=dims)
end
return res
end
``````

and then use it in these ways:

``````@dropd(mean(array, dims=3))
@dropd(mean(array, dims=(1, 3)))
# works for any other function called like this, not only mean
``````

I find it pretty convenient for such a common task.

Pieces of the macro originally taken from some discussions around here.

I think this is âlazyâ, so it doesnât actually create the array. So optimal performance depends on what you need the array for. Performance when operating on a lazy array may in some cases be poor. (Iâm not familiar with `LazyStack`, though.) âLazyâ often means âpostpone the work until itâs neededâ.

BTW, you should probably use variable interpolation with microbenchmarks like these.

Here are my attempts of functions â for median:

``````function median_my(A::AbstractArray; dims=1)
if typeof(A) <: AbstractVector
# Convert to array which `mean` can handle
sA = size(A[1])
nA = length(sA)
AA = reshape(reduce(hcat,A),sA...,length(A))
return dropdims(mean(AA,dims=nA+1),dims=nA+1)
else
return mean(A,dims=dims)
end
end
``````

and for quantiles:

``````function quantile_my(A::AbstractArray,qvec;dims=1)
if !(typeof(qvec) <: AbstractVector)
qvec = [qvec]
end
if typeof(A) <: AbstractVector
# Convert to array which can be handled by mapslices
sA = size(A[1])
nA = length(sA)
AA = reshape(reduce(hcat,A),sA...,length(A))
dims = nA+1
Q = [dropdims((mapslices(v -> quantile(v, q), AA, dims=dims)),dims=dims) for q in qvec]
else
Q = [mapslices(v -> quantile(v, q), A, dims=dims) for q in qvec]
end
return Q
end
``````

Iâm not a pro Julia programmer, so Iâm sure someone can improve on these. But this is more or less methods I wish were built into `Base.Statistics` â to make the operation of `median` and `quantile` somewhat similar to `mean`, `std`, and `var`.

Indeed. If doing something fairly simple like broadcasting a function, then the cost of this lazy view seems pretty low (although perhaps there are edge cases). If doing something like matrix multiplication afterwards, then the cost of making a dense matrix will probably pay off.

But those definitions donât make sense as median and quantile. Itâs not about your programming skills.

The median of an array of matrices should be the âmiddleâ one when you sort them. But how do you sort matrices? How do you decide which matrix is âbiggerâ? You will have to make a decision, for example sort them according to `norm`.

1 Like

Here is a use-case of the methods which (in my view) would be useful in `Base.Statistics`âŚ

A: A simple Gillespie type simulation of a SIR model:

since I wrote the code myself, I need to compute the statistics using `median_my` and/or `quantile_my`.
B: median with quantiles.