Base.Statistics -- I'm confused by the differences in arguments

BLI · August 13, 2020, 12:45pm

Suppose I have a number of experiments, say 5, each of which consist of a 2x3 matrix of data. I can either organize these in a 2x3x5 array A, C, or in a vector of 5 elements of 2x3 matrices (B). For illustration:

# Organized in array
A = rand(2,3,5)
# Organized in vector
B = [A[:,:,i] for i in 1:size(A)[3]]
# Convert from vector to array
C = Array{Float64}(undef,size(B[1])...,length(B))
for i in 1:length(B)
       C[:,:,i] = B[i]
end

Some questions:

In the Statistics package, I assume that arrays A,C (which are equal) qualify as AbstractArray while the vector B qualifies as itr or perhaps as AbstractVector??

The following pairs of operations give the same answer…

julia> mean(A,dims=3)[:,:] == mean(B)
true

julia> std(A,dims=3)[:,:] == std(B)
true

julia> var(A,dims=3)[:,:] == var(B)
true

However, if I want to compute median, quantiles, etc., things are different…

julia> median(A,dims=3)[:,:]
2×3 Array{Float64,2}:
 0.831352  0.615277  0.481671
 0.587861  0.531014  0.375483

julia> median(B)
ERROR: MethodError: no method matching isless(::Array{Float64,2}, ::Array{Float64,2})...

And for quantile, neither quantile(B, 0.5) nor quantile(A,0.5,dims=3) work.

Why don’t these functions admit/similar same arguments as mean, std, var?
Is there a more elegant way to convert from vector B to array C? [Something like hcat(B...,newdim=true)

mcabbott · August 13, 2020, 12:56pm

mean just isn’t looking at the contents of the iterator, but you can add matrices and divide by a scalar, as if they were numbers. But you can’t compare them with <, which is where median fails. To get the same result as median(A,dims=3) someone would have to specially write this case.

Yes:

A == cat(B..., dims=3) # slow, if B is large
A == reshape(reduce(hcat, B), 2,3,5) # fast
A == LazyStack.stack(B) # instant? (a view)

BLI · August 13, 2020, 1:02pm

What about quantile?

[I could certainly hack up some methods myself to do what I want, but I’m not a super good programmer, so someone would have to fix them with argument type specification, efficiency, etc.]

mcabbott · August 13, 2020, 1:10pm

It looks from methods(quantile) like this only works on 1D things. But you can feed it slices:

Q1 = mapslices(v -> quantile(v, 0.25), A, dims=3)
Q1 == TensorCast.@cast Q[i,j,_] := quantile(A[i,j,:], 0.25)

BLI · August 13, 2020, 6:47pm

I tested the four methods of re-creating the multidimensional array (C) from my vector B.

So here are the results:

julia> @btime begin
       C = Array{Float64}(undef,size(B[1])...,length(B))
       for i in 1:length(B)
              C[:,:,i] = B[i]
       end
       end
  2.167 μs (24 allocations: 976 bytes)

julia> @btime C = cat(B..., dims=3);
  7.225 μs (81 allocations: 3.48 KiB)

julia> @btime A = reshape(reduce(hcat, B), 2,3,5);
  323.305 ns (4 allocations: 480 bytes)

julia> @btime stack(B);
  25.301 ns (1 allocation: 16 bytes)

Clear winner is the LazyStack.stack() (“view”) method. Disadvantage of this method is the need to install another package. Method reshape is also good.

BLI · August 14, 2020, 7:36am

Perhaps a silly question… If I do:

julia> using Statistics
#
julia> A = rand(2,3,5)
julia> B = [A[:,:,i] for i in 1:5];
#
julia> mean(A,dims=3)
2×3×1 Array{Float64,3}:
[:, :, 1] =
 0.722285  0.405353  0.461026
 0.704911  0.322218  0.8064

julia> mean(B)
2×3 Array{Float64,2}:
 0.722285  0.405353  0.461026
 0.704911  0.322218  0.8064

I get different types, depending on whether I use mean(A,dims=3) or mean(B). What I really want is the output of mean(B), which I alternatively can get with mean(A,dims=3)[:,:].

So for general, higher dimensional arrays, how can I produce the [:,:] at the end of the result of mean(A,dims=3), i.e., mean(A,dims=3)[:,:] – if the array A is, say, 4 dimensional, I need to use [:,:,:], etc. How can I produce the operation of [:,:,..., :] in code when I have to query to find the number of dimensions?

mcabbott · August 14, 2020, 8:33am

Note that this copies the result of mean. You can do dropdims(mean(A,dims=3), dims=3) or reshape(mean(A,dims=3), axes(A)[1:2]) which are a bit ugly… you can also write vec(mean(A, dims=(1,3))) when there is just one survivor.

BLI · August 14, 2020, 8:44am

I tried with the reshape idea already, but – as you say – it is ugly. I’ll test the dropdims method, which seems fine.

aplavin · August 14, 2020, 9:03am

I put this macro in my personal Utils package:

using Match

macro dropd(expr)
    func, args, dims = @match expr begin
        Expr(h, [func, args..., kws]), if h == :call end => @match kws begin
            Expr(h, [ds, dims]), if h == :kw && ds == :dims end => (func, args, dims)
            _ => error("Dims not where expected")
        end
        _ => error("Not a function call")
    end
    res = quote
        dims = $(esc(dims))
        dropdims($(esc(func))($(map(esc, args)...), dims=dims), dims=dims)
    end
    return res
end

and then use it in these ways:

@dropd(mean(array, dims=3))
@dropd(mean(array, dims=(1, 3)))
# works for any other function called like this, not only mean

I find it pretty convenient for such a common task.

Pieces of the macro originally taken from some discussions around here.

DNF · August 14, 2020, 9:17am

I think this is ‘lazy’, so it doesn’t actually create the array. So optimal performance depends on what you need the array for. Performance when operating on a lazy array may in some cases be poor. (I’m not familiar with LazyStack, though.) ‘Lazy’ often means ‘postpone the work until it’s needed’.

BTW, you should probably use variable interpolation with microbenchmarks like these.

BLI · August 14, 2020, 9:24am

Here are my attempts of functions – for median:

function median_my(A::AbstractArray; dims=1)
    if typeof(A) <: AbstractVector
        # Convert to array which `mean` can handle
        sA = size(A[1])
        nA = length(sA)
        AA = reshape(reduce(hcat,A),sA...,length(A))
        return dropdims(mean(AA,dims=nA+1),dims=nA+1)
    else
        return mean(A,dims=dims)
    end
end

and for quantiles:

function quantile_my(A::AbstractArray,qvec;dims=1)
    if !(typeof(qvec) <: AbstractVector)
        qvec = [qvec]
    end
    if typeof(A) <: AbstractVector
        # Convert to array which can be handled by mapslices
        sA = size(A[1])
        nA = length(sA)
        AA = reshape(reduce(hcat,A),sA...,length(A))
        dims = nA+1
        Q = [dropdims((mapslices(v -> quantile(v, q), AA, dims=dims)),dims=dims) for q in qvec]
    else
        Q = [mapslices(v -> quantile(v, q), A, dims=dims) for q in qvec]
    end
    return Q
end

I’m not a pro Julia programmer, so I’m sure someone can improve on these. But this is more or less methods I wish were built into Base.Statistics – to make the operation of median and quantile somewhat similar to mean, std, and var.

mcabbott · August 14, 2020, 9:32am

Indeed. If doing something fairly simple like broadcasting a function, then the cost of this lazy view seems pretty low (although perhaps there are edge cases). If doing something like matrix multiplication afterwards, then the cost of making a dense matrix will probably pay off.

DNF · August 14, 2020, 9:34am

But those definitions don’t make sense as median and quantile. It’s not about your programming skills.

The median of an array of matrices should be the ‘middle’ one when you sort them. But how do you sort matrices? How do you decide which matrix is ‘bigger’? You will have to make a decision, for example sort them according to norm.

BLI · August 14, 2020, 9:35am

Here is a use-case of the methods which (in my view) would be useful in Base.Statistics…

A: A simple Gillespie type simulation of a SIR model:

since I wrote the code myself, I need to compute the statistics using median_my and/or quantile_my.
B: median with quantiles.

Topic		Replies	Views
How to speed up `Statistics.median` along a given dimension? Performance question , python	9	1518	January 11, 2020
Statistics on random matrices Statistics	6	646	June 8, 2020
Drop array dimension with Statistics.mean etc General Usage array	3	1837	September 14, 2019
Arrays vs Vectors vs matrix (permutations) problems,confusion and clarification General Usage	10	1481	June 11, 2018
Quick Longitudinal Access to arrays General Usage	34	2288	March 3, 2018

Base.Statistics -- I'm confused by the differences in arguments

Related topics