Operations *By* Computed CategoricalArrays (lapply-split-mean)

Dear Julia experts. I want to do a split-apply-combine operation, but instead of DataFrames, I want to learn how this should be programmed.

Simple example: I have one vector vx, which I want to cut into irregularly sized chunks, and then calculate, e.g., the means of vx (or vy) within these vx-cut categories. (In R, I could lapply( split(1:length(vx), cut(vx,cutpoints), mean ). R mclapply can even feed it to multicores.)

  1. I need a demonstration data set and some cut points:
julia> ( using Distributions; srand(0);
         vx= rand( Binomial( 10, 0.4 ), 10_000 );
         cutpoints=[ -12, 2, 5, 7, 12 ]; )
  1. I categorize vx. The default gives me nice text, but then I am stuck.
julia> ( using CategoricalArrays;
        vc= CategoricalArrays.cut( vx, cutpoints ) )
10000-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "[5, 7)"
 "[5, 7)"
 "[2, 5)"
     ⋮

julia> levels( vc )
4-element Array{String,1}:
 "[-12, 2)"
 "[2, 5)"
 "[5, 7)"
 "[7, 12)"

Hmmm…how do I make integers out of these string categories? Does CategoricalArray have such a function? I RTFM, but missed it.

  1. Alternative: the following seems over-complicated, but it does give integer categories:
julia> vi= parse.(Int, CategoricalArrays.cut( vx, cutpoints ,
             labels=string.([1:(length(cutpoints)-1);]) ));
  1. With integer categories, I can now use the indicatormat() function
julia> using Statsbase    ## indicatormat()

julia> for j=1:4; println( j, ": ", mean( vx[ indicatormat( vi )[j,:] ] ) ); end
1: 0.8519362186788155
2: 3.221335145235264
3: 5.363953120050681
4: 7.282398452611218

I tried to work with the vc’s instead of the vi’s in the indicatormat, but this did not work. It needed integers.

Is there a lot better? And does julia have R-equivalent split and mclapply functions?

Guidance appreciated.

/iaw

using StatsBase, Distributions, CategoricalArrays
srand(0)
vx = rand(Binomial(10, 0.4), Int(1e4))
cutpoints = [-12, 2, 5, 7, 12]
vc = cut(vx, cutpoints)
vi = get.(levelsmap(vc), vc, nothing)
function lapply(obj::AbstractVector, ind::AbstractVector, func::Function)
    map(elem -> (elem, func(obj[find(elem .== ind)])), sort(unique(ind)))
end
lapply(vx, vc, mean)
lapply(vx, vi, mean)

To convert from values to indicators in an vector use levelsmap with recode!, recode, or get. However, you don’t need to do the transformation at all. I left you a basic lapply implementation. You could use the BitVector instead of find, but it could crash due to missing so it is a recommended safeguard.

1 Like

There is a fastby function in FastGroupBy.jl. But it’s not fully developed yet. Watch this space as I will blog about it once I am done

1 Like

That works, but it can be simplified a bit:

  • vi = vc.refs is enough and preserves the order (we should probably provide a function for that)
  • sort(unique(ind)) can simply be levels(ind)
1 Like

One of the nicest features of R is that mclapply() works just like lapply() but does its work in parallel on multiple processors. I am wondering whether Nosferican’s lapply function just needs pmap instead of map, plus a fourth argument to pass further arguments into the func (which, in R, is often used for na.rm=TRUE, etc)…

The function is meant to work with any AbstractVector, but one could specialize it for CategoricalVector. The cv.refs give the default UInt8 code which is not very useful for knowing which is which, but if these preserve order that could be a potential workaround. For the method dispatching on CategoricalVector, levels would be optimal.

I must admit I have never or very few times used parallel processing with Julia. It might have been unfamiliarity, lack of need, or probably a combination of those. For my work, I used parallel processing heavily in R for some routines (for others, data.table in-place implementations are just orders of magnitude better). As for the keyword arguments for lapply,

function lapply(obj::AbstractVector, ind::AbstractVector, func::Function;
                na_rm = false)
    if na_rm
        valid = .!(ismissing.(obj) | ismissing(ind))
        obj = filter(valid, obj)
        ind = filter(valid, ind)
        output = map(elem -> (elem, func(obj[find(elem .== ind)])),
                     sort(unique(ind)))
    else
        output = map(elem -> (elem, func(obj[find(elem .== ind)])),
                     sort(unique(ind)))
    end
    return output
end

another alternative is to pass the na.rm in the function,

lapply(vx, vc, elem -> mean(skipmissing(elem)))
1 Like

hi N—sorry, I mistakenly assumed R usage was familiar. The idea is to have ‘…’ in the arguments to allow passing anything additional into the function. It is used for passing na.rm=TRUE, only because, say, the mean function understands this. it is not handled by lapply. julia’s mean function does not have a na.rm argument afaik, so useless here. in R, the opt parameters are also often used to pass exogenous constant parameters, etc., I can illustrate what I mean by using the quantile function’s percentile parameter.

julia> function plapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
           pmap(elem -> (elem, func(obj[find(elem .== ind)], x...)), levels(ind))
       end
plapply (generic function with 1 method)

and now parallel use is

julia>  c= rand('a':'d', 100);

julia> plapply( 1:100, c, quantile, [ 0.25, 0.5, 0.75 ] )
4-element Array{Tuple{Char,Array{Float64,1}},1}:
 ('c', [10.0, 47.0, 73.25])
 ('a', [27.0, 44.0, 76.5])
 ('b', [34.5, 49.0, 74.0])
 ('d', [33.0, 54.0, 72.0])

pretty nifty.

/iaw

cut always returns a CategoricalArray, so extracting the refs field will always work (and cut ensures the codes reflect the order). levels has a fallback for AbstractArrays which calls sort(unique(x)) and drops missing values.

I use tapply in various context which categorical variables are not used or needed, mostly Vector{Int64}. Didn’t know the levels fallback which is nice. Good to know how it handles missing values in case those need to be coded to something else before the routines.