Operations By Computed CategoricalArrays (lapply-split-mean)

ivo_welch · February 18, 2018, 12:57am

Dear Julia experts. I want to do a split-apply-combine operation, but instead of DataFrames, I want to learn how this should be programmed.

Simple example: I have one vector vx, which I want to cut into irregularly sized chunks, and then calculate, e.g., the means of vx (or vy) within these vx-cut categories. (In R, I could lapply( split(1:length(vx), cut(vx,cutpoints), mean ). R mclapply can even feed it to multicores.)

I need a demonstration data set and some cut points:

julia> ( using Distributions; srand(0);
         vx= rand( Binomial( 10, 0.4 ), 10_000 );
         cutpoints=[ -12, 2, 5, 7, 12 ]; )

I categorize vx. The default gives me nice text, but then I am stuck.

julia> ( using CategoricalArrays;
        vc= CategoricalArrays.cut( vx, cutpoints ) )
10000-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "[5, 7)"
 "[5, 7)"
 "[2, 5)"
     ⋮

julia> levels( vc )
4-element Array{String,1}:
 "[-12, 2)"
 "[2, 5)"
 "[5, 7)"
 "[7, 12)"

Hmmm…how do I make integers out of these string categories? Does CategoricalArray have such a function? I RTFM, but missed it.

Alternative: the following seems over-complicated, but it does give integer categories:

julia> vi= parse.(Int, CategoricalArrays.cut( vx, cutpoints ,
             labels=string.([1:(length(cutpoints)-1);]) ));

With integer categories, I can now use the indicatormat() function

julia> using Statsbase    ## indicatormat()

julia> for j=1:4; println( j, ": ", mean( vx[ indicatormat( vi )[j,:] ] ) ); end
1: 0.8519362186788155
2: 3.221335145235264
3: 5.363953120050681
4: 7.282398452611218

I tried to work with the vc’s instead of the vi’s in the indicatormat, but this did not work. It needed integers.

Is there a lot better? And does julia have R-equivalent split and mclapply functions?

Guidance appreciated.

/iaw

Nosferican · February 18, 2018, 2:11am

using StatsBase, Distributions, CategoricalArrays
srand(0)
vx = rand(Binomial(10, 0.4), Int(1e4))
cutpoints = [-12, 2, 5, 7, 12]
vc = cut(vx, cutpoints)
vi = get.(levelsmap(vc), vc, nothing)
function lapply(obj::AbstractVector, ind::AbstractVector, func::Function)
    map(elem -> (elem, func(obj[find(elem .== ind)])), sort(unique(ind)))
end
lapply(vx, vc, mean)
lapply(vx, vi, mean)

To convert from values to indicators in an vector use levelsmap with recode!, recode, or get. However, you don’t need to do the transformation at all. I left you a basic lapply implementation. You could use the BitVector instead of find, but it could crash due to missing so it is a recommended safeguard.

xiaodai · February 18, 2018, 3:10am

There is a fastby function in FastGroupBy.jl. But it’s not fully developed yet. Watch this space as I will blog about it once I am done

nalimilan · February 18, 2018, 12:52pm

That works, but it can be simplified a bit:

vi = vc.refs is enough and preserves the order (we should probably provide a function for that)
sort(unique(ind)) can simply be levels(ind)

ivo_welch · February 18, 2018, 11:14pm

One of the nicest features of R is that mclapply() works just like lapply() but does its work in parallel on multiple processors. I am wondering whether Nosferican’s lapply function just needs pmap instead of map, plus a fourth argument to pass further arguments into the func (which, in R, is often used for na.rm=TRUE, etc)…

Nosferican · February 19, 2018, 12:34am

The function is meant to work with any AbstractVector, but one could specialize it for CategoricalVector. The cv.refs give the default UInt8 code which is not very useful for knowing which is which, but if these preserve order that could be a potential workaround. For the method dispatching on CategoricalVector, levels would be optimal.

Nosferican · February 19, 2018, 12:46am

I must admit I have never or very few times used parallel processing with Julia. It might have been unfamiliarity, lack of need, or probably a combination of those. For my work, I used parallel processing heavily in R for some routines (for others, data.table in-place implementations are just orders of magnitude better). As for the keyword arguments for lapply,

function lapply(obj::AbstractVector, ind::AbstractVector, func::Function;
                na_rm = false)
    if na_rm
        valid = .!(ismissing.(obj) | ismissing(ind))
        obj = filter(valid, obj)
        ind = filter(valid, ind)
        output = map(elem -> (elem, func(obj[find(elem .== ind)])),
                     sort(unique(ind)))
    else
        output = map(elem -> (elem, func(obj[find(elem .== ind)])),
                     sort(unique(ind)))
    end
    return output
end

another alternative is to pass the na.rm in the function,

lapply(vx, vc, elem -> mean(skipmissing(elem)))

ivo_welch · February 19, 2018, 1:36am

hi N—sorry, I mistakenly assumed R usage was familiar. The idea is to have ‘…’ in the arguments to allow passing anything additional into the function. It is used for passing na.rm=TRUE, only because, say, the mean function understands this. it is not handled by lapply. julia’s mean function does not have a na.rm argument afaik, so useless here. in R, the opt parameters are also often used to pass exogenous constant parameters, etc., I can illustrate what I mean by using the quantile function’s percentile parameter.

julia> function plapply(obj::AbstractVector, ind::AbstractVector, func::Function, x...)
           pmap(elem -> (elem, func(obj[find(elem .== ind)], x...)), levels(ind))
       end
plapply (generic function with 1 method)

and now parallel use is

julia>  c= rand('a':'d', 100);

julia> plapply( 1:100, c, quantile, [ 0.25, 0.5, 0.75 ] )
4-element Array{Tuple{Char,Array{Float64,1}},1}:
 ('c', [10.0, 47.0, 73.25])
 ('a', [27.0, 44.0, 76.5])
 ('b', [34.5, 49.0, 74.0])
 ('d', [33.0, 54.0, 72.0])

pretty nifty.

/iaw

nalimilan · February 19, 2018, 8:38am

cut always returns a CategoricalArray, so extracting the refs field will always work (and cut ensures the codes reflect the order). levels has a fallback for AbstractArrays which calls sort(unique(x)) and drops missing values.

Nosferican · February 19, 2018, 9:10am

I use tapply in various context which categorical variables are not used or needed, mostly Vector{Int64}. Didn’t know the levels fallback which is nice. Good to know how it handles missing values in case those need to be coded to something else before the routines.

Topic		Replies	Views
CategoricalArrays.jl syntax question: vectorizing string operation Data question , categoricalarrays	7	125	July 18, 2024
CategoricalArrays Binning Data categoricalarrays	3	235	May 26, 2023
Julian methods for batched computations without loops General Usage question	5	601	February 11, 2022
Creating data bins with numeric labels with cut() General Usage categoricalarrays	12	1714	March 17, 2023
What's julia's solution to tapply or accumarray? General Usage question , dataframes , arrays , splitapplycombine	15	2074	January 3, 2022

Operations *By* Computed CategoricalArrays (lapply-split-mean)

Related topics

Operations By Computed CategoricalArrays (lapply-split-mean)