Dear Julia experts. I want to do a split-apply-combine operation, but instead of DataFrames, I want to learn how this should be programmed.
Simple example: I have one vector vx, which I want to cut into irregularly sized chunks, and then calculate, e.g., the means of vx (or vy) within these vx-cut categories. (In R, I could lapply( split(1:length(vx), cut(vx,cutpoints), mean )
. R mclapply can even feed it to multicores.)
- I need a demonstration data set and some cut points:
julia> ( using Distributions; srand(0);
vx= rand( Binomial( 10, 0.4 ), 10_000 );
cutpoints=[ -12, 2, 5, 7, 12 ]; )
- I categorize vx. The default gives me nice text, but then I am stuck.
julia> ( using CategoricalArrays;
vc= CategoricalArrays.cut( vx, cutpoints ) )
10000-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"[5, 7)"
"[5, 7)"
"[2, 5)"
⋮
julia> levels( vc )
4-element Array{String,1}:
"[-12, 2)"
"[2, 5)"
"[5, 7)"
"[7, 12)"
Hmmm…how do I make integers out of these string categories? Does CategoricalArray have such a function? I RTFM, but missed it.
- Alternative: the following seems over-complicated, but it does give integer categories:
julia> vi= parse.(Int, CategoricalArrays.cut( vx, cutpoints ,
labels=string.([1:(length(cutpoints)-1);]) ));
- With integer categories, I can now use the
indicatormat()
function
julia> using Statsbase ## indicatormat()
julia> for j=1:4; println( j, ": ", mean( vx[ indicatormat( vi )[j,:] ] ) ); end
1: 0.8519362186788155
2: 3.221335145235264
3: 5.363953120050681
4: 7.282398452611218
I tried to work with the vc’s instead of the vi’s in the indicatormat, but this did not work. It needed integers.
Is there a lot better? And does julia have R-equivalent split and mclapply functions?
Guidance appreciated.
/iaw