Tanslate code from PooledDataArray to CategoricalArray

matthieu · January 27, 2018, 10:56pm

I have this code in FixedEffectModels where a function takes a Dataframe as argument and returns a PooledDataArray. The goal of this function is to create a vector that indexes groups defined by different combination of vector values. For observations where one of the value was missing, the corresponding vector has a missing value. So for instance, we have

group([1, 2, 3], [1, 2, 3]) = [1, 2, 3]
group([4, 5, 6], [7, 8, 9]) = [1, 2, 3]
group([4, 5, 6], [NA, 8, 9]) = [NA, 1, 2]

I don’t know how to translate the code to CategoricalArray. Could someone help me to translate it?


function group(x::AbstractVector) 
	v = PooledDataArray(x)
	PooledDataArray(RefArray(v.refs), collect(1:length(v.pool)))
end

function pool_combine!(x::Array{UInt64, T}, dv::PooledDataVector, ngroups::Integer) where {T}
	@inbounds for i in 1:length(x)
	    # if previous one is NA or this one is NA, set to NA
	    x[i] = (dv.refs[i] == 0 || x[i] == zero(UInt64)) ? zero(UInt64) : x[i] + (dv.refs[i] - 1) * ngroups
	end
	return(x, ngroups * length(dv.pool))
end

function group(df::AbstractDataFrame) 
	isempty(df) && throw("df is empty")
	ncols = size(df, 2)
	v = df[1]
	ncols = size(df, 2)
	ncols == 1 && return(group(v))
	if typeof(v) <: PooledDataVector
		x = convert(Array{UInt64}, v.refs)
	else
		v = PooledDataArray(v, v.na, UInt64)
		x = v.refs
	end
	ngroups = length(v.pool)
	for j = 2:ncols
		v = PooledDataArray(df[j])
		(x, ngroups) = pool_combine!(x, v, ngroups)
	end
	return(factorize!(x))
end

function reftype(sz) 
	sz <= typemax(UInt8)  ? UInt8 :
	sz <= typemax(UInt16) ? UInt16 :
	sz <= typemax(UInt32) ? UInt32 :
	UInt64
end

function factorize!(refs::Array)
	uu = unique(refs)
	sort!(uu)
	has_na = uu[1] == 0
	T = reftype(length(uu)-has_na)
	dict = Dict{eltype(refs), T}(zip(uu, (1-has_na):convert(T, length(uu)-has_na)))
	@inbounds @simd for i in 1:length(refs)
		 refs[i] = dict[refs[i]]
	end
	PooledDataArray(RefArray(refs), collect(1:(length(uu)-has_na)))
end

nalimilan · January 28, 2018, 9:58am

Have you faced any particular difficulties?

In group, you could replace the PooledDataArray constructor call with:

CategoricalArray{Int,1}(v.refs, CategoricalPool(collect(1:length(levels(v)))))

Then the main differences are that you should use length(levels(dv)) instead of length(dv.pool), and CategoricalArrays.index(dv.pool)[dv.refs[i]] instead of dv.refs[i].

matthieu · January 29, 2018, 3:38pm

Thanks a lot. Could you succintly give me the reasons for the last two changes? I have a hard time understanding the source code of CategoricalArray.

nalimilan · January 29, 2018, 4:21pm

That’s because values in dv.refs do not follow the ordering of the levels, they just follow the order in which each level has been created originally. That allows reordering the levels without adapting the references.

Topic		Replies	Views
How to index a `CatagoricalArray` then make a new array with the same levels Data	4	536	August 15, 2018
Categorical to Integer values? New to Julia	16	5130	March 18, 2022
CategoricalArrays.jl syntax question: vectorizing string operation Data question , categoricalarrays	7	126	July 18, 2024
Is there a way that I can attach string labels to integer values in CategoricalArrays? Data question	9	2492	February 11, 2018
What is a Pooled Array? And my answer New to Julia	4	2735	October 16, 2017

Tanslate code from PooledDataArray to CategoricalArray

Related topics