[ANN] LatentClassAnalysis.jl - Latent Class Analysis in Julia

Hi all!

:wave:t3:I’m happy to announce LatentClassAnalysis.jl, a package I’ve developed for Latent Class Analysis (LCA).

Key Features

This package can do:

  • Model specification with dummies or categorical indicators
  • Maximum likelihood estimation via EM algorithm
  • Model diagnostics and fit statistics (e.g., AIC, BIC, entropy)
  • Class prediction and posterior probabilities

Examples

First, let’s load the packages and generate some data.

using LatentClassAnalysis
using DataFrames
using Random

# Generate data
Random.seed!(123)
n_samples = 1000

# True class assignments (2 latent classes)
true_classes = rand(1:2, n_samples)

# Generate responses for 4 binary items with different patterns
function generate_response(class)
    if class == 1
        return rand() < 0.8 ? 1 : 2  # High probability of 1
    else
        return rand() < 0.3 ? 1 : 2  # Low probability of 1
    end
end

# Create DataFrame with responses
df = DataFrame(
    item1 = [generate_response(c) for c in true_classes],
    item2 = [generate_response(c) for c in true_classes],
    item3 = [generate_response(c) for c in true_classes],
    item4 = categorical([rand(["Yes", "No"]) for _ in 1:n_samples])
)

Next, we prepare the data and fit LCA models with different number of latent classes.

# Step 1: Prepare data
data, n_categories = prepare_data(df, :item1, :item2, :item3, :item4)

# Step 2: Fit models with different number of classes
results = []
for n_classes in 2:4
    # Initialize model
    model = LCAModel(n_classes, size(data, 2), n_categories)
    
    # Fit model and get log-likelihood
    ll = fit!(model, data, verbose=true)
    
    # Get diagnostics
    diag = diagnostics!(model, data, ll)
    
    # Store results
    push!(results, (
        n_classes = n_classes,
        model = model,
        diagnostics = diag
    ))
    
    println("Log-likelihood: $(diag.ll)")
    println("AIC: $(diag.aic)")
    println("BIC: $(diag.bic)")
    println("SBIC: $(diag.sbic)")
    println("Entropy: $(diag.entropy)")
end

We identify the model with the lowest BIC as the best-fitting model (not recommended though).

# Find best model based on BIC
best_n_classes = argmin(k -> results[k].diagnostics.bic, keys(results)) + 1
best_model = results[best_n_classes - 1].model

We now get predicted probabilities and class assignments.

# Step 3: Analyze best model
# Get predictions
assignments, probabilities = predict(best_model, data)

# Add predicted classes to the original DataFrame
df[!, :predicted_class] = assignments

# Calculate class sizes
class_sizes = [sum(assignments .== k) / length(assignments) for k in 1:best_n_classes]
println("\nClass sizes:")
for (k, size) in enumerate(class_sizes)
    println("Class $k: $(round(size * 100, digits=1))%")
end

# Show item response probabilities for each class
println("\nItem response probabilities:")
for j in 1:best_model.n_items
    println("\nItem $j:")
    for k in 1:best_model.n_classes
        probs = best_model.item_probs[j][k, :]
        println("Class $k: $probs")
    end
end

I hope that through these efforts–implementing models that I have used before in my research, Julia can be more useful for social sciences.

I’m happy to make my contributions and really wish to hear any feedback and feature requests from you! Cheers🍻️!

15 Likes

This is really neat! Congrats on the new package! Would you mind sharing more about how you use this in your work within social sciences?

Reason I am curious is I have a number of collaborators coming from R and Julia still has a paucity of packages when it comes to social science research. Additionally, in JuliaHealth, we tend to look at social determinants of health and other interesting ways to weave multimodal data together to give a comprehensive look at a health question we have.

Thanks!

~ tcp :deciduous_tree:

2 Likes

Many thanks! Totally!

Think of LCA as a method that helps us find hidden groups within a population that shares something in common. Rather than trying to predict who might develop a condition, we look at folks who already have it to understand how different their situations and experiences can be.

In health research, we may study people living with dementia. Using LCA, we look at how different their daily lives and circumstances are, for things (which we call manifest variables) like:

  • Living arrangements
  • Social support
  • Educational background
  • Main occupation
  • Access to healthcare

LCA can reveal distinct latent classes. We use BIC and theoretical meaningfulness to determine how many groups make the most sense. We may find groups like:

  • Highly supported urban group
  • Isolated rural group

So, in this case, dementia care is not one-size-fits-all. It should be tailored to meet the needs of different groups.

I have also used LCA to study how people end up without children in their middle and later years - again, finding there are several different life paths that lead there. The ways these different paths shift across cohorts tell us meaningful but otherwise hidden demographic trends.

LCA belongs to a bigger family of person-centered approaches (e.g., latent profile analysis, sequence analysis, group-based trajectory modeling). I am excited to bring more of these tools to Julia in the future!

Would love to hear more about JuliaHealth :clap:t3:!

5 Likes

This is fantastic to hear, IMO the stats side of Julia is generally under-developed (with a few exceptions).

Hope you have a good time with the work you seem to be planning. I’m sure you’ll continue to find a warm reception here :slightly_smiling_face:

3 Likes

Hi Yanwen,

LCA belongs to a bigger family of person-centered approaches (e.g., latent profile analysis, sequence analysis, group-based trajectory modeling). I am excited to bring more of these tools to Julia in the future!

This is excellent and would strongly be in favor of that! I was curious about how you use and developed the tool because of how it could generalize. Certainly, it might be nice, if you wanted to, to house the package within the JuliaStats or another organization.

I am strongly familiar with these sorts of approaches and am thrilled to see Julia tools emerging to do this. To wit, it has generally only been R with these tools but I think Julia is well-positioned to generalize these approaches across several domains (the applications of which you are undoubtedly well-acquainted).

Would love to hear more about JuliaHealth :clap:t3:!

Sure! JuliaHealth is a GitHub organization dedicated to bringing Julia to applications across the health research continuum. We have a blog, a strong presence within Google Summer of Code, monthly meetings, playlist on the Julia YouTube channel, and live within the #health-and-medicine channel within the Julia Slack (if you haven’t joined, we would love to see you there:
The Julia Language Slack).

Largely, we have a lot of topics that we explore such as the following:

  • Observational Health
  • Geospatial Health Informatics
  • Medical Imaging
  • Health Standards and Interoperability
  • Innovation in Health Research

with a strong focus on health equity and fairness throughout. We tend to also place a high premium with interoperability with other Julia ecosystems as well. Finally, we also serve as a nice entry point for folks who are interested in research, Julia, and open source/science
by providing mentorship, fun projects, and a platform to grow. :slight_smile:

Would love to see you there! If you end up joining, feel free to introduce yourself, the package, and what you are doing for research and beyond!

Cheers!

~ tcp :deciduous_tree:

2 Likes

Very nice that someone is working on this for julia!

Have had to switch to R for Latent class analysis and would love to do this in Julia. I have been testing it a little bit and have some questions as there is not much documentation (very understandable of course, might be I miss understand some things)

  1. As far as I can tell continuous response variables are
    not allowed? Is this on the agenda?

  2. Related to 1, are there plans to move in the direction of using mixed models to fit latent regression models using the EM algorithm (like lcmm in R).

  3. As far as I can tell right now there is no implementation for allowing correlated datapoints right? Is this on the agenda for the package? Would love to be able to apply the techniques to longitudinal data.

Thank you for your work!

2 Likes

Many thanks for your kind words and questions!

Currently, this package uses multinomial distributions for the response variables and estimates item-response probabilities for each latent class. So yes, only categorical variables are supported. This is also true (and a limitation) of the popular poLCA package in R.

That being said, I have several future plans:

  1. Develop a separate package for Latent Profile Analysis (LPA) that handles continuous variables.
  2. Allow covariates to be added in LCA and LPA.
  3. Add functions for easier visualization.
  4. (documentation)…

And yes, extending this package to accommodate mixed effects and repeated measures is a wonderful idea. Let me explore this possilbity and work on building the necessary knowledge base! :thinking:

Many thanks!

2 Likes

LatentClassAnalysis.jl 0.2.0 Release Notes

  • Added the show_profiles function for summarizing latent class profiles
  • Added an example for replicating a research using real-world data.
  • Bug fixes for handling 0/1 coded binary variables
1 Like

LatentClassAnalysis.jl 0.2.1 Release Notes :wave:t3:

  • Added warnings for small number of observations
  • Added functions to check model identifiability based on item and category counts
  • Improved type flexibility and speed