[ANN] ACS.jl a package to download US Census American Community Survey files

Based on Kyle Walkerโ€™s tidycensus package for the R programming language, this contains functions to get specified variables from the one-, three-, and five-year surveys. A free Census API key is required.

Repo and documentation.

7 Likes

Hey @technocrat,

Wanted to drop a line and say that this looks great! Iโ€™ll definitely give this a look for some of my public health research.

In JuliaHealth land (if you are on the Julia Slack, we live in #health-and-medicine), we are definitely a huge fan of census microdata and other socioeconomic/public health information used in the USA and internationally. Iโ€™ll share some things we are up to as well:

Last year, we started work on an OpenAPI.jl based IPUMS tool called IPUMS.jl. We havenโ€™t registered it yet as it is in heavy development still, but here is a working demo to download data from NHGIS for example:

using IPUMS

# This comes from NHGIS IPUMS Website
api = IPUMSAPI("https://api.ipums.org/", Dict("Authorization" => "Your_Key"))

# This extract is in the IPUMS.jl repo
test_extract_definition = "test/testdata/example_extract_request.json"

# Submits an extract definition request to IPUMS
res = extract_submit(api, "nhgis", test_extract_definition)

# Gets information on status of request
metadata, defn, msg = extract_info(api, res.number, "nhgis")

# Lists out all extracts submitted to IPUMS NHGIS
extract_list(api, "nhgis")

# Download submitted extract data to output path
extract_download(api, res.number, "nhgis"; output_path = "file_downloads/")

From here, using some sample data from the IPUMS project and located in the IPUMS.jl repo, you can load the data like so:

ddi = parse_ddi("test/testdata/cps_00157.xml")
df = load_ipums_extract(ddi, "test/testdata/cps_00157.dat")

Which looks something like this:

3ร—8 DataFrame
 Row โ”‚ YEAR   SERIAL  MONTH  ASECWTH  STATEFIP  PERNUM  ASECWT   INCTOT
     โ”‚ Int64  Int64   Int64  Float64  Int64     Int64   Float64  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚  1962      80      3  1475.59        55       1  1475.59       4883
   2 โ”‚  1962      80      3  1475.59        55       2  1470.72       5800
   3 โ”‚  1962      80      3  1475.59        55       3  1578.75  999999998

Additionally, we have been working with Census data to get shapefile information. We created TigerLine.jl which is inspired by tigris (R package); hereโ€™s another small demo:

using TigerLine

# Run ?TIGER_DICT to see more layer options
download_tiger("DIRECTORY_OF_CHOICE", year=2024, layer="state")

Which you can then use like (adapted from my friend, Sam Loweโ€™s analysis):

using CairoMakie
using GeoMakie
using Polylabel
using DataFrames
using GeoDataFrames
using Shapefile

table = Shapefile.Table( "tl_2024_us_state.zip") |> DataFrame

one_unique = Vector{String}()
for col in names(table)
    if length(unique(table[!, col])) == 1
        push!(one_unique, col)
    end
end
select!(table, Not(one_unique))

f, a, p = poly(table.geometry; axis = (; type = GeoAxis))
tp = Makie.text!(a, Polylabel.polylabel.(table.geometry); text = string.(table.STUSPS), align = (:center, :center))
tp.fontsize = 5
f

We are actively working on building more and more support here for more social science/public health research tooling within JuliaHealth and have a nice start but a ways to go to make things more ergonomic and polished.

Might be interesting to work together on some things as I am working these days on some more IPUMS related work and research! Would love to see more of these tools coming together in Julia world. :smiley: Also, if you are interested, weโ€™d be happy to host and promote ACS.jl in JuliaHealth if youโ€™d like โ€“ I could share more details about that.

Cheers and excited to see what else comes together here!

~ tcp :deciduous_tree:

2 Likes

Hi @TheCedarPrince

Thanks for introducing me to the Slack channel.

Iโ€™ve got a parallel package in development with some more tools, RSetup.jl is a mini-package manager for R that will set the R environment for use by RCall and add packages. Although Julia has the statistics basics, thereโ€™s a lot of other stuff in R. For example classInt will divide a continuous vector into buckets based on a choice of algorithms. This prevents the mushy appearance in thematic maps where everything is a shade of blue.

It also facilitates access to Kyle Walkerโ€™s tidycensus package, which is an extensive series of Census-related tools. For example, hereโ€™s an r_get_acs

# SPDX-License-Identifier: MIT
using Census
using Dates
using RCall
using CSV
using DataFrames    

\"""
    r_get_acs_data(; geography::String, variables::Dict{String, String}, state::String,
                 year::Union{Integer, Nothing} = nothing, survey::Union{String, Nothing} = nothing)

Retrieve American Community Survey (ACS) data through R's tidycensus package.

# Arguments
- `geography::String`: Geographic level for data collection (e.g., "county", "tract", "block group")
- `variables::Dict{String, String}`: Dictionary mapping desired variable names to Census variable codes
- `state::String`: State abbreviation (e.g., "MA", "NY")
- `year::Union{Integer, Nothing} = nothing`: Survey year. Defaults based on current date and survey type
- `survey::Union{String, Nothing} = nothing`: Type of ACS survey. Use "acs1" for 1-year estimates,
   nothing for 5-year estimates

# Details
Year defaults are determined by current date and survey type:
- For 1-year ACS (`survey="acs1"`): after September, uses previous year
- For 5-year ACS: after December, uses previous year
- Otherwise uses two years prior

ACS availability notes:
- 1-year estimates (`survey="acs1"`) are only available for geographies with populations โ‰ฅ 65,000
- 2020 data is not available for ACS (use get_decennial() instead)
- 1-year and 5-year surveys are published in September and December respectively

# Returns
DataFrame containing requested ACS data

# Examples
```julia
# Get 5-year estimates for counties in Massachusetts
vars = Dict("median_income" => "B19013_001")
df = r_get_acs_data(geography="county", variables=vars, state="MA")

# Get 1-year estimates for 2022
df = r_get_acs_data(geography="county", variables=vars, state="MA",
                  survey="acs1", year=2022)
\```
\"""
function r_get_acs_data(; geography::String,
                      variables::Dict{String, String},
                      state::String,
                      year::Union{Integer, Nothing} = nothing,
                      survey::Union{String, Nothing} = nothing)

    # Ensure R environment is set up using Census module's function
    try
        Census.RSetup.setup_r_environment()
    catch e
        error("Failed to set up R environment: ", sprint(showerror, e))
    end

    # Determine current date for default year logic
    current_date = Dates.now()
    current_year = Dates.year(current_date)
    current_month = Dates.month(current_date)

    # Calculate default year based on current date
    if isnothing(year)
        # After September, 1-year ACS for previous year becomes available
        # After December, 5-year ACS for previous year becomes available
        if current_month >= 9 && !isnothing(survey) && survey == "acs1"
            year = current_year - 1
        elseif current_month >= 12
            year = current_year - 1
        else
            year = current_year - 2
        end
    end

    # Check for 2020 restriction
    if year == 2020
        throw(ArgumentError("ACS is not available for 2020. Use get_decennial() or pick a later year."))
    end

    # Convert Julia Dict to R named vector
    var_names = collect(keys(variables))
    var_codes = collect(values(variables))
    
    # Use proper R string interpolation
    R\"""
    r_vars <- setNames(c($(var_codes)), c($(var_names)))
    \"""

    # Build R function call with proper error handling
    try
        if isnothing(survey)
            R\"""
            # Ensure tidycensus is loaded
            library(tidycensus)
            
            # Call R get_acs() with the provided parameters
            data <- get_acs(geography = $(geography),
                          variables = r_vars,
                          state = $(state),
                          year = $(year))
            \"""
        else
            R\"""
            # Ensure tidycensus is loaded
            library(tidycensus)
            
            # Call R get_acs() with the provided parameters including survey
            data <- get_acs(geography = $(geography),
                          variables = r_vars,
                          state = $(state),
                          year = $(year),
                          survey = $(survey))
            """
        end
    catch e
        if isa(e, RCall.REvalError)
            error("R evaluation error: ", sprint(showerror, e))
        else
            rethrow(e)
        end
    end

    # Convert R data frame to Julia DataFrame with error handling
    try
        return rcopy(R"data")
    catch e
        error("Failed to convert R data frame to Julia DataFrame: ", sprint(showerror, e))
    end
end

Iโ€™ve also received a feature request for ACL.jl to add options for taking the Census API json file and rendering as a StructureArray or NamedTuple object. I have that working and am awaiting reaction from the requestor.

I considered implementing Tiger.jl but decided against it since Iโ€™m not working at the block level. What Iโ€™ve done is to get the county level shapefiles, which should be fairly stable year-to-year and put them into a Postgres database. The databaseโ€™s Postgis extension makes it much more convenient to do spatial work in the database than in R and, I assume, Julia.

For thematic mapping, I struggled quite a bit before settling on GeoMakie. I had a hard time converting the multipolygons of the shapefiles, but itโ€™s working now. Iโ€™ve also got CRS projection strings working.

Finally, folk may be interested in my age pyramid plot that shows differences between geographic units.

1 Like

Hey! I wrote GeoMakie.jl and work on many other geospatial packages - would love to hear what pain points you had there so we can look at how to improve things. If you were trying to plot geometries from R, then yeah thatโ€™s a bit harder - but anything from Julia should work out of the box.

As a side note, Iโ€™ve always wanted a port of classInt in Julia, for general visualization purposes - so will be working towards having that at some point. AlgebraOfGraphics.jl for example could use that as well.

Hi @asinghvi17

I sucked the Tiger shapefiles into postgis and hereโ€™s what I get from a query.

3235ร—5 DataFrame
  Row โ”‚ geoid    stusps   name           geom                               total_population 
      โ”‚ String?  String?  String?        String?                            Int64?           
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ 48315    TX       Marion         MULTIPOLYGON(((-94.709446 32.871โ€ฆ              9631
    2 โ”‚ 01011    AL       Bullock        MULTIPOLYGON(((-85.999258 32.250โ€ฆ             10157
    3 โ”‚ 49045    UT       Tooele         MULTIPOLYGON(((-114.046658941853โ€ฆ             76648
    4 โ”‚ 49023    UT       Juab           MULTIPOLYGON(((-114.047783 39.79โ€ฆ             12273
    5 โ”‚ 51095    VA       James City     MULTIPOLYGON(((-76.914226 37.356โ€ฆ             80046
    6 โ”‚ 47173    TN       Union          MULTIPOLYGON(((-84.007298 36.270โ€ฆ             20141
    7 โ”‚ 17163    IL       St. Clair      MULTIPOLYGON(((-90.263814 38.520โ€ฆ            254777
  โ‹ฎ   โ”‚    โ‹ฎ        โ‹ฎ           โ‹ฎ                        โ‹ฎ                         โ‹ฎ

and to get it to work, Iโ€™ve resorted to

function parse_geoms(df::DataFrame = df)
    return [ArchGDAL.fromWKT(geom) for geom in df.geom
       if !ismissing(geom)]
 end

to get

3235ร—6 DataFrame
  Row โ”‚ geoid    stusps   name           geom                               total_population  parsed_geoms              
      โ”‚ String?  String?  String?        String?                            Int64?            IGeometry?                
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ 48315    TX       Marion         MULTIPOLYGON(((-94.709446 32.871โ€ฆ              9631  Geometry: wkbMultiPolygon
    2 โ”‚ 01011    AL       Bullock        MULTIPOLYGON(((-85.999258 32.250โ€ฆ             10157  Geometry: wkbMultiPolygon
    3 โ”‚ 49045    UT       Tooele         MULTIPOLYGON(((-114.046658941853โ€ฆ             76648  Geometry: wkbMultiPolygon
    4 โ”‚ 49023    UT       Juab           MULTIPOLYGON(((-114.047783 39.79โ€ฆ             12273  Geometry: wkbMultiPolygon
    5 โ”‚ 51095    VA       James City     MULTIPOLYGON(((-76.914226 37.356โ€ฆ             80046  Geometry: wkbMultiPolygon
    6 โ”‚ 47173    TN       Union          MULTIPOLYGON(((-84.007298 36.270โ€ฆ             20141  Geometry: wkbMultiPolygon
    7 โ”‚ 17163    IL       St. Clair      MULTIPOLYGON(((-90.263814 38.520โ€ฆ            254777  Geometry: wkbMultiPolygon

to plot with

   # Plot the polygons
    for i in 1:nrow(df)
        geom = df.parsed_geoms[i]
        multi_poly = geom
        n_polys = ArchGDAL.ngeom(multi_poly)
        
        # Get the bin number and ensure it's valid
        bin_num = df[i, :pop_bins]
        
        for p_idx in 0:(n_polys-1)
            poly = ArchGDAL.getgeom(multi_poly, p_idx)
            ext_ring = ArchGDAL.getgeom(poly, 0)
            ring_text = ArchGDAL.toWKT(ext_ring)
            
            coords_text = replace(ring_text, "LINEARRING (" => "")
            coords_text = replace(coords_text, ")" => "")
            
            point_list = Point2f[]
            for pair in split(coords_text, ",")
                parts = split(strip(pair))
                if length(parts) >= 2
                    x = parse(Float32, parts[1])
                    y = parse(Float32, parts[2])
                    push!(point_list, Point2f(x, y))
                end
            end
            
            if !isempty(point_list)
                poly_obj = GeometryBasics.Polygon(point_list)
                # Ensure bin_num is within valid range
                color_idx = min(max(1, bin_num), length(custom_colors))
                poly!(
                    ga1,
                    poly_obj,
                    color=custom_colors[color_idx],
                    strokecolor=(:black, 0.5),
                    strokewidth=1
                )
            end
        end
    end

But if Iโ€™ve missed something in GeoMakie, Iโ€™ll be delighted to use it instead!

1 Like

You should be able to plot the ArchGDAL geometries directly, the objects you have in parsed_geoms. You could also have PostGIS return the geometries as WKB (well known binary) instead of well known text, which is faster and easier to parse.

Itโ€™s also way more efficient, due to the way Makie works, to plot all polygons at once:

df.parsed_geoms = (x -> ismissing(x) ? missing : ArchGDAL.fromWKT(x)).(df.geom)
dropmissing!(df, :parsed_geoms)
# ... other processing
poly(
    df.parsed_geoms; 
    color = df.pop_bins, 
    colormap = custom_colors, 
    colorrange = (1, length(custom_colors)),
    highclip = custom_colors[end],
    lowclip = custom_colors[begin]
)

If you want to get coordinates from geometry, you should also look into GeoInterface.jl which is a common interface across all Julia geospatial IO packages.

Your code to get the coordinates of one ring would then look like:

import GeoInterface as GI
multi_poly = ... # something
poly = GI.getgeom(multi_poly, 1)
ring = GI.getring(poly, 1) # or GI.getexterior(poly) or GI.gethole(poly, i #= in 1:GI.nhole(poly)=#)
coords = [(GI.x(p), GI.y(p) for p in GI.getpoint(ring)]
1 Like

Thank you. Very helpful