Based on Kyle Walkerโs tidycensus
package for the R programming language, this contains functions to get specified variables from the one-, three-, and five-year surveys. A free Census API key is required.
Repo and documentation.
Based on Kyle Walkerโs tidycensus
package for the R programming language, this contains functions to get specified variables from the one-, three-, and five-year surveys. A free Census API key is required.
Repo and documentation.
Hey @technocrat,
Wanted to drop a line and say that this looks great! Iโll definitely give this a look for some of my public health research.
In JuliaHealth land (if you are on the Julia Slack, we live in #health-and-medicine), we are definitely a huge fan of census microdata and other socioeconomic/public health information used in the USA and internationally. Iโll share some things we are up to as well:
Last year, we started work on an OpenAPI.jl based IPUMS tool called IPUMS.jl. We havenโt registered it yet as it is in heavy development still, but here is a working demo to download data from NHGIS for example:
using IPUMS
# This comes from NHGIS IPUMS Website
api = IPUMSAPI("https://api.ipums.org/", Dict("Authorization" => "Your_Key"))
# This extract is in the IPUMS.jl repo
test_extract_definition = "test/testdata/example_extract_request.json"
# Submits an extract definition request to IPUMS
res = extract_submit(api, "nhgis", test_extract_definition)
# Gets information on status of request
metadata, defn, msg = extract_info(api, res.number, "nhgis")
# Lists out all extracts submitted to IPUMS NHGIS
extract_list(api, "nhgis")
# Download submitted extract data to output path
extract_download(api, res.number, "nhgis"; output_path = "file_downloads/")
From here, using some sample data from the IPUMS project and located in the IPUMS.jl repo, you can load the data like so:
ddi = parse_ddi("test/testdata/cps_00157.xml")
df = load_ipums_extract(ddi, "test/testdata/cps_00157.dat")
Which looks something like this:
3ร8 DataFrame
Row โ YEAR SERIAL MONTH ASECWTH STATEFIP PERNUM ASECWT INCTOT
โ Int64 Int64 Int64 Float64 Int64 Int64 Float64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1962 80 3 1475.59 55 1 1475.59 4883
2 โ 1962 80 3 1475.59 55 2 1470.72 5800
3 โ 1962 80 3 1475.59 55 3 1578.75 999999998
Additionally, we have been working with Census data to get shapefile information. We created TigerLine.jl which is inspired by tigris
(R package); hereโs another small demo:
using TigerLine
# Run ?TIGER_DICT to see more layer options
download_tiger("DIRECTORY_OF_CHOICE", year=2024, layer="state")
Which you can then use like (adapted from my friend, Sam Loweโs analysis):
using CairoMakie
using GeoMakie
using Polylabel
using DataFrames
using GeoDataFrames
using Shapefile
table = Shapefile.Table( "tl_2024_us_state.zip") |> DataFrame
one_unique = Vector{String}()
for col in names(table)
if length(unique(table[!, col])) == 1
push!(one_unique, col)
end
end
select!(table, Not(one_unique))
f, a, p = poly(table.geometry; axis = (; type = GeoAxis))
tp = Makie.text!(a, Polylabel.polylabel.(table.geometry); text = string.(table.STUSPS), align = (:center, :center))
tp.fontsize = 5
f
We are actively working on building more and more support here for more social science/public health research tooling within JuliaHealth and have a nice start but a ways to go to make things more ergonomic and polished.
Might be interesting to work together on some things as I am working these days on some more IPUMS related work and research! Would love to see more of these tools coming together in Julia world. Also, if you are interested, weโd be happy to host and promote ACS.jl in JuliaHealth if youโd like โ I could share more details about that.
Cheers and excited to see what else comes together here!
~ tcp
Thanks for introducing me to the Slack channel.
Iโve got a parallel package in development with some more tools, RSetup.jl
is a mini-package manager for R
that will set the R
environment for use by RCall
and add packages. Although Julia
has the statistics basics, thereโs a lot of other stuff in R. For example classInt
will divide a continuous vector into buckets based on a choice of algorithms. This prevents the mushy appearance in thematic maps where everything is a shade of blue.
It also facilitates access to Kyle Walkerโs tidycensus
package, which is an extensive series of Census-related tools. For example, hereโs an r_get_acs
# SPDX-License-Identifier: MIT
using Census
using Dates
using RCall
using CSV
using DataFrames
\"""
r_get_acs_data(; geography::String, variables::Dict{String, String}, state::String,
year::Union{Integer, Nothing} = nothing, survey::Union{String, Nothing} = nothing)
Retrieve American Community Survey (ACS) data through R's tidycensus package.
# Arguments
- `geography::String`: Geographic level for data collection (e.g., "county", "tract", "block group")
- `variables::Dict{String, String}`: Dictionary mapping desired variable names to Census variable codes
- `state::String`: State abbreviation (e.g., "MA", "NY")
- `year::Union{Integer, Nothing} = nothing`: Survey year. Defaults based on current date and survey type
- `survey::Union{String, Nothing} = nothing`: Type of ACS survey. Use "acs1" for 1-year estimates,
nothing for 5-year estimates
# Details
Year defaults are determined by current date and survey type:
- For 1-year ACS (`survey="acs1"`): after September, uses previous year
- For 5-year ACS: after December, uses previous year
- Otherwise uses two years prior
ACS availability notes:
- 1-year estimates (`survey="acs1"`) are only available for geographies with populations โฅ 65,000
- 2020 data is not available for ACS (use get_decennial() instead)
- 1-year and 5-year surveys are published in September and December respectively
# Returns
DataFrame containing requested ACS data
# Examples
```julia
# Get 5-year estimates for counties in Massachusetts
vars = Dict("median_income" => "B19013_001")
df = r_get_acs_data(geography="county", variables=vars, state="MA")
# Get 1-year estimates for 2022
df = r_get_acs_data(geography="county", variables=vars, state="MA",
survey="acs1", year=2022)
\```
\"""
function r_get_acs_data(; geography::String,
variables::Dict{String, String},
state::String,
year::Union{Integer, Nothing} = nothing,
survey::Union{String, Nothing} = nothing)
# Ensure R environment is set up using Census module's function
try
Census.RSetup.setup_r_environment()
catch e
error("Failed to set up R environment: ", sprint(showerror, e))
end
# Determine current date for default year logic
current_date = Dates.now()
current_year = Dates.year(current_date)
current_month = Dates.month(current_date)
# Calculate default year based on current date
if isnothing(year)
# After September, 1-year ACS for previous year becomes available
# After December, 5-year ACS for previous year becomes available
if current_month >= 9 && !isnothing(survey) && survey == "acs1"
year = current_year - 1
elseif current_month >= 12
year = current_year - 1
else
year = current_year - 2
end
end
# Check for 2020 restriction
if year == 2020
throw(ArgumentError("ACS is not available for 2020. Use get_decennial() or pick a later year."))
end
# Convert Julia Dict to R named vector
var_names = collect(keys(variables))
var_codes = collect(values(variables))
# Use proper R string interpolation
R\"""
r_vars <- setNames(c($(var_codes)), c($(var_names)))
\"""
# Build R function call with proper error handling
try
if isnothing(survey)
R\"""
# Ensure tidycensus is loaded
library(tidycensus)
# Call R get_acs() with the provided parameters
data <- get_acs(geography = $(geography),
variables = r_vars,
state = $(state),
year = $(year))
\"""
else
R\"""
# Ensure tidycensus is loaded
library(tidycensus)
# Call R get_acs() with the provided parameters including survey
data <- get_acs(geography = $(geography),
variables = r_vars,
state = $(state),
year = $(year),
survey = $(survey))
"""
end
catch e
if isa(e, RCall.REvalError)
error("R evaluation error: ", sprint(showerror, e))
else
rethrow(e)
end
end
# Convert R data frame to Julia DataFrame with error handling
try
return rcopy(R"data")
catch e
error("Failed to convert R data frame to Julia DataFrame: ", sprint(showerror, e))
end
end
Iโve also received a feature request for ACL.jl
to add options for taking the Census API json
file and rendering as a StructureArray
or NamedTuple
object. I have that working and am awaiting reaction from the requestor.
I considered implementing Tiger.jl
but decided against it since Iโm not working at the block level. What Iโve done is to get the county level shapefiles, which should be fairly stable year-to-year and put them into a Postgres
database. The databaseโs Postgis
extension makes it much more convenient to do spatial work in the database than in R
and, I assume, Julia.
For thematic mapping, I struggled quite a bit before settling on GeoMakie
. I had a hard time converting the multipolygons of the shapefiles, but itโs working now. Iโve also got CRS projection strings working.
Finally, folk may be interested in my age pyramid plot that shows differences between geographic units.
Hey! I wrote GeoMakie.jl and work on many other geospatial packages - would love to hear what pain points you had there so we can look at how to improve things. If you were trying to plot geometries from R, then yeah thatโs a bit harder - but anything from Julia should work out of the box.
As a side note, Iโve always wanted a port of classInt in Julia, for general visualization purposes - so will be working towards having that at some point. AlgebraOfGraphics.jl for example could use that as well.
Hi @asinghvi17
I sucked the Tiger shapefiles into postgis and hereโs what I get from a query.
3235ร5 DataFrame
Row โ geoid stusps name geom total_population
โ String? String? String? String? Int64?
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 48315 TX Marion MULTIPOLYGON(((-94.709446 32.871โฆ 9631
2 โ 01011 AL Bullock MULTIPOLYGON(((-85.999258 32.250โฆ 10157
3 โ 49045 UT Tooele MULTIPOLYGON(((-114.046658941853โฆ 76648
4 โ 49023 UT Juab MULTIPOLYGON(((-114.047783 39.79โฆ 12273
5 โ 51095 VA James City MULTIPOLYGON(((-76.914226 37.356โฆ 80046
6 โ 47173 TN Union MULTIPOLYGON(((-84.007298 36.270โฆ 20141
7 โ 17163 IL St. Clair MULTIPOLYGON(((-90.263814 38.520โฆ 254777
โฎ โ โฎ โฎ โฎ โฎ โฎ
and to get it to work, Iโve resorted to
function parse_geoms(df::DataFrame = df)
return [ArchGDAL.fromWKT(geom) for geom in df.geom
if !ismissing(geom)]
end
to get
3235ร6 DataFrame
Row โ geoid stusps name geom total_population parsed_geoms
โ String? String? String? String? Int64? IGeometry?
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 48315 TX Marion MULTIPOLYGON(((-94.709446 32.871โฆ 9631 Geometry: wkbMultiPolygon
2 โ 01011 AL Bullock MULTIPOLYGON(((-85.999258 32.250โฆ 10157 Geometry: wkbMultiPolygon
3 โ 49045 UT Tooele MULTIPOLYGON(((-114.046658941853โฆ 76648 Geometry: wkbMultiPolygon
4 โ 49023 UT Juab MULTIPOLYGON(((-114.047783 39.79โฆ 12273 Geometry: wkbMultiPolygon
5 โ 51095 VA James City MULTIPOLYGON(((-76.914226 37.356โฆ 80046 Geometry: wkbMultiPolygon
6 โ 47173 TN Union MULTIPOLYGON(((-84.007298 36.270โฆ 20141 Geometry: wkbMultiPolygon
7 โ 17163 IL St. Clair MULTIPOLYGON(((-90.263814 38.520โฆ 254777 Geometry: wkbMultiPolygon
to plot with
# Plot the polygons
for i in 1:nrow(df)
geom = df.parsed_geoms[i]
multi_poly = geom
n_polys = ArchGDAL.ngeom(multi_poly)
# Get the bin number and ensure it's valid
bin_num = df[i, :pop_bins]
for p_idx in 0:(n_polys-1)
poly = ArchGDAL.getgeom(multi_poly, p_idx)
ext_ring = ArchGDAL.getgeom(poly, 0)
ring_text = ArchGDAL.toWKT(ext_ring)
coords_text = replace(ring_text, "LINEARRING (" => "")
coords_text = replace(coords_text, ")" => "")
point_list = Point2f[]
for pair in split(coords_text, ",")
parts = split(strip(pair))
if length(parts) >= 2
x = parse(Float32, parts[1])
y = parse(Float32, parts[2])
push!(point_list, Point2f(x, y))
end
end
if !isempty(point_list)
poly_obj = GeometryBasics.Polygon(point_list)
# Ensure bin_num is within valid range
color_idx = min(max(1, bin_num), length(custom_colors))
poly!(
ga1,
poly_obj,
color=custom_colors[color_idx],
strokecolor=(:black, 0.5),
strokewidth=1
)
end
end
end
But if Iโve missed something in GeoMakie
, Iโll be delighted to use it instead!
You should be able to plot the ArchGDAL geometries directly, the objects you have in parsed_geoms
. You could also have PostGIS return the geometries as WKB (well known binary) instead of well known text, which is faster and easier to parse.
Itโs also way more efficient, due to the way Makie works, to plot all polygons at once:
df.parsed_geoms = (x -> ismissing(x) ? missing : ArchGDAL.fromWKT(x)).(df.geom)
dropmissing!(df, :parsed_geoms)
# ... other processing
poly(
df.parsed_geoms;
color = df.pop_bins,
colormap = custom_colors,
colorrange = (1, length(custom_colors)),
highclip = custom_colors[end],
lowclip = custom_colors[begin]
)
If you want to get coordinates from geometry, you should also look into GeoInterface.jl which is a common interface across all Julia geospatial IO packages.
Your code to get the coordinates of one ring would then look like:
import GeoInterface as GI
multi_poly = ... # something
poly = GI.getgeom(multi_poly, 1)
ring = GI.getring(poly, 1) # or GI.getexterior(poly) or GI.gethole(poly, i #= in 1:GI.nhole(poly)=#)
coords = [(GI.x(p), GI.y(p) for p in GI.getpoint(ring)]
Thank you. Very helpful