Working with U.S. Census Bureau Microdata

Hello,

I’m just plugging a recent video I did on how to work with the American Community Survey public-use microdata files from the U.S. Census Bureau. I show how to download the data from the Census, save to tables to be consumed by JuliaDB, how to load the data and perform computations with the Distributed package, and then how to write a function to make estimates as well as how to compute the standard error and upper/lower bounds of a confidence interval for the estimate. Hopefully, this is helpful to anyone interested in working with this dataset.

P.S. I really would like to have some people join me to do live coding/QA sessions. For example, if there are any package authors out there that would like to do some quick (30 min - 1 hour) videos with me, please reach out and let’s do it!

8 Likes

Nice job with this, thanks for sharing! Census data was next on list to tackle as a public dataset, so very timely and appreciated. Am trying to especially mashup and compare using multiple sources such as Centers for Medicare & Medicaid Services (CMS), Bureau of Labor Statistics (BLS), Census Data, etc.

  • Didn’t see on github - is there a place to view the App.jl and Tables.jl files you demo’d?
  • I downloaded the files and would be nice to run the code.
    BTW, it was great comparing your final calculated results using JuliaDB against the Census Bureau published values to as a validation of the process and code, very effective.
    Also, the mic quality and sound levels were good.

Have you pulled and analyzed any of the BLS data yet? There is the BlsData.jl package by @micahjsmith that works nicely. Would be happy to share the simple notebook I made. Here’s a time-series example plotting national unemployment rate:

Oh, IRS tax data to pull in and mash up as well, and Zips and FIPS are good to key them together (and do Geoplotting, another core piece of this so following that, and shape file processing in the geo group). IRS tax database variables

-------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------------------
statefips       str2    %9s                 * The State Federal Information Processing System (FIPS) code
state           str2    %9s                 * The State associated with the ZIP code
zipcode         str5    %9s                 * 5-digit Zip code
...
n03220          long    %8.0g               * Number of returns with educator expenses 1040:23 / 1040A:16
a03220          long    %8.0g               * Educator expenses amount 1040:23 / 1040A:16
n03300          long    %8.0g               * Number of returns with self-employment retirement plans 1040:28
a03300          long    %12.0g              * Self-employment retirement plans amount 1040:28
n03270          long    %8.0g               * Number of returns with self-employment health insurance deduction 1040:29
a03270          long    %12.0g              * Self-employment health insurance deduction amount 1040:29
n03150          long    %8.0g               * Number of returns with IRA payments 1040:32 / 1040A:17
a03150          long    %12.0g              * IRA payments amount 1040:32 / 1040A:17
n03210          long    %12.0g              * Number of returns with student loan interest deduction 1040:33 / 1040A:18
a03210          long    %12.0g              * Student loan interest deduction amount 1040:33 / 1040A:18
n03230          long    %8.0g               * Number of returns with tuition and fees deduction 1040:34 / 1040A:19
a03230          long    %8.0g               * Tuition and fees deduction amount 1040:34 / 1040A:19
...
n85530          long    %12.0g              * Number of returns with additional Medicare tax 1040:62a
a85530          long    %12.0g              * Additional Medicare tax amount 1040:62a
n85300          long    %12.0g              * Number of returns with net investment income tax 1040:62b
a85300          long    %12.0g              * Net investment income tax amount 1040:62b
n11901          long    %12.0g              * Number of returns with tax due at time of filing 1040:78 / 1040A:50 / 1040EZ:14
a11901          long    %12.0g              * Tax due at time of filing amount [10] 1040:78 / 1040A:50 / 1040EZ:14
n11902          long    %12.0g              * Number of returns with overpayments refunded 1040:75 / 1040A:47 / 1040EZ:13a
a11902          long    %12.0g              * Overpayments refunded amount [11] 1040:75 / 1040A:47 / 1040EZ:13a
year            float   %9.0g               * Year
                                            * indicated variables have notes
-------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: zipcode  statefips  agi_stub  year
agi_stub:
           1 $1 under $25,000 
           2 $25,000 under $50,000 
           3 $50,000 under $75,000 
           4 $75,000 under $100,000 
           5 $100,000 under $200,000 
           6 $200,000 or more 
1 Like

Wow, I was just doing a similar thing (pulling the entire ACS 2017 5-year), except not with JuliaDB! You know the community is starting to grow when this happens :slight_smile:

3 Likes

Oh darn, @randyzwitch is replying, should have waited for my post, this will be a good one…

2 Likes

Ha! If I knew I was going to get such praise, I’d have written something better.

Internally at work, there’s a huge desire to have census data, so I parsed it using Julia and OmniSci.jl. Was thinking of writing that up as a blog post, then didn’t…then @mthelm85 went and did a whole video!

4 Likes

Ha, thought I’d post before you with a fun quip but too slow on the trigger - you were 3 seconds ahead.

Have been pulling the full set of BLS data down this afternoon. wget is still running on the ftp site, not sure how long it will go but it’s over 12Gb so far. The CMS database with the Cost Report data loaded for all US Hospitals is 20Gb, and have the Skilled Nursing Facility and Home Health Agency tables to load still.
Also, the work that @fipelle is doing with TSAnalysis.jl and the FRED Economic dataset is pretty awesome!

Pease do!!

3 Likes

@mthelm85: Very interesting video!

@pontus: Thank you! I am very happy that someone is taking interest in TSAnalysis.jl. I will try to finish up the next release pretty soon, and to show more examples. I am using FredData.jl to interface with Alfred. I am quite familiar with a range of economic databases. If there is something that is not there yet, feel free to leave some comments on GitHub and I will try to work on it.

2 Likes

@pontus I created a repo for the files here: https://github.com/mthelm85/acs. I’ve not seen the BlsData.jl package, thanks so much for sharing! Can you post a link to the IRS tax database? I just did a quick search and haven’t found it but that’s a dataset I’d be super interested in.

@randyzwitch I too use Census data at work so I’d love to hear more about your process (a blog post would be great :wink:).

If I get some time in the next couple of weeks I’ll try to record another video on working with the Current Population Survey. It’s a really amazing dataset and it’s released monthly so it’s great when timeliness of the data is important. Unfortunately, I’ve found it to be a lot more difficult to work with and I’m hoping to figure out how to compute standard errors for the estimates before doing the video, but it’s not as straightforward with the CPS (when compared to the ACS). For example, I’ve had to e-mail the BLS on several occasions (CPS is a joint Census/BLS survey) to figure out how they are filtering the data in order to arrive at published estimates. In my experience, the way they are filtering the data is not intuitive based on what’s available in the data dictionary and results can vary dramatically based on which variable you use (or don’t).

Anyways, thanks to all of you for the feedback and thanks a ton for sharing!! I’m wondering if it would be worthwhile to get involved in the Julia slack channel and get a public data channel going…? I work with public (government) data all the time and I use Julia to do it and I usually feel like I’m in the Wild West. It would be great to get a community going…I’m not sure what it takes to get invited to the Julia slack channel though…

Appreciate you dropping the repo!
I don’t have a link to the tax pieces, but have sourced some other info from nber.org. Let me take a look later this week and see what I can send for links. What environment are you using, Win or Linux?
Also, regarding the FULL BLS dataset:

FINISHED --2019-11-12 21:37:16--
Total wall clock time: 7h 14m 3s
Downloaded: 1784 files, 36G in 7h 1m 38s (1.46 MB/s)
1 Like

@mthelm85 Thanks for your stream. I just watched all your videos and they are pretty cool! When are you doing the livestreams?

1 Like

Randomly right now…I thought it would be easier to get people to join me but everyone is either too shy or just doesn’t want to do it for some other reason :stuck_out_tongue_closed_eyes:. I would eventually like to get to the point where I can schedule the streams once every two weeks and always have a guest on.

When I was building a lot of front-end stuff I used to watch a JavaScript-based live stream that airs on Twitch.tv every Monday morning (check out FunFunFunction on Twitch/YouTube). The host tends to get pretty good guests from the JavaScript community to join him and they code things live (and it doesn’t always go so well, but that’s actually the point :wink:). It’s really become a great place for web developers to just hang out, chat with others in the community, network, interact with the host/guest, and learn. I think that kind of thing does a lot of good for a programming language community and I’d love to see it happen with Julia. The beautiful thing about the live pair programming format is that there’s not a lot of prep work that needs to be done and there’s not any editing to do afterward so it’s not a big investment for the host or the guest.