Querying WRDS Data using Julia

Yifan_Liu · December 4, 2019, 4:49am

I have written a short tutorial about querying WRDS data using Julia. A user only needs to input her username and password to work with WRDS remotely in Julia. Basically, I translated the existing tutorial about querying WRDS data using R to Julia.

https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/

WRDS is the standard data source used by many finance and accounting departments. If you have a wrds account, please go ahead and try my tutorial. I have asked wrds to add this tutorial to their website. They will probably add more detailed instructions. Basically, a user needs to understand some basic PostgreSQL commands to work with WRDS remotely. On the Julia side, a user needs to know how to work with DataFrame. In addition, I personally use FixedEffectModels.jl for running regressions, and I highly recommend this package (cannot say enough good things about it).

#Querying WRDS Data using Julia

#load required packages
################################################################
using LibPQ, Tables, DataFrames 
################################################################


#1. connect to wrds
wrds = LibPQ.Connection("""
                        host = wrds-pgdata.wharton.upenn.edu 
                        port = 9737
                        user='username' 
                        password='password'
                        sslmode = 'require' dbname = wrds
                        """)


#2. determine the data libraries available at WRDS:
res = execute(wrds, "select distinct table_schema
                     from information_schema.tables
                     where table_type ='VIEW'
                     or table_type ='FOREIGN TABLE'
                     order by table_schema")

data = DataFrame(columntable(res))
res = 0
data

#3. determine the datasets within a given library:
res = execute(wrds, "select distinct table_name
                    from information_schema.columns
                    where table_schema='crsp'
                    order by table_name")

data = DataFrame(columntable(res))
res = 0
data


#4. determine the variables (column names) within a given dataset:
res = execute(wrds, "select column_name
                    from information_schema.columns
                    where table_schema='crsp'
                    and table_name='dsf'
                    order by column_name")

data = DataFrame(columntable(res))
res = 0
data

#5. To query the crsp.dsf dataset:
res = execute(wrds, "select * from crsp.dsf
                limit 10")
data = DataFrame(columntable(res))
res = 0
data

#6. To query specific variables from the crsp.dsf dataset:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi from crsp.dsf
                limit 10")
data = DataFrame(columntable(res))
res = 0
data
                
#7. To query variables that meet certain criteria:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where askhi > 2500
                   and bidlo < 2000
                   limit 10")
data = DataFrame(columntable(res))
res = 0
data
                    

#8. To query by date:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date = '2013-01-04'")
data = DataFrame(columntable(res))
res = 0
data
                   

#9. To query by date range:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date between '2013-01-07'
                   and '2013-01-08'")
data = DataFrame(columntable(res))
res = 0
data

#10. To query using multiple search criteria:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date between '1960-01-01'
                   and '1980-01-01'
                   and askhi > 2500
                   and bidlo < 2000")
data = DataFrame(columntable(res))
res = 0
data


#11. To join and query two Compustat datasets:
res = execute(wrds, "select a.gvkey, a.datadate, a.tic,
                   a.conm, a.at, a.lt, b.prccm, b.cshoq
                   from comp.funda a join comp.secm b
                   on a.gvkey = b.gvkey
                   and a.datadate = b.datadate
                   where a.tic = 'IBM'
                   and a.datafmt = 'STD'
                   and a.consol = 'C'
                   and a.indfmt = 'INDL'")
data = DataFrame(columntable(res))
res = 0
data

tbeason · December 4, 2019, 5:28pm

Very cool! I have not tried this but will give it a shot sometime. I’ve mostly just resorted to downloading the entire SAS tables that I need from WRDS to my local machines and then using SASLib.jl to load them in to DataFrames. It works for basically everything except the big stuff like the TAQ data.

Yifan_Liu · December 4, 2019, 5:36pm

the problem with downloading via web is that if you need to update your data or code, you may have to download your data again.

also, the data you download basically is a merged data, which means it may take a lot of space.

in addition, the data structure in WRDS makes more sense, while the web interface can be really confusing sometimes.

Yifan_Liu · December 4, 2019, 5:44pm

For high resolution financial data like TAQ, there is currently no cheap option to work with them. In academia, people use WRDS cloud, but it is very inefficient, you have to submit your code and wait for more than 1 day and sometimes more than 1 week to see the results. if something goes wrong, you have to start over.

In finance industry, kdb+ is used to handle large time-series data, but it is very expensive and I haven’t seen anyone in academia use it (not sure if they offer academic license).

cpfiffer · December 4, 2019, 6:30pm

Pretty slick. It’d be cool if we could get WRDS to put this on their site.

I haven’t seen anyone use kdb+ in academic, either. I don’t think they offer an academic license, but they do have a fairly restrictive personal license for free that probably would not be sufficient for TAQ.

Yifan_Liu · December 6, 2019, 7:30pm

Just heard back from wrds:

Thank you so much for this. Right now, we do not officially support Julia as we do not have much internal experience with it. We will look into the tutorial you provided as well as the packages that are available and work on developing instructions for our end users.

hopefully they can get this done ASAP

tbeason · March 27, 2020, 2:27am

Have you used this to do any processing with TAQ?

Yifan_Liu · March 27, 2020, 4:30am

Not really, but it should be very similar. The library is taqmsec ( Millisecond Trade and Quote) and within it there are cqm (Millisecond Consolidated Quotes), ctm ( Millisecond Consolidated Trades), and nbbom (Millisecond National Best Bid and Offer). For example, on 2011-01-01, you will have cqm_20030910, ctm_20030910, and nbbom_20030910.

tbeason · March 27, 2020, 4:34am

Right I’m very familiar with the data. What I guess I’m wondering is how effective this method would be to access the data. For example, does each execute call download its result to my local machine? At what point does the data transfer happen? For the TAQ files, even a fairly small subset can still leave you with multi-GB files per day.

Yifan_Liu · March 27, 2020, 4:37am

I haven’t tried it by myself, but I guess the workflow may be like this. First, you decide your sample range. Second, you write a for loop to download each day’s data as a csv file on your disk. Third, you use JuliaDB to convert these csv files into binary files. Fourth, you start working on the data.

something like this:

github.com

JuliaComputing/TrueFX/blob/master/JuliaDB with TrueFX dataset.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Playing with TrueFX data and JuliaDB\n",
    "\n",
    "This is an example workflow for loading, working with and saving datasets in Julia. There are a few packages for working with data, JuliaDB developed by JuliaComputing is one of them.\n",
    "\n",
    "Other popular alternatives are:\n",
    "\n",
    "- DataFrames.jl\n",
    "- Query.jl\n",
    "- TypedTables.jl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,

This file has been truncated. show original

balaji1975 · February 18, 2021, 2:10am

This is very useful Yifan. Thanks.
However, I don’t see all the variables from crsp.dsf or crsp.msf. For example, WRDS web interface shows 62 query variables for crsp.msf, while the code above produces only 21 variables. How do I see all the variables?

tbeason · February 18, 2021, 2:58am

That is unrelated to this Julia code. There are actually 2 underlying files for the daily/monthly securities files. DSF/MSF and DSE/MSE. The web interface shows you the union of the columns. There is a macro called %CRSPMERGE (I think) that will join them for you.

Yifan_Liu · February 18, 2021, 3:40am

tbeason is right. WRDS stores trading information in msf and dsf, and security information in mse and dse. you can merge them using date and permno.

balaji1975 · February 18, 2021, 3:44am

Aha! Thank you.

Topic		Replies	Views
[ANN] FamaFrenchData.jl - easy access to Fama-French data Package Announcements package , announcement , finance	4	1406	March 27, 2020
Financial Data Wrangling / Cleaning / Joining / Filtering of large DataFrame tables Jobs	0	475	August 18, 2021
Data Wrangling Best Practices - NYSE TAQ Data New to Julia question , dataframes , time-series	14	2245	April 10, 2023
WIOD database in Julia Data question , economics , vscode , data_structures , input-output	3	276	June 25, 2023
Julia import large out-of-memory csv data General Usage question	11	2333	June 29, 2017

Querying WRDS Data using Julia

Related topics