Querying WRDS Data using Julia

I have written a short tutorial about querying WRDS data using Julia. A user only needs to input her username and password to work with WRDS remotely in Julia. Basically, I translated the existing tutorial about querying WRDS data using R to Julia.

https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/

WRDS is the standard data source used by many finance and accounting departments. If you have a wrds account, please go ahead and try my tutorial. I have asked wrds to add this tutorial to their website. They will probably add more detailed instructions. Basically, a user needs to understand some basic PostgreSQL commands to work with WRDS remotely. On the Julia side, a user needs to know how to work with DataFrame. In addition, I personally use FixedEffectModels.jl for running regressions, and I highly recommend this package (cannot say enough good things about it).

#Querying WRDS Data using Julia

#load required packages
################################################################
using LibPQ, Tables, DataFrames 
################################################################


#1. connect to wrds
wrds = LibPQ.Connection("""
                        host = wrds-pgdata.wharton.upenn.edu 
                        port = 9737
                        user='username' 
                        password='password'
                        sslmode = 'require' dbname = wrds
                        """)


#2. determine the data libraries available at WRDS:
res = execute(wrds, "select distinct table_schema
                     from information_schema.tables
                     where table_type ='VIEW'
                     or table_type ='FOREIGN TABLE'
                     order by table_schema")

data = DataFrame(columntable(res))
res = 0
data

#3. determine the datasets within a given library:
res = execute(wrds, "select distinct table_name
                    from information_schema.columns
                    where table_schema='crsp'
                    order by table_name")

data = DataFrame(columntable(res))
res = 0
data


#4. determine the variables (column names) within a given dataset:
res = execute(wrds, "select column_name
                    from information_schema.columns
                    where table_schema='crsp'
                    and table_name='dsf'
                    order by column_name")

data = DataFrame(columntable(res))
res = 0
data

#5. To query the crsp.dsf dataset:
res = execute(wrds, "select * from crsp.dsf
                limit 10")
data = DataFrame(columntable(res))
res = 0
data

#6. To query specific variables from the crsp.dsf dataset:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi from crsp.dsf
                limit 10")
data = DataFrame(columntable(res))
res = 0
data
                
#7. To query variables that meet certain criteria:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where askhi > 2500
                   and bidlo < 2000
                   limit 10")
data = DataFrame(columntable(res))
res = 0
data
                    

#8. To query by date:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date = '2013-01-04'")
data = DataFrame(columntable(res))
res = 0
data
                   

#9. To query by date range:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date between '2013-01-07'
                   and '2013-01-08'")
data = DataFrame(columntable(res))
res = 0
data

#10. To query using multiple search criteria:
res = execute(wrds, "select cusip,permno,date,bidlo,askhi
                   from crsp.dsf
                   where date between '1960-01-01'
                   and '1980-01-01'
                   and askhi > 2500
                   and bidlo < 2000")
data = DataFrame(columntable(res))
res = 0
data


#11. To join and query two Compustat datasets:
res = execute(wrds, "select a.gvkey, a.datadate, a.tic,
                   a.conm, a.at, a.lt, b.prccm, b.cshoq
                   from comp.funda a join comp.secm b
                   on a.gvkey = b.gvkey
                   and a.datadate = b.datadate
                   where a.tic = 'IBM'
                   and a.datafmt = 'STD'
                   and a.consol = 'C'
                   and a.indfmt = 'INDL'")
data = DataFrame(columntable(res))
res = 0
data
9 Likes

Very cool! I have not tried this but will give it a shot sometime. I’ve mostly just resorted to downloading the entire SAS tables that I need from WRDS to my local machines and then using SASLib.jl to load them in to DataFrames. It works for basically everything except the big stuff like the TAQ data.

the problem with downloading via web is that if you need to update your data or code, you may have to download your data again.

also, the data you download basically is a merged data, which means it may take a lot of space.

in addition, the data structure in WRDS makes more sense, while the web interface can be really confusing sometimes.

For high resolution financial data like TAQ, there is currently no cheap option to work with them. In academia, people use WRDS cloud, but it is very inefficient, you have to submit your code and wait for more than 1 day and sometimes more than 1 week to see the results. if something goes wrong, you have to start over.

In finance industry, kdb+ is used to handle large time-series data, but it is very expensive and I haven’t seen anyone in academia use it (not sure if they offer academic license).

1 Like

Pretty slick. It’d be cool if we could get WRDS to put this on their site.

I haven’t seen anyone use kdb+ in academic, either. I don’t think they offer an academic license, but they do have a fairly restrictive personal license for free that probably would not be sufficient for TAQ.

Just heard back from wrds:

Thank you so much for this. Right now, we do not officially support Julia as we do not have much internal experience with it. We will look into the tutorial you provided as well as the packages that are available and work on developing instructions for our end users.

hopefully they can get this done ASAP

Have you used this to do any processing with TAQ?

Not really, but it should be very similar. The library is taqmsec ( Millisecond Trade and Quote) and within it there are cqm (Millisecond Consolidated Quotes), ctm ( Millisecond Consolidated Trades), and nbbom (Millisecond National Best Bid and Offer). For example, on 2011-01-01, you will have cqm_20030910, ctm_20030910, and nbbom_20030910.

Right I’m very familiar with the data. What I guess I’m wondering is how effective this method would be to access the data. For example, does each execute call download its result to my local machine? At what point does the data transfer happen? For the TAQ files, even a fairly small subset can still leave you with multi-GB files per day.

I haven’t tried it by myself, but I guess the workflow may be like this. First, you decide your sample range. Second, you write a for loop to download each day’s data as a csv file on your disk. Third, you use JuliaDB to convert these csv files into binary files. Fourth, you start working on the data.

something like this:

This is very useful Yifan. Thanks.
However, I don’t see all the variables from crsp.dsf or crsp.msf. For example, WRDS web interface shows 62 query variables for crsp.msf, while the code above produces only 21 variables. How do I see all the variables?

That is unrelated to this Julia code. There are actually 2 underlying files for the daily/monthly securities files. DSF/MSF and DSE/MSE. The web interface shows you the union of the columns. There is a macro called %CRSPMERGE (I think) that will join them for you.

tbeason is right. WRDS stores trading information in msf and dsf, and security information in mse and dse. you can merge them using date and permno.

Aha! Thank you.