hi tk—I thought for a while how best to communicate this and how to figure out in the future whether julia has become mature enough for our basic data needs. I decided that I may as well write some simple R code that demonstrates the need. the first program is not the test, but just writes a typpical 1.8GB data set to disk:
library(data.table)
set.seed(0)
NF <- 1000000
permno <- 1:NF
startdt <- as.integer(runif( NF )*500)
enddt <- as.integer(startdt+1+runif( NF )*500)
all.nrows <- sum(enddt)-sum(startdt)+NF
all.p <- rep(NA, all.nrows)
all.t <- rep(NA, all.nrows)
cm <- 1
for (permno in 1:NF) {
all.p[ cm:(cm+enddt[permno]-startdt[permno]) ] <- permno
all.t[ cm:(cm+enddt[permno]-startdt[permno]) ] <- startdt[permno]:enddt[permno]
}
d <- data.frame( permno= all.p, t=all.t )
d <- within(d, prc <- rnorm( nrow(d), 100, 1 ))
d[["prc"]][ sample(1:nrow(d), nrow(d)/30 ) ] <- NA
fwrite(d, file="test.csv")
system("gzip test.csv")
cat("Test Data Set Created\n")
system("ls -lh test.csv.gz")
so, please use the above csv file as input, both into R and julia code. this kind of csv coding is standard in my field.
Test Code
now, let’s get to the benchmark. the following test program does what my students and I need to do most of the time: read an irregular data set, create some time-series and cross-sectional variables (here, returns and market returns), run regressions by and for many firms, and then finally save the results in a csv file.
print( system.time( {
lagseries <- function(x) x[c(NA, 1:(length(x) - 1))]
d <- fread("test.csv.gz")
## calculate rates of return in a panel
d <- within(d, ret <- prc / lagseries(prc)-1)
d <- within(d, ret <- ifelse( permno != lagseries(permno), NA, ret ))
## create the market rate of return
d <- within(d, mktret <- ave( ret, t, FUN=function(x) mean(x,na.rm=TRUE) ))
## a market-model creates an alpha and a beta
marketmodel <- function(d) {
if (nrow(d) < 5) return(NULL) ## minimum of 5 observations
coef(lm( ret ~ mktret, data=d )) ## the actual regression
}
indexes <- split( 1:nrow(d), d$permno )
betas <- mclapply( indexes, FUN=function(.index) marketmodel(d[.index, , drop=FALSE]) )
## transform it into a nicer version
betas <- do.call("rbind", betas)
names(betas) <- c("alpha", "beta")
## and write it
write.csv(betas, file="betas.csv")
} ))
This can be further speeded up by using the R compiler
package, wrapping a compfun
around the marketmodel
function. But let’s just leave it this way.
I have not managed to write a julia implementation that can compete with R. by this, I mean julia no more than 30% slower than R on a 6 to 8-core machine. ideally, julia would be as fast and have nicer and cleaner code.
if someone can demonstrate competitive julia code for this task, I would be thrilled and will reconsider using julia here at UCLA for teaching quant finance.