Julia cookbook available

I am pleased to announce the availability of my julia cookbook at

*** http://julia.cookbook.tips**

most of the early chapters are in pretty good shape and updated to julia 1.0. the latter and more complex subject chapters are a mix of good and bad shapes. this is partly because julia is itself still shifting.


PS: as to myself, I will come back to julia when it will have acquired [a] superior data frame handling, ideally language-integrated; [b] superior fast (gzipped) csv IO, and [c] parallel processing. until then, for the kind of data analysis tasks that I am involved in, julia remains much slower than R. but julia has many other excellent use cases.

7 Likes

Is it vegetarian or non-vegetarian?

1 Like

Glad someone has similar experience

Do you mind elaborating on [a] and [c]? What are the specific issues that bother you the most?

Also have you looked at TableReader.jl for b? It processes gzip files directly and, for many CSV applications, it seems to be quite fast and competitive with the R readers.

Are you interested in corrections to this, and/or do you have a preferred mechanism for submitting them? I’ve found a few factual errors, particularly relating to the way the cookbook talks about the performance of “machine native” types.

4 Likes

hi tk—I thought for a while how best to communicate this and how to figure out in the future whether julia has become mature enough for our basic data needs. I decided that I may as well write some simple R code that demonstrates the need. the first program is not the test, but just writes a typpical 1.8GB data set to disk:

library(data.table)
set.seed(0)

NF <- 1000000

permno <- 1:NF
startdt <- as.integer(runif( NF )*500)
enddt <- as.integer(startdt+1+runif( NF )*500)

all.nrows <- sum(enddt)-sum(startdt)+NF
all.p <- rep(NA, all.nrows)
all.t <- rep(NA, all.nrows)

cm <- 1
for (permno in 1:NF) {
    all.p[ cm:(cm+enddt[permno]-startdt[permno]) ] <- permno
    all.t[ cm:(cm+enddt[permno]-startdt[permno]) ] <- startdt[permno]:enddt[permno]
}


d <- data.frame( permno= all.p, t=all.t )
d <- within(d, prc <- rnorm( nrow(d), 100, 1 ))

d[["prc"]][ sample(1:nrow(d), nrow(d)/30 ) ] <- NA

fwrite(d, file="test.csv")
system("gzip test.csv")

cat("Test Data Set Created\n")
system("ls -lh test.csv.gz")

so, please use the above csv file as input, both into R and julia code. this kind of csv coding is standard in my field.

Test Code

now, let’s get to the benchmark. the following test program does what my students and I need to do most of the time: read an irregular data set, create some time-series and cross-sectional variables (here, returns and market returns), run regressions by and for many firms, and then finally save the results in a csv file.

print( system.time( {
    lagseries <- function(x) x[c(NA, 1:(length(x) - 1))]

    d <- fread("test.csv.gz")

    ## calculate rates of return in a panel
    d <- within(d, ret <- prc / lagseries(prc)-1)
    d <- within(d, ret <- ifelse( permno != lagseries(permno), NA, ret ))

    ## create the market rate of return
    d <- within(d, mktret <- ave( ret, t, FUN=function(x) mean(x,na.rm=TRUE) ))

    ## a market-model creates an alpha and a beta
    marketmodel <- function(d) {
        if (nrow(d) < 5) return(NULL)  ## minimum of 5 observations
        coef(lm( ret ~ mktret, data=d ))  ## the actual regression
    }

    indexes <- split( 1:nrow(d), d$permno )
    betas <- mclapply( indexes, FUN=function(.index) marketmodel(d[.index, , drop=FALSE]) )

    ## transform it into a nicer version
    betas <- do.call("rbind", betas)
    names(betas) <- c("alpha", "beta")

    ## and write it
    write.csv(betas, file="betas.csv")
} ))

This can be further speeded up by using the R compiler package, wrapping a compfun around the marketmodel function. But let’s just leave it this way.

I have not managed to write a julia implementation that can compete with R. by this, I mean julia no more than 30% slower than R on a 6 to 8-core machine. ideally, julia would be as fast and have nicer and cleaner code.

if someone can demonstrate competitive julia code for this task, I would be thrilled and will reconsider using julia here at UCLA for teaching quant finance.

4 Likes

hi robin—I would indeed be interested.

if you send me an email with a username and passcode, I can give you wiki editing privileges. (same holds for anyone else who wants to tinker with it.)

(I also wrote some code to test automatically that everything is up to date and still gives the very same output [when Julia or packages update], but this takes too much maintenance if I don’t end up using julia myself any longer.)

/iaw

It would be quite challenging for any Julia package to beat data.table in a short time. There have been so much work put into pandas, but it is still much slower than data.table in many tasks.

Since data.table is written in C, and Julia can call C code easily. Is it possible to use data.table code to create a Julia package? Python already has it.

yes, R is heavily optimized for data analysis. I don’t think julia will be able to beat it.

if julia is half the speed for applied data analysis, then few data analysts will want to switch from R to julia, whether we like it or not.

I think quite a few people, including myself, are waiting for the new multithreading feature so that we can speed up our code more easily. Reading CSV file can be parallelized but the current story is not great because IO operations are not thread safe at the moment.

I’m glad that this PR has been merged though.

https://github.com/JuliaLang/julia/pull/22631

1 Like

I am happy happy with the current status of data wrangling tools in Julia. Importing data quickly is important, but being able to import out-of-memory data is more important to me.

Are there are any working examples of JuliaDB working with larger-than-memory data?

Which packages do you use regularly?

I only use DataFramesMeta. If I can use JuliaDB to work on out-of-core large data, that would be great.