Julia cookbook available

iwelch · April 20, 2019, 8:40pm

I am pleased to announce the availability of my julia cookbook at

most of the early chapters are in pretty good shape and updated to julia 1.0. the latter and more complex subject chapters are a mix of good and bad shapes. this is partly because julia is itself still shifting.

PS: as to myself, I will come back to julia when it will have acquired [a] superior data frame handling, ideally language-integrated; [b] superior fast (gzipped) csv IO, and [c] parallel processing. until then, for the kind of data analysis tasks that I am involved in, julia remains much slower than R. but julia has many other excellent use cases.

StevenSiew · April 20, 2019, 9:00pm

Is it vegetarian or non-vegetarian?

xiaodai · April 21, 2019, 12:09am

Glad someone has similar experience

tk3369 · April 21, 2019, 2:27pm

Do you mind elaborating on [a] and [c]? What are the specific issues that bother you the most?

js135005 · April 21, 2019, 3:08pm

Also have you looked at TableReader.jl for b? It processes gzip files directly and, for many CSV applications, it seems to be quite fast and competitive with the R readers.

rdeits · April 21, 2019, 3:54pm

Are you interested in corrections to this, and/or do you have a preferred mechanism for submitting them? I’ve found a few factual errors, particularly relating to the way the cookbook talks about the performance of “machine native” types.

iwelch · April 21, 2019, 6:36pm

hi tk—I thought for a while how best to communicate this and how to figure out in the future whether julia has become mature enough for our basic data needs. I decided that I may as well write some simple R code that demonstrates the need. the first program is not the test, but just writes a typpical 1.8GB data set to disk:

library(data.table)
set.seed(0)

NF <- 1000000

permno <- 1:NF
startdt <- as.integer(runif( NF )*500)
enddt <- as.integer(startdt+1+runif( NF )*500)

all.nrows <- sum(enddt)-sum(startdt)+NF
all.p <- rep(NA, all.nrows)
all.t <- rep(NA, all.nrows)

cm <- 1
for (permno in 1:NF) {
    all.p[ cm:(cm+enddt[permno]-startdt[permno]) ] <- permno
    all.t[ cm:(cm+enddt[permno]-startdt[permno]) ] <- startdt[permno]:enddt[permno]
}


d <- data.frame( permno= all.p, t=all.t )
d <- within(d, prc <- rnorm( nrow(d), 100, 1 ))

d[["prc"]][ sample(1:nrow(d), nrow(d)/30 ) ] <- NA

fwrite(d, file="test.csv")
system("gzip test.csv")

cat("Test Data Set Created\n")
system("ls -lh test.csv.gz")

so, please use the above csv file as input, both into R and julia code. this kind of csv coding is standard in my field.

Test Code

now, let’s get to the benchmark. the following test program does what my students and I need to do most of the time: read an irregular data set, create some time-series and cross-sectional variables (here, returns and market returns), run regressions by and for many firms, and then finally save the results in a csv file.

print( system.time( {
    lagseries <- function(x) x[c(NA, 1:(length(x) - 1))]

    d <- fread("test.csv.gz")

    ## calculate rates of return in a panel
    d <- within(d, ret <- prc / lagseries(prc)-1)
    d <- within(d, ret <- ifelse( permno != lagseries(permno), NA, ret ))

    ## create the market rate of return
    d <- within(d, mktret <- ave( ret, t, FUN=function(x) mean(x,na.rm=TRUE) ))

    ## a market-model creates an alpha and a beta
    marketmodel <- function(d) {
        if (nrow(d) < 5) return(NULL)  ## minimum of 5 observations
        coef(lm( ret ~ mktret, data=d ))  ## the actual regression
    }

    indexes <- split( 1:nrow(d), d$permno )
    betas <- mclapply( indexes, FUN=function(.index) marketmodel(d[.index, , drop=FALSE]) )

    ## transform it into a nicer version
    betas <- do.call("rbind", betas)
    names(betas) <- c("alpha", "beta")

    ## and write it
    write.csv(betas, file="betas.csv")
} ))

This can be further speeded up by using the R compiler package, wrapping a compfun around the marketmodel function. But let’s just leave it this way.

I have not managed to write a julia implementation that can compete with R. by this, I mean julia no more than 30% slower than R on a 6 to 8-core machine. ideally, julia would be as fast and have nicer and cleaner code.

if someone can demonstrate competitive julia code for this task, I would be thrilled and will reconsider using julia here at UCLA for teaching quant finance.

iwelch · April 21, 2019, 6:40pm

hi robin—I would indeed be interested.

if you send me an email with a username and passcode, I can give you wiki editing privileges. (same holds for anyone else who wants to tinker with it.)

(I also wrote some code to test automatically that everything is up to date and still gives the very same output [when Julia or packages update], but this takes too much maintenance if I don’t end up using julia myself any longer.)

/iaw

Yifan_Liu · April 21, 2019, 8:27pm

It would be quite challenging for any Julia package to beat data.table in a short time. There have been so much work put into pandas, but it is still much slower than data.table in many tasks.

Since data.table is written in C, and Julia can call C code easily. Is it possible to use data.table code to create a Julia package? Python already has it.

iwelch · April 21, 2019, 9:33pm

yes, R is heavily optimized for data analysis. I don’t think julia will be able to beat it.

if julia is half the speed for applied data analysis, then few data analysts will want to switch from R to julia, whether we like it or not.

tk3369 · April 22, 2019, 12:33am

I think quite a few people, including myself, are waiting for the new multithreading feature so that we can speed up our code more easily. Reading CSV file can be parallelized but the current story is not great because IO operations are not thread safe at the moment.

I’m glad that this PR has been merged though.

https://github.com/JuliaLang/julia/pull/22631

Yifan_Liu · April 22, 2019, 1:08am

I am happy happy with the current status of data wrangling tools in Julia. Importing data quickly is important, but being able to import out-of-memory data is more important to me.

Are there are any working examples of JuliaDB working with larger-than-memory data?

tk3369 · April 22, 2019, 1:15am

Which packages do you use regularly?

Yifan_Liu · April 22, 2019, 1:21am

I only use DataFramesMeta. If I can use JuliaDB to work on out-of-core large data, that would be great.

Topic		Replies	Views
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8907	March 23, 2022
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7784	August 27, 2021
CSV read in is too slow than other language General Usage performance	13	1368	June 21, 2023
Reading Data Is Still Too Slow Data	35	8818	August 2, 2019
Suggestions for a package to read tabular data Data question	12	2726	February 13, 2017

Julia cookbook available

Test Code

Related topics