I’d almost guarantee there’s a perfectly easy way to do almost all of it already in Julia, just people don’t know what it is (possibly me included). I mostly don’t do regressions with regression formulae unless it’s really simple. The next step up for me from very basic lm(@formula(a ~ b + c + d),...)
is to immediately go to Turing and specify a full Bayesian model and either sample or optimize it. I really think it’s a mistake to spend a lot of time in that intermediate space with simplified formula syntax, but I know this is a matter of opinion.
Sorry I wasn’t specific enough, I was wondering specifically can you give some data cleaning examples that are easy in STATA but you struggle with in Julia.
We could maybe split this out into its own topic, like “Data manipulation in Julia vs other standard tools. Examples” or something.
@Jorge_Vieyra, on Slack, also managed to optimize the likelihood calculation. @Jorge_Vieyra , I think it would be great if you posted your code here.
Panel models with aggregate fixed effects. i.e. where the fixed effect is spans a number of values (in our case all observations for each variable in 6 hour blocks.) This is not possible either in R’s plm() which otherwise is quite good. One needs STATA.
There is a huge number of very useful, and even more importantly trusted, functions in STATA that exist nowhere else. Will take a long time, and a lot of determination and resources to make Julia statsmodels compete.
And last time I checked, there was noting like R’s plm() in Julia (and google just now did not suggest that had changed)
We agree, and we tried to be as careful as possible in our piece. That said, it was not directed at an expert user like yourself who is quite able to evaluate these languages without outside advise.
We surmise that the vast majority of people who use one of these programming languages, once they have learned one, never switch to another language. I think this is the reason for the enduring popularity of Matlab. Students learn it in graduate school and then continue doing so as professors and make their PhD students also learn Matlab. We know of professors who produce world leading research with statistical software that has not been updated in over 30 years.
Most users want a language to solve their problems, they don’t care about languages. And in that case, all you need is one. And then any of the four, Matlab, Python, Julia, and R will usually get the job done.
Those we are writing for, and why we make such a recommendation are those who are deciding between languages for important projects. Research departments in central banks, PhD students starting out, etc. Once they commit to a language, it is very costly to switch to another. And what they care about is very different from what people on the Julia discourse channel tend to care about.
And in such cases, the best language is generally that which can do most things, not a language that is really good in some things and bad in others. This is where R wins out over Julia. And if all you need to statistics, STATA is usually the best.
I strongly disagree, but I’m thinking you probably just have a really narrow idea of “most things” for example try coding an agent based model in R, that will be slower than molasses. Or try solving PDEs for each time step in a time series model or write a model with a delay differential equation. Or do an optimization for each week in a 20 year time series… Etc
I worked on a model where after each week of collecting new data we sampled a Bayesian model for a few thousand samples and then ran an optimization the choose a portfolio of bets to maximize an expected outcome at the end of the week. It would have been ridiculously painful to do in R.
Sure, I was thinking of making a github repo and polish all the codes (to be fair) and speed them all up to the best of my ability. I also made a Project.toml
instead of having all the libraries dangling.
For the Julia code this is a faster version:
using StatsBase
using DelimitedFiles
using BenchmarkTools
using LoopVectorization
using Base: Iterators
using IterTools
y = readdlm("data.dat")
y = y[:, 1]
N = length(y)
S = 1e3
o, a, b = [0.001, 0.85, 0.01]
y2 = y .^ 2;
v = var(y);
function likelihood(o, a, b, h, y2, N)
local lik = 0.0
for i in 2:N
h = o + a * y2[i-1] + b * h
lik += log(h) + y2[i] / h
end
return lik
end
function likelihood2(o, a, b, h₀, y2)
N = length(y2)
h = similar(y2)
lik = 0.0
h[begin] = h₀
@inbounds for i in 2:N
h[i] = o + a * y2[i-1] + b * h[i-1]
end
@turbo for i in 2:N
lik += log(h[i]) + y2[i] / h[i]
end
return lik
end
lik = likelihood(o, a, b, v, y2, N)
lik2 = likelihood2(o, a, b, v, y2)
lik2 ≈ lik && @info "Original and modified version results match"
bench_o = @benchmark likelihood($o, $a, $b, $v, $y2, $N)
println("output,Julia-inbounds ", VERSION, ",", lik, ",", minimum(bench_o.times))
original_time = bench_o.times |> minimum
bench2 = @benchmark likelihood2($o, $a, $b, $v, $y2)
println("output,Julia-double-loopvec ", VERSION, ",", lik2, ",", minimum(bench2.times))
speedup = original_time / (bench2.times |> minimum)
@info "Speedup $(round(speedup, digits=1))X"
This is the output while running it in my system:
el_oso@galen:~/Downloads/GARCH$ julia --project garch_modified.jl
[ Info: Original and modified version results match
output,Julia-inbounds 1.8.0-rc4,206810.2364004015,91227.0
output,Julia-double-loopvec 1.8.0-rc4,206810.23640040183,34652.0
[ Info: Speedup 2.6X
This is something that has been criticized at my company. When users want to learn they hit a “wall” because they cannot learn by example for most cases and are confronted with a lot of reference manuals without many examples. Sometimes the reference is the code itself, which is not very useful for a starter.
Also, people want to google and find an already worked out example in StackOverflow and copy/paste it. Finally, people don’t know where to look for documentation and/or help. It is generally available, but not on the usual places, e.g. Discourse instead of StackOverflow, examples in Discourse instead of the package documentation, and available on “chats” like Slack, Zulip, Discord … instead of readily posted and digested for general consumption.
This is exactly the kind of thing where I think the emphasis on “special syntax” is totally misplaced. Turing will let you write general models in Julia itself and I strongly recommend people with complicated models jump directly to doing that. Even if you don’t want to MCMC sample from it, you can have Turing optimize the model and give you a point estimate.
Most of these comparisons discuss things from the perspective of a beginner. But I honestly think the more important perspective is of an intermediate or advanced user. Of course you have to be able to get there, so beginner stuff is important, but the walls you hit trying to shoehorn everything into R’s lmer syntax or stuff like that are real. Having the full Julia language available to write a model in is what makes Julia so amazing, so if we never consider that particularly perspective we get a very biased view of Julia’s advantages.
An excellent initiative. Do you mind if we include this on our web appendix?
One observation though, doesn’t @turbo use multithreading? We should have been more clear, but all the other comparisons are single thread. It is not a surprise you get a 2.6 speed up.
@turbo
is single-threaded, while @tturbo
is multi-threaded. The extra t
stands for threading.
Inspired by @Sukera 's comment a bit up the chain about test
documentation and @johnmyleswhite 's comments on identifying and trying to tackle issues, I opened a PR to the Julia docs that gives an example workflow on how to use pkg> test
within one’s own package going through how to create tests, run the tests, and add the test suite to one’s own package.
Feel free to add some review/revisions to it. Found myself on a long bus ride today and figured I would try to contribute another actionable item from this interesting discussion.
Yes, of course. Still it uses SIMD and we could have used that for (some of) the other languages, so doesn’t quite make for a fair comparison. But certainly worth mentioning.
I disagree. Automatic simd is definitely fair, since it doesn’t use extra resources, just exploits the allocated ones better. In fact, this is not at all limited to @turbo
, but is something Julia uses automatically in many different places. Would you block simd-vectorization somehow deliberately?
Anyway, numpy and Matlab also surely uses simd internally, probably R too.
Limiting multi-threading makes sense for comparisons, but preventing simd does not make sense.
A fantastic idea.
some time my misunderstanding of Pkg let me to initiate a discussion of here, where I showed myself to be rather ignorant of Julia, alas, I have no shame and here it is, if of use. Best practice for packages on shared drives
Thanks and no worries - it’s discussions like these that can lead us to solutions. I figure even if that isn’t a perfect PR there, it is at least something now we can all critique and make better. Feel free to comment on that PR with additional ideas/thoughts.
I would agree on most cases. However, I believe the authors specifically used an algorithm that can’t be vectorised. The improved Julia code uses a modified algorithm which can be vectorised in part. It’s not really a fair comparison in the spirit of the article.
Changing the algorithm might not be fair game, I agree, but any automatic simd that might happen should be fine.
Agreed, and @turbo is no less fair than @jit.
In case you are interested I also made a functional version of the code that maybe can be parallelized. It is not faster, but I like it because it has zero allocations and the GC is not called.
using StatsBase
using DelimitedFiles
using BenchmarkTools
using Base: Iterators
using IterTools
y = readdlm("data.dat")
y = y[:, 1]
N = length(y)
S = 1e3
o, a, b = [0.001, 0.85, 0.01]
y2 = y .^ 2;
v = var(y);
function likelihood_functional(o, a, b, h0, y2)
N = length(y2)
function updateh(t)
newh, i = t
@inbounds o + a * y2[i] + b * newh, i + 1
end
function loginv(t)
hh, i = t
@inbounds (i>1) * (log(hh) + y2[i] / hh)
end
return mapreduce(loginv, +, Iterators.take(iterated(updateh, (h0, 1, false)), N))
end
lik = likelihood_functional(o, a, b, v, y2)
bench = @benchmark likelihood_functional($o, $a, $b, $v, $y2)
println("output,Julia-functional ", VERSION, ",", lik, ",", minimum(bench.times))
It is not really faster, but I like about this is that the timings are more consistent
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 92.477 μs … 126.451 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 94.428 μs ┊ GC (median): 0.00%
Time (mean ± σ): 94.960 μs ± 2.134 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅██▄▇▇▅▂▂▁▂▂ ▂▁▁▂▂▃▁▁▁ ▁ ▂
▆▄▃▇▅▄▇▆█████████████▇██████████▇▅▆▇▆▆▆▆▇▆▇▆▇▅▄▅▅▄▄▅▄▄▅▅▄▇██ █
92.5 μs Histogram: log(frequency) by time 102 μs <