Best way to learn Julia for non-programmer data analyst?

Hi everyone,

Here’s my brief question:

What do you all recommend as a good post-v1.0 “Learn Julia” book for a non software developer / data analyst? By “non software developer” I mean someone who lacks both patience and natural talent for low level, “low productivity” languages like C, C++, and Rust (Java and C# are only slightly better). By “data analyst” I mean someone who regularly deals with large amounts of data (10s of GB to 10s of TB, usually unstructured or semi-structured), but has little to no need for ML/DL/AI related algorithms. Huge bonus points if said book includes practice problems/exercises (I learn best by doing), and if it covers both using built-in functionality/modules and developing new functionality (including wrapping existing C/C++ code).

Does such a book exist? If not, how do people like me learn Julia?

Here’s the longer version:

My computing experience started with being primarily taught Matlab as an undergraduate engineering student. Though I took a couple of computer science courses, C and C++ always seemed too low-level for my liking; pointers and indirection required more thought than I wanted to give them, and it always took much longer than I expected to accomplish something useful. So Matlab was where it was at.

As a graduate student, I rather quickly ran headlong into some of Matlab’s issues/weaknesses, chief among them its cost. I also rather quickly found that there’s often only one way to make Matlab code perform well (ie, figure out how to vectorize your code), which caused problems. In looking for an alternative, I discovered Python, which I fell in love with, built much of my dissertation on, and which I’ve done most of my subsequent “computing” work with. For the most part, it’s a great language: well designed, strongly biased towards the human side of computing (rather than the machine side), huge community, lots of libraries/modules, easily found threads with solutions to similar problems, etc. And I love the interactive nature of Jupyter Notebooks / iPython - I do a lot of my work in those two environments.

I’ve occasionally run into Python’s weaknesses over the years, chief among them the GIL and its implications on multi-threading performance for CPU intensive algorithms. I’ve always either stuck with single-threaded code and eaten the performance hit, or gone multi-process and taken less of a performance hit.

Then, several months ago, I needed to run a fast multi threaded regex search, and hit a very hard wall. So I’ve been looking for an alternative to Python that has most of its strengths but also good performance and strong multi threading support out of the box. Of everything I’ve looked at (Rust, Nim, D, Go, and revisiting C/C++), Julia looks like it checks the most boxes. The community isn’t as large, but that makes sense; it’s still a relatively new language. Jupyter Notebook support is fantastic, there’s out-of-the-box support for modules, performance looks pretty darn good (compared to Python), etc.

That being said, I have yet to find a good resource for learning Julia. I’ve played around a little with Julia’s regex capability, but as I recall it’s based on libpcre, which is…quite slow. And the library/module that wraps RE2 (I think it’s RE2) hasn’t been maintained and doesn’t work with Julia 1.5. So I’m rather quickly finding that, to do the thing I want to do, I’m gonna have to leave the realm of “use what’s in the box” and move into the realm of “develop something new” or “fix an unmaintained thing.”

Thus my question about how best to learn Julia.

Thanks.

5 Likes

I don’t know what is the best book for what you describe but I can relate my experience with regex’s in Julia. I was maintaining a log parser written in Perl which we used as part of our analysis stream for very large scale performance load tests. After migrating and refactoring the Perl code to Julia, this script is something like 10 times as fast as it used to be. I’m pretty sure there is still scope for further performance improvement.

About RE2.jl it looks well written and not too long so I think it should relatively easy for someone experienced to update it (I could even give it a try), although we’d also need a binary builder for the library. But if there’s interest for it I think that shouldn’t be an issue.

1 Like

Are you using Julia’s built-in regex capability, occursin?

https://docs.julialang.org/en/v1/base/strings/#Base.occursin

Yes, I pulled down RE2.jl. Figured I might have problems with it since the last commit was in early 2018…and I did. Found a thread on these forums related to the error Julia was throwing, and made a change to, I think, regex.jl. It then imported without error after that, and I was able to do some simple multi-threaded pattern matching. But enough time has passed that I don’t remember the change I made, and I’m confident that I never understood why it was breaking before or why it no longer broke. I also remember thinking “why is so much code needed for what should be a simple wrapper around a C++ library?!?”

So, yeah, that “relatively experienced” part is the thing I lack, but want to get. Am wondering how best to do it.

PS - opened up regex.jl in vim, and it maybe looks like I was doing something in function _write_capture().

To be honest, I don’t think doing things like this is Julia’s comparative advantage.

Essentially, my understanding is that you are looking for a book to teach you programming, using Julia. But the emphasis is on learning programming. While there are some books out there, I would recommend that you

  1. just read the manual,
  2. get started on a project that interests you,
  3. prepare for mistakes and frustration, but persevere,
  4. ask questions here if you get stuck.

I guess that’s pretty much how most of us learned programming. Also, reading code from well-written packages is instructive, but few people have the discipline to do that just for its own sake. So I recommend making PRs: you get a code review, which is like mentoring from an experienced developer. It’s a great way to learn.

10 Likes

Thanks, Tamas.

No? I’ve read Julia’s main use case / advantage is in scientific computing, and that it solves the 2-language problem. There’s lots of evidence that’s the case, including Graydon Hoare’s pair of really good blog posts about it. But I’ve also read there’s no reason it can’t serve as a general purpose programming language.

If that’s not the case, fine. What language would you turn to for fast multi-threaded data analytics (I view regex as a data analytic tool)?

Maybe? I describe myself as a non-programmer, which is pretty accurate. If you’re suggesting that, to build new Julia functionality (as a module, etc), or to fix someone else’s Julia code, one needs to be a programmer, then yes, I suppose that’s what I’m asking.

Though if that indeed is what I’m asking, then the almost immediate follow-up question becomes “If I’m gonna have to learn programming to do this thing, then why not suck it up and learn C++?” 2-language problem or not, I know Python, and it has a lots of strengths, including its general purpose nature. And C++ has been around for a long time and lots of good learning resources exist.

It is a great general programming language, but if you just need to digest large files using regular expressions, then specialized tools may be faster. Of course, if you need to do other things with that data, Julia could be very useful.

That is of course up to you. Most Julia programmers who know C++ find that they can prototype quicker in Julia and achieve pretty much the same speed with much less code; but in some contexts C++ may be your best choice (eg if you already have a lot of legacy code in it you have to maintain, the rest of your team prefers to use it and they need to be able to fix your code too, etc). Only you have the information to make these choices.

Even though you have told us about your previous experience in detail, I am still unsure what you are looking for — “data analysis” is such a broad term. If you need a good general programming language for some kinds of data analysis, Julia will be among your top choices.

However, if you mostly do text processing with regexs, Julia can do that too; however, since Base currently just wraps PCRE so you can expect the same kind of performance.

1 Like

Maybe you could try

https://github.com/BenLauwens/ThinkJulia.jl

3 Likes

What kinds of data analysis is Julia a top choice for?

As for what I’m looking for, I think it’s a couple of things, with different time scales:

  1. In the very near term, can it solve the immediate problem I’m trying to solve?
  2. In the more moderate to long term, how does Julia compare to Python for the types of things I typically do? How viable might it be as a Python replacement?

And yes, “data analysis” is a broad term. I typically do what people refer to as “exploratory analytics” - Python + Numpy + Pandas + Matplotlib (or Plotly) cover probably 80-85% of what I usually do**. But every once-in-a-while, I need to do something off-the-wall, like a regex-based pre-filter of large files before doing additional analysis. Or this one, which happened a few years ago: “iterate through all 100M+ possible permutations of a thing and perform a CPU-intensive calculation on each one.” (Python’s itertools + multiprocessing did the job there, albeit quite slowly.)

**Edit: in addition to data analytics, I’ve also worn the hat of “performance engineering,” “test engineering,” and “performance characterization” in the past, which has involved gathering data and then analyzing said data.

Thanks, I’ll take a look.

Typically, the kind of analysis which involves

  1. writing code (as opposed to using some canned method),
  2. a nontrivial amount of computation.
2 Likes

I’d say julia is the perfect choice in this case - it’s similar to what I do. I can also recommend Think Julia as an introduction - it’s the first thing I suggest for my students - but also given your interests, the DataFrames tutorial will likely be of use.

As to your current off-the-wall issue, depending on the complexity of your regex, you might want to take a look at Automa.jl - it’s meant for building parsers, but it’s got its own regex engine that’s quite fast (though it has some limitations). You might be able to use it for your purpose.

More generally, I’d say that, once you begin to use julia, you may find yourself shedding that sense that you’re not a programmer. Speaking only for myself (I’m a biologist, not a programmer, at least I would have said so 5 years ago when I started down this path), julia and this community have a way of getting you to look at your code differently, and encouraging good coding habits. I will never be a computer scientist the way some in this community are, but I find that an inspiration rather than a barrier.

11 Likes

Ok, you asked for a book, but maybe these courses from the Julia Academy are useful for you?

(You can also watch the videos on YouTube here)

4 Likes

Learning resources:

These are two really solid resources for learning some of the fundamentals. Based on the kind of work you describe, I would also recommend that you have a look at the following packages:

  • CSV
  • DataFrames
  • DataFramesMeta
  • DataVoyager
  • JuliaDB
  • JuliaDBMeta
  • Plots
  • Queryverse (this is a whole suite of data analytics packages)
  • StatsPlots

I would also read the documentation for distributed/parallel computing in Julia. Aside from the official Julia manual itself, there are several blog post type resources online that show the basics of distributed/parallel computing and one of the courses on JuliaAcademy is on parallel computing.

These resources alone should get you off and running and beyond the point of needing a book/additional courses, at least for the kind of work you describe above.

4 Likes

I assume this is because there are very few useful canned methods …??? If that’s the case, then the natural follow-up question is: “is there any plan in Julia’s roadmap to become something more than ‘write your own code’? If so, over what time frame (assuming that can even be predicted)?”

Several of the comments in a different thread seem related to this question. For example:

I think that a core question I’m asking myself is the following: when I need to do something that’s non-canned, what’s the best long-term solution, assuming I only have time/energy to do one? Am I better off learning a completely new language that shares many of Python’s strengths but still lacks maturity and reach? Or am I better off learning a language that can act as a good companion/2nd language to Python (eg, C++ or Rust)?

Thanks; I’ll take a look. What do you teach?

Will also look at this.

Thanks and thanks. How useful will the 2nd course/book be to someone who knows next to nothing about economics?

As with other open source (and closed source to some degree) projects that depends on the community. Of course one will always have to write code but more and more methods might be implemented in julia by the community. That being said I think Julia makes it easier than other languages to produce a decent package (I have been developing in C++ / R before which is a pain compared to Julia). More importantly great packages already exist that are in my opinion at least as good as counterparts in R and Python (this strongly depends on preference though). Among them many of the ones named in this thread. In addition, if something does not exist the community and infrastructure to create new packages is amazing (no getting yelled at in some mailing list ;)). Interop with other languages is also great if you do need to share code with someone who only uses say R (even though cran checks fail because they do not have Julia installed on the testing systems except for debian). Personally, I get the most bang for my Buck in julia compared to C++ (yes I forget ; on line 349 and put the argument order wrong in the header) and R. Python was supposed to be the language I learned after R but I never really got into it and then found julia and well here I am.

2 Likes

I think a lot of it will still be very useful. Feel free to ping me on economics related questions in the off topic section (I did study that stuff😃).

2 Likes

Not really, a lot of the modern canned methods exist in some package now, even if the collection is not as extensive as, say, CRAN.

The point is that Julia makes it easy to write custom code in a performant way.

I think you misunderstand how this works: there is no centralized “roadmap” for package development (similarly to other FOSS languages like R, Python, …).

Also, I think that you have arrived at the point where talking about these things in the abstract has diminishing returns and just starting learning & coding should be more informative.

4 Likes

I think this is the most underestimated point especially for Julia. I really think the manual teaches solid programming. Read the parts that you need for your work and see if the syntax and workflow are something you enjoy. I’ve had the experience of really disliking popular software twice (dplyr and Python) for no apparent reason. Just didn’t click in my brain. So maybe its for you or not. In general Julia is a solid choice for what you are trying to do (from what I can gather here)

5 Likes

I don’t disagree with this; the manual is well written; it’s actually how I learned enough to find and (sort of) fix issues with RE2.jl. The thing it’s “missing” that Think Julia has is exercises.

That’s probably fair.

Thanks, everyone, for the replies!