Its hard to compare vroom since
vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later.
Its hard to compare vroom since
vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later.
Except it doest not work half the time. Like lots of limitations are not mentioned. I tried it on Fannie Mae and it was slow.
I understand the idea but from a user’s perspective removing compilation is kind of cheating since in practice most of the time, I personally only load the data once per session so this fixed cost is something I experience. It’s pretty common as a data scientist to load a 500k x 40 csv do a bunch of data processing, try a few models and stop the Julia session before ringing and repeating the next day. The next time this process occurs I’m still experiencing this 10-15s load time. Granted it’s really not horrible but CSV.read doesn’t feel as fast in practice compared to pandas or data.table for small/medium sized datasets. I’m sure it blows both out of the water for huge CSVs but that’s not necessarily the majority of the use cases.
I disagree. If you aren’t running the script from start to finish consistently throughout your work session, you aren’t working in a way that guarantees reproducible results.
When I’m working, I run main()
all the time! Fast CSV reading after precompilation makes my life a lot easier.
If that is the case, what is the problem? I mean, I could understand that it would be annoying if you paid that compilation price every 5-10 minutes, but once a day? 0.05% of your workday?
For me, the first thing I do is load it via CSV then save it into a more efficient format like JDF.jl. That’s my 1_import_data.jl
From step 2 I load from using JDF.jl which should be faster. I saw some good benchmarks for the JLD2.jl as well but not yet tested on large dataset yet.
@aschmu, this way of working is not unusual for a data scientist. I usually load the data as data frame at the beginning of the session and if I can afford it, I make a working copy of the original data. With this working copy I then perform the analyses and can access the original data if necessary. So reading the CSV file is usually only necessary once and it does not depend on the second. So it almost doesn’t matter if I import the CSV file with R, Python or Julia.
I think that one way of looking at it is that if you only do interactive data analysis, it does not matter much which language you use. Your idling time (thinking between commands, etc…) is far higher than computing time.
Where it does matter is when your computations and workflow is defined and you need to launch the whole process: that’s where Julia shines… assuming your calculations are important and not simply a 1min calculations.
Ideally, I like to do both in the same language, it’s much more productive.
5 posts were split to a new topic: Updated CSV reader benchmarks?