Importing big data

question

#1

I need to deal with tick-by-tick stock quote data with large size every day, I was wondering if Julia has any package that enables me to read csv/txt data with a size of more than 20GB.

I used to use the sparklyr package of r and Postgresql to deal with large data.


#2

I think https://github.com/JuliaComputing/JuliaDB.jl is exactly what you’re looking for.


#3

Can it handle 100+GB data?


#4

What do you want to do with these data?


#5

You might be interested in this blog post: https://medium.com/@sdanisch/drawing-2-7-billion-points-in-10s-ecc8c85ca8fa
The data ecosystem is still being built in Julia - I think it’s very promising but don’t expect R’s tidyverse at the moment.


#6

Extremely unworthy of me to make this reply.
But why… why… does everyone want to read in CSV formatted data these days?
I am thinking of an example where I work where people are reading and storing thousands of CSV files.
Yes, CSV has a place. I use CSV formatted output every day in system health check scripts.
And I can imagine that for stock tick data that is the format in which you receive the data.
But I am thinking more widely about science and engineering data - for large datasets formats such as HDF5 have been worked on over the years.
Why not read data in from the CSV format and store it in a more efficient format?

My apology to the original poster - my comments are in no way aimed at you.

Also I do expect to be rapped over the knuckes here and corrected. That is all healthy debate and we can learn something.


#7

CSV is easily readable from any language and any system, HDF5 isn’t. Notably, neither PostgreSQL, nor Spark mentioned by topic starter can work with HDF5 (at least out of the box), but have excellent support of delimited formats like CSV.


#8

If I understood the comment of @John_Hearns correctly, the issue is not about using CSV as a storage format, but using CSV on the same (large) file repeatedly. CSV and other text-based delimited formats indeed have some advantages as a storage format (pretty future proof), but for repeated analysis of a large dataset, it makes sense to ingest the data and use some other format from then on. If that format changes or you need to use a different environment which is not compatible (because of endianness, binary changes, etc), you ingest again.

That said, HDF5 is pretty universally supported these days. Even if you only have access to a C compiler, you should be OK.


#9

@Tamas_papp Thankyou. That sums up a lot of what I meant.
I was not trying to specifically bring HDF5 into the discussion.
If I may be allowed to exaggerate, and certainly not to criticise anyone, scientists and engineers have been manipulating huge datasets, terabytes and petabyte sized datasets for years.
For instance take the Large Hadron Collider. There are petabytes fo data being produced there. I worked on the LEP accelerator, with much lower volumes of data. However we would have raw experimental data, which woudl be condensed down to a binary DST format (data Summary Tapes - yes - tape!) afor analysis. Maybe at the very end of an analysis chain you would read and write CSV files.

Now we see Big Data coming along, hwere everyone seems to have a hammer as a tool, and so the only problem is a nail. By that I mean that Hadoop (etc.) is the weapon of choice.
SO what I guess I am sayign is yes I agree with @tamas_papp - whatever darned system you are reading data from probably will only send it to you in CSV form. But think of storing that data in some other format which might make it easier for you to analyze.


#10

Even for analysis, use cases may be so different and corresponding tools so diverse that the only format you can store your own copy of data is something as simple as CSV or JSON.

For example, if you want to have running statistics, CSV is good enough - it’s a row-based format which you can read line by line without worrying about memory requirements. I don’t know whether it’s possible to do the same with HDF5, but it also won’t give any real advantage in this scenario.

When your data is really huge, Hadoop FS may be helpful (I’ll call to “Hadoop FS” instead of short “HDFS” to avoid confusion with “HDF5”). Hadoop FS supports any files, but choice of input and output formats - something you need to efficiently split data into chunks and process in parallel - is very limited. HDF5 isn’t of one of them. Hadoop / Spark have their own efficient formats like Parquet, but it automatically makes your data a touch fruit for other tools.

Finally, if you want to access separate records from your data in random order, you’ll have to use some kind of indexed database (like PostgreSQL or MongoDB). In this case you don’t care about the format at all, you just use API of that database.


#11

If you think CSV is bad (and it is), just be grateful you don’t have people sending you xlsx and asking you to use them. The horror… the horror…

By the way, is there a good way of dealing with large memory-mapped HDF5 files in Julia right now?


#12

What do you mean? I thought mmap was supported.


#13

Oh, I must be really out of the loop. For some reason I thought it wasn’t supported by the HDF5 package (although of course I knew it could be done in principle).


#14

@John_Hearns

I do not disagree with you. From the perspective of science and engineering, there is little to reason to use CSV format to store data due to the rapidly increasing data size in financial industry.

As a educator in finance, I always want to push for better technical solutions, but it proves very difficult.

The difficulty starts from the education system.

Take SAS and Matlab for example, they have very good marketing teams. Nowadays the funding for public universities are very limited, but still most finance phd programs in U.S. buy expensive SAS and Matlab to do research, and when they become professors, they will teach SAS and Matlab again.

This rule applies to all the office stuffs of Microsoft, because most students are taught to use excel in college, it is very difficult to break this loop.


#15

@dfdx

Thanks for your answer. I normally use my data to do time-series analysis, like running some GARCH models.
I normally get my data from a remote Postgresql data and store it in R with the help of sparklyr, and after cleaning and aggregation, I will save my results as a R tibble format data. Then I save my data in R.Data format.


#16

Yes it can. Note that it is an in memory database, so you’ll need enough RAM to hold all the data you want to operate on, but that can be on multiple nodes. It is distributed out of the box.

If you are using it in a commercial environment, feel free to reach out to Julia Computing for support.


#17

For now – we’re actively working on supporting out-of-core computation in JuliaDB both locally and distributed. Once we can do distributed + out-of-core, then you’ll be able to manage truly huge data sets.


#18

On the contrary, this might be the one circumstance in which you have some power. When I was a TA, if there was an assignment that involved some output data, there’s 0 chance I would have accepted an xlsx file. The students would have no choice but to learn better ways of doing things. Everybody wins.

Now if only I could pull this off as a working data scientist…


#19

If you are happy with Spark wrapper for cleaning and aggregation, take a look at Spark.jl. It should be pretty straightforward to use JDBC format to read data directly from PostgreSQL (you will need to re-build Spark.jl with postgresql jar, though) and get a dataframe from it.


#20

@dfdx, it would be really great if you could point us to more examples of Spark.jl usage. Perhaps something like loading an RDD from a database, doing some sort of map-reduce or something and then spitting out in the form of a Julia dataframe (it’s this last step that usually trips me up)? By the way, is there any support for Spark dataframes?