Is there a way to stream a table with 1 billion rows from SQL Server to Parquet efficently in Julia


#1

Hello,

I’m very new to Python and Julia and the tools for Data Science. In work, we need to put terabytes of data in a discoverable and easily queryable format. The tools i want to test are Dremio vs Azure Data Lake as Sql Layer over Parquet files. I want to copy the data to parquet files because it can be read from many tools (including both mentioned and Pandas and Julia). We are searching the best way. In Python, i can now read the table filtering by a range of days, and save that to a Parquet file. The steps i use are:

  1. Extract the data with Turbodbc and save the rows to a numpy array.
  2. Convert the numpy array to Arrow with PyArrow.
  3. Save the Arrow table as a Parquet file.

The problem with these steps is the memory usage. If i want to save 1 millions rows to each Parquet file, all rows are converted to numpy array, then to Arrow and then to Parquet. I’m using Windows, in Linux, its possible to skip the conversion to numpy. So, when i’m using 4 threads to extract the rows in parallel, the memory consumption is of 5-6 GB . It’s too much for me. I must use Windows because i want to teach others to do the same, but Linux is not a option in owns laptops.

Sorry for the large introduction and my English. The question is:

Currently, can i use Julia to replace the three steps, and extract fast and efficiently billions of rows from Sql Server to Parquet, and continue my exploration of tools?


#2

A couple of days ago I described how to read data from PostgreSQL using Spark.jl. If you manage to do the same thing with SQL Server, saving the result to Parquet is a matter of a single call to write_parquet(). At the moment documentation of Spark.jl is very limited, so don’t hesitate to ask questions.

Alternatively, you can query SQL server with whatever tools you like (e.g. JDBC.jl or ODBC.jl) and write it using Parquet.jl, although I’m don’t their current state and suitability.