Is there a way to stream a table with 1 billion rows from SQL Server to Parquet efficently in Julia

gabomgp · November 16, 2017, 7:25pm

Hello,

I’m very new to Python and Julia and the tools for Data Science. In work, we need to put terabytes of data in a discoverable and easily queryable format. The tools i want to test are Dremio vs Azure Data Lake as Sql Layer over Parquet files. I want to copy the data to parquet files because it can be read from many tools (including both mentioned and Pandas and Julia). We are searching the best way. In Python, i can now read the table filtering by a range of days, and save that to a Parquet file. The steps i use are:

Extract the data with Turbodbc and save the rows to a numpy array.
Convert the numpy array to Arrow with PyArrow.
Save the Arrow table as a Parquet file.

The problem with these steps is the memory usage. If i want to save 1 millions rows to each Parquet file, all rows are converted to numpy array, then to Arrow and then to Parquet. I’m using Windows, in Linux, its possible to skip the conversion to numpy. So, when i’m using 4 threads to extract the rows in parallel, the memory consumption is of 5-6 GB . It’s too much for me. I must use Windows because i want to teach others to do the same, but Linux is not a option in owns laptops.

Sorry for the large introduction and my English. The question is:

Currently, can i use Julia to replace the three steps, and extract fast and efficiently billions of rows from Sql Server to Parquet, and continue my exploration of tools?

dfdx · November 16, 2017, 9:18pm

A couple of days ago I described how to read data from PostgreSQL using Spark.jl. If you manage to do the same thing with SQL Server, saving the result to Parquet is a matter of a single call to write_parquet(). At the moment documentation of Spark.jl is very limited, so don’t hesitate to ask questions.

Alternatively, you can query SQL server with whatever tools you like (e.g. JDBC.jl or ODBC.jl) and write it using Parquet.jl, although I’m don’t their current state and suitability.

Topic		Replies	Views
Repartitioning 2TB of csv into parquets Data big-data	21	4668	June 25, 2020
Write Large Parquet to S3 General Usage parquet	6	309	August 9, 2023
Writing Parquet files General Usage	28	5248	November 12, 2020
Struggling with Julia and large datasets General Usage question , big-data	67	11067	October 17, 2024
Converting CSV to Parquet in Julia New to Julia question , csv , parquet	22	1574	August 11, 2024

Is there a way to stream a table with 1 billion rows from SQL Server to Parquet efficently in Julia

Related topics