Repartitioning 2TB of csv into parquets

Good morning from Colombia.

I’m beginner in big data/data science, and i’m trying to do the next task:

We have 2 TB of CSV from one table. We want to try to use a SQL Layer to query that data. 
Currently, the data is stored in Timescale, but Timescale doesn't compress data and the used SSD 
space is growing in a fast pace. So, I and a partner are trying to use Azure Data Lake, with a SQL 
Layer over that, to test if: the performance of queryng is acceptable? the price is better or worse?. 
The two SQL layer that we want to try are Dremio and Azure Data Flow Analytics. But the problem is 
csv are sometime very very very large (100 GB) and sometimes very very tiny (10KB). We want to 
repartition the data first and to write the data to parquet second.

To do the task, we tryied:

1) To use Pandas, in a very large machine (64 cores, more than 450 GB of RAM). The problem was 
   that Pandas doesn't scale to large machines.
2) To use Azure Data Factory Data Flow, the data flow cluster (Spark cluster really) crash with a 
   System Error (?). So, aborted.

Finally, the csv are internally sorted by the key we want to use as partition key. So, maybe, we can use Julia to read the csv in streaming, and to write the partitioned data to parquets. Is that a good idea? you can see problems in that aproximation?

P.D.: Sorry my English. I hope you can understand the problem.

Depends on how you want to partition it. Into multiple files by columns or rows. Take a looked at CSV.jl. There are ways of reading parts of a file in at a time

1 Like

I want to partition the csv by rows. So, is possible to write parquets in Julia?

Using the CSV.jl package, you can use the CSV.Rows(file) structure to efficiently iterate rows; it shouldn’t matter how big the file is, CSV.Rows can efficiently handle iterating rows.

Unfortunately, the Parquet.jl package doesn’t support writing parquet files yet. The Feather.jl package, however, is able to write parquet-like binary files that can be very efficient to re-read, so you might check that out if it might work for you.

Hopefully that helps a little.

2 Likes

This seems like a reasonable[1] approach, though. Can you elaborate in which way Pandas “doesn’t scale”?

[1] Yes, there’s probably a smarter/cheaper way, but if this is a one-off, your time is probably more expensive than the hardware.

In my use case:

  1. When i’m reading CSV, the reading is not using multiple cores; same when i’m grouping (for partitioning), or sorting (i’m sorting for better compression).
  2. I’m parallelize using a Pool, but:
    2.1) The CSV are very large sometimes, and reading many CSV in the same time, consume too much memory.
    2.2) I can read the CSV’s in chunks, but as the process is very slow because we are merging the final partitions with existant parquets. I’m not expert, maybe i’m doing something wrong here.

Is possible to write in streamming the CSV, but without a iterator?

I’m thinking in something as:

  1. Read in streamming the CSV.
  2. With each row:
    2.1) Check if the partition key has a csv file created.
    2.2) If the csv file is created: append the row to the file.
    2.3) If not created: create the file, in a dictionary save the file as the assigned file of the partition key, and append the row.

Reading and Writting in streamming and without grouping the rows, only using a dictionary of partition keys => csv files. Is that possible?