Good morning from Colombia.
I’m beginner in big data/data science, and i’m trying to do the next task:
We have 2 TB of CSV from one table. We want to try to use a SQL Layer to query that data. Currently, the data is stored in Timescale, but Timescale doesn't compress data and the used SSD space is growing in a fast pace. So, I and a partner are trying to use Azure Data Lake, with a SQL Layer over that, to test if: the performance of queryng is acceptable? the price is better or worse?. The two SQL layer that we want to try are Dremio and Azure Data Flow Analytics. But the problem is csv are sometime very very very large (100 GB) and sometimes very very tiny (10KB). We want to repartition the data first and to write the data to parquet second.
To do the task, we tryied:
1) To use Pandas, in a very large machine (64 cores, more than 450 GB of RAM). The problem was that Pandas doesn't scale to large machines. 2) To use Azure Data Factory Data Flow, the data flow cluster (Spark cluster really) crash with a System Error (?). So, aborted.
Finally, the csv are internally sorted by the key we want to use as partition key. So, maybe, we can use Julia to read the csv in streaming, and to write the partitioned data to parquets. Is that a good idea? you can see problems in that aproximation?
P.D.: Sorry my English. I hope you can understand the problem.