Question about Dagger's DTable

xiaodai · February 4, 2022, 12:11pm

I am reading through the DTable docs. But I can’t seem to quite grasp it.

Can I adapt DTable so that it can read from S3 path partitioned files? And the operate on the data?

In the doc, it shows how to create a DTable but it doesn’t mention where the data is stored, and how it’s stored etc. So I am a bit lost.

Any clarification welcome!

rejuvyesh · February 5, 2022, 1:09am

I think with JuliaCloud/AWSS3.jl: AWS S3 Simple Storage Service interface for Julia. (github.com) it should be possible use DTable with an S3 path.

jpsamaroo · July 23, 2023, 6:55pm

You would probably want to do one of two things:

Use something like s3fs to allow regular packages to load files from S3.
Implement an S3-backed table object that satisfies the Tables.jl interface.

In the doc, it shows how to create a DTable but it doesn’t mention where the data is stored, and how it’s stored etc. So I am a bit lost.

Data is stored either in-memory or on disk, depending on whether you’ve configured disk caching. It can be stored on any worker in your Julia cluster seamlessly. The specific format depends on how the data was ingested; it could either be in the input table format, in NamedTuple of Vectors format, or in the format specified by table.tabletype (generally, it should either be the former or the latter if table.tabletype is specified; NamedTuple of Vectors is the fallback).

bkamins · July 23, 2023, 7:29pm

Appreciation: this is super flexible (with is really great and beyond what alternative frameworks offer).

Gripe: too much freedom makes users unsure about best practices. It would be nice to have them documented somewhere.
@jpsamaroo + @pszufe : I think it is a great topic to discuss during JuliaCon 2023 BoF Future of JuliaData ecosystem :: JuliaCon 2023 :: pretalx

pszufe · July 23, 2023, 8:40pm

I think that in order to maximize throughput some dedicated setup might be needed. The S3 guide advises to open several GET requests at 8-16MB intervals with around 15 connection to saturate each 10Gb/s of network interface.

s3fs allows to mount S3 as a file system, however looking at its man page I do not see options for configuration of the parallelism. On the other hand if we are going to aim for a single file, we would specifically need to open the number of connections saturating the NIC throughput capacity. So for the maximum performance this could require a dedicated piece of code.

mrufsvold · July 23, 2023, 10:55pm

This is a major thing that’s held me back from using more Julia at work – a lot of our data is in very big Parquet tables. Figuring out how to get Parquet2.jl + AWS.jl + DataFrames.jl to do the right kind of laziness/concurrent requests/ deleting unneeded data / etc. makes me go … “Eh, I’ll just write some more awful SQL for Redshift.” I’m sure it can be done! But it has given me too much confusion so far.

Topic		Replies	Views
Write Large Parquet to S3 General Usage parquet	6	309	August 9, 2023
Can Sagemaker Julia query a S3/Athena table with SQL? New to Julia question	3	793	May 10, 2024
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8677	June 5, 2021
Package for reading/writing ~100GB data files General Usage	10	2883	November 17, 2018
ANN: JuliaDB.jl Community	40	9707	November 13, 2018

Question about Dagger's DTable

Related topics