Question about Dagger's DTable

I am reading through the DTable docs. But I can’t seem to quite grasp it.

Can I adapt DTable so that it can read from S3 path partitioned files? And the operate on the data?

In the doc, it shows how to create a DTable but it doesn’t mention where the data is stored, and how it’s stored etc. So I am a bit lost.

Any clarification welcome!

2 Likes

I think with JuliaCloud/AWSS3.jl: AWS S3 Simple Storage Service interface for Julia. (github.com) it should be possible use DTable with an S3 path.

You would probably want to do one of two things:

  1. Use something like s3fs to allow regular packages to load files from S3.
  2. Implement an S3-backed table object that satisfies the Tables.jl interface.

In the doc, it shows how to create a DTable but it doesn’t mention where the data is stored, and how it’s stored etc. So I am a bit lost.

Data is stored either in-memory or on disk, depending on whether you’ve configured disk caching. It can be stored on any worker in your Julia cluster seamlessly. The specific format depends on how the data was ingested; it could either be in the input table format, in NamedTuple of Vectors format, or in the format specified by table.tabletype (generally, it should either be the former or the latter if table.tabletype is specified; NamedTuple of Vectors is the fallback).

Appreciation: this is super flexible (with is really great and beyond what alternative frameworks offer).

Gripe: too much freedom makes users unsure about best practices. It would be nice to have them documented somewhere.
@jpsamaroo + @pszufe : I think it is a great topic to discuss during JuliaCon 2023 BoF Future of JuliaData ecosystem :: JuliaCon 2023 :: pretalx

1 Like

I think that in order to maximize throughput some dedicated setup might be needed. The S3 guide advises to open several GET requests at 8-16MB intervals with around 15 connection to saturate each 10Gb/s of network interface.

s3fs allows to mount S3 as a file system, however looking at its man page I do not see options for configuration of the parallelism. On the other hand if we are going to aim for a single file, we would specifically need to open the number of connections saturating the NIC throughput capacity. So for the maximum performance this could require a dedicated piece of code.

This is a major thing that’s held me back from using more Julia at work – a lot of our data is in very big Parquet tables. Figuring out how to get Parquet2.jl + AWS.jl + DataFrames.jl to do the right kind of laziness/concurrent requests/ deleting unneeded data / etc. makes me go … “Eh, I’ll just write some more awful SQL for Redshift.” I’m sure it can be done! But it has given me too much confusion so far.

1 Like