Retrieve data from Amazon S3?

I have a .tif file hosted on S3 and it looks like it should be possible to read it into memory using AWSS3.jl. I’ve tried the following:

using AWSS3

aws = AWSCore.aws_config()

bucket = "projects"
path = "path/to/file.tif"
filename = "path/to/file.tif"

obj = s3_get(aws, bucket, path)

which returns an Array{UInt8, 1}. I think what I really want is to replace the last line with

s3_get_file(aws, bucket, path, filename)

which is supposed to stream the result directly to filename. However, this method returns nothing and I’m not sure where to go from here.

Does anyone have experience with this?

Cheers,
Josh

A quick look through the code shows that the methods writes the file and returns nothing. Is any file creayed?

You’re right, a file was created on my machine… can’t believe I missed that! I was expecting the file to be read into memory, which is what I’d actually like to be able to do. Perhaps there is some way to do this with the Array{UInt8, 1} returned by s3_get?

You of course need to deserialize whatever you get form the S3. What are you expecting to be in the file? Just take the appropriate deserialization method from whatever the file is and call it on the Vector{UInt8}. (Most Julia IO packages support deserializing from an Vector{UInt8} or an IO stream.)

The file is a tif containing an Array{Float32, 3}. Could you share (or point me too) a basic example for deserializing a Vector{UInt8}?

I looked into *.tif and it looks like the preferred way to deserialize these in Julia is through FileIO. You should be able to do

FileIO.load(IOBuffer(v))

where v is the Vector{UInt8} you loaded from AWS S3. What this does is wrap the data in an IO object (since FileIO doesn’t seem to support reading directly from a Vector{UInt8}) and then reads it with FileIO. Apparently in this case FileIO will attempt to use the “magic bytes” (the first few bytes in a file which are supposed to indicate its format, don’t ask me why it’s called that) to determine the format. If for whatever reason *.tif doesn’t have these, you can try

load(Stream(format"TIF", IOBuffer(v)))

which will explicitly tell FileIO the format. Since FileIO supports many formats, you can do something analogous for many other files for which you pull the buffers off S3 as well.

Just a little overview of what’s going on here: Normally when you load a file from the file system, you are usually calling a function associated with whatever deserialization method you need directly (for example h5read from HDF5.jl). When the file is loaded in those cases, the bytes are read in and whatever deserializer you are using converts it to a useful format for you. The alternative is something like FileIO which is supposed to infer the format. This is typically done either through the file name extension or the “magic bytes”.

When you are reading from the S3 with AWSS3, what’s happening is that that package makes an HTTP call to Amazon and Amazon sends back an HTTP response with your data embedded in it. At this point, there’s not really anything for AWSS3.jl to do about it: it has the data, but it has no way of knowing if or how you want to deserialize that data. I suppose some day it could be integrated with FileIO, but in the meantime, there’s really nothing for it to do other than give you the raw data as a Vector{UInt8}. At this point it’s up to you to decide how to deserialize it.

(By the way, none of this other than references to specific packages is Julia specific.)

1 Like

I really appreciate the detailed response, that definitely helped me understand this process a bit better!

I tried

FileIO.load(IOBuffer(v))

but got the error

ErrorException("type GenericIOBuffer has no field name")
There was an error in magick function detect_ometiff

Then I tried

load(Stream(format"TIF", IOBuffer(obj)))

and got

No applicable_loaders found for TIF

For the record, I’m using Julia 1.2.0 and do have ImageMagick.jl installed.

Out of curiosity, what would be the result if FileIO.load() were successful? Would I have the Array{Float32, 3} read into memory, or would the *.tif file just be saved to my machine?

Looking at the FileIO README, I wonder if need to add a new format and implement a new loader using ArchGDAL.jl? Something like

add_format(format"TIF", [], ".tif") 
function load(f::File{format"TIF"})
    open(f) do s
        ArchGDAL.registerdrivers() do
            ArchGDAL.read(s) do dataset
                # build Array{Float32, 3}
            end
        end
    end
end

though it looks like ArchGDAL.read() expects a filename.

Hm, I have to confess I’m not too familiar with FileIO. Can you try it with using ImageMagick?

Ah you forgot to mention that you are dealing specifically with GeoTIFFs. That changes some things, especially if you want the geospatial metadata as well.

In principle you should be able to do this directly with GDAL.jl/ArchGDAL.jl, however we still need to figure out some build issues with GDAL to have this work out of the box. However until then you may be able to still use these package, using not the provided GDAL but one seperately installed, by modifying the paths under GDAL.jl/deps/.

Then it should work like so, to get all data to a julia array:

s3url = "/vsis3/bucket/key"
# create a dataset with all geospatial metadata (this in on the ArchGDAL#idataset branch)
dataset = ArchGDAL.read(s3url)
# read out all the pixels to an Array
A = ArchGDAL.read(dataset)

For now you can also just download the entire file and then read it in using the same code.

I’d be very interested to getting this all work as smooth as possible! But unfortunately I don’t have time right now to dive again into BinaryBuilder and ironing out the build issues. Help very much welcome!

Some references:
https://kokoalberti.com/articles/hosting-and-accessing-cloud-optimized-geotiff-files-on-aws-s3/
https://gdal.org/user/virtual_file_systems.html#vsis3

3 Likes

That’s correct, it is a GeoTIFF (though I don’t need the geospatial metadata). I didn’t realize there was a structural difference, so thank you for catching that!

It sounds like this would be the solution I’m looking for when completely implemented, but for now I’ll stick with the approach of downloading the entire file and reading it in using the approach outlined in ArchGDAL.jl.

Thank you all for the help!