Can Sagemaker Julia query a S3/Athena table with SQL?

Is it possible run a SQL-esque script to pull data within Julia from AWS’s S3/Athena? My Julia instance is located on AWS Sagemaker Jupyter Notebook.

Using python in SageMaker easily enough:

import boto3
region = boto3.Session().region_name

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir='s3://sagemaker-examplebucket/',
               region_name = region)

df = pd.read_sql("""SELECT 
 something1,
 something2
FROM "customer_data"."sagemaker_data"
WHERE 
    something = 0;""", conn)

For my instance of SageMaker Notebook using Julia I can connect and put/get an existing file, but I can’t find anything that allows new queries like Python’s pyathena library:

using AWS, AWSS3, Serialization
struct SampleData
  a::Int
  b::String
end

d=SampleData(1,"sss")
aws = global_aws_config(; region="us-west-2")
b = IOBuffer()
serialize(b, d)

s3_put(aws, "sagemaker-examplebucket","myfile.bin", b.data)

ddat = s3_get(aws, "sagemaker-examplebucket","myfile.bin")
d2 = deserialize(IOBuffer(ddat))

@assert d == d2

Hi Billpete002,

The SparkSQL.jl package enables Julia programs to work with Spark data using SQL. SparkSQL.jl returns results from Apache Spark queries as Julia DataFrames. You can move Julia data to your Spark query too. A common use case for SparkSQL.jl is machine learning. SparkSQL.jl makes it easy to get data from Spark using SQL, do machine learning in Julia, and return data back to Apache Spark. Example syntax:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

To learn more, visit the tutorial page and project pages:

Tutorial page:

Project page:

1 Like

Thanks for the answer. Finding libraries for Julia has been a challenge! I don’t have a spark instance set up currently but this looks do-able. It would be great to keep this all within Julia like R and Python can.

I know this is a few years late, but with TidierDB.jl and AWS.jl, you can run sql queries and collect the data as a dataframe from AWS’s Athena.

Here is the documentation for connecting and running queries. Hope this helps!