Can Sagemaker Julia query a S3/Athena table with SQL?

Billpete002 · August 12, 2021, 10:38am

Is it possible run a SQL-esque script to pull data within Julia from AWS’s S3/Athena? My Julia instance is located on AWS Sagemaker Jupyter Notebook.

Using python in SageMaker easily enough:

import boto3
region = boto3.Session().region_name

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir='s3://sagemaker-examplebucket/',
               region_name = region)

df = pd.read_sql("""SELECT 
 something1,
 something2
FROM "customer_data"."sagemaker_data"
WHERE 
    something = 0;""", conn)

For my instance of SageMaker Notebook using Julia I can connect and put/get an existing file, but I can’t find anything that allows new queries like Python’s pyathena library:

using AWS, AWSS3, Serialization
struct SampleData
  a::Int
  b::String
end

d=SampleData(1,"sss")
aws = global_aws_config(; region="us-west-2")
b = IOBuffer()
serialize(b, d)

s3_put(aws, "sagemaker-examplebucket","myfile.bin", b.data)

ddat = s3_get(aws, "sagemaker-examplebucket","myfile.bin")
d2 = deserialize(IOBuffer(ddat))

@assert d == d2

propelledanalytics · August 12, 2021, 9:14pm

Billpete002:

Is it possible run a SQL-esque script to pull data within Julia from AWS’s S3/Athena? My Julia instance is located on AWS Sagemaker Jupyter Notebook.

Using python in SageMaker easily enough:
import boto3
region = boto3.Session().region_name

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir='s3://sagemaker-examplebucket/',
               region_name = region)

df = pd.read_sql("""SELECT 
 something1,
 something2
FROM "customer_data"."sagemaker_data"
WHERE 
    something = 0;""", conn)

Hi Billpete002,

The SparkSQL.jl package enables Julia programs to work with Spark data using SQL. SparkSQL.jl returns results from Apache Spark queries as Julia DataFrames. You can move Julia data to your Spark query too. A common use case for SparkSQL.jl is machine learning. SparkSQL.jl makes it easy to get data from Spark using SQL, do machine learning in Julia, and return data back to Apache Spark. Example syntax:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

To learn more, visit the tutorial page and project pages:

Tutorial page:

Project page:

Billpete002 · August 13, 2021, 5:20am

Thanks for the answer. Finding libraries for Julia has been a challenge! I don’t have a spark instance set up currently but this looks do-able. It would be great to keep this all within Julia like R and Python can.

drizk1 · May 10, 2024, 3:00am

I know this is a few years late, but with TidierDB.jl and AWS.jl, you can run sql queries and collect the data as a dataframe from AWS’s Athena.

Here is the documentation for connecting and running queries. Hope this helps!

Topic		Replies	Views
PlotlyJS/WebIO on AWS SageMaker Studio Tooling jupyter , plotlyjs	0	221	January 16, 2023
S3 object stores - current state? Data	7	1334	May 22, 2019
How to correctly reach a Sagemaker endpoint with AWS.jl New to Julia aws	0	341	December 10, 2023
Julia docker image for AWS Sagemaker Studio won't run General Usage	0	624	June 28, 2022
Jupyter julia notebook on Amazon AWS SageMaker Jupyter-Notebook question	2	1333	January 5, 2023

Can Sagemaker Julia query a S3/Athena table with SQL?

Related topics