Is it possible run a SQL-esque script to pull data within Julia from AWS’s S3/Athena? My Julia instance is located on AWS Sagemaker Jupyter Notebook.
Using python in SageMaker easily enough:
import boto3
region = boto3.Session().region_name
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://sagemaker-examplebucket/',
region_name = region)
df = pd.read_sql("""SELECT
something1,
something2
FROM "customer_data"."sagemaker_data"
WHERE
something = 0;""", conn)
For my instance of SageMaker Notebook using Julia I can connect and put/get an existing file, but I can’t find anything that allows new queries like Python’s pyathena library:
using AWS, AWSS3, Serialization
struct SampleData
a::Int
b::String
end
d=SampleData(1,"sss")
aws = global_aws_config(; region="us-west-2")
b = IOBuffer()
serialize(b, d)
s3_put(aws, "sagemaker-examplebucket","myfile.bin", b.data)
ddat = s3_get(aws, "sagemaker-examplebucket","myfile.bin")
d2 = deserialize(IOBuffer(ddat))
@assert d == d2
The SparkSQL.jl package enables Julia programs to work with Spark data using SQL. SparkSQL.jl returns results from Apache Spark queries as Julia DataFrames. You can move Julia data to your Spark query too. A common use case for SparkSQL.jl is machine learning. SparkSQL.jl makes it easy to get data from Spark using SQL, do machine learning in Julia, and return data back to Apache Spark. Example syntax:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
To learn more, visit the tutorial page and project pages:
Thanks for the answer. Finding libraries for Julia has been a challenge! I don’t have a spark instance set up currently but this looks do-able. It would be great to keep this all within Julia like R and Python can.