KeyError when writing Parquet files to S3

a.frist · April 7, 2022, 5:56pm

I’m trying to write a parquet file to S3 and I’m just not sure exactly what the issue is. This code has worked in the past and there doesn’t seem to be a significant change to the underlying data. Potentially, all the values in a single column may be “missing” but I don’t believe that is causing the problem.

I am getting the following error below:

ERROR: LoadError: KeyError: key Union{} not found
Stacktrace:
 [1] getindex at ./dict.jl:467 [inlined]
 [2] write_col(::FilePathsBase.FileBuffer, ::Array{Missing,1}, ::String, ::Int32, ::Int32; nchunks::Int64) at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:369
 [3] _write_parquet(::FilePathsBase.FileBuffer, ::Tables.Columns{DataFrames.DataFrameColumns{DataFrame}}, ::Array{Symbol,1}, ::Int64; ncols::Int64, encoding::Dict{String,Int32}, codec::Dict{String,Int32}) at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:563
 [4] write_parquet(::FilePathsBase.FileBuffer, ::DataFrame; compression_codec::String) at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:506
 [5] #83 at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:526 [inlined]
 [6] open(::Parquet.var"#83#84"{String,DataFrame}, ::S3Path{Nothing}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:325
 [7] open at ./io.jl:323 [inlined]
 [8] #write_parquet#82 at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:525 [inlined]
 [9] write_parquet(::S3Path{Nothing}, ::DataFrame) at /home/ec2-user/.julia/packages/Parquet/O0PXc/src/writer.jl:525
 [10] top-level scope at /home/ec2-user/environment/Analytics Testing/julia_orch.jl:383
 [11] include(::Function, ::Module, ::String) at ./Base.jl:380
 [12] include(::Module, ::String) at ./Base.jl:368
 [13] exec_options(::Base.JLOptions) at ./client.jl:296
 [14] _start() at ./client.jl:506
in expression starting at /home/ec2-user/environment/Analytics Testing/julia_orch.jl:383

The code that is throwing the error is shown in a snippet below:

# Write the link prediction results to an S3 bucket.
path = S3Path("s3://$s3_bucket_output/$s3_filepath_output/link_prediction_df_" * unique_identifier * ".parquet")
@info("Outputting link prediction data to $path")
write_parquet(path, link_prediction_df)

Lastly, the contents of the dataframe are shown below:

5×18 DataFrame
 Row │ MONTHS_FROM_START  TOPIC_I   TOPIC_J   PIJ          PI_PJ        PI_GIVEN_J  RIJ        RIJ_HAT     DIJ          EIJ          DIJ_HAT      EIJ_HAT      RIJ_DELTA  RIJ_TREND  PI_PJ_NOVELTY  PI_PJ_LOG  YEAR   MONTH 
     │ Int64              String    String    Float64      Float64      Float64     Float64    Float64     Float64?     Float64?     Float64?     Float64?     Missing    Missing    Missing        Float64    Int64  Int64 
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │                 1  TOPIC_38  TOPIC_29  0.000288662  0.000337401   0.0157594  -0.225085  -0.0919115  -0.0100146   0.000345866  -0.00286437  0.000323948    missing    missing        missing   -7.99424   2018      2
   2 │                 1  TOPIC_38  TOPIC_18  0.00032709   0.000377291   0.0159694  -0.205989  -0.0822713  -0.00976723  0.000231847  -0.00268832  0.000228191    missing    missing        missing   -7.88249   2018      2
   3 │                 1  TOPIC_38  TOPIC_19  0.000414862  0.000508113   0.0150398  -0.292518  -0.0412777  -0.011874    0.000683656  -0.00253039  0.000313784    missing    missing        missing   -7.58481   2018      2
   4 │                 1  TOPIC_38  TOPIC_21  0.000422538  0.000489185   0.0159107  -0.2113    -0.0424972  -0.0103839   0.000352965  -0.00261318  0.000287849    missing    missing        missing   -7.62277   2018      2
   5 │                 1  TOPIC_26  TOPIC_38  0.000405344  0.000451404   0.0220053  -0.155272   0.0410278  -0.0110256   0.000526242  -0.00253196  0.000418073    missing    missing        missing   -7.70315   2018      2

a.frist · April 8, 2022, 1:09pm

It turns out that the S3 write component cannot handle “missing” columns. I used the following code to change the values in the files to NaN.

@rtransform!(my_df, :MY_COLUMN = ismissing(:MY_COLUMN) ? NaN : :MY_COLUMN)

I applied the above row transform macro to all columns that only contained missing values.

Topic		Replies	Views
Best way to get pieces of a parquet into a dataframe Data	0	627	September 18, 2017
Writing Parquet files General Usage	28	5315	November 12, 2020
Error displaying ParquetFiles.ParquetFile: ArgumentError: reducing over an empty collection is not allowed General Usage question , package	0	417	June 15, 2020
Trying to write a parquet writer. Please help! General Usage parquet	1	509	May 9, 2020
Unable to write DataFrame to Parquet or Arrow? Data question	7	630	July 27, 2021

KeyError when writing Parquet files to S3

Related topics