Read and convert JSON file from a website

Hi I am trying to download the json file from https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json

And then to save it as a csv file. At the moment I have used

using CSV, DataFrames, HTTP, Dates, JSON, JSONTables, JSON3
r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json");
df_vaiants_raw = DataFrame(jsontable(r.body))

But getting the error:

ArgumentError: input `JSON3.Object` must only have `JSON3.Array` values to be considered a table

Stacktrace:
 [1] jsontable(x::JSON3.Object{Vector{UInt8}, Vector{UInt64}})
   @ JSONTables ~/.julia/packages/JSONTables/g5bSA/src/JSONTables.jl:26
 [2] jsontable(source::Vector{UInt8})
   @ JSONTables ~/.julia/packages/JSONTables/g5bSA/src/JSONTables.jl:15
 [3] top-level scope
   @ In[42]:3
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1116

I am not sure how to fix it, and would like to get help.
Thanks
Aminath

The provided JSON is nested, you will need to convert it to a flat representation before creating a DataFrame or writing a CSV file. Can you provide an example of the output you want to get?

Thanks @dfdx . My final goal is to know the dominant variant in each country for a given period of time. I was hoping that I could get a data frame with columns:
Country; Week; Total sequence; variant name.

I am not sure if the raw data provided can be structured to reach to my final goal. I can see the information I am looking for as a visual plot here: CoVariants: Per Country.

Its mentioned that the visual plots are generated using the given raw data. So any alternate suggestion of a better way to approach/reach to my goal, that would be grateful.
Thanks
Aminath

You need to flatten the structure manually, something like this:

import HTTP, JSON3, DataFrames.DataFrame

function load_data()
    r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json")
    json = JSON3.read(r.body)    # convert HTTP response to a nested dict
    variant_names = [   # extracted from the first record
        Symbol("20I (Alpha, V1)")
        Symbol("20H (Beta, V2)")
        Symbol("20J (Gamma, V3)")
        Symbol("21A (Delta)")
        Symbol("21I (Delta)")
        Symbol("21J (Delta)")
        Symbol("21K (Omicron)")
        Symbol("21L (Omicron)")
        Symbol("21B (Kappa)")
        Symbol("21D (Eta)")
        Symbol("21F (Iota)")
        Symbol("21G (Lambda)")
        Symbol("21H (Mu)")
        Symbol("20B/S:732A")
        Symbol("20A/S:126A")
        Symbol("20E (EU1)")
        Symbol("21C (Epsilon)")
        Symbol("20A/S:439K")
        Symbol("20A.EU2")
        Symbol("20A/S:98F")
        Symbol("20C/S:80Y")
        Symbol("20B/S:626S")
    ]
    df = DataFrame(
        :country => [],
        :week => [],
        :total_sequences => [],
        [name => [] for name in variant_names]...   # put each variant as a separate column
                                                                             # alternatively, create a single column for the dominant variant
    )
    for (country_name, country) in json["countries"]
        for i in 1:length(country[:week])
            row = (
                country_name,
                country[:week][i],
                country[:total_sequences][i],
                # if you want only the dominant variant, determine it here instead of recording all of them
                [haskey(country, variant) ? country[variant][i] : 0 for variant in variant_names]...
            )
            push!(df, row)
        end
    end
    return df
end

To summarize, DataFrames won’t magically understand the structure of your data, you have to tell it how to interpret it.

1 Like

Thanks @dfdx , I have DataFrames package installed. But when I tried to import DataFrames.DataFrame, I am getting the following error.

invalid using path: "DataFrame" does not name a module

Stacktrace:
 [1] top-level scope
   @ In[48]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::

I am not sure how to fix it. Any help please?

That way, Julia thinks you’re asking for a sub module of DataFrames.
Better to put it on a new line as

import DataFrames: DataFrame

a little crude, but it could be a good starting point

using DataFrames, HTTP, JSON3
r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json");
json = JSON3.read(r.body)

df=DataFrame(json.countries.Italy)

function dominant(;from ::String=json.plotting_dates.min_date , to::String=json.plotting_dates.max_date , in:: Symbol=:Italy)
    df=DataFrame(json.countries[in])
    dfs=subset(df, :week=> x-> from .<= x .<= to)
    r=select(dfs, 1,Not(1:2)=>ByRow((x...)->[names(df)[2+argmax(x)],maximum(x)])=>[:dominant,:max_of_week])
end


dI=dominant(from="2021-01-01",to="2022-04-30",in=:Italy)
dF=dominant(from="2021-01-01",to="2022-04-30",in=:France)
dG=dominant(from="2021-01-01",to="2022-04-30",in=:Germany)

dominant(in=:USA)



dfcountries=DataFrame(contry=collect(keys(json.countries)))
...


function propcol(r)
    R=collect(r)
    tot=R[1]
    R[1]-=sum(R[2:end])
    (;zip(keys(r),R./tot)...)
end

r=dfI[10,2:end]

pr=transform(dfI, AsTable(Not(1))=>ByRow(propcol)=>AsTable)
prg=transform(dfG, AsTable(Not(1))=>ByRow(propcol)=>AsTable)


prm=combine(pr, Not(1).=>maximum)
perm=sortperm(collect(prm[1,:])).+1


using GLMakie

f = Figure()
Axis(f[1, 1])
xs=1:51
ys_low =ys_high= repeat([0.],51)


for c in eachcol(pr[:,perm])
    band!(xs, ys_low, c+ys_low )
    ys_low +=c
end

Thanks @jd-foster that worked

@rocco_sprmnt21 I like the idea of the code as it generates the dominant variant.
I ran the code but getting an error at pr=transform(dfI, AsTable(Not(1))=>ByRow(propcol)=>AsTable) prg=transform(dfG, AsTable(Not(1))=>ByRow(propcol)=>AsTable)
Here is the error:

MethodError: no method matching -(::String, ::Int64)
Closest candidates are:
  -(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:86
  -(::T, ::Integer) where T<:AbstractChar at char.jl:222
  -(::DataValues.DataValue{T1}, ::T2) where {T1<:Number, T2<:Number} at /Users/uqashaus/.julia/packages/DataValues/N7oeL/src/scalar/core.jl:212
  ...

Stacktrace:
  [1] propcol(r::NamedTuple{(:dominant, :max_of_week), Tuple{String, Int64}})
    @ Main ./In[24]:30
  [2] iterate
    @ ./generator.jl:47 [inlined]
  [3] collect(itr::Base.Generator{Tables.NamedTupleIterator{Tables.Schema{(:dominant, :max_of_week), Tuple{String, Int64}}, Tables.RowIterator{NamedTuple{(:dominant, :max_of_week), Tuple{Vector{String}, Vector{Int64}}}}}, typeof(propcol)})
    @ Base ./array.jl:678
  [4] (::ByRow{typeof(propcol)})(table::NamedTuple{(:dominant, :max_of_week), Tuple{Vector{String}, Vector{Int64}}})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:174
  [5] _transformation_helper(df::DataFrame, col_idx::AsTable, ::Base.RefValue{Any})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:378
  [6] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:546
  [7] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1383
  [8] manipulate(::DataFrame, ::Any, ::Vararg{Any, N} where N; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1311
  [9] #select#460
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:940 [inlined]
 [10] transform(df::DataFrame, args::Any; copycols::Bool, renamecols::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1018
 [11] transform(df::DataFrame, args::Any)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1007
 [12] top-level scope
    @ In[24]:36
 [13] eval
    @ ./boot.jl:360 [inlined]
 [14] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1116

I am not sure how to get rid of the error. Any ideas?
Thanks

the propcol function is applied to the second to last columns (Not(1)==2: end) of the dataframe.
You have to make sure that the data of these columns are numbers (integers).
I think the problem is tha the column “total …” is not of integer type.