Read and convert JSON file from a website

Aminath_Shausan · April 15, 2022, 10:35pm

Hi I am trying to download the json file from https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json

And then to save it as a csv file. At the moment I have used

using CSV, DataFrames, HTTP, Dates, JSON, JSONTables, JSON3
r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json");
df_vaiants_raw = DataFrame(jsontable(r.body))

But getting the error:

ArgumentError: input `JSON3.Object` must only have `JSON3.Array` values to be considered a table

Stacktrace:
 [1] jsontable(x::JSON3.Object{Vector{UInt8}, Vector{UInt64}})
   @ JSONTables ~/.julia/packages/JSONTables/g5bSA/src/JSONTables.jl:26
 [2] jsontable(source::Vector{UInt8})
   @ JSONTables ~/.julia/packages/JSONTables/g5bSA/src/JSONTables.jl:15
 [3] top-level scope
   @ In[42]:3
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1116

I am not sure how to fix it, and would like to get help.
Thanks
Aminath

dfdx · April 16, 2022, 5:51am

The provided JSON is nested, you will need to convert it to a flat representation before creating a DataFrame or writing a CSV file. Can you provide an example of the output you want to get?

Aminath_Shausan · April 16, 2022, 11:28pm

Thanks @dfdx . My final goal is to know the dominant variant in each country for a given period of time. I was hoping that I could get a data frame with columns:
Country; Week; Total sequence; variant name.

I am not sure if the raw data provided can be structured to reach to my final goal. I can see the information I am looking for as a visual plot here: CoVariants: Per Country.

Its mentioned that the visual plots are generated using the given raw data. So any alternate suggestion of a better way to approach/reach to my goal, that would be grateful.
Thanks
Aminath

dfdx · April 17, 2022, 8:04am

You need to flatten the structure manually, something like this:

import HTTP, JSON3, DataFrames.DataFrame

function load_data()
    r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json")
    json = JSON3.read(r.body)    # convert HTTP response to a nested dict
    variant_names = [   # extracted from the first record
        Symbol("20I (Alpha, V1)")
        Symbol("20H (Beta, V2)")
        Symbol("20J (Gamma, V3)")
        Symbol("21A (Delta)")
        Symbol("21I (Delta)")
        Symbol("21J (Delta)")
        Symbol("21K (Omicron)")
        Symbol("21L (Omicron)")
        Symbol("21B (Kappa)")
        Symbol("21D (Eta)")
        Symbol("21F (Iota)")
        Symbol("21G (Lambda)")
        Symbol("21H (Mu)")
        Symbol("20B/S:732A")
        Symbol("20A/S:126A")
        Symbol("20E (EU1)")
        Symbol("21C (Epsilon)")
        Symbol("20A/S:439K")
        Symbol("20A.EU2")
        Symbol("20A/S:98F")
        Symbol("20C/S:80Y")
        Symbol("20B/S:626S")
    ]
    df = DataFrame(
        :country => [],
        :week => [],
        :total_sequences => [],
        [name => [] for name in variant_names]...   # put each variant as a separate column
                                                                             # alternatively, create a single column for the dominant variant
    )
    for (country_name, country) in json["countries"]
        for i in 1:length(country[:week])
            row = (
                country_name,
                country[:week][i],
                country[:total_sequences][i],
                # if you want only the dominant variant, determine it here instead of recording all of them
                [haskey(country, variant) ? country[variant][i] : 0 for variant in variant_names]...
            )
            push!(df, row)
        end
    end
    return df
end

To summarize, DataFrames won’t magically understand the structure of your data, you have to tell it how to interpret it.

Aminath_Shausan · April 18, 2022, 2:29am

Thanks @dfdx , I have DataFrames package installed. But when I tried to import DataFrames.DataFrame, I am getting the following error.

invalid using path: "DataFrame" does not name a module

Stacktrace:
 [1] top-level scope
   @ In[48]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::

I am not sure how to fix it. Any help please?

jd-foster · April 18, 2022, 8:02am

That way, Julia thinks you’re asking for a sub module of DataFrames.
Better to put it on a new line as

import DataFrames: DataFrame

rocco_sprmnt21 · April 19, 2022, 12:26pm

a little crude, but it could be a good starting point

using DataFrames, HTTP, JSON3
r = HTTP.get("https://raw.githubusercontent.com/hodcroftlab/covariants/master/cluster_tables/EUClusters_data.json");
json = JSON3.read(r.body)

df=DataFrame(json.countries.Italy)

function dominant(;from ::String=json.plotting_dates.min_date , to::String=json.plotting_dates.max_date , in:: Symbol=:Italy)
    df=DataFrame(json.countries[in])
    dfs=subset(df, :week=> x-> from .<= x .<= to)
    r=select(dfs, 1,Not(1:2)=>ByRow((x...)->[names(df)[2+argmax(x)],maximum(x)])=>[:dominant,:max_of_week])
end


dI=dominant(from="2021-01-01",to="2022-04-30",in=:Italy)
dF=dominant(from="2021-01-01",to="2022-04-30",in=:France)
dG=dominant(from="2021-01-01",to="2022-04-30",in=:Germany)

dominant(in=:USA)



dfcountries=DataFrame(contry=collect(keys(json.countries)))
...

function propcol(r)
    R=collect(r)
    tot=R[1]
    R[1]-=sum(R[2:end])
    (;zip(keys(r),R./tot)...)
end

r=dfI[10,2:end]

pr=transform(dfI, AsTable(Not(1))=>ByRow(propcol)=>AsTable)
prg=transform(dfG, AsTable(Not(1))=>ByRow(propcol)=>AsTable)


prm=combine(pr, Not(1).=>maximum)
perm=sortperm(collect(prm[1,:])).+1


using GLMakie

f = Figure()
Axis(f[1, 1])
xs=1:51
ys_low =ys_high= repeat([0.],51)


for c in eachcol(pr[:,perm])
    band!(xs, ys_low, c+ys_low )
    ys_low +=c
end

Aminath_Shausan · April 23, 2022, 10:05pm

Thanks @jd-foster that worked

Aminath_Shausan · April 23, 2022, 10:08pm

@rocco_sprmnt21 I like the idea of the code as it generates the dominant variant.
I ran the code but getting an error at pr=transform(dfI, AsTable(Not(1))=>ByRow(propcol)=>AsTable) prg=transform(dfG, AsTable(Not(1))=>ByRow(propcol)=>AsTable)
Here is the error:

MethodError: no method matching -(::String, ::Int64)
Closest candidates are:
  -(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:86
  -(::T, ::Integer) where T<:AbstractChar at char.jl:222
  -(::DataValues.DataValue{T1}, ::T2) where {T1<:Number, T2<:Number} at /Users/uqashaus/.julia/packages/DataValues/N7oeL/src/scalar/core.jl:212
  ...

Stacktrace:
  [1] propcol(r::NamedTuple{(:dominant, :max_of_week), Tuple{String, Int64}})
    @ Main ./In[24]:30
  [2] iterate
    @ ./generator.jl:47 [inlined]
  [3] collect(itr::Base.Generator{Tables.NamedTupleIterator{Tables.Schema{(:dominant, :max_of_week), Tuple{String, Int64}}, Tables.RowIterator{NamedTuple{(:dominant, :max_of_week), Tuple{Vector{String}, Vector{Int64}}}}}, typeof(propcol)})
    @ Base ./array.jl:678
  [4] (::ByRow{typeof(propcol)})(table::NamedTuple{(:dominant, :max_of_week), Tuple{Vector{String}, Vector{Int64}}})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:174
  [5] _transformation_helper(df::DataFrame, col_idx::AsTable, ::Base.RefValue{Any})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:378
  [6] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:546
  [7] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1383
  [8] manipulate(::DataFrame, ::Any, ::Vararg{Any, N} where N; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1311
  [9] #select#460
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:940 [inlined]
 [10] transform(df::DataFrame, args::Any; copycols::Bool, renamecols::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1018
 [11] transform(df::DataFrame, args::Any)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/selection.jl:1007
 [12] top-level scope
    @ In[24]:36
 [13] eval
    @ ./boot.jl:360 [inlined]
 [14] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1116

I am not sure how to get rid of the error. Any ideas?
Thanks

rocco_sprmnt21 · April 24, 2022, 8:38am

the propcol function is applied to the second to last columns (Not(1)==2: end) of the dataframe.
You have to make sure that the data of these columns are numbers (integers).
I think the problem is tha the column “total …” is not of integer type.

Topic		Replies	Views
Efficiently Read JSON and Create DataFrame Performance json , dataframes	23	7733	April 3, 2025
DataFrames, best way to import from JSON format file Data	6	8448	October 15, 2019
Processing JSON from a .txt file and converting to a DataFrame New to Julia dataframes , json3	7	2594	May 15, 2021
How to read Panda's DataFrames from json file? New to Julia dataframes	27	975	February 1, 2023
How to read .json file in dataframe without using Pandas.jl General Usage json , dataframes	1	271	August 4, 2022

Read and convert JSON file from a website

Related topics