Speed up data extraction from large json file

TimG · November 10, 2023, 12:08pm

I’ve downloaded a json file from here containing records of nearly 350,000 National Lottery grants to good causes. I’m pretty new to Julia and even newer to json.

The structure of the json is like this (as best I can tell):

grants (V)
     |-- additional_data (D)
     |     |-- TSGRecipientType (S)
     |     |-- recipientCountryName (S)
     |     |-- TSGFundingOrgType (S)
     |     |-- GNCanonicalRecipientOrgId (S)
     |     |-- GNCanonicalFundingOrgId (S)
     |     |-- codeListLookup (D)
     |     |     |-- regrantType (S)
     |     |     |-- toIndividualsDetails (D)
     |     |     |     |-- primaryGrantReason (S)
     |     |     |     |-- grantPurpose (V)
     |     |     |     |-- secondaryGrantReason (S)
     |     |-- GNCanonicalRecipientOrgName (S)
     |     |-- GNRecipientOrgInfo0 (D)
     |     |     |-- ageWhenAwarded (S)
     |     |     |-- latestIncome (S)
     |     |-- recipientRegionName (S)
     |     |-- recipientLocation (S)
     |     |-- recipientDistrictName (S)
     |     |-- recipientOrgInfos (V)
     |     |     |-- orgIDs (V)
     |     |     |-- active (S)
     |     |     |-- name (S)
     |     |     |-- postalCode (S)
     |     |     |-- id (S)
     |     |     |-- dateRegistered (S)
     |     |     |-- dateRemoved (S)
     |     |     |-- linked_orgs (V)
     |     |     |-- latestIncome (S)
     |     |     |-- url (S)
     |     |     |-- charityNumber (S)
     |     |     |-- companyNumber (S)
     |     |     |-- organisationTypePrimary (S)
     |     |     |-- source (S)
     |     |     |-- organisationType (V)
     |     |     |-- dateModified (S)
     |     |     |-- latestIncomeDate (S)
     |     |-- locationLookup (V)
     |     |     |-- ladcd_active (S)
     |     |     |-- ladcd (S)
     |     |     |-- areatype (S)
     |     |     |-- ladnm_active (S)
     |     |     |-- ukcd (S)
     |     |     |-- lad20cd (S)
     |     |     |-- sourceCode (S)
     |     |     |-- sourcefile (S)
     |     |     |-- ctrynm (S)
     |     |     |-- ctrycd (S)
     |     |     |-- uknm (S)
     |     |     |-- lad20nm (S)
     |     |     |-- gbnm (S)
     |     |     |-- utlacd (S)
     |     |     |-- utlanm (S)
     |     |     |-- ladnm (S)
     |     |     |-- gbcd (S)
     |     |     |-- latitude (F)
     |     |     |-- areacode (S)
     |     |     |-- source (S)
     |     |     |-- areaname (S)
     |     |     |-- longitude (F)
     |     |-- GNCanonicalFundingOrgName (S)
     |     |-- recipientDistrictGeoCode (S)
     |-- dataType (S)
     |-- plannedDates (V)
     |     |-- duration (I)
     |     |-- startDate (S)
     |     |-- endDateDateOnly (N)
     |     |-- startDateDateOnly (S)
     |-- currency (S)
     |-- awardDateDateOnly (S)
     |-- title_and_description (S)
     |-- id (S)
     |-- description (S)
     |-- recipientOrganization (V)
     |     |-- name (S)
     |     |-- location (V)
     |     |     |-- name (S)
     |     |     |-- geoCode (S)
     |     |-- id_and_name (S)
     |     |-- id (S)
     |     |-- companyNumber (S)
     |     |-- charityNumber (S)
     |     |-- organisationType (S)
     |-- amountAwarded (I)
     |-- grantProgramme (V)
     |     |-- title_keyword (S)
     |     |-- title (S)
     |     |-- code (S)
     |-- dataset (D)
     |     |-- publisher (D)
     |     |     |-- name (S)
     |     |     |-- last_published (S)
     |     |     |-- org_id (S)
     |     |     |-- logo (S)
     |     |     |-- website (S)
     |     |     |-- prefix (S)
     |     |-- issued (S)
     |     |-- funders (V)
     |     |     |-- 1 (S)
     |     |     |-- 2 (S)
     |     |-- distribution (V)
     |     |     |-- accessURL (S)
     |     |     |-- downloadURL (S)
     |     |     |-- title (S)
     |     |-- datagetter_metadata (D)
     |     |     |-- downloads (B)
     |     |     |-- file_size (I)
     |     |     |-- acceptable_license (B)
     |     |     |-- json (S)
     |     |     |-- datetime_downloaded (S)
     |     |     |-- valid (B)
     |     |     |-- file_type (S)
     |     |-- description (S)
     |     |-- modified (S)
     |     |-- license_name (S)
     |     |-- title (S)
     |     |-- license (S)
     |     |-- identifier (S)
     |-- amountAppliedFor (I)
     |-- filename (S)
     |-- title (S)
     |-- fundingOrganization (V)
     |     |-- name (S)
     |     |-- id_and_name (S)
     |     |-- id (S)
     |-- awardDate (S)
     |-- dateModified (S)
     |-- simple_grant_type (S)
license (S)

and the grants element holds the 344601 grant records. Each grant may hold a different subset of the variables exported and a different number of repetitions (serialisations).of each item.

My code walks recursively through the different levels of the json to extract the values grant by grant into a one-row DataFrame. After each grant, I vcat this onto a DataFrame to aggregate all grant records using cols=:union to allow for the different subsets of data values across different grants.

I do this in chunks of 500 grants because I quickly run out of memory and either crash or slow down significantly. I write each chunk to a tmp csv file. Once I’ve finished, I vcat all the tmp files into one overall DataFrame and also write this to a master csv file.

This process seems to work OK. However, it is very slow. It takes well over 2 hours to process all grants.

Here is my code, which I run in VS Code:

using DataFrames, JSON, CSV

function dlist(d, l)
    for k in keys(d)
        t = first(string(typeof(d[k])), 1)
        l>1 ? a="     |-- " : a=""
        println("     |"^(max(l - 2, 0)), a, k, " ($t)")
        if t == "D"
           dlist(d[k], l+1)
        elseif t == "V"
            if length(d[k]) > 0
                if string(typeof((d[k][1]))) !== "String"
                     dlist(d[k][1], l + 1)
                end
            end
        end
    end
end

function extractjsonvalues(g, d, l, varname, vals)
    for k in keys(d)
        t = first(string(typeof(d[k])), 1)
        if l==1
            varname = k
        else
            oldname = varname
            varname = varname*string(k)
        end
        if t == "D"
            vals = extractjsonvalues(g, d[k], l + 1, varname*":", vals)
        elseif t == "V"
            if length(d[k]) > 0
                if string(typeof((d[k][1]))) !== "String"
                    for (q, dk) in enumerate(keys(d[k]))
                        z = q - 1
                        vals = extractjsonvalues(g, d[k][dk], l + 1, varname * ":$z:", vals)
                    end
                else
                    keylength = (length(keys(d[k])))
                    if keylength > 1
                        # for strings like ["GB-COH-223341", "GB-CHC-027652"]
                        # remove quotes and return as GB-COH-223341, GB-CHC-027652
                        v=""
                        for i in values(d[k])
                            v = v * i * ", "
                        end
                        vals[!, varname] = [first(v,length(v)-2)]
                    else
                        for dk in keys(d[k])
                            vals[!, varname] =  [get(d[k], dk, missing)]
                        end
                    end
                end
            end
        else
            varval = get(d, k, missing)
            if varval === nothing
                varval = missing
            else
                # for strings like [["GB-COH-223341", "GB-CHC-027652"]]
                # remove quotes and square brackets and return as GB-COH-223341, GB-CHC-027652
                if !ismissing(varval)
                    if first(varval,2) == "[\"" && last(varval, 2) == "\"]"
                        varval = replace(varval, "[" => "", "]" => "", "\"" => "")
                    end
                end
            end
            vals[!, varname] = [varval]
        end
        if l>1
            varname = oldname
        end
    end
    return vals
end

function json2chunkedcsv(json, csvname)
    print(rpad("\nParsing JSON result", 40), "  ...  ")
    @time d = JSON.parsefile(json)["grants"]
    chunkdata = DataFrame()
    singlegrant = DataFrame()
    println("\nExtracting data:")
    @time begin
        let step = 500
            for chunk = 1:step:min(div(length(d), step) * step + 1, 500000)
                @time begin
                    for g = chunk:min(chunk + step - 1, length(d))
                        singlegrant = extractjsonvalues(g, d[g], 1, "", singlegrant)
                        chunkdata = vcat(chunkdata, singlegrant, cols=:union)
                        singlegrant = singlegrant[:, []]
                    end
                    CSV.write("TMP_" * string(chunk + step - 1) * "_" * csvname, chunkdata)
                    chunkdata = chunkdata[:, []]
                    print(rpad("     Finished extracting " * string(min(chunk + step - 1, length(d))) * " grants", 40), "  ...  ")
                end
            end
        end
    end
end

function writegrants2bigcsv(csvname)
    @time begin
        TMPfiles = DataFrame(fname=readdir(pwd()))
        filter!(:fname => n -> occursin(csvname, n) && n !== csvname, TMPfiles)
        print(rpad("\nMerging "*string(nrow(TMPfiles))*" TMP files", 40), "  ...  ")
        GrantData = CSV.read(TMPfiles.fname[1], DataFrame)
        rm(TMPfiles.fname[1], force=false)
        for tmp = 2:nrow(TMPfiles)
            df = CSV.read(TMPfiles.fname[tmp], DataFrame)
            GrantData = vcat(GrantData, df, cols=:union)
            rm(TMPfiles.fname[tmp], force=false)
        end
        CSV.write(csvname, GrantData)
    end
    return GrantData
end

jsonfile = raw"360Giving Data\grantnav-20231104090448.json"
csvname = "new360JSON-20231104090448.csv"

dlist(JSON.parsefile(jsonfile), 1)

json2chunkedcsv(jsonfile, csvname)
GrantsData = writegrants2bigcsv(csvname)
println()

So my question is, how can I make this go faster? I’ve seen examples on here where more expert users make suggestions to naive users like me to speed things up by 10x, 100x, even 1000x. Is that possible here?

Thanks!

Jeff_Emanuel · November 10, 2023, 3:47pm

Probably not a big time savings, but simpler code:
if d[k][1] isa String

Similarly for other places you use typeof.

I suggest looking into JSON3 with StructTypes which may eliminate some of your data translation.

TimG · November 10, 2023, 4:30pm

Well, you say not a big time saving but making this change in just three places in extractjsonvalues has resulted in a more than 2x speed-up overall and a commensurate halving of allocations!

I’ll have another look at JSON3. I have looked before and couldn’t make head or tail of the documentation. Maybe I’m looking in the wrong place or maybe I just need to concentrate harder!

Thanks!

jules · November 10, 2023, 8:34pm

This is not good, because with each vcat operation you need to copy all your existing dataframe columns into new ones plus your new row. And you do this with every single row.

Instead, collect your disparate rows in a vector and then vcat all of them at once at the end.

At least this is a basic fix, JSON3 and more “bespoke” handling would be better but we can do the low hanging fruit first.

rocco_sprmnt21 · November 11, 2023, 11:14am

here a discussion about a similar problem. You could try some of the suggested ideas.

here one of the proposals

allflatnt(vd::Vector{Dict{Symbol, Any}})= all(allflatnt, vd)


allflatnt(d::Dict{Symbol, Any})=all(e->!(last(e) isa Vector{<:Dict} || last(e) isa Dict),d)

function nestdf(vd::Vector{Dict{Symbol, Any}})
  DataFrame(Tables.dictrowtable(nestdict.(vd)))
end

function nestdf(d::Dict{Symbol, Any})
    tdf=DataFrame(Tables.dictrowtable(nestdict(d)))
    if names(tdf)==["first","second"]
        rename!(tdf, [:first,:second].=>[:key,:value])
    end
    #tdf
end

function nestdict(d::Dict{Symbol, Any})
    for (k,v) in d
       if v isa Vector{<:Dict} || v isa Dict
            if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                d=merge(d, Dict(k=>tdf))
            else
                d=merge(d, Dict(k=>nestdf(v)))
            end
        end
    end
    d 
end



nestdf(jsobj::JSON3.Object)=nestdf(copy(jsobj))
nestdf(jsobj::JSON3.Array)=nestdf(copy(jsobj))


using JSON3, DataFrames

filtering the chunk £0-£500, after few seconds you get


julia> jsobj=JSON3.read("grantnav-20231111145216.json")
JSON3.Object{Vector{UInt8}, Vector{UInt64}} with 2 entries:
  :license => "See dataset/license within each grant. This file also contains…  :grants  => Object[{…"

julia> df=nestdf(jsobj)
2×2 DataFrame
 Row │ key      value
     │ Symbol   Union…
─────┼────────────────────────────────────────────
   1 │ grants   6243×29 DataFrame
   2 │ license  See dataset/license within each …

julia> df2=df.value[1]
6243×29 DataFrame
  Row │ plannedDates   filename                 currency  description                        tit ⋯
      │ DataFrame?     String                   String    String                             Str ⋯
──────┼───────────────────────────────────────────────────────────────────────────────────────────
    1 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1 x ⋯
    2 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1x   
    3 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
    4 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
    5 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1x  ⋯
    6 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1 x  
    7 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
    8 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
    9 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que ⋯
   10 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
   11 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que  
   12 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1 x  
   13 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  Que ⋯
   14 │ 1×5 DataFrame  a001p00000tV2XpAAK.json  GBP       Funding under Sport England's NB…  1 x  
  ⋮   │       ⋮                   ⋮                ⋮                      ⋮                      ⋱
 6231 │ missing        a003W000007E4EqQAK.json  GBP       A Project To Commemorate Ww2 Thr…  Gra  
 6232 │ missing        a003W000007E4EqQAK.json  GBP       The 7th Branch Of The Normandy V…  Gra  
 6233 │ missing        a003W000007E4EqQAK.json  GBP       This Group Will Mark The Anniver…  Gra ⋯
 6234 │ missing        a003W000007E4EqQAK.json  GBP       St Theresa'S Over 50s In St Hele…  Gra  
 6235 │ missing        a003W000007E4EqQAK.json  GBP       This Formally Constituted Theatr…  Gra  
 6236 │ missing        a003W000007E4EqQAK.json  GBP       This Group, Part Of The Territor…  Gra  
 6237 │ missing        a003W000007E4EqQAK.json  GBP       This Registered Charity Is Organ…  Gra ⋯
 6238 │ missing        a003W000007E4EqQAK.json  GBP       This Informal Group Will Use The…  Gra  
 6239 │ missing        a003W000007E4EqQAK.json  GBP       This Informal Group Will Use An …  Gra  
 6240 │ missing        a003W000007E4EqQAK.json  GBP       This School Would Use An Award T…  Gra  
 6241 │ missing        a003W000007E4EqQAK.json  GBP       This Formal Group Would Use An A…  Gra ⋯
 6242 │ missing        a003W000007E4EqQAK.json  GBP       The Project Is To Fund A Reunion…  Gra  
 6243 │ missing        a003W000007E4EqQAK.json  GBP       An Award Will Be Used To Commemo…  Gra  
                                                                  25 columns and 6216 rows omitted

julia> describe(df2)[:,[begin,end]]
29×2 DataFrame
 Row │ variable                           eltype
     │ Symbol                             Type
─────┼───────────────────────────────────────────────────────────────────
   1 │ plannedDates                       Union{Missing, DataFrame}
   2 │ filename                           String
   3 │ currency                           String
   4 │ description                        String
   5 │ title_and_description              String
   6 │ fromOpenCall                       Union{Missing, String}
   7 │ awardDateDateOnly                  String
   8 │ dataset                            DataFrame
   9 │ grantProgramme                     DataFrame
  10 │ beneficiaryLocation                Union{Missing, DataFrame}
  11 │ fundingOrganization                DataFrame
  12 │ title                              String
  13 │ additional_data                    DataFrame
  14 │ dateModified                       Union{Missing, String}
  15 │ recipientOrganization              DataFrame
  16 │ Award Type                         Union{Missing, String}
  17 │ simple_grant_type                  String
  18 │ id                                 String
  19 │ dataType                           String
  20 │ amountAwarded                      Union{Float64, Int64}
  21 │ Recipient Org Location             Union{Missing, DataFrame}
  22 │ dataSource                         Union{Missing, String}
  23 │ awardDate                          String
  24 │ amountAppliedFor                   Union{Missing, Float64, Int64}
  25 │ amountDisbursed                    Union{Missing, Int64}
  26 │ Planned Date                       Union{Missing, DataFrame}
  27 │ National Lottery Heritage Fund A…  Union{Missing, String}
  28 │ Project reference                  Union{Missing, String}
  29 │ classifications                    Union{Missing, DataFrame}

julia> df2.dataset[1,1].value[1]
7×2 DataFrame
 Row │ key                  value
     │ Symbol               Union…
─────┼────────────────────────────────────────────────────────
   1 │ file_size            11072508
   2 │ valid                true
   3 │ json                 /home/datastore/latest_datagette…
   4 │ acceptable_license   true
   5 │ datetime_downloaded  2023-11-11T00:02:10+00:00
   6 │ downloads            true
   7 │ file_type            xlsx

TimG · November 13, 2023, 8:39am

Thank you for this suggestion @jules.

The challenge I see (maybe it’s not real) is that each grant may only have a subset of possible variables. Some grants have as few as 140 variables describing them. The union of all variables used by all grants is over 230 variables, and this may change in future as more grants are published by more distributors. What is more, my understanding is that both the JSON itself, and the Dicts it is parsed into, are unordered, so I do not know the sequence of variables that will be found and nor can I assume this will be consistent from grant to grant. So I need to keep track variable name-value pairs.

Using DataFrames and vcat makes this easy because of the cols=:union option. I looked for a simple vector based solution but quickly made things a bit complicated!

I did find out that append! will essentially do an in-place vcat and also has the cols=:union option.

Changing my two uses of vcat to instead use append! resulted in a further >10x speed up! Now I can process the whole JSON file in around 4 minutes.

This might not be as fast as possible but it is fast enough! Thanks for your help!

The only remaining question I have is, why is there still so much time spent on compilation? I used @time to time each 500 grant chunk and it tells me that over 50% of time is spent compiling. Why?

...
     Finished extracting 310500 grants    ...    0.249097 seconds (550.33 k allocations: 35.978 MiB, 55.38% compilation time)
     Finished extracting 311000 grants    ...    0.218596 seconds (529.16 k allocations: 35.312 MiB, 62.47% compilation time)
     Finished extracting 311500 grants    ...    0.217486 seconds (548.14 k allocations: 35.884 MiB, 60.89% compilation time)
     Finished extracting 312000 grants    ...    0.321302 seconds (521.75 k allocations: 34.572 MiB, 57.87% compilation time)
     Finished extracting 312500 grants    ...    0.285743 seconds (500.67 k allocations: 34.175 MiB, 61.84% compilation time)
     Finished extracting 313000 grants    ...    0.290847 seconds (540.13 k allocations: 35.402 MiB, 63.00% compilation time)
     Finished extracting 313500 grants    ...    0.340282 seconds (500.40 k allocations: 34.131 MiB, 69.54% compilation time)
     Finished extracting 314000 grants    ...    0.205243 seconds (545.07 k allocations: 35.800 MiB, 60.67% compilation time) 
...

TimG · November 13, 2023, 8:50am

Thank you very much for this @rocco_sprmnt21.

I only saw your worked example (in your edit) this morning, but did have a go at using your code at the linked page over the weekend:

using DataFrames, JSON3

allflatnt(vd::Vector{Dict{Symbol,Any}}) = all(allflatnt, vd)
allflatnt(d::Dict{Symbol,Any}) = all(e -> !(last(e) isa Vector{<:Dict} || last(e) isa Dict), d)


function nestdf(vd::Vector{Dict{Symbol,Any}})
  DataFrame(Tables.dictrowtable(nestdict.(vd)))
end

function nestdf(d::Dict{Symbol, Any})
    tdf=DataFrame(Tables.dictrowtable(nestdict(d)))
    if names(tdf)==["first","second"]
        rename!(tdf, [:first,:second].=>[:key,:value])
    end
    #tdf
end

function nestdict(d::Dict{Symbol, Any})
    for (k,v) in d
       if v isa Vector{<:Dict} || v isa Dict
            if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                d=merge(d, Dict(k=>tdf))
            else
                d=merge(d, Dict(k=>nestdf(v)))
            end
        end
    end
    d 
end

nestdf(jsobj::JSON3.Object)=nestdf(copy(jsobj))
nestdf(jsobj::JSON3.Array)=nestdf(copy(jsobj))

jsonfile = raw"360Giving Data\grantnav-20231104090448.json"
aapl_json = JSON3.read(jsonfile)

@time ndf=nestdf(aapl_json)

println(ndf)
println(describe(ndf))
println(nrow(ndf))

This ran to completion but took nearly 7 hours!

24458.486519 seconds (1.78 G allocations: 111.697 GiB, 98.58% gc time, 0.76% compilation time)
2×2 DataFrame
 Row │ key      value
     │ Symbol   Union…
─────┼────────────────────────────────────────────
   1 │ grants   344601×32 DataFrame
   2 │ license  See dataset/license within each …
2×7 DataFrame
 Row │ variable  mean     min     median   max      nmissing  eltype
     │ Symbol    Nothing  Union…  Nothing  Union…   Int64     Type
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ key                grants           license         0  Symbol
   2 │ value                                               0  Union{String, DataFrame}
2

As you can see, 98% of this time was spent on garbage collection. I wonder why? If the task could be completed without this time, 2% of the elapsed time represents only ~500 seconds.

nilshg · November 13, 2023, 9:13am

That’s exactly what Jules was saying:

1.78 G allocations: 111.697 GiB

with every vcat you are allocating your data again, creating huge amounts of garbage that needs to be collected.

TimG · November 13, 2023, 9:55am

Yes, indeed.

I’m not sure I understand @rocco_sprmnt21’s code fully, but a simple change of the two occurrences of merge to merge! resulted in the following:

1079.830456 seconds (1.75 G allocations: 106.780 GiB, 72.20% gc time, 2.24% compilation time)
2×2 DataFrame
 Row │ key      value
     │ Symbol   Union…
─────┼────────────────────────────────────────────
   1 │ grants   344601×32 DataFrame
   2 │ license  See dataset/license within each …
2×7 DataFrame
 Row │ variable  mean     min     median   max      nmissing  eltype
     │ Symbol    Nothing  Union…  Nothing  Union…   Int64     Type
─────┼─────────────────────────────────────────────────────────────────────────────────
   1 │ key                grants           license         0  Symbol
   2 │ value                                               0  Union{String, DataFrame}
2

Still the same number of allocations but a lot less garbage collection and a 20x speed-up!

jules · November 13, 2023, 9:59am

Just a side question, the site also has a csv download option next to json, why don’t you use that?

rocco_sprmnt21 · November 13, 2023, 10:02am

When I wrote these functions, the effort was just to make them work in the most general way possible. I have no ambition to be efficient either.
I have no idea how the GC time can be reduced.
In general I have read some discussions in which it is said (but perhaps I misunderstood) that the latest version of JUlia handles GC better. But I don’t know if in this specific case the problem lies in the way the functions are written.
I didn’t even have the chance to test the complete database because I wasn’t able (after many attempts) to download it all. The download always stopped earlier. So I filtered just one block to try out these functions (also because I no longer remembered how they worked ).

TimG · November 13, 2023, 10:03am

I have taken a CSV file, too, and it is definitely easier to work with. However, the json file contains much more data for each grant. I’m trying to understand what and why, because there seems to be quite a lot of duplication.

Topic		Replies	Views
Reading a large JSON file make Julia crashing Data	10	1123	December 22, 2021
wrangling large json files Data question , json , query	5	4919	March 12, 2019
Processing JSON from a .txt file and converting to a DataFrame New to Julia dataframes , json3	7	2482	May 15, 2021
Parse a large array of small JSON objects? Performance json3	2	467	May 12, 2021
How To Quickly Iterate over a Collection of Dictionaries? General Usage performance , dictionary	2	98	February 5, 2025

Speed up data extraction from large json file

Related topics