I’ve downloaded a json file from here containing records of nearly 350,000 National Lottery grants to good causes. I’m pretty new to Julia and even newer to json.
The structure of the json is like this (as best I can tell):
grants (V)
|-- additional_data (D)
| |-- TSGRecipientType (S)
| |-- recipientCountryName (S)
| |-- TSGFundingOrgType (S)
| |-- GNCanonicalRecipientOrgId (S)
| |-- GNCanonicalFundingOrgId (S)
| |-- codeListLookup (D)
| | |-- regrantType (S)
| | |-- toIndividualsDetails (D)
| | | |-- primaryGrantReason (S)
| | | |-- grantPurpose (V)
| | | |-- secondaryGrantReason (S)
| |-- GNCanonicalRecipientOrgName (S)
| |-- GNRecipientOrgInfo0 (D)
| | |-- ageWhenAwarded (S)
| | |-- latestIncome (S)
| |-- recipientRegionName (S)
| |-- recipientLocation (S)
| |-- recipientDistrictName (S)
| |-- recipientOrgInfos (V)
| | |-- orgIDs (V)
| | |-- active (S)
| | |-- name (S)
| | |-- postalCode (S)
| | |-- id (S)
| | |-- dateRegistered (S)
| | |-- dateRemoved (S)
| | |-- linked_orgs (V)
| | |-- latestIncome (S)
| | |-- url (S)
| | |-- charityNumber (S)
| | |-- companyNumber (S)
| | |-- organisationTypePrimary (S)
| | |-- source (S)
| | |-- organisationType (V)
| | |-- dateModified (S)
| | |-- latestIncomeDate (S)
| |-- locationLookup (V)
| | |-- ladcd_active (S)
| | |-- ladcd (S)
| | |-- areatype (S)
| | |-- ladnm_active (S)
| | |-- ukcd (S)
| | |-- lad20cd (S)
| | |-- sourceCode (S)
| | |-- sourcefile (S)
| | |-- ctrynm (S)
| | |-- ctrycd (S)
| | |-- uknm (S)
| | |-- lad20nm (S)
| | |-- gbnm (S)
| | |-- utlacd (S)
| | |-- utlanm (S)
| | |-- ladnm (S)
| | |-- gbcd (S)
| | |-- latitude (F)
| | |-- areacode (S)
| | |-- source (S)
| | |-- areaname (S)
| | |-- longitude (F)
| |-- GNCanonicalFundingOrgName (S)
| |-- recipientDistrictGeoCode (S)
|-- dataType (S)
|-- plannedDates (V)
| |-- duration (I)
| |-- startDate (S)
| |-- endDateDateOnly (N)
| |-- startDateDateOnly (S)
|-- currency (S)
|-- awardDateDateOnly (S)
|-- title_and_description (S)
|-- id (S)
|-- description (S)
|-- recipientOrganization (V)
| |-- name (S)
| |-- location (V)
| | |-- name (S)
| | |-- geoCode (S)
| |-- id_and_name (S)
| |-- id (S)
| |-- companyNumber (S)
| |-- charityNumber (S)
| |-- organisationType (S)
|-- amountAwarded (I)
|-- grantProgramme (V)
| |-- title_keyword (S)
| |-- title (S)
| |-- code (S)
|-- dataset (D)
| |-- publisher (D)
| | |-- name (S)
| | |-- last_published (S)
| | |-- org_id (S)
| | |-- logo (S)
| | |-- website (S)
| | |-- prefix (S)
| |-- issued (S)
| |-- funders (V)
| | |-- 1 (S)
| | |-- 2 (S)
| |-- distribution (V)
| | |-- accessURL (S)
| | |-- downloadURL (S)
| | |-- title (S)
| |-- datagetter_metadata (D)
| | |-- downloads (B)
| | |-- file_size (I)
| | |-- acceptable_license (B)
| | |-- json (S)
| | |-- datetime_downloaded (S)
| | |-- valid (B)
| | |-- file_type (S)
| |-- description (S)
| |-- modified (S)
| |-- license_name (S)
| |-- title (S)
| |-- license (S)
| |-- identifier (S)
|-- amountAppliedFor (I)
|-- filename (S)
|-- title (S)
|-- fundingOrganization (V)
| |-- name (S)
| |-- id_and_name (S)
| |-- id (S)
|-- awardDate (S)
|-- dateModified (S)
|-- simple_grant_type (S)
license (S)
and the grants
element holds the 344601 grant records. Each grant may hold a different subset of the variables exported and a different number of repetitions (serialisations).of each item.
My code walks recursively through the different levels of the json to extract the values grant by grant into a one-row DataFrame. After each grant, I vcat this onto a DataFrame to aggregate all grant records using cols=:union to allow for the different subsets of data values across different grants.
I do this in chunks of 500 grants because I quickly run out of memory and either crash or slow down significantly. I write each chunk to a tmp csv file. Once I’ve finished, I vcat all the tmp files into one overall DataFrame and also write this to a master csv file.
This process seems to work OK. However, it is very slow. It takes well over 2 hours to process all grants.
Here is my code, which I run in VS Code:
using DataFrames, JSON, CSV
function dlist(d, l)
for k in keys(d)
t = first(string(typeof(d[k])), 1)
l>1 ? a=" |-- " : a=""
println(" |"^(max(l - 2, 0)), a, k, " ($t)")
if t == "D"
dlist(d[k], l+1)
elseif t == "V"
if length(d[k]) > 0
if string(typeof((d[k][1]))) !== "String"
dlist(d[k][1], l + 1)
end
end
end
end
end
function extractjsonvalues(g, d, l, varname, vals)
for k in keys(d)
t = first(string(typeof(d[k])), 1)
if l==1
varname = k
else
oldname = varname
varname = varname*string(k)
end
if t == "D"
vals = extractjsonvalues(g, d[k], l + 1, varname*":", vals)
elseif t == "V"
if length(d[k]) > 0
if string(typeof((d[k][1]))) !== "String"
for (q, dk) in enumerate(keys(d[k]))
z = q - 1
vals = extractjsonvalues(g, d[k][dk], l + 1, varname * ":$z:", vals)
end
else
keylength = (length(keys(d[k])))
if keylength > 1
# for strings like ["GB-COH-223341", "GB-CHC-027652"]
# remove quotes and return as GB-COH-223341, GB-CHC-027652
v=""
for i in values(d[k])
v = v * i * ", "
end
vals[!, varname] = [first(v,length(v)-2)]
else
for dk in keys(d[k])
vals[!, varname] = [get(d[k], dk, missing)]
end
end
end
end
else
varval = get(d, k, missing)
if varval === nothing
varval = missing
else
# for strings like [["GB-COH-223341", "GB-CHC-027652"]]
# remove quotes and square brackets and return as GB-COH-223341, GB-CHC-027652
if !ismissing(varval)
if first(varval,2) == "[\"" && last(varval, 2) == "\"]"
varval = replace(varval, "[" => "", "]" => "", "\"" => "")
end
end
end
vals[!, varname] = [varval]
end
if l>1
varname = oldname
end
end
return vals
end
function json2chunkedcsv(json, csvname)
print(rpad("\nParsing JSON result", 40), " ... ")
@time d = JSON.parsefile(json)["grants"]
chunkdata = DataFrame()
singlegrant = DataFrame()
println("\nExtracting data:")
@time begin
let step = 500
for chunk = 1:step:min(div(length(d), step) * step + 1, 500000)
@time begin
for g = chunk:min(chunk + step - 1, length(d))
singlegrant = extractjsonvalues(g, d[g], 1, "", singlegrant)
chunkdata = vcat(chunkdata, singlegrant, cols=:union)
singlegrant = singlegrant[:, []]
end
CSV.write("TMP_" * string(chunk + step - 1) * "_" * csvname, chunkdata)
chunkdata = chunkdata[:, []]
print(rpad(" Finished extracting " * string(min(chunk + step - 1, length(d))) * " grants", 40), " ... ")
end
end
end
end
end
function writegrants2bigcsv(csvname)
@time begin
TMPfiles = DataFrame(fname=readdir(pwd()))
filter!(:fname => n -> occursin(csvname, n) && n !== csvname, TMPfiles)
print(rpad("\nMerging "*string(nrow(TMPfiles))*" TMP files", 40), " ... ")
GrantData = CSV.read(TMPfiles.fname[1], DataFrame)
rm(TMPfiles.fname[1], force=false)
for tmp = 2:nrow(TMPfiles)
df = CSV.read(TMPfiles.fname[tmp], DataFrame)
GrantData = vcat(GrantData, df, cols=:union)
rm(TMPfiles.fname[tmp], force=false)
end
CSV.write(csvname, GrantData)
end
return GrantData
end
jsonfile = raw"360Giving Data\grantnav-20231104090448.json"
csvname = "new360JSON-20231104090448.csv"
dlist(JSON.parsefile(jsonfile), 1)
json2chunkedcsv(jsonfile, csvname)
GrantsData = writegrants2bigcsv(csvname)
println()
So my question is, how can I make this go faster? I’ve seen examples on here where more expert users make suggestions to naive users like me to speed things up by 10x, 100x, even 1000x. Is that possible here?
Thanks!