Hello everyone,
I’m trying to read and select a subset of a large JSON file from here. The result is a Dataframe of few lines (1600+) according to the game selected. However, I cannot reuse my process as after a first run, too much memory is still used and then Julia crash.
I think there are two main problems here:
- I don’t find any way to parse only the subset of the JSON I need.
- How to free the memory usage after each run? I tried to use the GC.gc() function but doesn’t seem to help me.
More globally, is there a better way to do so?
Here is the minimal working example:
using DataFrames, JSON3, JSONTables, ZipFile
Base.convert(::Type{JSON3.Array}, x::Vector{JSON3.Object}) = x
# just a utility function to get the zarchive file for a particular event file
function get_events(events::String)
# assuming that the events.zip file is in the tmp directory
zarchive = ZipFile.Reader("/tmp/events.zip")
dictio = Dict(zarchive.files[i].name => i for i in eachindex(zarchive.files))
file_num = dictio[events]
str = read(zarchive.files[file_num])
return str
end
# function to parse the json file
function createDict(str)
inDict = JSON3.read(str)
return inDict
end
# here I take only the subset I need (ie. the data for the game specified)
function create_subset_json(json_data, game_id::Int)
subset_json_indexes = [i for i in eachindex(json_data) if json_data[i][:matchId] == game_id]
subset_json = json_data[subset_json_indexes]
json_data = nothing
Base.GC.gc()
subset_json = DataFrame(convert(JSON3.Array,subset_json))
return subset_json
end
# run everything
function all_process(events::String, game_id::Int)
json_data = createDict(get_events(events))
data = create_subset_json(json_data, 2576335)
json_data = nothing
Base.GC.gc()
return data
end
all_process("events_Italy.json", 2576335)
Best regards,
Thomas