@jpsamaroo @MaximilianJHuber Thanks to you both for the help, I really appreciate it! I tweaked the code a bit this morning and I’m now unable to reproduce the problem. The only change I made was to add the indexcols argument when calling loadtable:
loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])
After that, I did everything exactly as I did yesterday and the code is running just fine (albeit not much faster than without adding additional processes):
using Distributed
addprocs(3)
@everywhere using JuliaDB, JuliaDBMeta, OnlineStats, Statistics
@everywhere using JuliaDBMeta: @filter
# loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])
tbl = load("bin")
function by_age(lower::Int64, upper::Int64, state::Int64)
@applychunked tbl begin
@where :ST == state &&
!ismissing(:AGEP) &&
upper >= :AGEP >= lower
@groupby _ :PUMA { total = sum(:PWGTP) }
end
end
Calling the function multiple times results in the following:
julia> @time by_age(24, 35, 12)
2.340554 seconds (3.67 k allocations: 164.938 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA total
───────────
101 30554
102 18177
500 46048
901 25682
902 12937
903 17459
904 18965
1101 21653
1102 18368
1103 16801
⋮
julia> @time by_age(24, 35, 12)
2.391163 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA total
───────────
101 30554
102 18177
500 46048
901 25682
902 12937
903 17459
904 18965
1101 21653
1102 18368
1103 16801
⋮
julia> @time by_age(24, 35, 12)
2.178894 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA total
───────────
101 30554
102 18177
500 46048
901 25682
902 12937
903 17459
904 18965
1101 21653
1102 18368
1103 16801
⋮
julia> @time by_age(24, 35, 12)
2.320857 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA total
───────────
101 30554
102 18177
500 46048
901 25682
902 12937
903 17459
904 18965
1101 21653
1102 18368
1103 16801
⋮
For your reference, the data I am working with is from the U.S. Census Bureau and can be downloaded here: https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pus.zip
The ZIP file contains 4 CSV files (psam_pusa, psam_pusb, psam_pusc, psam_pusd). These are the files that I’m loading via the loadtable function in my code above. You should be able to reproduce my code with these files. The CSV files range in size from 3.1GB - 4.3GB and after converting to IndexedTables they balloon in size to 6.2GB - 8.5GB.
My CPU has 8 cores and I’m only working with 8GB RAM.