JuliaDB Parallel/Distributed Computing

@jpsamaroo @MaximilianJHuber Thanks to you both for the help, I really appreciate it! I tweaked the code a bit this morning and I’m now unable to reproduce the problem. The only change I made was to add the indexcols argument when calling loadtable:

loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])

After that, I did everything exactly as I did yesterday and the code is running just fine (albeit not much faster than without adding additional processes):

using Distributed
addprocs(3)
@everywhere using JuliaDB, JuliaDBMeta, OnlineStats, Statistics
@everywhere using JuliaDBMeta: @filter

# loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])

tbl = load("bin")

function by_age(lower::Int64, upper::Int64, state::Int64)
    @applychunked tbl begin
        @where :ST == state &&
        !ismissing(:AGEP) &&
        upper >= :AGEP >= lower
        @groupby _ :PUMA { total = sum(:PWGTP) }
    end
end

Calling the function multiple times results in the following:

julia> @time by_age(24, 35, 12)
  2.340554 seconds (3.67 k allocations: 164.938 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.391163 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.178894 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.320857 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

For your reference, the data I am working with is from the U.S. Census Bureau and can be downloaded here: https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pus.zip

The ZIP file contains 4 CSV files (psam_pusa, psam_pusb, psam_pusc, psam_pusd). These are the files that I’m loading via the loadtable function in my code above. You should be able to reproduce my code with these files. The CSV files range in size from 3.1GB - 4.3GB and after converting to IndexedTables they balloon in size to 6.2GB - 8.5GB.

My CPU has 8 cores and I’m only working with 8GB RAM.

1 Like