JuliaDB Parallel/Distributed Computing

mthelm85 · July 4, 2019, 12:52pm

@jpsamaroo @MaximilianJHuber Thanks to you both for the help, I really appreciate it! I tweaked the code a bit this morning and I’m now unable to reproduce the problem. The only change I made was to add the indexcols argument when calling loadtable:

loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])

After that, I did everything exactly as I did yesterday and the code is running just fine (albeit not much faster than without adding additional processes):

using Distributed
addprocs(3)
@everywhere using JuliaDB, JuliaDBMeta, OnlineStats, Statistics
@everywhere using JuliaDBMeta: @filter

# loadtable(glob("data/*.csv"), output = "bin", chunks = 4, type_detect_rows = 2000, indexcols = [:ST])

tbl = load("bin")

function by_age(lower::Int64, upper::Int64, state::Int64)
    @applychunked tbl begin
        @where :ST == state &&
        !ismissing(:AGEP) &&
        upper >= :AGEP >= lower
        @groupby _ :PUMA { total = sum(:PWGTP) }
    end
end

Calling the function multiple times results in the following:

julia> @time by_age(24, 35, 12)
  2.340554 seconds (3.67 k allocations: 164.938 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.391163 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.178894 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

julia> @time by_age(24, 35, 12)
  2.320857 seconds (3.67 k allocations: 164.813 KiB)
Distributed Table with 151 rows in 1 chunks:
PUMA  total
───────────
101   30554
102   18177
500   46048
901   25682
902   12937
903   17459
904   18965
1101  21653
1102  18368
1103  16801
⋮

For your reference, the data I am working with is from the U.S. Census Bureau and can be downloaded here: https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pus.zip

The ZIP file contains 4 CSV files (psam_pusa, psam_pusb, psam_pusc, psam_pusd). These are the files that I’m loading via the loadtable function in my code above. You should be able to reproduce my code with these files. The CSV files range in size from 3.1GB - 4.3GB and after converting to IndexedTables they balloon in size to 6.2GB - 8.5GB.

My CPU has 8 cores and I’m only working with 8GB RAM.

Topic		Replies	Views
JuliaDB with Parallel Procs Not Working in JuliaPro New to Julia parallel	21	2174	June 4, 2019
JuliaDB Questions/Issues New to Julia package	13	2608	July 3, 2019
Using JuliaDB to read in all the CSVs fails, but not when reading them singly General Usage data	2	774	March 18, 2019
JuliaDB loadndsparse: many errors General Usage	15	830	October 24, 2019
CSV in Parallel Error Data	6	939	August 16, 2019

JuliaDB Parallel/Distributed Computing

Related topics