SparseArray types over distributed reduction

tblz · January 10, 2021, 2:59pm

I have a use-case of large sparse arrays that can be built in parallel. A minimum working example similar to my code is:

using Distributed

@everywhere using SparseArrays

@everywhere function build_array()
	return sparsevec(Int32.(1:10), Int32(1), Int32(10))
end

function main()
	full_array += @distributed (+) for i in 1:10
		build_array()
	end
end

main()

I’m trying to reduce RAM usage but noticing that summing two Int32 sparse arrays seems to lead to automatic conversion to Int64 counts. Just looking at REPL:

julia> one = sparsevec(Int32.(1:10), Int32(1), Int32(10))
10-element SparseVector{Int32,Int32} with 10 stored entries

compared to a sum:

julia> two = sparsevec(Int32.(1:10), Int32(1), Int32(10)) + sparsevec(Int32.(1:10), Int32(1), Int32(10))
10-element SparseVector{Int32,Int64} with 10 stored entries

Where you can see that the counting part of the SparseArray is now Int64 and varinfo() shows a correspondingly higher memory use than an identical sparse(Int32, Int32) object.

julia> equiv = sparsevec(Int32.(1:10), Int32(2), Int32(10))
10-element SparseVector{Int32,Int32} with 10 stored entries

julia> varinfo(r"one")
  name      size summary                             
  –––– ––––––––– ––––––––––––––––––––––––––––––––––––
  one  184 bytes 10-element SparseVector{Int32,Int32}

julia> varinfo(r"two")
  name      size summary                             
  –––– ––––––––– ––––––––––––––––––––––––––––––––––––
  two  224 bytes 10-element SparseVector{Int32,Int64}

julia> varinfo(r"equiv")
  name       size summary                             
  ––––– ––––––––– ––––––––––––––––––––––––––––––––––––
  equiv 184 bytes 10-element SparseVector{Int32,Int32}

I understand that in principle summing many Int32’s can lead to an Int64. However in my use case I can guarantee that the sum will stay <2 billion. I’m a bit new to using distributed workflows and sparse arrays and not quite sure what part of code needs to be improved (can the distributed reduction type be pre-specified? can SparseArray types be maintained? it doesn’t seem clear from documentation).

But would appreciate a way to maintain my construction in parallel while staying with Int32 types if that is possible. Thanks!

tblz · January 13, 2021, 10:20pm

I looked into this a bit further and it looks like the code responsible for addition in SparseArrays ends up promoting the types.

If the line here https://github.com/JuliaLang/julia/blob/48795c5e80e92c8cd1790d655eb9cc0012ce8337/stdlib/SparseArrays/src/sparsevector.jl#L1215 is switched

from:
rind = Vector{Int}(undef, cap)

to:
rind = Vector{promote_type(Tx, Ty)}(undef, cap)

Then a quick test seems to show that the types stay stable (I did a promote_type in case of different types mixing).

Is there a reason for this behavior (for example)

julia> spzeros(Int8, Int8, 3) + spzeros(Int8, Int8, 3)
3-element SparseVector{Int8,Int64} with 0 stored entries

or is this perhaps not intended and worth opening an issue for? I don’t want to presume to know the best way to handle the type management.

moeddel · January 14, 2021, 8:21am

I am no expert, but I would agree that this at least not necessary, since for simple SparseVector addition there is no index arithmetic necessary that could bring an index to overflow. It would mainly be simple index comparison.

Topic		Replies	Views
UniformScaling with SparseMatrixCSC{T,Int32} loses integer type Internals & Design linearalgebra	7	834	January 10, 2019
Why does Julia promote full array to sparse array? Internals & Design	8	949	August 10, 2020
Stop custom type that inherits from SparseArray getting converted to full array General Usage sparse	5	647	April 28, 2020
Dense .+ sparse -> sparse? Internals & Design	2	608	January 26, 2019
Sparse matrix product causes unnecessary casting to "Any" thereby causing issues? General Usage sparse	11	522	July 15, 2021

SparseArray types over distributed reduction

Related topics