Sparse vectors with Int128 indices

eckofoid · June 22, 2020, 1:56am

I need to create a sparse vector with very large indices (Int128 would suffice). Naively using the default constructor, sparsevec, fails catastrophically:

 julia> using SparseArrays
 julia> c = sparsevec([0xfffffffffffffffffffffffffffffff], [true], 0xfffffffffffffffffffffffffffffff)
ERROR: InexactError: trunc(Int64, 21267647932558653966460912964485513215)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int64}, ::UInt128) at ./boot.jl:557
 [2] checked_trunc_sint at ./boot.jl:579 [inlined]
 [3] toInt64 at ./boot.jl:633 [inlined]
 [4] Int64 at ./boot.jl:707 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] SparseVector at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:26 [inlined]
 [7] SparseVector at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:30 [inlined]
 [8] _sparsevector!(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128, ::typeof(|)) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:155
 [9] sparsevec(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128, ::Function) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:210
 [10] sparsevec(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:225
 [11] top-level scope at REPL[20]:1

I get the same exact results when I use type conversions:
c = (sparsevec([Int128(0xfffffffffffffffffffffffffffffff)], [true], Int128(0xfffffffffffffffffffffffffffffff)))
or type declarations:
c = sparsevec([0xfffffffffffffffffffffffffffffff]::Array{UInt128,1}, [true]::Array{Bool,1}, 0xfffffffffffffffffffffffffffffff::UInt128).

On the other hand, the following appears to succeed:

julia> a = [Int128(0xffffffffffffffffffff)]
1-element Array{Int128,1}:
 1208925819614629174706175

julia> c = SparseVector{Bool, Int128}(0xfffffffffffffff, a, [true])
1152921504606846975-element SparseVector{Bool,Int128} with 1 stored entry:
  [1208925819614629174706175]  =  1

However, accessing c[0xffffffffffffffffffff] yields an error: “BoundsError: attempt to access 1152921504606846975-element SparseVector{Bool,Int128} at index [1208925819614629174706175]”. Having a sparse vector with index 0xffffffffffffffffffff is useless if one can’t access the element at that index!

No doubt I’m doing something trivially wrong. Any help greatly appreciated.

ettersi · June 22, 2020, 2:40am

The size of a sparse vector is hard-coded to be of type Int:

github.com

JuliaLang/julia/blob/f6d34c3c32c535422491bfe980b1ebdb6bc123db/stdlib/SparseArrays/src/sparsevector.jl#L18


      
          ### The SparseVector
          
          ### Types
          
          """
              SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
          
          Vector type for storing sparse vectors.
          """
          struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
              n::Int              # Length of the sparse vector
              nzind::Vector{Ti}   # Indices of stored values
              nzval::Vector{Tv}   # Stored values, typically nonzeros
          
              function SparseVector{Tv,Ti}(n::Integer, nzind::Vector{Ti}, nzval::Vector{Tv}) where {Tv,Ti<:Integer}
                  n >= 0 || throw(ArgumentError("The number of elements must be non-negative."))
                  length(nzind) == length(nzval) ||
                      throw(ArgumentError("index and value vectors must be the same length"))
                  new(convert(Int, n), nzind, nzval)
              end
          end

So you cannot have sparse vectors of size larger than typemax(Int) (which is 9223372036854775807 if Int = Int64).

eckofoid:

julia> a = [Int128(0xffffffffffffffffffff)]
1-element Array{Int128,1}:
 1208925819614629174706175

julia> c = SparseVector{Bool, Int128}(0xfffffffffffffff, a, [true])
1152921504606846975-element SparseVector{Bool,Int128} with 1 stored entry:
  [1208925819614629174706175]  =  1

This works because you are missing some fs in the size parameter to SparseVector(), and it turns out SparseVector() does not check that all indices are <= to the size of the vector.

ettersi · June 22, 2020, 2:50am

One might think that SparseVector should be defined as follows.

struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
    n::Ti               # Length of the sparse vector
    nzind::Vector{Ti}   # Indices of stored values
    nzval::Vector{Tv}   # Stored values, typically nonzeros
end

The reason why this isn’t done is that the abstract array interface requires length(array)::Int and size(array)::NTuple{N,Int}, hence there is no point in storing the size information as something other than an Int. I assume this could be fixed by rewriting the entire abstract array machinery to work with an automatically inferred Integer type, but this would significantly complicate the code for what is probably a very niche application.

eckofoid · June 22, 2020, 5:17pm

Thanks for your quick and helpful replies.

It turns out I could squeak by with UInt64 indices. Will they work, and, if so how, do I set up the sparse vector object?

I appreciate your patience – I’m a novice at Julia, and still struggle with the appropriate way to indicate types.

ettersi · June 23, 2020, 1:17am

No, it has to be Int. Anything else will not fit in with the AbstractArray framework.

Also, just out of curiosity, what is your application where you need to enumerate more than typemax(Int) ≈ 1e19 different objects?

eckofoid · June 25, 2020, 5:36am

Consider a group of N strings comprised of 20 symbols. The strings are evolutionarily related to each other by descent with mutation from a common ancestor, and are typically comprised of several million symbols. The problem is to calculate the number of substrings of length m common to each pairwise combination of strings. A score can be extracted from this data which aids in the reconstruction of a phylogenetic tree which is likely to recapitulate the actual descent of the strings.

I developed a working algorithm using Set{String} types, but it was very slow. I realized that I could convert the substrings to numbers base 20 and store them as indices in a sparse vector with the indexed element storing Boolean ‘true’. Bitwise intersection (“.&”) would give me the number of substrings common to each pair of such vectors. When I tested this for small values of m, it worked quite nicely. But, conversion of a 25-mer substring to a base-20 integer greatly exceeds the maximum Int64 value, although it’s easily accommodated by Int128.

Nevertheless, I’ve since found a much cleaner solution using my original algorithm by encoding the substrings as before in base 20, and storing them in a Set{Int128} object. This results in a 7-fold speedup, allows for much larger substrings, generates fewer allocations and uses much less memory.

Thanks for your help and interest.

Sukera · June 25, 2020, 7:59am

Maybe a Trie could be of help here? It’s pretty efficient for storing overlapping strings, though I’m not sure how well your problem maps to that datastructure or how the algorithm for solving the problem would look like.

FabianBallar · March 11, 2021, 3:48am

Hello! Is there still no workaround for this? I’m currently working on a project and this would be extremely helpful to have. Just for context, I have many size L arrays whose entries may take values 0,1,2 or 3. I hash them using a base 4 map so for instance [1,0,3,2] would be mapped to (1 * 4^0 + 0 * 4^1 + 3 * 4^2 + 2 * 4^3) and I store a 1 on that sparse vector index to signify that I have 1 of those arrays in my system. But I’m interested in studying L >31 so this would really help.

jling · March 11, 2021, 3:57am

If it’s just 0 and 1, can you use a Set{BigInt} (basically a dictionary)? How many such Ls do you have?

FabianBallar · March 11, 2021, 4:19am

Thank you for your response!
Unfortunately, this number could be different from 0 and 1 later on. I have many such L’s thousands at some point. As I see you’re a fellow physicist these are essentially Pauli Strings. So [1,0,3,2] is sort of saying the tensor product of X_1 Id_2 Z_3 Y_4. So I have some initial operator and under successive commutations with the hamiltonian, I obtain many many such terms. For instance, for the case where array size is 15, I end up using about 8614 such arrays. I’ve managed to run to way more array size but now I’m struggling with the index size.

Bernard_GODARD · March 11, 2021, 9:12pm

This seems related to the issue discussed here:

Why allow custom integers type to store the indices of sparse arrays but have length fixed to Int64?
Could length(AbstractArray) return T<:Integer with type T dependent of concrete type of array? Is there really much code which work with that assumption? Generally indices and length types are derived (eachindex, length…)

dpsanders · March 11, 2021, 10:07pm

I think you are just looking for a dictionary? That will allow you to index with whichever type you like.

Topic		Replies	Views
How to efficiently construct a large SparseArray? Packages for this? Performance package , performance , parallel , sparse	20	1713	May 15, 2022
SparseArray types over distributed reduction Performance parallel , sparse	2	409	January 14, 2021
Index type of sparse matrix General Usage type , linearalgebra , sparse	7	999	February 8, 2021
SparseArray cannot store arbitrary type, but only types which support zero General Usage question	6	312	July 8, 2021
Question regarding sparse vector New to Julia	10	417	December 7, 2021

Sparse vectors with Int128 indices

Related topics