Sparse vectors with Int128 indices

I need to create a sparse vector with very large indices (Int128 would suffice). Naively using the default constructor, sparsevec, fails catastrophically:

 julia> using SparseArrays
 julia> c = sparsevec([0xfffffffffffffffffffffffffffffff], [true], 0xfffffffffffffffffffffffffffffff)
ERROR: InexactError: trunc(Int64, 21267647932558653966460912964485513215)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int64}, ::UInt128) at ./boot.jl:557
 [2] checked_trunc_sint at ./boot.jl:579 [inlined]
 [3] toInt64 at ./boot.jl:633 [inlined]
 [4] Int64 at ./boot.jl:707 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] SparseVector at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:26 [inlined]
 [7] SparseVector at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:30 [inlined]
 [8] _sparsevector!(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128, ::typeof(|)) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:155
 [9] sparsevec(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128, ::Function) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:210
 [10] sparsevec(::Array{UInt128,1}, ::Array{Bool,1}, ::UInt128) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/SparseArrays/src/sparsevector.jl:225
 [11] top-level scope at REPL[20]:1

I get the same exact results when I use type conversions:
c = (sparsevec([Int128(0xfffffffffffffffffffffffffffffff)], [true], Int128(0xfffffffffffffffffffffffffffffff)))
or type declarations:
c = sparsevec([0xfffffffffffffffffffffffffffffff]::Array{UInt128,1}, [true]::Array{Bool,1}, 0xfffffffffffffffffffffffffffffff::UInt128).

On the other hand, the following appears to succeed:

julia> a = [Int128(0xffffffffffffffffffff)]
1-element Array{Int128,1}:
 1208925819614629174706175

julia> c = SparseVector{Bool, Int128}(0xfffffffffffffff, a, [true])
1152921504606846975-element SparseVector{Bool,Int128} with 1 stored entry:
  [1208925819614629174706175]  =  1

However, accessing c[0xffffffffffffffffffff] yields an error: ā€œBoundsError: attempt to access 1152921504606846975-element SparseVector{Bool,Int128} at index [1208925819614629174706175]ā€. Having a sparse vector with index 0xffffffffffffffffffff is useless if one canā€™t access the element at that index!

No doubt Iā€™m doing something trivially wrong. Any help greatly appreciated.

2 Likes

The size of a sparse vector is hard-coded to be of type Int:

So you cannot have sparse vectors of size larger than typemax(Int) (which is 9223372036854775807 if Int = Int64).

This works because you are missing some fs in the size parameter to SparseVector(), and it turns out SparseVector() does not check that all indices are <= to the size of the vector.

One might think that SparseVector should be defined as follows.

struct SparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
    n::Ti               # Length of the sparse vector
    nzind::Vector{Ti}   # Indices of stored values
    nzval::Vector{Tv}   # Stored values, typically nonzeros
end

The reason why this isnā€™t done is that the abstract array interface requires length(array)::Int and size(array)::NTuple{N,Int}, hence there is no point in storing the size information as something other than an Int. I assume this could be fixed by rewriting the entire abstract array machinery to work with an automatically inferred Integer type, but this would significantly complicate the code for what is probably a very niche application.

5 Likes

Thanks for your quick and helpful replies.

It turns out I could squeak by with UInt64 indices. Will they work, and, if so how, do I set up the sparse vector object?

I appreciate your patience ā€“ Iā€™m a novice at Julia, and still struggle with the appropriate way to indicate types.

No, it has to be Int. Anything else will not fit in with the AbstractArray framework.

Also, just out of curiosity, what is your application where you need to enumerate more than typemax(Int) ā‰ˆ 1e19 different objects?

2 Likes

Consider a group of N strings comprised of 20 symbols. The strings are evolutionarily related to each other by descent with mutation from a common ancestor, and are typically comprised of several million symbols. The problem is to calculate the number of substrings of length m common to each pairwise combination of strings. A score can be extracted from this data which aids in the reconstruction of a phylogenetic tree which is likely to recapitulate the actual descent of the strings.

I developed a working algorithm using Set{String} types, but it was very slow. I realized that I could convert the substrings to numbers base 20 and store them as indices in a sparse vector with the indexed element storing Boolean ā€˜trueā€™. Bitwise intersection (ā€œ.&ā€) would give me the number of substrings common to each pair of such vectors. When I tested this for small values of m, it worked quite nicely. But, conversion of a 25-mer substring to a base-20 integer greatly exceeds the maximum Int64 value, although itā€™s easily accommodated by Int128.

Nevertheless, Iā€™ve since found a much cleaner solution using my original algorithm by encoding the substrings as before in base 20, and storing them in a Set{Int128} object. This results in a 7-fold speedup, allows for much larger substrings, generates fewer allocations and uses much less memory.

Thanks for your help and interest.

2 Likes

Maybe a Trie could be of help here? Itā€™s pretty efficient for storing overlapping strings, though Iā€™m not sure how well your problem maps to that datastructure or how the algorithm for solving the problem would look like.

1 Like

Hello! Is there still no workaround for this? Iā€™m currently working on a project and this would be extremely helpful to have. Just for context, I have many size L arrays whose entries may take values 0,1,2 or 3. I hash them using a base 4 map so for instance [1,0,3,2] would be mapped to (1 * 4^0 + 0 * 4^1 + 3 * 4^2 + 2 * 4^3) and I store a 1 on that sparse vector index to signify that I have 1 of those arrays in my system. But Iā€™m interested in studying L >31 so this would really help.

If itā€™s just 0 and 1, can you use a Set{BigInt} (basically a dictionary)? How many such Ls do you have?

Thank you for your response! :slight_smile:
Unfortunately, this number could be different from 0 and 1 later on. I have many such Lā€™s thousands at some point. As I see youā€™re a fellow physicist these are essentially Pauli Strings. So [1,0,3,2] is sort of saying the tensor product of X_1 Id_2 Z_3 Y_4. So I have some initial operator and under successive commutations with the hamiltonian, I obtain many many such terms. For instance, for the case where array size is 15, I end up using about 8614 such arrays. Iā€™ve managed to run to way more array size but now Iā€™m struggling with the index size.

This seems related to the issue discussed here:

Why allow custom integers type to store the indices of sparse arrays but have length fixed to Int64?
Could length(AbstractArray) return T<:Integer with type T dependent of concrete type of array? Is there really much code which work with that assumption? Generally indices and length types are derived (eachindex, lengthā€¦)

1 Like

I think you are just looking for a dictionary? That will allow you to index with whichever type you like.

3 Likes