If you need to store 0 and 1, you only need one bit per matrix entry. Isn’t that right?
That’s true! Also, I should be able to use mul!(...)
to perform operations on bit matrices while ensuring that the output is also a BitMatrix.
However, would that be more efficient than using SparseArray?
Sparse representations need to store the indexes of the nonzero positions, so they probably will gain no advantage from making it only zeros or ones.
I would recommend profiling each alternative, it depends on how sparse is the matrix. If your matrix has a nonzero ratio higher than 1/32 then the BitMatrix
may be a good idea.
If you use SparseArrays
properly, then the storage will be proportional to the number of nonzero entries. How many nonzeros entries do you have?
(Note that you would never call zeros
, which allocates a dense matrix and stores all of the zeros explicitly.)
If you have a 500000 \times 500000 matrix with that little sparsity, then even a BitMatrix
requires over 30GB of memory, and the original post said that a “handful” of such matrices are required, which will quickly become impractical.
The basic point here is that if you have such huge matrices, you really need to think more carefully about what you are doing; you almost never want to store or compute with the whole matrix explicitly. You almost always want to exploit sparsity or some other structure.
Oh, sorry, I though the original matrix was 80Gib, so a 32~64x reduction could bring it to something workable, but now I see that it was the 100000x100000 matrix that was that size.
I don’t think this is true? Calling zeros(UInt8, 1000_000_000)
seems to first invoke the undef
constructor to get v = Vector{UInt8}(undef, (1000_000_000,))
which doesn’t have to allocate physical memory when it’s done via mmap
. But this is followed by fill!(v, zero(UInt8))
which would touch every byte in the array, causing many OS page faults which in turn allocate physical memory pages.
For another interesting thread on page faults, see here: Julia slower in Windows - #16 by c42f
(Edit: As a side note - it seems technically possible to do lazy zero fill in (a) the case that a large array happens to be allocated with mmap
and MAP_ANONYMOUS
somewhere deep in the allocator and (b) zero(eltype(a))
happens to be a bits type with a bit pattern of zeros. I think this would only help in specialized circumstances, though.)
About 10,000 ones and the remaining elements are zeros. It seems a clear case for SparseArray.
I will leave the post open for a few days, so should you have more suggestions please add them below!
For a detailed discussion on the challenges that would crop up almost immediately when trying this, check out Faster zeros with calloc. The TL;DR is that the cost of initialization doesn’t vanish but gets shifted to the first write instead and that growing such an array immediately looses the “zero initialized” property.
Definitely. (A sparse array should take < 1MB of memory.)
One final tip is that you should always construct a sparse array all at once if possible, e.g. using the sparse(I, J, V)
constructor. Adding or removing individual elements one at a time is costly.
This is a very helpful statement for me! Perhaps others knew this already. What about integers or strings? Where can I find that information?
I mean, a matrix is n
times n
, if double precision is used then each element is 8 bytes, so 8n^2
.
Int32
is 4 bytes, Int64
is 8 bytes. Almost every “normal and modern” computer today is 64bits, so Julia will be using Int64
(8 bytes).
String
is a little more complicated, but if I am not wrong in general they use UTF-8, what means that for ASCII (i.e., most characters normally used in english) it is one byte per character (plus a static cost of a zero byte at the end of the string and maybe an Int
with the total size too?). But if characters from other languages, emojis, etc… are used, then the that specific character is represented by 2-4 bytes in the sequence.
I believe this information is somewhere in the docs? I have to admit that this kinda of thing was inculcated into me by my graduation in computer science, so I do not have a source at the moment.