Hello I am new to julia and currently using DiskArrays for very large matrices.
I have a DiskArray of volumes
of dimension 12000 x 12000
and I have a vector of indices
Vector{CartesianIndex{2}}
of length 47852268
. I am trying to find a way to create another DiskArray of specific_volumes
by extracting the volumes at the specific indices, but when I try to do :
volumes.[indices]
I get the error : ERROR: syntax: invalid syntax "volume.[indices]" around REPL[20]:1
.
A simple for loop of the indices is expensive in terms of memory, is there some way of using broadcasting to solve this issue.
Thanks!
You could try broadcasting getindex.(Ref(volume), indices)
. volume[i]
is just syntactic sugar for getindex
calls in the end.
The Ref(volume)
tells Julia not to broadcast over volume
, but you can read more about Ref
in the manual
This likely doesn’t return another DiskArray
, however, as broadcasting will return an Array
Thanks for your response, unfortunately with 16Gb RAM this does not scale well and crashes Julia
In addition the broadcasting would be awfully slow, the whole point of DiskArrays is to avoid repeated calls to getindex. Unfortunately there is not yet a simple function that would implement your use case in DiskArrays.jl. There is a branch that supports fast batched getindex Batch getindex by meggart · Pull Request #59 · meggart/DiskArrays.jl · GitHub but this will not solve your problem right now since the output is larger than RAM and the batched getindex function I implemented there writes into a normal Array.
In case you have control over your input array, have you considered using mmap for both source and target arrays?
Thanks @fabiangans , since my source array is a ArchGDAL dataset when I extract a raster band I end up getting a DiskArray
. I have tried to convert the source array from a DiskArray to a memory mapped array but I keep getting a segmentation fault for a DiskArray of type Float64 and size 12000x12000
Allocations: 30695363 (Pool: 30675019; Big: 20344); GC: 38
Ok, thanks for the clarification. In this case I think it would be best to hard-code a loop that sorts your CartesianIndex list by the chunk they belong to and then you batch-wise copy the data from one array to the other.
I will modify the batch getindex PR a bit to export some helper functions that will be useful, but I am still struggling to think of a good interface that would allow the copy operation you describe. Out of curiosity: do you know about a tool/framework in another programming language that would solve your task for arrays of this size? Just to get some inspirations for possible interfaces