Bug in `sizeof` for nested arrays

Just checking before I raise an issue in Julia repo, this is a bug right?

sizeof([[rand(Float64) for j in 1:10] for i in 1:10])
# 80

sizeof([[rand(Float32) for j in 1:10] for i in 1:10])
# 80

sizeof(rand(Float64, (10, 10)))
# 800

sizeof(rand(Float32, (10, 10)))
# 400
$ julia --version
julia version 1.4.1

No, it contains 10 pointers (to inner vectors) à 8 bytes each = 80 bytes.

2 Likes

That’s very interesting and certainly unexpected but when I think about it it makes sense. Thank you.

What you might be looking for is Base.summarysize.

julia> sizeof([[rand(Float32) for j in 1:10] for i in 1:10])
80

julia> Base.summarysize([[rand(Float32) for j in 1:10] for i in 1:10])
920
2 Likes

Now I’m confused.

help?> Base.summarysize
  Base.summarysize(obj; exclude=Union{...}, chargeall=Union{...}) -> Int

  Compute the amount of memory, in bytes, used by all unique objects reachable
  from the argument.

Not sure what is stored internally. I think it might be calculating the size of the data (800 bytes) + the size of the pointers storing the data + maybe some other housekeeping item? I don’t know.

According to Base.summarysize(), using nested arrays more than doubles the size of the object, even the 100 x 100 Float32 array contains 40 bytes of “gubbings”?

sizeof(rand(Float32, (10, 10)))
# 400

Base.summarysize(rand(Float32, (10, 10)))
# 440

Base.summarysize([[rand(Float32) for j in 1:10] for i in 1:10])
# 920

Is that right? If so it’s quite serious.

Why? It’s just 5 Int's, for example a pointer to the data, the number of dimensions, the length of that dimension…

2 Likes

True and as the array size grows the 40 bytes becomes insignificant. I’ll just be aware that creating nested arrays is much less memory efficient. It’s not something I previously gave much thought to.

The bigger problem with nested arrays when it comes to performance is how the data is laid out in memory. With a Matrix all the data is stored continuously while for nested arrays you have pointers to columns (rows) that can be at completely different places in memory and accessing these can be quite penalising (see https://biojulia.net/post/hardware/#cachemisses).

1 Like

I’m not sure if this will help or hurt…

import Base.summarysize

mutable struct Bar
    a::Int64
    b::Int64
    c::Int64
    Bar() = new(0, 0, 0)
end
bar = Bar()
print("sizeof(bar)       = $(sizeof(bar))\n")
print("summarysize(bar)  = $(summarysize(bar))\n")

struct Foo
    a::Int64
    b::Union{Bar, Nothing}
    Foo() = new(0, nothing)
    Foo(a, b) = new(a, b)
end

foo = Foo()
print("sizeof(foo)       = $(sizeof(foo))\n")
print("summarysize(foo)  = $(summarysize(foo))\n")

foo2 = Foo(0, Bar())
print("sizeof(foo2)      = $(sizeof(foo2))\n")
print("summarysize(foo2) = $(summarysize(foo2))\n")

Processes the following output:

sizeof(bar)       = 24
summarysize(bar)  = 24
sizeof(foo)       = 16
summarysize(foo)  = 16
sizeof(foo2)      = 16
summarysize(foo2) = 40

Bar is 24 bytes, three 8 byte values for a total of 24 bytes, both sizeof and summarysize report the same amount.

Foo can have a child Bar or it could have nothing, foo is it without a Bar, and foo2 is with a Bar. You will notice that sizeof(foo) == sizeof(foo2). The amount of bytes that structure needs is 16 bytes of allocated space whether Bar is there or not. When Foo HAS a bar then summarysize reports 40 bytes which is 16 bytes for Foo and 24 bytes for Bar. summarysize is recursive looking at any child objects and adding their bytes to the total.

Does that make a little more sense? The sizeof an Array of Arrays is the amount of memory need to reference all the child arrays. The summarysize includes the sizes of the referenced arrays.

3 Likes

I’m hoping that using nested arrays will be better for parallelization, aren’t nested arrays better for access by multiple processes? Could multiple threads simultaneously read data from a contiguous array at once?

Sure.

If the threads write back to the array it is good to keep https://en.wikipedia.org/wiki/False_sharing in mind though.

All I can see is that false sharing means that attempts to write to memory that is being read will cause problems, but what does it mean for concurrent data reads? As far as I sort of know when data is read in an array, the pointer has to be moved to that point, if multiple threads are attempting to read from the same chunk of memory, the pointer won’t be shared … doesn’t each process have to wait till the one having the pointer has finished? I’m only trying to read data not write to the array. That’s why I’m interested in nested arrays.

The array is a blob of memory and that memory is shared between two threads (i.e. all the threads are able to access the same memory). I’m not sure what you mean by “the pointer”. Each thread can be running code that reads and/or writes whatever memory it has access to. It’s up to you to make sure that they’re not reading and writing the exact same memory addresses concurrently, but if the idea is to do some processing in parallel than generally you’ll have each thread working with a different part of the array (a different range of addresses).

False sharing is when multiple threads write to memory locations that are near each other in memory and due to how CPUs are structured this means that they need to “refresh” their view of their data which takes some time. But if you are not writing then there is no risk for false sharing.

1 Like

I’ve read that arrays are a pointer + a memory block that the pointer can traverse and read/write at each point in that block (I’ve also written some code in languages like D and C++ that confirms this - though I’m no expert at pointer programming). I’ve always assumed that when you access an array you are moving that pointer to a specific set of points in the array and reading or writing. Are you saying that if I had a 12 core CPU (for sake of argument) all the cores in that CPU could simultaneously read from the same memory block in the array at the same time?

I guess I’m also asking multiple reads slow down the process or not? … it’s clear that there’s nothing “bad” about it, but does performance surfer from it?

I guess read only doesn’t incur any cost on the same array, in fact it looks as if there is a performance penalty for nested array as stated before:

import BenchmarkTools
function fun(x, y)
  T = typeof(x)
  if (x == y) && (x == 0)
    return 0
  end
  return T(-20*exp(-0.2*sqrt(0.5*(x^2 + y^2))) - exp(0.5*(cos(pi*2*x))) + exp(1) + 20)
end


import Base.Threads: @threads
function pFun1(z)
  n, p = size(z)
  @threads for j in 1:p
    for i in 1:(n - 1)
      fun(z[i, j], z[i + 1, j])
    end
  end
end

function pFun2(z)
  p = size(z)[1]; n = size(z[1])[1]
  @threads for i in 1:p
    for j in 1:(n - 1)
      fun(z[i][j], z[i][j + 1])
    end
  end
end

n = 1000_000; p = 48
mData = rand(Float32, (n, p));
pData = [rand(Float32, n) for i in 1:p];

# Warmup
pFun1(mData);
pFun2(pData);

mData = rand(Float32, (n, p));
pData = [rand(Float32, n) for i in 1:p];

@btime pFun1(mData)
@btime pFun2(pData)
#julia> @btime pFun1(mData)
#  232.162 ms (64 allocations: 8.63 KiB)
#julia> @btime pFun2(pData)
#  251.224 ms (64 allocations: 8.63 KiB)