Speed and type-stability of unsafe_pointer_to_objref

question
performance

#1

Hi,
I wanted to ask about unsafe_pointer_to_objref, or alternatives. Naively, I would expect this to be a NOP for Ptr{T} if T is not bitstype, and otherwise a single load.

Is there a way to get the NOP behaviour? Is there a way to force types, that is, perform a type assertion which is guaranteed to emit no code (either re-interpret by nop or memory-corrupt if my assumptions where wrong)?

Now, why would I want this? Of course to store unsafe object references in immutable bitstypes. This allows, for example, allocation-free unsafe SubArrays.

Say, e.g.:

import Base.size, Base.getindex, Base.setindex!, Base.indices, Base.IndexStyle
immutable evil_aptr{T,N}<:AbstractArray{T,N} 
    aptr::Ptr{Array{T,N}}
end

@inline evil_aptr(x::Array{T,N}) where {T,N} = evil_aptr{T,N}(pointer_from_objref(x))

@inline size(a:: evil_aptr) = size(evil_deref(a))

@inline getindex(a:: evil_aptr, i)= getindex(evil_deref(a), i)

@inline getindex(a:: evil_aptr, i...)= getindex(evil_deref(a), i...)

@inline setindex!(a::evil_aptr, i) = setindex!(evil_deref(a), i)

@inline setindex!(a::evil_aptr, i...) = setindex!(evil_deref(a), i...)

@inline IndexStyle(evil_aptr) = IndexLinear() 

@inline indices(a::evil_aptr) = indices(evil_deref(a))

@inline evil_deref(a::evil_aptr{T,N}) where {T,N} = unsafe_pointer_to_objref(a.aptr)::Array{T,N} 

Now, subarrays of this unsafe aray are bitstype and don’t need heap-allocation:

@noinline function testsum(arr::AbstractArray{T,1}) where {T}    
    sum_i ::T= zero(T)
    for i in arr
        sum_i += i
    end
    return sum_i 
end

function test_subarray(a::AbstractArray{T,1}) where {T}    
    sum_ ::T= zero(T)
    for i in 1:length(a)-100
        sum_ += testsum(view(a, i:i+100))
    end
    return sum_
end
X = Vector{Int}(100_000)
Xptr = evil_aptr(X)

test_subarray(X)
test_subarray(Xptr)
gc()

@time val1 = test_subarray(X)
@time val2 = test_subarray(Xptr)
assert(val1==val2)

  0.029448 seconds (99.91 k allocations: 4.573 MiB)
  0.034369 seconds (5 allocations: 176 bytes)

This is nice so far (no allocations), but real code using such a construction is quite slow. If it was possible to get a zero-overhead dereference for pointers to julia objects, then such things could actually work.

PS, for convenience: pointer.jl defines

unsafe_pointer_to_objref(x::Ptr) = ccall(:jl_value_ptr, Any, (Ptr{Void},), x)

Array views becoming dominant source of memory allocation
#2

Yes it is possible but this is a invalid use case.


#3

How?

Do you mean this is possible now, or do you mean that you know how to modify the C base to make it possible?

The example I posted was just because of the question in the other thread about subarray allocations.

My actual use-case is actually storing (inside of an array) structs which contain object-references; I know that I will need to access the fields together, so I want to be nice to the cache. Say e.g.

immutable foo
    flags::Int64
    otherstuff::Float64
    neighbors_out::Vector{Int64}
    neighbors_in::Vector{Int64}    
end

x = Vector{foo}(1000)

Now, x actually contains an array of pointers to heap-allocated foo objects (I think because of limitations of the gc?), hence a lot of indirection.

Well, I need to iterate over these guys very often, so I would be happy to just store a gc-invisible pointer and build some separate array that keeps all my objects alive. However, this makes only sense if I can get zero/low-overhead unsafe_pointer_to_objref in type-stable code.

Edit: One current possiblity is to store all the bitstype fields together in one array, and all the non-bitstype fields in extra arrays, one for each field. This should roughly amplify the number of cache-misses by a factor of 1+number of objref-fields. This is what I am currently doing.


#4

It is possible now. but you must not do this so I won’t give the answer. It is the wrong solution to the problem and can crash everything.


#5

…what is the right solution? I cannot be the only one who wonders about the right data layout for structs containing references.

I see, this is one of the secrets that must be learned, not told. One day, oh wise master, I will read the code for inference and AST manipulation, and then I shall be worthy of the grand and terrible responsibility of unsafe typecasting :slight_smile:


#6

You don’t need a solution. The object are allocated very close to each other already.

And there’s also work on making it possible to have inlined non-pointerfree objects (and I’ll repeat what I’ve said elsewhere that GC is NOT the issue there).

Also, before you read the code, I’ll just let you know that it’s not anywhere in the Base.


#7

Would you mind telling me what the actual issue is (or linking to the possibly existing active discussion / WIP on this topic)?

Currently, the price for having references in immutable structs is quite big, both in terms of memory use and in terms of access speed. E.g.

immutable some_non_bitstype
    member1::Float64
    member2::Vector{Int64}
end
immutable some_bitstype
    member1::Float64
    member2::Int64 #this might be some kind of hidden pointer/offset into some other array
end
Nmax = 100_000
x1 = Vector{some_non_bitstype}(Nmax)
for i in 1:length(x1)
    x1[i]=some_non_bitstype(1.0/convert(Float64, i), Vector{Int64}())
end

x2 = Vector{some_bitstype}(Nmax)
for i in 1:length(x2)
    x2[i]=some_bitstype(1.0/convert(Float64, i),i)
end

function testsum(x)
    sum_=0.0
    for i in x
        sum_ += i.member1
    end
    return sum_
end

r1=testsum(x1)
r2=testsum(x2)
assert(r1==r2) #no fastmath, hence we expect exact equality
@time testsum(x1)
@time testsum(x2)

  0.002411 seconds (5 allocations: 176 bytes)
  0.000669 seconds (5 allocations: 176 bytes)

(I am aware that this is not the right way of summing harmonic series; floating point summation is just a convenient stand-in for fast ops that cannot be reordered or vectorized)


#8

https://github.com/JuliaLang/julia/pull/18632

AFAICT. there’s only 2x difference between the two.


#9

Cool, thanks! A pity that this appears to be low-priority at the moment.

Then I’ll just hope that a later julia version will provide a way to explicitly control the memory layout. Unfortunately I’m not yet ready to contribute there.

(explicit control like e.g. inline struct vs struct to set a default for a type, and inlined/boxed in member field definitions to override the default set by the member-type, expecting an error during compilation when encountering impossible (circular) layouts, defining isinitialized only for pointer types, and optimizing non-initialization down to a single memset-zero over the relevant memory range if pointer-type members exist, and a NOP otherwise; Re this being yet another keyword, sure, but memory layout of structs and arrays of structs is sufficiently fundamental for performance and C-interop to warrant explicit control with sensible predictable defaults).