Immutables with reference-fields: Why boxed?

foobar_lv2 · December 12, 2017, 12:44pm

One of the most painful things in julia is that there is no zero-cost abstraction for bundling gc-visible references with anything. This manifests in allocating views, allocating Tuple{Vector{Int}, Int}, etc. More than just making some code slow, this severely constrains APIs and design of data layouts.

A priori there is no reason at all that immutable types cannot be always inline (self-referential / circular immutables make no sense and don’t even work in C); TBH, I see no point in immutability for boxed types (or are there any optimizations that become possible due to immutability of a boxed type? And don’t say unboxing, that’s my point!)

So I wanted to ask why this is the case, and whether this pain will go away. There must have been some reason that https://github.com/JuliaLang/julia/pull/18632 was not merged.

And I view this as a breaking change, not an optimization, in the sense that it will completely change best-practice for julia code (julia 0.6: don’t use views, ever; julia 0.7: after profiling and @inline, views might be usable, but YMMV).

Or, still current: Do not write many small inlined functions, write few big ones because you cannot logically bundle arguments and 10-argument functions defy the point. If you have an array that is supposed to hold an immutable with ref-fields, consider splitting your type; each ref-field goes into its own array, and the bitstypes go together into one; except if you do fast random-access reads and are bandwidth not latency constrained, and have many ref-fields, then you should merge them, and eat the additional indirection latency, because otherwise you only use 8 bytes out of every cache-line…argh!)

vchuravy · December 12, 2017, 1:10pm

https://github.com/JuliaLang/julia/pull/186324 was closed because all of the changes that it proposed are part of Julia 0.7 as far as I am aware and the code got quite stale.

One of the big 0.7 changes has been that structs that contain GC references and that don’t escape the function are not allocating. That made Ref and view non-allocating in most cases, but it is not limited to specify types. If you encounter cases on 0.7 where allocation happens and you think it shouldn’t a report would be welcomed.

foobar_lv2 · December 12, 2017, 1:17pm

function ffill(n,X)
       A = Vector{Tuple{Int, Vector{Int}}}(n)
       for i=1:n A[i] = (i,X) end
       A
end

@time ffill(1000, collect(1:2))
0.000049 seconds (1.01 k allocations: 39.500 KiB)

Is this supposed to have changed? As in, have I somehow checked out the wrong branch?

Also, in this case the bad thing is not even the allocation but the extra indirection (hence cache miss) on access. Storing refs to arrays (instead of data-pointers) already gives me an indirection; the tuple-boxing gives me another one.

…I’d really like an immutable array, length+data-pointer, that keeps the data-pointed region alive. Sure, you can’t resize, but you lose one cache-missed indirection.

Currently I need a second data-structure just for gc-visibility and need to be very careful that both never go out of sync

Edit: Still allocates and stores boxed on feec0ecf1c0189f2b47aa05680ee16257692d19e from yesterday. But as a bonus,

A= ffill(1000, collect(1:2));
show(A)

throws:

Tuple{Int64,Array{Int64,1}}[(1, ERROR: AssertionError: Array{Int64,1} is not a subtype of Tuple{Int64,Array{Int64,1}}
Stacktrace:
 [1] typeinfo_prefix(::IOContext{Base.TTY}, ::Array{Int64,1}) at ./arrayshow.jl:475
 [2] show_vector(::IOContext{Base.TTY}, ::Array{Int64,1}, ::Char, ::Char) at ./arrayshow.jl:428 (repeats 2 times)
 [3] show(::IOContext{Base.TTY}, ::Array{Int64,1}) at ./arrayshow.jl:442
 [4] show_delim_array(::IOContext{Base.TTY}, ::Tuple{Int64,Array{Int64,1}}, ::Char, ::Char, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:501
 [5] show_delim_array at ./show.jl:489 [inlined]
 [6] show at ./show.jl:517 [inlined]
 [7] show_delim_array(::IOContext{Base.TTY}, ::Array{Tuple{Int64,Array{Int64,1}},1}, ::Char, ::String, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:472
 [8] show_delim_array at ./show.jl:458 [inlined]
 [9] show_vector(::Base.TTY, ::Array{Tuple{Int64,Array{Int64,1}},1}, ::Char, ::Char) at ./arrayshow.jl:438
 [10] show_vector at ./arrayshow.jl:428 [inlined]
 [11] show(::Base.TTY, ::Array{Tuple{Int64,Array{Int64,1}},1}) at ./arrayshow.jl:442
 [12] show(::Array{Tuple{Int64,Array{Int64,1}},1}) at ./show.jl:130
 [13] top-level scope

vchuravy · December 12, 2017, 2:16pm

No, I don’t think that has improved.

julia> ccall(:jl_array_store_unboxed, Cint, (Any,), Tuple{Int})
1

julia> ccall(:jl_array_store_unboxed, Cint, (Any,), Tuple{Int, Vector{Int}})
0

kristoffer.carlsson · December 12, 2017, 2:17pm

Probably fixed by fix :typeinfo with views (closes #25038) by rfourquet · Pull Request #25040 · JuliaLang/julia · GitHub

jamii · December 12, 2017, 5:02pm

structs … that don’t escape the function are not allocating

Do not write many small inlined functions, write few big ones because you cannot logically bundle arguments and 10-argument functions defy the point.

This is a huge pain point for me (using Julia as a compile target for statistical relational ai). I spend an ungodly amount of time trying to come up with composable apis that don’t cause allocation across function boundaries and it’s likely to be one of the main barriers to adoption of Julia within the company.

it’s an unusual use case so I don’t expect it to be high priority for the core devs, but I would greatly appreciate some insight into the likely future. Is this still on the roadmap now that escape analysis handles the most common cases? Were there fundamental obstacles to implementing the original pull request or was it just lack of time? If, hypothetically, we just funneled money/engineers at it, would we be able to fix it or would there be other obstacles?

foobar_lv2 · December 12, 2017, 5:15pm

Probably fixed by fix :typeinfo with views (closes #25038) by rfourquet · Pull Request #25040 · JuliaLang/julia · GitHub

That’s fast, wow.

Another very annoying fact is that all the new allocation optimizations appear to work only when inlined. That is (sorry for the silly example):

@inline foo_in(n) = (n, Vector{Int}(n))
@noinline foo_ni(n) = (n, Vector{Int}(n))

function ft_in(n)
       s= 0
       for i= 1:n
        jj,v = foo_in(i)
       s+=sum(v)
       end
       s
end

function ft_ni(n)
       s= 0
       for i= 1:n
        jj,v = foo_ni(i)
       s+=sum(v)
       end
       s
end

@time ft_in(1000)
  0.001948 seconds (1.00 k allocations: 3.962 MiB)

@time ft_ni(1000)
  0.002083 seconds (2.00 k allocations: 3.992 MiB)

This means that multiple return values, some of which are not bitstype, are only performant when inlined.

This means that one needs to decide on API design whether to @inline, or make the API look like old-style C by demanding horrible Ref{Int} inputs for writing the output (where the caller is hopefully re-using the Ref{Int}). And also hope that the caller is re-using the Ref{Int} often enough to not cause additional cache-misses (because it can’t point to the stack, as in C).

I mean, we have a really nice ABI for returning bitstype-structs, please use it for immutables containing ref-fields as well!

foobar_lv2 · December 12, 2017, 5:46pm

I spend an ungodly amount of time trying to come up with composable apis that don’t cause allocation across function boundaries

Care to share some tips?

My code tends to converge to just having two mutable structs, state_shared and state_threadlocal (containing re-usable temporary variables and space for return values) that I just pass around (then the size of state_treadlocal does not really matter, since I only need one instance per thread*recursion_level). Chasing through these pointers is always in-cache and tends to be pulled out of loops during optimization, so I can organize my state-types logically.

This looks like crap, but kinda works. Have you discovered better ways?

vchuravy · December 12, 2017, 7:45pm

If non-hypothetically money or engineers would be funnelled towards it, it would certainly be fixable. From my understanding it is something that can be done as 1.X (X > 0).

StefanKarpinski · December 12, 2017, 8:06pm

Yes, this can be fixed with time and effort. A lot of the necessary infrastructure is in place and we’re already avoiding allocation of objects that don’t escape on 0.7 to some extent. Like most optimizations, this will be a process of getting reports of cases that don’t get optimized and working to make sure they do get optimized until we reach a point where we’re getting all the common cases that we should.

foobar_lv2 · December 13, 2017, 11:18am

Yes, but Carnaval’s PR was not about “common cases”, it was a fundamental solution.

So, to be more blunt: Is “immutables are always unboxed; no exceptions, no escape analysis, no heuristics, no optimization” something you want? Is the price (e.g. no circular immutable types, ever; maybe problems with Union might come up?) something you are willing to pay?

When reading the discussion on that thread, it was not clear whether immutables are currently boxed due to lack of devs / priority or due to fundamental design decisions.

Unless this is clear, there won’t be anyone willing to help make it happen.

andyferris · December 13, 2017, 12:13pm

Yes, I’m pretty sure everyone on the development team wants to have the bit layout of immutable structs to be flattened/inlined, even for cases where the GC has to track some of the struct field members. AFAIK there’s no fundamental design blocker - as Stefan clearly stated this can be fixed with time and effort.

I would also be very keen to have this feature. However, please be patient

(While it’s true that a fundamental fix to layout is required, the more recent ref optimization stuff goes beyond this in many ways, and having both the layout fix and the optimizations will be wonderful).

foobar_lv2 · December 13, 2017, 1:46pm

Thanks, and sorry if I sounded impatient or ungrateful! So, that is good to hear.

So time spent on looking over Carnaval’s PR (from 2016) and trying to make it compatible with the current state of julia would be likely non-wasted? But unless someone puts in a herculean effort this is more likely a 2019 than 2018 fix?

(I won’t repeat my arguments why the current state pollutes APIs and coding practices, which may be long-lived. I assume you understood them, and simply disagree / have different priorities. If I was unclear, tell me and I can write this down in longer form.)

andyferris · December 13, 2017, 2:14pm

If a person were up to the challenge and wanted to learn some internals, I’m guessing help from any quarter would be welcomed. Generally, I’ve found Julia developers to be rather welcoming!

(Please keep in mind that when you start hacking on parts of the system maintained by only a small number of contributors, those people may only have a limited amount of time to help out. Not to dissuade you, but something to remember if getting questions answered / code reviews take a while).

But unless someone puts in a herculean effort this is more likely a 2019 than 2018 fix?

Sorry, I really don’t know. I will say that from the above discussion it seems unlikely the current development effort with the current predicted timeline will get this done for v1.0.

I won’t repeat my arguments why the current state pollutes APIs and coding practices, which may be long-lived.

I happen to totally agree with this. Together with fast keyword arguments, the optimal API in many cases will be quite different to the current implementation. However, I don’t expect Julia to stagnate any time soon (though the pace, focus and scale of change may be different after v1.0). So I try and keep an open mind and relax (or else contribute fixes for things I feel strongly about)

jamii · December 13, 2017, 6:56pm

@StefanKarpinski is carnavals original approach still acceptable? ie if the PR was working today would it be merged, or are there still outstanding design questions that need discussion and agreement from the core team?

yuyichao · December 13, 2017, 7:20pm

The main (only) issue about it is that it destroys omission of null check. All other issues (api/syntax/loop detection/gc support/codegen support) are all figured out/easy to do/simply await decision.

However, pure optimization can potentially be possible in a lot of cases so it’ll be premature to do such a change only for the performance that can be obtained without any breaking change.

foobar_lv2 · December 13, 2017, 8:22pm

The main (only) issue about it is that it destroys omission of null check.

Could you explain / tell me whether I got this right?

As far as I understood, Null-checks are done for pointer-arrays and access to pointer-fields, and are omitted (1) in contexts where llvm can kill the branch directly (e.g. I write before I read and LLVM understands the aliasing well enough to see that it is unreachable), (2) for types that are supposed to have no uninitialized fields, as figured out by static analysis of inner constructors.

Now, if we store a ptr-containing immutable inline, then arrays are initialized with all-zero (memset-zero is faster than going over the offsets). Therefore, we cannot use static analysis of inner constructors anymore to rule out whether ptr-fields can be null.

That, in turn, makes every access to pointer-fields in immutables slower.

Did I understand the problem right?

Naively, I would guess that the optimal solution would be to skip all this checking business and instead to kindly ask the kernel to hand us the pagefault, so that we can raise an appropriate exception in julia?

I see two problems with this latter approach: (1) The OS needs to play along, and (2) we need to be capable of properly unwinding the stack.

Do you know which of these problems are real or unsurmountable? Or are they solved already?

One could also just eat the null-pointer deref and die, unless running a debug version? (It’s not like a nullpointer-deref is exploitable in such a context or could lead to silent data corruption, the user code is already crashed)

Edit: libsigsegv looks good for such a thing. And also, why do we emit positivity checks before square-roots? Shouldn’t this also be done by setting a FPU fault handler? A predicted branch is friggin expensive compared to the zero overhead of using fault handlers.

yuyichao · December 13, 2017, 10:08pm

Yes

Kind of. It require LLVM support. LLVM has it in 3.8 which had a bug that I fixed in 6.0. Ref Raise UndefRefError using SegFault handler by yuyichao · Pull Request #14147 · JuliaLang/julia · GitHub. It’ll still need testing to see how good it is though.

foobar_lv2 · December 13, 2017, 10:28pm

I see. So currently your patch for null-checks is in limbo (since 2 years), and carnaval’s patch is likely to be too expensive until your patch gets merged, and is hence in limbo as well (since 1 year)?

…this is painful. Thanks for the explanation.

Is there anything people could do to help on this? From outside it looks like all the problems are solved already, and this is a problem of organizational inertia, not code. Whining on the forums/github is unlikely to help, though

StefanKarpinski · December 14, 2017, 12:37am

The problem is that we’re not upgrading the default LLVM for 1.0 because that always ends up causing a lot of work that we can’t afford to do right now. As soon as 1.0 is out, we can upgrade LLVM in 1.1 and then this bug will be fixed and we can work on things that having this would enable.

Topic		Replies	Views
Memory layout of structs with mutable fields Performance	42	3405	March 3, 2019
Immutable vs Type: reassignment via arrays and performance General Usage question	11	2703	April 28, 2017
Idea for setting both mutables and immutables without allocation Internals & Design proposal	23	3963	April 15, 2017
Understanding generated assembly for simple loop Performance llvm	14	6083	March 19, 2018
Stack allocation for structs with heap references Internals & Design	2	2200	February 25, 2017

Immutables with reference-fields: Why boxed?

Related topics