Use of `pointer`


#1

Split from A plea for int overflow checking as the default

I am not sure this would be significantly better in the current state of the compiler. I haven’t looked into the implementation of BigInt, but just from poking around:

AU=Union{Int, BigInt}[1, BigInt(-1)]
pt = reinterpret(Ptr{Ptr{UInt64}}, pointer(AU))
Base.getindex(p::Ptr{T}) where T = unsafe_load(p)

pt[]
Ptr{UInt64} @0x00007febc26fadd0
pt[][]
0x0000000000000001

(pt+8)[]
Ptr{UInt64} @0x00007febc2d81730
(pt+8)[][]
0xffffffff00000001

Unless unions of bitstypes and non-bitstypes get significantly overhauled I cannot see any advantage of the union.

PS:

versioninfo()
#Julia Version 0.7.0-DEV.3666
#Commit 0f95988141* (2018-01-31 03:25 UTC)
pointer(A)[]
#Segmentation fault (core dumped)

Not sure whether this is intended (I guess I would prefer it to throw instead of segfaulting, but trying to unsafe_load a pointer to a union is obviously wrong).


A plea for int overflow checking as the default
#2

There’s a good reason we never allow code like the above into Base and strongly discourage calling pointer. For the past couple version, it can be replaced with Ref in almost all cases, which now provides a safe and supported implementation. It is likely well-past time for the pointer function to be deprecated and removed, but so far not especially high-priority.


#3

No, please, not my pointers! Snippets like the above are invaluable for peeking under the hood and understanding how internals work; and they are useful for interfacing C, and also sometimes nice for performance (even though I understand why you consider getindex/setindex! for pointers as too convenient for Base).

I take the segfault before a deprecation every day.

, or shall write something to turn more of the obvious ones into throws (we should know at compile-time that Unions cannot be unsafe_loaded from a pointer, zero runtime overhead but unsafe_load would need to become generated; I don’t see an obvious way to split throw vs pointerref on dispatch).

edit: Huh. Apparently unsafe_load of unions does make sense sometimes (it assumes a different layout that array elements). So what I proposed can probably not be done; that’s life, this is an easy segfault to avoid.


#4

I am rather nervous that deprecating pointers would make some useful things impossible to do in Julia, even in principle. I’m working on something lately where I need to be able to reinterpret data from a Vector{UInt8} as another data type on demand. It seems that if I were to actually do this with reinterpret, I would have to allocate an array twice: once to copy a subset of the Vector{UInt8} and again to reinterpret it. With pointers and copy these can basically both be done in the same step. There are some other similar, subtle things that one might need to do especially in I/O applications that seem very difficult to do efficiently without pointers.

I apologize that this has nothing to do with this thread. At some point I’m going to open a new thread about whether I can achieve these things without pointers.


#5

reinterpret doesn’t allocate a new array.


#6

Hm, it certainly seems to be doing that based on performance benchmarks. Like I said, when I get a chance I’m going to make a thread on this and hopefully you can help me :smile:


#7

I think the point of contention here is the following: The array-reinterpret does not allocate new space for the data, but it AFAIK does allocate a new jl_array struct on the heap (ca 1 cache-line), i.e. it is about as allocating as a view or an empty (length==0) array.

Whether you can afford to spend this amount of space as compared to 8 stack-bytes for a naked pointer is up to your application; consider reusing it (i.e.: define several reinterprets of the same buffer, for however many types you have, and stash these reinterprets somewhere instead of regenerating them on-demand (only works if your structs have the same size; else, use pointers)).

[edit: Not entirely sure about the current situation on 0.7]


#8

I also use pointer extensively, to get C performance from Julia.
I’ve done extensive benchmarking, and pointers have always given me the best results.


#9

Currently we pessimize (slow down) array-loads in some cases because of the possibility that they may have been used with the (now mostly deprecated implementation of) reinterpret calls. We always pessimize the performance of pointer calls and any code around them. In this, we are quite similar to C. In the future, we would like to be faster than C, like Fortran and Rust, by giving the compiler greater freedom to elide memory access and/or to use more SIMD instructions.


#10

Can you give some concrete examples of that?

I’ve found that for the sorts of things I’ve been doing, where I jump between using a Ptr{UInt64} and a Ptr{UInt8}/Ptr{UInt16}/Ptr{UInt32}, in order to perform operations on multiple characters at once,
that trying to use Julia arrays is simply much too slow.

In the future, I’d like to make it so that I can use SIMD instructions, to do that on up to 64 bytes (AVX/512) at a time (i.e. 64/32/16 code units at a time).
The current Julia SIMD support does not handle loops where you have to stop early, or conditionally execute some code depending on what is found when checking the chunk (for example, when scanning UTF-8, and you find a non-ASCII character in a chunk that needs to be handled, or a surrogate character when scanning UTF-16).
I’m not sure how Julia (or LLVM) could be made smart enough to handle those sorts of loops.


#11

I’m not sure what you mean. Julia defines that pointer / pointer_from_objref returns an invalid Ptr for use from Julia (it can be used from C, although implicit conversion via ccall or Ref should generally be better / easier), and that unsafe_load / unsafe_store are unaligned operations that cannot operate on Julia objects (it’s undefined behavior if they alias in any way). So it’s not entirely just a question of how they benchmark, it’s simply that we define that the compiler may just ignore both if you intermix them (similar to how turning on TBAA in C can break non-standards-compliant code).

My recollection is that SIMD doesn’t usually handle early-exit, so compiler authors haven’t put much effort into trying to sort out which cases might be feasible to handle. LLVM may handle vectorizing conditional execution if it can turn it into a select operation or can identify an applicable sufflevector or mask – this seems to be rather difficult for it however, as I’ve mostly only observed it succeeding on fairly simple cases.


#12

Rename pointer to Base.Really.Really.Dangerous.unsafe_pointer if you want, but please don’t get rid of it! I too have always been a fan of the fact that Julia provides enough foot-guns to have low-level fun — for pedagogic purposes if nothing else (like the old “ccall into an executable array” hack on the mailing list, among many others).


#13

Having it as: Core.unsafe_pointer I think would be enough to discourage the faint of heart.
I don’t know why it is exported, I’d make all the unsafe_* things be unexported, make people work just a little bit to find them (but don’t remove them from the tool box of people writing code at the level of Base, such as my package to replace Char and String)


#14

I’ve never heard that! Can you point to some documentation of that (which Base violates, BTW)?

I’ve been doing this for about 25 years with different platforms vector/SIMD instructions.
It’s not that you want a single SIMD instruction to somehow exit without processing all of the parallel operations, it’s that after operating a SIMD instruction, you can check to see if you need to conditionally perform some other operation, and continue with the next chunk, or exit the loop completely.

For example, for counting newline characters in an ASCII, Latin1, or UTF-8 string, you can use one of the SIMD instructions to compare 16/32/64 bytes in parallel, and count the matches, and continue the loop (possibly unrolled if you are dealing with very large strings). You can also use the SIMD instructions to quickly see if a string is all ASCII, for example. That’s the sort of “early-exit” I’m talking about.
If you need to return the offset of the first non-ASCII character in the string, a little more work is required, but it can still be done using the faster SIMD instructions. (SIMD isn’t just to speed up floating point operations!)

I’ve done that sort of thing manually, working 64-bits at a time, in my string package, and achieved very nice speedups compared to String, but I’d like to learn how to generate SIMD instructions from Julia if possible, using LLVM IR, so I don’t have to write an assembly language library for each processor to speed things up.
Any help in that direction would be greatly appreciated! :grinning:


#15

Yes, but that’s not early-exit anymore (as least, not as the compiler usually sees it), since you’re iterating the loop some number of times after the exit (relative to the ops executed by the scalar version of the code). Your examples assume that it’s safe to read 15/31/63 bytes beyond the last byte (and that those garbage bytes won’t randomly contain the target character). It’s possible to construct the data such that it’s valid to assume – and thus make it possible to use SIMD instructions. But compiler are constrained by the early-exit appearing in the source and can’t typically make the assumption that the subsequent addresses would have been valid.


#16

I think the recent threads and digressions I’ve started on pointer demonstrate rather well why one should be really careful when using it. I thought I understood Julia rather well at this point (at least for someone who does not contribute to the Julia repo itself), but it turns out that there were a number of things regarding pointers and references that I was quite mixed up about. I’d imagine there are others in the same boat.


#17

Do you really think that I’d be stupid enough to do something as silly as that??? That’s rather insulting, to tell the truth. My examples don’t assume that at all - that’s your assumption, which is totally incorrect.

You don’t seem to have understood what I was saying at all.

With what I was doing back decades ago, I often had a full 2K or 8K block full of data, which was always at least 512byte or 4Kbyte aligned (that depended on the I/O system, some needed that sort of alignment for async I/O to work). In that case, I didn’t have to worry about it.
For general strings, you simply use the SIMD instructions for as many aligned chunks as possible, and then if there is a partial chunk at the end (depending on the size of the chunk, how the memory was allocated, and where the last safe-to-read byte is), it might need to be processed in smaller chunks, or read a full chunk and mask it, so as not to pick up any garbage data.

You can even take a look at how I handle that now, in my WIP version of Strs.jl, using only normal 64-bit registers to get a nice speed-up over String operations in Base.
There, I take advantage of the fact that I’m operating on strings that are always at least 8-byte aligned, and allocated in multiples of 16-bytes (I even double checked with Jeff about that!).

About the “early-exit”, that is something I’d discussed with Arch Robison, at at least one JuliaCon (when he left Intel and went to Nvidia). It has nothing to do with addresses being valid, IIRC, it’s just that things like the LLVM auto-vectorization can’t handle it.


#18

Yes - I agree :100:%.
It really isn’t for the vast majority of programmers coming to Julia, esp. the scientists, mathematicians, etc. who are experts in their domains, but really shouldn’t have to deal with the messiness of dealing with pointers, etc. (the same as they shouldn’t have to deal with the details of indexing into UTF-8 encoded strings :grinning:)
However, for those of us writing packages that need to be as fast as possible, to give those scientists, mathematicians, etc. the tools they can use to get their jobs done, it is critical that that low-level functionality is available in Julia.


#19

On x86, a tiny string (<= 4 bytes) could fit in the 12-byte pool and only be 4-aligned. Although outside of that, it’ll typically land in at least the 16-byte pool (like malloc, anything larger is aligned-16), which would guarantee that the string data would always be exactly pointer-aligned (4 or 8 bytes), and never be 16-byte aligned (or greater). But Strings found in the system image are only pointer-aligned and have no padding, so I suppose those could manage to end up 16-byte aligned.


#20

Did you mean, x86-32 (or 32-bit ARM, for that matter)? It’s important to be clear.
Just what are the pool sizes and alignments for 32-bit and 64-bit platforms for v0.6.x and master?

All that means is that when I start using SIMD instructions, the first chunk will need the first 8 bytes masked off (I’m not going to worry about SIMD on 32-bit platforms - my optimizations using a pointer sized chunk that I’ve already done will have to be enough).

Don’t they still have an Int length, followed directly by the bytes (and for String, have a '\0' at the end)?
Can some other data immediately follow that '\0' byte?

At least from Jeff’s replies to my messages, for memory that I allocate with Base._string_n, it is my understanding that that it will always be allocated in at least 2*sizeof(Ptr{UInt8}) chunks.
Is there a 24-byte pool on 64-bit machines, that would make the allocation not always be 16-byte aligned (with the first byte after the length somethings being 8-byte aligned, other times 16-byte aligned), or is it only 16 / 32 / …?

You never answered another very important point - you’d said that pointer, and unsafe_load/unsafe_store! were not defined to work on Julia objects.
Can you point to any documentation of that? If it is true, then is the code doing just that in Base incorrect?