Reinterpret to non bits type

Lilith · June 6, 2022, 11:52am

I have a Vector{UInt8} and I want to get a Vector{T} for an arbitrary type T without or with minimal runtime penalty and without allocating. The actual contents of the input and output are unspecified.

The use case is in reusing a workspace buffer for sorting operations on different types.

Henrique_Becker · June 6, 2022, 12:57pm

Something like Base.reinterpret?

Lilith · June 6, 2022, 1:05pm

How do I get around an error like “ArgumentError: cannot reinterpret UInt8 as Vector{Int64}, type Vector{Int64} is not a bits type” when T = Vector{Int64}?

stevengj · June 6, 2022, 1:09pm

Sounds like you’re not calling reinterpret correctly: the type argument you pass should be the type In64 of the elements, not the type Vector{Int64} of the whole array:

julia> b = rand(UInt8, 64);

julia> reinterpret(Int64, b)
8-element reinterpret(Int64, ::Vector{UInt8}):
  1825538613332355507
   418583048467761498
 -8658478982884042069
 -9027638596725568490
  2879610717931555676
  7729036530860488291
 -3892794304566826536
  2529051593991046183

Sukera · June 6, 2022, 1:22pm

Arbitrary T are not possible for a number of reasons. The biggest one is that T may have internal padding between fields for alignment reasons, which you can’t convert/reinterpret to/from. The next biggest ones are that structures containing pointers are not likely to be serialized to pointers into that same array, and if they are, the array you have now is likely not at the same memory location as the original one was where the data was serialized/taken from.

If you require arbitrary reinterpretation, I’d suggest directly going to unsafe_*, but I’d think really hard about whether that is actually what is best here - parsing a byte stream safely is usually a better choice than going the unsafe_* route.

Lilith · June 6, 2022, 2:17pm

If T = Vector{Int64} then my elements are vectors of ints, so the final type I want is Vector{Vector{Int64}}.

I’m not interested in parsing because there is no useful information stored in the input vector. It is just a preallocated workspace. The key difference between my use case and the general reinterpret use case is that I don’t care about preserving semantics. The contents of the input array are unspecified, and I am perfectly happy with the output being gibberish. For example, when T = Vector{Int64}, I fully expect that dereferencing an element of the output vector will segfault.

The reason I am hesitant to use unsafe_* is because I want to produce a fully self-contained output Vector that does not need to be @gc.preserved.

Sukera · June 6, 2022, 3:18pm

I see - the trouble is that Vector{Int64} is a mutable object, so it’s not stored inline in the outer vector:

julia> Base.allocatedinline(Vector{Int64})
false

so a Vector{Vector{Int64}} is under the hood a vector of pointers to the contained Vector{Int64}. In part that’s also due to Vector{Int64} not being of fixed size, so there’s no other choice than to store pointers.

So if for example you have [0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8], you want to end up with [[578437695752307201]]? What happens for [0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8] - how many Vector{Int64} do you expect as output?

What’s the larger context you’re trying to use this in? It sounds to me like you’re trying to allocate blocks of memory outside of the control of GC, while then wanting to treat them like regularly allocated but differently typed objects, which means interfacing with GC on some level, to wrap the memory in your new type.

Lilith · June 6, 2022, 3:59pm

2

Sorting often requires a workspace buffer to swap elements into and out of. I want to re-use that vector like so:

function sort_many!(things_to_sort)
    workspace = UInt8[]
    for x in things_to_sort
        sort!(x; workspace)
    end
    things_to_sort
end

where sort! resizes and reinterprets the workspace itnernallly.

Lilith · June 6, 2022, 4:01pm

Yes, though I expect to get a segfault when dereferencing 0x0102030405060708 rather than the Int64 578437695752307201 stored at that memory address.

brenhinkeller · June 6, 2022, 4:02pm

As a general rule of thumb, vectors of vectors are in my experience rarely the right solution. I suspect there is an alternate way to express the problem that would not present this sort of difficulty.

brenhinkeller · June 6, 2022, 4:07pm

That said, if you just have a big array you’re using effectively as a manual memory buffer, you can always get the pointer to that array, do the pointer arithmetic yourself to get the start of where you want your vector to be and then Base.unsafe_wrap to get your vector. But you’re basically writing C at that point – don’t make an error or you’ll get a segfault!

Sukera · June 6, 2022, 4:09pm

Right - just reinterpret can’t know that though! It could have just as well been a single vector with two elements.

I don’t think you can do that in general without knowing more about things_to_sort, without falling back to something like workspace = Any[]. At least not in a safe way. Reusing a buffer typically requires knowing what you want to store in that buffer and allocating the workspace accordingly.

Yes, that’s be a little more involved, since julia does not allow you to do that without unsafe_*.

Lilith · June 6, 2022, 4:10pm

I want to support arbitrary vectors. That includes Int, Float64, Integer, Plots.Plot, Vector{Int}, etc. Vector{Int} is just an example of a difficult use case.

Sukera · June 6, 2022, 4:12pm

What would the type of things_to_sort be? How would you write this in e.g. C?

Lilith · June 6, 2022, 4:18pm

As you noted, a vector of vectors must simply store a vector of pointers. I think we can assume that the number of elements should be number of bits divided by size where size is

Base.elsize(Vector{T})

or

Base.aligned_sizeof(T)

In your example, this is 8, so 16 bytes / 8 bytes per input gives us an output of 2 elements. I don’t see a reasonable way to interpret bytes bits as [[3098029342423, 1098209414133]].

Lilith · June 6, 2022, 4:19pm

An iterable with eltype AbstractArray

brenhinkeller · June 6, 2022, 4:28pm

If you’re dealing with mutable objects like Plots.Plot or Vector{Int} you’re better off just embracing that and having your sorting buffer be a vector of pointers IMO, especially in the context of a sorting buffer.

It’s not 100% ideal, and means you need to differentiate between the code paths for mutable objects and immutable objects, but otherwise there’s no way to determine how much space is needed for the buffer at compile time so the without allocating part of your request is impossible otherwise.

It’ll also make the sorting much faster because then you just have to move pointers, not whole objects.

Lilith · June 6, 2022, 4:30pm

Yes. I am already using a vector of pointers.

Sukera · June 6, 2022, 4:31pm

Right - the difference is that in julia, Vector{Ptr{T}} (or Vector{Vector{Ptr{T}}}, for that matter) is not the same as Vector{T}, for T not being allocated inline and thus making Vector{T} being a vector of pointers under the hood. The types are different and you can’t freely convert between the two without going through unsafe_* (in C-speak, you’d cast a void* to my_type*, which may be undefined behavior, depending on where the pointer came from). One reason for that is that the “vector of pointers”-ness is AFAIK an implementation detail. In theory, if the compiler can figure out that none of the mutables in the vector outlives the vector itself, all of those mutable objects could be stored inline, as long as their sizes are known ahead of time. Vectors as elements of other vectors are not special cased here.

So you could then do this, can’t you?

workspace = eltype(eltype(things_to_sort))[]

This will in the worst case fall back to Any of course, but at that point your sorting will be type unstable anyway.

Lilith · June 6, 2022, 4:31pm

void sort_ints_and_floats(int *ints, size_t ints_len, double *floats, size_t floats_len) {
    workspace w = workspace();
    sort_ints(ints, ints_len, workspace);
    sort_floats(floats, floats_len, workspace);
    free_workspace(w)
}

workspace workspace() {
    workspace w;
    w.data = malloc(1024);
    w.size = 1024;
    return w;
}

void free_workspace(workspace w) {
    free(w.data);
}

void resize_workspace(workspace w, size_t new_size) {
    w.data = realloc(w.data, new_size);
    w.size = new_size;
}

void sort_ints(int *ints, size_t ints_len, workspace w) {
    if (ints_len * sizeof(int) > w.size) {
        resize_workspace(w, ints_len * sizeof(int));
    }
    int_workspace = (int *) w.data;
    // ...
}

void sort_floats(double *floats, size_t floats_len, workspace w) {
    if (floats_len * sizeof(double) > w.size) {
        resize_workspace(w, floats_len * sizeof(double));
    }
    float_workspace = (double *) w.data;
    // ...
}