Is there a way to check how "far away" a pointer is from a page boundary?


#1

I am quite new to understanding how memory works so please do correct me if I get the terminology and concepts wrong.

Update

Base on @stevengj’s suggestion, here is an MVE of my function. The unsafe part is step1 but I have provided the rest of the function for context.

function load_bits(s::String)
    step1 = s |> pointer |> Ptr{UInt} |> unsafe_load |> ntoh
    # now there are some bits at the end that should be padded with 0
    # but are not guaranteed to be 0 - so bitshift them away
    nbits_to_shift = 8(sizeof(UInt) - sizeof(s))
    step2 = step1 >> nbits_to_shift << nbits_to_shift
    step2
end

s = "abc"
load_bits(s)

svec = [randstring(rand(1:8)) for i=1:1_000_000]
@time load_bits.(svec)

What’s the best way to optimize the above but is still safe, in that in won’t be allowed to try and access memory that isn’t allocated to the program? Any explanation of the theories I need to make my own optimization in the future would be highly appreciated too.

Orig text
I found that the fastest way to load the underlying bytes of a string is to load 8 bytes at a time using unsafe_load as above. However the string may be less than 8 bytes in length, so in theory, one could be accessing memory that is not allocated and cause a segfault. But given I have a pointer which is just a memory address, is it possible to find out how “far away” that pointer is from the boundary? Because out of 100_000_000 strings, only one would be right on the boundary, so for the other 99_999_999 strings, I can just load 8 bytes and bitshift away the bits I don’t want. And for that 1 string I can just load it slowly byte by byte to avoid causing any issues.

New MWE showing load 8 bytes at a time is more efficient than loading 1 byte 8 times
A simple MVE to show that load 8 bytes at a time is more efficient than loading 1 byte at a time using codeunit 8 times.

svec = [randstring(rand(1:8)) for i=1:1_000_000];
using BenchmarkTools
@btime load_bits.($svec);
@btime codeunit.($svec,1);

#2

@xiaodai, can I just suggest that you give a minimal case that you want to optimize here and let some other people try to help you optimize it in a safe way?

As I understand it, you have a load_bits(T, s, skip) function that you want to make as fast as possible when applying to an array of strings via load_bits.(T, sarray, skip). Post a minimal version of your current implementation, e.g. just for T=UInt64, and a typical array of strings and skip value that you want to be fast.


#3

The typical way that compilers handle this (e.g. to SIMD loops) is to split the loop into handling loads where remaining bytes > 16 and the go byte-by-byte for the remaining bytes when < 16.


#4

Check for page boundary is trivial (though you can still interfere with compiler tranformations, asan for example)

Also note that this is a very hardward dependent optimization, it’ll usually better to rely on the compiler for most cases. AArch64-SVE (and IA64 iirc) has instructions that allow loading of vectors out of bound for example.


#5

Updated original post with

Update

Base on @stevengj’s suggestion, here is an MVE of my function. The unsafe part is step1 but I have provided the rest of the function for context.

function load_bits(s::String)
    step1 = s |> pointer |> Ptr{UInt} |> unsafe_load |> ntoh
    # now there are some bits at the end that should be padded with 0
    # but are not guaranteed to be 0 - so bitshift them away
    nbits_to_shift = 8(sizeof(UInt) - sizeof(s))
    step2 = step1 >> nbits_to_shift << nbits_to_shift
    step2
end

s = "abc"
load_bits(s)

svec = [randstring(rand(1:8)) for i=1:1_000_000]
@time load_bits.(svec)

What’s the best way to optimize the above but is still safe, in that in won’t be allowed to try and access memory that isn’t allocated to the program? Any explanation of the theories I need to make my own optimization in the future would be highly appreciated too.


#6

We still don’t know what step2 is. Until then, the most optimized version I can give you is:

function leading_bits(s::String)
    x = UInt(0)
    for i = 1:min(sizeof(x), sizeof(s))
        x |= UInt(codeunit(s, i)) << ((sizeof(x) - i) * 8)
    end
    return x
end

And until you @profile this, it’s also the fastest way to get the right answer (because the code that’s written is faster than the code that isn’t :wink:)


#7

step 2 is as I have shown. It’s to bitshift the bits at the end out then in to ensure that they zero


#8

Sorry, I meant that I have no idea what you will do with this data after the function returns, which is may have a non-trivial significance to the implementation.

(I also made a mistake above, and forgot to write @inbounds, so there was a slightly faster way to get the answer without much effort)


#9

Once I get the array of UInt I just radix sort it using the sorttwo! function that is undergoing a PR in SortingAlgorithms.jl. The function sorts the UInt array as well as the original string vector at the same time.


#10

Ok. I have benchmarked 4 implementations now. I want to test the case where strings are 8 bytes or shorter first

lead_bits - as shown by @jameson
unsafe_load - which loads 8 bytes regardless of string length
lead_bits_with_fast_path - basically lead_bits but checks if the string if length 8 if it is use unsafe_load to load all 8 bytes at onces
load_bits_with_padding - basically loads 8 bytes if of length 8, otherwise load 4 bytes, 2 bytes and 1 byte or a combination of the those

I have tested (code at end) the below by generating 10 million string vectors of length 8 or of variable length from 1 to 8. Reults are summarised below. As expected Variable length bits loading is slow. And the benchmark we should get close

method Fixed length 8 String timing Variable length timing
lead_bits 265ms 290ms
unsafe_load 65ms 62ms
lead_bits_with_fast_path 72ms 285 ms
load_bits_with_padding 74ms 200ms

I still think if we can figure which string is “close to the edge” and just load that with one of the slow loaders but every other string just load with unsafe_load and bitshift the redundant bits away would be a fast solution.

Noob question: the pointer(string_variable) is a numeric address. The larger it is the closer it is to the edge? Or something more nuanced is going on? E.g. using some pointer arithmetic I can get to the value that a pointer is pointing to. I think this will work if we can assume that all pointer addresses less than the largest address are used by Julia. If this cannot be assumed then the above technique won’t work

# pointer arithmetic trick to get to string values pointed at by pointer
x = "abc"
y = "defdgdfsf"
disposition = pointer(x) - pointer(y)
pointer_to_x = Int(pointer(y) + disposition - 8)
unsafe_pointer_to_objref(Ptr{UInt8}(pointer_to_x)) === x # true

Full code


function leading_bits(s::String)
    x = UInt(0)
    for i = 1:min(sizeof(x), sizeof(s))
        @inbounds x |= UInt(codeunit(s, i)) << ((sizeof(x) - i) * 8)
    end
    return x
end


function leading_bits_with_fast_path(s::String)
    if sizeof(s) == 8
        return ntoh(unsafe_load(Ptr{UInt64}(pointer(s))))
    end

    x = UInt(0)
    for i = 1:min(sizeof(x), sizeof(s))
        @inbounds x |= UInt(codeunit(s, i)) << ((sizeof(x) - i) * 8)
    end
    return x
end

function load_bits_with_padding(s::String, skipbytes = 0)  
    n = sizeof(s)    
    remaining_bytes_to_load = min(sizeof(UInt), n)
    # start
    res = zero(UInt)
    shift_for_padding = 64

    if remaining_bytes_to_load == 8
        res = ntoh(unsafe_load(Ptr{UInt64}(pointer(s, skipbytes+1))))
    else 
        if  remaining_bytes_to_load >= 4
            res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt32}(pointer(s, skipbytes+1))))) << (shift_for_padding - 32)
            skipbytes += 4
            remaining_bytes_to_load -= 4
            shift_for_padding -= 32
        end
        if  remaining_bytes_to_load >= 2
            res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt16}(pointer(s, skipbytes+1))))) << (shift_for_padding - 16)
            skipbytes += 2
            remaining_bytes_to_load -= 2
            shift_for_padding -= 16
        end
        if  remaining_bytes_to_load >= 1
            res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt8}(pointer(s, skipbytes+1))))) << (shift_for_padding - 8)
            skipbytes += 1
            remaining_bytes_to_load -= 1
            shift_for_padding -= 8
        end
    end

    return res
end

unsafe_loadbits(s) = s |> pointer |> Ptr{UInt} |> unsafe_load
leading_bitsdot(s) = leading_bits.(s)
leading_bits_with_fast_pathdot(s) = leading_bits_with_fast_path.(s)
unsafe_loadbitsdot(s) = unsafe_loadbits.(s)
load_bits_with_paddingdot(s) = load_bits_with_padding.(s)

Testig code

xvar = [randstring(rand(1:8)) for i in 1:100_000];
x = rand(xvar, 10_000_000);
using BenchmarkTools
@btime leading_bitsdot($x);
@btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);
@btime load_bits_with_paddingdot($x);

xfixed = [randstring(8) for i in 1:100_000];
x = rand(xfixed, 10_000_000);
using BenchmarkTools
@btime leading_bitsdot($x);
@btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);
@btime load_bits_with_paddingdot($x);

#11

Yes. Memory addresses are direct-mapped to virtual-memory pages, i.e. by looking at the memory address and masking out the least-significant bits. The least-significant bits are the offset with the page, and tell you how close you are to the page boundary. Since memory can only be marked unreadable by the VM system at the page level, in my understanding it is always safe (can’t segfault) to read memory within the same page as your data (if you are okay with reading garbage).

Assuming that the pages are a power-of-two size that is at least 4096 (4k, 0x1000 in hex) bytes, which I think is true on all x86_64 systems at least, you can therefore check whether the data of a string s is less than 8 bytes from a page boundary by (UInt(pointer(s)) & 0xfff) > 0xff8. I’m not sure about the page sizes on x86, ARM, etc.

This is very finicky low-level stuff, though, with problematic portability. I would be really sure that you have exhausted all other methods of optimization, and that this kind performance hack is really necessary, before considering it.


#12

Actually from own testing, this optimization at best shaves 3 seconds off a 20-second problem (sorting 100 million 8 bytes or fewer strings). So it’s not that critical but nice to know!

Also any garbage read in my use case will be bitshifted away.

The bigger gain will probably come from a more efficient radix sorting algorithm.


#13

Leads naturally to the question. How to determine mem pagesize from within Julia? Just nice to know.


#15

It is operating-system dependent, see e.g. here. Something like the following should work:

if Sys.isunix()
    pagesize() = ccall(:sysconf, Clong, (Cint,), #= _SC_PAGESIZE: =# Sys.islinux() ? 30 : 29)
elseif Sys.iswindows()
    function pagesize()
        info = zeros(Cint, 16) # >= sizeof SYSTEM_INFO
        ccall(:GetSystemInfo, Nothing, (Ptr{Cint},), info)
        return info[2] # dwPageSize
    end
end

I’m not 100% sure of the value of _SC_PAGESIZE on systems other than Mac and Linux x86_64, however. I suppose you could get it via something like:

const _SC_PAGESIZE = Sys.isapple() ? 29 : parse(Cint, match(r"\[\[\[([ 0-9]*)\]\]\]", read(pipeline(`echo $("#include <unistd.h>\n[[[_SC_PAGE_SIZE]]]")`, `cc -E -`),String)).captures[1])

since it is fairly safe to assume that non-Apple Unices will have cc installed.


#16

I am trying to avoid some ntoh calls and some shifting. Would you be so kind to benchmark it? :slight_smile:

julia> mutable struct MutableUInt val::UInt end

julia> function copy_bits(s::String, skipbytes = 0)  
    n = sizeof(s)    
    remaining_bytes_to_load = min(sizeof(UInt), n)
    # start
    res = MutableUInt(0)
    p=reinterpret(Ptr{UInt8}, pointer_from_objref(res))
    if remaining_bytes_to_load == 8
        unsafe_copyto!(p, pointer(s),8)
    else 
        if  remaining_bytes_to_load >= 4
            unsafe_copyto!(p, pointer(s), 4)
            skipbytes += 4
            remaining_bytes_to_load -= 4
        end
        if  remaining_bytes_to_load >= 2
            unsafe_copyto!(p+skipbytes, pointer(s, skipbytes+1),2)
            skipbytes += 2
            remaining_bytes_to_load -= 2
        end
        if  remaining_bytes_to_load >= 1
            unsafe_copyto!(p+skipbytes, pointer(s, skipbytes+1),1)
        end
    end

    ntoh(res.val)
end

PS. First version was with unsafe_copy! function which is deprecated. It had catastrophic impact to benchmark. It doesn’t seem that I gained something significant (if anything)…

Edit:

julia> x = rand(xvar, 10_000_000);

julia> @btime copy_bitsdot($x);
       @btime leading_bitsdot($x);
       @btime leading_bits_with_fast_pathdot($x);
       @btime unsafe_loadbitsdot($x);
       @btime load_bits_with_paddingdot($x);
  566.614 ms (2 allocations: 76.29 MiB)
  692.019 ms (2 allocations: 76.29 MiB)
  624.983 ms (2 allocations: 76.29 MiB)
  375.962 ms (2 allocations: 76.29 MiB)
  610.363 ms (2 allocations: 76.29 MiB)

julia> x = rand(xfixed, 10_000_000);

julia> @btime copy_bitsdot($x);
       @btime leading_bitsdot($x);
       @btime leading_bits_with_fast_pathdot($x);
       @btime unsafe_loadbitsdot($x);
       @btime load_bits_with_paddingdot($x);
  481.873 ms (2 allocations: 76.29 MiB)
  845.588 ms (2 allocations: 76.29 MiB)
  155.445 ms (2 allocations: 76.29 MiB)
  138.500 ms (2 allocations: 76.29 MiB)
  151.662 ms (2 allocations: 76.29 MiB)

#17

ccall(:jl_getpagesize, Int, ()). Also note that if you really want to do this optimization, you almost certainly want to just use a lower bound of 4096.


#18

You are using 0.7 then? I don’t have unsafe_copyto! in my Base only unsafe_copy!
From your benchmarking clearly the copy_itsdot is faster but is missing a fast path for exactly 8 bytes. So I have added that.

I have added a version where all lengths from 1 to 8 has a unsafe_load version by defining primitive types of byte size 1-7. But the clear winner is the check boundary version: it’s very close to the performance of the unsafe version!

# all char specialisation
primitive type Bits24 24 end
primitive type Bits40 40 end
primitive type Bits48 48 end
primitive type Bits56 56 end

function all_lens(s::String, skipbytes = 0)::UInt
    n = sizeof(s)    
    remaining_bytes_to_load = min(sizeof(UInt), n)

    if remaining_bytes_to_load == 8
        return ntoh(unsafe_load(Ptr{UInt64}(pointer(s))))
    elseif  remaining_bytes_to_load == 7
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{Bits56}(s |> pointer))))
    elseif  remaining_bytes_to_load == 6
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{Bits48}(s |> pointer))))
    elseif  remaining_bytes_to_load == 5
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{Bits40}(s |> pointer))))
    elseif  remaining_bytes_to_load == 4
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{UInt}(s |> pointer))))
    elseif  remaining_bytes_to_load == 3
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{Bits24}(s |> pointer))))
    elseif  remaining_bytes_to_load == 2
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{UInt16}(s |> pointer))))
    else
        return ntoh(Base.zext_int(UInt, unsafe_load(Ptr{UInt8}(s |> pointer))))
    end
end

function check_boundary(s::String)::UInt
    ptrs  = pointer(s)
    if (UInt(ptrs) & 0xfff) > 0xff8
        return all_lens(s)
    else
        return ntoh(unsafe_load(Ptr{UInt64}(ptrs)))
    end 
end

all_lensdot(s) = all_lens.(s)
check_boundarydot(s) = check_boundary.(s)

xvar = [randstring(rand(1:8)) for i in 1:100_000];
x = rand(xvar, 10_000_000);
using BenchmarkTools
# @btime leading_bitsdot($x);
# @btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);          # 70ms
@btime load_bits_with_paddingdot($x);   # 240ms
# @btime copy_bits_fast_pathdot($x)     # can't test 
@btime all_lensdot($x);                 # 190ms
@btime check_boundarydot($x);           # 72ms

xfixed = [randstring(8) for i in 1:100_000];
x = rand(xfixed, 10_000_000);
using BenchmarkTools
# @btime leading_bitsdot($x);
# @btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);          # 67ms
@btime load_bits_with_paddingdot($x);   # 78ms
# @btime copy_bits_fast_pathdot($x)     # can't test
@btime all_lensdot($x);                 # 81ms
@btime check_boundarydot($x);           # 74ms

#19

I am trying to evaluate both. As you can see my 0.7 from 3.1.2018 is obsolete now :stuck_out_tongue: because we won’t be able to use pointer_from_objref on Ints.

If you/we aim to create usable algorithm you/we have to be prepared to use sustainable methods.

  1. I still don’t believe that unsafe_copyto! is slower than unsafe_load so there is probably way to improve other parts of code
  2. if it is really slower we could simple replace >8 bytes code:
if remaining_bytes_to_load == 8
        unsafe_copyto!(p, pointer(s),8) # 312ms
        res.val = unsafe_load(Ptr{UInt64}(pointer(s, skipbytes+1))) # 273ms
        

My benchmarks are probably not very good because by repeating them I use to get different results. (memory swap problem? web browsers randomly consuming too much CPU? overheated CPU?)

I think that check_boundary will stay winner. :slight_smile:

You could simply change unsafe_copyto! to unsafe_copy! and test it. (I am curious but I don’t think it will be hilarious :stuck_out_tongue: )

BTW. You probably don’t miss that you could avoid to call in last if body in load_bits_with_padding:

            skipbytes += 1
            remaining_bytes_to_load -= 1
            shift_for_padding -= 8

But maybe it doesn’t matter because compiler optimize it away?


#20

I tried to generalize this. I am trying to find strings that are at least 15 bytes from the boundary, which means I should check for (UInt(pointer(s)) & 0xfff) > 0xff0.

On my Windows 10 64bit machine out of 1_000_000 length-15 strings I can always find 4000-6000 that have addresses that satisfy (UInt(pointer(s)) & 0xfff) > 0xff0 and they all end in ff8. Doesn’t this mean that the string is only 8 bytes from the boundary? So the machine stores 15 bytes across two pages? But unsafe_load.(Ptr{UInt128}.(ptrs)) on these pointers still work fine. I get page size of 4096 using both your and yuyichao’s method.

Actually I struggled to find strings, s, that satisfies (UInt(ptrs) & 0xfff) > 0xff8. Which makes testing hard. Is there a way to generate them?

x = [randstring(15) for i = 1:1_000_000];
ptrs = pointer.(x)

# keep only those close to the edge
isnearboundary(ptrs) = (UInt.(ptrs) .& 0xfff) .> 0xff0
# theoretically these are ptrs that are less than 15 bytes away from the boundary
ptrs = ptrs[find(isnearboundary, ptrs)]

Base.unsafe_load.(Ptr{UInt128}.(ptrs)) # works fine

using  StatsBase
countmap(UInt.(ptrs) .& 0xfff) # only has 0xff8 but nothing else even though I checked for `.> 0xff0`

#21

I did it was really slow so excluded from testing