Ok. I have benchmarked 4 implementations now. I want to test the case where strings are 8 bytes or shorter first

`lead_bits`

- as shown by @jameson

`unsafe_load`

- which loads 8 bytes regardless of string length

`lead_bits_with_fast_path`

- basically `lead_bits`

but checks if the string if length 8 if it is use `unsafe_load`

to load all 8 bytes at onces

`load_bits_with_padding`

- basically loads 8 bytes if of length 8, otherwise load 4 bytes, 2 bytes and 1 byte or a combination of the those

I have tested (code at end) the below by generating 10 million string vectors of length 8 or of variable length from 1 to 8. Reults are summarised below. As expected Variable length bits loading is slow. And the benchmark we should get close

method |
Fixed length 8 String timing |
Variable length timing |

`lead_bits` |
265ms |
290ms |

`unsafe_load` |
65ms |
62ms |

`lead_bits_with_fast_path` |
72ms |
285 ms |

`load_bits_with_padding` |
74ms |
200ms |

I still think if we can figure which string is “close to the edge” and just load that with one of the slow loaders but every other string just load with `unsafe_load`

and bitshift the redundant bits away would be a fast solution.

Noob question: the `pointer(string_variable)`

is a numeric address. The larger it is the closer it is to the edge? Or something more nuanced is going on? E.g. using some pointer arithmetic I can get to the value that a pointer is pointing to. I think this will work if we can assume that all pointer addresses less than the largest address are used by Julia. If this cannot be assumed then the above technique won’t work

```
# pointer arithmetic trick to get to string values pointed at by pointer
x = "abc"
y = "defdgdfsf"
disposition = pointer(x) - pointer(y)
pointer_to_x = Int(pointer(y) + disposition - 8)
unsafe_pointer_to_objref(Ptr{UInt8}(pointer_to_x)) === x # true
```

**Full code**

```
function leading_bits(s::String)
x = UInt(0)
for i = 1:min(sizeof(x), sizeof(s))
@inbounds x |= UInt(codeunit(s, i)) << ((sizeof(x) - i) * 8)
end
return x
end
function leading_bits_with_fast_path(s::String)
if sizeof(s) == 8
return ntoh(unsafe_load(Ptr{UInt64}(pointer(s))))
end
x = UInt(0)
for i = 1:min(sizeof(x), sizeof(s))
@inbounds x |= UInt(codeunit(s, i)) << ((sizeof(x) - i) * 8)
end
return x
end
function load_bits_with_padding(s::String, skipbytes = 0)
n = sizeof(s)
remaining_bytes_to_load = min(sizeof(UInt), n)
# start
res = zero(UInt)
shift_for_padding = 64
if remaining_bytes_to_load == 8
res = ntoh(unsafe_load(Ptr{UInt64}(pointer(s, skipbytes+1))))
else
if remaining_bytes_to_load >= 4
res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt32}(pointer(s, skipbytes+1))))) << (shift_for_padding - 32)
skipbytes += 4
remaining_bytes_to_load -= 4
shift_for_padding -= 32
end
if remaining_bytes_to_load >= 2
res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt16}(pointer(s, skipbytes+1))))) << (shift_for_padding - 16)
skipbytes += 2
remaining_bytes_to_load -= 2
shift_for_padding -= 16
end
if remaining_bytes_to_load >= 1
res |= Base.zext_int(UInt, ntoh(unsafe_load(Ptr{UInt8}(pointer(s, skipbytes+1))))) << (shift_for_padding - 8)
skipbytes += 1
remaining_bytes_to_load -= 1
shift_for_padding -= 8
end
end
return res
end
unsafe_loadbits(s) = s |> pointer |> Ptr{UInt} |> unsafe_load
leading_bitsdot(s) = leading_bits.(s)
leading_bits_with_fast_pathdot(s) = leading_bits_with_fast_path.(s)
unsafe_loadbitsdot(s) = unsafe_loadbits.(s)
load_bits_with_paddingdot(s) = load_bits_with_padding.(s)
```

**Testig code**

```
xvar = [randstring(rand(1:8)) for i in 1:100_000];
x = rand(xvar, 10_000_000);
using BenchmarkTools
@btime leading_bitsdot($x);
@btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);
@btime load_bits_with_paddingdot($x);
xfixed = [randstring(8) for i in 1:100_000];
x = rand(xfixed, 10_000_000);
using BenchmarkTools
@btime leading_bitsdot($x);
@btime leading_bits_with_fast_pathdot($x);
@btime unsafe_loadbitsdot($x);
@btime load_bits_with_paddingdot($x);
```