# How to create a random binary array with b bits?

I would like to make a random binary array with b bits. In fact I will make a lot of them and compute their Hamming distances which I need to be able to do fast. At a low level that means computing XOR between binary arrays and then using count_ones.

How can I make the random binary arrays?

rand(Bool, 1000) |> BitArray

or solving your problem:

``````julia> b=100;

julia> x = rand(Bool, b) |> BitArray;

julia> y = rand(Bool, b) |> BitArray;

julia> sum(xor.(x,y))
81
``````
1 Like
``````julia> using Random: bitrand

julia> bitrand(10)
10-element BitArray{1}:
0
0
0
1
0
0
1
1
0
0
``````

Performance of `bitrand` is quite good:

``````julia> b = 10^6;

julia> @btime BitArray(rand(Bool, \$b));
1.623 ms (5 allocations: 1.07 MiB)

julia> @btime bitrand(\$b);
20.637 ÎĽs (3 allocations: 122.23 KiB)
``````
11 Likes

Itâ€™s blazing fast. Nice!

Total thing seems pretty fast too:

``````julia> function count_rand_xor(n)
a = bitrand(n)
b = bitrand(n)
count(a .âŠ» b)
end
count_rand_xor (generic function with 1 method)

julia> @time count_rand_xor(10^6)
0.000189 seconds (9 allocations: 366.703 KiB)
500442
``````
3 Likes

Thank you so much! Do you happen to know how count is implemented at a low level?

You can type `@edit count(BitArray([1]))` to be moved directly to the implementation in a code editor.

I checked that it loops over internal BitArray vector of integers, and cals `count_ones` on them. This functions is super fast, it uses SSE popcntq to count bits on my machine.

``````julia> x = 5
5

julia> @code_native count_ones(x)
.section	__TEXT,__text,regular,pure_instructions
; â”Ś @ int.jl:354 within `count_ones'
popcntq	%rdi, %rax
retq
nopw	%cs:(%rax,%rax)
; â””
``````
3 Likes

Thatâ€™s interesting. It seems that popcount can be optimized even further by using AVX2 instructions https://arxiv.org/pdf/1611.07612.pdf .

1 Like

could you check what code native returns for you? my cpu has no avx2

Unfortunately I donâ€™t have access to a Haswell CPU currently. But it has been pointed to me that this is really a LLVM question rather than a Julia one.

I bet it gonna use this AVX2 instruction if you run it on a computer with AVX2. This is one of Juliaâ€™s strengths: you are not limited by library compiled for specific instruction set (AFAIK for instance no AMD CPUs include AVX2 instructions)

On my system, yes, itâ€™s using AVX-512 instructions:

``````julia> @code_native debuginfo=:none count(a)
.section	__TEXT,__text,regular,pure_instructions
movq	(%rdi), %rax
movq	8(%rax), %rcx
testq	%rcx, %rcx
jle	L45
movq	%rcx, %rdx
sarq	\$63, %rdx
andnq	%rcx, %rdx, %rcx
movq	(%rax), %rdx
cmpq	\$16, %rcx
jae	L48
movl	\$1, %esi
xorl	%eax, %eax
jmp	L349
L45:
xorl	%eax, %eax
retq
L48:
movabsq	\$9223372036854775792, %r8 ## imm = 0x7FFFFFFFFFFFFFF0
andq	%rcx, %r8
leaq	1(%r8), %rsi
vpxor	%xmm0, %xmm0, %xmm0
xorl	%eax, %eax
movabsq	\$4607319040, %rdi       ## imm = 0x1129E1C00
vmovdqa	(%rdi), %ymm1
movabsq	\$4607319072, %rdi       ## imm = 0x1129E1C20
vmovdqa	(%rdi), %ymm2
vpxor	%xmm3, %xmm3, %xmm3
vpxor	%xmm4, %xmm4, %xmm4
vpxor	%xmm5, %xmm5, %xmm5
vpxor	%xmm6, %xmm6, %xmm6
nopw	%cs:(%rax,%rax)
nopl	(%rax)
L128:
vmovdqu	(%rdx,%rax,8), %ymm7
vmovdqu	32(%rdx,%rax,8), %ymm8
vmovdqu	64(%rdx,%rax,8), %ymm9
vmovdqu	96(%rdx,%rax,8), %ymm10
vpand	%ymm1, %ymm7, %ymm11
vpshufb	%ymm11, %ymm2, %ymm11
vpsrlw	\$4, %ymm7, %ymm7
vpand	%ymm1, %ymm7, %ymm7
vpshufb	%ymm7, %ymm2, %ymm7
vpaddb	%ymm11, %ymm7, %ymm7
vpsadbw	%ymm0, %ymm7, %ymm7
vpaddq	%ymm3, %ymm7, %ymm3
vpand	%ymm1, %ymm8, %ymm7
vpshufb	%ymm7, %ymm2, %ymm7
vpsrlw	\$4, %ymm8, %ymm8
vpand	%ymm1, %ymm8, %ymm8
vpshufb	%ymm8, %ymm2, %ymm8
vpaddb	%ymm7, %ymm8, %ymm7
vpsadbw	%ymm0, %ymm7, %ymm7
vpaddq	%ymm4, %ymm7, %ymm4
vpand	%ymm1, %ymm9, %ymm7
vpshufb	%ymm7, %ymm2, %ymm7
vpsrlw	\$4, %ymm9, %ymm8
vpand	%ymm1, %ymm8, %ymm8
vpshufb	%ymm8, %ymm2, %ymm8
vpaddb	%ymm7, %ymm8, %ymm7
vpsadbw	%ymm0, %ymm7, %ymm7
vpaddq	%ymm5, %ymm7, %ymm5
vpand	%ymm1, %ymm10, %ymm7
vpshufb	%ymm7, %ymm2, %ymm7
vpsrlw	\$4, %ymm10, %ymm8
vpand	%ymm1, %ymm8, %ymm8
vpshufb	%ymm8, %ymm2, %ymm8
vpaddb	%ymm7, %ymm8, %ymm7
vpsadbw	%ymm0, %ymm7, %ymm7
vpaddq	%ymm6, %ymm7, %ymm6
cmpq	%rax, %r8
jne	L128
vpaddq	%ymm3, %ymm4, %ymm0
vpaddq	%ymm0, %ymm5, %ymm0
vpaddq	%ymm0, %ymm6, %ymm0
vextracti128	\$1, %ymm0, %xmm1
vpaddq	%xmm1, %xmm0, %xmm0
vpshufd	\$78, %xmm0, %xmm1       ## xmm1 = xmm0[2,3,0,1]
vpaddq	%xmm1, %xmm0, %xmm0
vmovq	%xmm0, %rax
cmpq	%r8, %rcx
je	L371
L349:
decq	%rsi
L352:
xorl	%edi, %edi
popcntq	(%rdx,%rsi,8), %rdi
incq	%rsi
cmpq	%rsi, %rcx
jne	L352
L371:
vzeroupper
retq
nopw	(%rax,%rax)
``````
3 Likes

I imagine these are the AVX-512 instructions (using write masks)?
Interesting that it is only using `ymm` registers.

Sometimes downclocking could happen when using 512 registers, maybe thatâ€™s why LLVM decides to use 256 bit ones. It could hit performance when mixed with not vectorized code. Reference.

Iâ€™ll try running a series of benchmarks in the next few months.
My machine runs non-avx/avx/avx512 code at 4.6/4.3/4.1 GHz. The difference between 4.3 and 4.1 GHz isnâ€™t very significant.

However, reading the blog post, it notes

Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.

The 4x slower execution was also observed in this post:

Ctrl + F for â€świth an IPC of ~0.25â€ť to find it. It also includes graphs showing this behavior.

The slowdown lasted about 9 microseconds.
In other words, if a function takes around that time to run, and you arenâ€™t executing a lot of other AVX512 code (so that the CPU is already in that state), that function will run at 1/4 speed if it uses 512 bit instructions. It will thus almost certainly be faster if it does not, and runs code similar to what the rest of the program is doing, so that the CPU can stay in the same â€ślicenseâ€ť.

When running numerical work, my CPU stays in the AVX512 license (as I can monitor with `watch -n0.5 "cat /proc/cpuinfo | grep MHz"`) , so Iâ€™d like to be able to keep it there.

Another concern of mine is that LLVM handles remainders on loops poorly, and it produces twice the remainder when using full-width AVX512. You can look at all the graphs in my opening post from [ANN] LoopVectorization, where Clang and Julia consistently show this pattern of behavior:

They would benefit a lot just by having a shorter remainder.
This wasnâ€™t a problem for LoopVectorization, the GNU compilers, or Intel compilers. Just LLVM.
(Both GNU and Intel also avoid AVX512 unless specified with a compilation flag; I did provide the flags for the above example. Meaning they must have decided avoiding it was worth it, despite not having the remainder issue.)

3 Likes

Wow, I read through this thread, really great package, I will try it for sure, I have some nice idea to benchmark it.
There is one more concern that LLVM could take into account, I mean heat dissipation. Despite the fact CPU has slower clock, 512 bit operations might need much more power. It will lead to throttling on laptops and desktops with stock cooling. My i7 4770k with a stock cooler was throttled all the time (I know it had no AVX-512, but still)

1 Like