How to create a random binary array with b bits?

I would like to make a random binary array with b bits. In fact I will make a lot of them and compute their Hamming distances which I need to be able to do fast. At a low level that means computing XOR between binary arrays and then using count_ones.

How can I make the random binary arrays?

rand(Bool, 1000) |> BitArray

or solving your problem:

julia> b=100;

julia> x = rand(Bool, b) |> BitArray;

julia> y = rand(Bool, b) |> BitArray;

julia> sum(xor.(x,y))
81
1 Like
julia> using Random: bitrand

julia> bitrand(10)
10-element BitArray{1}:
 0
 0
 0
 1
 0
 0
 1
 1
 0
 0

Performance of bitrand is quite good:

julia> b = 10^6;

julia> @btime BitArray(rand(Bool, $b));
  1.623 ms (5 allocations: 1.07 MiB)

julia> @btime bitrand($b);
  20.637 ÎĽs (3 allocations: 122.23 KiB)
11 Likes

It’s blazing fast. Nice!

Total thing seems pretty fast too:

julia> function count_rand_xor(n)
           a = bitrand(n)
           b = bitrand(n)
           count(a .⊻ b)
       end
count_rand_xor (generic function with 1 method)

julia> @time count_rand_xor(10^6)
  0.000189 seconds (9 allocations: 366.703 KiB)
500442
3 Likes

Thank you so much! Do you happen to know how count is implemented at a low level?

You can type @edit count(BitArray([1])) to be moved directly to the implementation in a code editor.

I checked that it loops over internal BitArray vector of integers, and cals count_ones on them. This functions is super fast, it uses SSE popcntq to count bits on my machine.

julia> x = 5
5

julia> @code_native count_ones(x)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ int.jl:354 within `count_ones'
	popcntq	%rdi, %rax
	retq
	nopw	%cs:(%rax,%rax)
; â””
3 Likes

That’s interesting. It seems that popcount can be optimized even further by using AVX2 instructions https://arxiv.org/pdf/1611.07612.pdf .

1 Like

could you check what code native returns for you? my cpu has no avx2

Unfortunately I don’t have access to a Haswell CPU currently. But it has been pointed to me that this is really a LLVM question rather than a Julia one.

I bet it gonna use this AVX2 instruction if you run it on a computer with AVX2. This is one of Julia’s strengths: you are not limited by library compiled for specific instruction set (AFAIK for instance no AMD CPUs include AVX2 instructions)

On my system, yes, it’s using AVX-512 instructions:

julia> @code_native debuginfo=:none count(a)
	.section	__TEXT,__text,regular,pure_instructions
	movq	(%rdi), %rax
	movq	8(%rax), %rcx
	testq	%rcx, %rcx
	jle	L45
	movq	%rcx, %rdx
	sarq	$63, %rdx
	andnq	%rcx, %rdx, %rcx
	movq	(%rax), %rdx
	cmpq	$16, %rcx
	jae	L48
	movl	$1, %esi
	xorl	%eax, %eax
	jmp	L349
L45:
	xorl	%eax, %eax
	retq
L48:
	movabsq	$9223372036854775792, %r8 ## imm = 0x7FFFFFFFFFFFFFF0
	andq	%rcx, %r8
	leaq	1(%r8), %rsi
	vpxor	%xmm0, %xmm0, %xmm0
	xorl	%eax, %eax
	movabsq	$4607319040, %rdi       ## imm = 0x1129E1C00
	vmovdqa	(%rdi), %ymm1
	movabsq	$4607319072, %rdi       ## imm = 0x1129E1C20
	vmovdqa	(%rdi), %ymm2
	vpxor	%xmm3, %xmm3, %xmm3
	vpxor	%xmm4, %xmm4, %xmm4
	vpxor	%xmm5, %xmm5, %xmm5
	vpxor	%xmm6, %xmm6, %xmm6
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)
L128:
	vmovdqu	(%rdx,%rax,8), %ymm7
	vmovdqu	32(%rdx,%rax,8), %ymm8
	vmovdqu	64(%rdx,%rax,8), %ymm9
	vmovdqu	96(%rdx,%rax,8), %ymm10
	vpand	%ymm1, %ymm7, %ymm11
	vpshufb	%ymm11, %ymm2, %ymm11
	vpsrlw	$4, %ymm7, %ymm7
	vpand	%ymm1, %ymm7, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpaddb	%ymm11, %ymm7, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm3, %ymm7, %ymm3
	vpand	%ymm1, %ymm8, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm8, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm4, %ymm7, %ymm4
	vpand	%ymm1, %ymm9, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm9, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm5, %ymm7, %ymm5
	vpand	%ymm1, %ymm10, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm10, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm6, %ymm7, %ymm6
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	L128
	vpaddq	%ymm3, %ymm4, %ymm0
	vpaddq	%ymm0, %ymm5, %ymm0
	vpaddq	%ymm0, %ymm6, %ymm0
	vextracti128	$1, %ymm0, %xmm1
	vpaddq	%xmm1, %xmm0, %xmm0
	vpshufd	$78, %xmm0, %xmm1       ## xmm1 = xmm0[2,3,0,1]
	vpaddq	%xmm1, %xmm0, %xmm0
	vmovq	%xmm0, %rax
	cmpq	%r8, %rcx
	je	L371
L349:
	decq	%rsi
L352:
	xorl	%edi, %edi
	popcntq	(%rdx,%rsi,8), %rdi
	addq	%rdi, %rax
	incq	%rsi
	cmpq	%rsi, %rcx
	jne	L352
L371:
	vzeroupper
	retq
	nopw	(%rax,%rax)
3 Likes

I imagine these are the AVX-512 instructions (using write masks)?
Interesting that it is only using ymm registers.

Sometimes downclocking could happen when using 512 registers, maybe that’s why LLVM decides to use 256 bit ones. It could hit performance when mixed with not vectorized code. Reference.

I’ll try running a series of benchmarks in the next few months.
My machine runs non-avx/avx/avx512 code at 4.6/4.3/4.1 GHz. The difference between 4.3 and 4.1 GHz isn’t very significant.

However, reading the blog post, it notes

Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.

The 4x slower execution was also observed in this post:

Ctrl + F for “with an IPC of ~0.25” to find it. It also includes graphs showing this behavior.

The slowdown lasted about 9 microseconds.
In other words, if a function takes around that time to run, and you aren’t executing a lot of other AVX512 code (so that the CPU is already in that state), that function will run at 1/4 speed if it uses 512 bit instructions. It will thus almost certainly be faster if it does not, and runs code similar to what the rest of the program is doing, so that the CPU can stay in the same “license”.

When running numerical work, my CPU stays in the AVX512 license (as I can monitor with watch -n0.5 "cat /proc/cpuinfo | grep MHz") , so I’d like to be able to keep it there.

Another concern of mine is that LLVM handles remainders on loops poorly, and it produces twice the remainder when using full-width AVX512. You can look at all the graphs in my opening post from [ANN] LoopVectorization, where Clang and Julia consistently show this pattern of behavior:
image
They would benefit a lot just by having a shorter remainder.
This wasn’t a problem for LoopVectorization, the GNU compilers, or Intel compilers. Just LLVM.
(Both GNU and Intel also avoid AVX512 unless specified with a compilation flag; I did provide the flags for the above example. Meaning they must have decided avoiding it was worth it, despite not having the remainder issue.)

3 Likes

Wow, I read through this thread, really great package, I will try it for sure, I have some nice idea to benchmark it.
There is one more concern that LLVM could take into account, I mean heat dissipation. Despite the fact CPU has slower clock, 512 bit operations might need much more power. It will lead to throttling on laptops and desktops with stock cooling. My i7 4770k with a stock cooler was throttled all the time (I know it had no AVX-512, but still)

1 Like