How to create a random binary array with b bits?

lesshaste · January 28, 2020, 9:51pm

I would like to make a random binary array with b bits. In fact I will make a lot of them and compute their Hamming distances which I need to be able to do fast. At a low level that means computing XOR between binary arrays and then using count_ones.

How can I make the random binary arrays?

Jakub_Wronowski · January 28, 2020, 10:01pm

rand(Bool, 1000) |> BitArray

or solving your problem:

julia> b=100;

julia> x = rand(Bool, b) |> BitArray;

julia> y = rand(Bool, b) |> BitArray;

julia> sum(xor.(x,y))
81

DNF · January 28, 2020, 10:15pm

julia> using Random: bitrand

julia> bitrand(10)
10-element BitArray{1}:
 0
 0
 0
 1
 0
 0
 1
 1
 0
 0

Performance of bitrand is quite good:

julia> b = 10^6;

julia> @btime BitArray(rand(Bool, $b));
  1.623 ms (5 allocations: 1.07 MiB)

julia> @btime bitrand($b);
  20.637 μs (3 allocations: 122.23 KiB)

Jakub_Wronowski · January 28, 2020, 10:24pm

It’s blazing fast. Nice!

StefanKarpinski · January 29, 2020, 2:34am

Total thing seems pretty fast too:

julia> function count_rand_xor(n)
           a = bitrand(n)
           b = bitrand(n)
           count(a .⊻ b)
       end
count_rand_xor (generic function with 1 method)

julia> @time count_rand_xor(10^6)
  0.000189 seconds (9 allocations: 366.703 KiB)
500442

lesshaste · January 29, 2020, 10:05am

Thank you so much! Do you happen to know how count is implemented at a low level?

Jakub_Wronowski · January 29, 2020, 10:21am

You can type @edit count(BitArray([1])) to be moved directly to the implementation in a code editor.

I checked that it loops over internal BitArray vector of integers, and cals count_ones on them. This functions is super fast, it uses SSE popcntq to count bits on my machine.

julia> x = 5
5

julia> @code_native count_ones(x)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ int.jl:354 within `count_ones'
	popcntq	%rdi, %rax
	retq
	nopw	%cs:(%rax,%rax)
; └

lesshaste · January 29, 2020, 10:41am

That’s interesting. It seems that popcount can be optimized even further by using AVX2 instructions https://arxiv.org/pdf/1611.07612.pdf .

Jakub_Wronowski · January 29, 2020, 10:47am

could you check what code native returns for you? my cpu has no avx2

lesshaste · January 29, 2020, 12:23pm

Unfortunately I don’t have access to a Haswell CPU currently. But it has been pointed to me that this is really a LLVM question rather than a Julia one.

Jakub_Wronowski · January 29, 2020, 12:50pm

I bet it gonna use this AVX2 instruction if you run it on a computer with AVX2. This is one of Julia’s strengths: you are not limited by library compiled for specific instruction set (~~AFAIK for instance no AMD CPUs include AVX2 instructions~~)

StefanKarpinski · January 29, 2020, 4:42pm

On my system, yes, it’s using AVX-512 instructions:

julia> @code_native debuginfo=:none count(a)
	.section	__TEXT,__text,regular,pure_instructions
	movq	(%rdi), %rax
	movq	8(%rax), %rcx
	testq	%rcx, %rcx
	jle	L45
	movq	%rcx, %rdx
	sarq	$63, %rdx
	andnq	%rcx, %rdx, %rcx
	movq	(%rax), %rdx
	cmpq	$16, %rcx
	jae	L48
	movl	$1, %esi
	xorl	%eax, %eax
	jmp	L349
L45:
	xorl	%eax, %eax
	retq
L48:
	movabsq	$9223372036854775792, %r8 ## imm = 0x7FFFFFFFFFFFFFF0
	andq	%rcx, %r8
	leaq	1(%r8), %rsi
	vpxor	%xmm0, %xmm0, %xmm0
	xorl	%eax, %eax
	movabsq	$4607319040, %rdi       ## imm = 0x1129E1C00
	vmovdqa	(%rdi), %ymm1
	movabsq	$4607319072, %rdi       ## imm = 0x1129E1C20
	vmovdqa	(%rdi), %ymm2
	vpxor	%xmm3, %xmm3, %xmm3
	vpxor	%xmm4, %xmm4, %xmm4
	vpxor	%xmm5, %xmm5, %xmm5
	vpxor	%xmm6, %xmm6, %xmm6
	nopw	%cs:(%rax,%rax)
	nopl	(%rax)
L128:
	vmovdqu	(%rdx,%rax,8), %ymm7
	vmovdqu	32(%rdx,%rax,8), %ymm8
	vmovdqu	64(%rdx,%rax,8), %ymm9
	vmovdqu	96(%rdx,%rax,8), %ymm10
	vpand	%ymm1, %ymm7, %ymm11
	vpshufb	%ymm11, %ymm2, %ymm11
	vpsrlw	$4, %ymm7, %ymm7
	vpand	%ymm1, %ymm7, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpaddb	%ymm11, %ymm7, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm3, %ymm7, %ymm3
	vpand	%ymm1, %ymm8, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm8, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm4, %ymm7, %ymm4
	vpand	%ymm1, %ymm9, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm9, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm5, %ymm7, %ymm5
	vpand	%ymm1, %ymm10, %ymm7
	vpshufb	%ymm7, %ymm2, %ymm7
	vpsrlw	$4, %ymm10, %ymm8
	vpand	%ymm1, %ymm8, %ymm8
	vpshufb	%ymm8, %ymm2, %ymm8
	vpaddb	%ymm7, %ymm8, %ymm7
	vpsadbw	%ymm0, %ymm7, %ymm7
	vpaddq	%ymm6, %ymm7, %ymm6
	addq	$16, %rax
	cmpq	%rax, %r8
	jne	L128
	vpaddq	%ymm3, %ymm4, %ymm0
	vpaddq	%ymm0, %ymm5, %ymm0
	vpaddq	%ymm0, %ymm6, %ymm0
	vextracti128	$1, %ymm0, %xmm1
	vpaddq	%xmm1, %xmm0, %xmm0
	vpshufd	$78, %xmm0, %xmm1       ## xmm1 = xmm0[2,3,0,1]
	vpaddq	%xmm1, %xmm0, %xmm0
	vmovq	%xmm0, %rax
	cmpq	%r8, %rcx
	je	L371
L349:
	decq	%rsi
L352:
	xorl	%edi, %edi
	popcntq	(%rdx,%rsi,8), %rdi
	addq	%rdi, %rax
	incq	%rsi
	cmpq	%rsi, %rcx
	jne	L352
L371:
	vzeroupper
	retq
	nopw	(%rax,%rax)

Elrod · January 29, 2020, 5:04pm

I imagine these are the AVX-512 instructions (using write masks)?
Interesting that it is only using ymm registers.

Jakub_Wronowski · January 29, 2020, 5:42pm

Sometimes downclocking could happen when using 512 registers, maybe that’s why LLVM decides to use 256 bit ones. It could hit performance when mixed with not vectorized code. Reference.

Elrod · January 29, 2020, 6:01pm

I’ll try running a series of benchmarks in the next few months.
My machine runs non-avx/avx/avx512 code at 4.6/4.3/4.1 GHz. The difference between 4.3 and 4.1 GHz isn’t very significant.

However, reading the blog post, it notes

Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.

The 4x slower execution was also observed in this post:

Ctrl + F for “with an IPC of ~0.25” to find it. It also includes graphs showing this behavior.

The slowdown lasted about 9 microseconds.
In other words, if a function takes around that time to run, and you aren’t executing a lot of other AVX512 code (so that the CPU is already in that state), that function will run at 1/4 speed if it uses 512 bit instructions. It will thus almost certainly be faster if it does not, and runs code similar to what the rest of the program is doing, so that the CPU can stay in the same “license”.

When running numerical work, my CPU stays in the AVX512 license (as I can monitor with watch -n0.5 "cat /proc/cpuinfo | grep MHz") , so I’d like to be able to keep it there.

Another concern of mine is that LLVM handles remainders on loops poorly, and it produces twice the remainder when using full-width AVX512. You can look at all the graphs in my opening post from [ANN] LoopVectorization, where Clang and Julia consistently show this pattern of behavior:

They would benefit a lot just by having a shorter remainder.
This wasn’t a problem for LoopVectorization, the GNU compilers, or Intel compilers. Just LLVM.
(Both GNU and Intel also avoid AVX512 unless specified with a compilation flag; I did provide the flags for the above example. Meaning they must have decided avoiding it was worth it, despite not having the remainder issue.)

Jakub_Wronowski · January 29, 2020, 8:07pm

Wow, I read through this thread, really great package, I will try it for sure, I have some nice idea to benchmark it.
There is one more concern that LLVM could take into account, I mean heat dissipation. Despite the fact CPU has slower clock, 512 bit operations might need much more power. It will lead to throttling on laptops and desktops with stock cooling. My i7 4770k with a stock cooler was throttled all the time (I know it had no AVX-512, but still)

Topic		Replies	Views
Efficient binary vector New to Julia question	33	3644	July 1, 2018
Optimization: How to make sure XOR is performed in chunks New to Julia	33	2309	July 6, 2022
Why is BitArray so slow? Performance	28	5841	January 14, 2022
Shifting the bit in the array General Usage	29	1018	June 22, 2023
Performance hacking `(&)(a::BitArray, b::BitArray)` Performance	14	228	October 22, 2024

How to create a random binary array with b bits?

Related topics