Number of values in an array above a threshold

question
performance

#1

Hello I’m new to Julia and


#2

That’s right, randn(N) returns a list of N normal numbers. (By the way, you can quote code by putting it between backticks ```).

For the second part, have you seen the count function in Julia? count(p,xz) takes a list xs of elements, applies a function p to the elements whose result is true/false, and then returns the number of trues in the list. The function p could be written something like p = x -> (x > threshold), where threshold is a variable. The comparison x > threshold will return a boolean, which could be what you need.


#3

I’m sorry I didn’t specify that an hint suggests to use an element-wise logical comparison with . (dot) to the logical operator. Any idea on how to do that?


#4

You might want to look at section 5.4.1 of the Julia manual


#5

It doesn’t say anything about how to compute the values above and below 1.96 in a standard normal distribution using the dot to access the array of random numbers


#6

This has a dot in it:

v0.5.0> a = randn(1000);

v0.5.0> length(a[a .< -1.96])
24


#7

Thank you :smiley:


#8

But that’s not a good hint. You should do as @felix suggests:

count(x->x<-1.96, a) 

More faster, more Julian.


#9

I guess the answer is either count(x -> (x > 1.96) || (x < -1.96), x) or countnz((x .> 1.96) | (x .< -1.96)) if you like the dot notation. But as @felix and @DNF said, the first version is more Julian.

I am not sure whether this

creates an additional array.


#10

As far as I can tell, it creates two extra arrays(!)


#11

You can also do

sum(randn() > 1.96 for i in 1:N)

(on Julia 0.5 or later).

This uses a generator that does not actually ever create the array of random numbers, and so is more efficient for large N:

julia> f1(N) = count(x->x>1.96, randn(N))
f1 (generic function with 2 methods)

julia> f2(N) = sum(randn() > 1.96 for i in 1:N)
f2 (generic function with 2 methods)

julia> f3(N) = sum(randn(N) .> 1.96)
f3 (generic function with 1 method)

julia> f4(N) = (a = randn(N); length(a[a .< -1.96]))
f4 (generic function with 1 method)

julia> @time f1(10^8)
  1.742257 seconds (8 allocations: 762.940 MB, 5.10% gc time)
2499801

julia> @time f2(10^8)
  0.879822 seconds (8 allocations: 256 bytes)
2502178

julia> @time f3(10^8)
  1.301290 seconds (73.26 k allocations: 778.590 MB, 7.58% gc time)
2501985

julia> @time f4(10^8)
  1.533186 seconds (73.26 k allocations: 797.675 MB, 1.40% gc time)
2501473

Note that, on my machine at least, the original suggestion (my f4) is faster than f1.

A reminder that for these kinds of performance tests, everything should be in a function, and timed only on the second run. Also, I should really be using BenchmarkTools.jl for this.


#12

It’s even slightly faster to do

f4(N) = sum(i -> randn() > 1.96, 1:N)

on my machine.


#13

Keep in mind that suggestion f1 assumed that the array a already existed.
Edit: still that is an extremely weird result.


#14

True. Here is the original version:

g1(r) = count(x->x>1.96, r)
g2(r) = length(r[r .> 1.96])
g3(r) = sum(r .> 1.96)
g4(r) = sum(i > 1.96 for i in r)
g5(r) = sum(x->x>1.96, r)

function run_bench(N)
    r = randn(N)
    
    @time g1(r)
    @time g2(r)
    @time g3(r)
    @time g4(r)
    @time g5(r)
    
end

After warm-up:

julia> run_bench(10^6)
0.001808 seconds
0.003731 seconds (745 allocations: 361.594 KB)
0.001243 seconds (741 allocations: 164.734 KB)
0.000444 seconds (2 allocations: 32 bytes)
0.000450 seconds

julia> run_bench(10^8)
0.134395 seconds
0.388841 seconds (73.25 k allocations: 34.730 MB)
0.195034 seconds (73.25 k allocations: 15.650 MB, 35.30% gc time)
0.078795 seconds (2 allocations: 32 bytes)
0.074950 seconds


I have opened an issue to deprecate `count` in favour of `sum`:
https://github.com/JuliaLang/julia/issues/20663

#15

Yeah, there’s clearly something wrong with the implementation of count. I do think that it is semantically different from sum, and should not be deprecated.