Number of values in an array above a threshold

Andrea_Tersigni · February 18, 2017, 8:21am

Hello I’m new to Julia and

felix · February 18, 2017, 9:12am

That’s right, randn(N) returns a list of N normal numbers. (By the way, you can quote code by putting it between backticks ```).

For the second part, have you seen the count function in Julia? count(p,xz) takes a list xs of elements, applies a function p to the elements whose result is true/false, and then returns the number of trues in the list. The function p could be written something like p = x -> (x > threshold), where threshold is a variable. The comparison x > threshold will return a boolean, which could be what you need.

Andrea_Tersigni · February 18, 2017, 9:17am

I’m sorry I didn’t specify that an hint suggests to use an element-wise logical comparison with . (dot) to the logical operator. Any idea on how to do that?

Simon_Bolland · February 18, 2017, 10:35am

You might want to look at section 5.4.1 of the Julia manual

Andrea_Tersigni · February 18, 2017, 10:41am

It doesn’t say anything about how to compute the values above and below 1.96 in a standard normal distribution using the dot to access the array of random numbers

cormullion · February 18, 2017, 11:46am

This has a dot in it:

v0.5.0> a = randn(1000);

v0.5.0> length(a[a .< -1.96])
24

Andrea_Tersigni · February 18, 2017, 12:19pm

Thank you

DNF · February 18, 2017, 12:45pm

But that’s not a good hint. You should do as @felix suggests:

count(x->x<-1.96, a)

More faster, more Julian.

mzaffalon · February 18, 2017, 3:33pm

I guess the answer is either count(x -> (x > 1.96) || (x < -1.96), x) or countnz((x .> 1.96) | (x .< -1.96)) if you like the dot notation. But as @felix and @DNF said, the first version is more Julian.

I am not sure whether this

creates an additional array.

DNF · February 18, 2017, 4:13pm

As far as I can tell, it creates two extra arrays(!)

dpsanders · February 18, 2017, 4:20pm

You can also do

sum(randn() > 1.96 for i in 1:N)

(on Julia 0.5 or later).

This uses a generator that does not actually ever create the array of random numbers, and so is more efficient for large N:

julia> f1(N) = count(x->x>1.96, randn(N))
f1 (generic function with 2 methods)

julia> f2(N) = sum(randn() > 1.96 for i in 1:N)
f2 (generic function with 2 methods)

julia> f3(N) = sum(randn(N) .> 1.96)
f3 (generic function with 1 method)

julia> f4(N) = (a = randn(N); length(a[a .< -1.96]))
f4 (generic function with 1 method)

julia> @time f1(10^8)
  1.742257 seconds (8 allocations: 762.940 MB, 5.10% gc time)
2499801

julia> @time f2(10^8)
  0.879822 seconds (8 allocations: 256 bytes)
2502178

julia> @time f3(10^8)
  1.301290 seconds (73.26 k allocations: 778.590 MB, 7.58% gc time)
2501985

julia> @time f4(10^8)
  1.533186 seconds (73.26 k allocations: 797.675 MB, 1.40% gc time)
2501473

Note that, on my machine at least, the original suggestion (my f4) is faster than f1.

A reminder that for these kinds of performance tests, everything should be in a function, and timed only on the second run. Also, I should really be using BenchmarkTools.jl for this.

stevengj · February 18, 2017, 4:25pm

It’s even slightly faster to do

f4(N) = sum(i -> randn() > 1.96, 1:N)

on my machine.

DNF · February 18, 2017, 4:51pm

Keep in mind that suggestion f1 assumed that the array a already existed.
Edit: still that is an extremely weird result.

dpsanders · February 18, 2017, 5:31pm

True. Here is the original version:

g1(r) = count(x->x>1.96, r)
g2(r) = length(r[r .> 1.96])
g3(r) = sum(r .> 1.96)
g4(r) = sum(i > 1.96 for i in r)
g5(r) = sum(x->x>1.96, r)

function run_bench(N)
    r = randn(N)
    
    @time g1(r)
    @time g2(r)
    @time g3(r)
    @time g4(r)
    @time g5(r)
    
end

After warm-up:

julia> run_bench(10^6)
0.001808 seconds
0.003731 seconds (745 allocations: 361.594 KB)
0.001243 seconds (741 allocations: 164.734 KB)
0.000444 seconds (2 allocations: 32 bytes)
0.000450 seconds

julia> run_bench(10^8)
0.134395 seconds
0.388841 seconds (73.25 k allocations: 34.730 MB)
0.195034 seconds (73.25 k allocations: 15.650 MB, 35.30% gc time)
0.078795 seconds (2 allocations: 32 bytes)
0.074950 seconds


I have opened an issue to deprecate `count` in favour of `sum`:
https://github.com/JuliaLang/julia/issues/20663

DNF · February 18, 2017, 7:00pm

Yeah, there’s clearly something wrong with the implementation of count. I do think that it is semantically different from sum, and should not be deprecated.

Shuhua · September 16, 2020, 9:41am

Now more than three years have passed since the beginning of this topic, though it is the first time I check this thread. It seems that count was blamed for its low efficiency in the above discussion in 2017. With Julia 1.5 in 2020, I made a new benchmark by reusing @dpsanders’s code (except the replacement of @time with @btime in BenchmarkTools for higher accuracy). The result is listed below.

julia> run_bench(10^6)
  361.700 μs (0 allocations: 0 bytes)
  676.700 μs (6 allocations: 322.50 KiB)
  523.100 μs (4 allocations: 126.42 KiB)
  361.999 μs (0 allocations: 0 bytes)
  358.700 μs (0 allocations: 0 bytes)

julia> run_bench(10^8)
  51.515 ms (0 allocations: 0 bytes)
  109.681 ms (6 allocations: 30.99 MiB)
  78.937 ms (4 allocations: 11.93 MiB)
  51.684 ms (0 allocations: 0 bytes)
  52.105 ms (0 allocations: 0 bytes)

Now it appears that count and the last two sum share the same highest efficiency. By contrast, the sum in g3 consumes more time due to the allocation of a temporary array r .> 1.96. The length in g2 is the slowest due to the same allocation reason. The above result is reasonable because all methods are of the same complexity O(N), though some of them (g1 and g3) are slower due to the unnecessary allocation.

The take-home message is that we can use count safely like g1 without worrying about its efficiency now . (However, do note that g6(r) = count(r .> 1.96) is as slow as g3; avoid memory allocation as possible as you can.)

[Update: as reminded by @rafael.guerra and @stevengj , the local variable r in run_bench should be used like @btime g1($r) (i.e., like string interpretation). Alternatively, we may declare r as a constant global variable outside run_bench like const r = randn(N). ]

rafael.guerra · September 16, 2020, 1:42pm

Thanks for sharing the latest results using @btime.
PS: In VS Code editor, I had to declare ‘r’ as global in @dpsanders’s function run_bench(), otherwise get error:

ERROR: UndefVarError: r not defined

stevengj · September 16, 2020, 1:59pm

For @btime you should do @btime g1($r) etcetera.

Also, @btime doesn’t need such a large array for benchmarking:

julia> run_bench(10^2)
  12.859 ns (0 allocations: 0 bytes)
  287.456 ns (3 allocations: 224 bytes)
  245.003 ns (2 allocations: 128 bytes)
  15.055 ns (0 allocations: 0 bytes)
  18.504 ns (0 allocations: 0 bytes)

rafael.guerra · September 16, 2020, 2:17pm

Thank you for the feedback on the proper @btime syntax.
Concerning the tip to use a small array in @btime, are the results representative? How should we interpret the fact that the execution time for g2 is 22 times slower than for g1 when using a small array (10^2), while using a large array (10^8) the difference is only a factor of 2?

stevengj · September 16, 2020, 2:20pm

This is a real effect: for smaller arrays, the overhead of allocating a temporary array is more significant. Whether this is “representative” or not depends on your problem, of course.

Topic		Replies	Views
Same random seed, but different random numbers? General Usage	11	2571	July 5, 2022
Rand not being random? New to Julia question	4	496	January 12, 2022
Rand and rand! seem inconsistent General Usage	3	472	June 6, 2018
Creating a List of 100 Binomial Distributed Values General Usage question	9	336	January 13, 2023
How to handle vector like in numpy? General Usage vector	5	389	April 7, 2022

Number of values in an array above a threshold

Related topics