I see at least two problems.
Why do you insist on parentheses (not a problem if you think clearer, but would be if Julia had different precedence)? They have no effect when juxtaposition is used, you get the precedence you would want from math, same as:
julia> x ./ 2π # and even x / 2π without the dot
2×2 Matrix{Float64}:
-0.19802 0.208875
-0.0571671 0.229319
but different from:
julia> x ./ 2*π # equivalent to (x ./ 2)*π, likely why you used them out of habit.
2×2 Matrix{Float32}:
-1.95438 2.06152
-0.564216 2.26329
but note at least you got the right type (and would have for whatever constant with the pi). So you could do:
julia> x / 2 * 1/π # just less clear, but good to know... and the type is correct:
2×2 Matrix{Float32}:
-0.19802 0.208875
-0.0571671 0.229319
Would there be a possibility for Julia to interpret x / 2π, as the semantically equivalent to (in math), the above (x / 2) * 1/π = 0.5x/π (or since this gives Float64), or rather 0.5f0x/π = two_over_pi_in_Float32*x?
@code_native c/2 gives me vdivss in (brand-new 1.9.0-DEV.1078) but should (preferably) give a multiply instruction.
Which brings be to the second problem, the division(s) got me thinking, is this slow, and timing now non-trivial (here for 200x200):
julia> @time x_large / 2 * 1/π;
0.000133 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000330 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000469 seconds (6 allocations: 468.891 KiB)
the next 10 timings were as slow, never as fast as the first one, so I suspected load from the web browser, so I killed it (well actually suspended it)
julia> @time x_large / 2 * 1/π;
0.010223 seconds (6 allocations: 468.891 KiB, 95.40% gc time)
I had just "killed" firefox, before the above excessive above, I guess GC is a coincidence, not induced by that killing firefox.
Time went down for a while, then up again:
julia> @time x_large / 2 * 1/π;
0.000128 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000128 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000126 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000402 seconds (6 allocations: 468.891 KiB)
julia> @time x_large / 2 * 1/π;
0.000425 seconds (6 allocations: 468.891 KiB)
A better way:
julia> two_over_pi = Float32(1/2π)
julia> @time x_large .*= two_over_pi;
0.000070 seconds (2 allocations: 64 bytes)
julia> GC.gc(true) # seems to guarantee I see minimal time after.
julia> @code_native @inbounds x_large / 2 * 1/π; # is very excessive and even worse without @inbounds but similarly but misleadingly much better (only one page) with:
julia> f(x_large) = @inbounds x_large / 2 * 1/π
julia> @code_native f(x_large)
because I see three callq and no div or mul, so code is likely still large, just elsewhere. Timing is the same so, in this case, maybe NOT better to wrap in a function (as usually stated to do), at least to see the code.
Does anyone know if the broadcasting implies using threads? Would that always be in your hands? Could easily fix, make a loop, but that didn’t work with or without a dot:
julia> @time @Threads.threads x_large .*= two_over_pi;
ERROR: LoadError: ArgumentError: @threads requires a `for` loop expression
julia> f!(x_large) = @inbounds x_large .*= Float32(1/2π)
f (generic function with 1 method)
julia> @time f!(x_large);
0.000052 seconds
That gives the longest assembly, but all of it, no call, and only mul, no div, unless I missed something.