One question/comment. Are you using threads? Explicitly or implicitly (I believe Pluto and VS Code may enable for you).
It might not matter, but I’m thinking could it in theory, or in a future Julia version.
The former is linear, it must call push! 10000 times, but that does however NOT mean the “array” is enlarged that often since Julia is clever behind the scenes.
With linear I mean it must add one element at a time to the end of the array (user visible), even though in practice it only enlarges (and moves likely) the underlying storage 9 times (clever logarithmic, avoiding O(n) allocations):
julia> @time (data = Int64[]; for _=1:10000 push!(data, 1) end)
0.000566 seconds (9 allocations: 326.547 KiB)
The other way is better than that, and differs more than than just in allocation amount/speed (though at first didn’t seem so, because of global scope):
julia> @time data = [1 for _=1:10000];
0.038610 seconds (19.32 k allocations: 1.376 MiB, 99.53% compilation time)
julia> test() = data = [1 for _=1:10000];
julia> @time test(); # I'm not worrying for now why I do not get the expected 1 allocation
0.000014 seconds (2 allocations: 78.172 KiB)
It’s 40x faster despite the other only having 4.5 times the number of allocations… Why? Probably since it uses SIMD/vectorized (vmovups), while the former likely doesn’t. In your case you might be enabling SIMD optimization and hitting a bug (in LLVM). SIMD is concurrency while not threads. Julia is allowed to use it without @simd macro if it can proof identical end-result. The macro allows SIMD if I recall even if slightly different.
Now what I’m thinking, the former much do in order, because of user-visible push!, unless sufficiently advanced compiler (that doesn’t and likely never will exist).
Despite the latter using for and seemingly linear, it wouldn’t be if using threads, or other concurrency. You’re only asking for the end-result, and the compiler should even be allowed to allocate the full array and populate in reverse order (with a pure foo function), or from both ends or split n ways etc.
You might think you need to annotate locally threads used. I’m not sure, maybe it’s required, but it shouldn’t be, so are threads used? If threads are used foo needs to be re-entrant. Is it or does foo use ccall? Are you showing an MWE that actually fails, or is it a simplification? E.g. if 10000 is not a constant, but a function returns that value, it might need to be called in each iteration.