There are a couple of things we should do:
-
Write faster scalar functions. We use translated versions of openlibm (which were based on FreeBSD’s libm, which in turn descend from fdlibm written by Sun in the 90s). These give respectable performance, but there is some room for improvement (e.g. by exploiting
fma
instructions on newer architectures). -
Provide hooks to use vectorised kernels, such as SVML, SLEEF, Yeppp or Apple’s Accelerate library, for operations like
broadcast
andreduce
. LLVM provides such hooks for its own intrinsics: we don’t currently use these because they are hard coded to call the system math library, not our own scalar functions. Apparently this is fixable, but the best option would be to have a general framework to do this for arbitrary functions, not just the those blessed by LLVM. -
A framework to write our own vectorised kernels. SIMD.jl provides some low-level functionality, but ideally we would have a higher-level way to write SPMD code like ISPC, which would do things like rewrite branches into bitmasks.
Unfortunately the discussion of this issue has become somewhat fragmented, but the main issues are: