Massive data-dependent floating-point slowdown

I encountered a mysterious case where the same floating-point computation experiences a massive slowdown depending on the data — a factor of 20, or even 60 if I use @turbo from LoopVectorization.jl.

I’m guessing it has to do with floating-point exceptions arising from underflow, since the slowdown occurs when many of the inputs are very small, even though there are no NaNs or other non-finite values. Is there any way to prevent this slowdown (without explicitly filtering out small values)?

MWE example follows. Put @turbo before the for loop in foo! to see an even bigger slowdown.

using LoopVectorization, BenchmarkTools

function foo!(df,f,dz,beta)
    N_tip = length(df)
    N_tip == length(f) == length(beta) || throw(DimensionMismatch())

    inv63dz = 1/63dz
    
    c1=-0.1;c2=1.125;c3=-6;c4=21;c5=-63;c6=12.6;c7=42;c8=-9;c9=1.5;c10=-0.125;
    for i = 6:N_tip-5
        β = beta[i]
        f1=f[i-5]; f2=f[i-4]; f3=f[i-3]; f4=f[i-2]; f5=f[i-1]
        f6=f[i]; f7=f[i+1]; f8=f[i+2]; f9=f[i+3]; f10=f[i+4]; f11=f[i+5]
        df[i] = (β * (c1*f1+c2*f2+c3*f3+c4*f4+c5*f5+c6*f6+c7*f7+c8*f8+c9*f9+c10*f10)
                 - (1-β) * (c10*f2+c9*f3+c8*f4+c7*f5+c6*f6+c5*f7+c4*f8+c3*f9+c2*f10+c1*f11))*inv63dz
    end

    return df
end

N = 618
dv = zeros(N)
v = rand(N)
beta   = rand(N)
@btime foo!($dv, $v,0.5,$beta);

vslow = [0.92389089176693, 0.8669216264166248, 0.9643267121900607, 0.10687178415239451, 0.17733012955496164, 0.47887924355578426, 0.6765443971723215, 0.849698198732405, 0.3419321304294267, 0.8078658384691519, 0.4065365747432079, 0.33672667825701086, 0.1435714240114634, 0.5811386489818586, 0.5598845288829772, 0.9432016787872362, 0.09808086950372452, 0.5158320118921296, 0.01890937885354127, 0.552401791382257, 0.2744454125649549, 0.03480718621554635, 0.35098522170989477, 0.6754825232823312, 0.6569460856179217, 0.5598372539486034, 0.3676651592886826, 0.16517250621277801, 0.2117556327438619, 0.7538385064082065, 0.5247924688387187, 0.6308875010982402, 0.8912823323439869, 0.2776101010022227, 0.6657825833937678, 0.3432159986872547, 0.9966203807569654, 0.7990331764107279, 0.3512422679122813, 0.0026392659205256574, 0.501577888904658, 0.8724436367008459, 0.44290684201836483, 0.31581821048664005, 0.94092327913634, 0.988392620486205, 0.8215429391092703, 0.09838939994057339, 0.32875365761697806, 0.4299066986838942, 0.39963102948769835, 0.4548854683854471, 0.5439283885192217, 0.2860925765788005, 0.23034428113924066, 0.6305496114046247, 0.2616745129324316, 0.33076100575731027, 0.3173431246153131, 0.7949443654786901, 0.3818862711779931, 0.6816930830262389, 0.3989606492488338, 0.618882220192716, 0.7658793682110481, 0.40364556211048463, 0.2596996212569971, 0.5085737632649239, 0.999143827413224, 0.8371036494757915, 0.9931252532144197, 0.94105444678155, 0.25753885017429834, 0.8957273651184332, 0.11769307744765123, 0.7658687244642619, 0.2381032338533302, 0.4807127624369689, 0.10163089529950375, 0.656653420037872, 0.9939005483030139, 0.12265090176113369, 0.5163974314927324, 0.2964576641970207, 0.36893915106069297, 0.17328482336244178, 0.1078826493318068, 0.7674184565836584, 0.4493600604328467, 0.907622110262766, 0.13108247037891196, 0.9319164503423449, 0.6777684936836363, 0.25447393633258764, 0.17842011636485644, 0.8469873349000372, 0.11106036152641896, 0.6676323166224754, 0.7936153631966576, 0.8613360551953659, 0.6997318209801424, 0.47677918237013084, 0.38058807051142374, 0.5452439997210576, 0.7155175058139713, 0.34311914221921436, 0.4972367855988775, 0.5335023032806896, 0.5621499401168588, 0.3359231643228331, 0.5456331986648786, 0.6313667319514233, 0.3371976764127489, 0.33128245204457296, 0.8363504877827797, 0.9748958863393007, 0.4719493762556044, 0.8683311381136678, 0.26425564174856486, 0.4944702337309639, 0.9569794988472287, 0.9625904975783761, 0.5613783230999425, 0.7881007984957311, 0.7605813470136449, 0.8153750088370297, 0.6491754288686233, 0.6940450274270096, 0.7151575904020699, 0.9278814282660308, 0.37903720206580394, 0.9715939686878396, 0.832082114074612, 0.035515673198750264, 0.08922760017851283, 0.6982712872498873, 0.6688638980993631, 0.6000866097616253, 0.4917270339899531, 0.9832024335014489, 0.7027578294787913, 0.8071951563731803, 0.684841333788939, 0.8944882757513914, 0.5262584621969904, 0.887518216247632, 0.5752010498661817, 0.7413125109657015, 0.6246338330288415, 0.47374821596797956, 0.9661555803521908, 0.9479564227707695, 0.47881526485992665, 0.12667594150835182, 0.9010984595939364, 0.8771826583975024, 0.18129434665565358, 0.8304940446284792, 0.6569138900413591, 0.9760070244992987, 0.22121652315925067, 0.7207660122970365, 0.8059162463642016, 0.2441170411536091, 0.03728694403898336, 0.11443947753543848, 0.23223649645161037, 0.009649744164245355, 0.3211408600610952, 0.15900545371452424, 0.29107123533594215, 0.3454857280246162, 0.6650112579584977, 0.6450221475744757, 0.14309312505152083, 0.08130023191147373, 0.6939114412652629, 0.746421041737394, 0.2284349919437434, 0.641479745490539, 0.6965031124328722, 0.9207597249936796, 0.6401655021575077, 0.8528985303085477, 0.21838177023256833, 0.4297253142623627, 0.6445324315639651, 0.8295384536571799, 0.4568580977623806, 0.9857722066730936, 0.537597015676752, 0.7622618136103223, 0.3625495057742667, 0.2712485884945044, 0.012893094908191571, 0.585629341018252, 0.46699242154539267, 0.11358833939065738, 0.9150490760674559, 0.4328856743452416, 0.9343312132153554, 0.04061730097281435, 0.3662358128312546, 0.2183168917654117, 0.8996414723653301, 0.41416233113438605, 0.21391950329503673, 0.11251557313715477, 0.44219293720031105, 0.7013518152706855, 0.7551363646968627, 0.03628626328984974, 0.5250103771530934, 0.9086876582695576, 0.6713241657069196, 0.7394335486387631, 0.25693216255545237, 0.38815930748718674, 0.36626080983896614, 0.026900936311424095, 0.8845539545730887, 0.9110561311121375, 0.8782076230947837, 0.8380993564772417, 0.4682682790196473, 0.5947518647119967, 0.28155320478838974, 0.23414078446572972, 0.5013442457602677, 0.7754553965150925, 0.43116450235796533, 0.9891670010459419, 0.4982216624440392, 0.6636925927703816, 0.6426483913074257, 0.2964822439122954, 0.26011985233940504, 0.015264446634731366, 0.957524564325064, 0.0005576821560062672, 0.33742190131755456, 0.09984543544750468, 0.5989846769170539, 0.707683963921022, 0.3497829005110953, 0.6811859223201486, 0.20741279903181598, 0.051583470660704345, 0.0012402230863233221, 0.053899378191897496, 0.5790953028404207, 0.47052383435714695, 0.8998071936265029, 0.26057645607040714, 0.30469987928365816, 0.8338782999455379, 0.1277793686540667, 0.871923551477481, 0.5739136879534095, 0.03261374252253613, 0.09677780036291583, 0.40354294179170935, 0.34529801918849823, 0.16351831747880485, 0.5636521721129784, 0.04345789840386849, 0.3147441924660319, 0.35810680852192744, 0.8715764252487044, 0.12517103255395123, 0.2663979348108685, 0.8085146137300765, 0.5151946412375867, 0.7145959826424735, 0.598444450511417, 0.6655883279636559, 0.8011192317295561, 0.37566997246512357, 0.5111218397525528, 0.7655223251719856, 0.6845247375524903, 0.44134242489204834, 0.9189970043503797, 0.8709343922800639, 0.38859212071426086, 0.799123728704823, 0.1866855724087899, 0.4268512478985851, 0.5938837882085846, 0.9259937550409847, 0.7328069988244774, 0.05026880574851211, 0.5487608697040505, 0.5119239586898319, 0.08476000113636628, 0.8396079842309567, 0.016632386860206294, 0.8714994618549998, 0.37509885877468685, 0.2916778638829083, 0.5076592079326836, 0.8170046767680188, 0.13844905572588262, 0.4143147871839381, 0.12488949832796115, 0.13724649267819156, 0.6961665486844544, 0.17962338559860958, 0.5060731415021129, 0.10028850199544803, 0.484842530683683, 0.5966956370387841, 0.3971554800151187, 0.36306918194700644, 0.5974090207260003, 0.9156645690289877, 0.5362637032330111, 0.9688914583796928, 0.6721699462181194, 0.879803630335946, 0.8168964135206993, 0.9997851251491714, 0.01969353649583505, 0.03332589771190908, 0.8093023194706404, 0.08889012688278997, 0.10246091547672065, 0.03980402020964102, 0.465994152384847, 0.24306603873367782, 0.7488390310309847, 0.9280859438509854, 0.22087631981656997, 0.6541238623457086, 0.7871315419680285, 0.9682875121916623, 0.40278758224742184, 0.04828365309746574, 0.07307157539203013, 0.7574112463810307, 0.4961992112449103, 0.982223053575185, 0.7920244171927737, 0.4754402938035476, 0.21476546138585406, 0.665681606018727, 0.994954015235614, 0.09612863664877924, 0.9589417724312574, 0.39107288105471905, 0.1734332024705958, 0.8509217942714025, 0.42393036159982533, 0.6625849908154338, 0.6026084663644262, 0.6599473806418119, 0.7535836073961031, 0.733211261035249, 0.3178277211052769, 0.21069042585166353, 0.43539213519160724, 0.5978572841369139, 0.9410992410145769, 0.45056930377462456, 0.8574896662214622, 0.02058128071427845, 0.6434960500233395, 0.24705389973234504, 0.8832265023614367, 0.34264800056841227, 0.40830393464242776, 0.8614801566040537, 0.7992488293360491, 0.2823167376055731, 0.7575890039055275, 0.8667010099294812, 0.3894568442158146, 0.3896054142532075, 0.3496469632127359, 0.38611829614854676, 0.3459134070000942, 0.3543662433366339, 5.0e-324, 0.0, 0.0, 1.5e-323, 5.0e-324, 0.0, 1.0e-323, 0.0, 0.0, 5.0e-324, 5.0e-324, 5.0e-324, 0.0, 0.0, 5.0e-324, 5.0e-324, 5.0e-324, 0.0, 5.0e-324, 3.5e-323, 5.0e-324, 3.5e-323, 0.0, 1.8122005587063045e-111, 0.0, 1.0e-323, 5.123543115836636e-266, 2.0e-323, 0.0, 5.0e-324, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0e-324, 5.0e-324, 1.0e-323, 3.7259029162416435e-25, 1.0e-323, 0.0, 0.0, 5.0e-324, 0.0, 5.0e-324, 0.0, 0.0, 0.0, 1.0e-323, 0.0, 5.0e-324, 0.0, 0.0, 1.0e-323, 0.0, 0.0, 5.0e-324, 1.5e-323, 5.0e-324, 0.0, 0.0, 0.0, 2.0e-323, 0.0, 0.0, 0.0, 0.0, 1.0e-323, 0.0, 5.0e-324, 0.0, 0.0, 2.304620780551251e-225, 5.0e-324, 5.0e-324, 2.5e-323, 0.0, 5.0e-324, 5.0e-324, 0.0, 5.0e-324, 5.0e-324, 5.720452365941365e-47, 0.0, 1.5e-323, 2.0e-323, 5.0e-324, 2.0e-323, 5.0e-324, 1.0e-323, 0.0, 5.0e-324, 5.0e-324, 2.5e-323, 5.0e-324, 5.0e-324, 0.0, 1.0e-323, 0.0, 1.5e-323, 5.0e-324, 5.0e-324, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0e-324, 1.0e-323, 3.5e-323, 0.0, 0.0, 1.0e-323, 0.0, 0.0, 0.0, 2.2398108754413678e-225, 1.0e-323, 0.0, 5.0e-324, 1.5e-323, 1.0e-323, 5.0e-324, 0.0, 0.0, 1.0e-323, 5.0e-324, 1.8508124459876873e-294, 5.0e-324, 0.0, 0.0, 1.5e-323, 0.0, 5.0e-324, 3.0e-323, 0.0, 0.0, 0.0, 5.0e-324, 0.0, 5.0e-324, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0e-324, 1.0e-323, 2.8e-322, 0.0, 5.0e-324, 0.0, 5.0e-324, 5.0e-324, 0.0, 0.0, 1.1667118198008294e-186, 1.5e-323, 1.0e-323, 0.0, 0.0, 0.0, 3.0e-323, 1.5e-323, 0.0, 0.0, 0.0, 5.0e-324, 0.0, 0.0, 0.0, 3.5e-323, 1.0e-323, 0.0, 0.0, 0.0, 5.0e-324, 0.0, 0.0, 0.0, 0.0, 3.5e-323, 5.0e-324, 1.0e-323, 1.0e-323, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0e-324, 0.0, 5.0e-324, 1.4449015213598344e-277, 7.99266983673582e-259, 5.0e-324, 0.0, 5.0e-324, 3.0e-323, 1.725216128232676e-235, 0.0, 5.0e-324, 3.986386902150488e-224, 3.5e-323, 0.0, 0.0, 0.0, 1.5e-323, 1.0e-323, 0.0, 0.0, 0.0, 2.0e-323, 0.0, 5.0e-324, 0.0, 0.0, 5.0e-324, 0.0, 5.0e-324, 0.0, 1.0e-323, 0.0, 0.0, 0.0, 5.0e-324, 0.0, 1.0e-323, 0.0, 1.0e-323, 0.0, 1.0e-323, 0.0, 0.0]
@btime foo!($dv, $vslow,0.5,$beta);

FYI, interestingly, it can only be reproduced on Intel CPU:

Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz:

  6.198 μs (0 allocations: 0 bytes)
  118.845 μs (0 allocations: 0 bytes)

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz:

  6.776 μs (0 allocations: 0 bytes)
  122.967 μs (0 allocations: 0 bytes)

Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz:

  11.202 μs (0 allocations: 0 bytes)
  202.572 μs (0 allocations: 0 bytes)

AMD EPYC 7502 32-Core Processor:

  4.367 μs (0 allocations: 0 bytes)
  4.664 μs (0 allocations: 0 bytes)

POWER9 Model 2.1 (pvr 004e 1201):

  7.100 μs (0 allocations: 0 bytes)
  7.107 μs (0 allocations: 0 bytes)

(These are without @turbo)

1 Like

If I run with set_zero_subnormals(true) (on an intel CPU), I get equal times. So you’re probably right about underflow.

4 Likes
set_zero_subnormals(true)

prevents the slowdown. See, e.g. this performance tip

4 Likes