Quick comment as a Tapir author here.
So the model of parallelism there is essentially individual tasks that can be optimized quite well and even run very lightweight (and some folks have even built custom hardware for lightweight threads compiled using tapir)
Much like the canonical enzyme example, the canonical tapir example shows why you want to run optimizations on parallel code (and get potentially asymptotic speed ups). For more details see the latter half of https://c.wsmoses.com/presentations/defense.pdf (including how applying tapir optimizations make for much faster parallel gradients).
While there has been various follow up work like Polygeist that does autoparallelization, my (admittedly limited) read of Bend is that it focuses more on identifying opportunities for parallelism via functional programming more than optimizing parallel code.
Of course that actually means the two could work together! (Use bend to find opportunities for parallelism and tapir to optimize it to run/scale better)