I think a really key point here though is that having a standard way to do composable threading is very different than having an optional one that you can use if you want to. It’s only really effective if everyone uses it. That’s why I don’t credit Cilk towards C/C++: not only do you need a special compiler and language extentions, but if one library uses pthreads and another library uses Cilk, the combination is not going to scale well. Same with other systems — a composable threading system is only effective if everyone buys in.
Go’s superpower as a language seems to be the fact that they implemented an incredibly good built-in task-based threading system and absolutely everyone uses it. It’s so important that the keyword for it is go
— the same as the name of the language. That and they’ve optimized the heck out of it so that it’s really, really efficient and reliable and the garbage collector is very low latency despite the threading.
Reducing spawn overhead is definitely a compiler team todo, but hasn’t lately been as high up as reducing compiler latency.