See my response and sample code for a somewhat similar problem here.
Also note that IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in tight loops and individual functions like this.
(With some exceptions, for example if your problem solely consists of multiplying a single large matrix with a vector.)