Reduced performance for parallel loops in larger code?

If in your actual problem you have these arrays of small vectors, like here, probably you will do much better with StaticArrays. (not necessarily related to the parallelization issue, which might be solved with other parallelization strategies, like that of Floops). In general it is hard to offer much advice without a working example of the problem.