Performance optimization on lots of small linear algebra operations

Thank you!

Current version of the code with StaticArrays is already pretty close to the original C , after compilation. But simplifying the code with indexing will help too!