if your sparse matrix does not change for several iterations. The CSB data structure and code has been observed to be faster than MKL for many sparse matrix families.
Your source codes will require a minor modification or abstraction to process the dense matrices in column batches. I think we compiled it with up to 32 dense columns per call.