How to implement hybrid parallelized programs

Thanks for the simplified code :slight_smile:
I think you might be able to gain quite some performance by reusing the memory of the matrix. Assuming you can make a version of make_matrix that constructs the matrix into a preallocated matrix, then you could try:

using ChunkSplitters
function calc_function(x)
    kvalue_range = range(0, 0.5, 201)
    result_vector = zeros(201)
    for chunk in chunks(kvalue_range; n=Threads.nthreads())
        Threads.@spawn begin
            temp_matrix = zeros(1000,1000) # or whatever type/size the matrix needs to have
            for k_ind in chunk
                kvalue = kvalue_range[k_ind]
                energy = eigvals!( make_matrix!(temp_matrix, kvalue) ) 
                    # make_matrix make a thousands x thousands matrix
                    # we need different matrices for each kvalues
                result_vector[k_ind] = sub_calculation(energy)

    result = integrate(result_vector)	
        # integrate uses the trapezoidal rule: just adding values
    retun result

This allocates only 1 matrix for each thread. You could profile sub_calcuation separately to make sure it does not allocate or else try to preallocate some more stuff. integrate is probably less important as it is called much less often. But you should profile and verify that it has no large chunk of runtime.