What type of code is `threads` good for?

I was reading the latest blog post on multithreading and have a few questions on what applications can benefit from threading.

This is how I visualize threads in my head: Suppose a CPU has two threads. You can spawn tasks on both these threads, but only one of them will be executing at any time. The OS may rapidly switch between threads to provide a feel of parallelism, but infact is only doing one job at a one.
For true parallelism, you require separate, independent cores on the CPU. Julia already provides great support for running parallel tasks through its Distributed api, such as pmap.

So to me, threading dosn’t feel like it should generate speedups in “computational” code. It makes sense when dealing with a GUI and a long-running task. That is, pause the long-running task for a miniscule amount of time, update the mouse pointer on screen, and switch back to the long-running task. How does this help with number crunching, because once a thread stops running there is no number crunching going on.

I’d very much appreciate someone explaining this like ELI-(math grad).

Additional question, if someone has time. My research involves agent-based models and running Monte Carlo simulations. We have access to a cluster (18 nodes with 32 cores each, with hyperthreading enabled – so 64 cores?). The workflow involves using ClusterManagers to connect to the nodes, use pmap to spawn 32 independent simulations on each of the 32 cores (per node). I spawn 32 processes per node because as far as I think, spawning 64 would actually slow everything down. Is this correct?

The problem I face is that my “independent” simulations also involve code writing results to disk (or STDOUT) and generating logfiles. This is inefficient because the disk is constantly busy writing from each of the 32 processes. I am hoping to use threads here to send information back to the headnode which sits idle while the simulations are happening. The headnode can receive messages in a sequential manner and write it to disk. Can I use threading here at all?

Is your previous coding background python by any chance? I only ask because python is rather unique in having the GIL so that threads provide concurrency but not true parallelism. Threads in Julia provide parallelism as well as concurrency, meaning that computations can occur on multiple cpu cores at the same time. The difference between multi-threading and Distributed is that multi-threading provides shared memory parallelism which can be much more efficient than communicating across process boundaries for many problems.

Multithreading usually means that the threads are in fact executing at the same time. This is what is described in that blog post.

Since things are running at the same time, you can do computations faster. Maybe try a simple threaded loop vs a non threaded one to see.

@affans I think you are describing hardware threads, which are referred to as hyperthreading.
Perhaps it is more clear to look at threads first - which are a software construct.
Threads are lightweight processes which share a common address space and common code (*) (**)
Threads can execute concurrently - as in the Julia threads which you have seen announced. This means that care must be taken to stop ‘treading on toes’ .

Now we come to Hyperthreading. This is related to threads - and is hardware which helps threads to run better on a CPU. As you say, there are two or more software threads being run on a given CPU core. When one thread stalls, perhaps due to an IO Wait then the core quickly switches to executing another thread.
I say two or more as for example the Xeon Phi cores can have 4 HT threads and Sparc processors up to 8.

Please, please do not think of this reply as a put down. Rather think that you are quite correct - hardware threading in CPUs assists the software processes of threads.

(*)'I may not be 100% correct about the shared code, but I think I am close enough
(**) The Wikipedia article on threads is pretty heavy going

1 Like

I think you are correct. It of course depends on the code you are running - the answer always is try your code on 32 and 64 processes per node and benchmark.

Being more constructive give some thought to the following:
You have a system which presents 64 cores to the Linux OS (I assume Linux). The scheduler is distributing the tasks and may switch tasks around between cores.
However you can explicitly bind the tasks to every second core. This can be done using your batch system - which one are you using?

You can also ‘pretend’ to the system that Hyperthreading is disabled. This is done by disabling every odd numbered core at run time. (You cannot disable core 0 - which stops you shooting your foot off).
Message me offline if you want to know more about this.

Without knowing too much about the topic and only practical experience, I can say that I have had success reading external data (binary files etc.) and using the Threads.@threads commando I see a doubling in performance. Again without knowing my assumption has been that it just uses multiple threads to open files and read different files in parallel.

So for me it has been useful.

Kind regards

At the risk of going off topic hyperthreading is currently interesting due to the Spectre/Meltdown and the recent MDS vulnerabilities. I will not steer the thread off course further. Google if you are interested.

To your 32 processes vs 64 processes…What will probably happen is that each instance will run slower because the CPU will be trying to pump 2 threads through a single core. However it should not run 50% slower so there will be a net gain.

Basically there a number of (independent) computational units in the CPU so running 2 threads through it is an attempt to keep all those units active, whereas running only 1 thread will not fully utilize the CPU. But again when running 2 threads there will be contention, so both threads will run slower…but not 50% slower.

That said, you might be able to create an algorithm that does cause extreme contention in hyper-threading…maybe something that performs complex and continuous operation on the same 4 floats…In this case there shouldn’t be any IO waits for the data to come from memory and the operations will require the floating point unit continuously…this would cause the CPU to have to divide the time the threads spend using that resource.

As to your disk usage issue, threads are probably are not going to help you (much). A single CPU can easily max out disk and network bandwidth. So if your operations are being throttled waiting for the disk or network to finish then threads are probably not going to be the silver bullet.

One thing they could help with is compressing the data before you transmit it over the network or to disk…thus reducing the number of bytes in transit. But you would want a compression algorithm that is fast enough to keep up with the input but compresses good enough that the disk/network is not the bottleneck.

You data may be in a format that you can meet both these needs but it’s far more likely you will end up either bottlenecked at the compressor or at the network/disk still…however the end result should be to shrink that bottleneck.

1 Like

No, hardware threads represent actual processors or cores, so support genuine simultaneous parallelism. Hyperthreads are “virtual threads” that support concurrency but not parallelism.

Julia’s new multithreading can be mapped onto both hardware and software threads, and therefore support actual, real simultaneous parallelism, and can therefore give considerable performance improvement.

No. HT/SMT almost behave like multiple CPU cores, including need for atomics and locking. They make a lot of sense for throughput-constraint workloads that bottleneck on main memory latency (~90 ns).

The concurrent-not-parallel @async stuff is about IO (with syscalls and us-ms latencies). Main memory is normally not considered IO.

Thanks everyone. Have a little better understanding. I will have to benchmark some code.

It seems to be the general workflow for me would be use spawn 32 simulations on each of the 32 cores. Then since hyperthreading is enabled, I can use @threads within each simulations to speed up calculations if necessary and perform IO.

I wonder, then, if the system is smart enough to “figure” out how that works. If a processor has 32 cores but Linux sees 64 with HT enabled, does that mean each “core” has two threads? If so, then my idea could work…

I’ve notice that OSes try to schedule active threads on their own core. It’s only when you have more active threads then cores that it will start to utilize hyper-threading.

I also believe they try to re-run threads on the previous CPU for the caching benefit…or at the very least keep the thread executing on the same CPU if there is no reason to switch it to another.

Mostly, yes. There are various ways to tweak scheduling, with the default being OK for getting some level of desktop responsiveness. I found

an informative read, though in practice I think this kind of tweaking matters more in a HPC context.

If you are using a recent (say, post-2017) Linux kernel, I don’t think it is worth investing too much in manual tweaking on a desktop/workstation running mostly one-off calculations (eg scientific computing). Chances are that your optimized settings will be suboptimal for other workloads, or become outdated as the default scheduling of the kernel improves.