One way to do this would by accessing the internals of Dict which have a simple array of “slots” under the hood, and an isslotfilled function to check whether a slot actually has an element. So, you could use a standard parallelization method to partition the array dict.slots among your threads, have each thread iterate over its slots and sum the pairs for slots that are filled.
An additional advantage over this approach is that you can skip the j < i check—to iterate over unique pairs for a filled slot i, you can just loop over slots \ge i.
Another option is to switch to another Dict-like collection. For example, the OrderedDict structure from OrderedCollections.jl has a simple array of keys that its iteration loops over, so you can parallelize a loop over dict.keys directly without worrying about “unfilled” slots.