How to best parallelize custom decision tree / forest?

Hello,

Could anyone give me advice on the best way to parallelize a custom decision tree or forest algorithm on CPU / GPU?
I am currently building my own from scratch to experiment with some of my ideas for multi-class / output scenarios .
Available decision tree packages aren’t exactly aligned to what I want, so the need to do it from scratch. The code I wrote is unparallelized and uses for - loops for building the tree ( since I am not familiar with recursion) and data is saved in a julia dictionary which is causing me problems when I try parallelizing it.

Note: I have only been using Julia for 2 months, and have only ever used python.

Any help would be appreciated
Thanks

Example Code

mutable struct Tree
name
max_depth
min_samples
data
end

function build_tree(tree::Tree,X,y)

for depth in 0:tree.max_depth
	# parallelize creating nodes per depth
	# Choices
	# Threads.@threads
	# Threads.@spawn
	# Distributed.@distributed
	# CUDA.jl ?
	for node in 1:2^depth
		# create nodes
	end
end

end