Threaded for loop over arbitrary indexable objects

bkamins · July 29, 2019, 9:24pm

Currently the following code fails:

julia> Threads.@threads for c in "ssf"
       end

Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=("ssf", 1), world=0x000000000000643a)

(the original problem was posted in DataFrames.jl here)

The reason is that Base.unsafe_getindex is used internally and it is defined only for a limited number of collection types.

With Julia 1.3 I expect that threads will be used more often.

So my question is if it is intentional and required to be restricted or could be allowed for a wider range of types that support getindex (as I suppose it could)?

EDIT: the example given above is probably not best, as strings are not indexable with step 1, but see the linked DataFrames.jl example where GroupedDataFrame is properly indexable with 1-increment, but on purpose we do not make it a subtype of AbstractArray.

EDIT2: probably the simplest example is with a Tuple:

julia> @Threads.threads for i in (1,2,3)
       end

Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=((1, 2, 3), 1), world=0x00000000000063fd)

kristoffer.carlsson · September 9, 2019, 11:51am

Why not use (loosely)

@sync begin
    for i in (1,2,3)
        Threads.@spawn work(i)
    end
end

and let the dynamic scheduler do its job?

bkamins · September 9, 2019, 12:07pm

This is only possible in Julia 1.3 and, e.g. in DataFrames.jl we try to support Julia 1.0 LTS.

kristoffer.carlsson · September 9, 2019, 12:11pm

Just run things serially in 1.0?

nalimilan · September 9, 2019, 12:11pm

I guess Compat could define some no-op macros to allow that. But AFAICT the issue was that users wanted to write that, and it failed. Should it?

kristoffer.carlsson · September 9, 2019, 12:15pm

I’m a bit confused if this topic is about doing threading inside DataFrames.jl or just a general question about users doing threads on arbitrary objects.

nalimilan · September 9, 2019, 12:16pm

It’s not about writing threaded code inside DataFrames.jl. It’s about whether/how things like this should work for users:

  Threads.@threads for sdf ∈ groupby(df, :grpcol)
    sdf.x3 .= -1.
  end

But the example with a tuple is simpler:

@Threads.threads for i in (1,2,3)
end

bkamins · September 9, 2019, 12:20pm

I used String as an example, to avoid going into the details of design and intended usage of GroupedDataFrame object in DataFrames.jl (as the core of the issue in the given question is the same).

Essentially we are investigating how we can speed up split-apply-combine ecosystem in DataFrames.jl using threading.

nalimilan · September 9, 2019, 12:35pm

Just to clarify: this will require internal changes in DataFrames which are well beyond the scope of this issue of course.

nilshg · September 10, 2019, 8:35am

This actually bit me yesterday, when I tried doing (you might remember the discussion on Slack) something like:

@Threads.threads for (n, r) in enumerate(eachrow(df))
   ...
end

So I’m also interested in the broader question of which iterators are expected to received threads support? I naively expected this to work out of the box for everything and couldn’t quite work out what the unsafe_getindex error message was telling me.

kristoffer.carlsson · September 10, 2019, 11:56am

Threads.@threads statically schedule the work so it needs random access into what is looped over. It could work probably be made to work for tuples but for simplicity, I have always just looped over a vector of int and then indexed into whatever I need based on that.

nilshg · September 10, 2019, 12:27pm

Is this just a case of missing methods for iterables other than an integer range or is there a more fundamental issue that prevents this from working?

To me the ability to write for loops over general iterables without having to do integer indexing is one of the great Julia features that allows for clean and readable code, and it’d be great if throwing in @Threads.threads after the fact would justwork

kristoffer.carlsson · September 10, 2019, 12:32pm

For @threads it is fundamental in that it schedules all the work on threads statically.

If you want to do dynamic scheduling then use the Threads.@spawn functionality in 1.3+. Is there an issue with that?

nilshg · September 10, 2019, 12:46pm

Sorry I feel like I should be doing a bit more reading on how the new threading works rather than wasting your time with basic questions - I haven’t tried the @spawn route, so far all my parallel code uses SharedArrays and then @sync @distributed for loops, and my mental model was that @threads replaces this and allows me to work with “normal” Arrays given it’s shared memory parallelism.

I’m still not sure what is so bad about random acces for general iterators, in my use case yesterday I was iterating through DataFrame rows, and for my taks it didn’t matter in which order the rows were processed. I might be missing something fundamental about what random access and static scheduling mean…

bkamins · September 10, 2019, 12:56pm

@kristoffer.carlsson - for me the core of the question is why @threads requires an iterable to support Base.unsafe_getindex and not just getindex and length (which already guarantees that the work can be statically scheduled).

kristoffer.carlsson · September 10, 2019, 1:04pm

Yeah, I realize this and I don’t think there is any reason. It was added a long time ago in Remove @threads for calls and blocks, since these require · JuliaLang/julia@a312aad · GitHub with the description:

Generalize @threads to work on any 1D range with random-access subscripting. Long term, we can generalize it to more general ranges similar to how @simd works

Previously, I think it only worked on unit ranges.

tkf · September 10, 2019, 7:28pm

Just FYI reduce defined in Transducers.jl can be used to do something like @threads for for not only arbitrary indexable collections but also any collections that can halve itself (…in principle; ATM only AbstractArray is supported). Roughly speaking,

Threads.@threads for x in xs
    use(x)
end

is equivalent to

reduce(Map(identity), xs; init=nothing) do _, x
    use(x)
end

It also support (deterministic) “break” which cannot be done with @threads for AFAICT.

Topic		Replies	Views
How to iterate in a thread safe way? General Usage question , multithreading	18	525	July 19, 2023
How to get around lack of indexing to Base.Iterators.product to allow threaded access New to Julia question , multithreading	5	637	September 5, 2020
Enumerate not suported for @threads Julia at Scale multithreading	2	2314	April 16, 2020
Threads.@threads over an iterator does not work - is there an reason? General Usage question	1	1083	November 5, 2021
Does @threads not allow iteration over Dict()? New to Julia question	2	686	January 29, 2020

Threaded for loop over arbitrary indexable objects

Related topics