Threaded for loop over arbitrary indexable objects

Currently the following code fails:

julia> Threads.@threads for c in "ssf"
       end

Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=("ssf", 1), world=0x000000000000643a)

(the original problem was posted in DataFrames.jl here)

The reason is that Base.unsafe_getindex is used internally and it is defined only for a limited number of collection types.

With Julia 1.3 I expect that threads will be used more often.

So my question is if it is intentional and required to be restricted or could be allowed for a wider range of types that support getindex (as I suppose it could)?

EDIT: the example given above is probably not best, as strings are not indexable with step 1, but see the linked DataFrames.jl example where GroupedDataFrame is properly indexable with 1-increment, but on purpose we do not make it a subtype of AbstractArray.

EDIT2: probably the simplest example is with a Tuple:

julia> @Threads.threads for i in (1,2,3)
       end

Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=((1, 2, 3), 1), world=0x00000000000063fd)
3 Likes

Why not use (loosely)

@sync begin
    for i in (1,2,3)
        Threads.@spawn work(i)
    end
end

and let the dynamic scheduler do its job?

1 Like

This is only possible in Julia 1.3 and, e.g. in DataFrames.jl we try to support Julia 1.0 LTS.

Just run things serially in 1.0?

I guess Compat could define some no-op macros to allow that. But AFAICT the issue was that users wanted to write that, and it failed. Should it?

I’m a bit confused if this topic is about doing threading inside DataFrames.jl or just a general question about users doing threads on arbitrary objects.

It’s not about writing threaded code inside DataFrames.jl. It’s about whether/how things like this should work for users:

  Threads.@threads for sdf ∈ groupby(df, :grpcol)
    sdf.x3 .= -1.
  end

But the example with a tuple is simpler:

@Threads.threads for i in (1,2,3)
end

I used String as an example, to avoid going into the details of design and intended usage of GroupedDataFrame object in DataFrames.jl (as the core of the issue in the given question is the same).

Essentially we are investigating how we can speed up split-apply-combine ecosystem in DataFrames.jl using threading.

Just to clarify: this will require internal changes in DataFrames which are well beyond the scope of this issue of course.

1 Like

This actually bit me yesterday, when I tried doing (you might remember the discussion on Slack) something like:

@Threads.threads for (n, r) in enumerate(eachrow(df))
   ...
end

So I’m also interested in the broader question of which iterators are expected to received threads support? I naively expected this to work out of the box for everything and couldn’t quite work out what the unsafe_getindex error message was telling me.

Threads.@threads statically schedule​ the work so it needs random access into what is looped over. It could work probably be made to work for tuples but for simplicity, I have always just looped over a vector of int and then indexed into whatever I need based on that.

1 Like

Is this just a case of missing methods for iterables other than an integer range or is there a more fundamental issue that prevents this from working?

To me the ability to write for loops over general iterables without having to do integer indexing is one of the great Julia features that allows for clean and readable code, and it’d be great if throwing in @Threads.threads after the fact would justwork:tm:

1 Like

For @threads it is fundamental in that it schedules all the work on threads statically.

If you want to do dynamic scheduling then use the Threads.@spawn functionality in 1.3+. Is there an issue with that?

Sorry I feel like I should be doing a bit more reading on how the new threading works rather than wasting your time with basic questions - I haven’t tried the @spawn route, so far all my parallel code uses SharedArrays and then @sync @distributed for loops, and my mental model was that @threads replaces this and allows me to work with “normal” Arrays given it’s shared memory parallelism.

I’m still not sure what is so bad about random acces for general iterators, in my use case yesterday I was iterating through DataFrame rows, and for my taks it didn’t matter in which order the rows were processed. I might be missing something fundamental about what random access and static scheduling mean…

@kristoffer.carlsson - for me the core of the question is why @threads requires an iterable to support Base.unsafe_getindex and not just getindex and length (which already guarantees that the work can be statically scheduled).

2 Likes

Yeah, I realize this and I don’t think there is any reason. It was added a long time ago in https://github.com/JuliaLang/julia/commit/a312aad47095ff1601d4a27afe5391184218e78d with the description:

Generalize @threads to work on any 1D range with random-access subscripting. Long term, we can generalize it to more general ranges similar to how @simd works

Previously, I think it only worked on unit ranges.

Just FYI reduce defined in Transducers.jl can be used to do something like @threads for for not only arbitrary indexable collections but also any collections that can halve itself (…in principle; ATM only AbstractArray is supported). Roughly speaking,

Threads.@threads for x in xs
    use(x)
end

is equivalent to

reduce(Map(identity), xs; init=nothing) do _, x
    use(x)
end

It also support (deterministic) break which cannot be done with @threads for AFAICT.

2 Likes