Currently the following code fails:
julia> Threads.@threads for c in "ssf"
Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=("ssf", 1), world=0x000000000000643a)
(the original problem was posted in DataFrames.jl here)
The reason is that
Base.unsafe_getindex is used internally and it is defined only for a limited number of collection types.
With Julia 1.3 I expect that threads will be used more often.
So my question is if it is intentional and required to be restricted or could be allowed for a wider range of types that support
getindex (as I suppose it could)?
EDIT: the example given above is probably not best, as strings are not indexable with step 1, but see the linked DataFrames.jl example where
GroupedDataFrame is properly indexable with 1-increment, but on purpose we do not make it a subtype of
EDIT2: probably the simplest example is with a
julia> @Threads.threads for i in (1,2,3)
Error thrown in threaded loop on thread 0: MethodError(f=typeof(Base.unsafe_getindex)(), args=((1, 2, 3), 1), world=0x00000000000063fd)
Why not use (loosely)
for i in (1,2,3)
and let the dynamic scheduler do its job?
This is only possible in Julia 1.3 and, e.g. in DataFrames.jl we try to support Julia 1.0 LTS.
Just run things serially in 1.0?
I guess Compat could define some no-op macros to allow that. But AFAICT the issue was that users wanted to write that, and it failed. Should it?
I’m a bit confused if this topic is about doing threading inside DataFrames.jl or just a general question about users doing threads on arbitrary objects.
It’s not about writing threaded code inside DataFrames.jl. It’s about whether/how things like this should work for users:
Threads.@threads for sdf ∈ groupby(df, :grpcol)
sdf.x3 .= -1.
But the example with a tuple is simpler:
@Threads.threads for i in (1,2,3)
String as an example, to avoid going into the details of design and intended usage of
GroupedDataFrame object in DataFrames.jl (as the core of the issue in the given question is the same).
Essentially we are investigating how we can speed up split-apply-combine ecosystem in DataFrames.jl using threading.
Just to clarify: this will require internal changes in DataFrames which are well beyond the scope of this issue of course.
This actually bit me yesterday, when I tried doing (you might remember the discussion on Slack) something like:
@Threads.threads for (n, r) in enumerate(eachrow(df))
So I’m also interested in the broader question of which iterators are expected to received threads support? I naively expected this to work out of the box for everything and couldn’t quite work out what the
unsafe_getindex error message was telling me.
Threads.@threads statically schedule the work so it needs random access into what is looped over. It could work probably be made to work for tuples but for simplicity, I have always just looped over a vector of int and then indexed into whatever I need based on that.
Is this just a case of missing methods for iterables other than an integer range or is there a more fundamental issue that prevents this from working?
To me the ability to write for loops over general iterables without having to do integer indexing is one of the great Julia features that allows for clean and readable code, and it’d be great if throwing in
@Threads.threads after the fact would justwork
@threads it is fundamental in that it schedules all the work on threads statically.
If you want to do dynamic scheduling then use the
Threads.@spawn functionality in 1.3+. Is there an issue with that?
Sorry I feel like I should be doing a bit more reading on how the new threading works rather than wasting your time with basic questions - I haven’t tried the
@spawn route, so far all my parallel code uses
SharedArrays and then
@sync @distributed for loops, and my mental model was that
@threads replaces this and allows me to work with “normal” Arrays given it’s shared memory parallelism.
I’m still not sure what is so bad about random acces for general iterators, in my use case yesterday I was iterating through DataFrame rows, and for my taks it didn’t matter in which order the rows were processed. I might be missing something fundamental about what random access and static scheduling mean…
@kristoffer.carlsson - for me the core of the question is why
@threads requires an iterable to support
Base.unsafe_getindex and not just
length (which already guarantees that the work can be statically scheduled).
Yeah, I realize this and I don’t think there is any reason. It was added a long time ago in https://github.com/JuliaLang/julia/commit/a312aad47095ff1601d4a27afe5391184218e78d with the description:
@threads to work on any 1D range with random-access subscripting. Long term, we can generalize it to more general ranges similar to how
Previously, I think it only worked on unit ranges.
reduce defined in Transducers.jl can be used to do something like
@threads for for not only arbitrary indexable collections but also any collections that can
halve itself (…in principle; ATM only
AbstractArray is supported). Roughly speaking,
Threads.@threads for x in xs
is equivalent to
reduce(Map(identity), xs; init=nothing) do _, x
It also support (deterministic) “
break” which cannot be done with
@threads for AFAICT.