Threads maxing out all cores, but no performance increase

Hello All

I started using Julia, because of its speed in an application, where the workload in pseudo-code looks like I added below (no MWE, but hopefully this is diagnostic enough - otherwise I will create a MWE, but it could be that this is an Julia-novice mistake that I wasn’t able to solve through googling):

There are 4000 csv files (4-7mb each), for each, the code does some data prep and then calls a function that iterates through each row and does a bunch of arithmetic. Unfortunately, the arithmetic is time-series data / state dependent and cannot be vectorized.

When single-threading it, the average iteration takes 7 seconds per csv file. When using Threads.@threads and 8 threads for the first loop Threads.@threads for i in 1:4000, the average iteration takes 7.8 seconds.

I changed the environment variable, so Julia always starts with 8 threads, check for it before running anything and most importantly, the CPU utilization in the multi-threaded case goes up to 100% on all 8 cores of my CPU, but the execution still is slower. Any advice on how to use multi-threading and actually speed up the code execution?

Any help is much appreciated

resArr =  DataFrame()

for i in 1:4000
        df = DataFrame(CSV.File(file_i)))
        dummy = fun1(df)
        resArr[i] = dummy
end

function fun1(df)
      *someDataPrep*
      res = fun2(df)
      return res
end

function fun2(df)
      for (i, row) in enumerate(eachrow(df))
           *a lot of arithmetic*
      end
      return x, y, z
end

Before you try parallelizing, have you profiled parts of your code using BenchmarkTools or Profile? Excessive allocations may be occurring, and those tools will help you find them and optimize.

2 Likes

Is each file taking 7.8 seconds (rather than 7), but you are doing ~8 files at time? That would be an improvement. Or are you saying that your total throughout is now 1 file in 7.8 seconds?

1 Like

Unfortunately its the later case - the total time until completion is longer when on 8 cores than single-threaded. I built the same code in Python (which on single-thread is 4 times slower) and used Dask to multi-thread it and got an approximate 7 times speed up (so right now Python is 1.75 times faster), but I hope to get the same speed up when utilizing multi-threading properly.

Sorry if this is a stupid question, but where is file_i defined in this code? Shouldn’t be file[i] where file is a vector with the file names? (maybe you are mistakenly reading the same file 4000 times here?)

Independently of the above, try putting this in a function. I am not sure how much it will help, but certainly won’t hurt:

function readdata(files)
  resArr = zeros(length(files)) # ?? Not sure how to initialize it here
  Threads.@threads for i in 1:length(files)
          df = DataFrame(CSV.File(files[i])))  #???
          dummy = fun1(df)
          resArr[i] = dummy
  end
  return resArr
end

I called @benchmark with .samples set to 5, and based on the print statement it also ran it 5 times, but it reports somehow only 1 sample and 1 evals/sample. Could be the max time cancelling some runs prematurely. But the numbers are consistent with that @time reports and vary by ±5% at most. The memory estimate seems high?

Single Threaded: (ran the entire program for 20 csv files )

BenchmarkTools.Trial: 
  memory estimate:  35.15 GiB
  allocs estimate:  724012212
  --------------
  all times:     123.552 s (3.96% GC)
  --------------
  samples:          1
  evals/sample:     1

fun1 by itself: (For 1 csv file)

BenchmarkTools.Trial: 
  memory estimate:  1.99 GiB
  allocs estimate:  40912629
  --------------
  minimum time:     6.978 s (3.91% GC)
  median time:      6.986 s (3.89% GC)
  mean time:        6.993 s (3.90% GC)
  maximum time:     7.019 s (3.84% GC)
  --------------
  samples:          5
  evals/sample:     1

fun2 (For 1 csv file 1 call - this gets called 11 times per csv file in the original file)

BenchmarkTools.Trial: 
  memory estimate:  185.95 MiB
  allocs estimate:  3718819
  --------------
  minimum time:     620.072 ms (3.64% GC)
  median time:      631.499 ms (3.73% GC)
  mean time:        631.443 ms (3.87% GC)
  maximum time:     646.031 ms (3.80% GC)
  --------------
  samples:          5
  evals/sample:     1

185mb in the function 2 seems high - especially given that this is called 4000*11 times.

Stepwise, I deleted all the arithmetic, then the iterator and then everything except for the following 2 lines, that cause 90% of the memory allocation:

function fun2_Barebone(df)
    df[!,"DateTime"] = floor.(Dates.DateTime.(df[!,:Date] , "yyy-mm-dd HH:MM:SS"),Dates.Hour)
    df = leftjoin(df, histP[!,["varP","Hour"]],on=[:DateTime => :Hour])
    return true
end

The df that goes into the function has a size of 3.4 mb (based on Base.summarysize) and comes from the csv file, the histP DataFrame I am merging has a size of 0.2mb and is defined as a global constant and contains one time series of prices and dates that has to be merged with each csv file separately as the csv files contain gaps.

Two questions - a) why do those 2 lines create 164MiB of memory allocation and is there an easy way of making this more efficient? b) Why does this not scale with multiple threads?

Putting it in a function has no perceptible effect on the speed or allocation.
The file comes from iterating over the Directory with the cleaned .csv files (the directory string is stored in the variable called cleanDir).
for (row, file) in collect(enumerate(readdir(cleanDir)))

1 Like

Every time you do [something] you’re telling Julia to allocate an array on the heap.

1 Like

As far as I understand DataFrames, this allocates a new array here, you could use @view histP[!,["varP","Hour"]] instead (or add @views to @views function ... .

Also ["varP","Hour"] allocates that two-element small vector. If you can avoid that somehow, it would improve things as well.

2 Likes

Thank you everyone so far.

I dug a little bit deeper, worked with the view for histP, and got a 1MiB improvement (nice but not material). Out of the two lines, 7MiB come from the leftjoin, but 155 come from what I have now re-written as
df.DateTime = floor.(Dates.DateTime.(df.Date , "yyy-mm-dd HH:MM:SS"),Dates.Hour)

I guess replacing notation does not change memory allocation.

  1. Is there any faster way of converting a string column to DateTime and rounding it to hours?
  2. With 32GB RAM on this machine which is barely used and the SSD mostly idling, why does the multi-threading not help speed up code execution, while still utilizing more cores?

Here, again if I am not mistaken, you are allocating a new array on the right. Try using a broadcasted assignment (df.DateTime .= floor....) with a dot in the .=.

If you provided a minimal working example helping would be much easier. I would not focus on the parallel execution until you get the best possible serial version, but allocations can hurt parallel performance a lot, in particularly if you have lots of them, as that means a lot of concurrent accesses to memory.

6 Likes

I don’t know if that matters, and might actually add unnecessary overhead. I don’t think you can avoid allocating an array on the right side at all? I think you’d have to just do a simple loop over the individual elements to avoid it. But that should just add one allocation per iteration, which shouldn’t be a big deal?

Hmm… I’m seeing that the type of the result here isn’t being inferred:

function g()
    o = "2001-04-05 12:34:25"
    floor(Dates.DateTime(o, "yyyy-mm-dd HH:MM:SS"), Dates.Hour)
end
@code_warntype g()

Variables
  #self#::Core.Const(g)
  o::String

Body::Any
1 ─      (o = "2001-04-05 12:34:25")
│   %2 = Dates.DateTime::Core.Const(DateTime)
│   %3 = o::Core.Const("2001-04-05 12:34:25")::Core.Const("2001-04-05 12:34:25")
│   %4 = (%2)(%3, "yyyy-mm-dd HH:MM:SS")::Any
│   %5 = Dates.Hour::Core.Const(Hour)
│   %6 = Main.floor(%4, %5)::Any
└──      return %6

Type inferrence failures are a big source of slow performance, because the compiler doesn’t emit optimized code for the types.

I see an old issue related to this: Date(::String, ::DateFormat) isn’t type stable · Issue #13644 · JuliaLang/julia (github.com)

I wonder if this is an edge case there? When I run the code in a function I see type instability, but not when I run it directly.

All that said, I found how to make it more performant: wrap the format string in Dates.DateFormat().

1 Like

Ah, from the documentation:

DateTime(dt::AbstractString, format::AbstractString; locale="english") -> DateTime

  Construct a DateTime by parsing the dt date time string following the pattern given in the format string (see
  DateFormat for syntax).

  This method creates a DateFormat object each time it is called. If you are parsing many date time strings of the
  same format, consider creating a DateFormat object once and using that as the second argument instead.
1 Like

Oooh. I think I’ve run into this too, and wondered to myself why it was slow. Good to know.

That depends on the size of array. Hard to know. Surely improper type inference doesn’t help.

1 Like

Awesome, everyone - that did the trick. It is a bit counterintuitive to me that running the function over the column is implemented that way, but thank you for digging this up from the documentation.

These two lines

myDF = Dates.DateFormat("yyy-mm-dd HH:MM:SS")
df.DateTime = floor.(Dates.DateTime.(df.Date , myDF),Dates.Hour)

reduces allocation from 155mb down to 7.7mb:)

Broadcasting in this case increased memory allocation to 8.06, so I am going without broadcasting for this line.

The best news: Now the multi-threading works - processing 100 csv files takes:
Single-thread: 45 seconds
8-Threads, on 8 cores: 11 seconds.

Much appreciated, the proper Date parsing and the multi-threading speeds up the code factor 56!

4 Likes