Performance with MongoDB

first-steps

#1

I recently changed job and I am now working in a company doing data science stuff with heavy emphasis on geospatial timeseries.
I work with two statisticians that have so far implemented everything in R, and occasionally some Python.

After familiarizing myself with R and realizing just how horribly slow it is, I looked around for alternatives and got very excited about Julia - although I’m afraid we cannot adopt it because the Geo libraries are still at an early development level.

in any case, I wanted to try Julia anyway, so I rewrote in it a small Python script that basically consists on three nested for loops, each of which iterates over MongoDB queries.

I was surprised to find out that the resulting Julia script is about 50% slower than the equivalent Python script.
I tried to follow all the guidelines in the “performance tips” page, but to no avail.

I can certainly post the code, if it can help, but is it possible that this is intrinsically a bad benchmark because of the use of Dicts?
Should I try constructing the BSON directly instead?


#2

Possibly. It’s hard to know without seeing the code what is going on.


#3

Here’s the code.

function date_with_duplicates_for_grid(grid_id:: Int32, client:: MongoClient)
 command_simple(client,  "xxx",  OrderedDict(
   "aggregate" => "yyy",
   "pipeline" => [
     Dict("\$match" => Dict("id_grid" => grid_id)),
     Dict("\$group" => Dict(
       "_id" => Dict("date" => "\$date", "id_grid"=> "\$id_grid"),
       "count" => Dict("\$sum" => 1))
     ),
     Dict("\$match" => Dict("count" => Dict("\$gt" => 1))),
     Dict("\$project" => Dict("_id.date" => 1))
   ]
  )
 )
end

function measurements_for(grid_id:: Int32, date:: DateTime, cdata:: MongoCollection)
 find(cdata, Dict("id_grid" => grid_id, "date" => date), Dict("_id" => 1, "rain" => 1))
end

function run(grids:: MongoCursor, client:: MongoClient)
 cdata = MongoCollection(client, "xxx", "yyy")
 for grid in grids
   grid_id = grid["id_grid"] :: Int32
   for doc in date_with_duplicates_for_grid(grid_id, client)["result"]
     d = doc["_id"]["date"] :: DateTime
     mes = [m:: BSONObject for m in measurements_for(grid_id, d, cdata)]
     vals = unique([m["rain"] :: Real for m in mes])
     if( length(vals) > 1 )
       println("More than one!")
     else
       for m in mes[2:end]
         id = m["_id"]
         delete(cdata, Dict("_id" => id))
       end
     end
   end
 end
end

#4

Any suggestion?


#5

I don’t see anything obviously slow, but you can benchmark each part with BenchmarkTools.jl and find out what exactly performs poorly.


#6

It seems @profile is more apt here?


#7

Right, I’ve been drilling through my own code with @btime for so lang that I’ve forgotten there’s more standard way :slight_smile: +1 for @profile.