Performance with MongoDB

Massimo_Redaelli · December 6, 2017, 6:31pm

I recently changed job and I am now working in a company doing data science stuff with heavy emphasis on geospatial timeseries.
I work with two statisticians that have so far implemented everything in R, and occasionally some Python.

After familiarizing myself with R and realizing just how horribly slow it is, I looked around for alternatives and got very excited about Julia - although I’m afraid we cannot adopt it because the Geo libraries are still at an early development level.

in any case, I wanted to try Julia anyway, so I rewrote in it a small Python script that basically consists on three nested for loops, each of which iterates over MongoDB queries.

I was surprised to find out that the resulting Julia script is about 50% slower than the equivalent Python script.
I tried to follow all the guidelines in the “performance tips” page, but to no avail.

I can certainly post the code, if it can help, but is it possible that this is intrinsically a bad benchmark because of the use of Dicts?
Should I try constructing the BSON directly instead?

ChrisRackauckas · December 6, 2017, 6:42pm

Possibly. It’s hard to know without seeing the code what is going on.

Massimo_Redaelli · December 7, 2017, 8:54am

Here’s the code.

function date_with_duplicates_for_grid(grid_id:: Int32, client:: MongoClient)
 command_simple(client,  "xxx",  OrderedDict(
   "aggregate" => "yyy",
   "pipeline" => [
     Dict("\$match" => Dict("id_grid" => grid_id)),
     Dict("\$group" => Dict(
       "_id" => Dict("date" => "\$date", "id_grid"=> "\$id_grid"),
       "count" => Dict("\$sum" => 1))
     ),
     Dict("\$match" => Dict("count" => Dict("\$gt" => 1))),
     Dict("\$project" => Dict("_id.date" => 1))
   ]
  )
 )
end

function measurements_for(grid_id:: Int32, date:: DateTime, cdata:: MongoCollection)
 find(cdata, Dict("id_grid" => grid_id, "date" => date), Dict("_id" => 1, "rain" => 1))
end

function run(grids:: MongoCursor, client:: MongoClient)
 cdata = MongoCollection(client, "xxx", "yyy")
 for grid in grids
   grid_id = grid["id_grid"] :: Int32
   for doc in date_with_duplicates_for_grid(grid_id, client)["result"]
     d = doc["_id"]["date"] :: DateTime
     mes = [m:: BSONObject for m in measurements_for(grid_id, d, cdata)]
     vals = unique([m["rain"] :: Real for m in mes])
     if( length(vals) > 1 )
       println("More than one!")
     else
       for m in mes[2:end]
         id = m["_id"]
         delete(cdata, Dict("_id" => id))
       end
     end
   end
 end
end

Massimo_Redaelli · December 15, 2017, 6:53pm

Any suggestion?

dfdx · December 15, 2017, 9:46pm

I don’t see anything obviously slow, but you can benchmark each part with BenchmarkTools.jl and find out what exactly performs poorly.

kristoffer.carlsson · December 15, 2017, 10:05pm

It seems @profile is more apt here?

dfdx · December 15, 2017, 11:49pm

Right, I’ve been drilling through my own code with @btime for so lang that I’ve forgotten there’s more standard way +1 for @profile.

Topic		Replies	Views
Poor time performance on Dict? Performance	26	19220	March 12, 2018
Really Fast Hash Set and Map Internals & Design	20	6330	August 29, 2018
How is the data ecosystem right now for large datasets? Data	35	6785	July 13, 2017
BSONqs.jl v0.5.0 - high speed fork of BSON.jl Package Announcements	8	1563	September 4, 2019
Generating type specific deserialisers for BSON.jl Data package	0	1337	June 26, 2019

Performance with MongoDB

Related topics