MongoDB Tutorial 4 - Performance

Fri, Jun 24, 2016 Last Modified: Jun 30, 2016
Category: tutorial Tags: [mongodb] [NoSQL] [database]

MongoDB Database

Storage engine

Storage engine sits between mongodb server and the file storage. Use db.serverStatus().storageEngine; to check.

WiredTiger

for many applications, this is faster - document level concurrency, lock free implementation. optimistic concurrency model which assumes two writes not gonna be on same document. if on same, one write unwound and try again. - compression of data and indexes. WiredTiger manages memory. - append only, no in place updates

MMAP V1

Uses mmap system call undercovers. Allocating memory, or map files or devices into memory. Operation system manages memory/virtual memory.

collection level locking, multiple reader, single writer lock.
in place update (database does not have to allocate and write a full new copy of the object).
power of 2 sizes when allocating initial storage with a minimum of 32 bytes.

Indexes

Built with btree or b+trees. With indexes writes are slower, reads are much faster. Without index, query on 10 million documents will be linear searching, too much disk io.

(a,b,c) leftmost index order. a, (a,b), (a,b,c) would work, (a,c) partially work, others do not.

Indexes Size

Indexes should fit into memory

It’s essential to fit working set (key component: Indexes) into memory. To check indexes sizes in mongo shell:

db.collection.stats() //detailed for each index
db.collection.totalIndexSize() //just the total

For a 10 million documents students collection, each index is about 300 Mb with MMAP V1 storage engine. Since 3.0, the wired tiger storage engine provides a few types of compression, one of which, i.e., prefix compression allows us to have smaller indexes. The same index is about 100 Mb instead of 300 Mb. The compression comes at the cost of CPU and whether the dataset can take advantage of something like prefix compression.

Number of Index Entries (Cardinality)

Regular: 1:1 proportional to collection size (index point needed even for null)
Sparse: <= number of documents
Multikey: index point for every array element. Could be (significantly) greater than number of documents.

If a document is updated and needs to be moved, the indexes need to be moved too. That cost only exists in the MMAPv1 storage engine. In the WiredTiger storage engine, index entries don’t contain pointers to actual disk locations. Instead, in WiredTiger, the index points to an internal document identifier (the Record Id that is immutable. Therefore, when a document is updated, its index does not need to be updated at all.

Geospatial Indexes

2D Type

Allows to find things based on location. Documents have 'location':[x,y] stored in them representing establishments around a person such as restaurants or coffee shops. createIndex({'location':'2d', type:1}) (2d is reserved type for two dimensional geo indexes). Use

find({location:{$near:[x0,y0]}}).limit(20)

to find closest establishments where $near is a query operator and [x0,y0] is ther person’s standing point.

Sperical Type

MongoDB supports a subset of GeoJSON location specification . A GeoJSON document is used for the value of key location. Google maps shows @latitude,longitude as part of the url for a particular location. MongoDB takes an opposite order @longitude,latitude. Although it’s a spherical model, it’s only looking at points at the surface of the sphere other than points in the air.

First, create indexes,

db.places.createIndex({'location':'2dsphere'})

then, to find nearest point to hoover tower, in geonear.js,

db.places.find(
  location:{
    $near:{
      $geometry:{
        type:"Point",
        coordinates:[-122,37]// hoover tower
      },
      $maxDistance:2000 // in meters
    }
  }).pretty()

to run the above query,

mongo < geonear.js

Text Indexes

aka Full text search index. Typically, when searching strings, the entire string must match. Except that a regex search will search the index (rather than the full collection), and if you anchor it on the left by beginning with ^, you can often do better still. Here’s a link to the documentation . Alternatively, put every word into an array and use set operators.

db.sentences.createIndex({'words':'text'})
db.sentences.find({$text:{$search:'dog'}})

A regular search db.sentences.find({'words':'dog'}) will return nothing.

We can rank the results by a score,

db.sentences.find(
  {$text:{$search:'dog tree obsidian'}},
  {$score:{$meta:'textScore'}}
).sort({$score:{$meta:'textScore'}})

Index Design

The goal is efficient read/write operations.

Selectivity - minimize number of records scanned for a given query pattern.
Other Ops - how sorts are handled?

Example query for 1 million student records:

db.students.find(
    {student_id:{$gt:500000}, class_id:54})
    .sort({student_id:1}).hint({class_id:1} // specify index shape
  ).explain("executionstats")

Note without hint(), "totalKeysExamined" : 850433 >> "nReturned" : 10118, so index was not very selective. Check queryPlanner, winningPlan, and rejectedPlans section, the student id query examines half of the collection. A query using class_id as an index would need a in memory sort ("stage" : "SORT") instead of one within the database so was rejected. With hint() "totalKeysExamined" : 20071. "executionTimeMillis" also dropped to 79 ms from 2600 ms.

With db.students.createIndex({class_id:1, student_id:1}) we will be using the most selective part of the query which is an equality/point query other than the range query on the student_id. Generally speaking, order equality queries first when building a compound index.

With the above compound index and query below,

db.students.find(
    {student_id:{$gt:500000}, class_id:54}).sort({final_grade:1})
  ).explain("executionstats")

"totalKeysExamined" and nReturned are 10118, executionTimeMillis 138 ms with in memory sort on final_grade done. It’s good to avoid in memory sort when you can, the trade off is to examine a few more keys. We can create another compound index, db.students.createIndex({class_id:1, final_grade:1, student_id:1}), final_grade has to be immediately after class_id since we want to walk the index keys in order.

db.students.find(
    {student_id:{$gt:500000}, class_id:54}).sort({final_grade:-1})
  ).explain("executionstats")

Now the result show executionTimeMillis : 27 and "totalKeysExamined": 10204. The winningPlan now does not have a sort stage, only a FETCH stage.

MongoDB can walk the index backward in order to sort on the final_grade field. While true given that we are sorting on only this field, if we want to sort on multiple fields, the direction of each field on which we want to sort in a query must be the same as the direction of each field specified in the index. So if we want to sort using something like db.collection.find( { a: 75 } ).sort( { a: 1, b: -1 } ), we must specify the index using the same directions, e.g., db.collection.createIndex( { a: 1, b: -1 } ).

In general, the rules of thumb is * Equality fields before range fields * Sort fields before range fields * Equality fields before sort fields

Logging Slow queries

Slow queries longer than 100 ms are already automatically logged with the default logging facility. It takes about 4 s for db.students.find({student_id:10000}) without an index among 1 million student documents.

There is a profiler built in mongod, which is a more sophisticated facility and writes documents entries to system.profile any query taking more than specified time. - level 0 off - level 1 log slow queries - level 2 log all queries, general debugging

To start the profiler,

mongod -dbpath /usr/local/var/mongodb --profile 1 --slowm 2
# logs queries above 2 ms

Use db.system.profile.find().pretty() to check for slow queries. For the same above query,

"query":{
  "student_id" : 10000
}
"ntoreturn" : 0,
"nscanned" : 10000000,
...
"nReturned" : 1
"millis" : 4231

This is a cap (fixed size) collection, recycle after it get used up. Some example usages below,

> db.system.profile.find({ns:/school.students/}).sort(ts:1).pretty()
// namespace school database, students collection, sort by timestamp.
> db.system.profile.find({millis:{$gt:1}}).sort(ts:1).pretty()
// looks for queries longer than 1 ms.
> db.getProfilingLevel()
1
> db.getProfilingStatus()
{"was" : 1, "slowms" : 2}
> db.setProfilingLevel(1, 4) // slow queries above 4 ms
{"was" : 1, "slowms" : 2, "ok" : 1}
> db.setProfilingLevel(0)
{"was" : 1, "slowms" : 4, "ok" : 1}

`mongotop` and `mongostat`

mongotop is named after unix top, which gives a high level view where mongo spends most of its time. Subsequently, one can follow up for more detailed examining and optimizing.

$ mongotop 3 # default returns every second, this set to 3 seconds
ns                    total    read    write    2016-06-25T23:32:52-07:00
admin.system.roles      0ms     0ms      0ms                             
admin.system.version    0ms     0ms      0ms                             
blog.posts              0ms     0ms      0ms                             
blog.sessions           0ms     0ms      0ms

mongostat is a performance tuning command, similar to unix iostat samples the database at 1 second increments and give information such as number of queries, inserts, updates, deletes, .etc. getmore is getting more from a cursor for a query with a large result. qr|qw ar|aw are queues and active reads and writes. %dirty %used refers to percentage of WiredTiger cache. res resident memory size smaller for WiredTiger. For MMAPv1 engine, faults number of page faults and for WiredTiger, %used cache can indicate amount of IO being used. Check fields section for mongostat doc .

$ mongostat [--port 27017]
insert query update delete getmore command % dirty % used    res qr|qw ar|aw netIn netOut
    *0    *0     *0     *0       0     1|0     0.0    0.0  52.0M   0|0   0|0   79b    18k
    *0    *0     *0     *0       0     1|0     0.0    0.0  52.0M   0|0   0|0   79b    18k

Sharding Overview

mongodb sharding scheme

Splitting a large collection amongst multiple servers. The details can be transparent to the client. mongos routes to multiple mongod shards. Each shard can contain multiple servers as a replica set.

An insert must contain the shard key, e.g. student_id. For update, remove, and find, if a shard key is not given, mongos will have to broadcast the requests to all the shards.

Resources

go to top

Link to the MongoDB tutorial series.