Connecting MongoDB to Elasticsearch
Since you are already reading this, I assume you know what Mongo and Elastic are, so let’s skip to the main content.
There are a few blogs (such as this and this) which discuss this. Some of the mentioned tools are fairly old, and no longer maintained. Some work only in batch mode, i.e., a one-time dump from Mongo to Elastic. I wanted to sync them in realtime, i.e, as soon as a new document is inserted into Mongo, it should also be inserted into an Elasticsearch index.
My guess for why these tools are not actively maintained is that since MongoDB now comes with better full text search features (such as field-boosting), so (at least for simple use cases) people end up using Mongo itself for their text search requirements. This was not the case for us because:
- Our data was relatively large in size (~10GB), and we had to perform full text search very frequently.
- Elasticsearch is more feature rich. For example, it lets us prioritise recent content in the search results more intelligently (penalising old content using decay functions) as opposed to a simple “sort by date”.
The library I ended up using was Monstache. It is a daemon which keeps the Elasticsearch database in sync with a MongoDB database. It works by monitoring the MongoDB oplog for any operation on the database. Since oplog is a feature only for MongoDB databases deployed as replica sets, Monstache does not work for standalone MongoDB instances. (I discuss a method for standalone MongoDB in the end)
Step 1: Setting up monstache
Monstache is a daemon. Where you choose to run it is up to you. We run it on the AWS EC2 instance which hosts our primary MongoDB instance. Of course, you can run it on a (small) dedicated server as well.
Get the latest monstache binary from here. Monstache expects a single configuration file as a parameter. This was ours (short explanation follows):
Authentication: Here, both the Mongo and Elastic authentication are basic username/password based, but Monstache also supports other types.
Mapping from Mongo to Elastic: By default, Monstache inserts the documents into an Elasticsearch index named <mongo-document-database>.<mongo-document-collection>
This can be changed by specifying an index-mapping.
Custom logic: You might want to filter out some documents, or pick only a few fields to insert into Elasticsearch. A custom script, like the one above, is one way of doing that.
Step 2: Running monstache
./monstache -f monstache.toml
If you are running this on a remote machine, you will want to keep it running after you log out as well. Use this instead:
nohup ./monstache -f monstache.toml &
This will run it in the background, and the output will be written to a file nohup.out
$ tail -f nohup.out
Sending to Elastic : 85xc5z8f33ul4js
Sending to Elastic : 2p1ruc42ya8o3ko
Sending to Elastic : muw6lmtzwgr4ivr
Sending to Elastic : k6cdbf505qzy9ce
Sending to Elastic : 8p0vk79n7akr5f6
...
That’s it! Check your Elasticsearch database for the documents, and you should find them there.
Step 3 (Optional): Replicating this setup locally
It is possible that you want this Mongo-to-Elastic logic on your local dev setup as well in. In case your local MongoDB setup is a replica set, the above approach should work.
But more likely, you use the standalone deployment method for MongoDB on you local dev machine. In that case, we need another approach. I use Logstash for this. This is a tool built by Elastic (the company behind Elasticsearch), and lets you connect different sources of data.
This is a much more powerful and general tool. But for our case, it works similar to Monstache. We specify a source of data, our custom transformations, and an output ‘sink’ in a configuration file, and logstash does the rest. Here is mine:
Fairly self-explanatory. Notably, we have used the MongoDB input plugin for Logstash. This plugin checks for MongoDB updates every 5 seconds, and pushes them to Elasticsearch.
This approach can, obviously, be used for your production setup as well. But I found it to be very CPU-intensive. Also, I found Monstache to be an amazing tool — it’s well documented, (very) well maintained and overall just very easy to use (all attributes of a good open source library :)).