CouchDB - blessings and curses

John Wood (of Signal fame) has a post up about Signal's experience moving to, and away from, CouchDB.  Its an interesting real-world example of what I've described in "NoSQL - What you'll find (for sure!)".  To recap, when first getting into NoSQL, you are sure to find that

  • You didn't understand your own problem-space as well as you thought you did. 
  • You didn't understand the package that you are using as well as you think you do.
  • It will not scale the way you thought it would.  Oh, it'll scale all right, just not the way you thought it would. 
  • Your object/document/JSON/whatever model really doesn't map exactly the way you expected it to.
In John's case, they found that
  • HTTP is a Very Slow Database Protocol
  • MVCC Overhead (is bad)
  • Large Databases Beat Up the Hard Disk
  • CouchDB is not a Distributed Database (by default)
  • map/reduce takes a while to get used to
  • Views take forever to build
  • Views are gigantic on disk
  • Replication issues
Taking these one by one from my perspective -->
  1. HTTP is a Very Slow Database Protocol : Yes it is.  And if your system is User I/O bound (as is ours), then it really doesn't matter :-)  Which is a serious point - it all depends on where the bottle-neck of the system is.  
  2. MVCC Overhead (is bad) : Which can be a feature or a bug.  For us, its a feature - we know that our records are safely written, and design accordingly.
  3. Large Databases Beat Up the Hard Disk : Absolutely true.  And we've got a couple of administrative scripts that are constantly running in the background compacting all the shards and views.  Which, in my mind, is no different from the VACUUM / REINDEX scripts that we have running on the PostgreSQL environment in our previous product.  There is always some administrative overhead, no way of getting around that
  4. CouchDB is not a Distributed Database (by default) : Again, absolutely true.  That is why we went with bigcouch which works pretty spectacularly - auto-sharding, distributed queries, fault-tolerant, etc.  It works brilliantly, and I couldn't be happier w/ the folks at Cloudant.
  5. map/reduce takes a while to get used to : Yet again, absolutely true.  Then again, we're erlang based, and that probably helped us quite a bit.  Also, its kind of difficult to *not* think map/reduce after using it for a while (think working in erlang, and then going back to java.  Shudder)
  6. Views take forever to build : Sigh.  True again.  But then again, when using bigcouch, its actually Forever / Nodes (ok, shards also matter).  Seriously though, this is an issue, and one that we address through some serious development / deployment methodology (each view is a different document, incremental changes, etc.) which allows for easy rollbacks, etc.
  7. Views are gigantic on disk : True true true.  But, three things here - you need to be careful about how many values you are emitting per row,  you need to check if multiple views can be used instead of a single complex one, and you need to be careful abut about maintenance (point 3 above).  With a little care/feeding, our view sizes collapsed from Gigabytes to Tens of Megabytes (really!)
  8. Replication issues : Bigcouch.  'nuff said.  See point 4 above. 
Seriously, do understand that this is not a rebuttal.  Ok, maybe it is, but, I am not saying "John Wood is wrong".  If anything, I am absolutely agreeing with him when he says
I would never throw out a hammer because it didn’t help me drive in a screw.
To take that analogy to its logical extreme, he has a whole bunch of screws, and needed a screwdriver.  Me, I had a whole bunch of nails, and actually needed the hammer. :-)

For what its worth, I hope this helps someone else when they get around to choosing between Hammers and Screwdrivers.

Note:  Our bigcouch installation sucks in ~2M documents per day, at around 1K/document...


Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Visualizing Prime Numbers

Its time to call Bullshit on "Technical Debt"