Sad to see. We were one of their early paying customers in 2009/2010 (we got a g...

imaginenore · on Sept 22, 2014

There are so many scalable (and often free) databases out there. What do you that can't be done in Cassandra or Mongo or Vertica?

swartkrans · on Sept 22, 2014

Honest question, is MongoDB actually scalable? I keep reading about how it is not for serious business[1][2][3][4]. Who is using Mongo at scale and what sort of loads are they seeing and are the level of problems on par with other solutions? As a developer I like Mongo's API, but I would worry about using it given what has been written about it.

[1] http://aphyr.com/posts/284-call-me-maybe-mongodb

[2] http://svs.io/post/31724990463/why-i-migrated-away-from-mong...

[3] http://use-the-index-luke.com/blog/2013-10/mysql-is-to-sql-l...

[4] https://news.ycombinator.com/item?id=8353228

deepsun · on Sept 22, 2014

I agree with you, per our experience Mongo is not scalable. We use Mongo in our company (60GB per db.stats().fileSize), both for relational data, and for what we call "fat data" (just to avoid calling these mere gigabytes "big data" -- those are just lots of statistical numbers, more like OLAP cubes).

Over many problems and sad surprises, we came to the point that we'd better have used PostgreSQL (or any other SQL) for relational data, and some NoSQL for "fat data". Or just single PostgreSQL would do better, because Mongo's feature set is a subset of PostgreSQL's feature set. Just Mongo renamed them and pretends it's something new.

StillBored · on Sept 23, 2014

  "fat data" (just to avoid calling these mere gigabytes "big data"

Mere GB's of data is just a normal database.

I have a rule that should be more widespread:

If you can put the database in RAM on a x86 server, its not "big data" by any stretch. Then beyond that, it becomes more complex, but for starters lets consider whether the indexes fit in RAM.

If the indexes/hashes for your data cannot fit in RAM on a commodity x86 then you probably can consider that you have "big data".

So, currently its possible to buy supermicro systems that take 6TB of RAM (just a normal QPI link) without getting into any of the exotic SSI systems (like the SGI UV 2000).

We should also avoid talking about the physical plant requirements for "big data" as well, since as its possible to put over 350TB of storage in 4U with products from Nexsan, JetStor, etc. That is over 3PB per rack...

So, you can call your data set "big data" if the indexes are > 6TB or the actual data set is > 3PB. These numbers will change next year when new machines/storage arrive.

techdragon · on Sept 23, 2014

I find it easier to think about it from a data "usage" angle, "if you can query anything you like while running the 'database' on a single machine and get your answer within 24hrs your data is not 'big'"

I used to see 24hr OLAP cube runs and no one ever called that 'big data'. It's entirely a question of scale in my mind, because these days you can buy truly gigantic servers, and they are phenomenally powerful but if your data needs multiple servers dividing the load in order to perform queries in a timely manner then you start talking about big data, it's a question of scale.

deepsun · on Sept 23, 2014

Yes, we call it "fat data" because of different approach to it, "bigdatey". E.g. it's not relational -- we don't JOIN it with anything (we intentionally designed it so that we stay scalable while growing), and it would perfectly fit a NoSQL datastorage (just don't confuse NoSQL with MongoDB :)

After working in Google I (with another teammate) kinda "feel" what's big data: it's more about approach, mindset and toolset to work with it. I agree with you and @techdragon that if you can fit the data into one machine it's probably not really big. But one can also work with 1GB of data using bigdata approach, what we call "fat data". When we grow out of a single machine we won't need to rewrite our project.

Nevertheless, this all doesn't prevent our sales team to say "big data" and "cloud" in every their phrase :)

nl · on Sept 23, 2014

Or just single PostgreSQL would do better, because Mongo's feature set is a subset of PostgreSQL's feature set.

Just a slight warning on this, because I was mildly burnt by hearing this and assuming it was true.

For background, I'm a long time happy Postgres user.

Recently I decided to use it for a new project instead of Mongo and to utilise the new JSON support.

Turns out it is great for CRD apps, not so much for CRUD.

The current release version of Postgresql (9.3) has no capability to do updates to parts of JSON stored as the JSON datatype (ie, you have to read the entire JSON blob, change it and write it back)[1].

Updates within JSON fields are coming in 9.4.

[1] http://dba.stackexchange.com/questions/54663/how-can-i-updat...

jbellis · on Sept 23, 2014

How sure are you that mongodb doesn't rewrite the entire document with each update? I couldn't find a definitive answer without source diving.

calpaterson · on Sept 23, 2014

Some updates are done in place, some are not. To give a specific example - integer updates are always done in place. The most common reason that a document has to be rewritten is when it is updated size is too big to fit into its previous space. Then it has to be moved.

http://blog.mongodb.org/post/248614779/fast-updates-with-mon...

I have found "source diving" a scary experience with Mongodb in the past.

jbellis · on Sept 23, 2014

Could be, but that article just says that the update is "in place." Which may mean that just the changed int is written, or it may be the entire document.

tracker1 · on Sept 23, 2014

But, it's in the box API functionality... since it's a single record, there's minimal locking involved. Depending on your load that can be very important.

For that matter, imho doing a locked partial update via something like _.assign would be fine imho in postgres. It depends on how you really need to use your data... and how it fits into that.

If you have a lot of recursion in your data, it may be better suited towards sql... if you have a lot of data gathered around a group of objects/documents a document db like mongo/rethink/elasticsearch may be best... if you really need key/value lookups, then cassandra is hard to beat.

For that matter, having data duplicated/replicated to multiple types of db servers is entirely reasonable. You management UI can interact with an SQL datastore, and on save, you also save to Mongo.

That was the interim step I chose in migrating our data structures.. the queries that run against mongo work great, there's three servers in the replica set, for a relatively small data set, and it is really nice.

nl · on Sept 23, 2014

I'm more familiar with CouchDB, and I know Couch doesn't have to do that.

Even if MongoDB does at least you don't have to write the functionality yourself in client code.

jbellis · on Sept 23, 2014

Right. It's a confusing space because even though couch/riak/mongo are all "document databases" their storage engines are very different. You can't usefully generalize from couch's append-only b-tree to mongodb's mmap'd data files.

deepsun · on Sept 23, 2014

You're right. I just meant a little different thing: that we'd better store JSON in conventional way of storing values in schema-full columns. For example, object {foo: {bar: [1, 2], a: 'b'}} would be three columns foo.bar.0=1, foo.bar.1=2, and foo.bar.a='b'. In medium-sized project you'll write DAO layer to convert DBMS view on data to application code view on data anyway for various purposes. It would be even better, because column order in SQL doesn't matter (but in Mongo it does for searching, as it's essentially just BSON string). And column names won't be repeated each time and take space.

Of course, if you have a case to store unstructured data where you don't know the structure in advance, in this case it won't work for you. But for your own data -- we'd better let DBMS maintain the schema, instead of maintaining it in application code (inventing the wheel).

Side note, regarding updates of particular fields in objects: a NoSQL datastore must provide some "compare-and-set" functionality to avoid race conditions during updates. PostgreSQL way is to use row-level-locking transactions, but MongoDB locks the whole collections (well, several months ago it blocked the whole database, so it's and improvement :). They kinda offer findAndUpdate() for "compare-and-set", but see my another comment below on why it doesn't work.

annnnd · on Sept 23, 2014

Great info, thanks! We are using MongoDB for single-server installations and it fits our needs pretty well. Our biggest gripe is that disk space is not freed when collections are dropped, but we can live with it. Other things are just minor inconveniences which plague other DBs too. In our experience MongoDB "just works" - but then again, we don't use its scaling capabilities.

That said, we are looking around to see if there is an even better document-oriented DB available and PostgreSQL looks interesting (with its JSON). Haven't had the chance to try it yet though. Another interesting option is OrientDB (having graph database would be beneficial - but only for small part of our system). Does anyone have experience with other document-oriented storages? (primarily single node usage)

lvca · on Sept 23, 2014

OrientDB is the most interesting alternative to MongoDB with support for Multi-Master replication and relationships: you can decide to embed or link documents:

http://www.orientechnologies.com/orientdb-vs-mongodb/

donw · on Sept 22, 2014

This fits with my experience as well, with (possibly) a larger data set.

MongoDB is actually pretty good as a document store, where you can accept soft commits, and are dealing with non-relational data that doesn't need to be aggregated.

When you start needing more than just a loose pile of documents, or live in a world where you really need ACID, Mongo has this nasty habit of falling down on the floor and twitching.

Those problems are solveable, but it doesn't just happen "out-of-the-box".

e12e · on Sept 23, 2014

I wonder, what do you get over exporting a file system over webdav, if you only do "document storage" of json documents? After all filesystems are made for storing a hierarchy of files... some indexes for searching? More convenient rest api?

donw · on Sept 23, 2014

The API is much more convenient, and last time I checked, WebDAV didn't provide the same wealth of query options or multi-node replication. Plus, I can get solid commercial support for MongoDB, which matters a lot if you're using it to power a business.

e12e · on Sept 24, 2014

Multi-node replication for webdav would be done at the file system level (eg: gfs or whatnot). By query options, to you mean some kind of join, for example? Obviously anything beyond plain id-based get/set (by file path/url) isn't possible with "just" webdav. But if you already limit yourself to looking at single documents (ie: json-files) -- you could always fetch doc1, find the id/path of doc2 in doc1 and fetch that. I'm not saying this sounds like a sane db strategy, I'm just not convinced doing the same thing, with a little easier api sounds like a sane strategy either?

cal2014 · on Sept 23, 2014

Like all technologies MongoDB is a tool and you need to pick the right tool for the right job or in this case the right DB for the right work load. We run 100's of TB of MongoDB with 100's of millions of Mongo Ops. Does it have challenges for sure, does mean you can't scale it no. Here's a talk about scaling MongoDB specifically. http://engineering.objectrocket.com/2014/03/10/scaling-mongo...

deepsun · on Sept 23, 2014

Right, but we're disappointed that MongoDB markets itself to fit into NoSQL niche, while it doesn't. If it honestly declared its shortcomings, we'd have nothing against, we'd love it.

For example, it doesn't support crucial for NoSQL "compare-and-set" functionality to avoid transactions/locks. Their best suggestion is to use findAndUpdate() with the full object for "find". And it works (though slow) when you have static schema. But over time you'll want to change your objects, and findAndUpdate() won't find them anymore. Grief. Also, order of fields in nested JSON object matters for findAndUpdate(), so have a happy time debugging why it doesn't find some objects anymore.

colordrops · on Sept 23, 2014

You can scale the file system too but it doesn't mean that it's well suited to the job.

x0x0 · on Sept 22, 2014

No, and even modest write loads will cause immense pain -- there was a single lock per db when I last used them. Plus lots of the standard tricks for getting performance out of dbs don't work for them. I'd recommend staying away unless your performance needs are very moderate and you must have unstructured tables.

rogerbinns · on Sept 23, 2014

There is still a single lock per db. It also uses huge amounts of space (around double JSON of same data). That puts increased pressure on memory. It also has no usable compaction. And can be rather dumb with indexes. (note)

The main mongo solution for things is to run lots of copies of it across many machines each with small chunks of data. In theory what is then a pain point on bigger systems becomes lots of lesser pains on small systems.

(note) every criticism is answered with how the next release will improve things. Sometimes that happens.

_c3bx · on Sept 22, 2014

Check out tokumx, it fixes a lot of the problems with mongo.

cheald · on Sept 23, 2014

Seconded; we've had a very positive experience with it so far. We're running around 1.5TB in it (and growing) and it's working pretty decently.

Our biggest problems with it have been CPU saturation due to compression (solvable with sharding) and oplog size (due to supporting ACID; supposedly much better in the upcoming release), but both of those are surmountable. In exchange we get massively better disk usage characteristics, no global locks, ACID compliance, transactions, and generally better performance. It's not perfect, but it solved a lot of our problems.

ub · on Sept 23, 2014

My experience with MongoDB has been terrible. Apart from just look-ups I don't think it's meant for much data wrangling. Joins with different collections are harder to do. I see the best use case of Mongo is for data dumps.

tracker1 · on Sept 23, 2014

It's pretty good for a document store.. partial updates to documents, as well as indexing work well. Setting it up for replica sets with auto failover is much easier than say PostgreSQL, as is the API interface (especially geo searches). It's run well for most of my own uses with it, though I do keep an eye towards RethinkDB as well as ElasticSearch and Cassandra.

RethinkDB really needs to get the auto-hot failover and geo searches worked out, geo is on the table for the next release iirc, and failover the next after.

Cassandra is great for key/value searches, but falls down for range queries.

ElasticSearch is pretty awesome in its' own right, but not perfect either.

PostgreSQL has a lot to offer as well. 9.4 should be pretty sweet, and if they get automagic failover in the community versions worked out, I'm totally in.

It really just depends on what your workload is... MongoDB offers a lot of DB-like scenarios against indexes in a single collection, a clean set of interfaces, and a fairly responsive system overall. There have been growing pains, and problems... the same can be said of any database.

To each their own, it really just depends on your needs, and for that matter how far out your project's release is, vs. how long you need to support it.

Right now, I'm replacing an over-normalized SQL database structure that is pretty rancid. Most of the data fits so much better with a document db it isn't funny. When I had done the first parts, I had issues with geo searches in similar alternatives, and that has been a deal breaker for a lot of the options.

You don't use a document store if you need to use joins.. you're better off either duplicating the data, or using separate on-demand queries... odds are your data isn't shaped right and you should have used a structured database, or you aren't thinking of the problem right.

YMMV.

darkarmani · on Sept 23, 2014

> Setting it up for replica sets with auto failover is much easier than say PostgreSQL

MongoDB replica sets are for availability not for consistency. Even with a write concern of majority, you can still have inconsistency. Without heavy load you might never see this race condition.

baudehlo · on Sept 23, 2014

I don't see what's complicated about the Postgres goesearch API. Just use earthdistance not PostGIS.

illumen · on Sept 23, 2014

You may not need joins, but someone else May.

Someone else likely Will in my experience. Or I will, when a new requirement comes in.

tracker1 · on Sept 23, 2014

Again, I'd say it depends on the data... either by shape or need. I also wouldn't use NoSQL for highly structured and relational data. For example, a classifieds site, absolutely yes to nosql... for comment threads, I'd favor SQL.

If you need certain reporting, does it have to be real time, is real time enough okay, and what are the other needs. I find that sometimes duplicating data (one point of authority) is better than using one or the other.

aikah · on Sept 23, 2014

Mongodb is great ... for logging stuff,or quick prototyping.It's not that fast on writes ,but pretty fast on reads.

In my opinion,what people usually really want is a RDBMS + a full text search engine like elastic search. But again,one needs to set these things up.

Mongodb didnt have aggregating features in the past,and their map/reduce feature is not that good.But again,the product is still young,maybe it will get better.

notastartup · on Sept 23, 2014

Agreed but maybe it was because at a startup I worked for a few years ago used Meteor.

samcrawford · on Sept 22, 2014

Of course, you're right, but we didn't know that at the time. It turns out there's nothing we were doing that really warranted using a data warehouse (for that is what InfiniDB was really suited for). We were collecting a lot of numerical, time-series data, bulk loading it and then performing summarised queries periodically.

Mongo, Cassandra, etc are not good fits for this. Vertica was very expensive. In the end we went with a sharded and partitioned MySQL setup (partitioning really is great if you use it well). It's worked very well.

zenogais · on Sept 22, 2014

This actually sounds like a great fit for Cassandra and Hadoop. You could use Cassandra's built in TimeUUID as the primary key and have your data pre-sorted on disk in time-series. This would make big queries, using something like Hadoop, very efficient.

jhh · on Sept 22, 2014

That's not necessary though, if it works fast enough on a relational database in my opinion. I think this article has some valuable insight on the topic: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

zenogais · on Sept 23, 2014

You'll get no argument from me here. Cassandra + Hadoop only make sense at a certain scale.

imaginenore · on Sept 22, 2014

Yeah, that's exactly what I suggested in the other comment. If you can get away with sharding, it's the easiest solution, and the developers/DBAs are relatively easy to find.

Glad it worked out for you guys.

jwegan · on Sept 22, 2014

FYI, for people using less than 1 TB of data, Vertica does have a free community edition. I used Vertica at my last job and it was blazingly fast compared to Hive (like 5x to 10x faster).

michaelcampbell · on Sept 23, 2014

Is that 1TB compressed or "raw"?

jeremymcanally · on Sept 23, 2014

For future reference, it really sounds like InfluxDB might be a perfect fit for you. I've been trying to find a reason to use it myself, but we don't do a lot of time-series stuff at work (right now at least).

nl · on Sept 23, 2014

Isn't this almost exactly the use-case Cassandra is designed for?

rubinelli · on Sept 23, 2014

My bet is what really took the wind out of InfiniDB's sails was Amazon's Redshift. It's new, shiny, ridiculously easy to use, has a strong pedigree and a fairly low initial capital outlay.

mhoglan · on Sept 23, 2014

Not really. Redshift is nice don't get me wrong, but we did not run into many people that were choosing Redshift over InfiniDB. The major players out there still run everything on premise and behind their firewall.

btw, InfiniDB was originally going to be the backend to Redshift. Lets just say the previous executive team 'screwed that up'

trhway · on Sept 22, 2014

is where a good way of doing MPP SQL in Cassandra or Mongo? and Vertica is really far from being free, isn't it?

electrum · on Sept 23, 2014

Presto is a fully distributed (MPP) SQL query engine that supports Cassandra: http://prestodb.io/docs/current/connector/cassandra.html

You can even join together data from Hive, Cassandra, MySQL, PostgreSQL, Kafka, etc., all in one query. We don't have a connector for Mongo yet, but contributions are welcome!

The Cassandra connector was actually an external contribution. We don't use it at Facebook.

kermatt · on Sept 22, 2014

Vertica has a community tier, free up to 1TB of source data, and a 3 node limit.

SQL is not part of the Cassandra or Mongo feature sets. They have certain analytic possibilities, but not if you want to use SQL (window functions for example), or most of the BI client tools associated with data warehousing.

imaginenore · on Sept 22, 2014

Postgres-XL can do that.

There's also Shard-Query and (not free) Amazon Redshift.

Or you could just shard your regular community Postgre or MySQL.

kermatt · on Sept 22, 2014

See also https://github.com/citusdata/cstore_fdw