Sad to see. We were one of their early paying customers in 2009/2010 (we got a great deal). Performance was fantastic over very large datasets, but the bugs, storage requirements (very expensive SANs) and query limitations became big problems for us. We moved away after a year.
Just recently one of our team has been looking at them again (following a strong benchmark being posted at mysqlperformanceblog). So I went and checked out the forums and saw it was pretty much in a similar state to when I last used it four years ago.
Sad to see they couldn't make it work. The team was always really friendly and quick to help with issues - good luck in the future.
Honest question, is MongoDB actually scalable? I keep reading about how it is not for serious business[1][2][3][4]. Who is using Mongo at scale and what sort of loads are they seeing and are the level of problems on par with other solutions? As a developer I like Mongo's API, but I would worry about using it given what has been written about it.
I agree with you, per our experience Mongo is not scalable.
We use Mongo in our company (60GB per db.stats().fileSize), both for relational data, and for what we call "fat data" (just to avoid calling these mere gigabytes "big data" -- those are just lots of statistical numbers, more like OLAP cubes).
Over many problems and sad surprises, we came to the point that we'd better have used PostgreSQL (or any other SQL) for relational data, and some NoSQL for "fat data". Or just single PostgreSQL would do better, because Mongo's feature set is a subset of PostgreSQL's feature set. Just Mongo renamed them and pretends it's something new.
"fat data" (just to avoid calling these mere gigabytes "big data"
Mere GB's of data is just a normal database.
I have a rule that should be more widespread:
If you can put the database in RAM on a x86 server, its not "big data" by any stretch. Then beyond that, it becomes more complex, but for starters lets consider whether the indexes fit in RAM.
If the indexes/hashes for your data cannot fit in RAM on a commodity x86 then you probably can consider that you have "big data".
So, currently its possible to buy supermicro systems that take 6TB of RAM (just a normal QPI link) without getting into any of the exotic SSI systems (like the SGI UV 2000).
We should also avoid talking about the physical plant requirements for "big data" as well, since as its possible to put over 350TB of storage in 4U with products from Nexsan, JetStor, etc. That is over 3PB per rack...
So, you can call your data set "big data" if the indexes are > 6TB or the actual data set is > 3PB. These numbers will change next year when new machines/storage arrive.
I find it easier to think about it from a data "usage" angle, "if you can query anything you like while running the 'database' on a single machine and get your answer within 24hrs your data is not 'big'"
I used to see 24hr OLAP cube runs and no one ever called that 'big data'. It's entirely a question of scale in my mind, because these days you can buy truly gigantic servers, and they are phenomenally powerful but if your data needs multiple servers dividing the load in order to perform queries in a timely manner then you start talking about big data, it's a question of scale.
Yes, we call it "fat data" because of different approach to it, "bigdatey". E.g. it's not relational -- we don't JOIN it with anything (we intentionally designed it so that we stay scalable while growing), and it would perfectly fit a NoSQL datastorage (just don't confuse NoSQL with MongoDB :)
After working in Google I (with another teammate) kinda "feel" what's big data: it's more about approach, mindset and toolset to work with it. I agree with you and @techdragon that if you can fit the data into one machine it's probably not really big. But one can also work with 1GB of data using bigdata approach, what we call "fat data". When we grow out of a single machine we won't need to rewrite our project.
Nevertheless, this all doesn't prevent our sales team to say "big data" and "cloud" in every their phrase :)
Or just single PostgreSQL would do better, because Mongo's feature set is a subset of PostgreSQL's feature set.
Just a slight warning on this, because I was mildly burnt by hearing this and assuming it was true.
For background, I'm a long time happy Postgres user.
Recently I decided to use it for a new project instead of Mongo and to utilise the new JSON support.
Turns out it is great for CRD apps, not so much for CRUD.
The current release version of Postgresql (9.3) has no capability to do updates to parts of JSON stored as the JSON datatype (ie, you have to read the entire JSON blob, change it and write it back)[1].
Some updates are done in place, some are not. To give a specific example - integer updates are always done in place. The most common reason that a document has to be rewritten is when it is updated size is too big to fit into its previous space. Then it has to be moved.
Could be, but that article just says that the update is "in place." Which may mean that just the changed int is written, or it may be the entire document.
But, it's in the box API functionality... since it's a single record, there's minimal locking involved. Depending on your load that can be very important.
For that matter, imho doing a locked partial update via something like _.assign would be fine imho in postgres. It depends on how you really need to use your data... and how it fits into that.
If you have a lot of recursion in your data, it may be better suited towards sql... if you have a lot of data gathered around a group of objects/documents a document db like mongo/rethink/elasticsearch may be best... if you really need key/value lookups, then cassandra is hard to beat.
For that matter, having data duplicated/replicated to multiple types of db servers is entirely reasonable. You management UI can interact with an SQL datastore, and on save, you also save to Mongo.
That was the interim step I chose in migrating our data structures.. the queries that run against mongo work great, there's three servers in the replica set, for a relatively small data set, and it is really nice.
Right. It's a confusing space because even though couch/riak/mongo are all "document databases" their storage engines are very different. You can't usefully generalize from couch's append-only b-tree to mongodb's mmap'd data files.
You're right. I just meant a little different thing: that we'd better store JSON in conventional way of storing values in schema-full columns. For example, object {foo: {bar: [1, 2], a: 'b'}} would be three columns foo.bar.0=1, foo.bar.1=2, and foo.bar.a='b'. In medium-sized project you'll write DAO layer to convert DBMS view on data to application code view on data anyway for various purposes. It would be even better, because column order in SQL doesn't matter (but in Mongo it does for searching, as it's essentially just BSON string). And column names won't be repeated each time and take space.
Of course, if you have a case to store unstructured data where you don't know the structure in advance, in this case it won't work for you. But for your own data -- we'd better let DBMS maintain the schema, instead of maintaining it in application code (inventing the wheel).
Side note, regarding updates of particular fields in objects: a NoSQL datastore must provide some "compare-and-set" functionality to avoid race conditions during updates. PostgreSQL way is to use row-level-locking transactions, but MongoDB locks the whole collections (well, several months ago it blocked the whole database, so it's and improvement :). They kinda offer findAndUpdate() for "compare-and-set", but see my another comment below on why it doesn't work.
Great info, thanks! We are using MongoDB for single-server installations and it fits our needs pretty well. Our biggest gripe is that disk space is not freed when collections are dropped, but we can live with it. Other things are just minor inconveniences which plague other DBs too. In our experience MongoDB "just works" - but then again, we don't use its scaling capabilities.
That said, we are looking around to see if there is an even better document-oriented DB available and PostgreSQL looks interesting (with its JSON). Haven't had the chance to try it yet though. Another interesting option is OrientDB (having graph database would be beneficial - but only for small part of our system). Does anyone have experience with other document-oriented storages? (primarily single node usage)
OrientDB is the most interesting alternative to MongoDB with support for Multi-Master replication and relationships: you can decide to embed or link documents:
This fits with my experience as well, with (possibly) a larger data set.
MongoDB is actually pretty good as a document store, where you can accept soft commits, and are dealing with non-relational data that doesn't need to be aggregated.
When you start needing more than just a loose pile of documents, or live in a world where you really need ACID, Mongo has this nasty habit of falling down on the floor and twitching.
Those problems are solveable, but it doesn't just happen "out-of-the-box".
I wonder, what do you get over exporting a file system over webdav, if you only do "document storage" of json documents? After all filesystems are made for storing a hierarchy of files... some indexes for searching? More convenient rest api?
The API is much more convenient, and last time I checked, WebDAV didn't provide the same wealth of query options or multi-node replication. Plus, I can get solid commercial support for MongoDB, which matters a lot if you're using it to power a business.
Multi-node replication for webdav would be done at the file system level (eg: gfs or whatnot). By query options, to you mean some kind of join, for example? Obviously anything beyond plain id-based get/set (by file path/url) isn't possible with "just" webdav. But if you already limit yourself to looking at single documents (ie: json-files) -- you could always fetch doc1, find the id/path of doc2 in doc1 and fetch that. I'm not saying this sounds like a sane db strategy, I'm just not convinced doing the same thing, with a little easier api sounds like a sane strategy either?
Like all technologies MongoDB is a tool and you need to pick the right tool for the right job or in this case the right DB for the right work load. We run 100's of TB of MongoDB with 100's of millions of Mongo Ops. Does it have challenges for sure, does mean you can't scale it no. Here's a talk about scaling MongoDB specifically. http://engineering.objectrocket.com/2014/03/10/scaling-mongo...
Right, but we're disappointed that MongoDB markets itself to fit into NoSQL niche, while it doesn't. If it honestly declared its shortcomings, we'd have nothing against, we'd love it.
For example, it doesn't support crucial for NoSQL "compare-and-set" functionality to avoid transactions/locks. Their best suggestion is to use findAndUpdate() with the full object for "find". And it works (though slow) when you have static schema. But over time you'll want to change your objects, and findAndUpdate() won't find them anymore. Grief. Also, order of fields in nested JSON object matters for findAndUpdate(), so have a happy time debugging why it doesn't find some objects anymore.
No, and even modest write loads will cause immense pain -- there was a single lock per db when I last used them. Plus lots of the standard tricks for getting performance out of dbs don't work for them. I'd recommend staying away unless your performance needs are very moderate and you must have unstructured tables.
There is still a single lock per db. It also uses huge amounts of space (around double JSON of same data). That puts increased pressure on memory. It also has no usable compaction. And can be rather dumb with indexes. (note)
The main mongo solution for things is to run lots of copies of it across many machines each with small chunks of data. In theory what is then a pain point on bigger systems becomes lots of lesser pains on small systems.
(note) every criticism is answered with how the next release will improve things. Sometimes that happens.
Seconded; we've had a very positive experience with it so far. We're running around 1.5TB in it (and growing) and it's working pretty decently.
Our biggest problems with it have been CPU saturation due to compression (solvable with sharding) and oplog size (due to supporting ACID; supposedly much better in the upcoming release), but both of those are surmountable. In exchange we get massively better disk usage characteristics, no global locks, ACID compliance, transactions, and generally better performance. It's not perfect, but it solved a lot of our problems.
My experience with MongoDB has been terrible. Apart from just look-ups I don't think it's meant for much data wrangling. Joins with different collections are harder to do. I see the best use case of Mongo is for data dumps.
It's pretty good for a document store.. partial updates to documents, as well as indexing work well. Setting it up for replica sets with auto failover is much easier than say PostgreSQL, as is the API interface (especially geo searches). It's run well for most of my own uses with it, though I do keep an eye towards RethinkDB as well as ElasticSearch and Cassandra.
RethinkDB really needs to get the auto-hot failover and geo searches worked out, geo is on the table for the next release iirc, and failover the next after.
Cassandra is great for key/value searches, but falls down for range queries.
ElasticSearch is pretty awesome in its' own right, but not perfect either.
PostgreSQL has a lot to offer as well. 9.4 should be pretty sweet, and if they get automagic failover in the community versions worked out, I'm totally in.
It really just depends on what your workload is... MongoDB offers a lot of DB-like scenarios against indexes in a single collection, a clean set of interfaces, and a fairly responsive system overall. There have been growing pains, and problems... the same can be said of any database.
To each their own, it really just depends on your needs, and for that matter how far out your project's release is, vs. how long you need to support it.
Right now, I'm replacing an over-normalized SQL database structure that is pretty rancid. Most of the data fits so much better with a document db it isn't funny. When I had done the first parts, I had issues with geo searches in similar alternatives, and that has been a deal breaker for a lot of the options.
You don't use a document store if you need to use joins.. you're better off either duplicating the data, or using separate on-demand queries... odds are your data isn't shaped right and you should have used a structured database, or you aren't thinking of the problem right.
> Setting it up for replica sets with auto failover is much easier than say PostgreSQL
MongoDB replica sets are for availability not for consistency. Even with a write concern of majority, you can still have inconsistency. Without heavy load you might never see this race condition.
Again, I'd say it depends on the data... either by shape or need. I also wouldn't use NoSQL for highly structured and relational data. For example, a classifieds site, absolutely yes to nosql... for comment threads, I'd favor SQL.
If you need certain reporting, does it have to be real time, is real time enough okay, and what are the other needs. I find that sometimes duplicating data (one point of authority) is better than using one or the other.
Mongodb is great ... for logging stuff,or quick prototyping.It's not that fast on writes ,but pretty fast on reads.
In my opinion,what people usually really want is a RDBMS + a full text search engine like elastic search. But again,one needs to set these things up.
Mongodb didnt have aggregating features in the past,and their map/reduce feature is not that good.But again,the product is still young,maybe it will get better.
Of course, you're right, but we didn't know that at the time. It turns out there's nothing we were doing that really warranted using a data warehouse (for that is what InfiniDB was really suited for). We were collecting a lot of numerical, time-series data, bulk loading it and then performing summarised queries periodically.
Mongo, Cassandra, etc are not good fits for this. Vertica was very expensive. In the end we went with a sharded and partitioned MySQL setup (partitioning really is great if you use it well). It's worked very well.
This actually sounds like a great fit for Cassandra and Hadoop. You could use Cassandra's built in TimeUUID as the primary key and have your data pre-sorted on disk in time-series. This would make big queries, using something like Hadoop, very efficient.
Yeah, that's exactly what I suggested in the other comment. If you can get away with sharding, it's the easiest solution, and the developers/DBAs are relatively easy to find.
FYI, for people using less than 1 TB of data, Vertica does have a free community edition. I used Vertica at my last job and it was blazingly fast compared to Hive (like 5x to 10x faster).
For future reference, it really sounds like InfluxDB might be a perfect fit for you. I've been trying to find a reason to use it myself, but we don't do a lot of time-series stuff at work (right now at least).
My bet is what really took the wind out of InfiniDB's sails was Amazon's Redshift. It's new, shiny, ridiculously easy to use, has a strong pedigree and a fairly low initial capital outlay.
Not really. Redshift is nice don't get me wrong, but we did not run into many people that were choosing Redshift over InfiniDB. The major players out there still run everything on premise and behind their firewall.
btw, InfiniDB was originally going to be the backend to Redshift. Lets just say the previous executive team 'screwed that up'
You can even join together data from Hive, Cassandra, MySQL, PostgreSQL, Kafka, etc., all in one query. We don't have a connector for Mongo yet, but contributions are welcome!
The Cassandra connector was actually an external contribution. We don't use it at Facebook.
Vertica has a community tier, free up to 1TB of source data, and a 3 node limit.
SQL is not part of the Cassandra or Mongo feature sets. They have certain analytic possibilities, but not if you want to use SQL (window functions for example), or most of the BI client tools associated with data warehousing.
For actually being in a company like that it's mostly that and bad overall management - at a time we did have a ratio of two (useless...) PM for on dev. Dev underpaid, PM overpaid (as expert freelance, of course, or with bonus), some huge expense made for stuff that make no sense (branding), negotiating badly some contract so that they cost the company more than they provide...
Of course after a while this stop since all these mismanagement corrupt the products too, and pretty fast you're left with a huge black hole for money that doesn't produce anything meaningful.
I've been playing with Infobright community edition but also evaluated InfiniDB. I found InfiniDB was not compressing my data nearly at all whereas Infobright was utterly jamming it down - factors of over 300x for even small datasets of ~2m rows.
I don't know what the differences are to produce that, but when it comes to storing as much crap as I was looking at, I was willing to design around the limitations that Infobright CE has (ie: no insert/update queries) rathre than deal with the massive extra disk cost. I have currently got 223m rows sitting in Infobright and it's taking about 38MB.
I really hope that the OSS project takes off and that InfiniDB sees some better compression implemented, similar to Infobright. The extra features that InifiniDB has over Infobright CE (not only insert/update but also a multi-threaded infile loader, for example) would convert me if only the compression were better.
I'm all new to this though so if there's some good reason why they differ so greatly I'm all ears and would love to know. Maybe I screwed something up in the configuration? I'm not sure.
Either way, it's sad to see them go. Columnstore databases fill a really useful application that I can only see growing over time as more and more operational data is collected by industry.
Phenomenal. Some is binary (sparse, too - 99% false sort of thing), one other is an integer from -1 to 8, some are floats (decimal degrees for GPS) and one is a datetime. 38MB is what is reported when I use the following query, which I found somewhere:
SELECT table_schema, sum (data_length + index_length) / 1024 / 1024 'Data Base Size in MB', TABLE_COMMENT
FROM information_schema.TABLES
GROUP BY table_schema
You're right though - the cost per row sounds really low, I might try to find it via other means tomorrow and report back.
Yup, just confirmed that the raw folder size of the database is 39.8MB and reports 39MB (not 38) in my query. It is also 222.78M rows. Note that you need to remove the space between the sum and the bracket if you want to run that query.
I can only assume that IB does some sort of differential compression on the data to get such a small filesize on that data and that it's an artifact of the machine data I'm using. But that's what I think the next big wave of data will be - stuff generated by data loggers on machines and equipment and analysed to scrape out incremental improvements on efficiency, reliability, etc. from previously relatively low-tech industrial corporations.
If your database has specialised compression, or keeps most data in main memory compressed (I can think of at least one that does), then this approach won't work.
For almost everything else though, putting compression in the filesystem layer is better.
InfiniDB is/was a great idea. The unfortunate bit was just what a Rube Goldberg machine of a data store it was. I spent a good few days just getting everything provisioned from our automation.
It looks most of the successful MPP analytic databases are based on PostgreSQL (e.g. Greenplum, Redshift). It's sad to see that InfiniDB could not make MySQL work for them reliably.
I don't think it's fair to characterise this as a MySQL vs Postgres situation. Having used InfiniDB quite a bit I am familiar with how it related to MySQL. It basically only used MySQL as a frontend (meaning as an application you could use the mysql libs and query protocol); everything beyond that was InfiniDB's own stuff.
> It's sad to see that InfiniDB could not make MySQL work for them reliably.
Integrating with MySQL at this level would probably only make sense if the team was already familiar with the internals; otherwise, PostgreSQL's code base would be a much cleaner choice.
The team knew MySQL very well internally, it is not so much that MySQL did not work out, it was very nice for people who needed to grow out MySQL to jump over to InfiniDB. MySQL going from independent to Sun to Oracle changed a lot of how friendly it is to do these kind of interactions.
I imagine the PostgreSQL players in MPP space such as CitusDB have done very similar things that we did with MySQL. And it is not that InfiniDB could not move away from MySQL, but that is a lateral sideways move, and has to be funded for no advancement in benefit.
Clear up a few things here for all the speculation. I was an architect at InfiniDB that came on in Nov 2013 to build out the Enterprise Manager which was coming to alleviate many of the provisioning, management and monitoring woes that customers were experiencing and help modernize those aspects. The first beta offering of which was early July, unfortunately the ship had sailed so to speak. So I know first hand how and why things did not work here. As with all things, take your lessons learned, move on. Success is not the path of learning, failure is (see survivorship bias)
Some notes:
Labeling InfiniDB as MySQL+ is a gross underestimation of what it does. MySQL is used as the front end query parser, and that is about it. Everything else behind it was custom written, and that is where the power is.
As with all DB technologies, your use case is the primary thing that determines your mileage. Comparing InfiniDB to MongoDB is one of the first signs to me that you don't fully comprehend the differences between database architectures. For the use cases that InfiniDB was made for, we routinely were faster performing on a smaller footprint. Using InfiniDB as a document store can be done, but that is not what it was made for.
What people call 'big datasets' is relative. Some think 500GB is alot, some think 5TB alot. Coming from telecommunications monitoring background, I will appreciate your dataset when we are talking TBs a day of churn per monitoring point with hundreds and thousands of monitoring points. The size of dataset you are working with, along with your use case for analysis is the two most important things in determining the technology stack. InfiniDB operated at these higher end scales very efficiently. There is a reason why Impala was a primary comparison, and we would usually operate on fraction of the hardware they needed.
Best technology does not always win. See InfiniDB.
Decisions made by previous executive teams years in the past can set a course that cannot be corrected sometimes (not efficiently or without alot of money)
Patents are worth their weight in gold.
Being open source is great for the community, but is a challenge to a business to build consistent revenue. There were many big projects running InfiniDB with the open source version, but not contributing to revenue. Even if they did sign up with support, you need custom feature development and other big ticket items to make impact. Or you have to build a large customer base paying for support, and that takes time. With the multiple iterations of adapting the technology to different architectures over the years, that was hard to retain those customers consistently. Also many customers will pay for support for their rollout or initial deployment, but when the project is done, they feel they are adequate enough to live with open source only.
Just because a company raised $X in a month, does not mean all that money is slated for going forward from that month. On top of that, payroll is not cheap, and you would be surprised how quickly you can burn through money keeping the lights on. For those of you who think people at startups are working for pennies on the dollars, I would advise you it is not the case. And if you are one of those people, I wish you the best, and odds are there are other reasons why you are doing so. Why would good engineers work at discount? Equity? There is not enough of the pie to go around to make that sustainable. Most startups pay competitive market salaries.
InfiniDB was at a junction where it was time to go for it, or go home, and that is exactly what happened in 2014. The marketplace for data solutions with Hadoop rising, other MPP vendors consolidating, and bigger players entering the field, made it very competitive, and the time to swing for the fence was now, versus treading water and hoping.
Even with stars aligned and everything else, all you have done in a company is weight the opportunity of it succeeding, not guaranteed.
I really enjoyed my time with InfiniDB and the team there. I really do feel its a missed opportunity with some decisions that could have been changed several years ago. Not securing patents and probably choosing MySQL as a frontend are some of those.
Side note, core group of us at InfiniDB have landed at Tune, a company that has appreciated the technology of InfiniDB and what it can offer for their solutions. Look forward to this new opportunity and what we can provide to the ad and mobile analytics space.
I've been a party to the 8-figure sales of more than one company and stand by what I said. I wasn't at Arbor when it sold (for what I assume to be mid-high 9-figures), but I know a lot about how unserious patents were there as well: not a real factor.
Most importantly: the "door-to-door" time from initiating a patent application to bringing it to bear in a legal dispute is something on the order of a decade.
Incidentally: I didn't take your analogy seriously; I just used it as a hook to disagree with you. I'm not an anti-patent zealot. I've just worked in startups for ~20 years and have come to the conclusion that they are a total waste of time for software companies.
I agree that there are a lot of patents that are not factors and generally frivolous. I would argue though that at Arbor the patents / algorithms around classification of flows and DoS detection are a cornerstone of the business being competitive and protected, and should someone else step on their space, they were positioned well for staying competitive. Before InfiniDB I was a Principal Engineer at Tektronix Communications who acquired Arbor and know them very well too (nice company and sounded nice to work at too)
btw, dont get me wrong in saying that if they had their patents they would have been successful. Just one cog in the whole machine. And the timeline I am referring to is there are things back 5-6 years ago that could have been filed before others were doing it, and it would have provided a nice differentiator in the market. It would have helped, but it was not the sole reason.
Just recently one of our team has been looking at them again (following a strong benchmark being posted at mysqlperformanceblog). So I went and checked out the forums and saw it was pretty much in a similar state to when I last used it four years ago.
Sad to see they couldn't make it work. The team was always really friendly and quick to help with issues - good luck in the future.