I’m not saying you made an inefficient choice. If you only need to interact with...

excerionsforte · on Dec 22, 2019

Every column you write to a db is binary at some point. When you choose a type, all you asked the DB to do is interpret the stored binary data in that way. By choosing binary type to store data, you've declared to the db not to interpret the data. You don't hit issues with charset collation/encoding, database interpretation (lack of 128bit ints), and etc. Think about it, what is an integer? It is 4 bytes of data.

Anti pattern depends on use case. Yes, I absolutely do not agree with your generalization. Storing binary data in an RDBMS is not an anti-pattern. Binary data can be of any size. The bigger the binary data the less rows you should expect to store. At some point (maybe the binary data is images/watermarks, etc), you have to choose a replicated file system to use that as the part of the datastore operations.

Furthermore, I've only worked on high qps applications, so maybe I'm a bit biased on how to use the database efficiently. :)

Aeolun · on Dec 22, 2019

Congratulations, you’ve offloaded to your app what your database is designed to deal with.

Storing all data as binary is an anti-pattern too, regardless of your qps.

I wouldn’t store anything but metadata in the database, the blob can be somewhere like S3.

excerionsforte · on Dec 22, 2019

It depends on use case. If you cannot take that then agree to disagree and discussion is no longer warranted. I've been on both sides and take my experiences with me.

You can assert anti pattern, but knowing how to structure your tables matter. SQL has BLOB, BINARY, VARBINARY choose the proper type depending on trade off. Models that include blobs can be structured in DB to avoid IO issues (indexed data includes id). Go to S3 with the id. How does what you are saying differ? My first post literally says this.

Not only can be structured to avoid IO issues, but are protected via a cache where the unindexed data is exploded.

Of course with high QPS I want to offload CPU cycles away from the DB. Scaling the app is easier than a DB.

Scratching my head here. Are you arguing for just use only SQL? Why would I do that.

orf · on Dec 22, 2019

I’m sorry, but no matter what you think storing all unindexed columns as binary is very, very weird. You can write all the documentation you want as to why this is superior to what literally everyone else is doing and has been doing since the dawn of time, but it won’t stop people joining your project thinking that you are mad.

This kind of thing is the stuff of horror stories. Now I have no doubt that you’ve convinced yourself that this is a great approach, but it’s not, and it will be replaced as soon as you leave (assuming you’re not a one man band).

evanelias · on Dec 22, 2019

> I’m sorry, but no matter what you think storing all unindexed columns as binary is very, very weird

You may be surprised to learn that several very large and well-known social networks use this technique -- serializing unindexed columns into a blob (typically compressed at some level) -- for their core product tables. It's not really that "weird", if you consider that literally tens of thousands of engineers work at companies doing this.

Conceptually it's exactly equivalent the same technique as putting all unindexed columns in a single JSON or HSTORE column. Newer companies use those; older companies tend to have something more homegrown, but ultimately equivalent, typically wrapped by a data access service which handles the deserialization automatically.

This technique is especially advantageous if you have a huge number of flexible fields per row, and many of those fields are often missing / default / zero value, and the storage scheme permits just omitting such fields/values entirely. Multiplied by trillions of rows, that results in serious cost savings.

excerionsforte · on Dec 22, 2019

It's a very boring technique really in hindsight. I guess one is really at the top 1% when you work at a high scale company and you see techniques that question the assumptions one holds. Databases like MySQL, as you pointed out, are embracing this technique, but make the inner data indexable.

Also https://www.skeema.io/ looks like a good product that I'll have to checkout. Looks like a better product so far compared to solutions like flyway/liquibase. Full featured suite for DB migration from dev -> prod is exactly what I've been raving about. Like it is "boring" tech as in no one really wants to touch it, but it is the easiest to screw up and products like this really take it to the next level.

The responses I've seen in this post appear to be from people who've never used dynamic fields in a db and advise against it by saying it is an anti pattern. If it is an anti-pattern, bring on all the anti patterns as I'd like to not wake up at night or be pinged for slowness.

evanelias · on Dec 22, 2019

Yeah, it's always interesting seeing the contrast between textbook computer science approaches vs practical real-world solutions. For some reason, people get especially hung up about academic CS concepts in the database world in particular... I've found things are never that clean in the real world at scale.

Thank you for the kind words re: Skeema :)

Aeolun · on Dec 22, 2019

Just because a very large and well known company does something, does not mean it’s a good idea.

If you put something in a JSON column you give your DB some expectation about the data.

If you store your data in a binary column you are just doing what the database already does in it’s backend with it’s normal columns, and any optimizations it might make are lost.

It might result in cost savings, but in my experience cost savings at the expense of transparency/usability are always a bad idea.

evanelias · on Dec 22, 2019

Sure, at small scale, it may be a bad idea. At no point did I say it was universally a good idea, just that it was not in fact "very very weird" as the GP claimed.

At large scale (especially social network scale that I'm describing and personally spent most of the past decade working in), the cost/benefit calculus is quite different. I'm talking 8-figure cost savings here. And there's no negative impact to "transparency/usability", since application queries are all going through a data access layer that understands the serialization scheme, and rando staff members are not allowed to arbitrarily make manual queries against users' data anyway.

As for optimizations: a custom serialization scheme can easily be more optimized than how the database stores JSON/BSON/hstore/etc. A custom scheme can be designed with full knowledge of the application's access patterns, conceptual schema, and field cardinality. Whereas whatever the db natively gives you is inherently designed to be generic, and therefore not necessarily optimized for your particular use-case.

Aeolun · on Dec 23, 2019

I don’t think the argument that it doesn’t hurt transparency holds any water, since you have the exact same limitations on your development environment (e.g. no clue what the database actually stores)

If we’re talking about the scale you’re speaking at, your data storage costs might go down, but at the same time you’re now doing a deserialization action for every row, you have no options of retrieving only half the values, so you’re always going to do that on all fields now. The costs for your data access layer rise correspondingly (but maybe that’s a different team, so it doesn’t matter).

If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).

Whenever something sounds extremely counterintuitive, people show up defending it because it’s what they have been doing for years and it makes a sort of twisted sense.

I do appreciate all your messages on this topic though, they have been much more informative than the GP’s original comment, and I have no doubt there’s actually people doing it now.

If I ever run into a situation where storing all my fields as one big blob makes most sense, I’ll revise my opinion.

evanelias · on Dec 23, 2019

Thank you for being open-minded, I appreciate it. There's definitely truth to what you're saying -- at a high level these decisions do have major complexity trade-offs, and that inherently harms transparency.

That said, in large companies and large complex systems, it's not unusual for any given team to have no clue about how various far-flung parts of the stack work. Dev environments tend to look quite a bit different there too; for example there's often no notion of a singular local dev database that you can freely query anyway.

In terms of transparency and debugging, this usually means some combination of in-house tooling, plus using a REPL / code to examine persisted values instead of using a db client / SQL. Also automated test suites. Lots of tests -- better to programmatically confirm your code is correct, rather than manual querying. Many SREs view having to SSH to debug a prod server being an anti-pattern, and you could apply the same notion to having to manually query a database.

Regarding deserialization overhead, sure there's some, but deserialization is heavily optimized and ultimately becomes pretty cheap. Consider that more traffic is from apps than web these days, and app traffic = APIs = deserialization for every request itself. And lots more deserialization for each internal service call. And significantly more deserialization for every cache call, since caches (redis, memcached, etc) don't speak in columns / SQL anyway. Adding deserialization to database calls is honestly a drop in the bucket :)

excerionsforte · on Dec 23, 2019

I want to point out here that the comments here in general were not conducive to giving out information. Your comment (shown below) was dismissive so there was no way I was going to elaborate. I recommend that if you are skeptical about something that you ask questions about it and we can discuss it more not simply dismiss it as "anti pattern". I would never talk/comment the way if I wanted to learn from others.

Point of my original comment got lost because of this "binary data in rdbms is anti pattern" detour. It was the least important thing that several people wanted to zero in and pile on.

> "Congratulations, you’ve offloaded to your app what your database is designed to deal with.

Storing all data as binary is an anti-pattern too, regardless of your qps.

I wouldn’t store anything but metadata in the database, the blob can be somewhere like S3."

> If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).

And also my comment above answered this filtering question from the get-go which underscores this distraction detour: "Sometimes I may need higher performance or more dynamic queries, so just creating a table or elastic search with only indexable values to get the ids works. Use those IDs to fetch from original store."

Let's be honest questions like "how to structure your tables", "how to make your datastore operations more efficient" (non exhaustive list, seems I need to point that out) veers on consulting, which is typically paid for so of course comments on the topic will have less information than you would like. Skeptics, who make blanket assertions, will usually be disregarded. Some may be willing to divulge and others may not be.

Aeolun · on Dec 24, 2019

I think we’re on the same wavelength about this. We would have both dismissed each others’ comments and moved on our way.

I feel the need to thank the people that even so make an effort to elaborate their point.

Don’t get me wrong, I’m still convinced it’s an anti-pattern, it was just very politely explained to me by someone so I feel obliged to respond with the same.

Like I said in the other post, I’ll let you know if I ever reach a point where this is the only reasonable solution and I can’t think of something else.

excerionsforte · on Dec 24, 2019

Learning != convince. May be some type of pessimism gained through time with people trying to convince you among others what is right. You missed the key words and point of what I said as did the others... the big picture. Goodbye.

AmericanChopper · on Dec 22, 2019

> You may be surprised to learn that several very large and well-known social networks use this technique -- serializing unindexed columns into a blob

Not in an RDBMS though. The issues with blob storage I’ve mentioned in this thread relate specifically to RDBMS, and aren’t relevant to technology like Hadoop.

evanelias · on Dec 22, 2019

I'm talking about in a RDBMS, MySQL specifically. I am describing the core sharded relational product tables of well-known social networks.

I've personally worked on the database tiers of multiple social networks and am stating this concretely via many years of first-hand experience. At no point am I talking about Hadoop in any way, not sure why you've assumed that.

AmericanChopper · on Dec 23, 2019

> At no point am I talking about Hadoop in any way, not sure why you've assumed that.

Because every large social network uses Hadoop to solve the type of problems you were describing.

> I am describing the core sharded relational product tables of well-known social networks.

Well depending on how you set your clusters up, you’re probably not actually using an RDBMS for blob storage. If you have a multi master set up with any form of weak consistency guarantees, then you’ve introduced a way to violate key and check constraints, and no longer have a system that complies with Codd’s 12 rules for defining an RDBMS. If you’re writing to a single node, and offloading blob I/O to a seperate node, then you’re essentially just using a MySQL server as an object store external to your RDBMS.

evanelias · on Dec 23, 2019

> Because every large social network uses Hadoop to solve the type of problems you were describing.

Hadoop is not used for serving real-time requests (user-facing page loads) in any social network that I am aware of.

You may be misunderstanding my comments. I'm describing use of the blob column type to store serialized data for unindexed fields. For example, metadata describing a particular user or piece of content. Let's say there are 100 possible fields/properties/columns for each user, but the data is sparse, meaning many of these fields are null/zero/default for each row. Rather than having 100 "real" columns mostly storing NULLs, social networks tend to serialize all these together (omitting the ones with empty values) into a single blob column, ideally using a strategy that maps the field names to a more compact numeric format, and storing this in a binary format (conceptually similar to BSON).

This has nothing to do with "blob storage" meaning images/audio/video.

This has nothing to do with Hadoop. Social networks / UGC sites don't use Hadoop for "blob storage" either, btw.

This has nothing to do with multi-master setups, which are generally avoided by social networks due to risks of split brain and other inconsistency problems completely tangential to this topic.

This has nothing to do with foreign key constraints or check constraints, which are implemented entirely at the application level in a sharded social network, again for reasons completely tangential to this topic.

My original comment in this subthread was in response to a claim that using blobs for unindexed fields is "very very weird". I feel that claim is incorrect because it is a common approach, used e.g. by Facebook, Pinterest, and Tumblr (among others). I think you may have misread or misunderstood this subthread entirely.

AmericanChopper · on Dec 23, 2019

> You may be misunderstanding my comments. I'm describing use of the blob column type to store serialized data for unindexed fields.

I understand perfectly. This is a very inefficient use of RDBMS. It violates every single normal form, and will lead to all of the performance bottlenecks I mentioned.

> This has nothing to do with foreign key constraints or check constraints, which are implemented entirely at the application level

As I stated before. This is not an RDBMS, it’s simply using RDBMS components as a storage service. An RDBMS must enforce key and check constraints, if this is enforced entirely at a higher level then you’ve violated the nonsubversion rule, and you’re dealing with a system that is not an RDBMS.

An RDBMS can only be called so if it conforms to Edgar Codd’s relational model, that is the exclusive definition of an RDBMS. Simply managing relational data on some level does not make an RDBMS, excel does that and it’s certainly not an RDBMS. So does MongoDB, and it’s not an RDBMS either.

evanelias · on Dec 23, 2019

> An RDBMS must enforce key and check constraints, if this is enforced entirely at a higher level then you’ve violated the nonsubversion rule, and you’re dealing with a system that is not an RDBMS.

There are well over a hundred companies using a sharded database layout. This typically entails not using the RDBMS's built-in support for foreign key constraints, since they can't cross shard boundaries.

In your mind does this mean a sharded MySQL or Postgres environment is no longer a RDBMS? That's an extreme view if so, and IMO not a terribly useful or relevant distinction in the real world.

> will lead to all of the performance bottlenecks I mentioned.

In the situation I've described (large number of sparse fields), serializing non-empty non-indexed fields into a single blob results in smaller row sizes than making all the fields into real columns. And generally speaking, smaller row sizes tends to improve performance.

Look, I've literally spent a significant chunk of my life working on some of the largest databases that exist. I post here using my real name and have my information in my profile. If you want to have pseudonymous pedantic semantic arguments about why you claim the largest database fleets are no longer "relational", I suppose that's your prerogative, but kindly refrain from trying to lecture me about vague unsubstantiated "performance bottlenecks" in widely-used real-world approaches.

AmericanChopper · on Dec 23, 2019

But you’re not talking about Postgres or MySQL. You’re talking about a larger bespoke system where one of the components is MySQL. The system you’re describing has:

* An object store

* A bespoke mechanism for defining schemas

* A bespoke mechanism for querying data

* A bespoke mechanism for mutating data

* A bespoke mechanism for enforcing relationships

* A bespoke mechanism for enforcing check constraints

The performance bottlenecks I described are not vague at all. They are well known constraints of RDBMS. To come into a discussions about such constraints and say “well if you only use the RDBMS for object storage, and then implement your own bespoke systems to replace all other RDBMS functionality with something completely different, then you don’t need to worry about that” is completely asinine, and to claim that such a system is still an RDBMS is just factually wrong.

You might be an excellent distributed system engineer, but it really seems like you don’t actually know what an RDBMS is.

evanelias · on Dec 23, 2019

In light of your earlier blatant misstatements about Hadoop, I don't think I'm the one with a knowledge problem here.

The original post here is entitled "Scaling Etsy". The ultimate solution that Etsy used, to solve the problem described in those tweets, was to move to sharded MySQL. With no database-side foreign key constraints. This architecture has served them well for over a decade now.

By your rigid definitions, Etsy isn't using an RDBMS, which would seemingly make all your RDBMS-related comments off-topic.

excerionsforte · on Dec 23, 2019

Proof of how fast FB, Twitter, Tumblr and all sorts of companies is already out there. Some people have a negative reaction (dissent and disbelief) to FB using PHP and that they should have rewrote it in the different language, which is more than absurd. "PHP is an anti pattern" or whatever.

I have seen zero examples so far of the fears and dissent raised by several people. Threats of horror stories, bottlenecks or "just because you can do it doesn't mean you should" are all kinds of dismissive, anti-social and social anti-patterns unconducive to learning.

I think we can leave it at that. No point in this endless replyfest. Some people will stay with their beliefs and never be open to what is done in practice and works in hopes to just being "right".

AmericanChopper · on Dec 23, 2019

I replied directly to you sighting the specific reasons this is non-performant in an RDBMS, and you simply ignored it.

There is also nothing to disbelieve about the system described in the parent comment. It a perfectly sound design pattern. It’s just not an RDBMS. It is a system that uses an RDBMS as an object store, and then uses external systems for enforcing ref and check constraints, schema definition, querying, mutation and consistency. If you’re just using collection of RDBMS to store objects, and performing all other operations in external systems, then RDBMS design constraints are mostly irrelevant.

barrkel · on Dec 25, 2019

It makes sense when you have enough scale that DDLing all the sharded databases for every release from every team is too much overhead.

(Storing unindexed columns (as JSON rather than binary though) is where I'm headed at my job, due to dynamic schema, the cost of coordinating DDL, and the excruciating cost of joins.)

excerionsforte · on Dec 22, 2019

I saw what you posted before: "You are mad". Before you press the send button you should make sure to follow (https://news.ycombinator.com/newsguidelines.html) or your comment can be submitted for moderation. Leave the emotions out if you want to discuss, thanks. I can personally attack you too, but not interested and not worth the time.

As I commented below. Unsubstantiated generalizations are what I distrust. From your comment here, you say "horror stories" where is the substantiation? Notice how my comment said specifically" If I need the speed", "database interpretation" of values. The use of caching for exploded unindexed values. Not sure what it is exactly you don't like other than binary data (512-64kb) in database that is protected by cache. There is literally no change over existing patterns of how to store data in a db other than telling the db not to interpret the values. Are you smearing the idea of lookaside caching in front of a DB (SQL none the less)? Plenty of companies use it.

I prefer to use SQL data types unless I need the speed. What's so unreasonable? I never said that I use it everywhere like you purport it.

Aeolun · on Dec 22, 2019

> knowing how to structure your tables matters

I think we can agree on that. What I don’t think we agree on is that storing all non-indexed columns as binary is a good idea.

There are a lot of non-obvious issues with storing blobs in your relational database.

I’m sure you can mitigate them, but ultimately I think it’s better to just not give yourself that headache.

excerionsforte · on Dec 22, 2019

Unsubstantiated generalizations are the sort I disagree with strongly. Why? because it never looks at the use case. Dismissing all ideas just on the basis of one thing is a documented Anti-Pattern.

Let's do google search 'storing blobs' to find what do people have to say with their experiences.

First link: https://dba.stackexchange.com/questions/2445/should-binary-f...

>Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway

Second to last link: https://www.quora.com/Is-it-a-bad-design-to-store-images-as-...

> Bill Karwin is correct that you can't make unequivocal statement that storing images in a database is always bad. I like a lot of his points. Most people who are against images in the DB usually are promoting the only thing they know how to do.

and I'm not talking about images. Object data that I would serialize is can very between 512b to 64kb (Always interesting to see the scale of memory nowadays, kb is so small).

AmericanChopper · on Dec 22, 2019

Those answers are both dubious and very outdated. At the time they were written, the most mature and widely available solution for blob storage was the filesystem, which has its own set of shortcomings for that use case. This is no longer true, and that alone frankly makes those answer obsolete.

Then you have the fact that those answers are ignoring a lot of facts. The answer you referenced from Bill Karwin talks about mutating blobs. This isn’t a reasonable design pattern at all, you create a new blob and update the reference. The downsides to storing blobs in an rdbms are so numerous that you’d really have to have a very strong justification for doing so, and I’m really struggling to think of any that are actually technical.

burpsnard · on Dec 22, 2019

One i saw, was a banking backend; stored blobs retrieved by accountnumber from the mainframe were passed to mid-tier servers that mounted the blobs as r/w db instances.

Sort of like piles of sqlite db's on S3

Aeolun · on Dec 22, 2019

I mean, you do you, but you probably see the pattern in the responses to your comment.

It is theoretically possible that there’s a good use case for storing all fields in a blob. It’s just that none of the ones you mentioned are ones I agree with.

The only thing that makes sense to store in a blob is something that is otherwise incomprehensible (e.g. a jpeg file, doesn’t fit in any other column type), and even that is a bad idea.

excerionsforte · on Dec 22, 2019

Dissenting/disbelief responses do not appear to be from people who've worked with RDBMS at the high scale. I'm content with being able to keep the system simple, sleep at night, not getting pings for database and feeling good about the choice. Big enough reward for me and any others who have dealt with this.

AmericanChopper · on Dec 22, 2019

> you’ve offloaded to your app what your database is designed to deal with.

A lot of NoSQL design patterns lead you into this trap tbh. If you’re trying to make NoSQL work with relational data, you end up in a lot of situations where you have to choose between complex, over-engineered interfaces, or pushing more work into the client. For example, requiring the client to manage intermediate states because you can’t make efficient use of transactions.

AmericanChopper · on Dec 22, 2019

Storing blobs in a DB is nearly always a bad idea. The only reason I can ever think to do it would be if you had some insane business constraints which for some reason made it the least bad option. Even though most RDBMS support the blob type, they are absolutely not optimised for handling it. Blob access is always going to be slower through a DB, blob I/O will take resources away from the rest of your app, due to concurrency control your DB is always going to be a bottleneck and is always going to be difficult to horizontally scale, it will also slow down your backups, and running backups eats into your DB performance again, so you want to make them as efficient as possible. Due to the sophistication of object storage solutions these days (which infinitely scale horizontally), you don’t even need to deal with the pain of the filesystem to avoid using blobs.

abraae · on Dec 22, 2019

All good points, but there are other reasons to store images in the DB.

One is making it much easier to copy graphs of data, e.g copying an account with a bunch of attached users, widgets and their images.

When everything's in the DB this is just "insert into ... Select from". When images are held in e.g s3 this gets an order of magnitude harder.