Just because a very large and well known company does something, does not mean it’s a good idea.
If you put something in a JSON column you give your DB some expectation about the data.
If you store your data in a binary column you are just doing what the database already does in it’s backend with it’s normal columns, and any optimizations it might make are lost.
It might result in cost savings, but in my experience cost savings at the expense of transparency/usability are always a bad idea.
Sure, at small scale, it may be a bad idea. At no point did I say it was universally a good idea, just that it was not in fact "very very weird" as the GP claimed.
At large scale (especially social network scale that I'm describing and personally spent most of the past decade working in), the cost/benefit calculus is quite different. I'm talking 8-figure cost savings here. And there's no negative impact to "transparency/usability", since application queries are all going through a data access layer that understands the serialization scheme, and rando staff members are not allowed to arbitrarily make manual queries against users' data anyway.
As for optimizations: a custom serialization scheme can easily be more optimized than how the database stores JSON/BSON/hstore/etc. A custom scheme can be designed with full knowledge of the application's access patterns, conceptual schema, and field cardinality. Whereas whatever the db natively gives you is inherently designed to be generic, and therefore not necessarily optimized for your particular use-case.
I don’t think the argument that it doesn’t hurt transparency holds any water, since you have the exact same limitations on your development environment (e.g. no clue what the database actually stores)
If we’re talking about the scale you’re speaking at, your data storage costs might go down, but at the same time you’re now doing a deserialization action for every row, you have no options of retrieving only half the values, so you’re always going to do that on all fields now. The costs for your data access layer rise correspondingly (but maybe that’s a different team, so it doesn’t matter).
If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).
Whenever something sounds extremely counterintuitive, people show up defending it because it’s what they have been doing for years and it makes a sort of twisted sense.
I do appreciate all your messages on this topic though, they have been much more informative than the GP’s original comment, and I have no doubt there’s actually people doing it now.
If I ever run into a situation where storing all my fields as one big blob makes most sense, I’ll revise my opinion.
Thank you for being open-minded, I appreciate it. There's definitely truth to what you're saying -- at a high level these decisions do have major complexity trade-offs, and that inherently harms transparency.
That said, in large companies and large complex systems, it's not unusual for any given team to have no clue about how various far-flung parts of the stack work. Dev environments tend to look quite a bit different there too; for example there's often no notion of a singular local dev database that you can freely query anyway.
In terms of transparency and debugging, this usually means some combination of in-house tooling, plus using a REPL / code to examine persisted values instead of using a db client / SQL. Also automated test suites. Lots of tests -- better to programmatically confirm your code is correct, rather than manual querying. Many SREs view having to SSH to debug a prod server being an anti-pattern, and you could apply the same notion to having to manually query a database.
Regarding deserialization overhead, sure there's some, but deserialization is heavily optimized and ultimately becomes pretty cheap. Consider that more traffic is from apps than web these days, and app traffic = APIs = deserialization for every request itself. And lots more deserialization for each internal service call. And significantly more deserialization for every cache call, since caches (redis, memcached, etc) don't speak in columns / SQL anyway. Adding deserialization to database calls is honestly a drop in the bucket :)
I want to point out here that the comments here in general were not conducive to giving out information. Your comment (shown below) was dismissive so there was no way I was going to elaborate. I recommend that if you are skeptical about something that you ask questions about it and we can discuss it more not simply dismiss it as "anti pattern". I would never talk/comment the way if I wanted to learn from others.
Point of my original comment got lost because of this "binary data in rdbms is anti pattern" detour. It was the least important thing that several people wanted to zero in and pile on.
> "Congratulations, you’ve offloaded to your app what your database is designed to deal with.
Storing all data as binary is an anti-pattern too, regardless of your qps.
I wouldn’t store anything but metadata in the database, the blob can be somewhere like S3."
> If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).
And also my comment above answered this filtering question from the get-go which underscores this distraction detour: "Sometimes I may need higher performance or more dynamic queries, so just creating a table or elastic search with only indexable values to get the ids works. Use those IDs to fetch from original store."
Let's be honest questions like "how to structure your tables", "how to make your datastore operations more efficient" (non exhaustive list, seems I need to point that out) veers on consulting, which is typically paid for so of course comments on the topic will have less information than you would like. Skeptics, who make blanket assertions, will usually be disregarded. Some may be willing to divulge and others may not be.
I think we’re on the same wavelength about this. We would have both dismissed each others’ comments and moved on our way.
I feel the need to thank the people that even so make an effort to elaborate their point.
Don’t get me wrong, I’m still convinced it’s an anti-pattern, it was just very politely explained to me by someone so I feel obliged to respond with the same.
Like I said in the other post, I’ll let you know if I ever reach a point where this is the only reasonable solution and I can’t think of something else.
Learning != convince. May be some type of pessimism gained through time with people trying to convince you among others what is right. You missed the key words and point of what I said as did the others... the big picture. Goodbye.
If you put something in a JSON column you give your DB some expectation about the data.
If you store your data in a binary column you are just doing what the database already does in it’s backend with it’s normal columns, and any optimizations it might make are lost.
It might result in cost savings, but in my experience cost savings at the expense of transparency/usability are always a bad idea.