A new storage engine for PostgreSQL to provide better control over bloat

jacobkg · on March 6, 2018

This is potentially BIG news. At my last company Postgres bloat was a gigantic problem. We did a lot of data updates and sometimes customer databases (we sharded by schema) would be bloated by as much as 300%. This really hurts when some customers are 50+ GB. Not only does bloat take up more space but it slows down queries.

We would occasionally resort to taking the customer offline for scheduled maintenance to do a pg_dump/pg_restore to get their DB back in shape.

I'm very curious to hear when this might become available for production use cases.

EDIT: Article states this has been proposed for Postgres 12

grzm · on March 6, 2018

This can indeed be an issue, with both indexes and tables. For tables, I've had success with tuning fillfactor to take into account update patterns. If there's a deterministic subset of rows that are updated more frequently, splitting those into a smaller table/partition, perhaps with a different fillfactor can help as well. This also makes vacuum faster for this subset. Even if there isn't a deterministic subset, partitioning for smaller table size can speed up vacuum regardless.

For indexes, using concurrent reindex to create new indexes and dropping the old can make a difference.

Yeah, it would be nice if more of this were automated and it's good that there's work being done. There are some options available in existing that can ameliorate bloat.

jacobkg · on March 6, 2018

Solid advice! I remember reading about fillfactor but didn't get a chance to try it out on our workload.

yardstick · on March 6, 2018

Were the tables under your control? Could you set a low fill factor [1] for frequently updated tables (eg 30 or lower)? That would help avoid creating too many db pages.

[1] https://www.postgresql.org/docs/10/static/sql-createtable.ht...

felixge · on March 6, 2018

Are you aware of pg_repack [1]? I haven't had a chance to use it myself yet, but it's designed for reclaiming bloated space without downtime.

[1] https://github.com/reorg/pg_repack/

jacobkg · on March 6, 2018

We tested pg_repack early on. Our experience (at least 5 years ago) was it didn't reliably clean up after itself in all cases so we were left with random triggers in the database. With thousands of customer databases this was a huge potential hassle.

I have heard plenty of people use pg_repack so maybe it got better or we didn't test it enough.

al_james · on March 6, 2018

We use pg_repack and hit that exact problem (left over triggers etc). In the end we resorted to creating and dropping the pg_repack extension after each run.

(edit: Missed an "and")

kawsper · on March 6, 2018

Do you have any idea why their docs are not mentioning this, or even why their software aren't doing that itself?

cookiecaper · on March 6, 2018

I haven't used pg_repack in production, but on my development box with PgSQL 10.x, if you end the process early, pg_repack will complain on the next run and ask you to clean up by dropping and re-adding the extension. Maybe this occurs in the wild if the cleanup code is not reached, for whatever reason.

danmaz74 · on March 6, 2018

We use pg_repack with automated schedules and it's really helpful.

deepsun · on March 6, 2018

Was it an old version of Postgres? I'm just curious to know why VACUUM couldn't help. Did you try VACUUM FULL?

jacobkg · on March 6, 2018

Vacuum helps with deletes but updates create new rows and vacuum never cleans up the old ones.

Edit: My statement about updates is not correct, see comments below

Vacuum Full works but it requires a full table lock so we could only use it on select tables for smaller customers. For large tables Vacuum Full can take many minutes and once you need to take downtime pg_dump/pg_restore becomes an option.

pritambaral · on March 6, 2018

> Vacuum helps with deletes but updates create new rows and vacuum never cleans up the old ones.

Only if there are long running transactions older than the updates themselves. When no old transactions are left, vacuum reclaims old tuples.

If you're talking about vacuum being unable to reclaim pages that still have some live tuples, that problem affects deletes and updates equally.

jacobkg · on March 6, 2018

Thank you for correcting my misunderstanding!

riku_iki · on March 6, 2018

> Vacuum helps with deletes but updates create new rows and vacuum never cleans up the old ones.

Where did you read this?..

_pj_ · on March 6, 2018

I really wish the EDB guys (both Robert Haas and Amit Kapila) would mention in their blog posts about this feature that it's meant to use the pluggable storage API which is also slated for PG12. That would make it more clear that we are not abandoning the current storage format, just giving users choice (per table) to use what works best for their use case. It would also alleviate the worries that some comments expressed for both of the blog posts.

amitkapila · on March 6, 2018

To me that was obvious, but you are right, it is better to be clear. I have added a below sentence my blog post.

"I would like to mention that the above results don't mean that zheap will be better in all cases than heap. For example, rollbacks will be costlier in zheap. Just to be clear, this storage format is proposed as another format alongside current heap, so that users can decide which storage they want to use for their use case."

fanf2 · on March 6, 2018

Where can we find a summary of the pluggable storage API?

_pj_ · on March 6, 2018

I don't think there is a good summary, it hasn't been settled yet. The (rather lengthy) discussion is like with everything on -hackers mailing list - https://www.postgresql.org/message-id/flat/20160812231527.GA...

macdice · on March 6, 2018

There was an intelligent comment about the extra writes when writing to UNDO and the extra reads when anyone needs to read UNDO and suggesting that it might not always beat the traditional system which just has to write a new tuple. It was deleted while I was composing my reply, so here, without context, is my reply!

Certainly there are complicated trade-offs here. Some things to consider: (1) unless the pages get flushed to disk due to memory pressure or a checkpoint, the UNDO data won't hit the disk so that '2x write' isn't a disk write, (2) in-place updates don't require every index to be updated (only indexes whose key changed), skipping the famous "write amplification" problem (more generally than HOT can), (3) the 2x reads for transactions that need modified rows applies only to transactions running with old snapshots; in the traditional system every transaction pays for reading past the bloat that can't be cleaned up (in other words, an old snapshot is a tax on every transaction in the traditional design, but UNDO is only a tax on the transaction that's using an old snapshot in this proposed system).

CWuestefeld · on March 6, 2018

I'm a PostgreSQL newbie [1]. From the discussion below, it seems that most of what's being addressed [2] is only really relevant in cases where there are really long-lived transactions. So a better-behaved system in which the transactions are of fleeting duration wouldn't see as much benefit because they don't suffer much from the problem.

Am I understanding that correctly?

[1] Our DBs are currently all SQL Server, but due to licensing costs, we're going to do all new stuff in pg, and over time, migrate the legacy stuff.

[2] I see there are other across-the-board benefits from e.g., smaller tuple headers; I'm talking about the main thrust.

garyclarke27 · on March 6, 2018

Considering that: storage costs are falling rapidly - update in place is a poor design, immutable far superior - pg is rock solid because of this architecture including DDL rollback - vacuum can recover dead space anyway! - this does seem to me to be a bit of a wasted (large) effort. Reducing size of headers and padding sounds good though. I would much rather see effort spent on incremental auto refresh of materialized views. NOT IN Foreign Keys,would also be useful to enforce XOR constraints. And Table Input parameters for Functions.

amitkapila · on March 6, 2018

Vacuum can't recover dead space in all cases. Consider if the dead rows are distributed throughout the table, it won't be able to reclaim it, only it can facilitate the future insertions into it. Also, the work it does to do that is many-a-times quite heavy for many workloads as it can slow down the system. I think you haven't fully looked at all the other benefits of this system which includes reducing write amplification by (a) by avoiding rewrites due to freezing, hint bits (b) making it possible to do an update that touches indexed columns without updating every index.

I think there are other benefits too in terms of just raw performance because it avoids a lot of work like HOT pruning. Yes, it instead needs to discard undo, but that is much cheaper.

In general, I think it is an overstatement to say that this is a wasted effort.

garyclarke27 · on March 6, 2018

Yes I didn’t mean completely wasted, just feels to me that this requires a huge effort and is a risk to Postgres’s legendary reliability - for a relatively small gain, especially considering that ssd storage capacity will continue to increase dramatically and plummet in price. Would be better I think to prepare for new paradigm of unified immutable memory coming v soon, (Optane already shipping) which will enable Postgres to compete with ‘in memory, databases like SAP Hanna

grzm · on March 6, 2018

I agree that a lot of the features you list are desirable. Like many open source projects, contributors work on features they’re interested in or paid to work on. It’s likely that you could find developers who could be motivated to realign their priorities if properly incentivized.

garyclarke27 · on March 6, 2018

I agree, still amazes me how good Postgres is for free, coming from Microsoft partner world of insane SQL Enterprise per core licensing. How much incentive would you guess is required, to get auto incremental mat view refresh?

grzm · on March 6, 2018

It’s likely developer dependent. You could reach out to one of the consultancies, or post an email to one of the mailing lists, though I’m not sure which one. Starting with -general would be s safe bet, and including a request for recommendation for a better list would likely get you sorted.

https://www.postgresql.org/list/

kjerniga · on March 11, 2018

What's your use case for incremental materialized view refresh? Are you using materialized views within a single database/instance, or across links between databases?

jayflux · on March 6, 2018

> zheap will prevent bloat (a) by allowing in-place updates in common cases and (b) by reusing space as soon as a transaction that has performed a delete or non-in-place update has committed

I’m not going to pretend to understand all of this, but is there not a trade off here? Is MVCC not fast and concurrent because it does create new rows instead of trying to mutate the old/current ones, would it be slower writes?

In other words, are we sacrificing speed for less storage consumption?

amitkapila · on March 6, 2018

It appears like we are sacrificing speed in some cases, but it is not actually true. In the current heap, we have to modify a current row, add a new row, and then WAL log diff of both rows (or in some cases need to write both rows) and routinely perform Hot-pruning. Now, when the new row can't fit on the same page, we have to write both the pages, the page which contains old row and the page that contains new row.

In zheap, we need to write an old row in undo, but in many cases, it won't be written to disk as undo worker will discard the undo very quickly unless there are open snapshots. In this, we don't need to perform Hot-pruning. So, there is an additional cost of memory copy, but that is more than compensated by savings.

You can read 'macdice' reply in this regards.