Not sure tbh, but right now there's a supporter tier ($50) to support sustainable development, which has everything from the free tier + gives early access to new features + a discord badge
This makes a ton of sense! We've faced the same exact problem because whenever we need to add a new materialized table or column, we need to backfill data into it.
For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.
For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.
I would love to think more about this to see if it's something that would make sense to handle within Houseplant.
Interesting, we didn't run into the need for blocking and async migrations for our use case, because we do incremental updates to our clickhouse schema. What do you mean by needing to change the ordering of existing tables? I'll check out the background migrations link tomorrow!
Why mot use comment-annotated .sql like liquibase or flyway database migration tools do? How does yaml help?
As someone who did migrations wrapped in XML back in the day, those parsing wrappers around sql (xml, yaml, json) add complexity for the user and create opaque error messages at times due to the wrapper format. (quoting, escaping, indenting, allowed characters, etc)
I get that as a certain fraction of your userbase may like yaml and I am not saying you shouldn't support it. And I get that the yaml parsing library can accelerate certain dev steps for you. But the impedence mismatch between yaml and sql and error handling will make your product usability worse.
You are already building a domain specific language of sorts. You will build a better product if you keep your "language" as close to the problem domain (sql-centric migrations) as possible. Yaml just adds complexity.
YAML is bad, first and foremost, because the language is hideously awful. Anything that can interpret Norway’s country code as a bool, or certain numbers as sexagesimal, has not been well-thought out.
As to what else, I continually fail to see what’s wrong with SQL. It’s incredibly easy to read, in plain English, and understand what is being done. ALTER TABLE foo ADD COLUMN bar INTEGER NOT NULL DEFAULT 0… hmm, wonder what that’s doing?
For people who say you get too many .sql files doing it this way: a. Get your data models right the first time b. Build a tool to parse and combine statements, if you must.
I agree with your point about SQL's readability, however, I think there's value in having structured metadata around migrations, like dependencies between migrations, rollback instructions, and environment specific variations, beyond just the raw SQL.
This could be done with SQL comments, but having it in a structured format makes it more reliable to parse and validate programmatically. I do see why YAML's quirks could become a problem in a tool that's meant to help you make sure your database is in order, we didn't run into issues like country codes or numbers being interpreted as sexagesimal (yet).
Perhaps a middle ground would be to keep the actual migrations in .sql files for readability, while using a separate metadata file (in JSON or TOML) for the orchestration details. What do you think?
Re: dependencies, SQL is declarative. While there might be benefit in keeping the last modification as its own entity, I think just dumping the output of SHOW CREATE TABLE or the equivalent is the best solution. It clearly shows the current state of the table. This of course requires that the migration has already occurred, or has been faked for display purposes.
I guess my stance is that if you need some kind of config file to create the migration, then fine, but as a DBRE, it’s much easier for me to reason about the current state by seeing the declarative output than by mentally summing migration files. As a bonus, this lets you see quite easily if you’ve done something silly like creating an additional index on a column that has a UNIQUE constraint (though you’d still have to know why that’s unnecessary, I suppose).
Yep that's the reason we generate a schema.sql at the end so that you know the current state of the database after the migrations have been applied (you don't need to mentally sum those migrations). Coupled with git, you get to see the diff of the schema.sql to see what changed which is super useful, like the same additional index example you mentioned.
You can run DDL commands directly on ClickHouse, but after you have a lot of these commands in your codebase, it becomes a pain to manage them. We had these commands defined in a lot of ruby files named <timestamp>_create_table.rb using which we mimicked the way rails and active record manage the app database migrations. Rails and active record also give you a schema.rb which shows the latest state of the app database at all times. After having no schema.rb for a long time, and it becoming a pain to manage the standalone migration files in ruby, we decided to build this to be able to manage migrations for ClickHouse, our analytics database. We love using it and the schema.sql file (with the latest db state) it generates at the end.
As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.
Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!