More

vortex_ape · 2025-01-22T14:07:15 1737554835

I grew up calling every stranger uncle and auntie in India. My British roommate in Delhi was a bit taken aback when I called our landlady "auntie".

vortex_ape · 2025-01-02T22:11:28 1735855888

Not sure tbh, but right now there's a supporter tier ($50) to support sustainable development, which has everything from the free tier + gives early access to new features + a discord badge

vortex_ape · 2024-12-28T01:38:09 1735349889

This makes a ton of sense! We've faced the same exact problem because whenever we need to add a new materialized table or column, we need to backfill data into it.

For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.

For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.

I would love to think more about this to see if it's something that would make sense to handle within Houseplant.

vortex_ape · 2024-12-28T01:38:59 1735349939

I've opened an issue to track this https://github.com/juneHQ/houseplant/issues/30

Would love to chat more here or there if you're keen!

vortex_ape · 2024-12-28T01:25:18 1735349118

Interesting, we didn't run into the need for blocking and async migrations for our use case, because we do incremental updates to our clickhouse schema. What do you mean by needing to change the ordering of existing tables? I'll check out the background migrations link tomorrow!

vortex_ape · 2024-12-28T01:16:22 1735348582

We could look into supporting another way to define migrations that is not yaml. What would you have in mind and why is yaml bad?

gregw2 · 2024-12-31T11:40:17 1735645217

Why mot use comment-annotated .sql like liquibase or flyway database migration tools do? How does yaml help?

As someone who did migrations wrapped in XML back in the day, those parsing wrappers around sql (xml, yaml, json) add complexity for the user and create opaque error messages at times due to the wrapper format. (quoting, escaping, indenting, allowed characters, etc)

I get that as a certain fraction of your userbase may like yaml and I am not saying you shouldn't support it. And I get that the yaml parsing library can accelerate certain dev steps for you. But the impedence mismatch between yaml and sql and error handling will make your product usability worse.

You are already building a domain specific language of sorts. You will build a better product if you keep your "language" as close to the problem domain (sql-centric migrations) as possible. Yaml just adds complexity.

My two cents anyway.

sgarland · 2024-12-28T02:08:01 1735351681

YAML is bad, first and foremost, because the language is hideously awful. Anything that can interpret Norway’s country code as a bool, or certain numbers as sexagesimal, has not been well-thought out.

As to what else, I continually fail to see what’s wrong with SQL. It’s incredibly easy to read, in plain English, and understand what is being done. ALTER TABLE foo ADD COLUMN bar INTEGER NOT NULL DEFAULT 0… hmm, wonder what that’s doing?

For people who say you get too many .sql files doing it this way: a. Get your data models right the first time b. Build a tool to parse and combine statements, if you must.

vortex_ape · 2024-12-28T02:26:54 1735352814

I agree with your point about SQL's readability, however, I think there's value in having structured metadata around migrations, like dependencies between migrations, rollback instructions, and environment specific variations, beyond just the raw SQL.

This could be done with SQL comments, but having it in a structured format makes it more reliable to parse and validate programmatically. I do see why YAML's quirks could become a problem in a tool that's meant to help you make sure your database is in order, we didn't run into issues like country codes or numbers being interpreted as sexagesimal (yet).

Perhaps a middle ground would be to keep the actual migrations in .sql files for readability, while using a separate metadata file (in JSON or TOML) for the orchestration details. What do you think?

sgarland · 2024-12-28T15:37:51 1735400271

Re: dependencies, SQL is declarative. While there might be benefit in keeping the last modification as its own entity, I think just dumping the output of SHOW CREATE TABLE or the equivalent is the best solution. It clearly shows the current state of the table. This of course requires that the migration has already occurred, or has been faked for display purposes.

I guess my stance is that if you need some kind of config file to create the migration, then fine, but as a DBRE, it’s much easier for me to reason about the current state by seeing the declarative output than by mentally summing migration files. As a bonus, this lets you see quite easily if you’ve done something silly like creating an additional index on a column that has a UNIQUE constraint (though you’d still have to know why that’s unnecessary, I suppose).

vortex_ape · 2024-12-28T23:14:51 1735427691

Yep that's the reason we generate a schema.sql at the end so that you know the current state of the database after the migrations have been applied (you don't need to mentally sum those migrations). Coupled with git, you get to see the diff of the schema.sql to see what changed which is super useful, like the same additional index example you mentioned.

vortex_ape · 2024-12-28T01:15:14 1735348514

You can run DDL commands directly on ClickHouse, but after you have a lot of these commands in your codebase, it becomes a pain to manage them. We had these commands defined in a lot of ruby files named <timestamp>_create_table.rb using which we mimicked the way rails and active record manage the app database migrations. Rails and active record also give you a schema.rb which shows the latest state of the app database at all times. After having no schema.rb for a long time, and it becoming a pain to manage the standalone migration files in ruby, we decided to build this to be able to manage migrations for ClickHouse, our analytics database. We love using it and the schema.sql file (with the latest db state) it generates at the end.

vortex_ape · 2024-12-11T15:31:20 1733931080

Companies like Tiktok and Bytedance have 50% higher odds than average!

Nice analysis from Debarghya Das on Twitter: https://x.com/deedydas/status/1866335915125465541

vortex_ape · on March 10, 2021

Hey! Camelot maintainer here. You can check out this doc for details on how Camelot extracts tables from PDFs: https://camelot-py.readthedocs.io/en/master/user/how-it-work...

As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.

[1] https://github.com/camelot-dev/camelot/pull/209

vortex_ape · on March 10, 2021

Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!

vortex_ape · on Feb 16, 2021

That's awesome to hear!