DuckDB is mind blowingly awesome. It is like SQLite, lightweight, embeddable, serverless, in-memory database, but it's optimized to be columnar (analytics optimized). It can work with files that are in filesystem, S3 etc without copying (it just looks at the necessary regions in the file) by just doing `select * from 's3://....something.parquet'`. It support automatic compression and automatic indexing. It can read json lines, parquet, CSV, its own db format, sqlite db, excel, Google Sheets... It has a very convenient SQL dialect with many QoL improvements and extensions (and is PostgreSQL compatible). Best of all: it's incredibly fast. Sometimes it's so fast that I find myself puzzled "how can it possibly analyze 10M rows in 0.1 seconds?" and I find it difficult to replicate the performance in pure Rust. It is an extremely useful tool. In the last year, it has become one of my use-everyday tools because the scope of problems you can just throw DuckDB at is gigantic. If you have a whole bunch of structured data that you want to learn something about, chances are DuckDB is the ideal tool.
PS: Not associated with DuckDB team at all, I just love DuckDB so much that I shill for them when I see them in HN.
I'm sorry, I must be exceptionally stupid (or haven't seriously worked in this particular problem domain and thus lacking awareness), but I still can't figure out the use cases from this feature list.
What sort of thing should I be working on, to think "oh, maybe I want this DuckDB thing here to do this for me?"
I guess I don't really get the "that you want to learn something about" bit.
I’m not the person you asked, but here are some random, assorted examples of “structured data you want to learn something about”:
- data you’ve pulled from an API, such as stock history or weather data,
- banking records you want to analyze for patterns, trends, unauthorized transactions, etc
- your personal fitness data, such as workouts, distance, pace, etc
- your personal sleep patterns (data retrieved from a sleep tracking device),
- data you’ve pulled from an enterprise database at work — could be financial data, transactions, inventory, transit times, or anything else stored there that you might need to pull and analyze.
Here’s a personal example: I recently downloaded a publicly available dataset that came in the form of a 30 MB csv file. But instead of using commas to separate fields, it used the pipe character (‘|’). I used DuckDB to quickly read the data from the file. I could have actually queried the file directly using DuckDB SQL, but in my case I saved it to a local DuckDB database and queried it from there.
My dumb guy heuristic for DuckDB vs SQLite is something like:
- Am I doing data analysis?
- Is it read-heavy, write-light, using complex queries over large datasets?
- Is the dataset large (several GB to terabytes or more)?
- Do I want to use parquet/csv/json data without transformation steps?
- Do I need to distribute the workload across multiple cores?
If any of those are a yes, I might want DuckDB
- Do I need to write data frequently?
- Are ACID transactions important?
- Do I need concurrent writers?
- Are my data sets tiny?
- Are my queries super simple?
If most of the first questions are no and some of these are yes, SQLite is the right call
On way to think about it is SQLite for columnar / analytical data.
It works great against local files, but my favorite DuckDB feature is that it can run queries against remote Parquet files, fetching just the ranges of bytes it needs to answer the query using HTTP range queries.
This means you can run eg a count(*) against a 15GB parquet file from your laptop and only fetch a few hundred KBs of data (if that).
Small intro, It's a relational database for analytical data primarily. It's an "in-process" database meaning you can import certain files at runtime and query them. That's how it differs primarily from regular relational systems.
for the average developer I think the killer feature is allowing you to query over whatever data you want (csv, json, parquet, even gsheets) as equals, directly from their file form - can even join across them
It has great CSV and JSON processing so I find it's almost better thought of as an Excel-like tool vs. a database. Great for running quick analysis and exploratory work. Example: I need to do some reporting mock-ups on Jira data; DuckDB sucks it all in (or queries exports in place), makes it easy to clean, filter, pivot, etc. export to CSV
If you're developing in the data space you should consider your "small data" scenarios (ex: the vast majority of our clients have < 1GB of analytical data; Snowflake, etc. is overkill). Building a DW that exists entirely in a local browser session is possible now; that's a big deal.