We built Korvus, an open-source RAG (Retrieval-Augmented Generation) pipeline that consolidates the entire RAG workflow - from embedding generation to text generation - into a single SQL query, significantly reducing architectural complexity and latency.
Here's some of the highlights:
- Full RAG pipeline (embedding generation, vector search, reranking, and text generation) in one SQL query
- SDKs for Python, JavaScript, and Rust (more languages planned)
- Built on PostgreSQL, leveraging pgvector and pgml
- Open-source, with support for open models
- Designed for high performance and scalability
Korvus utilizes Postgres' advanced features to perform complex RAG operations natively within the database. We're also the developers of PostgresML, so we're big advocates of in-database machine learning. This approach eliminates the need for external services and API calls, potentially reducing latency by orders of magnitude compared to traditional microservice architectures. It's how our founding team built and scaled the ML platform at Instacart.
We're eager to get feedback from the community and welcome contributions. Check out our GitHub repo for more details, and feel free to hit us up in our Discord!
Very cool! A assume you use Postgres' native full-text search capabilities? Any plans for BM25 or similar? This would make Korvus the end-game for open source rag IMO.
I’d start with something very simple such as Reciprocal Rank Fusion. I’d also want to make sure I really trusted the outputs of each search pipeline before worrying too much about the appropriate algorithm for combining the rankings.
IMHO it would be much clearer if you just used the normal %s for the "outer" string and left the implicit f-string syntax as it is, e.g.
{
"role": "user",
# this is not an f-string, is rather replaced by TODO FIXME
"content": "Given the context\n:{CONTEXT}\nAnswer the question: %s" % query,
},
The way the example (in both the readme and the docs) is written, it seems to imply I can put my own fileds as siblings to the chat key and they, too, will be resolved
Very cool. I see more languages planned in your comment. Are you looking for community help developing SDKs in other languages? After spending an entire Saturday running a RAG pipeline for a POC for a "fun" side project, I definitely would've loved to have been able to use this instead.
I spent too long reading Python docs because I haven't touched the language since 2019. Happy to help develop a Ruby SDK!
We would love help developing a Ruby SDK! We programmatically generate our Python, JavaScript, and C bindings from our Rust library. Check out the rust-bridge folder for more info on how we do that.
Does this work my running LLM such as Llama directly on the database server? If so, does that mean that your database and the LLM are competing for the same CPU and memory resources?
I'm not sure if this is a good idea, just like pretending a network request is a function call, it hides a lot of elements that shouldn't be ignored. I still prefer to clearly explicit embedding, LLM generation, etc.
> I'm not sure if this is a good idea, just like pretending a network request is a function call
This was my first reaction, too.
Perhaps there's something about data locality that makes it good for certain use cases?
> I still prefer to clearly explicit embedding, LLM generation, etc.
The bit that I usually need to control is how the retrieved results are formatted in the prompt. In order to make the context as information dense as possible, I might strip out certain words/l and/or symbols. But it depends on the query, so it can't be done at ingestion time.
This sounds very promising, but let me ask an honest question: to me, it seems like databases are the hardest part to scale in your average IT infrastructure. How much work does it add to the database if you let it make all the ML related work as well? How much work is saved by reducing the number of necessary queries?
Contrary to some of the sibling responses, my experience with pgvector specifically (with hundreds of millions or billions of vectors) is that the workload is quite different from your typical web-app workload, enough so that you really want them on separate databases. For example, you have to be really careful about how vacuum/autovacuum interacts with pgvector’s HNSW indices if you’re frequently updating data; you have to be aware that the tables and indices are huge and take up a ton of memory, which can have knock-on performance implications for other systems; etc.
This is a read workload that can be easily horizontally scaled. The reduction in dev and infrastructure complexity is well worth the slight increase in DB provisioning.
You can use PL/Python to make API calls outside of the database, you just don't need a separate service to interact with the DB to orchestrate all your ML stuff, only endpoints.
I was expecting to see something like a foreign table that managed the upload, chunking, embedding, everything in a transparent manner. But what I found in the examples was some Python code that look a lot like what the other frameworks are doing.
What am I missing? Honest question. I want to likes this :)
But let's take splitting as an example. Does it happen in the Python part or the Postgres part? Is it a feature of the Python SDK or is it a feature of pgml? I couldn't understand this from the docs.
This looks exciting! Will definitely be testing it out in the coming days.
I see you offer re-ranking using local models, will there be build-in support for making re-ranking calls to external services such as cohere in the future?
Great question! Making calls to external services is not something we plan to support. The point of Korvus is to write SQL queries that take advantage of the pgml and pgvector extensions. Making calls to external services is something that could be done by users after retrieval.
Retrieval Augmented Generation uses text that is stored in a database to augment user prompts that are sent to a generative AI, like a large language model. The retrieval results are based on their similarities to the user input. The goal is to improve the output of the generative AI by providing more information in the input (user prompt + retrieval results.) For example, we can provide the LLM details from an internal knowledge base so it can generate responses that are specific to an organization rather than based on general information. It may also reduce errors and improve the relevancy of the model output, depending on the context.
We built Korvus, an open-source RAG (Retrieval-Augmented Generation) pipeline that consolidates the entire RAG workflow - from embedding generation to text generation - into a single SQL query, significantly reducing architectural complexity and latency.
Here's some of the highlights:
- Full RAG pipeline (embedding generation, vector search, reranking, and text generation) in one SQL query
- SDKs for Python, JavaScript, and Rust (more languages planned)
- Built on PostgreSQL, leveraging pgvector and pgml
- Open-source, with support for open models
- Designed for high performance and scalability
Korvus utilizes Postgres' advanced features to perform complex RAG operations natively within the database. We're also the developers of PostgresML, so we're big advocates of in-database machine learning. This approach eliminates the need for external services and API calls, potentially reducing latency by orders of magnitude compared to traditional microservice architectures. It's how our founding team built and scaled the ML platform at Instacart.
We're eager to get feedback from the community and welcome contributions. Check out our GitHub repo for more details, and feel free to hit us up in our Discord!