Hacker News new | past | comments | ask | show | jobs | submit login
GraphRAG is now on GitHub (microsoft.com)
282 points by alexchaomander 7 months ago | hide | past | favorite | 49 comments
The team at Microsoft is pleased to announce that GraphRAG is now available in open-source!

Check out the announcement video: https://youtu.be/dsesHoTXyk0

GraphRAG is a research project from Microsoft exploring the use of knowledge graphs and large language models for enhanced retrieval augmented generation. It is an end-to-end system for richly understanding text-heavy datasets by combining text extraction, network analysis, LLM prompting, and summarization.

For more details on GraphRAG check out aka.ms/graphrag

Try out the Python code on your own machine: https://github.com/microsoft/graphrag

Easily deploy GraphRAG in Azure: https://github.com/Azure-Samples/graphrag-accelerator

Leave a comment below for what you want to build with GraphRAG!




I find it interesting that their entity extraction method for building a knowledge graph does not use or require one of the 'in-vogue' extraction libraries like instructor, Marvin, or Guardrails (all of which build off of pydantic). They just tell the llm to list graph nodes and edges in a list, and do some basic delimiter parsing, and load the result right into a networkx graph [1]. Is this because GPT-4 and the like have become very reliable at following specific formatting instructions, like a certain .json schema?

It looks like they just provide in the prompt a number of examples that follow the schema they want [2].

[1] https://github.com/microsoft/graphrag/blob/main/graphrag/ind...

[2] https://github.com/microsoft/graphrag/blob/main/graphrag/ind...


LLMs are very good at knowledge extraction and following instructions, especially if you provide examples. What you see in the prompt you linked is an example of in-context learning, in particular a method called Few-Shot prompting [1]. You provide the model some specific examples of input and desired output, and it will follow the example as best it can. Which, with the latest frontier models, is pretty darn well.

[1] https://www.promptingguide.ai/techniques/fewshot


Using a library like Instructor adds a significant token overhead. While that overhead can be justified, if their use case performs fine without it, I don’t see any reason to add such a dependency.


That's what I was wondering. How the token overhead of function calling with, say, Instructor compares with a regular chat response that includes few-shot learning (schema examples).

Maybe Instructor makes the most sense only when you're working with potentially malicious user data


anecdotally i found gpt3.5 was decently reliable at returning structured data but it took some prompt tuning and gpt4 has never returned invalid JSON when asked for it, though it doesn't always return the same fields or field names. it performs better when the schema is explicit and included in the prompt, but that can add a non trivial number of tokens to each request.


regardless how reliable llms are they will never be perfect, so graphrag will error when the RNG gods feel like ruining your day. use pydantic.


Have you any insights about whether these python libraries are using extensions for the heavy lifting? Current situation involves layers upon layers of parsing and serializing/deserializing in Python.


sorry what kind of extensions do you mean? im not aware of any. this stuff is just boilerplate


Was looking forward to this. Haven’t looked at the code yet, but yeah - it’s a bit of a red flag that it doesn’t use an established parsing and validation library.


i mean dont throw baby out with bathwater. its research code, take the good chuck the bad.


I am ecstatic that Microsoft open sourced this. After watching the demo video[1], my mind raced with all of the possibilities that GraphRAG unlocks. I'm planning to try GraphRAG + Llama3 on my MacBook, since it has 96GB of unified (V)RAM. I think this tool could be a legit game changer.

[1] https://www.youtube.com/watch?v=r09tJfON6kE


How does this work?

I understand the issues with "baseline RAG", as I experienced them myself. They make sense to me, if I query "what are the major themes of this dataset" on a list of articles, and the RAG tries to find content that's similar to that query, it probably won't find much.

Why does GraphRAG find something here? What's happening with my query that it suddenly "matches" the dataset?


from https://www.microsoft.com/en-us/research/blog/graphrag-new-t...

> It does this by detecting “communities” of densely connected nodes in a hierarchical fashion, partitioning the graph at multiple levels from high-level themes to low-level topics, as illustrated in Figure 1. Using an LLM to summarize each of these communities creates a hierarchical summary of the data, providing an overview of a dataset without needing to know which questions to ask in advance. Each community serves as the basis of a community summary that describes its entities and their relationships.


Here's a write up the differences of GraphRAG - TLDR is it's expensive but can improve accuracy (possibly pertaining to LLM definition of the nodes), so if it's worth it to your app. https://medium.com/@jonathan.bennion/graphrag-analysis-part-...


For those like me that were looking for something more substantial on the GraphRag Method -> https://arxiv.org/pdf/2404.16130


Did you read it? That paper mentions "Comprehensiveness" and "Diversity" lift to a degree of "substantial improvements over naive RAG baseline" - could they be any more vague?


txtai has been working in the graph-vector space since 2022. Building semantic graphs with vector similarity for example. [1] [2] [3] [4]

Disclaimer: I'm the author of txtai

[1] https://neuml.hashnode.dev/introducing-the-semantic-graph

[2] https://neuml.hashnode.dev/generate-knowledge-with-semantic-...

[3] https://neuml.hashnode.dev/build-knowledge-graphs-with-llm-d...

[4] https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...


Do you plan to introduce the workflow of GraphRAG into txtai?


Perhaps some of the entity extraction workflows. But otherwise what's in GraphRAG that's not in txtai?


I've been waiting for this.

Knowledge graphs don't replace traditional semantic search, but they do unlock a whole new set of abilities when performing RAG, like both traversing down extremely long contexts and traversing across different contexts in a coherent, efficient way.

The only thing about KGs is that it's garbage-in-garbage-out and I've found my feeble attempts at using LLMs to generate graphs sorely lacking.. I can't wait to try this out.


Indeed, if you build the Knowledge Graph in a "naive" way but just asking the LLM to generate a Knowledge Graph you'll probably end up with a very "dirty" Knowledge Graph full with duplications.


Can this be handled by a second (adversarial?) LLM that searches for bad entries?


If I understand the paper right...

At indexing time:

- run LLM over every data point multiple times ("gleanings") for entity extraction and constructing a graph index

- run an LLM over the graph multiple times to create clusters ("communities")

At query time:

- Run the LLM across all clusters, creating an answer from each and score them

- Run the LLM across all but the lowest scoring answers to produce a "global answer"

...aren't the compute requirements here untenable for any decent sized dataset?


It really depends on the job you're trying to accomplish. I'd venture saying that it's way too early for horizontal / massive scale RAG apps.

Most solutions will want to focus on a very specific vertical application where the dataset is much more constrained. That we're this makes more sense.

Also a lot of alpha in data augmentation.


It depends on your latency requirements, not every RAG task has a user waiting for an immediate response, for my use case it doesn't matter if an answer takes even 10's of minutes to generate


Given the cost of running an LLM for 10 minutes, yes it does matter. That's about 15 dollars.

The overall answer better be very good and relevant every time for this to tech to make sense.


Oh that’s a good point, I have my own GPU rack and run locally because it’s cheaper to do so, so I hadn’t considered that…


The GraphRAG project is great and really shows the why Vector Databases can provide a full RAG solution when it comes to none trivial search queries. But in order to build a full an accurate Knowledge Graph we found out you need more than just loading the text to LLM.

For that we wrote the GraphRAG-SDK that is also generating a stable Ontology. https://github.com/FalkorDB/GraphRAG-SDK


This is awesome! I've done a lot of little projects exploring the use of graphs with LLMs and it's great to see that this approach really pays off. Stupid me for trying to prematurely optimize when the solution is just prompt engineering and burning a bunch of tokens on multiple passes. Going to give this a try and see if my jaw drops. If it's as good as it looks then I'll have to put in the work to get it out of Python-land.


Anyone know how to use GraphRAG to build the knowledge graph on a large collection of private documents, where some might have complex structure (tables, links to other docs), and the content or terms in one document could be related to other documents as well?


This is exactly why we're working on the GraphRAG-SDK to ease the process. You might want to check out https://github.com/FalkorDB/GraphRAG-SDK/ and we would love to hear your feedback


LlamaIndex has something called the Knowledge Graph RAG Query engine. Is this related in any way?


I've been looking forward to playing with this since reading the paper. I was considering implementing it myself based on the paper but I figured the code would just be a few weeks behind and patience did indeed pay off :)


how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/


Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}


Excuse me, how is it not?


I find the choice of the Russo-Ukrainian war an interesting choice of topic as an example. I could see it being an intentional choice as a means to target military data analysis contracts.


So can someone explain how this is different/superior to Raptor RAG? I don't have the current concentration to figure it out for myself...


This is great - I have been interested in KG-enhanced RAG for some time and think there is a lot of potential in this space!


Knowledge graph or just a graph? My fear is that the term is being borrowed to help hype more AI products.


These are Knowledge Graphs, perhaps not RDF but Knowledge Graph can also be based on properties Graphs


can we use the Graph RAG with ollama and other opensource embedding models instead of openai and azureopenai


Does anyone know how to run it against Ollama?


Replying to my own comment - it's trivially possible to replace `api_base` in settings.yaml. The only problem is that Ollama doesn't support OpenAI style embedding endpoint yet. But looking through PRs to Ollama, this is very close to being merged. All steps before actual embedding work fine with e.g. Llama3-8B.


You can use litellm with drop_params for embedding endpoint


I'd just try setting the env vars and see what happens, if it breaks you'll have to read the code though.

https://microsoft.github.io/graphrag/posts/config/overview/


You will have to do a fair amount of refactoring.


does it support multi-tenancy?


[deleted]


I’m the creator of https://atomictessellator.com

While building the backend of this, I have focused on building a composable set of APIs suitable for machine consumption - i.e to act as agentic tools.

I was looking for a good RAG framework to process the large amount of pdfs I have crawled, so the agents can then design and run simulations. This comes at just the right time! I am looking forward to trying it out




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: