GraphRAG is now on GitHub

sansseriff · 2024-07-02T19:04:34 1719947074

I find it interesting that their entity extraction method for building a knowledge graph does not use or require one of the 'in-vogue' extraction libraries like instructor, Marvin, or Guardrails (all of which build off of pydantic). They just tell the llm to list graph nodes and edges in a list, and do some basic delimiter parsing, and load the result right into a networkx graph [1]. Is this because GPT-4 and the like have become very reliable at following specific formatting instructions, like a certain .json schema?

It looks like they just provide in the prompt a number of examples that follow the schema they want [2].

[1] https://github.com/microsoft/graphrag/blob/main/graphrag/ind...

[2] https://github.com/microsoft/graphrag/blob/main/graphrag/ind...

ftkftk · 2024-07-02T19:29:41 1719948581

LLMs are very good at knowledge extraction and following instructions, especially if you provide examples. What you see in the prompt you linked is an example of in-context learning, in particular a method called Few-Shot prompting [1]. You provide the model some specific examples of input and desired output, and it will follow the example as best it can. Which, with the latest frontier models, is pretty darn well.

[1] https://www.promptingguide.ai/techniques/fewshot

yunohn · 2024-07-03T06:38:14 1719988694

Using a library like Instructor adds a significant token overhead. While that overhead can be justified, if their use case performs fine without it, I don’t see any reason to add such a dependency.

sansseriff · 2024-07-03T18:22:40 1720030960

That's what I was wondering. How the token overhead of function calling with, say, Instructor compares with a regular chat response that includes few-shot learning (schema examples).

Maybe Instructor makes the most sense only when you're working with potentially malicious user data

inhumantsar · 2024-07-03T14:31:05 1720017065

anecdotally i found gpt3.5 was decently reliable at returning structured data but it took some prompt tuning and gpt4 has never returned invalid JSON when asked for it, though it doesn't always return the same fields or field names. it performs better when the schema is explicit and included in the prompt, but that can add a non trivial number of tokens to each request.

swyx · 2024-07-03T03:06:43 1719976003

regardless how reliable llms are they will never be perfect, so graphrag will error when the RNG gods feel like ruining your day. use pydantic.

Dowwie · 2024-07-03T10:22:21 1720002141

Have you any insights about whether these python libraries are using extensions for the heavy lifting? Current situation involves layers upon layers of parsing and serializing/deserializing in Python.

swyx · 2024-07-03T15:12:30 1720019550

sorry what kind of extensions do you mean? im not aware of any. this stuff is just boilerplate

darkteflon · 2024-07-03T05:39:35 1719985175

Was looking forward to this. Haven’t looked at the code yet, but yeah - it’s a bit of a red flag that it doesn’t use an established parsing and validation library.

swyx · 2024-07-03T15:11:59 1720019519

i mean dont throw baby out with bathwater. its research code, take the good chuck the bad.

a_wild_dandan · 2024-07-02T18:23:30 1719944610

I am ecstatic that Microsoft open sourced this. After watching the demo video[1], my mind raced with all of the possibilities that GraphRAG unlocks. I'm planning to try GraphRAG + Llama3 on my MacBook, since it has 96GB of unified (V)RAM. I think this tool could be a legit game changer.

[1] https://www.youtube.com/watch?v=r09tJfON6kE

k__ · 2024-07-02T22:29:53 1719959393

How does this work?

I understand the issues with "baseline RAG", as I experienced them myself. They make sense to me, if I query "what are the major themes of this dataset" on a list of articles, and the RAG tries to find content that's similar to that query, it probably won't find much.

Why does GraphRAG find something here? What's happening with my query that it suddenly "matches" the dataset?

madacol · 2024-07-06T11:55:02 1720266902

from https://www.microsoft.com/en-us/research/blog/graphrag-new-t...

> It does this by detecting “communities” of densely connected nodes in a hierarchical fashion, partitioning the graph at multiple levels from high-level themes to low-level topics, as illustrated in Figure 1. Using an LLM to summarize each of these communities creates a hierarchical summary of the data, providing an overview of a dataset without needing to know which questions to ask in advance. Each community serves as the basis of a community summary that describes its entities and their relationships.

rooftopzen · 2024-07-09T23:26:24 1720567584

Here's a write up the differences of GraphRAG - TLDR is it's expensive but can improve accuracy (possibly pertaining to LLM definition of the nodes), so if it's worth it to your app. https://medium.com/@jonathan.bennion/graphrag-analysis-part-...

glesperance · 2024-07-02T18:43:54 1719945834

For those like me that were looking for something more substantial on the GraphRag Method -> https://arxiv.org/pdf/2404.16130

rooftopzen · 2024-07-09T23:24:07 1720567447

Did you read it? That paper mentions "Comprehensiveness" and "Diversity" lift to a degree of "substantial improvements over naive RAG baseline" - could they be any more vague?

dmezzetti · 2024-07-02T20:45:40 1719953140

txtai has been working in the graph-vector space since 2022. Building semantic graphs with vector similarity for example. [1] [2] [3] [4]

Disclaimer: I'm the author of txtai

[1] https://neuml.hashnode.dev/introducing-the-semantic-graph

[2] https://neuml.hashnode.dev/generate-knowledge-with-semantic-...

[3] https://neuml.hashnode.dev/build-knowledge-graphs-with-llm-d...

[4] https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...

Doorknob8479 · 2024-07-03T23:09:17 1720048157

Do you plan to introduce the workflow of GraphRAG into txtai?

dmezzetti · 2024-07-04T11:20:07 1720092007

Perhaps some of the entity extraction workflows. But otherwise what's in GraphRAG that's not in txtai?

laborcontract · 2024-07-02T17:44:19 1719942259

I've been waiting for this.

Knowledge graphs don't replace traditional semantic search, but they do unlock a whole new set of abilities when performing RAG, like both traversing down extremely long contexts and traversing across different contexts in a coherent, efficient way.

The only thing about KGs is that it's garbage-in-garbage-out and I've found my feeble attempts at using LLMs to generate graphs sorely lacking.. I can't wait to try this out.

gkorland · 2024-07-03T03:25:04 1719977104

Indeed, if you build the Knowledge Graph in a "naive" way but just asking the LLM to generate a Knowledge Graph you'll probably end up with a very "dirty" Knowledge Graph full with duplications.

falcor84 · 2024-07-03T12:57:41 1720011461

Can this be handled by a second (adversarial?) LLM that searches for bad entries?

dweinus · 2024-07-02T18:21:47 1719944507

If I understand the paper right...

At indexing time:

- run LLM over every data point multiple times ("gleanings") for entity extraction and constructing a graph index

- run an LLM over the graph multiple times to create clusters ("communities")

At query time:

- Run the LLM across all clusters, creating an answer from each and score them

- Run the LLM across all but the lowest scoring answers to produce a "global answer"

...aren't the compute requirements here untenable for any decent sized dataset?

glesperance · 2024-07-02T18:48:46 1719946126

It really depends on the job you're trying to accomplish. I'd venture saying that it's way too early for horizontal / massive scale RAG apps.

Most solutions will want to focus on a very specific vertical application where the dataset is much more constrained. That we're this makes more sense.

Also a lot of alpha in data augmentation.

malux85 · 2024-07-02T18:47:33 1719946053

It depends on your latency requirements, not every RAG task has a user waiting for an immediate response, for my use case it doesn't matter if an answer takes even 10's of minutes to generate

hirako2000 · 2024-07-02T21:17:29 1719955049

Given the cost of running an LLM for 10 minutes, yes it does matter. That's about 15 dollars.

The overall answer better be very good and relevant every time for this to tech to make sense.

malux85 · 2024-07-02T21:22:43 1719955363

Oh that’s a good point, I have my own GPU rack and run locally because it’s cheaper to do so, so I hadn’t considered that…

gkorland · 2024-07-03T03:22:40 1719976960

The GraphRAG project is great and really shows the why Vector Databases can provide a full RAG solution when it comes to none trivial search queries. But in order to build a full an accurate Knowledge Graph we found out you need more than just loading the text to LLM.

For that we wrote the GraphRAG-SDK that is also generating a stable Ontology. https://github.com/FalkorDB/GraphRAG-SDK

throwaway4aday · 2024-07-02T20:59:19 1719953959

This is awesome! I've done a lot of little projects exploring the use of graphs with LLMs and it's great to see that this approach really pays off. Stupid me for trying to prematurely optimize when the solution is just prompt engineering and burning a bunch of tokens on multiple passes. Going to give this a try and see if my jaw drops. If it's as good as it looks then I'll have to put in the work to get it out of Python-land.

_bifc · 2024-07-05T07:33:51 1720164831

Anyone know how to use GraphRAG to build the knowledge graph on a large collection of private documents, where some might have complex structure (tables, links to other docs), and the content or terms in one document could be related to other documents as well?

gkorland · 2024-07-05T09:31:58 1720171918

This is exactly why we're working on the GraphRAG-SDK to ease the process. You might want to check out https://github.com/FalkorDB/GraphRAG-SDK/ and we would love to hear your feedback

darksaints · 2024-07-02T20:26:21 1719951981

LlamaIndex has something called the Knowledge Graph RAG Query engine. Is this related in any way?

ftkftk · 2024-07-02T18:55:11 1719946511

I've been looking forward to playing with this since reading the paper. I was considering implementing it myself based on the paper but I figured the code would just be a few weeks behind and patience did indeed pay off :)

boywitharupee · 2024-07-02T18:35:31 1719945331

how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/

throwaway4aday · 2024-07-02T20:52:49 1719953569

Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}

yard2010 · 2024-07-02T19:44:52 1719949492

Excuse me, how is it not?

loufe · 2024-07-02T22:43:41 1719960221

I find the choice of the Russo-Ukrainian war an interesting choice of topic as an example. I could see it being an intentional choice as a means to target military data analysis contracts.

bitsinthesky · 2024-07-02T21:29:05 1719955745

So can someone explain how this is different/superior to Raptor RAG? I don't have the current concentration to figure it out for myself...

shreezus · 2024-07-02T17:35:24 1719941724

This is great - I have been interested in KG-enhanced RAG for some time and think there is a lot of potential in this space!

fumeux_fume · 2024-07-03T03:04:19 1719975859

Knowledge graph or just a graph? My fear is that the term is being borrowed to help hype more AI products.

gkorland · 2024-07-03T03:14:09 1719976449

These are Knowledge Graphs, perhaps not RDF but Knowledge Graph can also be based on properties Graphs

ravi1krkr · 2024-07-04T18:51:19 1720119079

can we use the Graph RAG with ollama and other opensource embedding models instead of openai and azureopenai

piotrrojek · 2024-07-02T19:54:08 1719950048

Does anyone know how to run it against Ollama?

piotrrojek · 2024-07-03T09:06:41 1719997601

Replying to my own comment - it's trivially possible to replace `api_base` in settings.yaml. The only problem is that Ollama doesn't support OpenAI style embedding endpoint yet. But looking through PRs to Ollama, this is very close to being merged. All steps before actual embedding work fine with e.g. Llama3-8B.

goofystasy · 2024-07-03T12:36:05 1720010165

You can use litellm with drop_params for embedding endpoint

throwaway4aday · 2024-07-02T20:28:26 1719952106

I'd just try setting the env vars and see what happens, if it breaks you'll have to read the code though.

https://microsoft.github.io/graphrag/posts/config/overview/

ftkftk · 2024-07-02T21:36:10 1719956170

You will have to do a fair amount of refactoring.

justanotheratom · 2024-07-02T20:12:25 1719951145

does it support multi-tenancy?

2024-07-02T14:41:19 1719931279

[deleted]

malux85 · 2024-07-02T17:34:27 1719941667

I’m the creator of https://atomictessellator.com

While building the backend of this, I have focused on building a composable set of APIs suitable for machine consumption - i.e to act as agentic tools.

I was looking for a good RAG framework to process the large amount of pdfs I have crawled, so the agents can then design and run simulations. This comes at just the right time! I am looking forward to trying it out