Graph RAG with Kuzu #5346

sajozsattila · 2025-05-06T07:37:15Z

sajozsattila
May 6, 2025

Hello,

I am building a Graph RAG (for hybrid search) with Kuzu. Currently, I see two options for implementation:

LlamaIndex – for example: https://www.youtube.com/watch?v=vctV3p7ex0o
Direct LLM Prompting – for example: https://github.com/kuzudb/graph-rag-workshop

The second option seems simpler, in my opinion, and has also yielded better results in the first couple of tests compared to LlamaIndex. This has made me wonder: what are the benefits of using LlamaIndex, as it appears to be the more popular approach? What is your opinion on this?

Regards,

Answered by prrao87

May 6, 2025

Hi @sajozsattila, there is a third option, which imo is the best one:
https://github.com/Connected-Data/cdkg-challenge/tree/main/src/kuzu

It's based on BAML. It's an improvement on the second option (that used ell), and the reason I think it's better is that you have much more fine-grained control over the prompting layer using BAML. You also have the ability to write tests and do a lot more rigorous prompt engineering than the earlier two approaches. LlamaIndex is the most rigid and inflexible of these approaches, as the prompt logic is lost in 5 layers of abstraction and you have to conform to their abstraction philosophy to use it. BAML is low level, and very, very powerful in this reg…

View full answer

prrao87 · 2025-05-06T18:40:15Z

prrao87
May 6, 2025

Hi @sajozsattila, there is a third option, which imo is the best one:
https://github.com/Connected-Data/cdkg-challenge/tree/main/src/kuzu

It's based on BAML. It's an improvement on the second option (that used ell), and the reason I think it's better is that you have much more fine-grained control over the prompting layer using BAML. You also have the ability to write tests and do a lot more rigorous prompt engineering than the earlier two approaches. LlamaIndex is the most rigid and inflexible of these approaches, as the prompt logic is lost in 5 layers of abstraction and you have to conform to their abstraction philosophy to use it. BAML is low level, and very, very powerful in this regard, and offers full transparency and flexibility to the developer. Highly recommend it.

I plan on adding vector search + graph traversal examples in due course, but if you look at our latest release of Kuzu (0.9.0), you have a vector index natively in Kuzu, which should allow you to add embeddings as properties on the nodes and use the results from the vector search for downstream graph traversal too. Will post more examples on this soon, but of you get stuck, please join us on Discord and ask us your questions there. Cheers!

0 replies

sajozsattila · 2025-05-08T04:18:26Z

sajozsattila
May 8, 2025
Author

One thing regarding this: schema discovery is very important, and I think that for the best RAG solution, in an ideal case, we can add some descriptions to the objects and properties. This way, the schema will be more understandable by the LLM. For example I would like to format the prompt something like this:

### **Schema Details:**
#### **Nodes:**
1. **Company**
   Represents companies, corporations, or organizations (e.g., Apple Inc., Tesla, Goldman Sachs).  
   - **Label:** `company`  
   - **Properties:**  
     - `name` (STRING): Name of the company.

2. **Person**
   Represents individuals, such as executives, investors, or notable figures (e.g., Elon Musk, Warren Buffett).  
   - **Label:** `person`  
   - **Properties:**  
     - `name` (STRING): Name of the person.
 
#### **Relationships:**
1. **Management Connection**  
   Links a `company` with a `person`.  
   - **Type:** `management`  
   - **Pattern:** `(:company)-[: management]->(: person)`

For this, I need to record these descriptions somewhere. Do you have any ideas on how to do this within Kuzu?

4 replies

prrao87 May 11, 2025

Hi @sajozsattila, apologies for the delayed response, I was travelling and was at a conference. Does the below code help? I commonly use this to help the LLM discover the given graph's schema at runtime, prior to generating the Cypher. It's based on the code here.

def get_schema_dict(conn: kuzu.Connection) -> dict[str, list[dict]]:
    # Get schema for LLM
    nodes = conn._get_node_table_names()
    relationships = conn._get_rel_table_names()

    schema = {"nodes": [], "edges": []}

    for node in nodes:
        node_schema = {"label": node, "properties": []}
        node_properties = conn.execute(f"CALL TABLE_INFO('{node}') RETURN *;")
        while node_properties.has_next():
            row = node_properties.get_next()
            node_schema["properties"].append({"name": row[1], "type": row[2]})
        schema["nodes"].append(node_schema)

    for rel in relationships:
        edge = {
            "label": rel["name"],
            "src": rel["src"],
            "dst": rel["dst"],
            "properties": [],
        }
        rel_properties = conn.execute(f"""CALL TABLE_INFO('{rel["name"]}') RETURN *;""")
        while rel_properties.has_next():
            row = rel_properties.get_next()
            edge["properties"].append({"name": row[1], "type": row[2]})
        schema["edges"].append(edge)

    return schema

The general idea is as follows:

Obtain the graph schema of the given Kuzu database in JSON format (contains information about the nodes, edges, their property names and their associated types
Optional: Reformat the schema into YAML or XML style strings (LLMs tend to like these more than they do JSON) - I plan on doing more experiments to test whether this is quantifiable. In the file I linked above, I use YAML formatting for the schema
Pass the schema to the text2Cypher prompt
Obtain the Cypher query as the result and run the query on the Kuzu database
Pass the returned response as context to the LLM.

The formatted prompt from BAML I use tends to look like this:

    ALWAYS RESPECT THE EDGE DIRECTIONS:
    ---
    (:DrugGeneric) -[:CAN_CAUSE]-> (:Symptom)
    (:DrugGeneric) -[:HAS_BRAND]-> (:DrugBrand)
    (:Condition) -[:IS_TREATED_BY]-> (:DrugGeneric)
    ---

    Node properties:
    - DrugGeneric
        - name: string
    - DrugBrand
        - name: string
    - Condition
        - name: string
    - Symptom
        - name: string

    Edge properties:

Note that the clearly specifying the directions of the edges is the most important thing (without this, the LLM will hallucinate the direction and the query will fail). I find that using the ASCII-art style syntax of Cypher to specify the direction of the edges works well across a variety of LLMs (they're really good at understanding these kinds of patterns).

There are of course many more ways to do this, but I think this should be a good starting point for you to begin experimenting!

sajozsattila May 12, 2025
Author

Thank you for your answer. Yes, I have seen the 'get_schema_dict', but my point was a little. Different. Like recording a description for the Node properties somewhere in the schema, so we can give more details to the LLM. For example, what does it mean by the "DrogGeneric" or the "DrugBand"? Obviously, I can use an extra JSON to keep this information, but somehow I think it would be more elegant if the longer description of the graph nodes and properties could be collected from the Graph DB.

prrao87 May 12, 2025

Ah yes, we have 3 functions specifically for this, listed here:

CALL SHOW_TABLES() RETURN *: For listing node/rel table high level info
CALL SHOW_CONNECTIONS() RETURN *: For listing the source/target nodes for our rel table info
CALL TABLE_INFO('tableName') RETURN *: For listing a specific tables info

I typically use a combination of these three functions to construct the objects I need in Python for use downstream. Because Kuzu is a columnar system (tables are a central construct of everything, unlike in Neo4j), providing this nested information in one single table didn't seem very nice to our core team, so we chose to implement separate functions for them. You can take a look at these functions and reach us on Discord with any more questions on this, thanks!

what does it mean by the "DrogGeneric" or the "DrugBand"

These are just node table names in that database. In your case, there could be totally different node tables and other relationships that connect these node table together. As I mentioned, the schema can be returned at runtime so you can trust that the LLM sees the latest database schema every time a new query is run.

sajozsattila May 22, 2025
Author

Thanks it is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graph RAG with Kuzu #5346

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Graph RAG with Kuzu #5346

Uh oh!

sajozsattila May 6, 2025

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

prrao87 May 6, 2025

Uh oh!

sajozsattila May 8, 2025 Author

Uh oh!

prrao87 May 11, 2025

Uh oh!

sajozsattila May 12, 2025 Author

Uh oh!

Uh oh!

prrao87 May 12, 2025

Uh oh!

sajozsattila May 22, 2025 Author

sajozsattila
May 6, 2025

Replies: 2 comments 4 replies

prrao87
May 6, 2025

sajozsattila
May 8, 2025
Author

sajozsattila May 12, 2025
Author

sajozsattila May 22, 2025
Author