Problem Motivation:
- Enterprise graph databases supported on Neo4j employ large schemas with complex node relationships and defined attributes comprising of millions of nodes and edges.
- Querying such a graph database requires manually writing complex cypher statements which is cumbersome and infeasible
- An important feature is to enable querying the graph database with natural language input from user
- Existing solutions employ basic rules created manually for each new schema to match simple 1-1 node relationships based on keywords identified in user query
- LLM based solutions fail miserably to generalize on a new schema for queries beyond single match and are plagued with the problem of hallucination over node attributes, values and relationships as well as incorrect syntax
- Further LLM based solutions are often limited by the context length of the input model due to large enterprise schema
- There is no course correction to ensure the generated cypher statements are correct and no user input/feedback is considered in the generation process
The main challenges with the existing approaches are multi-fold – (i) lack of support for large enterprise graph databases with complex schemas (ii) lack of user input in the cypher generation process (iii) no mechanism to identify issues in the generated cypher statements which is prone to hallucination and syntax issues (iv) limited context length of LLMs to accommodate large enterprise schemas.
Challenges in Current systems:
- Support for Large Enterprise Schemas – Lack of support for large enterprise graph databases with complex schemas
- Don’t provide recommendation to user on input query: No way of mapping input natural language (NL) query to enterprise schema with appropriate set of node and attribute values that preserve the schema relationships
- Unable to account for user preferences: No way of accounting for user feedback/preferences based on input query
- No Mechanism for identifying issues in the generation process: No way to course correct hallucinations, incorrect syntax, invalid relationships etc. in the generated cypher
- Support for Unseen Schemas: No way to generalize on unseen schema beyond the LLM’s black-box zero-shot capability
- Support for Follow Up Queries: No way to support follow up user queries based on initial user input query
Our Proposition:
The intent of this project is to focus on several important aspects which are missing from literature and can help in several ways in enabling querying with natural language for enterprise graph databases:
- Large Enterprise Schemas: Identify the subgraph that covers the entire intent of the user query from the large enterprise graph
- Recommendations to User: Identify user intent and correspondingly provide translations of user query that are mapped to subgraph with appropriate node, attribute and relationship names
- Recommendations to User: Provide appropriate values for identified nodes, attributes and relationships to user to consider user preferences
- Dynamic Templates: Dynamically generate NL, cypher statement pairs based on template patterns repository covering a variety of query patterns such as retrieval, path-based, multi-step, evaluation etc.
- Self-Healing: Agentic workflow to course-correct the generated cypher based on user intent, enterprise schema, syntax etc.
- Conversational Search: Agentic workflow to enable conversational aspect of querying
- Creating Benchmark Dataset: Create a benchmark dataset containing NL,Cypher pairs for training/finetuning LLMs
Expected Outcomes:
The expected outcomes are multi-fold
1. Creation of a benchmark dataset for the task of text2cypher
2. Implementation of various end-to-end baselines for the task of text2cypher
3. Experimental analysis of baselines for the task
4. Writing and submitting a paper to a tier 1 AI conference such as IJCAI, AAAI, NeurIPS, KDD, ACL etc.
Expected Skill Set:
The candidate must be well-versed with theory and concepts in the domain of generative AI. The candidate is expected to know prompt-engineering, few-shot learning, chain of thought prompting, RAG, agentic workflows etc. and have prior experience in implementing these techniques. The candidate must also be self-sufficient in driving experiments with various LLM models, have familiarity with SLM models, and some prior experience of writing a research paper is also appreciated.