How to Build Knowledge Graphs With LLMs (python tutorial)
04 Dec 2023 (12 months ago)
- Discussed how graph databases and large language models (LLMs) can generate knowledge graphs from unstructured data.
- LLMs can be used to identify entities and relationships and interact with the database through a chat interface.
- Knowledge graphs have advantages over vector search in some applications.
- The video was well-received, prompting detailed explanations on building the system with code.
- The tutorial includes a step-by-step coding guide.
Environment overview (1m3s) & Neo4J & OpenAI Configuration (2m47s)
- Data consists of profiles, project briefs, and Slack messages linked to the same people.
- Data in JSON format and markdown files, generated using ChatGPT.
- An
.env
file holds secrets and credentials for the OpenAI API and Neo4J database.
- Python libraries required are listed in
requirements.txt
.
- Code execution is done in a Jupyter Notebook.
- Configuration parameters for OpenAI API include API key, endpoint, and version.
- Neo4J requirements include endpoint URL, username, password, and database setup.
- Azure is used to manage OpenAI models, with deployment names and keys gathered for use.
- Neo4J Aura's hosted setup allows for instance creation and database connection setup.
- Important to note down Neo4J credentials upon instance creation as the password is only shown once.
Helper functions (7m50s)
- A function is defined to facilitate calling the OpenAI API with a system message/prompt and content.
- The function uses the OpenAI Library, setting a maximum of 15,000 tokens and a temperature of zero.
- System messages and user messages structure the request, with results filtering for the final response.
- The function simplifies obtaining the text response from the OpenAI API.
Identifying entities and relationships in data (9m15s)
- Introduced a function to extract entities and relationships from a folder of files, returning a JSON object.
- The function receives a folder name and prompt templates, and processes all files in a specified data directory.
- It initializes a list to save results and loops through each file, stripping whitespace from the contents.
- Each file type has a dedicated prompt template, which includes placeholder values for file contents.
- A generic system message is defined to provide context for the model, such as "You are helpful IT project and accounts management experts".
- Results from the language model are appended to a list, and exceptions are printed with the file name.
- A timer is implemented to track processing duration for each folder.
- File prompts contain detailed instructions for the language model to identify entities, such as projects, technologies, and clients, and their attributes.
- The model extracts relationships based on document context and produces a JSON output with entities and relationships lists.
- Testing and error handling for rate limits are performed, with solutions such as requesting limit increases or adding delays between requests.
Function Implementation Details
- The function uses language model API calls to interpret each document.
- A project brief prompt instructs the model to extract entities with attributes and generate relationships, expecting a JSON result format.
- Entity IDs are created using lowercase names with special characters removed, though this approach is noted as not reliable for production.
- Technologies are identified with detail to distinguish specific instances.
- Clients and industries are inferred from document context.
- Relationships are structured based on entity co-mention within a document.
- Adjustments to the function include renaming variables for clarity, fixing mistakes, and ensuring JSON objects are correctly read into results.
- Rate limits are managed by adding delays between API requests.
- Results include node types, IDs, relationships, and summaries formatted as per instructions.
Generating Cypher Statements (21m49s)
- The function converts JSON into Cypher statements for Neo4j.
- Cypher statements are used to create entities and relationships within the database.
- The function loops through JSON objects to extract
label
and ID
keys, while special characters in IDs are removed.
- Nodes have properties, such as a project entity's
summary
, which are added dynamically along with other key-value pairs into the Cypher statement.
- The
merge
keyword is used in Cypher to avoid duplicate entities during multiple executions.
- Attributes for nodes are created with
oncreate set
.
- The function writes the generated Cypher statements into a text file.
- Relationship loops extract
source node ID
, relationship type
, and target node ID
, using a pipe character as a separator.
- A relationship Cypher statement defines the relationship between two nodes using their IDs and types.
- The function appends Cypher statements to a list and returns the combination of entity and relationship statements.
Running the full pipeline (33m20s)
- A pipeline function runs all defined functions in sequence to bring together the extraction and database insertion processes.
- The function processes a dictionary that maps folder names to prompt templates.
extract entities and relationships
function is called for each item, returning a list of entities and relationships which get extended into one comprehensive list.
generate Cypher
function takes the list in JSON format and outputs Cypher statements which are then executed in the database through a GDs execute query
method, with the results also written to a file.
- During execution, errors are caught and logged into a failure file.
- After execution, the database is checked in the Neo4j workspace to confirm the presence of nodes and relationships which indicates success in knowledge graph creation.