How to Build Knowledge Graphs With LLMs (python tutorial)

04 Dec 2023 (12 months ago)
How to Build Knowledge Graphs With LLMs (python tutorial)

Intro (0s)

  • Discussed how graph databases and large language models (LLMs) can generate knowledge graphs from unstructured data.
  • LLMs can be used to identify entities and relationships and interact with the database through a chat interface.
  • Knowledge graphs have advantages over vector search in some applications.
  • The video was well-received, prompting detailed explanations on building the system with code.
  • The tutorial includes a step-by-step coding guide.

Environment overview (1m3s) & Neo4J & OpenAI Configuration (2m47s)

  • Data consists of profiles, project briefs, and Slack messages linked to the same people.
  • Data in JSON format and markdown files, generated using ChatGPT.
  • An .env file holds secrets and credentials for the OpenAI API and Neo4J database.
  • Python libraries required are listed in requirements.txt.
  • Code execution is done in a Jupyter Notebook.
  • Configuration parameters for OpenAI API include API key, endpoint, and version.
  • Neo4J requirements include endpoint URL, username, password, and database setup.
  • Azure is used to manage OpenAI models, with deployment names and keys gathered for use.
  • Neo4J Aura's hosted setup allows for instance creation and database connection setup.
  • Important to note down Neo4J credentials upon instance creation as the password is only shown once.

Helper functions (7m50s)

  • A function is defined to facilitate calling the OpenAI API with a system message/prompt and content.
  • The function uses the OpenAI Library, setting a maximum of 15,000 tokens and a temperature of zero.
  • System messages and user messages structure the request, with results filtering for the final response.
  • The function simplifies obtaining the text response from the OpenAI API.

Identifying entities and relationships in data (9m15s)

  • Introduced a function to extract entities and relationships from a folder of files, returning a JSON object.
  • The function receives a folder name and prompt templates, and processes all files in a specified data directory.
  • It initializes a list to save results and loops through each file, stripping whitespace from the contents.
  • Each file type has a dedicated prompt template, which includes placeholder values for file contents.
  • A generic system message is defined to provide context for the model, such as "You are helpful IT project and accounts management experts".
  • Results from the language model are appended to a list, and exceptions are printed with the file name.
  • A timer is implemented to track processing duration for each folder.
  • File prompts contain detailed instructions for the language model to identify entities, such as projects, technologies, and clients, and their attributes.
  • The model extracts relationships based on document context and produces a JSON output with entities and relationships lists.
  • Testing and error handling for rate limits are performed, with solutions such as requesting limit increases or adding delays between requests.

Function Implementation Details

  • The function uses language model API calls to interpret each document.
  • A project brief prompt instructs the model to extract entities with attributes and generate relationships, expecting a JSON result format.
  • Entity IDs are created using lowercase names with special characters removed, though this approach is noted as not reliable for production.
  • Technologies are identified with detail to distinguish specific instances.
  • Clients and industries are inferred from document context.
  • Relationships are structured based on entity co-mention within a document.
  • Adjustments to the function include renaming variables for clarity, fixing mistakes, and ensuring JSON objects are correctly read into results.
  • Rate limits are managed by adding delays between API requests.
  • Results include node types, IDs, relationships, and summaries formatted as per instructions.

Generating Cypher Statements (21m49s)

  • The function converts JSON into Cypher statements for Neo4j.
  • Cypher statements are used to create entities and relationships within the database.
  • The function loops through JSON objects to extract label and ID keys, while special characters in IDs are removed.
  • Nodes have properties, such as a project entity's summary, which are added dynamically along with other key-value pairs into the Cypher statement.
  • The merge keyword is used in Cypher to avoid duplicate entities during multiple executions.
  • Attributes for nodes are created with oncreate set.
  • The function writes the generated Cypher statements into a text file.
  • Relationship loops extract source node ID, relationship type, and target node ID, using a pipe character as a separator.
  • A relationship Cypher statement defines the relationship between two nodes using their IDs and types.
  • The function appends Cypher statements to a list and returns the combination of entity and relationship statements.

Running the full pipeline (33m20s)

  • A pipeline function runs all defined functions in sequence to bring together the extraction and database insertion processes.
  • The function processes a dictionary that maps folder names to prompt templates.
  • extract entities and relationships function is called for each item, returning a list of entities and relationships which get extended into one comprehensive list.
  • generate Cypher function takes the list in JSON format and outputs Cypher statements which are then executed in the database through a GDs execute query method, with the results also written to a file.
  • During execution, errors are caught and logged into a failure file.
  • After execution, the database is checked in the Neo4j workspace to confirm the presence of nodes and relationships which indicates success in knowledge graph creation.

Overwhelmed by Endless Content?