Introduction
Knowledge graphs have emerged as a powerful and versatile approach in AI and Data Science for recording structured information to promote successful data retrieval, reasoning, and inference. This article examines state-of-the-art knowledge graphs, including construction, representation, querying, embeddings, reasoning, alignment, and fusion.
We also discuss the many applications of knowledge graphs, such as recommendation engines and question-answering systems. Finally, in order to pave the way for new advancements and research opportunities, we explore the subject’s problems and potential future routes.
Knowledge graphs have revolutionized how information is organized and used by providing a flexible and scalable mechanism to express complicated connections between entities and characteristics. Here, we give a general introduction to knowledge graphs, their importance, and their potential use across various fields.
Learning Objective
- Understand the concept and purpose of knowledge graphs as structured representations of information.
- Learn about the key components of knowledge graphs: nodes, edges, and properties.
- Explore the construction process, including data extraction and integration techniques.
- Understand how knowledge graph embeddings represent entities and relationships as continuous vectors.
- Explore reasoning methods to infer new insights from existing knowledge.
- Gain insights into knowledge graph visualization for better understanding.
This article was published as a part of the Data Science Blogathon.
Table of contents
What is a Knowledge Graph?
A knowledge graph can store the extracted information during an information extraction operation. Many fundamental knowledge graph implementations utilize the idea of a triple, which is a collection of three elements (a subject, a predicate, and an object) that can hold information about anything.
A graph is a collection of nodes and edges.
This is the smallest knowledge graph we can design, also known as a triple. Knowledge Graphs come in a number of forms and sizes. Here, Node A and Node B here are two separate things. These nodes are connected by an edge that shows the relationship between the two nodes.
Data Representation in Knowledge Graph
Take the following phrase as an illustration:
London is the capital of England. Westminster is located in London.
We will see some basic processing later, but initially, we would have two triples looking like this:
(London, be capital, England), (Westminster, locate, London)
In this example, we have three distinct entities (London, England, and Westminster) and two relationships (capital, location). Constructing a knowledge graph requires only two related nodes in the network with the entities and vertices with the relations. The resulting structure is as follows: Creating a knowledge graph manually is not scalable. No one will go through hundreds of pages to extract all the entities and their relationships!
Because they can easily sort through hundreds or even thousands of papers, robots are more suited to handle this work than people. The fact that machines cannot grasp natural language presents another difficulty. Using natural language processing (NLP) in this situation is important.
Making our computer understand natural language is crucial if we want to create a knowledge graph from the text. Using NLP methods to do this, including sentence segmentation, dependency parsing, parts of speech tagging, and entity recognition.
Import Dependencies & Load dataset
import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
from spacy.tokens import Span
import networkx as nx
import matplotlib.pyplot as plt
from tqdm import tqdm
pd.set_option('display.max_colwidth', 200)
%matplotlib inline
# import wikipedia sentences
candidate_sentences = pd.read_csv("../input/wiki-sentences1/wiki_sentences_v2.csv")
candidate_sentences.shape
candidate_sentences['sentence'].sample(5)
Sentence Segmentation
Splitting the text article or document into sentences is the first stage in creating a knowledge graph. Then, we will only shortlist the phrases that have precisely one subject and one object.
doc = nlp("the drawdown process is governed by astm standard d823")
for tok in doc:
print(tok.text, "...", tok.dep_)
Entities Extraction
A single-word component of a sentence can easily be removed. We can achieve this rapidly by using parts of speech (POS) tags. Nouns and proper nouns would be our entities.
When an entity spans many words, POS tags alone are inadequate. The dependency tree of the statement must be parsed.
The nodes and their relationships are most important when developing a knowledge graph.
These nodes will be made up of entities found in Wikipedia texts. Edges reflect the relationships between these elements. We will use an unsupervised approach to extract these elements from the phrase structure.
The basic idea is to read a phrase and identify the subject and object as you come across them. However, there are a few drawbacks. For example, “red wine” is a phrase-spanning entity, while dependency parsers only identify individual words as subjects or objects.
Because of the above mentioned issues, I created the code below to extract the subject and object (entities) from a sentence. For your convenience, I’ve broken the code into many sections:
def get_entities(sent):
## chunk 1
ent1 = ""
ent2 = ""
prv_tok_dep = "" # dependency tag of previous token in the sentence
prv_tok_text = "" # previous token in the sentence
prefix = ""
modifier = ""
#############################################################
for tok in nlp(sent):
## chunk 2
# if token is a punctuation mark then move on to the next token
if tok.dep_ != "punct":
# check: token is a compound word or not
if tok.dep_ == "compound":
prefix = tok.text
# if the previous word was also a 'compound' then add the current word to it
if prv_tok_dep == "compound":
prefix = prv_tok_text + " "+ tok.text
# check: token is a modifier or not
if tok.dep_.endswith("mod") == True:
modifier = tok.text
# if the previous word was also a 'compound' then add the current word to it
if prv_tok_dep == "compound":
modifier = prv_tok_text + " "+ tok.text
## chunk 3
if tok.dep_.find("subj") == True:
ent1 = modifier +" "+ prefix + " "+ tok.text
prefix = ""
modifier = ""
prv_tok_dep = ""
prv_tok_text = ""
## chunk 4
if tok.dep_.find("obj") == True:
ent2 = modifier +" "+ prefix +" "+ tok.text
## chunk 5
# update variables
prv_tok_dep = tok.dep_
prv_tok_text = tok.text
#############################################################
return [ent1.strip(), ent2.strip()]
Chunk 1
This code block above defined several empty variables. The preceding word’s dependents and the word itself will be kept in the variables prv_tok_dep and prv_tok_text, respectively. The prefix and modifier will hold the text associated with the subject or object.
Chunk 2
Then we’ll go over all of the tokens in the phrase one by one. The token’s status as a punctuation mark will be established first. If that’s the case, we’ll disregard it and go on to the next token. If the token is a component of a compound phrase (dependency tag = “compound”), we will put it in the prefix variable.
People combine many words together to form a compound word and generate a new phrase with a new meaning (examples include “Football Stadium” and “animal lover”).
They will append this prefix to each subject or object in the sentence. A similar method will be used for adjectives such as “nice shirt,” “big house,” and so on.
Chunk 3
If the subject is the token in this scenario, it will be entered as the first entity in the ent1 variable. The variables prefix, modifier, prv_tok_dep, and prv_tok_text will all be reset.
Chunk 4
If the token is the object, it will be placed as the second entity in the ent2 variable. The variables prefix, modifier, prv_tok_dep, and prv_tok_text will all be reset.
Chunk 5
After determining the subject and object of the phrase, we’ll update the preceding token and its dependency tag.
Let’s use a phrase to test this function:
get_entities("the film had 200 patents")
Wow, everything looks to be going as planned. In the above phrase, ‘film’ is the topic and ‘200 patents’ is the aim.
We can now use this approach to extract these entity pairings for all of the phrases in our data:
entity_pairs = []
for i in tqdm(candidate_sentences["sentence"]):
entity_pairs.append(get_entities(i))
The list entity_pairs includes all of the subject-object pairings from Wikipedia sentences. Let’s take a look at a few of them.
entity_pairs[10:20]
As you can see, a few pronouns exist in these entity pairs, such as ‘we’, ‘it’,’ she’, and so on. Instead, we’d want proper nouns or nouns. We might possibly update the get_entities() code to filter out pronouns.
Relations Extraction
The extraction of entities is only half the task. We need edges to link the nodes (entities) to form a knowledge graph. These edges represent the connections between two nodes.
According to our hypothesis, the predicate is the principal verb in a phrase. For example, in the statement “Sixty Hollywood musicals were released in 1929,” the verb “released in” is used as the predicate for the triple formed by this sentence.
The following function may extract such predicates from sentences. I utilized spaCy’s rule-based matching in this case:
def get_relation(sent):
doc = nlp(sent)
# Matcher class object
matcher = Matcher(nlp.vocab)
#define the pattern
pattern = [{'DEP':'ROOT'},
{'DEP':'prep','OP':"?"},
{'DEP':'agent','OP':"?"},
{'POS':'ADJ','OP':"?"}]
matcher.add("matching_1", None, pattern)
matches = matcher(doc)
k = len(matches) - 1
span = doc[matches[k][1]:matches[k][2]]
return(span.text)
The function’s pattern attempts to discover the phrase’s ROOT word or primary verb. After identifying the ROOT, the pattern checks to see if it is followed by a preposition (‘prep’) or an agent word. If this is the case, it is appended to the ROOT word. Allow me to demonstrate this function:
get_relation("John completed the task")
relations = [get_relation(i) for i in tqdm(candidate_sentences['sentence'])]
Let’s look at the most common relations or predicates that we just extracted:
pd.Series(relations).value_counts()[:50]
Build Knowledge Graph
Finally, we will construct a knowledge graph using the retrieved entities (subject-object pairs) and predicates (relationships between entities). Let us build a dataframe with entities and predicates:
# extract subject
source = [i[0] for i in entity_pairs]
# extract object
target = [i[1] for i in entity_pairs]
kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
The networkx library will then be used to form a network from this dataframe. The nodes will represent the entities, while the edges or connections between the nodes will reflect the nodes’ relationships.
This will be a directed graph. In other words, each linked node pair’s relationship is one-way only, from one node to another.
# create a directed-graph from a dataframe
G=nx.from_pandas_edgelist(kg_df, "source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
Let’s plot the network with a small example:
import networkx as nx
import matplotlib.pyplot as plt
# Create a KnowledgeGraph class
class KnowledgeGraph:
def __init__(self):
self.graph = nx.DiGraph()
def add_entity(self, entity, attributes):
self.graph.add_node(entity, **attributes)
def add_relation(self, entity1, relation, entity2):
self.graph.add_edge(entity1, entity2, label=relation)
def get_attributes(self, entity):
return self.graph.nodes[entity]
def get_related_entities(self, entity, relation):
related_entities = []
for _, destination, rel_data in self.graph.out_edges(entity, data=True):
if rel_data["label"] == relation:
related_entities.append(destination)
return related_entities
if __name__ == "__main__":
# Initialize the knowledge graph
knowledge_graph = KnowledgeGraph()
# Add entities and their attributes
knowledge_graph.add_entity("United States", {"Capital": "Washington,
D.C.", "Continent": "North America"})
knowledge_graph.add_entity("France", {"Capital": "Paris", "Continent": "Europe"})
knowledge_graph.add_entity("China", {"Capital": "Beijing", "Continent": "Asia"})
# Add relations between entities
knowledge_graph.add_relation("United States", "Neighbor of", "Canada")
knowledge_graph.add_relation("United States", "Neighbor of", "Mexico")
knowledge_graph.add_relation("France", "Neighbor of", "Spain")
knowledge_graph.add_relation("France", "Neighbor of", "Italy")
knowledge_graph.add_relation("China", "Neighbor of", "India")
knowledge_graph.add_relation("China", "Neighbor of", "Russia")
# Retrieve and print attributes and related entities
print("Attributes of France:", knowledge_graph.get_attributes("France"))
print("Neighbors of China:", knowledge_graph.get_related_entities("China", "Neighbor of"))
# Visualize the knowledge graph
pos = nx.spring_layout(knowledge_graph.graph, seed=42)
edge_labels = nx.get_edge_attributes(knowledge_graph.graph, "label")
plt.figure(figsize=(8, 6))
nx.draw(knowledge_graph.graph, pos, with_labels=True,
node_size=2000, node_color="skyblue", font_size=10)
nx.draw_networkx_edge_labels(knowledge_graph.graph, pos,
edge_labels=edge_labels, font_size=8)
plt.title("Knowledge Graph: Countries and their Capitals")
plt.show()
This isn’t exactly what we were looking for (but it’s still quite a sight!). We discovered that we had generated a graph with all of the relationships that we had. A graph with this many relations or predicates becomes quite difficult to see.
As a result, it is best to employ only a few key relationships to visualize a graph. I’ll tackle it one relationship at a time. Let us begin with the relationship “composed by”:
G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="composed by"],
"source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5)
nx.draw(G, with_labels=True, node_color='skyblue',
node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
That is a much better graph. The arrows in this case point to the composers. In the graph above, A.R. Rahman, a well-known music composer, is linked to things such as “soundtrack score,” “film score,” and “music.”
Let’s look at some additional connections. Now I’d want to draw the graph for the “written by” relationship:
G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="written by"], "source",
"target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5)
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500,
edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
This knowledge graph provides us with some astonishing data. Famous lyricists include Javed Akhtar, Krishna Chaitanya, and Jaideep Sahni; this graph eloquently depicts their relationship.
Let’s look at the knowledge graph for another crucial predicate, “released in”:
G=nx.from_pandas_edgelist(kg_df[kg_df['edge']=="released in"],
"source", "target",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k = 0.5)
nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500,
edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
Conclusion
In conclusion, knowledge graphs have emerged as a powerful and versatile tool In AI and data science for representing structured information, enabling efficient data retrieval, reasoning, and inference. Throughout this article, we have explored key points highlighting the significance and impact of knowledge graphs across different domains. Here are the key points:
- Knowledge graphs offer a structured representation of information in a graph format with nodes, edges, and properties.
- They enable flexible data modeling without fixed schemas, facilitating data integration from diverse sources.
- Knowledge graph reasoning allows for inferring new facts and insights based on existing knowledge.
- Applications span across domains, including natural language processing, recommendation systems, and semantic search engines.
- Knowledge graph embeddings represent entities and relationships in continuous vectors, enabling machine learning on graphs.
In conclusion, knowledge graphs have become essential for organizing and making sense of vast amounts of interconnected information. As research and technology advance, knowledge graphs will undoubtedly play a central role in shaping the future of AI, data science, information retrieval, and decision-making systems across various sectors.
Frequently Asked Questions
A: Knowledge graphs enable efficient data retrieval, reasoning, and inference. They support semantic search, facilitate data integration, and provide a powerful foundation for building intelligent applications like recommendation and question-answering systems.
A: Various sources extract and integrate information to construct knowledge graphs. They use data extraction techniques, entity resolution, and entity linking to build a coherent and comprehensive graph.
A: Knowledge graph alignment is integrating information from multiple knowledge graphs or datasets to create a unified and interconnected knowledge base.
A: Knowledge graphs enhance natural language processing tasks by providing contextual information and semantic relationships between entities, improving entity recognition, sentiment analysis, and question-answering systems.
A: Knowledge graph embeddings represent entities and relationships as continuous vectors in a low-dimensional space. They are used to capture the semantic meaning and structural information of entities and relationships in the graph.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.