Become a Knowledge Graph Jedi | School Cleaning Guide

Using NLP and search algorithms to derive meaning from your dataIn a dark place we find ourselves, and a little more knowledge lights our way.A knowledge graph is a representation of entities (e.g., a person, a place) and the relationships between them. Knowledge graphs are used to derive semantic understanding from these connections that wouldn’t otherwise be apparent. Most people try to derive sematic understanding from data. A knowledge graph Jedi does it.Let’s start our understanding of knowledge graphs using a popular nursery rhyme.Graph model for “Jack Be Nimble”The entity “Jack” is related to the entities “nimble,” “quick,” and “candlestick” with the actions “be” and “jump over the” (I promise that’s not a typo). Okay, so what, you might be asking.We live in an age where it is nearly impossible to understand everything because knowledge and data are increasing exponentially [1]. We need tools to help us summarize and make sense of all this data. A knowledge graph is that tool. It increases our ability to search through copious amounts of data to help derive meaning and understanding. Let’s revisit our nursery rhyme example.Expanded graph model for “Jack Be Nimble”If we expand out from the “candlestick” entity, we can discover why a candlestick was used in this nursery rhyme. We may be able to derive meaning from the time that this nursery rhyme was first created.Of course, this example is child’s play. The candlestick background is easily searchable in the corresponding Wikipedia article [2]. That’s fair. But real-world knowledge graphs aren’t this simple. They typically look something more like the image below:Photo by TheDigitalArtist from PixabayWith millions of interconnected entities. It is a daunting task to manually create all those entities and relationships and then sift through the resulting graph. There must be a better way, right?Well yes, of course there is. Otherwise, I wouldn’t have written this blog post. That better way can be found at the intersection of two roads, Data Extraction Street and Search Algorithm Avenue. Let’s explore the former first.Data ExtractionHow do you eat an elephant?One bite at a time.Unless you’re a computer engineer, in which case you just build a mechanism to automate the process.Luckily our automation process won’t take longer than the manual approach. Let me explain. Typically, building a knowledge graph starts with defining a graph data model [3]. If you’re unfamiliar with graph theory, a graph data model is a way to represent data as a series of entities (i.e. nodes) and the relationships between those entities (i.e. edges).Simple graph exampleThe best graph data models come from a good understanding of the data. But we don’t have a good understanding of our data. I mean, that’s why we’re building the knowledge graph, right? I think you can see the chicken and egg problem here. What we need is an initial bootstrap of our data. To boil our data down to its simplest components. For that, I’d like you to meet relationship extraction.Relationship ExtractionRelationship extraction is a natural language processing (NLP) method for automatically deriving relationships between noun entities in a body of text. Let’s see an example. Here is an excerpt of a White House press briefing document [4]. Why am I using a White House press briefing document? Because, yay politics?Press Briefing by Press Secretary Jen PsakiThis excerpt comes from the title of a press briefing on 2021/12/10. You and I can clearly see the relationship between Press Secretary Jen Psaki and the press briefing.Manual relationship extractionBut if we had to extract these relationships manually in an entire document, it would take longer than sitting through a Jar Jar Binks movie.Photo by Conmongt from PixabayThe trick is to get a computer to automatically recognize and extract these relationships for use. I used a project called Open Information Extraction, or Open IE, to accomplish this [5]. When we supply our excerpt to the Open IE service, this is what we get back.Automated relationship extractionNot bad, right?Let’s look at another excerpt from the same press briefing.And today I wanted to highlight that the Department of Transportation awarded $12.6 million in grants to nine marine highway projects across the country in Delaware, Hawaii, Indiana, Kentucky, Louisiana, North Carolina, New York, New Jersey, Tennessee, Texas, and VirginiaAnd this is what we get as output from Open IE.(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Virginia)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Texas)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in New York)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in North Carolina)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Hawaii)Do you see what’s going on here?Open IE was able to extract the relationship “awarded” between the Department of Transportation and the $12.6 million in grants given to the States listed. We now have discrete indexable information about The Department of Transportation and, let’s say, Virginia, that can be the kick-off point to an analysis on those two entities. If the prospects of this don’t excite you like a stay-at-home mom when the schools opened back up after the COVID-19 restrictions, let me go one step further.The simplest components in our knowledge graph are the entity and the relationship. That simple combination of two entities and the relationship between them are the backbone of our knowledge graph. Using the power of NLP to build that backbone gives us a superpower or near limitless data consumption. But that alone is like a turkey with no stuffing. We still need an efficient way to search through and analyze all that data. For that, let’s turn down Search Algorithm Avenue.Search AlgorithmsIf you’re familiar with Google (or is it Alphabet, or maybe it’s Meta), you’ve heard of the Page Rank algorithm. You may also be aware that Page Rank works by counting the number and quality of links to a web page to determine its importance [6]. What you may not know, however, is that Page Rank can be used to determine the importance of an entity in a graph network. In fact, Page Rank is just one of many graph algorithms that we can use to explore and understand our knowledge graph data.Page Rank falls under a class of graph algorithms called centrality algorithms. Centrality algorithms are used to determine the importance of distinct nodes in a graph network [7]. Other useful classes of algorithms for our knowledge graph include community detection algorithms and similarity algorithms. Let’s go through a quick example of each class. For each example, I used ten White House Press Briefing documents, fed them to the Open IE relationship extractor and loaded the resulting entity-relationship pairs into a Neo4j database.Centrality AlgorithmsAs previously mentioned, centrality algorithms identify the importance of an entity in our knowledge graph. Here are the top five results after running the algorithm.Unsurprisingly, the entity “the President” scored high in importance. Interestingly, “today” was identified at the most important entity. This probably speaks to the frequency in which press briefings refer to the current day, which makes intuitive sense. Let’s look at another algorithm class.Community Detection AlgorithmsCommunity detection algorithms are used to evaluate how groups of entities are clustered together [8]. I used a community detection algorithm called Weakly Connected Components (WCC) on the White House Press Briefing graph data and here are the top five strongest connected results.The results above indicate a strong cluster within our graph network. One could deduce the U.S. government’s strategy in the Middle East based on this cluster, which could be a starting point for further analysis. Let’s look at one more algorithm class.Similarity AlgorithmsSimilarity algorithms compute the similarity of pairs of nodes using different vector-based metrics [9]. What does that mean? Let’s say you and your co-worker are neighbours but didn’t know it. If the two of you kept bringing up the same landmarks and shops that you visit, chances are your addresses have a similarity. True story.Let’s look at a node similarity algorithm on our press briefing data.Now granted, the data here is dirtier than your kitchen after Thanksgiving dinner, but there are some rudiments of a cohesive similarity analysis here. The link between “Afghanistan” and “Our objective” may give some insights into the U.S. government’s current strategy, which also happens to be “similar” to our WCC algorithm results. Another example not shown in the table above was a link between “American soldier” and “Polish soldiers,” which was eighth in the list of results. If you were curious about American military collaborations around the world then that similarity might prove useful to you.Future WorkThe example I’ve shown is just the tip of the iceberg when it comes to what’s possible with a fully productionized knowledge graph solution. For that, there are several things we’d need to enhance.1. More data. Surprise, surprise, the data engineer wants more data. It’s unavoidable though. This example used ten press briefings. After going through data cleanup and filtering that amounted to only 3,823 nodes. That is grossly underwhelming in the world of knowledge graphs. We need node counts on the order of millions, perhaps billions.2. Better data cleaning. The data cleaning I did was rudimentary, removing duplicates and stop word entities like “we,” “I,” and “they”. However, entities like “today” or “All these things” also don’t provide much context. Cleaning this type of data as well would remove ambiguous nodes from our knowledge graph and clean up the algorithm search results.3. Enhance data quality. As an extension from the last point, entities like “we” were removed from the relationship extractions results. However, “we” means something in the context of the document: “we” could mean “our administration” or “the U.S. government”. If we enhanced our data to attribute these stop word entities to specific entities instead of losing them altogether, it could provide additional context to our knowledge graph.4. Enhance data quality, part two. Sequels are rarely better than the original, but I promise this is more of a Winter Soldier sequel. We can also enhance our knowledge graph using additional NLP techniques. For example, the relationship extraction identified “President” and “the President” as separate entities. If we used an NLP entity extractor on top of our relationship extraction, we could combine those entities into one. That would make our knowledge graph more connected, which would provide better results when running our search algorithms.ConclusionHopefully I’ve been able to convey why the field of knowledge graphs is so exciting and specifically why adding NLP techniques like relationship extraction are a game changer for its capabilities. There’s so much information out there that it is nearly impossible to effectively consume it all. Knowledge graphs give us that capability and automated extraction techniques make it easier than ever. The combination of knowledge graphs and NLP data extraction make the intimidating task of test extraction, search an analysis a relatively trivial pursuit.So, the next time you find yourself overwhelmed by the seemingly insurmountable task of deriving insights from an exorbitant amount of data, just use knowledge graphs and NLP data extraction so you can approach the task as cool and as calm as Qui-Gon Jinn fighting Darth Maul.Thanks so much for reading. If you want to duplicate this knowledge graph project checkout the GitHub repository here.References1. Kurzgesagt — …And We’ll Do it Again2. Jack Be Nimble3. Graph Data Modeling4. Press Briefing by Press Secretary Jen Psaki, December 10, 20215. Open IE6. Page Rank7. Centrality Algorithms8. Community Detection Algorithms9. Similarity AlgorithmsAdditional Articles by this AuthorReal-Time Question & Answer App using Gesture Recognition and OpenCVReal-Time Custom-Trained Object Detection App with OpenCV and TensorFlowReal-Time Object Detection App with OpenCV and TensorFlowBecome a Knowledge Graph Jedi was originally published in Slalom Build on Medium, where people are continuing the conversation by highlighting and responding to this story.Using NLP and search algorithms to derive meaning from your dataIn a dark place we find ourselves, and a little more knowledge lights our way.A knowledge graph is a representation of entities (e.g., a person, a place) and the relationships between them. Knowledge graphs are used to derive semantic understanding from these connections that wouldn’t otherwise be apparent. Most people try to derive sematic understanding from data. A knowledge graph Jedi does it.Let’s start our understanding of knowledge graphs using a popular nursery rhyme.Graph model for “Jack Be Nimble”The entity “Jack” is related to the entities “nimble,” “quick,” and “candlestick” with the actions “be” and “jump over the” (I promise that’s not a typo). Okay, so what, you might be asking.We live in an age where it is nearly impossible to understand everything because knowledge and data are increasing exponentially [1]. We need tools to help us summarize and make sense of all this data. A knowledge graph is that tool. It increases our ability to search through copious amounts of data to help derive meaning and understanding. Let’s revisit our nursery rhyme example.Expanded graph model for “Jack Be Nimble”If we expand out from the “candlestick” entity, we can discover why a candlestick was used in this nursery rhyme. We may be able to derive meaning from the time that this nursery rhyme was first created.Of course, this example is child’s play. The candlestick background is easily searchable in the corresponding Wikipedia article [2]. That’s fair. But real-world knowledge graphs aren’t this simple. They typically look something more like the image below:Photo by TheDigitalArtist from PixabayWith millions of interconnected entities. It is a daunting task to manually create all those entities and relationships and then sift through the resulting graph. There must be a better way, right?Well yes, of course there is. Otherwise, I wouldn’t have written this blog post. That better way can be found at the intersection of two roads, Data Extraction Street and Search Algorithm Avenue. Let’s explore the former first.Data ExtractionHow do you eat an elephant?One bite at a time.Unless you’re a computer engineer, in which case you just build a mechanism to automate the process.Luckily our automation process won’t take longer than the manual approach. Let me explain. Typically, building a knowledge graph starts with defining a graph data model [3]. If you’re unfamiliar with graph theory, a graph data model is a way to represent data as a series of entities (i.e. nodes) and the relationships between those entities (i.e. edges).Simple graph exampleThe best graph data models come from a good understanding of the data. But we don’t have a good understanding of our data. I mean, that’s why we’re building the knowledge graph, right? I think you can see the chicken and egg problem here. What we need is an initial bootstrap of our data. To boil our data down to its simplest components. For that, I’d like you to meet relationship extraction.Relationship ExtractionRelationship extraction is a natural language processing (NLP) method for automatically deriving relationships between noun entities in a body of text. Let’s see an example. Here is an excerpt of a White House press briefing document [4]. Why am I using a White House press briefing document? Because, yay politics?Press Briefing by Press Secretary Jen PsakiThis excerpt comes from the title of a press briefing on 2021/12/10. You and I can clearly see the relationship between Press Secretary Jen Psaki and the press briefing.Manual relationship extractionBut if we had to extract these relationships manually in an entire document, it would take longer than sitting through a Jar Jar Binks movie.Photo by Conmongt from PixabayThe trick is to get a computer to automatically recognize and extract these relationships for use. I used a project called Open Information Extraction, or Open IE, to accomplish this [5]. When we supply our excerpt to the Open IE service, this is what we get back.Automated relationship extractionNot bad, right?Let’s look at another excerpt from the same press briefing.And today I wanted to highlight that the Department of Transportation awarded $12.6 million in grants to nine marine highway projects across the country in Delaware, Hawaii, Indiana, Kentucky, Louisiana, North Carolina, New York, New Jersey, Tennessee, Texas, and VirginiaAnd this is what we get as output from Open IE.(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Virginia)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Texas)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in New York)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in North Carolina)(the Department of Transportation) – [awarded] -> ($ 12.6 million in grants to nine marine highway projects across the country in Hawaii)Do you see what’s going on here?Open IE was able to extract the relationship “awarded” between the Department of Transportation and the $12.6 million in grants given to the States listed. We now have discrete indexable information about The Department of Transportation and, let’s say, Virginia, that can be the kick-off point to an analysis on those two entities. If the prospects of this don’t excite you like a stay-at-home mom when the schools opened back up after the COVID-19 restrictions, let me go one step further.The simplest components in our knowledge graph are the entity and the relationship. That simple combination of two entities and the relationship between them are the backbone of our knowledge graph. Using the power of NLP to build that backbone gives us a superpower or near limitless data consumption. But that alone is like a turkey with no stuffing. We still need an efficient way to search through and analyze all that data. For that, let’s turn down Search Algorithm Avenue.Search AlgorithmsIf you’re familiar with Google (or is it Alphabet, or maybe it’s Meta), you’ve heard of the Page Rank algorithm. You may also be aware that Page Rank works by counting the number and quality of links to a web page to determine its importance [6]. What you may not know, however, is that Page Rank can be used to determine the importance of an entity in a graph network. In fact, Page Rank is just one of many graph algorithms that we can use to explore and understand our knowledge graph data.Page Rank falls under a class of graph algorithms called centrality algorithms. Centrality algorithms are used to determine the importance of distinct nodes in a graph network [7]. Other useful classes of algorithms for our knowledge graph include community detection algorithms and similarity algorithms. Let’s go through a quick example of each class. For each example, I used ten White House Press Briefing documents, fed them to the Open IE relationship extractor and loaded the resulting entity-relationship pairs into a Neo4j database.Centrality AlgorithmsAs previously mentioned, centrality algorithms identify the importance of an entity in our knowledge graph. Here are the top five results after running the algorithm.Unsurprisingly, the entity “the President” scored high in importance. Interestingly, “today” was identified at the most important entity. This probably speaks to the frequency in which press briefings refer to the current day, which makes intuitive sense. Let’s look at another algorithm class.Community Detection AlgorithmsCommunity detection algorithms are used to evaluate how groups of entities are clustered together [8]. I used a community detection algorithm called Weakly Connected Components (WCC) on the White House Press Briefing graph data and here are the top five strongest connected results.The results above indicate a strong cluster within our graph network. One could deduce the U.S. government’s strategy in the Middle East based on this cluster, which could be a starting point for further analysis. Let’s look at one more algorithm class.Similarity AlgorithmsSimilarity algorithms compute the similarity of pairs of nodes using different vector-based metrics [9]. What does that mean? Let’s say you and your co-worker are neighbours but didn’t know it. If the two of you kept bringing up the same landmarks and shops that you visit, chances are your addresses have a similarity. True story.Let’s look at a node similarity algorithm on our press briefing data.Now granted, the data here is dirtier than your kitchen after Thanksgiving dinner, but there are some rudiments of a cohesive similarity analysis here. The link between “Afghanistan” and “Our objective” may give some insights into the U.S. government’s current strategy, which also happens to be “similar” to our WCC algorithm results. Another example not shown in the table above was a link between “American soldier” and “Polish soldiers,” which was eighth in the list of results. If you were curious about American military collaborations around the world then that similarity might prove useful to you.Future WorkThe example I’ve shown is just the tip of the iceberg when it comes to what’s possible with a fully productionized knowledge graph solution. For that, there are several things we’d need to enhance.1. More data. Surprise, surprise, the data engineer wants more data. It’s unavoidable though. This example used ten press briefings. After going through data cleanup and filtering that amounted to only 3,823 nodes. That is grossly underwhelming in the world of knowledge graphs. We need node counts on the order of millions, perhaps billions.2. Better data cleaning. The data cleaning I did was rudimentary, removing duplicates and stop word entities like “we,” “I,” and “they”. However, entities like “today” or “All these things” also don’t provide much context. Cleaning this type of data as well would remove ambiguous nodes from our knowledge graph and clean up the algorithm search results.3. Enhance data quality. As an extension from the last point, entities like “we” were removed from the relationship extractions results. However, “we” means something in the context of the document: “we” could mean “our administration” or “the U.S. government”. If we enhanced our data to attribute these stop word entities to specific entities instead of losing them altogether, it could provide additional context to our knowledge graph.4. Enhance data quality, part two. Sequels are rarely better than the original, but I promise this is more of a Winter Soldier sequel. We can also enhance our knowledge graph using additional NLP techniques. For example, the relationship extraction identified “President” and “the President” as separate entities. If we used an NLP entity extractor on top of our relationship extraction, we could combine those entities into one. That would make our knowledge graph more connected, which would provide better results when running our search algorithms.ConclusionHopefully I’ve been able to convey why the field of knowledge graphs is so exciting and specifically why adding NLP techniques like relationship extraction are a game changer for its capabilities. There’s so much information out there that it is nearly impossible to effectively consume it all. Knowledge graphs give us that capability and automated extraction techniques make it easier than ever. The combination of knowledge graphs and NLP data extraction make the intimidating task of test extraction, search an analysis a relatively trivial pursuit.So, the next time you find yourself overwhelmed by the seemingly insurmountable task of deriving insights from an exorbitant amount of data, just use knowledge graphs and NLP data extraction so you can approach the task as cool and as calm as Qui-Gon Jinn fighting Darth Maul.Thanks so much for reading. If you want to duplicate this knowledge graph project checkout the GitHub repository here.References1. Kurzgesagt — …And We’ll Do it Again2. Jack Be Nimble3. Graph Data Modeling4. Press Briefing by Press Secretary Jen Psaki, December 10, 20215. Open IE6. Page Rank7. Centrality Algorithms8. Community Detection Algorithms9. Similarity AlgorithmsAdditional Articles by this AuthorReal-Time Question & Answer App using Gesture Recognition and OpenCVReal-Time Custom-Trained Object Detection App with OpenCV and TensorFlowReal-Time Object Detection App with OpenCV and TensorFlowBecome a Knowledge Graph Jedi was originally published in Slalom Build on Medium, where people are continuing the conversation by highlighting and responding to this story.Sem Onyalo