Build Knowledge Graph from unstructured corpus using Machine Learning
Problem of creating knowledge graph from unstructured data is a well known machine learning problem. Not even a single org has achieved 100% accuracy for completely enriched knowledge graph . I have few findings that will help to kick-start for a person who is new in to this .
Before move to findings , i will let you to walk through the problem of building knowledge graph from unstructured corpus . Lets consider this scenario . Suppose we have very small corpus :
"Apple was founded by Steve jobs and current CEO is Tim Cook. Apple launched several products like Ipad, iphone , MAC etc. "
Corpus may be very complex sentences also . Problem is how can we build a knowledge graph out of this unstructured corpses . If we create generic knowledge graph , then our system should be able to provide answers like "who founded Apple ?" , " What are products launched by Apple ?" etc .
Few techniques to create knowledge graph :
1.) Supervised Technique :
Supervised models used in the field of information extraction involve formulation of the problem as a classification problem and they generally learn a discriminative classifier given a set of positive and negative examples. Such approaches extract a set of features from the sentence which generally include context words, 3 part of speech tags, dependency path between entity, edit distance, etc. and the corresponding labels are obtained from a large labelled training corpus.
- Sentence Segmentation : It will take input as a raw corpus and split it in to multiple sentences which is basically a list of strings.
- Tokenization : It will take list of splitted sentences and convert it in to tokens which is basically a list of list of strings.
- POS Tagging : It will convert in to pos tagged sentences which is basically list of list of tuples.
- Entity detection : it will detect entity and create chunk of sentences which is basically a list of trees.
- Relation detection: It will classify whether the particular relation satisfies the given entity set.
- It needs a set of relation types.
- A named entity tagger
- Lots of Labeled data (Break into training set, development set and test set)
- Feature representation
- A classifier (Naïve Bayes, MaxEnt, SVM, …)
- Lightweight features – require little pre-processing
- Words: headwords, bag of words, bigrams (between, before or after)
- Entity type: PERSON, ORGANIZATION, FACILITY, LOCATION & Geo-Polotical Entity/GPE
- Entity level: NAME, NOMIAL & PRONOUN
- Medium-weight features – require base phrase chunking
- Base phrase chunk paths
- Bags of chunk heads
- Heavyweight features – require full syntactic parsing
- Dependency tree paths between entities
- Parse tree paths between entities
- Could be adapted to a different domain
- High accuracy with enough hand-labeled training data and test similar enough to training
- Have to label a large training set (expensive)
- Could not generalize well to different genres
- Extension to high order Entity relation is difficult as well.
2.) Semi-Supervised Technique :
One more popular algorithms algorithm of it is Snowball ML algorithm.
1.) Start with seed set R of tuples.
2.) Generate set P of patterns from R . Compute support and confidence for each Pattern in P and discard those pattern with low support or confidence.
3.) Generate new Set T of tuples matching patterns P . Compute confidence of each tuple in T , add to R the tuples t in T with conf(t) > threshold.
4.) Go back to step 2.
1.) Start with Seed examples
3.) Grab the extracted pattern
In general , pattern is of the 5-tuple form : (left,tag1,mid,tag2,right)
5.) Using the patterns , scan the collection to generate new seed tuples
Initial seed tuple will be of the form : (tag1,tag2,tag3,tag4 etc)
Example : (organization , product,location etc) so seed example may be (Apple,ipad,california) or (ibm,db2,Armonk ) etc.
- avoid labeling manually lots of data
- Require seeds for each relation (quality of the original set of seeds is important)
- Big problem of semantic drift at each iteration
- Not high precision
3.) Distant Supervision ApproachIt uses a database of relations Freebase to get lots of training examples. We build a feature vector in the training phase for an ‘unrelated’ relation by randomly selecting entity pairs that do not appear in any Freebase relation and extracting features for them .
We use a multi-class logistic classifier optimized using L-BFGS with Gaussian regularization. Our classifier takes as input an entity pair and a feature vector, and returns a relation name and a confidence score based on the probability of the entity pair belonging to that relation. Once all of the entity pairs discovered during testing have been classified, they can be ranked by confidence score and used to generate a list of the n most likely new relation instances.
- Leverage unlimited amounts of text data
- Allows for very large number of weak features
- Not sensitive to training corpus: genre-independent