Sha Li

-- Once you've got a task to do, it's better to do it than live with the fear of it.
17 Sep 2018

Pattern-Based Information Extraction

Pattern-Based Information Extraction

There are many cases where we would like to extract information from unstructured text data: for example, finding out the cast of a movie or identifying the viewer’s attitude towards different aspects (plot, acting, special effects) from his/her review.

An Initial Attempt: Hearst Patterns

A straightforward idea is to look for “textual patterns” in the original corpus. A special case of information extraction is to extract word pairs that have hypernym-hyponym relationship such as (“animal”,”dog”). The textual pattern “X, such as Y” is a good indicator of this type of relationship: whenever we see the pattern “X, such as Y” appear in text, we can automatically infer that X is the hypernym of Y. Similarly, we can also discover a set of patterns that are useful for identifying other types of relationships (Part-of, synonym, causal etc.) between word pairs. Hearst 1 was the first to outline a procedure for discovering lexico-syntactic patterns for relation extraction but this procedure was not fully automated.

The procedure for uncovering new patterns:

  1. Decide on a lexical relation of interest. For example, the hypernym-hyponym relationship or the causal relationship.
  2. Collect a list of terms where this relation is known to hold. The list can be bootstrapped from manually selected seeds or from an existing resource.
  3. Find places in the corpus where the target word pairs occur syntactically near each other.
  4. Find the commonalities among the environments from the last step and transform the environment into a pattern.
  5. Once the pattern has been identified, use the new pattern to find new pairs of the target relation. (Step 2.)

An example of a pattern for hypernym-hyponym extraction:

There are a few places where we can improve upon. In step 2, where does our initial positive pair set come from? In step 3, even if the positive pair appear close to each other, the sentence may not be related to the target relation, so how are we going to eliminate the noisy patterns and focus on the high quality patterns? In step 4, how do we define “patterns” and how is the pattern distilled from the context? Later we can see that the definition of “pattern” can be rather broad and not restricted to matching the surface form of text.

Dependency Path Based Patterns

2 purposes to formalize the pattern space based on dependency paths. The expressibillity of patterns is directly connected to how we define the pattern space or the formal language of patterns. Hearst’s pattern language is close to a regular language, with an alphabet including words and noun phrases.

Every edge in a dependency parse tree can be represented as a 5 element tuple: Here, the category is a POS tag, such as N for noun and V for verb. Relation is the syntactic relationship between two words: OBJ for object, MOD for modifier etc. Now we can define our pattern space to be the shortest paths between two nouns in a dependency tree where the path length is not longer than 4. The Hearst patterns described in the previous section can be converted to patterns using linearized dependency parses.

Dependency Paths

The paper’s goal is to build a classifier when given a pair of words could determine whether they are related by hypernym or not. Patterns would serve as features to this classifier. The most useful features would naturally be the most reliable patterns to support unknown hypernym pair extraction.

###Typed Patterns

The Never-Ending Language Learner(NELL) is built to learn to read the web, each day a bit better than the day before, acquiring a huge knowledge base of facts as it learns. A critical component of NELL is its relation classification module, where pairs of noun phrases are classified to types of relations using three classification functions: CPL, OpenEval and SEAL. The first two are based on features from the text context and the last is based on the hyperlink structure of web pages. CPL, standing for coupled pattern learner, works in a self-supervised manner. It starts by training classifiers using a small amount of labeled data, then uses these classifiers to label more unlabeled data, bootstrapping the model along the way. Part-of-speech-tag heuristics are used to limit extraction to instances that appear to be noun phrases and patterns that are likely to be informative.

The initial version of NELL works on a pre-specified set of relations with type signatures. Later, the NELL architecture was extended to automatically compute new relation types (beyond the prespecified ones) for a given type signature of arguments, based on a clustering.

PATTY4 uses SOL patterns, which allows syntactic features(POS tags), ontological type signatures and lexical features(words). The above example is a pattern with types $PERSON and $SONG, the word “voice” and the POS tag ADJ, * is a wildcard symbol, standing for any sequence of words.

Such SOL patterns are derived from dependency paths. Whenever two entities are detected to co-occur in a sentence, the shortest dependency path is extracted. Adverbial and adjectival modifiers are included in the patterns as well to better capture the context. At this time, we get what the paper refers to as “textual patterns”. A SOL pattern is different from textual patterns in that only frequent n-grams are kept as lexicons and others are replaced by the wildcard symbol *. To find the frequent n-grams, they use frequent itemset mining where each sentence is treated as a “shopping basket”.

PATTY also organizes patterns into a subsumption hierarchy. For each pair of patterns, if the support set of pattern A is a superset of the support set of pattern B, then A is semantically more general than B. Suppose we have a huge dataset that covers all the possible entities, this definition naturally makes sense. However, in reality, two patterns may appear to have the same support set but not be synonymous simply because the corpus is rather small. To take this into consideration, the notion of soft set inclusion is used. The support set $S$ is viewed as a random sample from the true support set $S’$ given that the corpus is infinitely large. Then we can compute the Wilson score interval for the ratio of the target support set $S$ w.r.t. another support set $B$ .

Now that we can define the relationship between a pair of patterns, naively, we can build the pattern taxonomy by comparing the support set for all patterns. A more efficient way would be to build a prefix-tree to store the support set of patterns (similar to a FP-tree). After removing the potential cycles in the graph, we can get a DAG which is the final pattern taxonomy.

MetaPAD5 extracts patterns from the text by segmentation instead of parsing. The advantage is that more context is retained, but it is harder to capture relations between entities that are far away in the original sentence. This segmentation approach is similar to that in SegPhrase where frequent sequential patterns are treated as candidates and then scored using features of frequency, concordance, informativeness and completeness. Also, different ways of segmenting the original corpus would lead to different frequency counts, so it is necessary to run the segmentation for multiple rounds (usually two).

Instead of the soft subsumption in PATTY, this paper explicitly groups synonymous patterns. Recall that in [1] and [2], this is not a problem since only one type of relation is considered so it is sufficient to treat it as binary classification. They also point out that apart from having the same extractions, the context words in pattern itself is a good indicator of pattern semantics and can be used to group patterns. Of course, since they utilize a type hierarchy, the types are required to be compatible. A support vector regression machine is used to learn the similarity between patterns and then a clique detection method is applied to find all pattern groups.

MetaPAD also considers the type hierarchy in producing patterns with type signatures. For example $POLITICIAN is a fine-grained type under $PERSON. If we are interested in the president_of relationship, we should look for the fine grained type as in $POLITICIAN, president of $COUNTRY. If we are interested in the born_in relationship, it would be sufficient to stick to the coarse grained type $PERSON. The paper presents some metrics for topdown and bottomup type adjustment. The experiments show that neither is strictly dominant of the other and the preference depends on the dataset.

Open Information Extraction

6 and 7 are part of UW’s Open Information Extraction system. OpenIE aims at extracting tuples of (entity1, relation, entity2). All three fields may be extracted from the original text and need not be normalized. ReVerb places syntactic constraints and lexical constraints on the relation phrase extraction and assigns scores to extracted tuples based on a list of features. ReNoun focuses on extracting nominal attributes and supplements the relations from ReVerb. ReNoun is bootstrapped by a small group of manually specified patterns to extract tuples of high confidence. In turn, these facts will be used to learn more dependency parse patterns from the corpus. In OpenIE, the focus is more on extracting accurate values from the corpus than to extract meaningful patterns.

[1] Automatic Acquisition of Hyponyms from Large Text Corpora. Marti A. Hearst. COLING 1992.

[2] Learning Syntactic Patterns for Automatic Hypernym Discovery. Rion Snow et al. NIPS 2004.

[3] Never-Ending Learning. T. Mitchell et al. AAAI 2015

[4] PATTY: A Taxonomy of Relational Patterns with Semantic Types. Ndapandula Nakashole, Gerhard Weikum, Fabian M. Suchanek. EMNLP 2012

[5] MetaPAD: Meta Pa‚ttern Discovery from Massive Text Corpora. Meng Jiang et al. 2017 KDD

[6] ReVerb: “Identifying Relations for Open Information Extraction.” Fader, Anthony, Stephen Soderland and Oren Etzioni. EMNLP 2011.

[7] ReNoun: Fact Extraction for Nominal Attributes EMNLP 2014

Til next time,
Zoey at 00:00