spacy rule based phrase matching for hello world

  • Last Update :
  • Techknowledgy :

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.,spaCy features a rule-matching engine, the Matcher, that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags like IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher, which accepts Doc objects as match patterns.,When using the REGEX operator, keep in mind that it operates on single tokens, not the whole text. Each expression you provide will be matched on a token. If you need to match on the whole text instead, see the details on regex matching on the whole text.,The dependency matcher may be slow when token patterns can potentially match many tokens in the sentence or when relation operators allow longer paths in the dependency parse, e.g. <<, >>, .* and ;*.

When writing patterns, keep in mind that each dictionary represents one token. If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results. When developing complex patterns, make sure to check examples against spaCy’s tokenization:

doc = nlp("A complex-example,!")
print([token.text
   for token in doc
])

Example

# Matches "love cats"
or "likes flowers"
pattern1 = [{
      "LEMMA": {
         "IN": ["like", "love"]
      }
   },
   {
      "POS": "NOUN"
   }
]

# Matches tokens of length >= 10
pattern2 = [{
   "LENGTH": {
      ">=": 10
   }
}]

# Match based on morph attributes
pattern3 = [{
   "MORPH": {
      "IS_SUBSET": ["Number=Sing", "Gender=Neut"]
   }
}]
# "", "Number=Sing"
and "Number=Sing|Gender=Neut"
will match as subsets
# "Number=Plur|Gender=Neut"
will not match
# "Number=Sing|Gender=Neut|Polite=Infm"
will not match because it 's a superset

There are many ways to do this and the most straightforward one is to create a dict keyed by characters in the Doc, mapped to the token they’re part of. It’s easy to write and less error-prone, and gives you a constant lookup time: you only ever need to create the dict once per Doc.

chars_to_tokens = {}
for token in doc:
   for i in range(token.idx, token.idx + len(token.text)):
   chars_to_tokens[i] = token.i

You can then look up character at a given position, and get the index of the corresponding token that the character is part of. Your span would then be doc[token_start:token_end]. If a character isn’t in the dict, it means it’s the (white)space tokens are split on. That hopefully shouldn’t happen, though, because it’d mean your regex is producing matches with leading or trailing whitespace.

span = doc.char_span(start, end)
if span is not None:
   print("Found match:", span.text)
else:
   start_token = chars_to_tokens.get(start) end_token = chars_to_tokens.get(end) if start_token is not None and end_token is not None: span = doc[start_token: end_token + 1] print("Found closest match:", span.text)

Example

pattern = [{
      "LOWER": "hello"
   },
   {
      "IS_PUNCT": True,
      "OP": "?"
   }
]

When working with entities, you can use displaCy to quickly generate a NER visualization from your updated Doc, which can be exported as an HTML file:

from spacy
import displacy
html = displacy.render(doc, style = "ent", page = True,
   options = {
      "ents": ["EVENT"]
   })
import spacy
from spacy.matcher
import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
   "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
   span = doc[start: end]
print(span.text)

By default, the PhraseMatcher will match on the verbatim token text, e.g. Token.text. By setting the attr argument on initialization, you can change which token attribute the matcher should use when comparing the phrase pattern to the matched Doc. For example, using the attribute LOWER lets you match on Token.lower and create case-insensitive match patterns:

from spacy.lang.en
import English
from spacy.matcher
import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", patterns)

doc = nlp("angela merkel and us president barack Obama")
for match_id, start, end in matcher(doc):
   print("Matched based on lowercase token text:", doc[start: end])

Another possible use case is matching number tokens like IP addresses based on their shape. This means that you won’t have to worry about how those strings will be tokenized and you’ll be able to find tokens and combinations of tokens based on a few examples. Here, we’re matching on the shapes ddd.d.d.d and ddd.ddd.d.d:

from spacy.lang.en
import English
from spacy.matcher
import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")
matcher.add("IP", [nlp("127.0.0.1"), nlp("127.127.0.0")])

doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
for match_id, start, end in matcher(doc):
   print("Matched based on token shape:", doc[start: end])

Suggestion : 2

To match either hello world and also hello, world, you may use

pattern = [{
   "LOWER": "hello"
}, {
   "IS_PUNCT": True,
   "OP": "?"
}, {
   "LOWER": "world"
}]

Suggestion : 3

In this article, we are going to learn about Rule-Based Matching features in NLP.,Unlike the regular expression where we get an output for a fixed pattern matching, this helps us to match a word, phrases, or sometimes sentences according to given a predefined pattern.,How does this work? It is an object of the matcher class that is created using nlp.vocab which returns an object of it. Now we will add a pattern to be matched along with a unique id and a callback function. The pattern which we give is in a form of a dictionary where each dictionary represents the token.  ,Here as we see the first output is match_id, then it is starting and ending values of the string, and at last, we print the text. The return match_id is an integer hash value so that we can convert it back to Unicode format using nlp.vocab.string.

1._
import spacy
from spacy.matcher
import Matcher
from spacy.tokens
import Span

We will be using a smaller model for the spacy of the English language. Loading the smaller model into nlp variable by using load() function.

nlp = spacy.load("en_core_web_sm")

Instantiate an object for the Matcher class with the ‘vocab’ object from the Language created.

matcher = Matcher(nlp.vocab)
5._
matcher.add(“Matching”, None, pattern)

Taking a string and storing it into the doc variable

doc = nlp("Hello, Good morning How was your day!")
print(doc)

# Output:
   Hello, Good morning How was your day!

Suggestion : 4

I am doing ruled based phrase matching in anycodings_nlp Spacy. I am trying the following example but anycodings_nlp it is not working. ,then final matches is giving empty string. anycodings_nlp Would you please correct me?,Your pattern matches Hello, world with a anycodings_spacy punctuation token in the middle, not anycodings_spacy Hello world,String replacing in a data frame based on another data frame

Example

import spacy
from spacy.matcher
import Matcher
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello world!')

pattern = [{
   "LOWER": "hello"
}, {
   "IS_PUNCT": True
}, {
   "LOWER": "world"
}]

matcher = Matcher(nlp.vocab)
matcher.add('HelloWorld', None, pattern)

matches = matcher(doc)
print(matches)

To match either hello world and also anycodings_spacy hello, world, you may use

pattern = [{
   "LOWER": "hello"
}, {
   "IS_PUNCT": True,
   "OP": "?"
}, {
   "LOWER": "world"
}]

Suggestion : 5

December 31, 2020

First of all be sure that you have installed the spaCy library and downloaded the en_core_web_sm as follows.

pip install - U spacy
python - m spacy download en_core_web_sm

Let’s begin by reading the data and importing the libraries.

#reading the data
data = open('11-0.txt').read()

#if you get an error
try the following
#data = open('11-0.txt', encoding = 'cp850').read()

import spacy

# Import the Matcher
from spacy.matcher
import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(data)

Let’s say we want to find phrases starting with the word Alice followed by a verb.

#initialize matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "Alice"
and a Verb
#TEXT is
for the exact match and VERB
for a verb
pattern = [{
   "TEXT": "Alice"
}, {
   "POS": "VERB"
}]

# Add the pattern to the matcher

#the first variable is a unique id
for the pattern(alice).
#The second is an optional callback and the third one is our pattern.
matcher.add("alice", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start: end].text
   for match_id, start, end in matches
])

Find adjectives followed by a noun .

matcher = Matcher(nlp.vocab)

pattern = [{
   "POS": "ADJ"
}, {
   "POS": "NOUN"
}]

matcher.add("id1", None, pattern)
matches = matcher(doc)

# We will show you the first 20 matches
print("Matches:", set([doc[start: end].text
   for match_id, start, end in matches
][: 20]))
matcher = Matcher(nlp.vocab)

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]

matcher.add("id1", None, pattern)
matches = matcher(doc)

# We will show you the first 20 matches
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {
   'grand words',
   'hot day',
   'legged table',
   'dry leaves',
   'great delight',
   'low hall',
   'own mind',
   'many miles',
   'little girl',
   'good opportunity',
   'right word',
   'long passage',
   'other parts',
   'low curtain',
   'large rabbit',
   'pink eyes',
   'several things',
   'golden key',
   'little door'
}

Suggestion : 6

Regular expression allows you to find the pattern in the text. But the text you want to find is fixed. But if you use the rule base matching using Spacy then the text is matched using tokens, phrases, entities e.t.c which is a set of pre-defined patterns. To achieve it you have to use Spacy matcher.,Let’s create a pattern that will use to match the entire document and find the text according to that pattern. For example, I want to find an email address then I will define the pattern as below.,You can also define more than one pattern and find the text in your document. For example, I also want to find all the names or nouns in the text then I will use the pattern [{“POS”: “PROPN”}].,In this entire section, you will know how to extract information by matching text as per defined patterns. But before going to the demonstration part make sure you have installed spacy in your system. Also, follow all the steps for better understanding.

1._
import spacy
from spacy.matcher
import Matcher

To download the model use the following command in your terminal.

nlp = spacy.load("en_core_web_sm")

The third step is to call all the vocabulary of the NLP and pass it into the Matcher() constructor.

matcher = Matcher(nlp.vocab)

After that, you have to add the pattern to the Matcher that will be used for finding the text. Add the below line to add the pattern.

matcher.add("EMAIL", [pattern])
6._
text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)