If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.,spaCy features a rule-matching engine, the Matcher, that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags like IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher, which accepts Doc objects as match patterns.,When using the REGEX operator, keep in mind that it operates on single tokens, not the whole text. Each expression you provide will be matched on a token. If you need to match on the whole text instead, see the details on regex matching on the whole text.,The dependency matcher may be slow when token patterns can potentially match many tokens in the sentence or when relation operators allow longer paths in the dependency parse, e.g. <<, >>, .* and ;*.
When writing patterns, keep in mind that each dictionary represents one token. If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results. When developing complex patterns, make sure to check examples against spaCy’s tokenization:
doc = nlp("A complex-example,!")
print([token.text
for token in doc
])
Example
# Matches "love cats" or "likes flowers" pattern1 = [{ "LEMMA": { "IN": ["like", "love"] } }, { "POS": "NOUN" } ] # Matches tokens of length >= 10 pattern2 = [{ "LENGTH": { ">=": 10 } }] # Match based on morph attributes pattern3 = [{ "MORPH": { "IS_SUBSET": ["Number=Sing", "Gender=Neut"] } }] # "", "Number=Sing" and "Number=Sing|Gender=Neut" will match as subsets # "Number=Plur|Gender=Neut" will not match # "Number=Sing|Gender=Neut|Polite=Infm" will not match because it 's a superset
There are many ways to do this and the most straightforward one is to create a
dict keyed by characters in the Doc
, mapped to the token they’re part of. It’s
easy to write and less error-prone, and gives you a constant lookup time: you
only ever need to create the dict once per Doc
.
chars_to_tokens = {} for token in doc: for i in range(token.idx, token.idx + len(token.text)): chars_to_tokens[i] = token.i
You can then look up character at a given position, and get the index of the
corresponding token that the character is part of. Your span would then be
doc[token_start:token_end]
. If a character isn’t in the dict, it means it’s
the (white)space tokens are split on. That hopefully shouldn’t happen, though,
because it’d mean your regex is producing matches with leading or trailing
whitespace.
span = doc.char_span(start, end)
if span is not None:
print("Found match:", span.text)
else:
start_token = chars_to_tokens.get(start) end_token = chars_to_tokens.get(end) if start_token is not None and end_token is not None: span = doc[start_token: end_token + 1] print("Found closest match:", span.text)
Example
pattern = [{
"LOWER": "hello"
},
{
"IS_PUNCT": True,
"OP": "?"
}
]
When working with entities, you can use displaCy to
quickly generate a NER visualization from your updated Doc
, which can be
exported as an HTML file:
from spacy
import displacy
html = displacy.render(doc, style = "ent", page = True,
options = {
"ents": ["EVENT"]
})
import spacy from spacy.matcher import PhraseMatcher nlp = spacy.load("en_core_web_sm") matcher = PhraseMatcher(nlp.vocab) terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."] # Only run nlp.make_doc to speed things up patterns = [nlp.make_doc(text) for text in terms] matcher.add("TerminologyList", patterns) doc = nlp("German Chancellor Angela Merkel and US President Barack Obama " "converse in the Oval Office inside the White House in Washington, D.C.") matches = matcher(doc) for match_id, start, end in matches: span = doc[start: end] print(span.text)
By default, the PhraseMatcher
will match on the verbatim token text, e.g.
Token.text
. By setting the attr
argument on initialization, you can change
which token attribute the matcher should use when comparing the phrase
pattern to the matched Doc
. For example, using the attribute LOWER
lets you
match on Token.lower
and create case-insensitive match patterns:
from spacy.lang.en
import English
from spacy.matcher
import PhraseMatcher
nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", patterns)
doc = nlp("angela merkel and us president barack Obama")
for match_id, start, end in matcher(doc):
print("Matched based on lowercase token text:", doc[start: end])
Another possible use case is matching number tokens like IP addresses based on
their shape. This means that you won’t have to worry about how those strings
will be tokenized and you’ll be able to find tokens and combinations of tokens
based on a few examples. Here, we’re matching on the shapes ddd.d.d.d
and
ddd.ddd.d.d
:
from spacy.lang.en
import English
from spacy.matcher
import PhraseMatcher
nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")
matcher.add("IP", [nlp("127.0.0.1"), nlp("127.127.0.0")])
doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
for match_id, start, end in matcher(doc):
print("Matched based on token shape:", doc[start: end])
To match either hello world
and also hello, world
, you may use
pattern = [{
"LOWER": "hello"
}, {
"IS_PUNCT": True,
"OP": "?"
}, {
"LOWER": "world"
}]
In this article, we are going to learn about Rule-Based Matching features in NLP.,Unlike the regular expression where we get an output for a fixed pattern matching, this helps us to match a word, phrases, or sometimes sentences according to given a predefined pattern.,How does this work? It is an object of the matcher class that is created using nlp.vocab which returns an object of it. Now we will add a pattern to be matched along with a unique id and a callback function. The pattern which we give is in a form of a dictionary where each dictionary represents the token. ,Here as we see the first output is match_id, then it is starting and ending values of the string, and at last, we print the text. The return match_id is an integer hash value so that we can convert it back to Unicode format using nlp.vocab.string.
import spacy
from spacy.matcher
import Matcher
from spacy.tokens
import Span
We will be using a smaller model for the spacy of the English language. Loading the smaller model into nlp
variable by using load()
function.
nlp = spacy.load("en_core_web_sm")
Instantiate an object for the Matcher class with the ‘vocab’ object from the Language created.
matcher = Matcher(nlp.vocab)
matcher.add(“Matching”, None, pattern)
Taking a string and storing it into the doc variable
doc = nlp("Hello, Good morning How was your day!") print(doc) # Output: Hello, Good morning How was your day!
I am doing ruled based phrase matching in anycodings_nlp Spacy. I am trying the following example but anycodings_nlp it is not working. ,then final matches is giving empty string. anycodings_nlp Would you please correct me?,Your pattern matches Hello, world with a anycodings_spacy punctuation token in the middle, not anycodings_spacy Hello world,String replacing in a data frame based on another data frame
Example
import spacy
from spacy.matcher
import Matcher
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello world!')
pattern = [{
"LOWER": "hello"
}, {
"IS_PUNCT": True
}, {
"LOWER": "world"
}]
matcher = Matcher(nlp.vocab)
matcher.add('HelloWorld', None, pattern)
matches = matcher(doc)
print(matches)
To match either hello world and also anycodings_spacy hello, world, you may use
pattern = [{
"LOWER": "hello"
}, {
"IS_PUNCT": True,
"OP": "?"
}, {
"LOWER": "world"
}]
December 31, 2020
First of all be sure that you have installed the spaCy library and downloaded the en_core_web_sm as follows.
pip install - U spacy python - m spacy download en_core_web_sm
Let’s begin by reading the data and importing the libraries.
#reading the data data = open('11-0.txt').read() #if you get an error try the following #data = open('11-0.txt', encoding = 'cp850').read() import spacy # Import the Matcher from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") doc = nlp(data)
Let’s say we want to find phrases starting with the word Alice followed by a verb.
#initialize matcher matcher = Matcher(nlp.vocab) # Create a pattern matching two tokens: "Alice" and a Verb #TEXT is for the exact match and VERB for a verb pattern = [{ "TEXT": "Alice" }, { "POS": "VERB" }] # Add the pattern to the matcher #the first variable is a unique id for the pattern(alice). #The second is an optional callback and the third one is our pattern. matcher.add("alice", None, pattern) # Use the matcher on the doc matches = matcher(doc) print("Matches:", [doc[start: end].text for match_id, start, end in matches ])
Find adjectives followed by a noun .
matcher = Matcher(nlp.vocab) pattern = [{ "POS": "ADJ" }, { "POS": "NOUN" }] matcher.add("id1", None, pattern) matches = matcher(doc) # We will show you the first 20 matches print("Matches:", set([doc[start: end].text for match_id, start, end in matches ][: 20]))
matcher = Matcher(nlp.vocab) pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}] matcher.add("id1", None, pattern) matches = matcher(doc) # We will show you the first 20 matches print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))
Matches: {
'grand words',
'hot day',
'legged table',
'dry leaves',
'great delight',
'low hall',
'own mind',
'many miles',
'little girl',
'good opportunity',
'right word',
'long passage',
'other parts',
'low curtain',
'large rabbit',
'pink eyes',
'several things',
'golden key',
'little door'
}
Regular expression allows you to find the pattern in the text. But the text you want to find is fixed. But if you use the rule base matching using Spacy then the text is matched using tokens, phrases, entities e.t.c which is a set of pre-defined patterns. To achieve it you have to use Spacy matcher.,Let’s create a pattern that will use to match the entire document and find the text according to that pattern. For example, I want to find an email address then I will define the pattern as below.,You can also define more than one pattern and find the text in your document. For example, I also want to find all the names or nouns in the text then I will use the pattern [{“POS”: “PROPN”}].,In this entire section, you will know how to extract information by matching text as per defined patterns. But before going to the demonstration part make sure you have installed spacy in your system. Also, follow all the steps for better understanding.
import spacy
from spacy.matcher
import Matcher
To download the model use the following command in your terminal.
nlp = spacy.load("en_core_web_sm")
The third step is to call all the vocabulary of the NLP and pass it into the Matcher() constructor.
matcher = Matcher(nlp.vocab)
After that, you have to add the pattern to the Matcher that will be used for finding the text. Add the below line to add the pattern.
matcher.add("EMAIL", [pattern])
text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)