how to add tags to negated words in strings that follow "not", "no" and "never"

  • Last Update :
  • Techknowledgy :

To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:

import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r '\b(?:not|never|no)\b[\w\s]+[^\w\s]',
   lambda match: re.sub(r '(\s+)(\w+)', r '\1NEG_\2', match.group(0)),
   string,
   flags = re.IGNORECASE)

Will print (demo here)

It was never NEG_going NEG_to NEG_work, he thought.He did not NEG_play NEG_so NEG_well, so he had to practice some more.Not NEG_foobar!

The first step is to select the parts of your string you're interested in. This is done with

\
b( ? : not | never | no)\ b[\w\ s] + [ ^ \w\ s]

And replace them with what you want

\
1 NEG_\ 2

The first step is to select the parts of your string you're interested in. This is done with

\
b( ? : not | never | no)\ b[\w\ s] + [ ^ \w\ s]

Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with

(\s + )(\w + )

And replace them with what you want

\
1 NEG_\ 2

Suggestion : 2

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These "word classes" are not just the idle invention of grammarians, but are useful categories for many language processing tasks. As we will see, they arise from simple analysis of the distribution of words in text. The goal of this chapter is to answer the following questions:,Your Turn: Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Hint: think of a commonplace object and try to put the word to before it to see if it can also be a verb, or think of an action and try to put the before it to see if it can also be a noun. Now make up a sentence with both uses of this word, and run the POS-tagger on this sentence.,Verbs are words that describe events and actions, e.g. fall, eat in 2.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.,The regular expression tagger assigns tags to tokens on the basis of matching patterns. For instance, we might guess that any word ending in ed is the past participle of a verb, and any word ending with 's is a possessive noun. We can express these as a list of regular expressions:

>>> text = word_tokenize("And now for something completely different") >>>
   nltk.pos_tag(text)[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
      ('completely', 'RB'), ('different', 'JJ')]
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit") >>>
   nltk.pos_tag(text)[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
      ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>>
   text.similar('woman')
Building word - context index...
   man day time year car moment world family house boy child country job
state girl place war way
case question
   >>>
   text.similar('bought')
made done put said found had seen given left heard been brought got
set was called felt in that told >>>
   text.similar('over')
in on to of and
for with from at by that into as up out down through
about all is
   >>>
   text.similar('the')
a his this their its her an that our any all one these my in your no
some other and
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>>
   tagged_token('fly', 'NN') >>>
   tagged_token[0]
'fly' >>>
tagged_token[1]
'NN'
>>> sent = ''
'
...The / AT grand / JJ jury / NN commented / VBD on / IN a / AT number / NN of /IN
   ...other / AP topics / NNS, /, AMONG/IN
them / PPO the / AT Atlanta / NP and / CC
   ...Fulton / NP - tl County / NN - tl purchasing / VBG departments / NNS which / WDT it / PPS
   ...said / VBD `` / ``
ARE / BER well / QL operated / VBN and / CC follow / VB generally / RB
   ...accepted / VBN practices / NNS which / WDT inure / VB to / IN the / AT best / JJT
   ...interest / NN of /IN both/ABX
governments / NNS '' / ''. / .
   ...''
' >>>
[nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
   ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ...('.', '.')
]
>>> nltk.corpus.brown.tagged_words()[('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>>
   nltk.corpus.brown.tagged_words(tagset = 'universal')[('The', 'DET'), ('Fulton', 'NOUN'), ...]

Suggestion : 3

In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to (?{0}) but more efficient. See below for details. Full syntax: (?(DEFINE)definitions...),A special form is the (DEFINE) predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose.,Recall that which of yes-pattern or no-pattern actually matches is already determined. The ordering of the matches is the same as for the chosen subexpression.,In other words, once the (*COMMIT) has been entered, and if the pattern does not match, the regex engine will not try any further matching on the rest of the string.

There are a number of Unicode characters that match a sequence of multiple characters under /i. For example, LATIN SMALL LIGATURE FI should match the sequence fi. Perl is not currently able to do this when the multiple characters are in the pattern and are split between groupings, or when one or more are quantified. Thus

"\N{LATIN SMALL LIGATURE FI}" = ~/fi/i;
# Matches
   "\N{LATIN SMALL LIGATURE FI}" = ~/[fi][fi]/i;
# Doesn 't match!
"\N{LATIN SMALL LIGATURE FI}" = ~/fi*/i;
# Doesn 't match!

# The below doesn 't match, and it isn'
t clear what $1 and $2 would
# be even
if it did!!
   "\N{LATIN SMALL LIGATURE FI}" = ~/(f)(i)/i;
# Doesn 't match!

Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.

"hello" = ~/(hi|hello)/;
# $1 is "hello"
"hello" = ~/(hi|hello)/n;
# $1 is undef

This is equivalent to putting ?: at the beginning of every capturing group:

"hello" = ~/(?:hi|hello)/;
# $1 is undef

/n can be negated on a per-group basis. Alternatively, named captures may still be used.

"hello" =~ /(?-n:(hi|hello))/n;   # $1 is "hello"
"hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
                                  # "hello"

Flags described further in "Using regular expressions in Perl" in perlretut are:

c - keep the current position during repeated matching
g - globally match the pattern repeatedly in the string

Substitution-specific modifiers described in "s/PATTERN/REPLACEMENT/msixpodualngcer" in perlop are:

e - evaluate the right - hand side as an expression
ee - evaluate the right side as a string then eval the result
o - pretend to optimize your code, but actually introduce bugs
r - perform non - destructive substitution and
return the new value

Note that a comment can go just about anywhere, except in the middle of an escape sequence. Examples:

qr / foo( ? #comment) bar / '  # Matches '
foobar '

# The pattern below matches 'abcd', 'abccd', or 'abcccd'
qr / abc( ? #comment between literal and its quantifier) {
   1,
   3
}
d /

   # The pattern below generates a syntax error, because the '\p'
must
# be followed immediately by a '{'.
qr / \p( ? #comment between\ p and its property name) {
   Any
}
/

# The pattern below generates a syntax error, because the initial
# '\('
is a literal opening parenthesis, and so there is nothing
#
for the closing ')'
to match
qr / \( ? #the backslash means this isn 't a comment)p{Any}/

      # Comments can be used to fold long patterns into multiple lines qr / First part of a long regex( ? #) remaining part /

This is particularly useful for dynamically-generated patterns, such as those read in from a configuration file, taken from an argument, or specified in a table somewhere. Consider the case where some patterns want to be case-sensitive and some do not: The case-insensitive ones merely need to include (?i) at the front of the pattern. For example:

$pattern = "foobar";
if (/$pattern/i) {}

# more flexible:

   $pattern = "(?i)foobar";
if (/$pattern/) {}

These modifiers are restored at the end of the enclosing group. For example,

(( ? i) blah)\ s + \g1

A modifier is overridden by later occurrences of this construct in the same scope containing the same modifier, so that

/((?im)foo(?-m)bar)/

both /x and /xx are turned off during matching foo. And in

/(?x)foo/x

This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make backreferences as "()" does. So

@fields = split(/\b(?:a|b|c)\b/)

matches the same field delimiters as

@fields = split(/\b(a|b|c)\b/)

Any letters between "?" and ":" act as flags modifiers as with (?adluimnsx-imnsx). For example,

/(?s-i:more.*than).*million/i

Starting in Perl 5.14, a "^" (caret or circumflex accent) immediately after the "?" is a shorthand equivalent to d-imnsx. Any positive flags (except "d") may follow the caret, so

( ? ^ x : foo)

is equivalent to

( ? x - imns : foo)

Suggestion : 4

A regular expression consisting of only nonspecial characters simply represents that sequence of characters. If abc occurs anywhere in the string we are testing against (not just at the start), test will return true.,The star (*) has a similar meaning but also allows the pattern to match zero times. Something with a star after it never prevents a pattern from matching—it’ll just match zero instances if it can’t find any suitable text to match.,Regular expression objects have a number of methods. The simplest one is test. If you pass it a string, it will return a Boolean telling you whether the string contains a match of the pattern in the expression.,Putting parentheses around the parts of the expression that we are interested in, we can now create a date object from a string.

A regular expression is a type of object. It can be either constructed with the RegExp constructor or written as a literal value by enclosing a pattern in forward slash (/) characters.

let re1 = new RegExp("abc");
let re2 = /abc/;

The second notation, where the pattern appears between slash characters, treats backslashes somewhat differently. First, since a forward slash ends the pattern, we need to put a backslash before any forward slash that we want to be part of the pattern. In addition, backslashes that aren’t part of special character codes (like \n) will be preserved, rather than ignored as they are in strings, and change the meaning of the pattern. Some characters, such as question marks and plus signs, have special meanings in regular expressions and must be preceded by a backslash if they are meant to represent the character itself.

let eighteenPlus = /eighteen\+/;

Regular expression objects have a number of methods. The simplest one is test. If you pass it a string, it will return a Boolean telling you whether the string contains a match of the pattern in the expression.

console.log(/abc/.test("abcde"));
// → true
console.log(/abc/.test("abxde"));
// → false

So you could match a date and time format like 01-30-2003 15:20 with the following expression:

let dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/;
console.log(dateTime.test("01-30-2003 15:20"));
// → true
console.log(dateTime.test("30-jan-2003 15:20"));
// → false

To invert a set of characters—that is, to express that you want to match any character except the ones in the set—you can write a caret (^) character after the opening bracket.

let notBinary = /[^01]/;
console.log(notBinary.test("1100100010100110"));
// → false
console.log(notBinary.test("1100100010200110"));
// → true

Suggestion : 5

However, keep in mind that this modifier just interleaves the ASCII codes of the characters in the string with zeroes, it does not support truly UTF-16 strings containing non-English characters. If you want to search for strings in both ASCII and wide form, you can use the ascii modifier in conjunction with wide , no matter the order in which they appear.,The ascii modifier can appear alone, without an accompanying wide modifier, but it's not necessary to write it because in absence of wide the string is assumed to be ASCII by default.,In case you want to express that only some occurrences of the string should satisfy your condition, the same logic seen in the for..of operator applies here:,When writing the condition for a rule you can also make reference to a previously defined rule in a manner that resembles a function invocation of traditional programming languages. In this way you can create rules that depend on others. Let's see an example:

rule dummy {
   condition: false
}
rule ExampleRule {
   strings: $my_text_string = "text here"
   $my_hex_string = {
      E2 34 A1 C8 23 FB
   }

   condition: $my_text_string or $my_hex_string
}
/*
    This is a multi-line comment ...
*/

rule CommentExample // ... and this is single-line comment
{
   condition: false // just a dummy rule, don't do this
}
rule WildcardExample {
   strings: $hex_string = {
      E2 34 ?? C8 A ? FB
   }

   condition: $hex_string
}
rule JumpExample {
   strings: $hex_string = {
      F4 23[4 - 6] 62 B4
   }

   condition: $hex_string
}
F4 23 01 02 03 04 62 B4
F4 23 00 00 00 00 00 62 B4
F4 23 15 82 A3 04 45 22 62 B4