python regex matching pattern not surrounded by double quotes

  • Last Update :
  • Techknowledgy :

You can check for word boundaries (\b):

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3' >>>
   re.sub(r '\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

Or, alternatively, you can use negative look behind and ahead assertions to check for not having quotes around title:\w+:

>>> re.sub(r '(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'

We can solve it with a beautifully-simple regex:

"[^"] * "|(\btitle:\S+)

This program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r '"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
   if m.group(1):
   return ""
else:
   return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
 re.sub('[^"]title:\w+', "", string)
 keyword1 keyword2 "title:quoted"
 keyword3

A little violent but works in all situations and without catastrophic backtracking:

import re

string = r ''
'keyword1 keyword2 title:hello title:world "title:quoted"title:foo
"abcd \" title:bar"
title: foobar keyword3 keywordtitle: keyword "non balanced quote title:foobar'''

pattern = re.compile(
r ''
'(?:
(# other content( ? : ( ? = (
   " (?:(?=([^\\"] + | \\.))\ 3) * ( ? : "|$) # quoted content |
   [ ^ t "]+             # all that is not a "
      t " or a quote |
      \Bt # "t"
      preceded by word characters |
      t( ? !itle : [a - z] + ) # "t"
      not followed by "itle:" + letters
   ))\ 2) +
) |
# OR
   ( ? < !") # not preceded by a double quote)
   ( ? : \btitle: [a - z] + ) ? ''
',
re.VERBOSE)

print re.sub(pattern, r '\1', string)

Suggestion : 2

Regular expressions are patterns. A string can be classified as either matching or not matching the pattern. Regular expressions are used in at least three different ways:,public boolean matches(String regex) -- test whether or not the entire string matches the regular expression regex ,public String[] split(String regex) -- breaks the string into "tokens" separated by substrings that match the regular expression regex. The return value is an array containing all the tokens, but not the substrings that match the delimiting regular expression. ,in Java. This expression matches a string that starts with a left parenthesis, ends with a right parenthesis, and contains no double quotation marks. To write this as a Java string literal, you have to escape the special characters \ and " with backslashes and enclose it in parentheses:

Certain characters have special purposes in regular expressions. These are called meta-characters or meta-symbols. Meta-characters are not part of the strings that are matched by a pattern. Instead, they are part of the syntax that is used for representing patterns. Typically, the following characters are meta-characters:

          .* | ? +()[] {} ^ $\

There is one more important aspect of regular expressions on a computer: backreferences. A backreference is a way of referring to a substring that was matched by an earlier part of the expression. Backreferences take the form \1, \2, \3, ..., \9. \1 represents the part of the string that was matched by the first parenthesized sub-expression in the regular expression; \2, the part that was matched by the second parenthesized sub-expression, and so on. For example, the expression

          ^ (\w + ).*\1 $

matches a line of text that begins and ends with the same word. The \1 matches whatever sequence of characters were matches the by the \w+ that is enclosed in the first (and only) set of parentheses in the expression. The numbering of sub-expressions is done by counting left parentheses, and sub-expressions can be nested. For example, in

          ((\d + )\ s * [+\ - * /]\s*(\d+))=(\d+)

UNIX utility program such as grep are designed to work together using the "pipe" operator, "|". On the command line, this operator sends the output from one program into the input of the next program. For example, here is a command that will print an alphabetical list of all "tags" that are used in an html file, with duplicates removed:

          egrep - o '<\w+[ />]'
          index.html | egrep - o '\w+' | sort - u

Java regular expressions are specified by strings that use the syntax described above. There is one unfortunate complication when specifying a regular expression as a String literal in Java: String literals themselves have special characters that have to be escaped. For example, suppose you want to write the regular expression

          \
          ([ ^ "]*\)

Suggestion : 3

A regular expression is similar to wildcard matching in X-PLOR. Table 5.9 is a list of conversions from X-PLOR style wildcards to the matching regular expression. , In brief, a regular selection allows matching to multiple possibilities, instead of just one character. Table 5.8 shows some of the methods that can be used. , Multiple terms can be provided on the list of matching keywords. This example selects residues starting with an A, the glycine residues, and residues ending with a T. As with a string, a regular expression in a numeric context gets converted to an integer, which will always be zero: , Selections containing special characters such as , , or , must be escaped with the \ character. In order to select atoms named Na+, one would use the selection:

        name "C.*"

Suggestion : 4

Last updated: Jun 26, 2022

Copied!
   import re

#✅ extract string between double quotes
my_str = 'One "Two" Three "Four"'

my_list = re.findall(r '"([^"]*)"', my_str)

print(my_list) #👉️['Two', 'Four']
print(my_list[0]) #👉️ 'Two'
print(my_list[1]) #👉️ 'Four'
#-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -

#✅ extract string between single quotes

my_str_2 = "One 'Two' Three 'Four'"

my_list_2 = re.findall(r "'([^']*)'", my_str_2)

print(my_list_2) #👉️['Two', 'Four']
Copied!
   import re

#✅ extract string between double quotes
my_str = 'One "Two" Three "Four"'

my_list = re.findall(r '"([^"]*)"', my_str)

print(my_list) #👉️['Two', 'Four']
print(my_list[0]) #👉️ 'Two'
print(my_list[1]) #👉️ 'Four'
Copied!
   import re

my_str_2 = "One 'Two' Three 'Four'"

my_list_2 = re.findall(r "'([^']*)'", my_str_2)

print(my_list_2) #👉️['Two', 'Four']

print(my_list_2[0]) #👉️ Two
print(my_list_2[1]) #👉️ Four
Copied!
   import re

my_str_2 = "One 'Two' Three 'Four'"

my_list_2 = re.findall(r "'([^']*)'", my_str_2)

print(my_list_2) #👉️['Two', 'Four']

print(my_list_2[0]) #👉️ Two
print(my_list_2[1]) #👉️ Four

Suggestion : 5

A text field is enclosed in double quotes (") For example: "HC Account", "Mary", and "|". This is correct and the data should be loaded without the quotes.,Sometimes only a starting quote is provided and there is no ending quote. For example: "Account1. ,Some values will contain the pipe delimiter. For example: "STE|504". In this case, the field must necessarily be enclosed within double quotes. If it isn't, it falls into category three below.,TL;DR: Any field that starts with |", must end with a "|. If it doesn't, and another |" is encountered, the first double quote must be escaped.

This is the simplest regex that will work for most cases:

                    Capturing Group 1 Capturing Group 2
                       (All previous valid fields)(Unclosed opening quote)
                    __________________________ | _________________________ |
                       |
                       || |
                       ^
                       (( ? : ( ? : ( ? !")[^|\r\n]*|" [ ^ "\r\n]*"( ? = $ | \ | ))( ? : $ | \ | )) * +)(") |
                                ____________ | | _________________ | | ______ |
                                |
                                | |
                                Unquoted field OR Quoted field EOL or hypen delimiter

Use it with this replacement string:

$1\\ $2

The regex can be improved to work even when the next following quoted field begins with a pipe:

^ (( ? : ( ? : ( ? !")[^|\r\n]*|" [ ^ "\r\n]*"( ? : ( ? = $) | ( ? = \ | )( ? !( ? : \ | [ ^ | "\r\n]*)+[^|\r\n]")))( ? : $ | \ | )) * +)(") |
               ____________________________________________ |
               |
               Modified lookahead to make sure that the following | is not the first char of a properly quoted field

Suggestion : 6

Last Updated : 05 Oct, 2021,GATE CS 2021 Syllabus

Illustration:

Input: "Out of this String required only is 'Geeks for Geeks' only'"
Output: Geeks
for Geeks

Input: "The data wanted is'Java Regex'"
Output: Java Regex

String to be extracted: Out of this String I want 'Geeks for Geeks'
only
Extracted part: Geeks
for Geeks

String to be extracted: The data that I want is 'Java Regex'
Extracted part: Java Regex

String to be extracted: Out of this String I want 'Geeks for Geeks'
only
Extracted part: Geeks
for Geeks

String to be extracted: The data that I want is 'Java Regex'
Extracted part: Java Regex