why use regex finditer() rather than findall()

  • Last Update :
  • Techknowledgy :

So

import re
CARRIS_REGEX=r'<th>(\d+)</th>
<th>([\s\w\.\-]+)</th>
<th>(\d+:\d+)</th>
<th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
print(match)
print()
for match in pattern.findall(mailbody):
print(match)

prints

< _sre.SRE_Match object at 0x00A63758 >
   <
   _sre.SRE_Match object at 0x00A63F98 >
   <
   _sre.SRE_Match object at 0x00A63758 >
   <
   _sre.SRE_Match object at 0x00A63F98 >
   <
   _sre.SRE_Match object at 0x00A63758 >
   <
   _sre.SRE_Match object at 0x00A63F98 >
   <
   _sre.SRE_Match object at 0x00A63758 >
   <
   _sre.SRE_Match object at 0x00A63F98 >

   ('790', 'PR. REAL', '21:06', '04m')
   ('758', 'PORTAS BENFICA', '21:10', '09m')
   ('790', 'PR. REAL', '21:14', '13m')
   ('758', 'PORTAS BENFICA', '21:21', '19m')
   ('790', 'PR. REAL', '21:29', '28m')
   ('758', 'PORTAS BENFICA', '21:38', '36m')
   ('758', 'SETE RIOS', '21:49', '47m')
   ('758', 'SETE RIOS', '22:09', '68m')

If you want the same output from finditer as you're getting from findall, you need

for match in pattern.finditer(mailbody):
   print(tuple(match.groups()))

I get this example from Regular expression operations in Python 2.* Documentation and that example well described here in details with some modification. To explain whole example, let's get string type variable call,

text = "He was carefully disguised but captured quickly by police."

and the compile type regular expression pattern as,

regEX = r "\w+ly"
pattern = re.compile(regEX)

Following code lines gives you the basic understand of re.search().

search = pattern.search(text)
print(search)
print(type(search))

#output
<re.Match object; span=(7, 16), match='carefully'>
   <class 're.Match'>

The above example gives us the Iterator Objects which need to be loop. This is obviously not the result we want. Let's loop finditer and see what's inside this Iterator Objects.

for anObject in finditer:
print(anObject)
print(type(anObject))
print()

#output
<re.Match object; span=(7, 16), match='carefully'>
   <class 're.Match'>

      <re.Match object; span=(40, 47), match='quickly'>
         <class 're.Match'>

Let's understand what happen in re.findall().

findall = pattern.findall(text)
print(findall)
print(type(findall))

#output
['carefully', 'quickly']
<class 'list'>

You can't make them behave the same way, because they're different. If you really want to create a list of results from finditer, then you could use a list comprehension:

>>> [match
   for match in pattern.finditer(mailbody)
]
[...]

In general, use a for loop to access the matches returned by re.finditer:

>>>
for match in pattern.finditer(mailbody):
   ......

The solution was practically that I needed to create at least one group, which enabled fetching it from the group dict

-yield from zip(re.finditer(r "\w+", line)...
      +yield from zip(re.finditer(r "(\w+)", line)...
            ...
            -block.(miscellaneous attempts) +
            block.group(1)

For example:

text_to_search = ''
'
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ\ s
321 - 555 - 4321
1234567890
Ha HaHa
MetaCharacters(Need to be escaped):
   . ^ $ * + ? {} []\ | ()
khanafsaan11.com
321 - 555 - 4321
123.555 .1234
123 * 555 * -1234
123.555 .1234
800 - 555 - 1234
900 - 555 - 1234
Mr.Schafer
Mr Smith
Ms Davis
Mrs.Robinson
Mr.T
Mr_hello
   ''
'
pattern = re.compile(r 'M(r|rs|s)\.? [A-Z][a-z]*')
print(list(pattern.finditer(text_to_search))) #converted to list
print(pattern.findall(text_to_search))

Output:

['r', 'r', 's', 'rs', 'r']
[, , , , ]

And you can get the output like findall() from finditer() output as following

for obj in pattern.finditer(text_to_search):
   print(obj.group()) #group() is an attribute of re.Match object
#ouput
Mr.Schafer
Mr Smith
Ms Davis
Mrs.Robinson
Mr.T

Suggestion : 2

re.finditer() returns iterator of matched objects in the string while re.findall() returns list of matched patterns in string. Refer below snippet for understanding difference between re.finditer() and re.findall() . , Difference between BinaryCrossentropy and CategoricalCrossentropy ,Extracting urls from text using Python re.findall ,Extracting urls from text using Python re.finditer

  • Extracting urls from text using Python re.finditer
  •       import re
    
          text = ''
          ' Extract the doamin from the urls www.gcptutorials.com,
          www.wikipedia.org, www.google.com ''
          '
    
          pattern = r '(www.([A-Za-z_0-9-]+)(.\w+))'
    
          find_iter_result = re.finditer(pattern, text)
    
          print(type(find_iter_result))
          print(find_iter_result)
    
          for i in find_iter_result:
             print(i.group(2))

    Output

         <
         class 'callable_iterator' >
         <
         callable_iterator object at 0x7f0c5cc24e48 >
            gcptutorials
         wikipedia
         google
  • Extracting urls from text using Python re.findall
  •       import re
    
          text = ''
          ' Extract the domain from the urls www.gcptutorials.com,
          www.wikipedia.org, www.google.com ''
          '
    
          pattern = r '(www.([A-Za-z_0-9-]+)(.\w+))'
    
          find_all_result = re.findall(pattern, text)
    
          print(type(find_all_result))
          print(find_all_result)
          for i in find_all_result:
             print(i[1])

    Example Output

         <re.Match object; span=(12, 24), match='[email protected]'>
            [email protected]
            <re.Match object; span=(47, 60), match='[email protected]'>
               [email protected]
    Extracting emails from text using re.findall
         import re
    
         sample_str = 'dummy email [email protected], one more dummy email [email protected] testing'
    
         emails = re.findall(r '[\w\.-][email protected][\w\.-]+', sample_str)
         for email in emails:
            print(email)

    Suggestion : 3

    Last Updated : 11 Jan, 2022

    There are a total of 14 metacharacters and will be discussed as they follow into functions:

    \
    Used to drop the special meaning of character
    following it(discussed below)[] Represent a character class ^
       Matches the beginning
    $ Matches the end
       .Matches any character except newline ?
       Matches zero or one occurrence. |
       Means OR(Matches with any of the characters separated by it.*Any number of occurrences(including 0 occurrences) +
          One or more occurrences {}
          Indicate number of occurrences of a preceding RE to match.
          () Enclose a group of REs

    Output:

    Match at index 14, 21
    Full match: June 24
    Month: June
    Day: 24

    Suggestion : 4

    The finditer() function matches a pattern in a string and returns an iterator that yields the Match objects of all non-overlapping matches.,Use the finditer() function to match a pattern in a string and return an iterator yielding the Match objects.,Summary: in this tutorial, you’ll learn how to use the Python regex finditer() function to find all matches in a string and return an iterator that yields match objects.,If the search is successful, the finditer() function returns an iterator yielding the Match objects. Otherwise, the finditer() also returns an iterator that will yield no Match object.

    The following shows the syntax of the finditer() function:

    .wp - block - code {
          border: 0;
          padding: 0;
       }
    
       .wp - block - code > div {
          overflow: auto;
       }
    
       .shcb - language {
          border: 0;
          clip: rect(1 px, 1 px, 1 px, 1 px); -
          webkit - clip - path: inset(50 % );
          clip - path: inset(50 % );
          height: 1 px;
          margin: -1 px;
          overflow: hidden;
          padding: 0;
          position: absolute;
          width: 1 px;
          word - wrap: normal;
          word - break: normal;
       }
    
       .hljs {
          box - sizing: border - box;
       }
    
       .hljs.shcb - code - table {
          display: table;
          width: 100 % ;
       }
    
       .hljs.shcb - code - table > .shcb - loc {
          color: inherit;
          display: table - row;
          width: 100 % ;
       }
    
       .hljs.shcb - code - table.shcb - loc > span {
          display: table - cell;
       }
    
       .wp - block - code code.hljs: not(.shcb - wrap - lines) {
          white - space: pre;
       }
    
       .wp - block - code code.hljs.shcb - wrap - lines {
          white - space: pre - wrap;
       }
    
       .hljs.shcb - line - numbers {
          border - spacing: 0;
          counter - reset: line;
       }
    
       .hljs.shcb - line - numbers > .shcb - loc {
          counter - increment: line;
       }
    
       .hljs.shcb - line - numbers.shcb - loc > span {
          padding - left: 0.75 em;
       }
    
       .hljs.shcb - line - numbers.shcb - loc::before {
          border - right: 1 px solid #ddd;
          content: counter(line);
          display: table - cell;
          padding: 0 0.75 em;
          text - align: right; -
          webkit - user - select: none; -
          moz - user - select: none; -
          ms - user - select: none;
          user - select: none;
          white - space: nowrap;
          width: 1 % ;
       }
    re.finditer(pattern, string, flags = 0) Code language: Python(python)

    The following example uses the finditer() function to search for all vowels in a string:

    import re
    
    s = 'Readability counts.'
    pattern = r '[aeoui]'
    
    matches = re.finditer(pattern, s)
    for match in matches:
       print(match) Code language: Python(python)

    Output:

    <re.Match object; span=(1, 2), match='e'>
       <re.Match object; span=(2, 3), match='a'>
          <re.Match object; span=(4, 5), match='a'>
             <re.Match object; span=(6, 7), match='i'>
                <re.Match object; span=(8, 9), match='i'>
                   <re.Match object; span=(13, 14), match='o'>
                      <re.Match object; span=(14, 15), match='u'>Code language: Python (python)

    Suggestion : 5

    The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.,Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.,The text categories are specified with regular expressions. The technique is to combine those into a single master regular expression and to loop over successive matches:,Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

    >>>
    import re
       >>>
       m = re.search('(?<=abc)def', 'abcdef') >>>
       m.group(0)
    'def'
    >>> m = re.search(r '(?<=-)\w+', 'spam-egg') >>>
       m.group(0)
    'egg'
    \
    a\ b\ f\ n\ N\ r\ t\ u\ U\ v\ x\\
    a = re.compile(r ""
          "\d +  # the integral part\.# the decimal point\ d * # some fractional digits ""
          ", re.X)
          b = re.compile(r "\d+\.\d*")
    prog = re.compile(pattern)
    result = prog.match(string)
    result = re.match(pattern, string)