So
import re
CARRIS_REGEX=r'<th>(\d+)</th>
<th>([\s\w\.\-]+)</th>
<th>(\d+:\d+)</th>
<th>(\d+m)</th>'
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
print(match)
print()
for match in pattern.findall(mailbody):
print(match)
prints
< _sre.SRE_Match object at 0x00A63758 >
<
_sre.SRE_Match object at 0x00A63F98 >
<
_sre.SRE_Match object at 0x00A63758 >
<
_sre.SRE_Match object at 0x00A63F98 >
<
_sre.SRE_Match object at 0x00A63758 >
<
_sre.SRE_Match object at 0x00A63F98 >
<
_sre.SRE_Match object at 0x00A63758 >
<
_sre.SRE_Match object at 0x00A63F98 >
('790', 'PR. REAL', '21:06', '04m')
('758', 'PORTAS BENFICA', '21:10', '09m')
('790', 'PR. REAL', '21:14', '13m')
('758', 'PORTAS BENFICA', '21:21', '19m')
('790', 'PR. REAL', '21:29', '28m')
('758', 'PORTAS BENFICA', '21:38', '36m')
('758', 'SETE RIOS', '21:49', '47m')
('758', 'SETE RIOS', '22:09', '68m')
If you want the same output from finditer
as you're getting from findall
, you need
for match in pattern.finditer(mailbody):
print(tuple(match.groups()))
I get this example from Regular expression operations in Python 2.* Documentation and that example well described here in details with some modification. To explain whole example, let's get string type variable call,
text = "He was carefully disguised but captured quickly by police."
and the compile type regular expression pattern as,
regEX = r "\w+ly"
pattern = re.compile(regEX)
Following code lines gives you the basic understand of re.search().
search = pattern.search(text)
print(search)
print(type(search))
#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>
The above example gives us the Iterator Objects which need to be loop. This is obviously not the result we want. Let's loop finditer
and see what's inside this Iterator Objects.
for anObject in finditer:
print(anObject)
print(type(anObject))
print()
#output
<re.Match object; span=(7, 16), match='carefully'>
<class 're.Match'>
<re.Match object; span=(40, 47), match='quickly'>
<class 're.Match'>
Let's understand what happen in re.findall().
findall = pattern.findall(text)
print(findall)
print(type(findall))
#output
['carefully', 'quickly']
<class 'list'>
You can't make them behave the same way, because they're different. If you really want to create a list of results from finditer
, then you could use a list comprehension:
>>> [match for match in pattern.finditer(mailbody) ] [...]
In general, use a for
loop to access the matches returned by re.finditer
:
>>> for match in pattern.finditer(mailbody): ......
The solution was practically that I needed to create at least one group, which enabled fetching it from the group dict
-yield from zip(re.finditer(r "\w+", line)...
+yield from zip(re.finditer(r "(\w+)", line)...
...
-block.(miscellaneous attempts) +
block.group(1)
For example:
text_to_search = '' ' abcdefghijklmnopqurtuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ\ s 321 - 555 - 4321 1234567890 Ha HaHa MetaCharacters(Need to be escaped): . ^ $ * + ? {} []\ | () khanafsaan11.com 321 - 555 - 4321 123.555 .1234 123 * 555 * -1234 123.555 .1234 800 - 555 - 1234 900 - 555 - 1234 Mr.Schafer Mr Smith Ms Davis Mrs.Robinson Mr.T Mr_hello '' ' pattern = re.compile(r 'M(r|rs|s)\.? [A-Z][a-z]*') print(list(pattern.finditer(text_to_search))) #converted to list print(pattern.findall(text_to_search))
Output:
['r', 'r', 's', 'rs', 'r']
[, , , , ]
And you can get the output like findall() from finditer() output as following
for obj in pattern.finditer(text_to_search):
print(obj.group()) #group() is an attribute of re.Match object
#ouput
Mr.Schafer
Mr Smith
Ms Davis
Mrs.Robinson
Mr.T
re.finditer() returns iterator of matched objects in the string while re.findall() returns list of matched patterns in string. Refer below snippet for understanding difference between re.finditer() and re.findall() . , Difference between BinaryCrossentropy and CategoricalCrossentropy ,Extracting urls from text using Python re.findall ,Extracting urls from text using Python re.finditer
re.finditer
import re text = '' ' Extract the doamin from the urls www.gcptutorials.com, www.wikipedia.org, www.google.com '' ' pattern = r '(www.([A-Za-z_0-9-]+)(.\w+))' find_iter_result = re.finditer(pattern, text) print(type(find_iter_result)) print(find_iter_result) for i in find_iter_result: print(i.group(2))
Output
<
class 'callable_iterator' >
<
callable_iterator object at 0x7f0c5cc24e48 >
gcptutorials
wikipedia
google
re.findall
import re text = '' ' Extract the domain from the urls www.gcptutorials.com, www.wikipedia.org, www.google.com '' ' pattern = r '(www.([A-Za-z_0-9-]+)(.\w+))' find_all_result = re.findall(pattern, text) print(type(find_all_result)) print(find_all_result) for i in find_all_result: print(i[1])
Example Output
<re.Match object; span=(12, 24), match='[email protected]'>
[email protected]
<re.Match object; span=(47, 60), match='[email protected]'>
[email protected]
Extracting emails from text using re.findall
import re
sample_str = 'dummy email [email protected], one more dummy email [email protected] testing'
emails = re.findall(r '[\w\.-][email protected][\w\.-]+', sample_str)
for email in emails:
print(email)
Last Updated : 11 Jan, 2022
There are a total of 14 metacharacters and will be discussed as they follow into functions:
\
Used to drop the special meaning of character
following it(discussed below)[] Represent a character class ^
Matches the beginning
$ Matches the end
.Matches any character except newline ?
Matches zero or one occurrence. |
Means OR(Matches with any of the characters separated by it.*Any number of occurrences(including 0 occurrences) +
One or more occurrences {}
Indicate number of occurrences of a preceding RE to match.
() Enclose a group of REs
Output:
Match at index 14, 21 Full match: June 24 Month: June Day: 24
The finditer() function matches a pattern in a string and returns an iterator that yields the Match objects of all non-overlapping matches.,Use the finditer() function to match a pattern in a string and return an iterator yielding the Match objects.,Summary: in this tutorial, you’ll learn how to use the Python regex finditer() function to find all matches in a string and return an iterator that yields match objects.,If the search is successful, the finditer() function returns an iterator yielding the Match objects. Otherwise, the finditer() also returns an iterator that will yield no Match object.
The following shows the syntax of the finditer()
function:
.wp - block - code {
border: 0;
padding: 0;
}
.wp - block - code > div {
overflow: auto;
}
.shcb - language {
border: 0;
clip: rect(1 px, 1 px, 1 px, 1 px); -
webkit - clip - path: inset(50 % );
clip - path: inset(50 % );
height: 1 px;
margin: -1 px;
overflow: hidden;
padding: 0;
position: absolute;
width: 1 px;
word - wrap: normal;
word - break: normal;
}
.hljs {
box - sizing: border - box;
}
.hljs.shcb - code - table {
display: table;
width: 100 % ;
}
.hljs.shcb - code - table > .shcb - loc {
color: inherit;
display: table - row;
width: 100 % ;
}
.hljs.shcb - code - table.shcb - loc > span {
display: table - cell;
}
.wp - block - code code.hljs: not(.shcb - wrap - lines) {
white - space: pre;
}
.wp - block - code code.hljs.shcb - wrap - lines {
white - space: pre - wrap;
}
.hljs.shcb - line - numbers {
border - spacing: 0;
counter - reset: line;
}
.hljs.shcb - line - numbers > .shcb - loc {
counter - increment: line;
}
.hljs.shcb - line - numbers.shcb - loc > span {
padding - left: 0.75 em;
}
.hljs.shcb - line - numbers.shcb - loc::before {
border - right: 1 px solid #ddd;
content: counter(line);
display: table - cell;
padding: 0 0.75 em;
text - align: right; -
webkit - user - select: none; -
moz - user - select: none; -
ms - user - select: none;
user - select: none;
white - space: nowrap;
width: 1 % ;
}
re.finditer(pattern, string, flags = 0) Code language: Python(python)
The following example uses the finditer()
function to search for all vowels in a string:
import re
s = 'Readability counts.'
pattern = r '[aeoui]'
matches = re.finditer(pattern, s)
for match in matches:
print(match) Code language: Python(python)
Output:
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='a'>
<re.Match object; span=(4, 5), match='a'>
<re.Match object; span=(6, 7), match='i'>
<re.Match object; span=(8, 9), match='i'>
<re.Match object; span=(13, 14), match='o'>
<re.Match object; span=(14, 15), match='u'>Code language: Python (python)
The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.,Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.,The text categories are specified with regular expressions. The technique is to combine those into a single master regular expression and to loop over successive matches:,Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
>>>
import re
>>>
m = re.search('(?<=abc)def', 'abcdef') >>>
m.group(0)
'def'
>>> m = re.search(r '(?<=-)\w+', 'spam-egg') >>>
m.group(0)
'egg'
\
a\ b\ f\ n\ N\ r\ t\ u\ U\ v\ x\\
a = re.compile(r "" "\d + # the integral part\.# the decimal point\ d * # some fractional digits "" ", re.X) b = re.compile(r "\d+\.\d*")
prog = re.compile(pattern) result = prog.match(string)
result = re.match(pattern, string)