python - parse string, known structure

  • Last Update :
  • Techknowledgy :

You could use the regex pattern

( ? : (\d + ) Y) ? ( ? : (\d + ) M) ?

which means

( ? : start a non - grouping pattern(\d + ) match 1 - or - more digits, grouped Y followed by a literal Y) ? end the non - grouping pattern;
matched 0 - or - 1 times( ? : start another non - grouping pattern(\d + ) match 1 - or - more digits, grouped M followed by a literal M) ? end the non - grouping pattern;
matched 0 - or - 1 times

When used in

re.match(r '(?:(\d+)Y)?(?:(\d+)M)?', text).groups()

import re
def parse_aYbM(text):
   a, b = re.match(r '(?:(\d+)Y)?(?:(\d+)M)?', text).groups()
if a == b == None:
   raise ValueError('input does not match aYbM')
a, b = [int(item) if item is not None
   else 0
   for item in (a, b)
]
return a + b / 12.0

tests = [
   ('5Y3M', 5.25),
   ('5Y', 5.0),
   ('6M', 0.5),
   ('10Y11M', 10.917),
   ('3Y14M', 4.167),
]

for test, expected in tests:
   result = parse_aYbM(test)
status = 'Failed'
if abs(result - expected) < 0.001:
   status = 'Passed'
print('{}: {} --> {}'.format(status, test, result))

yields

Passed: 5 Y3M-- > 5.25
Passed: 5 Y-- > 5.0
Passed: 6 M-- > 0.5
Passed: 10 Y11M-- > 10.9166666667
Passed: 3 Y14M-- > 4.16666666667

You may use re.findall

>>> def parse(m):
   s = 0
j = re.findall(r '\d+Y|\d+M', m)
for i in j:
   if 'Y' in i:
   s += float(i[: -1])
if 'M' in i:
   s += float(i[: -1]) / 12
print(s)

   >>>
   parse('5Y')
5.0
   >>>
   parse('6M')
0.5
   >>>
   parse('10Y11M')
10.916666666666666
   >>>
   parse('3Y14M')
4.166666666666667

Suggestion : 2

Last Updated : 08 Jul, 2022

Output :

['geeks', 'for', 'geeks']
['geeks', ' for', ' geeks']
['geeks', 'for', 'geeks']
['Ca', 'Ba', 'Sa', 'Fa', 'Or']

Suggestion : 3

This converts a C string and its lengths to a Python object. If the C string pointer is NULL, None is returned.,This converts a C string to a Python bytes object. If the C string pointer is NULL, None is returned.,Convert a C string and its length to a Python str object using 'utf-8' encoding. If the C string pointer is NULL, the length is ignored and None is returned.,Convert a null-terminated C string to a Python str object using 'utf-8' encoding. If the C string pointer is NULL, None is used.

status = converter(object, address);
static PyObject *
   weakref_ref(PyObject * self, PyObject * args) {
      PyObject * object;
      PyObject * callback = NULL;
      PyObject * result = NULL;

      if (PyArg_UnpackTuple(args, "ref", 1, 2, & object, & callback)) {
         result = PyWeakref_NewRef(object, callback);
      }
      return result;
   }
PyArg_ParseTuple(args, "O|O:ref", & object, & callback)

Suggestion : 4

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Python code.,Lark is a parser generator that works as a library. You write the grammar in a string or a file and then use it as an argument to dynamically generate the parser. Lark can use two algorithms: Earley is used when you need to parse all grammars and LALR when you need speed. Earley can parse also ambiguous grammars. Lark offers the chance to automatically solve the ambiguity by choosing the simplest option or reporting all options.,TatSu is the successor of Grako, another parser generator tool, and it has a good level of compatibility with it. It can create a parser dynamically from a grammar or compiling into a Python module.,Lrparsing is a parser generator whose grammars are defined as Python expressions. These expressions are attribute of a class that corresponds to rule of a traditional grammar. They are usually dynamically generated, but the library provide a function to precompile a parse table beforehand.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

437 + 734

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

<symbol> ::= __expression__

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

addition:: = expression '+'
expression
multiplication:: = expression '*'
expression
// an expression could be an addition or a multiplication or a number
expression:: = addition | multiplication | // a number

The following example grammar shows a useful feature of Lark: it includes rules for common things, like whitespace or numbers.

parser = Lark(''
      '?sum: product |
      sum "+"
      product - > add |
      sum "-"
      product - > sub

      ?
      product : item |
      product "*"
      item - > mul |
      product "/"
      item - > div

      ?
      item : NUMBER - > number |
      "-"
      item - > neg |
      "("
      sum ")"

      %
      import common.NUMBER %
      import common.WS %
      ignore WS ''
      ', start='
      sum ')

Given their format depending on Python, lrparsing grammars can be easy to read for Python developers, but they are harder to read than a traditional grammar.

// from the documentation
class ExprParser(lrparsing.Grammar):
   #
# Put Tokens we don 't want to re-type in a TokenRegistry.
#
class T(lrparsing.TokenRegistry):
   integer = Token(re = "[0-9]+")
integer["key"] = "I'm a mapping!"
ident = Token(re = "[A-Za-z_][A-Za-z_0-9]*")
#
# Grammar rules.
#
expr = Ref("expr") # Forward reference
call = T.ident + '(' + List(expr, ',') + ')'
atom = T.ident | T.integer | Token('(') + expr + ')' | call
expr = Prio(# If ambiguous choose atom 1 st, ...
   atom,
   Tokens("+ - ~") >> THIS, # >> means right associative THIS << Tokens("* / // %") << THIS,
   THIS << Tokens("+ -") << THIS, # THIS means "expr"
   here THIS << (Tokens("== !=") | Keyword("is")) << THIS)
expr["a"] = "I am a mapping too!"
START = expr # Where the grammar must start
COMMENTS = (# Allow C and Python comments Token(re = "#(?:[^rn]*(?:rn?|nr?))") |
   Token(re = "/[*](?:[^*]|[*][^/])*[*]/"))