ply yacc pythonic syntax for accumulating list of comma-separated values

  • Last Update :
  • Techknowledgy :

There are two productions. Use two separate functions. (There is no extra cost :-) )

def p_type_list_1(p):
   ''
'type_list : type'
''
p[0] = [p[1]]

def p_type_list_2(p):
   ''
'type_list : type_list COMMA type'
''
p[0] = p[1] + [p[3]]

Note: I fixed your grammar to use left-recursion. With bottom-up parsing, left-recursion is almost always what you want, because it avoids unnecessary parser stack usage, and more importantly because it often simplifies actions. In this case, I could have written the second function as:

def p_type_list_2(p):
   ''
'type_list : type_list COMMA type'
''
p[0] = p[1]
p[0] += [p[3]]

Or "simplify" p_type_list to (you reduce by 1 line of code, not sure if that's worth it):

def p_type_list(p):
   ''
'type_list : type |
type_list COMMA type ''
'
if len(p) == 2:
   p[0] = [p[1]]
else:
   p[0] = p[1] + [p[3]]

Suggestion : 2

I'm currently building a list of types from a comma separated list (eg. int, float, string):,I'm using YACC for the first time and getting used to using BNF grammar.,The rules work, but I'm getting the sense that my p_type_list logic is a bit of a kludge and could be simplified into a one-liner.,Note: I fixed your grammar to use left-recursion. With bottom-up parsing, left-recursion is almost always what you want, because it avoids unnecessary parser stack usage, and more importantly because it often simplifies actions. In this case, I could have written the second function as:

I'm currently building a list of types from a comma separated list (eg. int, float, string):

def p_type(p):
   ''
'type : primitive_type |
array
   |
   generic_type |
   ID ''
'
p[0] = p[1]

def p_type_list(p):
   ''
'type_list : type |
type COMMA type_list ''
'
if not isinstance(p[0], list):
   p[0] = list()
p[0].append(p[1])
if len(p) == 4:
   p[0] += p[3]

There are two productions. Use two separate functions. (There is no extra cost :-) )

def p_type_list_1(p):
   ''
'type_list : type'
''
p[0] = [p[1]]

def p_type_list_2(p):
   ''
'type_list : type_list COMMA type'
''
p[0] = p[1] + [p[3]]

Note: I fixed your grammar to use left-recursion. With bottom-up parsing, left-recursion is almost always what you want, because it avoids unnecessary parser stack usage, and more importantly because it often simplifies actions. In this case, I could have written the second function as:

def p_type_list_2(p):
   ''
'type_list : type_list COMMA type'
''
p[0] = p[1]
p[0] += [p[3]]

Suggestion : 3

PLY consists of two separate modules; lex.py and yacc.py, both of which are found in a Python package called ply. The lex.py module is used to break input text into a collection of tokens specified by a collection of regular expression rules. yacc.py is used to recognize language syntax that has been specified in the form of a context free grammar. , The identification of tokens is typically done by writing a series of regular expression rules. The next section shows how this is done using lex.py. , This will produce various sorts of debugging information including all of the added rules, the master regular expressions used by the lexer, and tokens generating during lexing. ,This function uses Python reflection (or introspection) to read the regular expression rules out of the calling context and build the lexer. Once the lexer has been built, two methods can be used to control the lexer.

 x = 3 + 42 * (s - t)
 'x', '=', '3', '+', '42', '*', '(', 's', '-', 't', ')'
 'ID', 'EQUALS', 'NUMBER', 'PLUS', 'NUMBER', 'TIMES',
 'LPAREN', 'ID', 'MINUS', 'ID', 'RPAREN'
 ('ID', 'x'), ('EQUALS', '='), ('NUMBER', '3'),
 ('PLUS', '+'), ('NUMBER', '42), ('
    TIMES ',' * '),
    ('LPAREN', '('), ('ID', 's'), ('MINUS', '-'),
    ('ID', 't'), ('RPAREN', ')'
 #-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 # calclex.py
 #
 # tokenizer
 for a simple expression evaluator
 for
 # numbers and + , -, *, /
 #-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 import ply.lex as lex

 # List of token names.This is always required
 tokens = (
    'NUMBER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
 )

 # Regular expression rules
 for simple tokens
 t_PLUS = r '\+'
 t_MINUS = r '-'
 t_TIMES = r '\*'
 t_DIVIDE = r '/'
 t_LPAREN = r '\('
 t_RPAREN = r '\)'

 # A regular expression rule with some action code
 def t_NUMBER(t):
    r '\d+'
 t.value = int(t.value)
 return t

 # Define a rule so we can track line numbers
 def t_newline(t):
    r '\n+'
 t.lexer.lineno += len(t.value)

 # A string containing ignored characters(spaces and tabs)
 t_ignore = ' \t'

 # Error handling rule
 def t_error(t):
    print("Illegal character '%s'" % t.value[0])
 t.lexer.skip(1)

 # Build the lexer
 lexer = lex.lex()
 # Test it out
 data = ''
 '
 3 + 4 * 10 +
    -20 * 2 ''
 '

 # Give the lexer some input
 lexer.input(data)

 # Tokenize
 while True:
    tok = lexer.token()
 if not tok:
    break # No more input
 print(tok)

Suggestion : 4

To build the parser, call the yacc.yacc() function. This function looks at the module and attempts to construct all of the LR parsing tables for the grammar you have specified. The first time yacc.yacc() is invoked, you will get a message such as this: , PLY's error messages and warnings are also produced using the logging interface. This can be controlled by passing a logging object using the errorlog parameter. ,To resolve ambiguity, especially in expression grammars, yacc.py allows individual tokens to be assigned a precedence level and associativity. This is done by adding a variable precedence to the grammar file like this: , If any errors are detected in your grammar specification, yacc.py will produce diagnostic messages and possibly raise an exception. Some of the errors that can be detected include:

x = 3 + 42 * (s - t)
'x', '=', '3', '+', '42', '*', '(', 's', '-', 't', ')'
'ID', 'EQUALS', 'NUMBER', 'PLUS', 'NUMBER', 'TIMES',
'LPAREN', 'ID', 'MINUS', 'ID', 'RPAREN'
('ID', 'x'), ('EQUALS', '='), ('NUMBER', '3'),
('PLUS', '+'), ('NUMBER', '42), ('
   TIMES ',' * '),
   ('LPAREN', '('), ('ID', 's'), ('MINUS', '-'),
   ('ID', 't'), ('RPAREN', ')'
#-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
# calclex.py
#
# tokenizer
for a simple expression evaluator
for
# numbers and + , -, *, /
#-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
import ply.lex as lex

# List of token names.This is always required
tokens = (
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
   'LPAREN',
   'RPAREN',
)

# Regular expression rules
for simple tokens
t_PLUS = r '\+'
t_MINUS = r '-'
t_TIMES = r '\*'
t_DIVIDE = r '/'
t_LPAREN = r '\('
t_RPAREN = r '\)'

# A regular expression rule with some action code
def t_NUMBER(t):
   r '\d+'
t.value = int(t.value)
return t

# Define a rule so we can track line numbers
def t_newline(t):
   r '\n+'
t.lexer.lineno += len(t.value)

# A string containing ignored characters(spaces and tabs)
t_ignore = ' \t'

# Error handling rule
def t_error(t):
   print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()
# Test it out
data = ''
'
3 + 4 * 10 +
   -20 * 2 ''
'

# Give the lexer some input
lexer.input(data)

# Tokenize
while True:
   tok = lexer.token()
if not tok: break # No more input
print tok