 | Level: Introductory David Mertz (mertz@gnosis.cx), Analyzer, Gnosis Software, Inc.
01 Jan 2002 Many parsing tools have been written for Python. This column discusses a high-level parsing language built on top of Python. SimpleParse provides an EBNF-style syntax on top of mxTextTools that can greatly clarify the expression of grammars.
Like most programmers, I have frequently needed
to identify parts and structures that exist inside textual
documents: log files, configuration files, delimited data, and
more free-form (but still semi-structured) report formats. All
of these documents have their own "little languages" for what
can occur within them.
The way I have programmed these informal parsing tasks has
always been somewhat of a hodgepodge of custom state-machines,
regular expressions, and context-driven string tests. The
pattern in these programs was always, roughly, "read a bit of
text, figure out if we can make something of it, maybe read a
bit more text afterwards, keep trying."
Parsers of the formal variety distill descriptions of the parts and
structures in documents into concise, clear, and declarative
rules for how to identify what makes up a document. The
declarative aspect is particularly interesting here. All my
old ad hoc parsers were imperative in flavor: read some
characters, make some decisions, accumulate some variables,
rinse, repeat. As this column's installments on functional
programming have observed, the recipe style of program flow is
comparatively error-prone and difficult to maintain.
Formal parsers almost always use variants on Extended
Backus-Naur Form (EBNF) to describe the "grammars" of the
languages they describe. Those tools we look at here do so, as
does the popular compiler development tool YACC (and its
variants). Basically, an EBNF grammar gives names to the
parts you might find in a document; additionally, larger
parts are frequently composed of smaller parts. The frequency
and order in which small parts may occur in larger parts is
specified by operators -- mostly the same symbols you see in
regular expressions. In parser-talk, each named part in a
grammar is called a "production."
Possibly without even knowing it, readers have already seen
EBNF descriptions at work. For example, the familiar Python
Language Reference defines what a floating point number looks
like in Python:
EBNF-style description of floating point number
floatnumber: pointfloat | exponentfloat
pointfloat: [intpart] fraction | intpart "."
exponentfloat: (nonzerodigit digit* | pointfloat) exponent
intpart: nonzerodigit digit* | "0"
fraction: "." digit+
exponent: ("e"|"E") ["+"|"-"] digit+ |
Or you might have seen an XML DTD element defined in an EBNF
style. For example, the <body> of a developerWorks tutorial
looks like:
EBNF-style description in a developerWorks DTD
<!ELEMENT body ((example-column | image-column)?, text-column) > |
Spellings vary slightly, but the general notions of
quantification, alternation, and sequencing exist in all
EBNF-style language grammars.
Building tag lists with SimpleParse
SimpleParse is an interesting tool. To use this module, you
need the underlying module mxTextTools, which implements a
"tagging engine" in C. mxTextTools (see Resources later in this article) is powerful, but rather difficult to use.
Once SimpleParse is layered on top of mxTextTools, the work
becomes a lot easier.
Using SimpleParse is really quite simple, because it
removes the need to think about most of the complexity of
mxTextTools. The first thing to do is create an EBNF-style
grammar that describes the language you want to handle. The
second step is to call mxTextTools to create a tag list
that describes all the successful productions when the grammar
is applied to the document. Finally, you actually do
something with the tag list returned by mxTextTools.
For this article, the "language" we will parse is the set of
markup codes used by "smart ASCII" to indicate things like
boldface, module names, and book titles. This is the very same
language mxTextTools was earlier used to identify, and
regular expressions and state-machines before that, in earlier
installments. The language is far simpler than a full
programming language would be, but complicated enough to be
representative.
We probably need to back up for one moment here. What the heck
is a "tag list" that mxTextTools gives us? Basically, this
is a nested structure that simply gives the character offsets
where every production was matched in the source text.
mxTextTools traverses a source text quickly, but it does not
do anything to the source text itself (at least not when
using the SimpleParse grammars). Let's look at an abridged
tag list:
Tag list produced from SimpleParse grammar
(1,
[('plain',
0,
15,
[('word', 0, 4, [('alphanums', 0, 4, [])]),
('whitespace', 4, 5, []),
('word', 5, 10, [('alphanums', 5, 10, [])]),
('whitespace', 10, 11, []),
('word', 11, 14, [('alphanums', 11, 14, [])]),
('whitespace', 14, 15, [])]),
('markup',
15,
27,
...
289) |
The elipses in the middle represent a bunch more matches. But
the part we see says the following. The root production
("para") succeeds and ends at offset 289 (the length of the
source text). The child production "plain" matches offsets 0
through 15. This "plain" child is itself composed of smaller
productions. After the "plain" production, the "markup"
production matches offsets 15 through 27. The details are left
out, but this first "markup" is made of components, and
additional productions succeed later in the source.
An EBNF-style grammar for "smart ASCII"
We have glanced at the tag list that SimpleParse +
mxTextTools can give us. But we really need to look at
the grammar that was used to generate this tag list. The
grammar is where the real work happens. EBNF grammars are
almost self-explanatory to read (although designing one does
require a bit of thought and testing):
typographify.def
para := (plain / markup)+
plain := (word / whitespace / punctuation)+
whitespace := [ \t\r\n]+
alphanums := [a-zA-Z0-9]+
word := alphanums, (wordpunct, alphanums)*, contraction?
wordpunct := [-_]
contraction := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')
markup := emph / strong / module / code / title
emph := '-', plain, '-'
strong := '*', plain, '*'
module := '[', plain, ']'
code := "'", plain, "'"
title := '_', plain, '_'
punctuation := (safepunct / mdash)
mdash := '--'
safepunct := [!@#$%^&()+=|\{}:;<>,.?/"] |
This grammar is almost exactly the way you would describe the
"smart ASCII" language verbally, which is a nice sort of
clarity. A paragraph consist of some plain text and some
marked-up text. Plain text consists of some collection of
words, whitespace, and punctuation. Marked-up text might be
emphasized, or strongly emphasized, or module names, etc.
Strongly emphasized text is surrounded by asterisks. And so
on. A couple of features like just what a "word" really is, or
just what a contraction can end with, take a bit of thought,
but the syntax of EBNF doesn't get in the way.
In contrast, the same sort of rules can be described even more
tersely using regular expressions. This is what the first
version of the "smart ASCII" markup program did. But this
terseness is much harder to write, and harder still to tweak
later. The following code expresses largely (but not
precisely) the same set of rules:
Python regexs for smart ASCII markup
# [module] names
re_mods = r"""([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])"""
# *strongly emphasize* words
re_strong = r"""([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])"""
# -emphasize- words
re_emph = r"""([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])"""
# _Book Title_ citations
re_title = r"""([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])"""
# 'Function()' names
re_funcs = r"""([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""
|
If you discover or invent some slightly new variant of the
language, it is a lot easier to play with the EBNF grammar
than with those regular expressions. Moreover, using
mxTextTools will generally be even faster in performing the
manipulations of the patterns.
Generating and using a tag list
For our sample program, we put the actual grammar in a separate
file. For most purposes, this is a good organization to use.
Changing the grammar is usually a different sort of task than
changing the application logic; and the files reflect this.
But the whole of what we do with the grammar is pass it as a
string to a SimpleParse function, so in principle we could
include it in the main application (or even dynamically
generate it in some way).
Let's look at our entire (compact) tagging application:
typographify.py
import os
from sys import stdin, stdout, stderr
from simpleparse import generator
from mx.TextTools import TextTools
input = stdin.read()
decl = open('typographify.def').read()
from typo_html import codes
parser = generator.buildParser(decl).parserbyname('para')
taglist = TextTools.tag(input, parser)
for tag, beg, end, parts in taglist[1]:
if tag == 'plain':
stdout.write(input[beg:end])
elif tag == 'markup':
markup = parts[0]
mtag, mbeg, mend = markup[:3]
start, stop = codes.get(mtag, ('<!-- unknown -->','<!-- / -->'))
stdout.write(start + input[mbeg+1:mend-1] + stop)
stderr.write('parsed %s chars of %s\n' % (taglist[-1], len(input))) |
Here is what it does. First read in the grammar, and create an
mxTextTools parser from the grammar.
Next we
apply the tag-table/parser to the input source to create a tag
list. Finally, we loop through the tag list, and emit some new
marked-up text. The loop could, of course, do anything else
desired with each production encountered.
For the particular grammar used for smart ASCII, everything in
the source text is expected to fall into either a "plain"
production or a "markup" production. Therefore, it suffices to
loop across a single level in the tag list (except when we look
exactly one level lower for the specific markup production,
such as "title"). But a more free-form grammar -- such as occurs
for most programming languages -- could easily recursively
descend into the tag list, and look for production names at
every level. For example, if the grammar were to allow nested
markup codes, this recursive style would probably be used.
You might enjoy the exercise of figuring out how to adjust
the grammar (hint: remember that productions are allowed to be
mutually recursive).
The particular markup codes that go to the output live in yet
another file, for organizational not essential reasons. A
little trick of using a dictionary as a switch statement is
used here (although the otherwise case remains too narrow in
the example). The idea is just that we might in the future
want to create multiple "output format" files for, say, HTML,
DocBook, LaTeX, or others. The particular markup file used for
the example looks like:
typo_html.py
codes = \
{ 'emph' : ('<em>', '</em>'),
'strong' : ('<strong>', '</strong>'),
'module' : ('<em><code>', '</code></em>'),
'code' : ('<code>', '</code>'),
'title' : ('<cite>', '</cite>'),
} |
Extending this to other output formats is straightforward.
Conclusion
SimpleParse provides a concise and very readable EBNF-style
wrapper to the underlying power and speed of the cryptic
mxTextTools C module. Moreover, EBNF grammars are already
familiar to many programmers, even if only in passing. I
cannot prove anything about what is easier to
understand -- intuitions differ -- but I can comment quantitatively
on source length. The mxTypographify module that was
manually developed earlier is the following size:
wc mxTypographify.py
199 776 7041 mxTypographify.py
|
Of these 199 lines, a fair number are comments. And 18 of
those lines are an included regular expression version of the
markup function that is included for timing comparisons. But
what the program does is essentially identical to what
typographify.py -- listed above -- does. In contrast, our
SimpleParse program, including its support files comes to:
wc typo*.def typo*.py
19 79 645 typographify.def
20 79 721 typographify.py
6 25 205 typo_html.py
45 183 1571 total
|
In other words, about one fourth as many lines. This version
has fewer comments, but that is mostly because the EBNF grammar
is fairly self-documenting. I would not want to emphasize LOC
too strongly -- obviously, you can play games with minimizing or
maximizing code length. But in a general way, one of the few
real empirical results of work studies on programmers is that
kLOC/programmer-month is fairly close to constant across
languages and libraries. Of course, the regular expression
version is, in turn, one third as long as the SimpleParse
version -- but I think the density of its expression makes it
fragile to maintain and harder to write.
I think, on balance, SimpleParse wins of the
approaches considered.
Resources
About the author  | 
|  |
David Mertz would like to write, with Nietzsche, that these are
the musings of an old philologist, but that untruth would
unmask itself. But perhaps his (right here gratuitously
plugged) forthcoming book, Text Processing in Python, will
someday be mistaken for a cybernetic variant of philology. David may be reached at mertz@gnosis.cx; his
life pored over at http://gnosis.cx/dW/. Suggestions and
recommendations on this, past, or future columns are welcome.
|
Rate this page
|  |