Unix has a long-standing tradition of hosting tools that are specifically designed to generate code for various special purposes. The venerable monuments of this tradition, which go back to Version 7 and earlier days, and were actually used to write the original Portable C Compiler back in the 1970s, are lex(1) and yacc(1). Their modern, upward-compatible successors are flex(1) and bison(1), part of the GNU toolkit and still heavily used today. These programs have set an example that is carried forward in projects like GNOME's Glade interface builder.
yacc and lex are tools for generating language parsers. We observed in Chapter 8 that your first minilanguage is all too likely to be an accident rather than a design. That accident is likely to have a hand-coded parser that costs you far too much maintenance and debugging time — especially if you have not realized it is a parser, and have thus failed to properly separate it from the remainder of your application code. Parser generators are tools for doing better than an accidental, ad-hoc implementation; they don't just let you express your grammar specification at a higher level, they also wall off all the parser's implementation complexity from the rest of your code.
If you reach a point where you are planning to implement a minilanguage from scratch, rather than by extending or embedding an existing scripting language or parsing XML, yacc and lex will probably be your most important tools after your C compiler.
lex and yacc each generate code for a single function — respectively, “get a token from the input stream” and “parse a sequence of tokens to see if it matches a grammar”. Usually, the yacc-generated parser function calls a Lex-generated tokenizer function each time it wants to get another token. If there are no user-written C callbacks at all in the yacc-generated parser, all it will do is a syntax check; the value returned will tell the caller if the input matched the grammar it was expecting.
More usually, the user's C code, embedded in the generated parser, populates some runtime data structures as a side-effect of parsing the input. If the minilanguage is declarative, your application can use these runtime data structures directly. If your design was an imperative minilanguage, the data structures might include a parse tree which is immediately fed to some kind of evaluation function.
yacc has a rather ugly interface, through exported global variables with the name prefix yy_. This is because it predates structs in C; in fact, yacc predates C itself; the first implementation was written in C's predecessor B. The crude though effective algorithm yacc-generated parsers use to try to recover from parse errors (pop tokens until an explicit error production is matched) can also lead to problems, including memory leaks.
lex is a lexical analyzer generator. It's a member of the same functional family as grep(1) and awk(1), but more powerful because it enables you to arrange for arbitrary C code to be executed on each match. It accepts a declarative minilanguage and emits skeleton C code.
A crude but useful way to think about what a lex-generated tokenizer does is as a sort of inverse grep(1). Where grep(1) takes a single regular expression and returns a list of matches in the incoming data stream, each call to a lex-generated tokenizer takes a list of regular expressions and indicates which expression occurs next in the datastream.
lex was written to automate the task of generating lexical analyzers (tokenizers) for compilers. It turned out to have a surprisingly wide range of uses for other kinds of pattern recognition, and has since been described as “the Swiss-army knife of Unix programming”.[130]
If you are attacking any kind of pattern-recognition or state-machine problem in which all the possible input stimuli will fit in a byte, lex may enable you to generate code that will be more efficient and reliable than a hand-crafted state machine.
Most importantly, the lex specification minilanguage is much higher-level and more compact than equivalent handcrafted C. Modules are available to use flex, the open-source version, with Perl (find them with a Web search for “lex perl”), and a work-alike implementation is part of PLY in Python.
lex generates parsers that are up to an order of magnitude slower than hand-coded parsers. This is not a good reason to hand-code, however; it's an argument for prototyping with lex and hand-hacking only if prototyping reveals an actual bottleneck.
yacc is a parser generator. It, too, was written to automate part of the job of writing compilers. It takes as input a grammar specification in a declarative minilanguage resembling BNF (Backus-Naur Form) with C code associated with each element of the grammar. It generates code for a parser function that, when called, accepts text matching the grammar from an input stream. As each grammar element is recognized, the parser function runs the associated C code.
The combination of lex and yacc is very effective for writing language interpreters of all kinds. Though most Unix programmers never get to do the kind of general-purpose compiler-building that these tools were meant to assist, they're extremely useful for writing parsers for run-control file syntaxes and domain-specific minilanguages.
lex-generated tokenizers are very fast at recognizing low-level patterns in input streams, but the regular-expression minilanguage that lex knows is not good at counting things, or recognizing recursively nested structures. For parsing those, you want yacc. On the other hand, while you theoretically could write a yacc grammar to do its own token-gathering, the grammar to specify that would be hugely bloated and the parser extremely slow. For tokenizing input, you want lex. Thus, these tools are symbiotic.
If you can implement your parser in a higher-level language than C (which we recommend you do; see Chapter 14 for discussion), then look for equivalent facilities like Python's PLY (which covers both lex and yacc)[131] or Perl's PY and Parse::Yapp modules, or Java's CUP,[132] Jack,[133] or Yacc/M[134] packages.
As with macro processors, one of the problems with code generators and preprocessors is that compile-time errors in the generated code may carry line numbers that are relative to the generated code (which you don't want to edit) rather than the generator input (which is where you need to make corrections). yacc and lex address this by generating the same #line constructs that the C preprocessor does; these set the current line number for error reporting so the numbers will come out right. Any program that generates C or C++ should do likewise.
More generally, well-designed procedural-code generators should never require the user to hand-alter or even look at the generated parts. Getting those right is the code generator's job.
The canonical demonstration example that seems to have appeared in every lex and yacc tutorial ever written is a toy interactive calculator program that parses and evaluates arithmetic expressions entered by the user. We will spare you yet another repetition of this cliche; if you are interested, consult the source code of the bc(1) and dc(1) calculator implementations from the GNU project, or the paradigm example ‘hoc’[135] from [Kernighan-Pike84].
Instead, the grammar of fetchmail's run-control-file parser provides a good medium-sized case study in lex and yacc usage. There are a couple of points of interest here.
The lex specification, in rcfile_l.l, is a very typical implementation of a shell-like syntax. Note how two complementary rules support either single or double-quoted strings; this is a good idea in general. The rules for accepting (possibly signed) integer literals and discarding comments are also pretty generic.
The yacc specification, in rcfile_y.y, is long but straightforward. It does not perform any fetchmail actions, just sets bits in a list of internal control blocks. After startup, fetchmail's normal mode of operation is just to repeatedly walk that list, using each record to drive a retrieval session with a remote site.
We looked at Glade in Chapter 8 as a good example of a declarative minilanguage. We also noted that its back end produces a result by generating code in any one of several languages.
Glade is a good modern example of an application-code generator. What makes it Unixy in spirit are the following features, which most GUI builders (especially most proprietary GUI builders) don't have:
Rather than being glued together as one monster monolith, the Glade GUI and Glade code generator obey the Rule of Separation (following the “separated engine and interface” design pattern).
The GUI and code generator are connected by an (XML-based) textual data file format that can be read and modified by other tools.
Multiple target languages (as opposed to just C or C++) are supported. More could easily be added.
The design implies that it should also be possible to replace the Glade GUI editor component, should that ever become desirable.
[130] The common latter-day description of Perl as a “Swiss-army chainsaw” is derivative.
[131] PLY is downloadable.
[132] CUP is downloadable.
[133] Jack is downloadable.
[134] Yacc/M is downloadable.