The lexer produced by lex is a C routine called yylex , so we call it. We placed our original example in a file called ch To create an executable program on our UNIX system we enter these commands:. Lex translates the lex specification into a C source file called lex. We then execute the resulting program to check that it works as we expect, as we saw earlier in this section.
Try it to convince yourself that this simple description really does recognize exactly the verbs we decided to recognize. We list more words than we did before, and in principle we could extend this example to as many words as we want. It would be more convenient, though, if we could build a table of words as the lexer is running, so we can add new words without modifying and recompiling the lex program.
In our next example, we do just that—allow for the dynamic declaration of parts of speech as the lexer is running, reading the words to declare from the input file. Declaration lines start with the name of a part of speech followed by the words to declare.
These lines, for example, declare four nouns and three verbs:. The table of words is a simple symbol table , a common structure in lex and yacc applications. A C compiler, for example, stores the variable and structure names, labels, enumeration tags, and all other names used in the program in its symbol table. Each name is stored along with information describing the name.
In a C compiler the information is the type of symbol, declaration scope, variable type, etc. In our current example, the information is the part of speech.
Adding a symbol table changes the lexer quite substantially. The names of parts of speech noun, verb, etc. We still have a separate lex pattern for each reserved word. Example shows the definition section. We define an enum in order to use in our table to record the types of individual words, and to declare a variable state.
We also declare our symbol table routines. Example shows the rules section. For declaring words, the first group of rules sets the state to the type corresponding to the part of speech being declared. We reset the state to LOOKUP at the beginning of each line so that after we add new words interactively we can test our table of words to determine if it is working correctly.
The user subroutines section in Example contains the same skeletal main routine and our two supporting functions. These last two functions create and search a linked list of words. If there are a lot of words, the functions will be slow since, for each word, they might have to search through the entire list. In a production environment we would use a faster but more complex scheme, probably using a hash table. Our simple example does the job, albeit slowly.
Traditionally, a description of such a set of actions is known as a grammar. It seems especially appropriate for our example. Suppose that we wished to recognize common sentences. Here is a list of simple sentence types:. At this point, it seems convenient to introduce some notation for describing grammars.
As an added example we could define an object as follows:. Indeed, we could expand this definition of sentence to fit a much wider variety of sentences. However, at this stage we would like to build a yacc grammar so we can test our ideas out interactively. Before we introduce our yacc grammar, we must modify our lexical analyzer in order to return values useful to our new parser.
When you use a lex scanner and a yacc parser together, the parser is the higher level routine. It calls the lexer yylex whenever it needs a token from the input. The lexer then scans through the input recognizing tokens. The lexer and the parser have to agree what the token codes are. We solve this problem by letting yacc define the token codes. Yacc defines each of these as a small integer using a preprocessor define.
Here are the definitions it used in this example:. Token code zero is always returned for the logical end of the input. Yacc can optionally write a C header file containing all of the token definitions.
You include this file, called y. Example shows the declarations and rules sections of the new lexer. There are several important differences here. We have also added return statements to pass to the parser the token codes for the words that it recognizes.
These return statements show that yylex acts like a coroutine. Each time the parser calls it, it takes up processing at the exact point it left off. This allows us to examine and operate upon the input stream incrementally. The backslash in front of the period quotes the period, so this rule matches a period followed by a newline.
The other change we made to our lexical analyzer was to omit the main routine as it will now be provided within the parser. Finally, Example introduces our first cut at the yacc grammar. The structure of a yacc parser is, not by accident, similar to that of a lex lexer. We use it here for a C comment as with lex, C comments belong inside C code blocks, at least within the definition section and a single include file. Then come definitions of all the tokens we expect to receive from the lexical analyzer.
In this example, they correspond to the eight parts of speech. The name of a token does not have any intrinsic meaning to yacc, although well-chosen token names tell the reader what they represent. Although yacc lets you use any valid C identifier name for a yacc symbol, universal custom dictates that token names be all uppercase and other names in the parser mostly or entirely lowercase. The routine yyparse is the parser generated by yacc, so our main program repeatedly tries to parse sentences until the input runs out.
The rules section describes the actual grammar as a set of production rules or simply rules. Some people also call them productions. By default, the first rule is the highest-level rule. That is, the parser attempts to find a list of tokens which match this initial rule, or more commonly, rules found from the initial rule. The expression on the right-hand side of the rule is a list of zero or more names.
A typical simple rule has a single symbol on the right-hand side as in the object rule which is defined to be a NOUN. The symbol on the left-hand side of the rule can then be used like a token in other rules. From this, we build complex grammars. The parser executes an action at the end of a rule as soon as the rule matches. Since sentence is the top-level symbol, the entire input must match a sentence. The parser returns to its caller, in this case the main program, when the lexer reports the end of the input.
Subsequent calls to yyparse reset the state and begin processing again. The parser calls yyerror , which we provide in the user subroutines section, and then recognizes the special rule error.
You can provide error recovery code that tries to get the parser back into a state where it can continue parsing. If error recovery fails or, as is the case here, there is no error recovery code, yyparse returns to the caller after it finds an error.
This section can contain any C code and is copied, verbatim, into the resulting parser. In our example, we have provided the minimal set of functions necessary for a yacc-generated parser using a lex-generated lexer to compile: main and yyerror. The main routine keeps calling the parser until it reaches the end-of-file on yyin , the lex input file.
The only other necessary routine is yylex which is provided by our lexer. In our final example of this chapter, Example , we expand our earlier grammar to recognize a richer, although by no means complete, set of sentences. We invite you to experiment further with this example—you will see how difficult English is to describe in an unambiguous way.
We have expanded our sentence rule by introducing a traditional grammar formulation from elementary school English class: a sentence can be either a simple sentence or a compound sentence which contains two or more independent clauses joined with a coordinating conjunction.
Our current lexical analyzer does not distinguish between a coordinating conjunction e. We have also introduced recursion into this grammar. Recursion, in which a rule refers directly or indirectly to itself, is a powerful tool for describing grammars, and we use the technique in nearly every yacc grammar we write. The first possible match,. For example, in this C language statement,. We called our various lexers ch1-N.
Similarly, we called our parsers ch1-M. Then, to build the output, we did the following in UNIX:. The first line runs lex over the lex specification and generates a file, lex. In the second line, we use yacc to generate both y. The next line compiles each of the two C files.
The final line links them together and uses the routines in the lex library libl. Lesk and E. Schmidt Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.
An input language may be as complex as a programming language, or as simple as a sequence of numbers. Unfortunately, usual input facilities are limited, difficult to use, and often are lax about checking their inputs for validity.
Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that han- dles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine.
0コメント