Informatica 28 (2004) 257-263 257 Improved Error Recovery in Generated LR Parsers Boštjan Slivnik and Boštjan Vilfan University of Ljubljana Faculty of Computer and Information Science Tržaška 25, 1000 Ljubljana, Slovenia bostjan.slivnik@fri.uni-lj.si bostjan.vilfan@fri.uni-lj.si Keywords: LR parsing, error recovery and reporting Received: February 17, 2004 A new method for error recovery in LR parsers is described. An error recovery routine based on this new method can be generated automatically by a parser generator as a part of an LR parser. Based on the result that a viable suffix from which the unread part of the input is derived can be computed in certain states of an LR parser, the new method uses the viable suffix to discard the erroneous part of the input and to synchronize the parser stack with the rest of the input afterwards. Thus it resembles a simple but efficient error recovery method used by LL and other predictive parsers. It is proved that all states suitable for this kind of error recovery can automatically be identified by a parser generator. Povzetek: članek opisuje okrevanje po napaki v LR analizatorjih. 1 Introduction Compilers are programs that mostly process erroneous input. Robust error recovery and meaningful error reporting are therefore essential parts of any industrial-strength compiler. Nowadays many compilers perform syntax analysis using an LALR parser (more precisely, LA(1)LR(0) parser) that is generated automatically by a parser generator like yacc, bison, JavaCUP, etc. [3]. However, none of these parser generators uses any advanced method for error recovery and reporting mainly, because these methods are either (a) time consuming, or (b) require inspection of individual parser states and manual insertion of error recovery routines [4, 5]. LALR parser generators usually provide only a very simple mechanism for error recovery and none for error reporting. If a yacc generated parser is to recover after an error is encountered within a sentential form derived from a nonterminal A, a compiler writer should insert a production A —► error a manually where error is a yacc reserved word [2, 3]. In case of an error, a parser abandons other productions expanding A, moves forward over the erroneous part of the input and discards it until a string which can be reduced to a is seen. A reduction to A is performed and thus the parser is resynchronized. However, "proper placement of error tokens in a grammar is a black art" [3]. The paper is organized as follows. Section 2 includes definitions of some basic elements of formal language theory and parsing. Following the brief outline of the new method in Section 3, the construction of the finite automaton used by the error recovery routine is described in Section 4. This is followed in Section 5 by (a) the algorithm for computing the viable suffix needed for error recovery and (b) the algorithm for computing a grammatical context of the erroneous part of the input. Examples and figures are given along the way. 2 Basic definitions Standard terminology of formal language theory and parsing is assumed [4, 5]. Throughout the paper we assume that grammars are reduced (no useless or unreachable symbols) and $-augmented (production S' —► $S$ is added as the only production expanding the new start symbol S'). A string y e V* is a viable prefix (suffix) of G iff there exists a rightmost (leftmost) derivation S =^*G rm yu (S =^*gM uyR). ' The nondeterministic LR(fc) machine Nk for the grammar G = (V,, T, P, S) (where V contains both nonterminal and terminal symbols) is a finite (semi)automaton with the state set equal to Ik (the set of valid LR(fc)-items for G), an initial item i0 e Ik, and a mapping SN: Ik x (Vu{e}) —► 2Ik [1]. The deterministic LR(fc) (or LR(fc)LA(fc')) machine Mk for the grammar G is a finite (semi)automaton with a set of states Q C 2Ik, an initial state qs e Q, and a mapping Sm : Q x V —► Q. If S*M(qs, y) = q for some q e Q and y e V*, then [y] denotes the set of equivalent viable prefixes leading from the initial state qs to the state q. Furthermore, [y] uniquely determines q (and is thus just another name for q). 258 Informatica 28 (2004) 257-263 B. Slivnik et al. The LR(fc) parser is a pushdown transducer (M, t) (or simply M). M denotes the deterministic pushdown automaton based on the deterministic LR(fc) machine Mk, and t denotes the output effect (a mapping of parser actions into grammar productions). States of M are the same as the states of Mk . The stack alphabet of M is a set of states of Mk. A configuration (or instantaneous description) of a parser M is represented as $rl u$, where r g Q* and u e T* denote the stack contents and the unread part of the input, respectively. 3 The outline of the new method To relieve a compiler writer of the "black art" of proper placement of error productions, a better error recovery method is needed. Let us suppose that the input string uv' is being parsed with an LR(fc) parser M. Starting with the initial stack contents r0, the parser M performs the parser steps n(u) corresponding to the derivation $r01 uv'$ MU) $ri v'$ (1) and enters a configuration $7lv'$ (note that 7 and v' denote the stack contents and the unread part of the input, respectively). LR(k) parsers have the correct prefix property [4, 5], i.e., any string of terminals pushed on the parser stack is a prefix of some valid input. It follows from Derivation (1) that there exist derivations S- YVi (2) where the steps denoted by n(v[) are used to skip the part of the input derived from symbol X 1. Thus, the string V$ is the longest suffix of v'$ having a property that (k: V$) e FIRSTfc (S$). The second approach is to skip everything until a string which can be reduced to X2, the first symbol of S, is read, and then to resynchronize by pushing X1 and then X2 on the stack. Formally, Derivation (1) is extended with the derivation $ri v'$ = ^{■"i ) M n{"2 ) M $r v v2 v$ $r[7x1] iv2v$ $r[7Xi][7Xi X2] iv$ for various Vi and a single viable prefix 7 where r = r'7] (the stack contents r of parser M corresponds to the viable prefix 7 of Derivations 2). Therefore, there exist leftmost derivations S =^*m u5i =^*m uvi (3) for various viable suffixes Si. Now suppose that $r lv'$ is an error configuration. In other words, v' cannot be derived from any viable suffix Si in any of Derivations (3). The idea, on which the new method is based, can be outlined as follows: 1. If there is only one viable suffix S = X1S5 such that Si = S for any i in Derivations (3), and 2. if this particular S is known in the error configuration $r iv' $, then the parser should discard the next few tokens of input, resynchronize and continue parsing the string derived from S. If there exists a unique viable suffix S, there are two approaches to discard the erroneous part of the input. The first is to skip everything until a string from FIRSTk(5$) is seen, and then to resynchronize by pushing the symbol X1 on the stack. Using this approach, Derivation (1) can be extended with the derivation $ri v'$ = $r ivi v$ ) $r iv$ =^M $r[7Xi] 15$ where the steps denoted by n(v1) are used to skip the part of the input derived from symbol X1 (as in the case above) and where n(v2) is the parse of v2, the correct part of the input, in M. The parser using an error recovery routine based on either of the two strategies tries to skip as few symbols of the input string as possible. More precisely, there may be many different, but viable splittings of the input string v' either to strings v[ and 5 (as in the first approach) or to strings v[, v2 and 5 (as in the second approach). However, the problem of splitting the string v' is beyond the scope of this paper. Finally, if the viable suffix S cannot be determined uniquely in the parser state [7], the parser removes one state at the time from the parser stack until a state with a unique viable suffix is at the top of the stack. Then, one of the approaches described above is applied. 4 The construction of the error recovery routine Two conditions were set in the previous section that must be fulfilled if a parser is to recover from the error: (1) the viable suffix S must be unique and (2) it must be known. In general, a viable prefix 7 and thus a state [7] (a set of LR(k)-equivalent viable prefixes) can have many corresponding viable suffixes Si. To identify states of an LR(k) parser suitable for performing error recovery we start with the following two definitions [6]: Definition 1 LetNk = (Ij ,V,PNk ,i0) be a nondetermin-istic LR(k) machine for a grammar G = (V, T, P,S). A string of LR(k)-valid items i0i1 ...in e (IjG)* is called a (7, k) -path if there exists a sequence X1, X2,..., Xn e (VU{e}) so thatXn = e and[ij-1Xj ^ ij] e PNk where i = 1, 2,... ,n. Definition 2 (7, k) -paths p1 and p2, where p1 = i0i1 .. .in and p2 = i0 i 1 ... i'm are 0 -equivalent iff n = m and items ij and ij differ only in lookahead strings for all j = 1,2,...,n (i.e., if ij = [A ^ a • f3,x] and ij = [A' ^ a'• 3',xX], then A = A', a = a', and3 = 3'). uv IMPROVED ERROR RECOVERY IN. Informatica 28 (2004) 257-263 259 The reason for defining 0-equivalence of (y, k) -paths becomes obvious with the following lemma, which establishes a relationship between viable prefixes and suffixes on the one hand and LR(k) machines on the other. Lemma 1 Any two 0-equivalent (j,k)-paths define the same viable suffix. and Proof: Any (y, k)-path p = i0ii . ..in leftmost derivations all having the form specifies a set of Ai Kpi nm n(ai) Hm lm n(ai) lm KPm ''lm n(ai) lm Aj E N = { ((ii,qi), (i2,q2))\lX e VU{e} : [qiX — q2] e Pm A [iiX — i2] e Pn0 }, aiA2pi U1A2P1 uia2A3^2^i UiU2A3 uiu2 ... um-iamAm+ifimfim-i. ..pi uiu2 ... umAm+iPmpm_i . ..pi, —► a.j Aj+ipj is the production of the where pj j-th LR(k)-item of p having the dot in the initial position at the far left (and Am+i may be e). As any change of lookahead strings in LR(k)-items of p does not affect the leftmost derivations above, all (y, k) -paths which are 0-equivalent to p define the same viable suffix S, where SR = Am+ipmpm-i. ..pi. □ When the traditional LR(k) parser enters the error configuration $ri v'$, the error is recognized because no action is specified for q and x = k: v', where r = r'q, i.e., ACTION(q, x) = error. But as LR(k) parsers have the correct prefix property, the first (k -1) symbols of the lookahead buffer are correct — otherwise the error would have been detected earlier. The first step is to identify all states [y] with the property that for any y' e [y], all (y',k) -paths ending with an item (k - 1)-active for (k - 1): v' are 0-equivalent (an item [A — aX • p, y] is k-active for x if and only if x e FIRST^(Py)). In other words, the stack contents of the LR(k) parser and the first (k -1) symbols in the looka-head buffer must define the viable suffix uniquely (remember that in error configuration $r I v'$ the k-th symbol of v' is erroneous and thus not useful for error recovery). To do so, we change the focus to the nondeterministic LR(0) machine N0. Definition 3 An LR(0) -item [A — a • p] of N0 is relevant for state [y] of the deterministic LR(k) machine for G if and only if, for all y' e y and i' = [A — a • p,y] e [y], all (y', k) -paths p = p'i' are 0-equivalent. During the parser construction, we compute a directed graph Gn with a set of vertices VN and a set of edges EN defined as Vn = { ([A — a • p],q) \^q e Qm,y e T* : [A — a • p,y] e q} respectively. It is derived from the graph of the nondeterministic LR(0) machine by (1) replicating each LR(0)-item as many times as there are states in M in which an LR(k) item with the corresponding core appears, and (2) erasing edge labels. Thus every path in GN starting with ([S' — •$S$],qs) has its corresponding path in N0 (and vice versa). As an example, consider the following $-augmented LALR grammar Gex with productions S' —► $S$, S —► AB, A —► aA, A —> e, B —► Bb, B —> b. The LALR machine for the grammar Gex as constructed by a modified version of bison is shown in Figure 1. The graph Gn for the grammar Gex is shown in Figure 2. The irrelevant vertices of GN (i.e., vertices (i, q) where i is irrelevant for q) can be identified by the following simple algorithm: 1. Compute the set of all conflicting vertices, i.e., those with at least two different predecessors in the same state: V (i) N = {v\ v = ([A — •p],q)A ((ii,q),v), ((i2,q),v) e en a ii = i2} (Different predecessors from different states represent no problem as the state itself helps determining the right path.) 2. Compute the set of successors of conflicting vertices: V (2) N {v\v is-reachable-from v' e T/N\ri)} In Figure 2 the irrelevant vertices are shaded while in Figure 1 they are closed between double lines. We define the skeleton automaton U with the set of states Qu = {i\(i,q)eVN \ (N U VN2))}, alphabet QM, and the mapping SU: QU x QM defined as QU Su (i, q') = i' ((i',q'), (i,q))e En a (i,q), (i',q')eVN \ (V^ U VN2)). The skeleton automaton is just a compact representation of the graph GN with irrelevant vertices removed. The skeleton automaton for the grammar Gex is shown in Figure 3. The aforementioned modification of bison makes 260 Informatica 28 (2004) 257-263 B. Slivnik et al. Figure 1: Bison-generated LALR machine for the grammar G. Note the different grammar augmentation: instead of using production S' —► $S$, bison uses a production $accept —► S$. S - •$S$, qs)—KS' — $^S$, qo)—KS' - >$S•$,q2^ •aA, qo ) 1 (B - •Bb, q3y^( B - * B • b, qz) \ .r "" w (A - aA,qi) ( A y X B -•b,qa ) (B- U - > aA^,qA ) ( B Figure 2: Graph GN for the grammar Gex. IMPROVED ERROR RECOVERY IN. Informatica 28 (2004) 257-263 261 qo qo qi (v5: A ) (W A - •aA \ Tvr: A " q qo,qi qi a qa qi (v4: S ^ AB • ~ (v»: A ^ aA• ' Figure 3: Skeleton automaton U for the grammar Gex. bison capable of computing the skeleton automaton — bison-generated skeleton automaton for the same grammar (although differently augmented) is shown in Figure 4. (the modification of bison makes bison capable of computing the skeleton automaton as described below). If the LR(k) parser is to start the error recovery process in state q and with the string x in the lookahead buffer, it should be able to select the right vertex of the skeleton automaton U. Hence, apart from the skeleton automaton, the parser must contain the table ERROR, which maps the topmost state and the first (k -1) symbols of the lookahead buffer to a vertex of U: ERROR: Q x T*(k- i) N • LR STATE SKELETON STATE S [S' - - »$S$j 0 V2 : [S - ^ »AB] 1 V7: [A - a • A] 2 V9 : [S' - - $S • $] 3 V3: [S - A • B] 4 vg: [A - aA»] 5 vio: [S' - - $S$•] 6 undefined 7 V4: [S - * AB»] 8 undefined Construction of the table ERROR is straightforward. To compute the value of ERROR(q, x), apply the following procedure: 1. If there exists a state of the skeleton automaton U corresponding to an item i where (i, q) e Vn \ (VN1 u V^) and all items of the core of the state q which are (k -1)-active for x map to i, i.e., V[A ^ a • (, y] e Core(q): x e FIRSTfc_i(3y) [A ^ a • (] = i, then ERROR(q, x) = i. Otherwise, the value of ERROR(q, x) is undefined. (The set Core(q) contains either all items [A ^ a • 3, y] where a = e if q = qS orthe item [S ^ •$£$, e] if q = qs.) 2. If there exists exactly one node (i1, q) eVN \ (V^ u V^) where Su(i',q) = i and there exists an item [A ^ a • 3,y] e q so that x e FIRSTk-1(3y), then set ERROR(q, x) = i' and repeat Step 2; otherwise, terminate. In other words, make the path leading from q to qS as long as possible, but keep it unique in respect with the first (k - 1) symbols of the lookahead buffer. Table 1 shows the ERROR table for the grammar Gex. 5 Computing the context of the syntax error As mentioned at the end of Section 2, not every state of LR(k) parser is suitable for error recovery. If an error is de- Table 1: The ERROR table for grammar Gex. tected in the state where error recovery cannot start, i.e., in the error configuration $ri v'$ where r = r'q, x = k: v' and ERROR(q, x) = ±, then the LR(k) parser must remove the topmost state from the stack and repeat the spawning of the error recovery. But as the lookahead string x in that state is no longer available, the parser must push the appropriate vertex of the skeleton automaton together with the at the time when the state itself. More precisely, the parser stack should not contain just parser states, but pairs consisting of a parser state and a vertex of the skeleton automaton which is to be used if an error occurs. Hence, whenever the state q is pushed onto the stack (as a result of either shift or reduce action), it should be pushed as a pair (q, ±). In the next step, before checking the action table and deciding on the next action, the parser must check the table ERROR and correct the value of the second component of the topmost pair on the stack. The algorithm for computing the context in which a syntax error occurs is presented in Figure 5 (Su is a transition function corresponding to the set of rewriting rules Pu of skeleton automaton U). It starts at the top of the stack and proceeds downward. It produces a list of LR(0)-items i1i2 ■■ - im where ij = [Aj ^ a.j • Aj+1(j,yj], which determine the derivation Ai lm 'lm aiA2ßi aia.2A?,ß2ßi >\m ala2 • • • um-lamAm+l ßmßm- l ßl —> —> 262 Informatica 28 (2004) 257-263 B. Slivnik et al. Figure 4: Bison-generated skeleton automaton U for the grammar Gex. context stack i | i --= [S' — $ • S$] = £ context stack@((q, _): st) [A — •?] = ctx 0 [A — •] where i = Su ([A — •Iq) ctx = context stack i context stack@((q, _): st) [A — a! • 3] = ctx 0 [A — aX • 3] where ctx = context st [A — a • X3] Figure 5: Algorithm for computing the viable suffix. and the viable suffix 3R3R ... [3R. Hence, we can write down the following theorem: Theorem 1 If any two {7',k) -paths ending with items which are (k — 1) -active for x, are 0-equivalent for each 7' € [7], then a viable suffix Si in derivation (3) can be computed from the stack contents r = r' [7] in the parser configuration $ri v$ for any v = xv'. A parse of erroneous string aacbb is shown in Tables 2 and 3. Both possible solutions mentioned in Section 2 are shown. In Table 2, the erroneous part of the input is discarded until the string b € FIRSTi(AB$) is found in the lookahead buffer. The resulting stack contents after error recovery is performed is therefore $[$][$a][$aa][$aaA]. This is a simple but efficient solution. In Table 3, however, the string bb is reduced to the second symbol of the viable suffix AB $, namely B. Hence, the resulting stack contents is $[$][$A][$B]. Finding and reducing the substring bb can be performed in the same way as by parsers generated by existing LALR parser generators if error productions are used. Finally, the list of items [S' — $ • S$], [S — •AB], [A — •aA], [A — a • A], [A — •aA], [A — a • A] STACK INPUT 1. $(qo,v2) aacbb$ 2. $(qo,v2)(qi,v7) acbb$ 3. $(qo,v2)(qi,vr)(qi,vr) cbb$ ^ context returns [S' — $ • S$] [S — •ab] [A — •aA] [A — a • A] [A — •aA] [A — a • A] yielding viable suffix (AB$)R: 4. $(qo,v2)(qi,v7)(qi,v7)(q4,v8) bb$ Table 2: A trace of parsing with error recovery: erroneous part of the input is discarded until b € FIRSTi(AB$) is seen. IMPROVED ERROR RECOVERY IN. Informatica 28 (2004) 257-263 263 STACK INPUT 1. $(^0,^2) aacbb$ 2. $(qo,v2)(qi,v7) acbb$ 3. $(qo,V2)(qi,V7)(qi,V7) cbb$ ^ context returns [S' $ • S$] [S ^ •AB] [A ^ •aA] [A ^ a • A] [A ^ •aA] [A ^ a • A] yielding viable suffix (AB$)R: 4. $(qo,V2)(q3,V3)(q7,V4) $ Table 3: A trace of parsing with error recovery: erroneous part of the input is discarded until bb derived from B in AB$ is reduced. represents a particularly good starting point for printing out helpful error messages because it provides the compiler writer with the exact grammatical context within which the error occurred. 6 Conclusion The presented method works with both, canonical LR(fc) parsers as well as LA(fc)LR(fc') parsers. The error recovery routine does not slow down or influence the parser until it encounters the first error, and it can be generated automatically. Besides, it has two main benefits: (a) a compiler writer needs not add any additional productions to the grammar and (b) it is a good starting point for meaningful error reporting. However, the generation of parser is slower and the generated parser is larger. References [1] A. V. Aho and J. D. Hopcroft. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979. [2] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley series in Computer Science. Addison-Wesley, 1986. [3] J. Levine, T. Mason, and D. Brown. Lex & Yacc. O'Reilly & Associates, 1992. [4] S. Sippu and E. Soisalon-Soininen. Parsing Theory,, Volume I: Languages and Parsing, volume 15 of EATCS Monographs on Theoretical Computer Science. Springer-Verlag, 1988. [5] S. Sippu and E. Soisalon-Soininen. Parsing Theory,, Volume II: LR(k) and LL(k) Parsing, volume 20 of EATCS Monographs on Theoretical Computer Science. Springer-Verlag, 1990. [6] B. Slivnik. Kombinacija Knuthovega in Lewis-Stearnsovega sintaksnega analizatorja z minimalno uporabo Knuthove analize. PhD thesis, University of Ljubljana, Ljubljana, Slovenia, 2003.