Rats Parser Generators

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Parser » Rats Parser Generators 
Rats! Parser Generators
License:GNU General Public License (GPL)
URL:http://www.cs.nyu.edu/rgrimm/xtc/rats.html
Description:
Package NameComment
antlr
cryptix.jce.provider.asn
cryptix.jce.provider.cipher
cryptix.jce.provider.md
java.security
java.security.interfaces
javax.crypto
xtc xtc-wide constants.
xtc.lang Language-specific support for C and Java.
xtc.lang.antlr A Java parser generated by ANTLR. The ANTLR Java grammar is based on the Java grammar distributed at the ANTLR web site, but with all AST building annotations removed.
xtc.lang.c Language-specific support for C.
xtc.lang.c4 C4, an aspect-enhanced version of C.
xtc.lang.javacc A Java parser generated by JavaCC.
xtc.lang.javacc.syntaxtree The AST for the JavaCC-generated parser.
xtc.lang.javacc.visitor Visitors for the JavaCC-generated Java AST.
xtc.lang.jeannie A compiler contributed to xtc that integrates Java with C. Both language and compiler are described in an OOPSLA '07 paper by Martin Hirzel and Robert Grimm.

Prerequisites

You need Java 1.5, gcc, and the usual GNU tooling (in particular, gcc and make). We have tested Jeannie under Mac OS X with HotSpot, and under Linux and Cygwin with IBM Java.

Environment variables

JAVA_DEV_ROOT set this such that $JAVA_DEV_ROOT/xtc is the top-level xtc directory
PATH_SEP ':' for MacOS or Linux, or ';' for Cygwin
CLASSPATH $JAVA_DEV_ROOT/classes$PATH_SEP$JAVA_DEV_ROOT/bin/junit.jar$PATH_SEP$JAVA_DEV_ROOT/bin/antlr.jar
JAVA_HOME set this such that $JAVA_HOME/bin/java is the Java virtual machine
CPATH should include the directory that contains jni.h, which is most likely $JAVA_HOME/include
PATH should include $JAVA_HOME/bin
OSTYPE should be either cygwin, or have linux or darwin as a substring

Testing using the Makefile

Try the following:
make -C $JAVA_DEV_ROOT/fonda/jeannie_testsuite test_000
If all goes well, that should produce the output:
==== integration test_000 ====
Processing tmp/000sugared/Main.jni ...
diff tmp/000mangled/output.txt tmp/000sugared/output.txt
What happened is that the Makefile compiled and ran the same test written in Jeannie (fonda/jeannie_testsuite/input/000sugared_Main.jni) and in JNI (fonda/jeannie_testsuite/input/000mangled_Main.{c,java}), and compared the output.

You can also run all included integration tests in batch mode:

make -C $JAVA_DEV_ROOT/fonda/jeannie_testsuite test
To find out the individual compilation steps, uncomment the following line in the Makefile:
# export VERBOSE_MAKE=true

Compiling your own programs

The easiest way is to follow the existing examples and use the existing Makefiles. But if you prefer to compile by hand, the following example compiles and runs foo/Bar.jni
  • Run Jeannie preprocessor to inject "#include <jni.h>" at the start of the file.
    java -ea xtc.lang.jeannie.PreJeannieParser foo/Main.jni > foo/Main.jni.pp
    
  • Run C prepreocessor to resolve #includes, #ifdefs, and macros.
    # Mac OS:
    cc -DSPECIALIZE_RELPROD -DSPECIALIZE_AND -DSPECIALIZE_OR -DSMALL_NODES -fomit-frame-pointer -fno-common -I/System/Library/Frameworks/JavaVM.framework/Headers -E -x c foo/Bar.jni.pp > foo/Bar.jni.i
    # Linux:
    gcc -E -x c foo/Bar.jni.pp > foo/Bar.jni.i
    # Cygwin:
    gcc -mno-cygwin -I$JAVA_HOME/include -E -x c foo/Bar.jni.pp > foo/Bar.jni.i
    
  • Run Jeannie compiler.
    # Mac OS or Linux:
    java -ea -DJNICALL='' xtc.lang.jeannie.Jeannie -analyze -translate -in foo foo/Bar.jni.i
    # Cygwin:
    java -ea -DJNICALL='__attribute__((__stdcall__))' xtc.lang.jeannie.Jeannie -analyze -translate -in foo foo/Bar.jni.i
    
  • Compile resulting C code into a shared object file (dynamically linked libary):
    # Mac OS:
    cc -DSPECIALIZE_RELPROD -DSPECIALIZE_AND -DSPECIALIZE_OR -DSMALL_NODES -fomit-frame-pointer -fno-common -I/System/Library/Frameworks/JavaVM.framework/Headers -dynamiclib -framework JavaVM -o foo/libBar.jnilib foo/Bar.i
    # Linux:
    gcc -shared -o foo/libBar.so foo/Bar.i
    # Cygwin:
    gcc -mno-cygwin -I$JAVA_HOME/include -Wl,--add-stdcall-alias -shared -o foo/Bar.dll foo/Bar.i
    
  • Compile resulting Java code into class files (bytecode):
    javac -sourcepath foo -d foo foo/Bar.java
    
  • Tell the dynamic linker where to find the shared object file.
    export PATH=foo:"$PATH"
    export LD_LIBRARY_PATH=foo:"$LD_LIBRARY_PATH"
    
  • Run the code with a Java virtual machine.
    java -cp foo -Djava.library.path=foo Bar
    
xtc.lang.p2 This package contains P2 related language parsers and analyzers.
xtc.parser Rats!, a parser generator supporting extensible syntax.

Grammars for Rats! build on the Parsing Expression Grammar (PEG) formalism described in Brian Ford's Parsing Expression Grammars paper. However, since Rats! produces working parsers, the syntax of Rats! grammars is somewhat different from and more expressive than the PEG syntax described in the paper. Additionally, to make grammars more easily reusable and extensible, Rats organizes grammar fragments into modules. A good starting point for learning how to use Rats!, in addition to this introduction, are Rats!' own grammar in package xtc.parser and the C and Java grammars in package xtc.lang.

The rest of this document covers the following topics:

Syntax

Here is the syntax of Rats!' grammar modules, expressed in PEG syntax:
Module        <- Spacing Intro Production* EOF

Intro         <- ModuleDecl Dependency* Header? Body? Footer? Option?
ModuleDecl    <- "module" FSpacing ModuleRef SEMICOLON
Dependency    <- Modification / Instantiation / Import
Modification  <- "modify" FSpacing ModuleRef ModuleTarget? SEMICOLON
Instantiation <- "instantiate" FSpacing ModuleRef ModuleTarget? SEMICOLON
Import        <- "import" FSpacing ModuleRef ModuleTarget? SEMICOLON
ModuleRef     <- QName ModuleParams?
ModuleParams  <- OPEN ( QName (COMMA QName)* )? CLOSE
ModuleTarget  <- "as" FSpacing QName
Header        <- "header" Spacing Action
Body          <- "body" Spacing Action
Footer        <- "footer" Spacing Action
Option        <- "option" FSpacing Attribute (COMMA Attribute)* SEMICOLON

Production    <- Full / Addition / Removal / Override
Full          <- PAttributes QName Identifier EQUAL Choice SEMICOLON
Addition      <- QName Identifier PLUSEQUAL
                 ( SName ELLIPSIS SLASH Choice SEMICOLON
                 / Choice SLASH SName ELLIPSIS SEMICOLON )
Removal       <- QName Identifier MINUSEQUAL
                 SName ( COMMA SName )* SEMICOLON
Override      <- QName Identifier COLONEQUAL Choice SEMICOLON
                 / QName Identifier COLONEQUAL ELLIPSIS SLASH Choice SEMICOLON
                 / QName Identifier COLONEQUAL Choice SLASH ELLIPSIS SEMICOLON
                 / PAttributes QName Identifier COLONEQUAL ELLIPSIS SEMICOLON
PAttributes   <- &(QName Identifier EQUAL) / Attribute PAttributes
Choice        <- Sequence (SLASH Sequence)*
Sequence      <- !(SName ELLIPSIS / ELLIPSIS) SName? Voided*
Voided        <- ("void" Spacing COLON)? Prefix
Prefix        <- (AND / NOT / CARET / Identifier COLON
                 / StringLit Spacing COLON)? Suffix
Suffix        <- Primary (QUESTION / STAR / PLUS)?
Primary       <- NullLit / QName / Literal / NodeMarker / Action 
                 / OPEN Choice CLOSE

NullLit       <- "null" Spacing

NodeMarker    <- '@' Id Spacing

Action        <- '{' ActionBody* '}' Spacing
ActionBody    <- Action / CharLit / StringLit / MLComment / SLComment / !'}' .

Attribute     <- Identifier (OPEN AttValue CLOSE)?
AttValue      <- Integer / QName / StringLit Spacing

QName         <- Id ('.' Id)* Spacing
SName         <- LESS Id GREATER
Identifier    <- Id Spacing
Id            <- [a-zA-Z] [a-zA-Z0-9]*

Literal       <- ('_' / CharLit / StringLit / Class) Spacing
CharLit       <- ['] (Escape / !['\\] .)  [']
StringLit     <- ["] (Escape / !["\\] .)* ["]
Class         <- '[' (Char '-' Char / Char)* ']'
Char          <- Escape / ![-\]\\] .
Escape        <- '\\' [btnfr"'\[\\\]-] / '\\' 'u' HexQuad / OctalEscape
OctalEscape   <- '\\' ([0-3] OctDigit OctDigit / OctDigit OctDigit / OctDigit)

Integer       <- (HexNumber / OctNumber / Number) Spacing
HexNumber     <- '0' [xX] HexDigit+
HexQuad       <- HexDigit HexDigit HexDigit HexDigit
HexDigit      <- [0-9a-fA-F]
Number        <- '0' / NZDigit Digit*
NZDigit       <- [1-9]
Digit         <- [0-9]
OctNumber     <- '0' OctDigit+
OctDigit      <- [0-7]

ELLIPSIS      <- "..." Spacing
PLUSEQUAL     <- "+=" Spacing
MINUSEQUAL    <- "-=" Spacing
COLONEQUAL    <- ":=" Spacing
COMMA         <- ',' Spacing
EQUAL         <- '=' Spacing
SLASH         <- '/' ![/*] Spacing
AND           <- '&' Spacing
NOT           <- '!' Spacing
CARET         <- '^' Spacing
COLON         <- ':' Spacing
QUESTION      <- '?' Spacing
STAR          <- '*' Spacing
PLUS          <- '+' Spacing
OPEN          <- '(' Spacing
CLOSE         <- ')' Spacing
SEMICOLON     <- ';' Spacing
LESS          <- '<'
GREATER       >- '>' Spacing

Spacing       <- (Space / SLComment / MLComment)*
FSpacing      <- (Space / SLComment / MLComment)+
Space         <- ' ' / '\t' / '\f' / EOL
SLComment     <- "//" (![\n\r] .)* EOL
MLComment     <- "/*" ('*' !'/' / !'*' .)* "*/"
EOL           <- '\r' '\n' / '\r' / '\n'
EOF           <- !.
Note that QName stands for "qualified name," SName stands for "sequence name," PAttributes for "production attributes," AttValue for "attribute value," NZDigit for "non-zero digit," FSpacing for "forced spacing," SLComment for "single-line comment," and MLComment for "multi-line comment."

Overview of Expressions and Operators

The biggest difference between parsing expression grammars and the more familiar context-free grammars (CFGs) is the use of ordered choices instead of symmetric choices. Hence, the different options in a PEG choice (and also a Rats! choice) are separated by a slash / instead of a vertical bar |. Furthermore, to emphasize that PEGs define how to parse a language instead of how to generate a language, PEGs use a left arrow <- instead of the right arrow -> used in CFGs. Note that Rats! uses an equal sign instead.

Otherwise, PEGs have many of the same expressions and operators as other syntax and grammar formalisms. For example, the any character constant . (_ for Rats!) matches any character in the input, and a character class, as defined by the [] operator, matches any of the characters in the class. The option operator ? makes an expression optional, and the star * and plus + operators indicate zero-or-more and one-or-more repetitions, respectively. Somewhat less common are the and & and not ! operators, which denote syntactic predicates. Expressions in a syntactic predicate must match (for the and & operator) or not match (for the not ! operator) the input, though the corresponding text in the input is not consumed.

Rats! grammars differ from PEGs in that they have additional expressions and operators necessary for generating actual parsers. Most importantly, Rats! grammars include actions that generate semantic values. Actions are surrounded by curly brackets {}, just like blocks of code in C, C++, or Java. To access such semantic values, Rats! grammars can also include bindings. Bindings first specify the variable name, followed by a colon :, and then the expression to be bound. Additionally, Rats! grammars support semantic predicates, which are denoted by the and & operator directly followed by an action (which must be an expression evaluating to a boolean value). Furthermore, Rats! grammars support text matching expressions, which first specify the text to be matched as a string literal, followed by a colon :, and then the expression to be matched. A text matching expression

      "text":expr

is equivalent to the expression

      fresh-variable:expr &{ "text".equals(fresh-variable) }

but implemented more efficiently. Rats! grammars also support a voiding operator, which is specified as "void:" followed by an expression, node markers, which are specified as an at sign '@' immediately followed by an identifier, and parser actions, which are actions prefixed by a caret '^'. Voided expressions and node markers help with the automatic generation of abstract syntax trees and are explained here. Parser actions provide a low-level interface for extending parsers generated by Rats! and are explained here.

Other differences between PEGs and Rats! grammars include that, as already mentioned Rats! uses an underscore '_' as the any character constant, while PEGs use a dot '.'. Furthermore, Rats!, just like C/C++/Java, requires that single-quoted literals specify exactly one character, while double-quoted literals may specify an arbitrary number of characters. Escape sequences include basic C/C++/Java escapes such as '\n' and '\\', Java Unicode escapes such as '\u00ff', and '\[', '\-', and '\]' for character class specifications. Rats! grammars also use standard C/C++/Java comments. Note that, while Rats! supports Unicode, it only supports 16 bit characters in the basic multilingual plane (i.e., with code points between U+0000 and U+FFFF). Expressions recognizing supplementary characters (i.e., with code points between U+10000 and U+10FFFF) need to use Java's encoding as a pair of char values in the surrogates range.

Grammar Modules

A Rats! grammar consists of one top-level module, which is the module specified on the command line when invoking Rats!, and zero or more dependent modules. Each module starts with several declarations, whose syntax follows from Intro in the Rats! syntax specification above:
  • The module declaration specifies the fully qualified name of a grammar module. Optionally, the module may have one or more parameters, which are treated as module names and are replaced with the actual arguments throughout the module, notably in module dependency declarations and in qualified nonterminals (a qualified nonterminal consists of the module name, followed by a dot '.', followed by the unqualified nonterminal). Module parameters are discussed in more detail here.

  • Zero or more module dependency declarations specify how the current module depends on other modules. Rats! supports three types of such dependency declarations. First, import dependencies make another module's nonterminals referenceable from within the current module. Second, modification dependencies make another module's productions available for modification in the current module. Each module can have at most one modification dependency. Finally, instantiation dependencies instantiate other modules (typically, modules that require arguments) and make their names available for use in other dependencies. The effects of import and instantiation declarations are discussed in more detail here; module modifications are discussed in detail here.

  • The action code following an optional header declaration is copied into the parser just before the parser class declaration. If a grammar spans several modules, all unique header actions are copied into the parser class.

  • The action code following an optional body declaration is copied into the body of the parser class. Again, if a grammar spans several modules, all unique body actions are copied into the parser class.

  • The action code following an optional footer declaration is copied just after the parser class declaration. As for header and body declarations, if a grammar spans several modules, all unique footer actions are copied into the parser class.

  • An optional option declaration specifies parser options. Each option is expressed as an attribute, which is an identifier specifying the attribute's name followed by an optional integer literal, qualified name, or string within parentheses () specifying the attribute's value. The currently recognized options are discussed here.

A grammar's top-level module (i.e., the module specified on the command line when invoking Rats!) is treated somewhat differently from its dependent modules:

  • By default, the Java package and class names of the generated parser are automatically deduced from the fully qualified name of the top-level module. For example, the default package and class names for module xtc.parser.PGrammar would be xtc.parser and PGrammar, respectively. However, this default can be overriden through the parser option (and is, in fact, overriden by that module).

  • Option declarations are outside the module system, and only the top-level module's options are observed by the parser generator. The exceptions are the following options. First, the stateful option specifies a global state object and must have the same value across all of a grammar's modules (if it appears at all). Second, the setOfString option specifies a global set of strings and has the corresponding field's name as its value; each unique value results in its own global field. Finally the flag option specifies a global boolean field and has the corresponding field's name as its value; each unique value results in its own global field.

  • The top-level module must have at least one production that is marked as public, which indicates that that production is a top-level production and thus visible outside the parser class. Other modules' public productions are not treated as top-level production for the grammar.

  • Finally, because the top-level module is directly instantiated, it cannot have any parameters. Only dependent modules may have parameters; the corresponding arguments are specified when importing, instantiating, or modifying the module.

Similar to the Java compiler automatically loading dependent Java classes, Rats! automatically loads dependent modules from the file system. For example, when looking for the dependent module xtc.util.Spacing, Rats! searches for the file xtc/util/Spacing.rats on Unix-like systems. One or more root directories for this search can be specified through the -in command line option.

A grammar's modules can be visualized by using Rats!' -html command line option combined with either the -instantiated, -applied, or -processed command line option to specify the stage of Rats!' processing. In all three cases, Rats! creates a hyperlinked version of the grammar at that processing stage. Make sure that the grammar.css stylesheet is in the same directory as the generated HTML file(s). The output directory can be controlled with the -out command line option.

Productions and Semantic Values

After the module declarations follow the productions, with each production defining how to parse a nonterminal. Regular modules may only contain full productions, whose syntax follows from Full in the Rats! syntax specification above. Module modifications may also contain alternative additions, alternative removals, and production overrides, which are discussed below.

Each full production first lists zero or more attributes, then declares the type of the semantic value it returns, and then specifies the name of the nonterminal it defines (which must be unqualified). The definition follows after the equals sign = and is terminated by a semicolon ;. Each full production can either be public, protected, or private, as indicated by the corresponding attribute of the same name. A public production is a top-level production for the grammar if it appears in the top-level module. Otherwise, it is treated as a protected production, which is the default, can be omitted, and makes the production referenceable from other modules. Finally, a private production can only be referenced from within the same module. Note that the use of the transient attribute is explained here. All per-production attributes supported by Rats! are discussed here. Further note that types cannot be primitive types, such as int; though, as discussed here, void is allowed.

A production's semantic value is usually defined in an action by setting the yyValue variable (so named in deference to the Yacc parser generator) to the desired value. The action code can use characters or strings matched by a literal as well as semantic values returned by nested nonterminals through bindings. As specified above, each binding first declares the variable, followed by a colon, and then the bound expression.

For example, this is the production for the FullProduction nonterminal from Rats!' own grammar:

FullProduction FullProduction =
   atts:ProductionAttributes
   type:Name nt:UnqualifiedNonTerminal "=":Symbol choice:Choice ";":Symbol
      { yyValue = new FullProduction(atts.list(), type, nt, choice); }
   ;

The production first declares the type of the semantic value it returns to be FullProduction (which only coincidentally has the same name as the production). In the definition, the semantic value of recognizing the production's attributes is bound to the variable atts, the semantic value returned by the Name production is bound to the variable type, the semantic value returned by the UnqualifiedNonTerminal production is bound to the variable nt, and the semantic value returned by the Choice production is bound to the variable choice. The action code then uses these variables to create a new FullProduction object and assigns that object to yyValue, thus making the newly created object the production's semantic value.

In the generated parser, Rats! declares the types of variables as following:

  • Variables bound to the any character constant _, to a character literal, or a character class are declared as char.

  • Variables bound to a string literal are declared as String.

  • Variables bound to a nonterminal are declared to be of the type of the corresponding production. For example, the declared type of the Name production from the above example is String and, consequently, type is declared to be a String as well.

  • Similarly, yyValue is declared to be of the type of the corresponding production.

Passing the Semantic Value Through

Sometimes, the semantic value of a production is the same as the semantic value of one of the expressions appearing in that production. To avoid first binding that expression to some variable and then assigning that variable to yyValue , it is possible to directly bind yyValue.

For example, the production for the Header nonterminal from Rats!' own grammar makes use of this feature:

Action Header = "header":Word yyValue:Action ;
An equivalent, but more verbose definition might look like this:
Action Header = "header":Word a:Action { yyValue = a; } ;

Options, Repetitions, and Nested Choices

In general, Rats! lifts options, repetitions, and nested choices into their own, newly generated productions. It also desugars options and repetitions into equivalent productions without the option or repetition operators. Note that nested choices are not lifted if they are the last element in a sequence. Further note that repetitions are not desugared if they appear in a transient production. Finally, options are not desugared if they are not bound or if Rats can automatically deduce the value of the bound optional expression.

This lifting and desugaring raises several questions:

  • What is the semantic value of an expression before applying the option or repetition operator? Similarly, what is the semantic value of a nested choice?

  • For options and repetitions, what is the semantic value after applying the option or repetition operator?

  • Finally, for each of these expressions, what is the type of the newly generated production?

We now answer these questions in turn.

To determine the semantic value of an expression before applying the option or repetition operator and that of a nested choice, Rats! generally requires that each expression has its own, embedded action that defines the corresponding value. As a result, the definition of a production may contain several actions that all assign values to yyValue, even in nested expressions.

However, for some expressions, Rats! is able to automatically deduce their semantic values. Notably, if a nonterminal is optional or repeated, the semantic value of the expression before applying the option or repetition operator simply is the semantic value of the nonterminal's production. Similarly, if a nonterminal is the only expression appearing in an option, the semantic value for that option simply is the semantic value of the nonterminal's production. In these cases, no embedded actions are necessary.

For options, the semantic value after applying the option operator is the semantic value of the expression, if that expression can be matched in the input. Otherwise, the overall semantic value is simply null. For repetitions, the semantic value after applying the repetition operator is a {@link xtc.util.Pair}. Pairs are used to implement singly-linked lists and contain the semantic values of the repeated expressions. In the case of zero matches, the pair is the special {@link xtc.util.Pair#EMPTY empty pair}, which is also accessible through the type-safe {@link xtc.util.Pair#empty()} method. Rats! uses pairs as they allow for efficient addition of an element to the front of a sequence of pairs. Note that pairs can easily be converted into a Java Collections Framework list by calling {@link xtc.util.Pair#list()} (as illustrated in the production for Rats! productions shown above).

The type of a newly generated production representing a desugared option generally is Object. However, if Rats! can automatically determine the semantic value of an optional expression, it uses the more specific type (which, in the case of desugared optional nonterminals, is the type of the corresponding production). The type of a newly generated production representing a desugared repetition always is xtc.util.Pair. Finally, the type of a newly generated production representing a lifted choice always is Object.

To illustrate the general case of repetitions and nested choices, consider the following snippet from Rats!' own grammar, which is taken from the Terminal production and parses character class specifications:

   / '['
      l:( c1:ClassChar '-' c2:ClassChar
         { 
            yyValue = new CharRange(Utilities.unescape(c1).charAt(0),
                                    Utilities.unescape(c2).charAt(0));
         }
        / c1:ClassChar
         {
            yyValue = new CharRange(Utilities.unescape(c1).charAt(0));
         }
        )*
     ']' Spacing
      { yyValue = new CharClass(l.list()); }
The nested choice has its own actions, which create character ranges as their semantic values. The parser collects these character ranges into a sequence of pairs, which is then bound to the l variable. The outer action uses this sequence of pairs to create a new character class as the overall semantic value (converting the sequence of pairs into a Java Collections Framework list along the way).

To illustrate the automatic deduction of a semantic value, consider the production for sequences from Rats!' own grammar:

Sequence Sequence =
   !Ellipsis n:SequenceName? l:Voided*
      { yyValue = new Sequence(n, l.list()); }
   ;
A sequence essentially consists of zero or more repetitions of the Voided nonterminal; no embedded action is necessary for the repetition. Rats! automatically collects the individual semantic values into a sequence of pairs. The optional sequence name preceding the repeated Voided nonterminal simply is an identifier between less-than < and greater-than > signs and is used when modifying productions. It can also be used for documentation, as Rats! tries to preserve it throughout its internal processing, so that it will be included in the generated parser (within a comment). The corresponding optional expression does not need an embedded action either. The syntactic predicate !Ellipsis excludes sequence names followed by an ellipsis ... or an ellipsis by itself from sequences, as they are used to modify productions.

Void and Text-Only Productions

Adding actions to create a production's semantic value can seem tedious and may make grammars unnecessarily verbose. To reduce the need for explicit actions, Rats! supports special types of productions, which need no semantic actions at all. We now discuss void and text-only productions, which are especially useful for recognizing lexical syntax (such as identifiers, punctuation, numbers, and so on).

A void production is a production that declares the type of its semantic value to be void. Such a production does not require any actions, but it may not be bound to an identifier either. A void production is typically used for ignored spacing or punctuation elements. Void productions also improve the accuracy of Rats!' deduction of a compound expression's semantic value. If the compound expression only contains a single non-void nonterminal, that nonterminal's semantic value is used as the semantic value of the entire expression.

Here is an example void production from Rats!' own grammar:

transient void Spacing =
  (Space / LineTerminator / TraditionalComment/ EndOfLineComment)*
  ;
This production matches space characters, line terminators, and comments. Note that the transient keyword is explained here.

The fact the Spacing is declared as void is then used in the production for Symbol:

String Symbol = SymbolCharacters Spacing ;
No action is necessary because only the SymbolCharacters nonterminal produces a semantic value. Rats! automatically detects this case and passes the value of SymbolCharacters through.

In general, Rats! tries to deduce a compound expression's value, even if it is nested within another expression, ignoring nonterminals referencing a void production or explicitly voided expressions. Furthermore, for repetitions, options, and nested choices that do not appear as the last expression in a sequence, it treats the entire repetition, option, or nested choice as void, if all component expressions are nonterminals referencing void productions, explicitly voided expressions, or predicates. For example, if nt references a void production, then the expression "nt*" is equivalent to "void:(nt*)" (note that the parentheses are not strictly necessary).

A text-only production is a production that declares the type of its semantic value to be a String, does not contain any bindings to yyValue nor actions that reference yyValue and references only other text-only productions (if any). The semantic value of a text-only production is the text matched in the input, represented as a String. Note that the above Symbol production is not text-only because it references a void production. Also note that bindings (besides to yyValue) and semantic predicates may appear in text-only productions.

To illustrate text-only productions, consider the following productions from Rats!' own grammar:

String StringLiteral            = ["] ( EscapeSequence / !["\\] _ )* ["] ;

transient String EscapeSequence = 
   '\\' ( [btnfr\"\'\-\[\\\]] / 'u' HexNumber ) ;
transient String HexNumber      = HexDigit HexDigit HexDigit HexDigit ;
transient String HexDigit       = [0-9a-fA-F] ;
The StringLiteral production only references terminals and other productions that are also text-only. As a result, its semantic value is the text matched in the input, exactly what we expect from a production that matches string literals. Note that the transient keyword is explained here.

A note on bindings within text-only productions: When binding to a character terminal (the any character constant, a character class, or a character literal), the bound variable is a char (just as for all other productions). However, for all other expressions (including options, repetitions, and nested choices), the value of the bound variable always is a string representing the text matched by that expression in the input. For options that cannot be matched in the input, the value is the empty string (and not null).

Generic Productions

Void and text-only productions typically help with recognizing lexical syntax; though, they are of limited use for productions that recognize hierarchical syntax, such as the productions for FullProduction, Header, or Sequence shown above. Generic productions help simplify productions for hierarchical syntax. Just as for void and text-only productions, no explicit semantic actions are required.

A generic production is a production that declares generic as its type and returns a semantic value that is a {@link xtc.tree.GNode generic node}. Rats! automatically creates the production's semantic value: its name is the name of the production, and its children are the semantic values of the component expressions in the recognized sequence, with the exception of any voided expressions (using the void: operator), void nonterminals, and character terminals, which are ignored.

For example, we could rewrite the production for FullProduction as a generic production:

generic FullProduction =
   ProductionAttributes
   Name UnqualifiedNonTerminal void:"=":Symbol Choice void:";":Symbol ;
   ;
The rewritten production has a generic node as its semantic value; the grammar author thus does not need to define a FullProduction class to represent the semantic value. Furthermore, the rewritten production is equivalent to the following production with an explicit semantic action:
Node FullProduction =
   v1:ProductionAttributes
   v2:Name v3:UnqualifiedNonTerminal "=":Symbol v4:Choice ";":Symbol
     { yyValue = GNode.create("Production", v1, v2, v3, v4); }
   ;
The two symbol references are not bound because they are explicitly voided.

Options, repetitions, and nested choices in generic productions are treated just like the corresponding expressions in regular productions. In other words, they may be desugared and lifted into their own productions. As a result, they require explicit actions to create their semantic values, unless, of course, Rats! can automatically determine the corresponding values.

List-Valued Productions

If a grammar has the flatten attribute, pairs resulting from a repeated expression are treated somewhat differently in generic productions: the values on the list are added to the production's generic node as individual children. For example, consider this production for recognizing parameter lists in C:

generic ParameterList =
   ParameterDeclaration (void:",":Symbol ParameterDeclaration)* ;
Because the semantic values of the embedded repetition are added as invidividual children to the production's generic node, that node's children are all parameter declarations, which is exactly the desired behavior.

Alternatively, Rats! can automatically deduce the semantic value of productions with a Pair type. The production's value is a list of all of a sequence's component expressions. For example, consider this rewritten production for recognizing parameter lists:

Pair<Node> ParameterList =
   ParameterDeclaration (void:",":Symbol ParameterDeclaration)* ;
Beacuse the declared type is Pair<Node>, Rats! automatically deduces the semantic value of the production. It is a list containing the value of the first parameter declaration followed by the value of the repetition. The production is equivalent to the following production using an explicit semantic action:
Pair<Node> ParameterList =
   head:ParamterDeclaration tail:(void:",":Symbol ParameterDeclaration)*
   { yyValue = new Pair<Node>(head, tail); } ;
As illustrated by this version, Rats! recognizes when the last component expression in a list-valued production already has a list value and turns that value into the overall list's tail. If the last component expression does not have a list value, the overall list's tail is the empty list.

Whether to (1) use the flatten attribute so that list elements are added as individual children to generic nodes or (2) use list-valued productions to create lists and preserve them as generic nodes' children is a trade-off. With the flatten attribute, ASTs are more uniform, typically consisting of only generic nodes and strings. In contrast, without the flatten attribute, generic nodes may also contain lists of, say, nodes or strings. However, flattening lists may result in the loss of information. For example, consider this production for Java compilation units:

public generic CompilationUnit =
  Spacing
  PackageDeclaration? ImportDeclaration* Declaration*
  EndOfFile ;
When lists are flattened, there is no way to distinguish import declarations from regular declarations without looking at the actual children. In contrast, when lists are not flattened, the generic node for compilation units has exactly three children, corresponding to package, import, and regular declarations respectively. Additionally, the preservation of lists in generic nodes enables more exact typing of ASTs through Rats!' -ast option. Consequently, the recommended practice is not to use the flatten attribute.

Left-Recursive Productions

To simplify the specification of productions that recognize expressions at different precedence levels, void, text-only, and generic productions may contain direct left-recursions; though arbitrary left-recursions are illegal in Rats! grammars just as for PEGs. Directly left-recursive productions are automatically transformed into equivalent right-iterative productions, while preserving the left-recursive structure of the original production's semantic value. As an example, consider the following generic production for recognizing logical and expressions in C:

generic LogicalAndExpression =
     LogicalAndExpression void:"&&":Symbol BitwiseOrExpression
   / BitwiseOrExpression
   ;
This directly left-recursive production is equivalent to the following two productions, which leverage so-called actions (i.e., promises) to create the semantic values:
Node LogicalAndExpression =
   seed:BitwiseOrExpression actions:LogicalAndExpressionTail*
      { yyValue = apply(actions, seed); }
   ;

Action<Node> LogicalAndExpressionTail =
   "&&":Symbol right:BitwiseOrExpression
      {
         yyValue = new Action<Node>() {
            public Node run(Node left) {
               Node result = GNode.create("LogicalAndExpression", left, right);
               result.setLocation(location(yyStart));
               return result;
            }};
      }
   ;
The LogicalAndExpressionTail* expression creates a list of {@link xtc.util.Action actions}, with each action on the list creating a generic node that is annotated with the source location corresponding to the start of the production. The LogicalAndExpression production uses {@link xtc.parser.ParserBase#apply(Pair,Object)} to apply each action in the list onto the result of the previous action, starting with the seed value for the first action. If the list is empty, the seed value is simply passed through, which is the desired behavior. By using actions and thus delaying the creation of generic nodes, the rewritten productions ensure that the resulting abstract syntax tree preserves left-associativity. Note that actions can also be used outside of generic productions to create left-associative abstract syntax trees.

In the previous example, the use of a list of actions results in the correct treatment of the base case: no new generic node is created, rather the semantic value of the bitwise or expression is simply passed through. When specifying productions that do not use actions, grammar writers can achieve similar results by explicitly binding yyValue. For alternatives in a generic production that assign yyValue either through a binding or a semantic action, Rats! does not create a new generic node but rather uses the explicitly specified semantic value.

Node Markers in Generic Productions

Also in the previous example, there is only one recursive alternative; consequently, the production's name, i.e., "LogicalAndExpression", provides a meaningful name for the newly created generic node. However, languages such as C contain more than one left-recursive operator at the same precedence level. Using the features described so far, grammar developers can either write one directly left-recursive generic production for all operators, thus using the same name for all operators' generic nodes, or they can write several right-iterative productions that create the generic nodes through explicit semantic actions. Neither option is particularly attractive and node markers provide a more elegant solution. They are written as an at sign '@' immediately followed by an identifier and specify a generic node's name for nodes generated by the sequence they appear in.

As an example, consider the following generic production for recognizing postfix expressions in C:

generic PostfixExpression =
    <Subscript>         PostfixExpression void:"[":Symbol Expression
                                          void:"]":Symbol
                                          @SubscriptExpression
  / <DirectSelection>   PostfixExpression void:".":Symbol Identifier
                                          @DirectComponentSelection
  / <IndirectSelection> PostfixExpression void:"->":Symbol Identifier
                                          @IndirectComponentSelection
  / <FunctionCall>      PostfixExpression void:"(":Symbol ExpressionList?
                                          void:")":Symbol
                                          @FunctionCall
  / <Increment>         PostfixExpression void:"++":Symbol
                                          @PostincrementExpression
  / <Decrement>         PostfixExpression void:"--":Symbol
                                          @PostdecrementExpression
  / <Compound>          CompoundLiteral
  / <Primary>           PrimaryExpression
  ;
In a single directly left-recursive production, the example specifies all of C's postfix expressions. Yet, each recursive alternative contains a distinct node marker, thus resulting in a differently named generic node. Internally (and as described above), Rats! converts the directly left-recursive production into the corresponding right-iterative version, which uses actions to create left-recursive AST nodes. Note that the last two alternatives do not require node markers (nor bindings to yyValue) because they provide the base cases for the recursion.

Modules, Name Spaces, and Parameters

The intent behind Rats!' module system is to facilitate the re-use and extensibility of grammar specifications. Even without parameters and modifications, basic modules help, as they allow a grammar writer to break up a grammar into several units. All modules exist in the same global name space. Without parameters and modifications, this name space consists simply of the module names as specified by the corresponding module declarations. With parameters and modifications, this name space consists of the module names of all instantiated modules (with instantiation being the process of providing arguments to parameterized modules).

In contrast to module names, nonterminals are only meaningful in relation to a specific module. Without any import or modify declarations, a module can only reference the nonterminals defined in that module. In the presence of import or modify declarations, a module can also reference the public and protected nonterminals defined by imported modules and all nonterminals defined by modified modules.

Nonterminals may be ambiguous, e.g., the same nonterminal may be defined within a module and an imported module or in several imported modules. Rats! gives precedence to nonterminals defined in the same module as the nonterminal reference. For example, if module Foo defines nonterminal Name and imports module Bar, which also defines nonterminal Name, then a reference to Name appearing in one of Foo's productions is interpreted to mean Foo's Name. Bar's Name can still be referenced by using the qualified notation Bar.Name. Furthermore, if module Foo does not define Name but imports Bar and Baz, which both define Name, any reference to Name in Foo must be qualified, writing Bar.Name and Baz.Name respectively. Note that, while qualified nonterminals are globally unique and thus always unambiguous, a nonterminal, whether qualified or not, can only be used if the corresponding production's module is the same module, imported by the referencing module, or modified by the referencing module.

Parameterized modules improve re-use when compared to basic modules, as the same parameterized module can be used with different actual dependencies. For example, module xtc.util.Symbol takes one parameter for a module defining spacing:

module xtc.util.Symbol(Spacing);

import Spacing;

String Symbol = SymbolCharacters Spacing ;

transient String SymbolCharacters = ..elided.. ;
As a result, this module can be instantiated with different forms of spacing, without changing xtc.util.Symbol. At the same time, an instantiation must provide a module that specifies a void nonterminal Spacing for the resulting module to be meaningful.

A parameterized module is instantiated by specifying the corresponding arguments in the corresponding import declaration:

import xtc.util.Symbol(xtc.util.Spacing);
This import declaration without a target name is equivalent to the following declaration with a target name:
import xtc.util.Symbol(xtc.util.Spacing) as xtc.util.Symbol;
In either case, all occurrences of the parameter module name Spacing are replaced by the specified argument module name xtc.util.Spacing. Furthermore, all occurrences of the module name itself are replaced by the specified target name (which, in this example, is the same as the module name). Finally, the instantiated module becomes the sole module of that name in the global module name space. No other instantiation with the same target name is possible, unless it instantiates exactly the same module with the same arguments.

Note that the nonterminal Spacing in the production for Symbol is not renamed during instantiation, as it denotes a nonterminal and not a module. However, if the nonterminal was written as Spacing.Spacing, it would be renamed to xtc.util.Spacing.Spacing. Further note that xtc.util.Symbol could be instantiated with a different argument within the same grammar, as long as that instantiation's target name is different. Finally, note that once instantiated, other modules can reference the instantiated module through its target name. For example, after module xtc.util.Symbol has been instantiated through either of the above import declarations, it can also be imported as following:

import xtc.util.Symbol;
This import declaration, when appearing in a dependent module, references the same instantiated module.

While module parameters and arguments can only be module names and never more complex expressions, sometimes more complex arguments are desirable. For example, a grammar writer may want to instantiate xtc.util.Symbol with a more complex module my.Spacing, which requires its own argument my.Argument. In this case, the grammar writer can use an instantiate declaration, which instantiates a module but does not make its nonterminals accessible from within the instantiating module. In the example, the grammar writer would use the following declarations:

instantiate my.Spacing(my.Argument) as Spacing;
import xtc.util.Symbol(Spacing);

Module dependencies are resolved through a breadth-first search starting with the top-level module. In other words, a dependent module's dependencies are only processed after all the dependencies of the depending module have been resolved. As a result, the order of import, instantiate, and modify declarations in a module does not matter and mutually dependent modules can be instantiated within the same module. For example, xtc.lang.CSpacing has a parameter for a constant module and xtc.lang.CConstant has a parameter for a spacing module, with each module importing the parameterized module. To instantiate these two mutually dependent modules, module xtc.lang.C simply declares (with some intermediate declarations omitted and ignoring the additional xtc.lang.CState argument):

instantiate xtc.lang.CConstant(xtc.lang.CSpacing);
instantiate xtc.lang.CSpacing(xtc.lang.CState, xtc.lang.CConstant);
Because it uses breadth-first search, Rats! processes these two instantiate declarations before processing the corresponding import declarations in xtc.lang.CConstant and xtc.lang.CSpacing. By the time Rats reaches the import declarations in the dependent modules, the modules have been instantiated and no arguments are necessary.

Module Modifications

To further facilitate re-use and extensibility, Rats! supports module modifications, which concisely express extensions to a grammar and, in addition to full productions, can also contain alternative additions, which add new alternatives to a production's top-level choice, alternative removals, which remove alternatives from a production's top-level choice, and production overrides, which can override an entire production, specific alternatives, or a production's attributes. All three types of partial production, whose syntax is specified above and illustrated below, depend on the different alternatives in a production's top-level choice having sequence names. Consequently, it is good practice to always name your sequences!

A module modification must contain a single modify declaration, which specifies the module to be modified. For example, the following declaration specifies that module xtc.lang.JavaSymbol modifies module Symbol:

module xtc.lang.JavaSymbol(Symbol);

modify Symbol;
The example uses a module parameter to ensure that module xtc.lang.JavaSymbol can be applied to different versions of Symbol. The resulting module contains all full productions appearing in both the modifying and modified modules (in the example, xtc.lang.JavaSymbol and Symbol, respectively). Furthermore, all productions in the resulting module are modified as specified by the alternative additions, alternative removals, and production overrides in the modifying module. Modifications are applied in the same order as they appear in the modifying module (in the example, xtc.lang.JavaSymbol). The resulting productions are the only versions; i.e., all nonterminals, whether they originally appear in the modifying module or the modified module reference the modified productions.

The resulting module's options are the options of the modifying module, with exception of any stateful, setOfString, and flag options, which are preserved if they appear in the modified module's options. Furthermore, the modifying module's header, body, and footer actions are combined with the modified module's actions (if they exist).

An alternative addition adds an expression before or after an existing sequence. For example, the following addition adds sequence s after sequence <Bar> in production Foo:

Type Foo += <Bar> ... / <NewSequence> s ;
Similarly, the following addition adds the new sequence before sequence <Bar>:
Type Foo += <NewSequence> s / <Bar> ... ;
Note that the new expression may actually consist of several sequences and not only one as suggested by the examples. Further note that the type and nonterminal must both be specified and match an existing full production.

An alternative removal removes one or more sequences from a production. For example, the following removal eliminates sequences <Bar> and <Baz> from production Foo:

Type Foo -= <Bar>, <Baz> ;
As before, the type and nonterminal must both be specified and match an existing full production.

A production override can replace all alternatives, only specific alternatives, or a production's attributes. For example, the following override replaces all alternatives of production Foo with expression e:

Type Foo := e ;
In contrast, this override only replaces alternatives <Bar> and <Baz<:
Type Foo := ... / <Bar> s1 / <Baz> s2 ;
Finally, this override replaces Foo's attributes with the public attribute:
public Type Foo := ... ;
Note that the list of attributes may be empty, thus removing all attributes from the full production.

Memoization and Transient Productions

Packrat parsers process their input in time linear to the size of the input, even if they need to backtrack. The reason they can do this is that they store, or memoize, the result of trying each nonterminal at every input position – hence the name "packrat" parser.

The benefits of memoization are best illustrated with a production from Rats!' own grammar:

Element Suffix =
   ( p:Primary "?":Symbol { yyValue = new Option(p);            }
   / p:Primary "*":Symbol { yyValue = new Repetition(false, p); }
   / p:Primary "+":Symbol { yyValue = new Repetition(true,  p); }
   / Primary
   )
   ;
The Suffix production matches possible suffix expressions in the input, with each of the four alternatives in the choice first trying to parse a primary expression. Assume that the current input contains a primary expression without a suffix (i.e., the fourth alternative is the alternative that will succeed). If the parser did not memoize its results, the primary expression would be parsed four times, once for each alternative, resulting in suboptimal performance. However, because parsers generated by Rats! memoize their results, the primary expression is parsed exactly once. Each subsequent alternative simply looks up the result of the first parse.

Note that it is possible to rewrite the production for Suffix to not require backtracking. An alternative expression, without semantic actions, might look like this:

   Primary ( Question / Star / Plus / /* Empty */ )
Further note that other top-down parsers, such as JavaCC and ANTLR, may not require backtracking, even with a comparable production to the one used in Rats!' own grammar as they support (limited) lookahead. However, the original form clearly and concisely expresses the intent behind this production and, because parsers generated by Rats! memoize their results, does not require any fiddling with lookahead (as would be necessary for other top-down parsers).

In theory, packrat parsers create a two-dimensional array, with characters in the input stream along the x-axis and nonterminals along the y-axis. In practice, not all values in this array are calculated (many of them represent mismatched nonterminals anyway) and the array is sparse. To avoid allocating memory for the entire array, parsers generated by Rats! break each column into several chunks. Memory for the different chunks is allocated independently and on demand (i.e., if a parsing result needs to be stored). Furthermore, for productions that are only referenced once in the entire grammar the corresponding row is never allocated.

Grammar writers have further control over memory allocation by declaring productions to be transient: if a production is transient, the corresponding row is not allocated either. The transient keyword can thus reduce a parser's memory footprint and also improve its performance. However, it is also dangerous because, if overused, the resulting parser may need to reparse the same input for the same nonterminal several times and thus perform in time superlinear to the size of the input.

So, which productions should be declared transient? In other parsers, the parser processes tokens generated by a separate lexer instead of accessing the character stream directly. Packrat parser grammars thus need to include productions that build up tokens. Examples include the Escape, HexNumber, and HexDigit productions in Rats!' own grammar (shown above). Such productions can be declared transient, as the parser typically does not backtrack at the level of individual characters. Furthermore, productions that cover ignored input characters, notably all productions that cover white space and comments, can be declared transient as well. However, when in doubt, it is best to not declare a production as transient, or to perform performance studies on several inputs to see how such declarations impact the parser's memory footprint and performance.

Grammar and Production Attributes

Rats! supports a number of options, which are either specified as a grammar-wide option in the top-level module's introduction or as per-production attributes:
  • public, protected, and private instruct Rats! to make a production a top-level production, referenceable from other modules, and referenceable only from the same module, respectively. These attributes do not have values. Furthermore, these attributes can only be specified as per-production attributes. The default is protected and can be omitted from productions.
  • transient instructs Rats! not to memoize the corresponding production. The attribute does not have a value and can only be specified as a per-production attribute.
  • memoized instructs Rats! to always memoize the corresponding production. The attribute does not have a value and can only be specified as a per-production attribute.
  • inline instructs Rats! not to memoize the corresponding production. For productions that are neither void nor text-only, it also instructs Rats! to inline the production if the corresponding nonterminal appears as the only element in an ordered choice's alternative. The attribute does not have a value and can only be specified as a per-production attribute.
  • noinline instructs Rats! not to inline the production, even if it is (recognized as) transient. The attribute does not have a value and can only be specified as a per-production attribute.
  • constant instructs Rats! to make all bindings constant and thus unmodifiable. The attribute does not have a value. This attribute can be specified either as a grammar-wide attribute or as a per-production attribute.
  • withLocation instructs Rats! to annotate all semantic values that are instances of {@link xtc.tree.Node} with their locations in the source file. The attribute does not have a value. This attribute can be specified either as a grammar-wide attribute or as a per-production attribute.
  • stateful instructs Rats! to include code for managing global parsing state through a {@link xtc.util.State state object}. This attribute is used both as a grammar-wide attribute and a per-production attribute. The grammar-wide attribute value specifies the class managing the global state (which must have a no-argument constructor). Rats! includes a static final field yyState, which references an instance of that class, in the generated parser. For a production with this attribute, the attribute does not have a value and Rats! includes the appropriate calls to {@link xtc.util.State#start()}, {@link xtc.util.State#commit()}, and {@link xtc.util.State#abort()}. Note that the state object must be reset explicitly with the per-production resetting attribute. Further note that, if a grammar spans several modules, all the module's stateful options must specify the same class name.
  • resetting instructs Rats! to include code to {@link xtc.util.State#reset(String) reset} the global state object. The attribut does not have a value. This attribute can only be specified as a per-production attribute and requires a grammar-wide stateful attribute.
  • ignoringCase instructs Rats! to perform comparisons for string matches in a case-insensitive manner. The attribute does not have a value. This attribute can be specified either as a grammar-wide attribute or as a per-production attribute.
  • flatten instructs Rats! to add the elements of list values (i.e., instances of {@link xtc.util.Pair}) to generic nodes instead of adding the list values themselves. This attribute does not have a value and can only be specified as a grammar-wide attribute.
  • variant instructs Rats! that all generic nodes returned by the production as a semantic value are variants of the same type. This attribute does not have a value and can only be specified as a per-production attribute.
  • withParseTree instructs Rats! to generate a parse tree instead of an abstract syntax tree. This attribute is specified as a grammar-wide attribute and does not have a value. For a grammar with this attribute, Rats! rewrites all generic, text-only, and void productions as well as productions that pass the semantic value through to preserve the input in the form of {@link xtc.tree.Formatting} annotations. The embedded AST generally has the same structure as for the grammr without the attribute. The exception are strings, which are represented as instances of {@link xtc.tree.Token}. Furthermore, generic nodes include additional children consisting of Formatting annotating null values if voided expressions or void nonterminals appear between two non-void repetitions.
  • verbose instructs Rats! to print debugging information to the console for each parser method invocation. The attribute does not have a value. This attribute can be specified either as a grammar-wide attribute or as a per-production attribute.
  • nowarn instructs Rats! to suppress warnings during parser generation. The attribute does not have a value; it can be specified either as a grammar-wide attribute or as a per-production attribute.
  • parser instructs Rats! to use the fully qualified class name specified by the attribute's value as the Java class name for the generated parser, instead of deducing that name from the module's name. This attribute can only be specified as a grammar-wide attribute.
  • factory instructs Rats! to use the (optionally qualified) class name specified by the attribute's value as the factory for creating generic nodes instead of using xtc.tree.GNode. This attribute can only be specified as a grammar-wide attribute.
  • visibility instructs Rats! to make the generated parser class either public or package private, depending on the attribute's value (public or packagePrivate). This attribute can only be specified as a grammar-wide attribute. The default for modules without this attribute is public.
  • rawTypes instructs Rats! to use raw types instead of generic types. Besides a "@SuppressWarnings("unchecked")" annotation for the parser class, the generated code is compatible with Java 1.4. The attribute does not have a value and can only be specified as a grammar-wide attribute.
  • main instructs Rats! to include a static main method that parses one or more input files and prints the corresponding results. The value of this attribute must be the name of the top-level nonterminal to parse. This attribute can only be specified as a grammar-wide attribute.
  • printer instructs Rats! to use the visitor specified by the attribute's value for printing the semantic value returned by a successful parse. This attribute can only be specified as a grammar-wide attribute and requires that the grammar also has the main attribute.
  • setOfString instructs Rats! to include a static final set of strings with the attribute's value as its name. Rats! also includes a convience method for filling sets add(Set,T[]). This attribute can only be specified as a grammar-wide attribute.
  • flag instructs Rats! to include a public static final boolean field with the attribute's value as its name. The field's value is true. This attribute can only be specified as a grammar-wide attribute.
  • genericAsVoid instructs Rats! to treat generic productions as void productions. The attribute does not have a value and can only be specified as a grammar-wide attribute.
  • dump instructs Rats! to include a method for dumping a plain-text representation of the parser's memoization table to a printer: dump({@link xtc.tree.Printer}). The attribute does not have a value and can only be specified as a grammar-wide attribute.

A note on licensing: Parsers generated with the main, printer, and dump attributes link with code licensed under the GNU GPL version 2. As a result, parsers generated with these attributes may not be used in software that is not compatible with the GPL version 2. To generate parsers that are not restricted by the GPL, use the -lgpl option when running Rats!.

Parser Actions

Some languages cannot be expressed by PEGs or Rats! grammars. For example, the on-the-wire format for SPKI/SDSI represents byte strings as one or more digits, followed by a colon, followed by the number of characters specified by the sequence of digits (when interpreted as a decimal integer); but parsing a specific number of characters cannot be expressed by PEGs. Parser actions allow grammar writers to still use Rats! for such languages by providing a low-level extensibility mechanism.

Parser actions work as follows. Before executing the code in a parser action, the variable yyBase is assigned the index of the current parser position. The parser action can then use this index together with the parser's methods to further consume the input, parsing the expression not expressible by PEGs. When finished, it assigns the corresponding result to yyResult, which must either be a {@link xtc.parser.SemanticValue} object indicating a successful parse or a {@link xtc.parser.ParseError} object indicating a parse error. In the case of a semantic value object, that object's semantic value becomes the semantic value of the production, unless another action after the parser action changes it.

The following grammar illustrates the use of a parser action for parsing byte strings as used by SPKI/SDSI. Note that the yyStart reference in the parser action is the parser position at the beginning of the production and is used for indicating the position of an error.

module ByteString;

body {
  Result parseChars(String number, int start, int base) throws IOException {
    int n;

    try {
      n = Integer.parseInt(number);
    } catch (NumberFormatException x) {
      return new ParseError("Malformed length", start);
    }
    
    StringBuilder buf = new StringBuilder(n);
    for (int i=0; i<n; i++) {
      int c = character(base + i);

      if (c != -1) {
        buf.append((char)c);
      } else {
        return new ParseError("Unexpected end of bytestring", base + i);
      }
    }

    return new SemanticValue(buf.toString(), base + n);
  }
}

option main(ByteString);

public String ByteString =
  n:Integer ':' ^{ yyResult = parseChars(n, yyStart, yyBase); } ;

String Integer = [0-9]+ ;
xtc.tree xtc's support for abstract syntax trees and for traversing them with visitors. This package defines the {@link xtc.tree.Node base class} for all abstract syntax tree nodes and the {@link xtc.tree.Visitor base class} for all visitors. It also defines a {@link xtc.tree.GNode generic node class}.

In contrast to the original visitor pattern, node visitors do not implement a fixed interface. Rather, the appropriate method is selected through reflection-based dynamic dispatch, based on the type of a node. On invocation of {@link xtc.tree.Visitor#dispatch}, the dynamic dispatch mechanism tries to find a method name visit() that takes either the specified node, any of its interfaces, or any of its superclaseses as its only argument. The search starts with the class of the node, then checks for all implemented interfaces, then for the superclass, then for all interfaces of the superclass, and so on until it reaches java.lang.Object, which is not considered.

For {@link xtc.tree.GNode generic nodes} the appropriate visit() method is selected based on the name of the generic node. For example, if the target name is Node, then the dynamic dispatch mechanism tries to locate a method visitNode(GNode). If no such method exists, the dynamic dispatch mechanism also tries to locate a visit(GNode) and visit(Node) method (in that order).

Note that to improve the performance of dynamic dispatch, our implementation uses a static method lookup cache that is not thread-safe. Source-to-source transformers thus cannot be multi-threaded.

xtc.type xtc's representation of types. This package defines a hierarchy of type classes, representing types of the C, Java, and ML programming languages as well as pseudo-types to help with type-checking programs. In particular:
  • {@link xtc.type.Type} is the superclass of all type classes and defines the common interface.
  • {@link xtc.type.VoidT} represents the void type.
  • {@link xtc.type.UnitT} represents the unit type.
  • {@link xtc.type.BooleanT} represents the boolean type.
  • {@link xtc.type.IntegerT} and {@link xtc.type.FloatT} represent integer and floating point {@link xtc.type.NumberT numbers}, respectively.
  • {@link xtc.type.InternalT} represents compiler-internal types, such as the __builtin_va_list type used by gcc for variable argument lists.
  • {@link xtc.type.LabelT} and {@link xtc.type.PackageT} represent labels and packages, respectively.
  • {@link xtc.type.PointerT}, {@link xtc.type.ArrayT}, {@link xtc.type.StructT}, {@link xtc.type.UnionT}, {@link xtc.type.FunctionT}, {@link xtc.type.MethodT}, {@link xtc.type.ClassT}, {@link xtc.type.InterfaceT}, {@link xtc.type.TupleT}, and {@link xtc.type.VariantT} represent {@link xtc.type.DerivedT derived types}.
  • {@link xtc.type.WrappedT} is the superclass of pseudo-types adding symbolic information to any of the previous types. {@link xtc.type.AliasT} represents type aliases; {@link xtc.type.AnnotatedT} provides annotations for another type; {@link xtc.type.EnumeratorT} and {@link xtc.type.EnumT} represent C enums; {@link xtc.type.VariableT} represents globals, locals, fields, and bitfields; {@link xtc.type.ParameterizedT} represents parameterized types; and {@link xtc.type.InstantiatedT} represents instantiations of parameterized types.
  • {@link xtc.type.Parameter} represents a type parameter and {@link xtc.type.Wildcard} represents a wildcard. Types containing instances of these classes are parameterized. They should be wrapped in a {@link xtc.type.ParameterizedT} and instantiated through a {@link xtc.type.InstantiatedT}.
  • Finally, {@link xtc.type.ErrorT} represents typing errors.
These type classes are complemented by a common interface for all {@link xtc.type.Tagged tagged types} (structs, unions, and enums) and for all {@link xtc.type.Constant constant-valued types} (enumerators and the ConstantT pseudo-type).

To model the memory shape of lvalues, this package also defines a separate hierarchy of references:

  • {@link xtc.type.Reference} is the superclass of all reference classes and defines the common interface. All references have a type specifying the referenced memory's shape or layout.
  • {@link xtc.type.NullReference#NULL} represents the zero location in memory. It has void as its type.
  • {@link xtc.type.StaticReference} and {@link xtc.type.DynamicReference} represent statically and dynamically allocated variables, respectively. Both reference classes are symbolic representations of memory; they do not have a known location.
  • {@link xtc.type.RelativeReference} is the superclass of all relative references, with each relative reference having another reference as its base.
  • {@link xtc.type.CastReference} represents a differently typed view on the same memory region.
  • {@link xtc.type.IndirectReference} represents an indirection through a memory region, i.e., pointer.
  • {@link xtc.type.IndexReference} represents an integer offset from another reference, while {@link xtc.type.FieldReference} represents a symbolic offset.
xtc.typical

The Typical type checker generator.

This package implements Typical, a language and compiler for specification and generation of type checkers. Typical is a domain specific language based on ML, and currently supports Java as a target language. The compiler's main class is {@link xtc.typical.Typical}.

The rest of this document covers the following topics:


ML Ancestry
Typical is a domain specific language built on the functional core of the O'Caml dialect of ML. Notably; Typical has the following syntactic differences from O'Caml:
  • ";" instead of ";;" for terminators
  • "," instead of ";" for list separators
  • "mltype" instead of "type"
  • "mlvalue" instead of "value"
  • no global "rec" since recursion is automatically detected

Features
Typical also provides the following built-in constructs, types, and features specifically for expressing type checker specifications.
  • Type Representation
    Typical represents source language types using the variant type raw_type. For example, the type system for an implementation of the Simply Typed Lambda Calculus with Integer and String values can be expressed as:
       mltype raw_type = IntegerT | StringT | FunctionT of type * type ;
    
    The raw_type variant is sufficient to represent the type system of some languages such as the Simply Typed Lambda Calculus, ML and Typical itself. For other languages, however, we need to capture information about attributes; for example, C's qualifiers and storage classes. For representing attributed type systems, Typical provides the attribute and eqattribute constructs. For example, C's type system could be expressed as
       mltype raw_type = VoidT | IntT | PointerT of type | ... ;
       mltype storage_class = Auto | Extern | Register | Static | Typedef ;
       mltype qualifier = Const | Volatile | Restrict ;
       attribute storage : storage_class ;
       attribute qualifiers : qualifier list; 
    
    The Typical compiler collects all attributes and combines them with the raw_type variant to create a record representation of types which will be used by the typechecker.
       mltype type = { type : raw_type; storage : storage_class; qualifiers = qualifer list; ... } ;
    
    By default, Typical evaluates type equality as structural equality ignoring attributes introduced with attribute as opposed to eqattribute. This default behavior may be overridden for variant types by using the equality construct which specifies which constructor arguments to compare. For example, consider the following definition of C’s structure type:
       mltype raw_type = ... | StructT of int * string * member list ;
    
    Consistent with the C standard, it uses its integer argument as a nonce that distinguishes between structures that have the same name and members but are declared in different scopes. The corresponding equality declaration (meaning two StructTs compare equal if they have equal first arguments) reads:
       equality raw_type = ... | StructT (n, _, _) ;
    
  • Name Management Features
    Typical supports name management by providing (1) a symbol table that is implicitly available to all programs, (2) the scope construct to declare which AST nodes introduce which scopes, and (3) the namespace construct to declare individual namespaces, the types of their values, and which AST nodes specify which names.

    Typical provides the following constructs for symbol table operations: define, redefine, lookup, and lookup_locally. The constructs have the structures shown:
      define no type [error|warn "error message"] [at loc]
      redefine no type
      lookup no [error|warn "error message"] [at loc]
      lookup_locally no [error|warn "error message"] [at loc]
    
    Both define and redefine take a node argument (no), a value argument (type), and optional error messages and source location nodes. The snippet below illustrates the use of define and lookup.
       | Abstraction (id, type, body) ->
         let param = analyze type in
         let _     = define id param in
         let res   = analyze body in
           { type = FunctionT (param, res) }
       | Identifier _ as id -> let t = lookup id in t
    
    In this snippet, define adds the name derived from the node id to the symbol table with the value type. lookup looks up the name derived from the node id. The symbol table operations have built in error management. If a name is previously defined (in the case of define) or not defined (in the case of lookup) the operations report default error messages and return an error. The source location for the error is also automatically derived from the current position in the ast. The default source location can be overridden by using the at loc construct where loc is the name of an ast node. The default error messages can be overridden by explicitly adding an error clause to the operations; for example:
       | Abstraction (id, type, body) ->
         let param = analyze type in
         let _     = define id param error "previously defined identifier" in
         let res   = analyze body in
           { type = FunctionT (param, res) }
       | Identifier _ as id -> let t = lookup id error "identifier undefined" in t
    
    The redefine operator is used for overriding previously defined symbol table entries without reporting errors. lookup_locally is used to retrieve entries that are only visibly within the current scope.

    The symbol table operations depend on Typical's namespace construct to introduce new namespaces and their types and also how to extract names from AST nodes. The namespace construct hast the structure shown:
      namespace "default"|identifier : value = PatternMatching [and "default"|identifier : value = PatternMatching]*
    
    Here, "default" and identifier identify the namespace, value indicates the type of values stored in this namespace, and PatternMatching is a matching from AST nodes to a Name constructor.

    For example the namespace declaration:
       namespace default : type = Identifier (id) -> SimpleName(id);
    
    indicates that there's a single default namespace, where all values have type type. Identifiers are matched to names by extracting the string child from an Identifier node and using it to create a new SimpleName. Names can be simple, omitting the scope, or qualified, explicitly specifying a scope according the built-in variant type declaration:
       mltype name =
         | SimpleName of string
         | QualifiedName of string list ;
    
    The symbol table operations also depend on the scope construct to manage scopes while traversing the AST. The scope construct (non-exhaustively) maps AST nodes to scopes; each right-hand side must evaluate to a Scope constructor value defined by the built-in variant declarations:
      mltype scope_kind =
        | Named of name
        | Anonymous of string
        | Temporary of string ;
      mltype scope =
        | Scope of scope_kind * node list ;
    
    A scope spans the specified list of nodes. It can be one of three kinds. First, a named scope is introduced by a function or class. Second, an anonymous scope is introduced by a block or compound statement. Third, a temporary scope, unlike a named or anonymous scope, is deleted after the AST traversal has left the scope’s AST nodes. It is necessary for typing C or C++ function declarations, which may contain parameter names different from the corresponding definitions. Named scopes assume the specified name, while anonymous and temporary scopes receive a fresh name based on the specified string. An implementation of the scope construct depends on the ability to accurately trace a Typical program’s AST traversal. To this end, we restrict AST node patterns: they may specify several (recursively) nested nodes but only use variable or alias patterns for arguments of a single node constructor. With this restriction in place, each right-hand side of a pattern match has a unique parent. As a result, we can easily model the path representing the AST traversal’s dynamic state through a stack:
       traversal_path : node list
    
    Typical’s runtime updates the stack with the parent and any matched ancestors before evaluating a pattern match’s right-hand side and restores the stack before returning the right-hand side’s value. Whenever the runtime pushes nodes on the stack, it evaluates the scope construct’s pattern match against each node being added to the stack. On a match, it updates the current scope as specified by the corresponding Scope constructor value. Independent of a match, it annotates the node with the result. It uses the annotations to restore previous scopes when popping nodes of the stack. It also uses the annotations to avoid re-evaluation of the scope construct’s pattern match for repeated passes over the same AST.

  • Error Management, Detection, and Reporting.
    For error management, Typical provides a system-wide no-information monad. All types are automatically injected into this monad and implicitly represent the union with the bottom type; that type’s only value is bottom. Typical's bottom value can be compared with Java's null value. In fact, in the generated Java code, bottom is represented by null. Built-in constructs and primitives generally return bottom if any of their arguments are bottom. Furthermore, all pattern matches return bottom due to an implicit clause mapping bottom to itself; this default can be overridden by explicitly matching bottom. However, type constructors such as those for tuples, lists, and variants treat bottom just like any other value. It allows for marking, say, a type attribute as having no information without forcing the entire type record to bottom. Finally, the is_bottom primitive enables the detection of bottom values, since the = and != operators yield bottom when either operand is bottom.

    To actually detect and report error conditions, Typical uses require and guard. The require construct enforces one or more boolean constraints on an expression. For example, consider :
       require param = tr error "argument type mismatch" in res
    
    If the constraints, here "param = tr", evaluate to true, require evaluates the expression, here "res" and returns the corresponding value. If the constraints evaluate to false, it reports an error, here "argument type mismatch", and returns bottom. Unless explicitly specified, the traversal_path’s top node supplies the error’s source location. However, if the constraints evaluate to bottom, reduce silently returns bottom. This feature avoids cascading errors since constraints that evaluate to bottom means that an error was previously generated (and reported).

  • List Reduction.
    C-like languages rely on lists of specifiers or modifiers to express the properties of declarations, including types and their attributes. When processing such lists, a type checker needs to (1) map AST nodes to the corresponding internal representation, (2) enforce semantic constraints on the number and combination of specifiers, and (3) provide comprehensive error reporting. Typical addresses these needs using the reduce construct. The reduce construct has the following structure:
      reduce to ["singleton" | "list" | "set" | "optional" | "required"] tag with PatternMatching
    
    As illustrated in Figure 1 using C’s type specifiers as an example, the reduce construct selects values from a list, while also mapping them to different values and enforcing constraints on their number and combination. It uses an extension of ML’s pattern syntax, the set pattern { ... }, to denote that the elements may appear in any order in the list; alternatively, a regular list pattern can be used to indicate that the elements must appear in the specified order. While mapping the (non-exhaustive) pattern match over a list, reduce ignores elements that do not match any patterns; we assume that lists, in particular those generated by the parser, only contain legal elements. While mapping, reduce also enforces the reduction constraints. The singleton constraint in the example indicates that the pattern match may be triggered at most once, while set and list constraints allow for multiple triggers (with duplicate triggers being ignored for set constraints). A further optional constraint specifies that the pattern match need not match any elements.

    The design of the reduce construct reflects our analysis of how to type check C specifiers or Java modifiers and supports a generalization of the corresponding requirements. For example, a list of C specifiers must include a single type specifier such as int, reducing the list to a "singleton". It may optionally include one storage class specifier such as register, reducing to an "optional singleton". It may also include one or more qualifiers such as const, which may be duplicated and thus reduce to an "optional set". Furthermore, Figure 5 illustrates how reduce combines several type specifiers into one internal value; as an added benefit, the code directly reflects the constraints specified in the C99 standard.

    Like require and guard, reduce integrates error checking and reporting, basing any message on a string tag that describes the kind of values being selected such as "type" or "access modifier". For example, if processing the C declaration long int float foo = 5;, the reduce construct shown below would generate a "multiple types detected" error and indicate the source location.
      (** Reduce a list of declaration specifiers to a type. *)
      mlvalue extractType = reduce to required singleton "type" with
          [Unsigned _, Long _, Long _, Int _]    -> {type = ULongLongT}
        | [Unsigned _, Long _, Long _]           -> {type = ULongLongT}
        | [Signed _, Long _, Long _, Int _]      -> {type = LongLongT} 
        | [Signed _, Long _, Long _]             -> {type = LongLongT}
        | [Unsigned _, Long _, Int _]            -> {type = ULongT}
        | [Signed _, Long _, Int _]              -> {type = LongT}
        | [Signed _, Short _, Int _]             -> {type = ShortT}
        | [Unsigned _, Short _, Int _]           -> {type = UShortT}
        | [Long _, Double _, Complex _]          -> {type = LongDoubleComplexT} 
        | [Long _, Long _, Int _]                -> {type = LongLongT} 
        | [Unsigned _, Long _ ]                  -> {type = ULongT}
        | [Signed _, Short _]                    -> {type = ShortT}
        | [Short _, Int _]                       -> {type = ShortT}
        | [Unsigned _, Short _]                  -> {type = UShortT}
        | [Signed _, Char _]                     -> {type = SCharT}
        | [Unsigned _, Char _]                   -> {type = UCharT}
        | [Float _, Complex _]                   -> {type = FloatComplexT}
        | [Double _, Complex _]                  -> {type = DoubleComplexT}
        | [Signed _, Long _]                     -> {type = LongT}
        | [Long _, Int _]                        -> {type = LongT}
        | [Long _, Double _]                     -> {type = LongDoubleT}
        | [Long _, Long _]                       -> {type = LongLongT}
        | [Unsigned _, Int _]                    -> {type = UIntT}
        | [Signed _, Int _]                      -> {type = IntT}
        | [Unsigned _]                           -> {type = UIntT}    
        | [Signed _]                             -> {type = IntT}
        | [Long _]                               -> {type = LongT} 
        | [Int _]                                -> {type = IntT}
        | [VoidTypeSpecifier _]                  -> {type = VoidT}
        | [Char _]                               -> {type = CharT}
        | [Float _]                              -> {type = FloatT}
        | [Double _]                             -> {type = DoubleT}
        | [Short _]                              -> {type = ShortT}
        | [Bool _]                               -> {type = IntT} 
        | [VarArgListSpecifier _]                -> {type = VarArgT}
        | [(StructureTypeDefinition _ ) as me]   -> analyze me
        |  _ (*Similar for enum, union and typedef *)
        ;
    
    Figure 1. reduce pattern to process C declaration specifiers
  • Module System.
Library
Typical provides a library of primitive operations. The Java implementations of these operations can be found in Primitives.java. The operations, their signatures and descriptions are listed below:
Map operations
get : 'a -> 'b                                  
put : 'a -> 'b                        
Prelude operations:
abs_float : float -> float                      Affirm a float
abs_int : int -> int                            Affirm an int
and_bits : int -> int -> int                    Perform logical and on two integers
ancestor : 'a -> node                           Get the ancestor of the top node of the traversal path if it matches a specified pattern
annotate : node -> string -> 'a -> 'a           Annotate a node with a named value
annotate_list node list -> string -> 'a -> 'a   Annotate a list of nodes
ftoi : float -> int                             Convert a float to an int
get_annotation : node -> string -> 'a           Get the named annotation from a node
has_annotation : node -> string -> bool         Test if a node has an annotation with the specified name
is_bottom : 'a -> bool                          Test a value for bottom
is_empty  : 'a list -> bool                     Test a list for emptiness
is_not_bottom : 'a -> bool                      Test if a value is not bottom 
negate_bits : int -> int                        Compute the bitwise negation of an integer
negate_float : float -> float                   Negate a float value
negate_int : int -> int                         Negate an int value
node_name : node -> string                      Get the name of a node 
nth : 'a list -> int -> 'a                      Get a list's nth element
or_bits : int -> int                            Compute the bitwise or of an int
parent : 'a -> node                             Get the parent of the top node on the traversal path if it matches a specified pattern
shift_left : int -> int -> int                  Left shift and int by the specified number of bits 
shift_right : int -> int -> int                 Right shift and int by the specified number of bits
trace : 'a -> 'a                                Debugging function to print a value to the standard output
trace2 : 'a -> string -> 'a                     Debugging function to print a prefied value to the standard output
xor_bits : int -> int -> int                    Compute the exclusive or of two integers 
List operations
append : 'a list -> 'a list -> 'a list          Append a list to another
cons : 'a -> 'a list                            Cons a new list from an value and an existing list
contains : 'a -> 'a list -> bool                Test if a list contains an element
exists : (fn : 'a -> bool) -> 'a list           Test if a list element satisfies a predicate
foldl : (fn: 'a -> 'a -> 'a) -> 'a list         Fold a list   
head : 'a list -> 'a                            Get the head of a list
intersection : 'a list -> 'a list 'a list       Compute the intersection of two lists
iter : (fn: 'a -> 'b) -> 'a list                Iterate a function over a list
length : 'a list -> int                         Compute the length of a list
subtraction : 'a list -> 'a list -> 'a list     Get the set subtraction of two lists
tail : 'a list -> 'a list                       Get the tail of a list
union : 'a list -> 'a list -> 'a list           Compute the set union of two lists
String operations
concat : string -> string -> string             Concatenate two strings
ends_with : string -> string -> bool            Test if a string ends with a substring
ends_withi : string -> string -> bool           Test if a string ends with a substring - case insensitive
join_strings : string list -> string            Combine a list of strings into a single string
ssize : string -> int                           Compute the size of a string
starts_with : string -> string -> bool          Test if a string begins with a specified substring
starts_withi : string -> string -> bool         Test if a string begins with a specified substring - case insensitive
stof : string -> float                          Convert a string to a float
stoi : string -> int                            Convert a string to an integer
substring : string -> int                       Get the substring starting at the specified index
substring2 : string -> int -> int               Get the substring starting at the specified range
Usage and Example
The Typical system encompasses three different programming languages and their compilers. The source language, the meta-language, and the bootstrap language. The source language is the language whose type-checker is being implemented. The meta-language is the language in which the type checker specification is written; i.e. Typical. The Typical compiler will generate code in the bootstrap language, which in the current implementation is Java.
As an introduction to Typical the following example presents the implementation of a complete type checker using the Simply Typed Lambda Calculus as the source language. In this example, the calculus is treated not as a formal system, but as a programming language whose front end we wish to implement.

Below we have the calculus' syntax. The corresponding grammar (found in lang/TypedLambda.rats) is written for the Rats! parser generator and largely follows the syntactic specification shown.
Expression      <- Application EOF
Application     <- Application BasicExpression / BasicExpression
Abstraction     <- LAMBDA Identifier COLON FunctionType DOT Application 
BasicExpression <- Abstraction / Identifier / IntegerConstant / StringConstant  / OPEN Application CLOSE
Identifier      <- [a-zA-Z]+ 
IntegerConstant <- [1-9] [0-9]* 
StringConstant  <- ["] (!["] _)* ["]"
FunctionType    <- BasicType ARROW FunctionType / BasicType
BasicType       <- IntegerType / StringType / OPEN FunctionType CLOSE
IntegerType     <- "int"
StringType      <- "string"
LAMBDA          <- "\\"
COLON           <- ":"
DOT             <- "."
ARROW           <- "->"
OPEN            <- "("
CLOSE           <- ")"
Figure 2.

Next, we use the Typical compiler and the complete TypedLambda.tpcl specification below in Figure 3 to create the type checker.
1.  (**
2.  * Typing rules for the simply typed lambda calculus.
3.  *
4.  * @author Robert Grimm
5.  * @version $Revision: 1.14 $
6.  *)
7.  module xtc.lang.TypedLambda;
8. 
9.  mltype node =
10.   | Application of node * node
11.   | Abstraction of node * node * node
12.   | Identifier of string
13.   | IntegerConstant of string
14.   | StringConstant of string
15.   | FunctionType of node * node
16.   | IntegerType
17.   | StringType
18.   ;
19. 
20. mltype raw_type = IntegerT | StringT | FunctionT of type * type ;
21. 
22. scope Abstraction (id, _, body) -> Scope(Anonymous("lambda"), [id, body]) ;
23. 
24. namespace default : type = Identifier (id) -> SimpleName(id) ;
25. 
26. mlvalue analyze = function
27.   | Application (lambda, expr) ->
28.       let tl = analyze lambda
29.       and tr = analyze expr in 
30.         require (predicate FunctionT _) tl.type
31.           error "application of non-function" in begin
32.             match tl.type, tr with
33.               | FunctionT (param, res), param -> res
34.               | _ -> error "argument/parameter type mismatch"
35.           end
36.   | Abstraction (id, type, body) ->
37.       let param = analyze type in
38.       let _     = define id param in
39.       let res   = analyze body in
40.         { type = FunctionT (param, res) }
41.   | Identifier _ as id ->
42.       let t = lookup id in
43.          require t != bottom error (get_name id) ^ " undefined" in t
44.   | IntegerConstant _ ->
45.       { type = IntegerT }
46.   | StringConstant _ ->
47.       { type = StringT }
48.   | FunctionType (parameter, result) ->
49.       { type = FunctionT (analyze parameter, analyze result) }
50.   | IntegerType ->
51.       { type = IntegerT }
52.   | StringType ->
53.       { type = StringT }
54.   ;
55. 
56. mlvalue get_name = function
57.   | Identifier (name) -> name
58.   | _ -> bottom
59.   ;
60.
Figure 3. Type checker specification for the Simply Typed Lambda Calculus

The specification begins with the module declaration on line 7.

Lines 9-17 declare the variant type node to represent the STLC's abstract syntax tree structure. The variants of the node type indicate the names and the type of the children of all the nodes that may appear in the STLC AST to be type checked. Note that this specification need not be written by hand, but can automatically generated from a Rats! grammar specification using the -ast option.

Line 20 declares the variant type raw_type to represent the type structure of our STLC. In this case we have three types: an integer type, a string type, and a function type.

Line 22 uses Typical's scope construct to declaratively specify scope management for the STLC type checker. The declaration can be interpreted as follows: If an Abstraction node is encountered while traversing the AST the type checker will enter a new Anonymous scope with a name derived from "lambda", and the first and third children of the Abstraction node will belong to the new scope.

Line 24 uses Typical's scope construct to declaratively specify the STLC's namespace management. The declaration can be interpreted as follows: All STLC names belong to a single default namespace and have value 'type'. Symbol table names are obtained by extracting the string value from an Identifier node, and using this string to create a new SimpleName.

Lines 26-55 define the type rules of our STLC as pattern matches from nodes to type. Lines 27-34 typecheck an Application node by first typechecking each child of the application, then using the require construct to enforce that the first child has function type, and the error construct to report failure. Finally the error construct is used to report an error on type/argument mismatch. Lines 36-40 process an abstraction node by analysing the second and third children, and adding an symbol table entry using the define construct. Since the define construct has built in error management, an error will be reported if the identifier was previously defined. Successful processing of the Abstraction node returns the appropriate FunctionType. Line 41 types an Identifier via a symbol table lookup on the node. Lines 44-54 are straightforward mapping of the IntegerConstant, StringConstant, FunctionType, IntegerType, and StringType nodes to their corresponding types.

Finally the specification can be compiled using the command java xtc.typical.Typical TypedLambda.tpcl. This generates three files: TypedLambdaTypes.java which contains definitions for all the types used by the STCL type checker, TypedLambdaSupport.java which contains general supporting infrastructure, and TypedLambdaAnalyzer.java which is the STLC typechecker itself. TypedLambdaAnalyzer can be incorporated into a Compiler or Driver (for example, TypedLambda.java) for processing STLC programs.
xtc.util Utility classes for xtc.
xtc.xform

The XForm query and transformation language.

This package provides a querying and transformation facility for abstract syntax trees (ASTs). The language, XForm, loosely follows the syntax and semantics of the XPath 2.0 XML query language, but with added support for AST transformations. However, there are some key differences between the XForm language and XPath, so a careful reading of this document is essential.

Also note that example XForm queries can be found here. Additional examples can be found in the xform/samples subdirectory.

The Language

Program Structure

XForm is a declarative language that can be used to query or transform an AST. An XForm script, or query, is composed of a series of one or more comma-separated expressions, each returning a sequence of tree items. If the query is composed of a single expression, its result is the value of the expression. However, if a query is composed of more than one expression, its result will be a list of the values of each expression (a list of lists). Note that internally, the query engine implicitly flattens all lists of lists.

It is also important to note that the result of each expression in a list of expressions provides the focus (discussed below) for the expression that follows it

Path Expressions

All XForm expressions generate ordered sets of tree items, called sequences A tree item may be a node within the AST, a string, or null. Nested sequences are not permitted.

The most basic expression in XForm is the path expression. A path expression produces a sequence of all items in an AST that conform to a specified template. The format of a path expression template is:

Path expression<-    "/" RelativePathExpression?
                           / "inside_out"? "//" RelativePathExpression  
                           / RelativePathExpression
RelativePathExpression <-    RelativePathExpression ("/" / "//") StepExpression 
                           / StepExpression
StepExpression         <-    ItemTest PredicateList / ".."
ItemTest               <-    PrimaryExpression / Identifier / "*"
PredicateList          <-    Predicate*
Predicate              <-    "[" (IntegerLiteral / Expression / FunctionCall ) "]"
PrimaryExpression      <-    "." / StringLiteral / VariableReference / FunctionCall 
                           / ParenthesizedExpression

The template begins with an optional path specifier - the initial focus of an expression. The focus of an expression can be thought of as the environment that an expression is evaluated in - it represents the context for each step of an expression. So in E1/E2 or E1 [E2], the value of E1 is computed, which then acts as the focus for computing the value of E2.

If the specifier is /, then the search will take place relative to the AST root. For example, the template/FOO/BAR will return all BAR nodes whose parent is a FOO node and whose grandparent is the AST root node. If the specifier is //, then all items that satisfy the template will be considered (so //FOO will collect all FOO nodes in the AST). If the keyword inside_out is prepended to a path expression using //, items will be chosen based on a bottom-up, breadth-first traversal of the AST. If the path specifier is omitted, then the nodes collected will have to be relative to the first step expression in the template (so $x/FOO will collect all FOO children of the nodes in the sequence $x). If a step expressions are separated by//, the latter expression will be matched against all descendents of the former, rather than just its children. (So, //FOO//BAR will collect all FOO descendents of all BAR nodes).

The individual steps of a path expression are known as step expressions. There are two different kinds of step expressions -forward and reverse. A forward step expression filters the children or the descendents of the current focus, whereas the reverse step expression, .., moves back up the tree, to the level of the focus' parents. (E.g.. //FOO/.. would collect the parents of every FOO node in the AST).

Of the forward step expressions, tests include matching against the names of an item (as in /FOO/BAR or/FOO/"j"), matching all nodes (/FOO/*), or matching against a primary expression.

A primary expression represents a value. There are five kinds.

  • The context item, ., which represents the current tree item being processed (so //FOO/. would resolve to all FOO nodes in the tree).
  • A string literal, such as "baz".
  • A variable reference, such as $stat. Variable references begin with a dollar-sign, and represent a previously bound sequence of items.
  • A function call, such as test($a,$b), which can be used to call an external function, and whose value and parameters are all sequences. Instructions on how to add external functions programmed in Java are described later.
  • A parenthesized expression, such as(//CompoundStatement) whose value is that of the expressions it contains, concatenated together.

A step expression may contain one or more predicates. A predicate takes the form of [ Expression ], and represents a sequence. Each predicate will intersect its sequence with the current focus, producing a new focus for the evaluation of any subsequent predicates. The value of the final intersection is the value of the step expression.

A predicate containing an integer literal (beginning at one) represents an index into the current focus. So //FOO [1] would return the first FOO node in the AST.

When using predicates, it is important to note that within a predicate, the focus becomes that of the current step expression. So, for example, the value of //FOO/BAR [BAZ], is a sequence containing all BAR nodes with FOO parents and BAZ children.

Also note that tree traversals are done in a depth-first, in-order manner. If you would like to do an "inner" traversal, that is, traverse the tree in a bottom-up manner, you may use the inner keyword. For example,

replace inner //ForStatement with FOO<>
will replace all ForStatement nodes in the AST with FOO nodes, but the order of replacement will be inside-out.

In addition to searching an AST, XForm provides facilities for manipulating sequences. These facilities include binding variables, looping through sequences, conditionals, creating new items and evaluating logical operations over sequences.

Binding Variables

To bind a sequence to a variable for later use, one can use a let expression. The syntax of a let expression is as follows:

LetExpression       <- "let" VariableBindingList "return" SingleExpression
VariableBindingList <- VariableBinding ("," VariableBinding)*
VariableBinding     <- VariableReference "be" SingleExpression

A let expression binds one or more sequences to variables for the duration of its single expression. For example,

let $f be //ForStatement, $cs be //CompoundStatement  return ($f, $cs)
would return a sequence composed of all the for statements and all the compound statements in an AST (a parenthesized expression concatenates the results of the expressions within).

Looping

Two options exist to iterate through sequences - for and cfor. Their formats are:

ForExpression        <- "for" IterativeBindingList "return" SingleExpression
CForExpression       <- "cfor" IterativeBindingList "return" SingleExpression
IterativeBindingList <- IterativeBinding ("," IterativeBinding)*
IterativeBinding     <- VariableReference "in" SingleExpression

A for statement iterates through one or more sequences, binding the resultant singletons to the specified variables (the bindings hold for the duration of the expression's single expression) and evaluating its single expression. If more than one sequence is specified to iterate through, the iterations will be implicitly nested. So,

for $x in //FOO, $y in //BAR return $x union $y
is equivalent to
for $x in //FOO return   for $y in //BAR return    $x union $y

A cfor statement acts like a for statement, except that it iterates each of its bound variables concurrently, halting when one of them reaches its end. In either case, the resulting sequence of a looping expression is the concatenation of each sequence that it returns.

Conditionals

Conditional expressions in XForm take the following form:

IfExpression <- "if" SingleExpression "then" SingleExpression "else" SingleExpression 

Conditional expressions evaluate the first single expression. If that expression is a non-empty sequence, the second single expression is evaluated and its value is returned. Otherwise, the third single expression is evaluated and its value is returned.

Set Operations

Set operations XForm take the following form:

UnionExpression <- UnionExpression "union" IntersectionExpression
IntersectionExpression <- IntersectionExpression "intersect" PathExpression
DifferExpression    <- DifferExpression "differ" LogicalExpression

A union expression returns the union of two sequences, whereas an intersection expression returns their intersection and the differ expression returns their difference. Note that these operations are based on identity (so union is not a concatenation). If you wish to concatenate two single expressions, wrap them in a parenthesized expression

Logical Operations

There are two supported logical operation expressions, or and .

An or expression allows you to group together a series of pathexpressions and select the value of the first path expression that returns a non-null sequence. For example, the query:

  //Foo or //Bar 
will select the sequence returned by //Foo if that sequence is non-null. Should the sequence be null, the results of //Bar will be used.

An and expression also lets you group together a series of pathexpressions. If the value of each path expression in the series is non-null, all of the values are concatenated together into a single sequence and then returned. Otherwise, a null value is returned. For example, the query:

  //Foo and //Bar
will return each Foo and Bar node in the tree, should the tree contain both Foo and Bar nodes. Otherwise, it will return a null value.

Note that these logical operations can be embedded inside of a pathexpression by way of parenthesized expressions. For example,

  //Foo/(Bar or Zoo)
will return all Bar children of Foo nodes, should any exist. Otherwise, it will attempt to return any Zoo children of Foo nodes. Whereas,
  //Foo/(Bar and Zoo)
will return any Bar nodes with Foo parents and Zoo siblings, and any Zoo nodes with Foo parents and Bar siblings, should both sequences be non-null.

Generating New Items

New tree items may be generated with a new item expression. A newitem expression creates a singleton sequence containing a new tree item. It takes the following form:

NewItemExpression <- "null" / StringLiteral / NewNodeExpression
NewNodeExpression <- Identifier "<" Children ">"
Children          <- Child ("," Child)*
Child             <- "null" / NewNodeExpression / SingleExpression / /* empty */

Note that the last type of child is blank, which means that the newly created node will have no children (which is not the same as having a null child). So FOO<> would create a FOO node with no children, whereas FOO<null> would create a FOO node with a single null child.

Transforming ASTs

Modifications to abstract syntax trees are done using replacement expressions, removal expressions, addition expressions, or insertion expressions. Users transform ASTs in one of two ways. The first is to query the AST for items to replace and then replacing them with new or existing items. The second way is to query the AST for insertion position markers and then inserting new or existing items.

The format of a replacement expression is:


ReplacementExpression <- "replace" SingleExpression "with" SingleExpression

The value of a replacement expression is a sequence containing the items that have been inserted (or moved) within the tree, otherwise, if no replacements have been made, it's an empty sequence. For example,

 for $f in //ForStatement replace $f with CompoundStatement<>
would replace all of the for statements in an AST with new compoundstatements, whereas
 for $f in //ForStatement replace $f with //CompoundStatement [1]
would replace each for statement in the tree with one the tree's first compound statement. Note that a sequence need not be bound to a variable for its items to be replaced, so
replace //BAR with FOO<>
would replace any BAR nodes in the AST with FOO nodes.

Note that in the case of a replacement expression, in the context of a for expression, you can omit the "return".

The format of a remove Expression is:

"remove" SingleExpression

For example, remove //IfStatement removes all IfStatements from the AST.

The format of an add Expression is:

 "add" SingleExpression "to" SingleExpression

For example, add Foo<> to //Bar adds a Foo node (as the last child) to all Bar nodes.

The format of an insert expression is:


insert SingleExpression before SingleExpression 
insert SingleExpression after SingleExpression

For example, insert Foo<> before //Bar<> would create and insert new Foo<> items before every Bar node in the AST.

The value of an insertion operation is a sequence containing the items inserted, or an empty sequence if no insertions are made. insert expression operate only on none empty sequences. To illustrate, insert Foo<> before //Block/*[1] adds a Foo<> node before the first child in a Block node. It will not, however, add a Foo<> node to a an empty Block. This behaviour, if desired, can be implemented with addtional commands; for example, replace empty(//Block<>) with Block< Foo<> >

Extending XForm

A user may add functionality to XForm by way of external functions. To add an external function to XForm, one must do the following:

  • Create a new class that implements the xtc.xform.Function interface. Make sure to have the getName method return the name of the function as it will be called, and have the apply method return a list of items.
  • When you want to use the external function, use an import statement at the beginning of your query. The import statement takes a list of quoted class names, and loads the named classes. For example,
    import "xtc.xform.IsNullFunction","xtc.xform.TestFunction"would load 
          the external function classes"IsNullFunction" and "TestFunction".

Example Queries

Example0: Using Xform queries to find empty IfStatements


1.  //example0.java
2.  class example1{
3.     public void bar(){
4.        Boolean foo = true;
5.        if( foo ){

6.           //empty!
7.        }
8.        int i = 0;
9.        if( foo ) //no braces!
10.          i++; //not empty
11.       if( !foo ); //empty!!
12.    }
13. }

In example0.java above, the IfStatements on lines 5 and 6 are empty. To find them using xform, we must first identify the AST structure of empty IfStatements. Using the xform driver's -preAST function, we deduce that empty IfStatements come in two flavours: either an IfStatement item with an empty Block child or an IfStatement item with an EmptyStatement child. The query


1. #find all the emptyIfStatements
2. empty(//IfStatement/Block)/.. union //IfStatement[ EmptyStatement ]

will return all emptyIfStatement. The XForm library function empty returns sequence items that have no children so empty(//IfStatement/Block) will return all empty blocks belong to IfStatement. Adding /.. will return the parents of those blocks. //IfStatement[ EmptyStatement ] returns all IfStatement items that have an EmptyStatement child. The union of the two expressions gives all Statements in the program.

For additional feedback, the library function lines() can be used to report the line and column information of the items identified. Executing the with query


1. #example0.xform
2. #find all the emptyIfStatements
3. lines( empty(//IfStatement/Block)/.. union //IfStatement[ EmptyStatement ] )


with java xtc.xform.Driver -java example0.xform example0.java would display:

example0.java: 5:7
example0.java: 11:7

Example1: Using XForm transformations to ensure that all IfStatements in a Java program have braces.


1.  //example1.java
2.  class example1{
3.     public void bar(){
4.        Boolean foo = true;
5         if( foo ){

6.           //this one is good
7.        }
8.        int i = 0;
9.        if( foo ) //no braces!
10.          i++;
11.    }
12. }

In example1.java above the IfStatement on line 8 violates the coding convention that states all if statements should have curly braces. To correct this using xform transformations, we must first query the program's AST to find all braceless IfStatement. Using the xform Driver's preAST command line option (or alternately figuring it out from the grammar), we realize that the AST representation of a braceless if Statement does not have a Block enclosing it's body.

The IfStatement on line 5 is represented as

 IfStatement<
    PrimaryIdentifier<"foo">,
    Block<>
 >
while the one on line 8 is represented as 
 IfStatement<
  PrimaryIdentifier<"foo">,
   ExpressionStatement<
    PostfixExpression<
      PrimaryIdentifier<"i">,
      PostincrementTail<>
    >
  >
 >
We can find braceless IfStatements with the XForm query
 //IfStatement differ IfStatement[ Block ]

The complete query to add missing braces is


1. #example1.xform
2  #add missing braces toIfStatement
3. for $i in //IfStatement differ //IfStatement[ Block ] return

4.    replace $i with IfStatement< $i/*[1], Block< $i/*[2] > >

Line 3 in the addifbrace.xform iterates over each Blockless if statement in the AST Line 4 creates a new IfStatement item with two children and replaces the original. The first child in the new item is the same as the first child of the original IfStatement - meaning the if( foo ) part of the statement is rewritten/preserved as if( foo ). The second child is a new Block item containing the second child of the original IfStatement.

The command java xtc.xform.Driver -java -printJava example1.xform example1.java produces the output


1.  class example1{
2.     public void bar(){
3.        Boolean foo = true;
4         if ( foo ){

5.        }
6.        int i = 0;
7.        if ( foo ){
8.          i++;
9.        }
10.    }
11. }

Example2: Transforming for loops to enforce 'Must Have Braces' rule.

The procedure for transforming ForStatements similar to the IfStatement transformation in Example1. The difference, however, is that the first 3 children of each Blockless ForStatement must be preserved in the rewrite. This can be done with the query


1. #add missing forStatement braces

2. for $f in //ForStatement differ //ForStatement[ Block ] return

3.    replace $f with ForStatement< $f/*[1], $f/*[2], $f/*[3], Block< $f/*[4]> >
or by using XForms subsequence() library function in the query

1. #example2.xform
2. for $f in //ForStatement differ //ForStatement[ Block ] return
3.   replace $f with ForStatement< subsequence( $f/*, 1, 3 ), Block< $f/*[4] > >

1. //example2.java
2. class example2{3. public void bar(){
4.   int j = 0;
5.   for( int i = 0; i < 10; i++ )
6.     j++;
7.
8.   for(;;)
9.      j--;
10.   }
11.}

On the file example2.java (shown above), the query example2.xform produces the following transformation.


1. class example2{
2.   public void bar(){
3.      int j = 0;
4.      for( int i = 0; i < 10; i++ ){
5.         j++;
6.        }
7.      for (;;){
8.        j--;
9.     }
10.  }
11.}

Example3: Using XTC and Xform to create an simple extension to Java (JavaProperty), that adds C# like properties. Note that this example does not have the full set of C# property features. Note also that all code in this example can be found in xform/samples/javaproperty

In our extension, we wish to write property declarations in the form "property int foo". A desugaring transformation convert the JavaProperty code to Java code that contains generate accessors methods for this variable in this declaration. For example the JavaProperty code:

1. //sample.jprop2. class sample{3.    property int foo;4. }

should desugared to the following Java code


1. //sample.java2. class sample{

3.   private int foo;

4.   public void setfoo( int val ){
5.     foo = val;
6.   }
7.   public int getfoo(){
8.      return foo;
9.   }
10. }

First we use xtc to define a grammar for the extension. This involves adding "property" as a keyword and as a Modifer.
The JavaPropertyKW.Rats file below adds 'property' as a keyword to the existing list of Java keywords.


1. //JavaPropertyKW.rats
2. module xtc.xform.samples.javaproperty.JavaPropertyKW;
3. import xtc.lang.JavaIdentifier(xtc.lang.JavaSymbol, xtc.util.Spacing);
4. 
5. body {
6.   static{
7.     add(JAVA_KEYWORDS, new Object[] { "property" });
8.   }
9. }

The JavaProperty.Rats file below modifies the existing Java core grammar (JavaCore) to add 'property' as a Modifier. For more details on grammar modification see: Rats!

1.  //JavaProperty.Rats
2.  module xtc.xform.samples.javaproperty.JavaProperty;
3.  instantiate xtc.lang.JavaType(xtc.lang.JavaIdentifier, xtc.lang.JavaSymbol);
4.  instantiate xtc.lang.JavaConstant(xtc.lang.JavaIdentifier, xtc.util.Spacing);
5.  instantiate xtc.lang.JavaIdentifier(xtc.lang.JavaSymbol, xtc.util.Spacing);
6.  instantiate xtc.util.Symbol(xtc.util.Spacing);
7.  instantiate xtc.lang.JavaSymbol(xtc.util.Symbol);
8.  instantiate xtc.util.Spacing;
9.  import      xtc.xform.samples.javaproperty.JavaPropertyKW;
10. 
11. modify xtc.lang.JavaCore(xtc.lang.JavaType, xtc.lang.JavaConstant, xtc.lang.JavaIdentifier, xtc.lang.JavaSymbol,xtc.util.Spacing, xtc.util.Null );
12. option withLocation, constant, parser(xtc.xform.JavaPropertyParser);
13. 
14. String Modifier += <Strictfp> ... /  "property":Word;

We can now generate a parser for our JavaProperty grammar by typing 'make' in xform/samples/javaproperty to execute the following commands:
java xtc.parser.Rats -in ~/java/src JavaProperty.rats
javac -source 1.4 -d ~/java/classes -sourcepath ~java/src JavaPropertyParser.java

Finally, using the Xform query JPTrans.xform shown below, we can transform JavaProperty code to Java code with the command:
java xtc.xform.Driver -printJava -parser xtc.xform.samples.javaproperty.JavaPropertyParser JPTrans.xform sample.jprop

This command line instructs the XForm engine to 1. parse sample.jprop using the JavaPropertyParser 2. Execute the query JPTrans.xform 3. Pretty print the resulting code using xtc's Java PrettyPrinter

JPTrans.xform is explained as follows: Line 5 finds all property declarations. Lines 6-18 inserts getter methods after each property declaration. Line 7 makes the method public. Line 8 sets the methods return type to be the same as the field type. Line 9 sets the method name as the concatenation of "get" and the field's name. Lines 14-16 add a return statement with the field name as an identifier. Lines 21-45 add a setter method after each property declaration. This is similar to the xform code for adding getter methods, the main differences are the return types and an assignment statement instead of a Return statement. Last, lines 49-55 rewrite all property declarations to private field declarations with the same name and type.

1.  #JPTrans.xform
2.  #XFrom desugaring from JavaProperty to pure Java.
3.
4.  #add a getter method for the field
5.  for $i in  //FieldDeclaration/Modifiers/"property"/../.. return 
6.    insert MethodDeclaration<7.              Modifiers<"public">,
8.              $i/*[2],
9.              concat( "get", $i/Declarators/Declarator/*[1] ),
10.             FormalParameters<>,
11.               null,
12.               null,
13.               Block<
14.                 ReturnStatement<
15.                    PrimaryIdentifier<$i/Declarators/Declarator/*[1]>
16.                >
17.              >
18.           >  after $i,
19. 
20. #add a setter method for the field
21. for $i in  //FieldDeclaration/Modifiers/"property"/../.. return 
22.   insert MethodDeclaration<
23.             Modifiers<"public">,
24.             VoidTypeSpecifier<>,
25.             concat( "set", $i/Declarators/Declarator/*[1] ) ,
26.             FormalParameters<
27.               FormalParameter<
28.                 null,
29.                 $i/*[2],  
30.                 "val",
31.                 null
32.               >       
33.             >,
34.             null,
35.             null,
36.             Block<
37.               ExpressionStatement<
38.                 Expression<
39.                   PrimaryIdentifier<$i/Declarators/Declarator/*[1]>,
40.                   "=",
41.                   PrimaryIdentifier<"val">
42.                 >
43.               >
44.             >
45.          > after $i, 
46. 
47. #replace the property declaration with a private field declaration
48. for $i in //FieldDeclaration/Modifiers/"property"/../.. return
49.    replace $i with FieldDeclaration<
50.                      Modifiers<"private">,
51.                      $i/*[2],
52.                      Declarators<
53.                        Declarator< $i/Declarators/Declarator/*[1], null, null >
54.                      >
55.                    >
56. 
xtc.xform.samples.javaproperty
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.