| Version 32 (modified by , 15 years ago) ( diff ) |
|---|
The TASS Abstract Syntax Tree
The TASS AST is the common front-end representation of a source program used by TASS. It is used to generate a model, the TASS intermediate representation.
Requirements:
- the AST should be "abstract": it should not be tied too tightly to the specific syntax of one programming language. It should be more-or-less language neutral: to the extent possible one common AST should be used for C, C++, Fortran(s), or similar imperative programming languages
- the AST should be a tree: every node other than root should have a unique parent and there should be no cycles
- the AST should deal primarily with syntax, though some amount of semantic information might be necessary
- should be easy to generate model from AST
- should be easy to generate AST from a typical parse tree
- should be easy to generate AST from ANTLR, ROSE, or Clang (other tools?)
- AST may go through many processing stages; at each stage it is still an AST
- persistent XML representation: should be able to read and write
- all nodes must carry precise source information: file, begin point (line and column numbers), and end point
Questions:
- How will it handle things like (foo)*bar: this could be either a cast of *bar to type foo, or it could be the product of foo and bar; you need to know whether foo defines a type, which is some semantic information
- cf. http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
- approach: just choose one way, then change in later pass when analyzing
- ANTLR C grammar shows how this can be handled directly in grammar by processing the typedefs as you parse
- What to do about #defines from the system (OSX, etc.)? Determine these automatically, or take as input?
Notes
- ANTLR grammar list: http://www.antlr.org/grammar/list
- includes grammars for C and for pre-processor
Comments:
- When multiple translation units are present, an AST is generated for each unit, and then the ASTs are merged (global scope variables are merged to one scope).
- For any global scope variable with the same name that is declared in multiple files, all but at most one must be extern.
- No extern variable declarations can have initializers.
- For scoping issues with blocks, see sec 8.4 of C: A Reference Manual.
- Rewrite blocks to have variables declared separately. Replace declarations that have initializations with assignment statements.
Example:
int x=1;
{ print x;
int x=2;
print x;
}
should be translated to an AST that looks like:
- Root
- Globals: x
- Statements
- Block
- Decls
- x
- int
- x
- Stmt
- print x /* the global x */
- x = 2 /* the local x */
- print x /* the local x */
- Decls
- Block
In other words, in the AST for the block, all the variable decls that occur anywhere in the block are grouped together under one node in the block. The initializers, however, are translated to statements at the point where the variable was declared in the original source. So in the end, local variables should not have any initializers associated to them.
Elements of a TASS AST
For one example of how to do this, see description of CIL AST:
Preprocessing: before an AST is created, the source file(s) are preprocessed to create a stream of tokens with complete source information. This stream is fed to the parser which creates the AST.
All AST nodes have source information.
AST Nodes
ASTNodeIF extends SourceableIF
- methods
- numChildren: int
- child(int i): ASTNodeIF
- children: Iterator<ASTNodeIF>
The following extend ASTNodeIF:
- ReferenceNodeIF
- has no children
- used to refer to another node u when u might be referenced from several places (shared)
- methods
- referent: ASTNodeIF
- identifier: string
- FunctionReferenceNodeIF extends ReferenceNodeIF
- methods
- referent: FunctionDeclarationNodeIF
- methods
- VariableReferenceNodeIF extends ReferenceNodeIF
- methods
- referent: VariableDeclarationNodeIF
- methods
- TypeReferenceNodeIF extends ReferenceNodeIF
- methods
- referent: TypeNodeIF
- methods
- LabelReferenceNodeIF
- methods
- referent: LabelNodeIF
- methods
- EnumElementReferenceNodeIF
- methods
- EnumElementNodeIF
- methods
- IdentifierNodeIF: wraps a string. represents an identifier occurrence in code. Has 0 children, i.e., is a leaf node.
- methods
- kind: UNKNOWN, VARIABLE, FUNCTION, TYPE
- name: string
- reference: ASTNodeIF : the thing that this identifier represents
- methods
- TypeNodeIF: a type expression
- VoidTypeNodeIF: the type "void" used, for example, as the return type for a function that does not return anything
- IntegerTypeNodeIF: an integer type, such as "int" or "long int" or "short int" in C
- integer sub-types, specified by parameters
- RealTypeNodeIF: a real type, such as "float" or "double" in C
- real sub-stypes, specified by parameters (e.g., IEEE754 floating point numbers)
- BooleanTypeNodeIF: the boolean type
- CharacterTypeNodeIF: the character type (char in C)
- ArrayTypeNodeIF: an array type
- methods
- elementType: TypeNodeIF
- extent: ? This is complicated. The extent might be an expression involving input variables.
- methods
- PointerTypeNodeIF
- methods
- baseType: TypeNodeIF
- methods
- CompositeTypeNodeIF: "struct" or "union" in C
- methods
- struct or union?: boolean
- name: IdentifierNodeIF
- numFields: int
- field(int i): NameTypePairNodeIF: this is an ordered pair (n,t) where n is an IdentifierNodeIF and t is a TypeNodeIF
- fields: Iterator<NameTypePairNodeIF>
- methods
- FunctionTypeNodeIF
- methods
- numInputs: int
- inputType(int i): TypeReferenceNodeIF
- inputTypes: Iterator<TypeNodeIF>
- methods
- EnumerationTypeNodeIF
- methods
- name: IdentifierNodeIF
- numElements: int
- element(int i): IdentifierLiteralPairNodeIF: an IdentifierNodeIF-LiteralNodeIF pair
- elements: Iterator<IdentifierLiteralPairNodeIF>
- methods
- VariableDefinitionNodeIF: Every variable is defined at one unique location in the code. The variable can be declared in several places. Each declaration has an associated definition. There can be many decls associated to one def. This is a linking thing: a single variable that is to be used in several files will be defined in one file but be declared in all
- methods
- scope: ? (global, input/output, local,...)
- name: IdentifierNodeIF (together with scope, this uniquely identifies the variable)
- type: TypeNodeIF
- initializer: ExpressionNodeIF
- typeQualifier: CONST, VOLATILE
- storageClass: EXTERNAL, AUTO, STATIC, REGISTER
- methods
- VariableDeclarationNodeIF
- methods
- definition: VariableDeclarationNodeIF (may be null at first)
- scope: ScopeIF
- name: IdentifierNodeIF
- type: TypeNodeIF
- typeQualifier: CONST, VOLATILE
- storageClass: EXTERNAL, AUTO, STATIC, REGISTER
- methods
- FunctionDefinitionNodeIF: a function with a body. As with variables, one def. can have many decls, but each decl is associated to at most one definition.
- methods
- name: IdentifierNodeIF
- numInputs: int
- input(int i): NameTypeReferencePairNodeIF
- outputType: TypeNodeIF
- body: BlockNodeIF
- methods
- FunctionDeclarationNodeIF: no body, just "prototype" in C
- methods
- definition: FunctionDefinitionNodeIF (may be null)
- name: IdentifierNodeIF
- numInputs: int
- input(int i): NameTypePairNodeIF
- outputType: TypeNodeIF
- methods
- TypeDefinitionNodeIF: "typedef" in C
- methods
- name: IdentifierNodeIF
- type: TypeNodeIF (type being assigned a name)
- methods
- LabelNodeIF: a label preceding a statement
- methods
- statement: StatementNodeIF : the statement this label precedes (this may be null at first)
- identifier: IdentifierNodeIF: the name of this label
- methods
- StatementNodeIF
- methods
- labels: Collection<LabelNodeIF>: set of labels associated to this statement. The labels all correspond to the point of control just before the statement.
- AssertStatementNodeIF: this could come from a program language assert or a TASS pragma
- methods
- expr: Expression
- methods
- AssumeStatementNodeIF: these come from TASS pragmas
- methods
- expr: Expression
- methods
- PragmaNodeIF:(any kind of pragma, represented as just a string
- methods
- string: this is an initial form, where the text of the pragma is uninterpreted. After a certain processing stage, TASS pragmas will be parsed and replaced with more specific nodes. Non-tass pragmas will just remain as strings and will be ignored by TASS.
- methods
- SwitchStatementNodeIF: this can be used to model C's switch
- methods
- expression: ExpressionNodeIF: expression that is evaluated
- numCases: int
- case(int i): ValueLocationPairNodeIF (ordered pair (v,l), where v is a ValueNodeIF, and l is an ASTStatementNodeIF)
- default: ValueLocationPairNodeIF
- methods
- IfThenElseStatementNodeIF
- methods
- condition: ExpressionNodeIF
- trueBranch: StatementNodeIF
- falseBranch: StatementNodeIF (may be null)
- methods
- WhileLoopNodeIF: loop. associated invariant (derived from TASS invariant pragma)
- methods
- condition: ExpressionNodeIF
- body: StatementNodeIF
- invariants: Collection<ExpressionNodeIF> ?
- methods
- ForLoopNodeIF
- methods
- initializer: StatementNodeIF
- condition: ExpressionNodeIF
- incrementer: StatementNodeIF
- body: StatementNodeIF
- methods
- DoUntilNodeIF
- BreakNodeIF
- ContinueNodeIF
- GotoNodeIF
- label: LabelNodeIF
- ReturnNodeIF
- expression (may be null)
- NoopNodeIF
- BlockNodeIF: ({...} in C)
- sequence of variable definitions
- sequence of statements
- ExpressionNodeIF: expressions--these are considered a kind of statement (as in C)
- PureExpressionNodeIF: side-effect-free expressions
- LiteralNodeIF (including named literals): integers, reals, strings, chars, enumValue
- VariableExpressionNodeIF
- OperatorNodeIF
- ADD, ADD_POINTER_INT, ADDRESS_OF, BIT_AND, BIT_NOT, BIT_OR, BIT_XOR, DEREFERENCE, DIVIDE, EQUALS, GREATER_THAN, GTE, IF_THEN_ELSE, LEQ, LESS_THAN, LOGICAL_AND, LOGICAL_NOT, LOGICAL_OR, MODULO, MULTIPLY, NAVIGATE (DOT), NEGATIVE, NOT_EQUALS, SHIFT_LEFT, SHIFT_RIGHT, SUBSCRIPT, SUBTRACT, SUBTRACT_POINTER_INT, SUBTRACT_POINTER_POINTER
- CastNodeIF
- newType: TypeNodeIF (type you are casting to)
- expression: ExpressionNodeIF
- QuantifierNodeIF
- quantifier: {FORALL, EXISTS, LAMBDA, UNIFORM, SUM}
- boundVariable: VariableDefinitionNodeIF
- constraint: ExpressionNodeIF
- expression: ExpressionNodeIF
- InitializerNodeIF: special kind of expression used to initialize variables in their decl
- SizeOfNodeIF: type, expression, or string literal
- StartOfNodeIF: &a[0] -- that is the Cil way of converting an array to pointer to first element
- FunctionApplicationNodeIF: invocation f(x) when f is abstract (pure) function
- ProcessReferenceNodeIF: PROC[expr].x
- SideEffectExpressionNodeIF: expressions with side-effects
- AssignmentNodeIF: x=expr, x++, x--, ++x, --x
- FunctionInvocationNodeIF: invocation f(x) when f is concrete
- PureExpressionNodeIF: side-effect-free expressions
- methods
Processing Stages
The AST can be used to represent the program at different stages of translation.
Stages
- The input source file(s) are pre-processed, creating a stream of tokens with source information
- #include
- #define X 100 /* object-like macro */
- #define min(X,Y) (X<=Y ? X : Y) /* function-like macro */
- #ifdef...#endif
- The token stream is parsed to produce an AST in which all variables, types, etc., are represented simply as identifiers
- The pragma strings are parsed and processed, and those AST nodes are filled out
- Symbol table information is associated to the identifiers in the AST
- (Static) types are created and associated to every expression (use the types in the model package?)
- Variable objects are created and inserted into AST
- Side effects are removed by introducing temp variables
- Model is produced
Modules
- CPreProcessor: an ANTLR-generated parser that takes source file and produces token stream, applies pre-processing rules, produces new token stream
- CParser: an ANTRL-generated parser that takes stream of sourced tokens from pre-processor and produces AST stage 0
- AST, ASTNode
