| Version 16 (modified by , 15 years ago) ( diff ) |
|---|
The TASS Abstract Syntax Tree
The TASS AST is the common front-end representation of a source program used by TASS. It is used to generate a model, the TASS intermediate representation.
Requirements:
- language neutral: to the extent possible one common AST should be used for C, C++, Fortran(s)
- should be easy to generate model from AST
- should be easy to generate AST from a typical parse tree
- should be easy to generate AST from ANTLR, ROSE, or Clang (other tools?)
- AST may go through many processing stages; at each stage it is still an AST
- persistent XML representation: should be able to read and write
- all nodes must carry precise source information: file, begin point (line and column numbers), and end point
Questions:
- How to handle pre-processor macros? Include in AST?
- How abstract should the AST be?
- Should it contain semantic information, e.g., types, and variables?
- How will it handle things like (foo)*bar: this could be either a cast of *bar to type foo, or it could be the product of foo and bar; you need to know whether foo defines a type, which is some semantic information
- cf. http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
- approach: just choose one way, then change in later pass when analyzing
Elements of a TASS AST
For one example of how to do this, see description of CIL AST:
Preprocessing: before an AST is created, the source file(s) are preprocessed to create a stream of tokens with complete source information. This stream is fed to the parser which creates the AST.
All AST nodes have source information.
ASTNodeIF extends SourceableIF
- methods
- numChildren: int
- child(int i): ASTNodeIF
The following extend ASTNodeIF:
- IdentifierNodeIF: wraps a string. represents an identifier occurrence in code
- methods
- name: string
- methods
- TypeNodeIF: a type expression
- IdentifierTypeNodeIF
- this is a type that is just a name, e.g., resulting from a C typedef. This will node will eventually be replaced by a more specific type tree once type definitions are processed.
- methods
- identifier: IdentifierNodeIF
- VoidTypeNodeIF: the type "void" used, for example, as the return type for a function that does not return anything
- IntegerTypeNodeIF: an integer type, such as "int" or "long int" or "short int" in C
- integer sub-types, specified by parameters
- RealTypeNodeIF: a real type, such as "float" or "double" in C
- real sub-stypes, specified by parameters (e.g., IEEE754 floating point numbers)
- BooleanTypeNodeIF: the boolean type
- CharacterTypeNodeIF: the character type (char in C)
- ArrayTypeNodeIF: an array type
- methods
- elementType: TypeNodeIF
- extent: ? This is complicated. The extent might be an expression involving input variables.
- methods
- PointerTypeNodeIF
- methods
- baseType: TypeNodeIF
- methods
- RecordTypeNodeIF: "struct" in C
- methods
- name: IdentifierNodeIF
- numFields: int
- field(int i): NameTypePairNodeIF: this is an ordered pair (n,t) where n is an IdentifierNodeIF and t is a TypeNodeIF
- methods
- FunctionTypeNodeIF
- methods
- numInputs: int
- inputType(int i): TypeNodeIF
- methods
- EnumerationTypeNodeIF
- methods
- name: IdentifierNodeIF
- numElements: int
- element(int i): IdentifierNodeIF
- methods
- IdentifierTypeNodeIF
- VariableDefinitionNodeIF
- methods
- scope: ? (global, input/output, local,...)
- name: IdentifierNodeIF (together with scope, this uniquely identifies the variable)
- type: TypeNodeIF
- initializer: ExpressionNodeIF
- methods
- VariableDeclarationNodeIF: each declaration has an associated definition. There can be many decls associated to one def. This is a linking thing: a single variable that is to be used in several files will be defined in one file but be declared in all
- methods
- definition: VariableDeclarationNodeIF (may be null)
- scope: ScopeIF
- name: IdentifierNodeIF
- type: TypeNodeIF
- methods
- FunctionDefinitionNodeIF: a function with a body. As with variables, one def. can have many decls, but each decl is associated to at most one definition.
- methods
- name: IdentifierNodeIF
- numInputs: int
- input(int i): NameTypePairNodeIF
- outputType: TypeNodeIF
- body: BlockNodeIF
- methods
- FunctionDeclarationNodeIF: no body, just "prototype" in C
- methods
- definition: FunctionDefinitionNodeIF (may be null)
- name: IdentifierNodeIF
- numInputs: int
- input(int i): NameTypePairNodeIF
- outputType: TypeNodeIF
- methods
- TypeDefinitionNodeIF: "typedef" in C
- methods
- name: IdentifierNodeIF
- type: TypeNodeIF (type being assigned a name)
- methods
- LabelNodeIF: a label preceding a statement
- methods
- statement: StatementNodeIF : the statement this label precedes (this may be null at first)
- identifier: IdentifierNodeIF: the name of this label
- methods
- StatementNodeIF
- methods
- labels: Collection<LabelNodeIF>: set of labels associated to this statement. The labels all correspond to the point of control just before the statement.
- AssertStatementNodeIF: this could come from a program language assert or a TASS pragma
- methods
- expr: Expression
- methods
- AssumeStatementNodeIF: these come from TASS pragmas
- methods
- expr: Expression
- methods
- PragmaNodeIF:(any kind of pragma, represented as just a string
- methods
- string: this is an initial form, where the text of the pragma is uninterpreted. After a certain processing stage, TASS pragmas will be parsed and replaced with more specific nodes. Non-tass pragmas will just remain as strings and will be ignored by TASS.
- methods
- SwitchStatementNodeIF: this can be used to model C's switch
- methods
- expression: ExpressionNodeIF: expression that is evaluated
- numCases: int
- case(int i): ValueLocationPairNodeIF (ordered pair (v,l), where v is a ValueNodeIF, and l is an ASTStatementNodeIF)
- default: ValueLocationPairNodeIF
- methods
- IfThenElseStatementNodeIF
- methods
- condition: ExpressionNodeIF
- trueBranch: StatementNodeIF
- falseBranch: StatementNodeIF (may be null)
- methods
- WhileLoopNodeIF: loop. associated invariant (derived from TASS invariant pragma)
- methods
- condition: ExpressionNodeIF
- body: StatementNodeIF
- invariants: Collection<ExpressionNodeIF> ?
- methods
- ForLoopNodeIF
- methods
- initializer: StatementNodeIF
- condition: ExpressionNodeIF
- incrementer: StatementNodeIF
- body: StatementNodeIF
- methods
- DoUntilNodeIF
- BreakNodeIF
- ContinueNodeIF
- GotoNodeIF
- label: LabelNodeIF
- ReturnNodeIF
- expression (may be null)
- NoopNodeIF
- BlockNodeIF: ({...} in C)
- sequence of variable definitions
- sequence of statements
- ExpressionNodeIF: expressions--these are considered a kind of statement (as in C)
- PureExpressionNodeIF: side-effect-free expressions
- LiteralNodeIF (including named literals): integers, reals, strings, chars
- VariableExpressionNodeIF
- OperatorNodeIF
- ADD, ADD_POINTER_INT, ADDRESS_OF, BIT_AND, BIT_NOT, BIT_OR, BIT_XOR, DEREFERENCE, DIVIDE, EQUALS, GREATER_THAN, GTE, IF_THEN_ELSE, INDEX, LEQ, LESS_THAN, LOGICAL_AND, LOGICAL_NOT, LOGICAL_OR, MODULO, MULTIPLY, NAVIGATE, NEGATIVE, NOT_EQUALS, SHIFT_LEFT, SHIFT_RIGHT, SUBSCRIPT, SUBTRACT, SUBTRACT_POINTER_INT, SUBTRACT_POINTER_POINTER
- CastNodeIF
- newType: TypeNodeIF (type you are casting to)
- expression: ExpressionNodeIF
- QuantifierNodeIF
- quantifier: {FORALL, EXISTS, LAMBDA, UNIFORM, SUM}
- boundVariable: VariableDefinitionNodeIF
- constraint: ExpressionNodeIF
- expression: ExpressionNodeIF
- InitializerNodeIF: special kind of expression used to initialize variables in their decl
- SizeOfNodeIF: type, expression, or string literal
- StartOfNodeIF: &a[0] -- that is the Cil way of converting an array to pointer to first element
- FunctionApplicationNodeIF: invocation f(x) when f is abstract (pure) function
- ProcessReferenceNodeIF: PROC[expr].x
- SideEffectExpressionNodeIF: expressions with side-effects
- AssignmentNodeIF: x=expr, x++, x--, ++x, --x
- FunctionInvocationNodeIF: invocation f(x) when f is concrete
- PureExpressionNodeIF: side-effect-free expressions
- methods
Processing Stages
The AST can be used to represent the program at different stages of translation.
Stages
- The input source file(s) are pre-processed, creating a stream of tokens with source information
- #include
- #define X 100 /* object-like macro */
- #define min(X,Y) (X<=Y ? X : Y) /* function-like macro */
- #ifdef...#endif
- The token stream is parsed to produce an AST in which all variables, types, etc., are represented simply as identifiers
- The pragma strings are parsed and processed, and those AST nodes are filled out
- Symbol table information is associated to the identifiers in the AST
- (Static) types are created and associated to every expression (use the types in the model package?)
- Variable objects are created and inserted into AST
- Side effects are removed by introducing temp variables
- Model is produced
Modules
- CPreProcessor: an ANTLR-generated parser that takes source file and produces token stream, applies pre-processing rules, produces new token stream
- CParser: an ANTRL-generated parser that takes stream of sourced tokens from pre-processor and produces AST stage 0
- AST, ASTNode
