wiki:AST

Context Navigation

Version 13 (modified by Stephen Siegel, 15 years ago) ( diff )
--

The TASS Abstract Syntax Tree

The TASS AST is the common front-end representation of a source program used by TASS. It is used to generate a model, the TASS intermediate representation.

Requirements:

language neutral: to the extent possible one common AST should be used for C, C++, Fortran(s)
should be easy to generate model from AST
should be easy to generate AST from a typical parse tree
should be easy to generate AST from ANTLR, ROSE, or Clang (other tools?)
AST may go through many processing stages; at each stage it is still an AST
persistent XML representation: should be able to read and write
all nodes must carry precise source information: file, begin point (line and column numbers), and end point

Questions:

How to handle pre-processor macros? Include in AST?
How abstract should the AST be?
Should it contain semantic information, e.g., types, and variables?
How will it handle things like (foo)*bar: this could be either a cast of *bar to type foo, or it could be the product of foo and bar; you need to know whether foo defines a type, which is some semantic information
- cf. http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
- approach: just choose one way, then change in later pass when analyzing

Elements of a TASS AST

For one example of how to do this, see description of CIL AST:

Preprocessing: before an AST is created, the source file(s) are preprocessed to create a stream of tokens with complete source information. This stream is fed to the parser which creates the AST.

All AST nodes have source information.

ASTNodeIF extends SourceableIF

methods
- numChildren: int
- child(int i): ASTNodeIF

The following extend ASTNodeIF:

IdentifierNodeIF: wraps a string. represents an identifier occurrence in code
- methods
  - name: string
TypeNodeIF: a type expression
- IdentifierTypeNodeIF
  - this is a type that is just a name, e.g., resulting from a C typedef. This will node will eventually be replaced by a more specific type tree once type definitions are processed.
- VoidTypeNodeIF: the type "void" used, for example, as the return type for a function that does not return anything
- IntegerTypeNodeIF: an integer type, such as "int" or "long int" or "short int" in C
  - integer sub-types, specified by parameters
- RealTypeNodeIF: a real type, such as "float" or "double" in C
  - real sub-stypes, specified by parameters (e.g., IEEE754 floating point numbers)
- BooleanTypeNodeIF: the boolean type
- CharacterTypeNodeIF: the character type (char in C)
- ArrayTypeNodeIF: an array type
  - methods
    - elementType: TypeNodeIF
    - extent: ? This is complicated. The extent might be an expression involving input variables.
- PointerTypeNodeIF
  - methods
    - baseType: TypeNodeIF
- RecordTypeNodeIF: "struct" in C
  - methods
    - name: IdentifierNodeIF
    - numFields: int
    - field(int i): NameTypePairNodeIF: this is an ordered pair (n,t) where n is an IdentifierNodeIF and t is a TypeNodeIF
- FunctionTypeNodeIF
  - methods
    - numInputs: int
    - inputType(int i): TypeNodeIF
- EnumerationTypeNodeIF
  - methods
    - name: IdentifierNodeIF
    - numElements: int
    - element(int i): IdentifierNodeIF
VariableDefinitionNodeIF
- methods
  - scope: ? (global, input/output, local,...)
  - name: IdentifierNodeIF (together with scope, this uniquely identifies the variable)
  - type: TypeNodeIF
  - initializer: ExpressionNodeIF
VariableDeclarationNodeIF: each declaration has an associated definition. There can be many decls associated to one def. This is a linking thing: a single variable that is to be used in several files will be defined in one file but be declared in all
- methods
  - definition: VariableDeclarationNodeIF (may be null)
  - scope: ScopeIF
  - name: IdentifierNodeIF
  - type: TypeNodeIF
FunctionDefinitionNodeIF: a function with a body. As with variables, one def. can have many decls, but each decl is associated to at most one definition.
- methods
  - name: IdentifierNodeIF
  - numInputs: int
  - input(int i): NameTypePairNodeIF
  - outputType: TypeNodeIF
  - body: BlockNodeIF
FunctionDeclarationNodeIF: no body, just "prototype" in C
- methods
  - definition: FunctionDefinitionNodeIF (may be null)
  - name: IdentifierNodeIF
  - numInputs: int
  - input(int i): NameTypePairNodeIF
  - outputType: TypeNodeIF
TypeDefinitionNodeIF: "typedef" in C
- methods
  - name: IdentifierNodeIF
  - type: TypeNodeIF (type being assigned a name)
StatementNodeIF
- methods
  - labels: Collection<ASTLabelIF>: set of labels associated to this statement. The labels all correspond to the point of control just before the statement.
- AssertStatementNodeIF: this could come from a program language assert or a TASS pragma
  - methods
    - expr: Expression
- AssumeStatementNodeIF: these come from TASS pragmas
  - methods
    - expr: Expression
- PragmaNodeIF:(any kind of pragma, represented as just a string
  - methods
    - string: this is an initial form, where the text of the pragma is uninterpreted. After a certain processing stage, TASS pragmas will be parsed and replaced with more specific nodes. Non-tass pragmas will just remain as strings and will be ignored by TASS.
- SwitchStatementNodeIF: this can be used to model C's switch
  - methods
    - expression: ExpressionNodeIF: expression that is evaluated
    - numCases: int
    - case(int i): ValueLocationPairNodeIF (ordered pair (v,l), where v is a ValueNodeIF, and l is an ASTStatementNodeIF)
    - default: ValueLocationPairNodeIF
- IfThenElseStatementNodeIF
  - methods
    - condition: ExpressionNodeIF
    - trueBranch: StatementNodeIF
    - falseBranch: StatementNodeIF (may be null)
- WhileLoopNodeIF: loop. associated invariant (derived from TASS invariant pragma)
  - methods
    - condition: ExpressionNodeIF
    - body: StatementNodeIF
    - invariants: Collection<ExpressionNodeIF> ?
- ForLoopNodeIF
  - methods
    - initializer: StatementNodeIF
    - condition: ExpressionNodeIF
    - incrementer: StatementNodeIF
    - body: StatementNodeIF
- DoUntilNodeIF
- BreakNodeIF
- ContinueNodeIF
- GotoNodeIF
  - label (this is the kind before processing)
  - statement (after processing)
- ReturnNodeIF
  - expression (may be null)
- NoopNodeIF
- BlockNodeIF: ({...} in C)
  - sequence of variable definitions
  - sequence of statements
- ExpressionNodeIF: expressions--these are considered a kind of statement (as in C)
  - PureExpressionNodeIF: side-effect-free expressions
    - LiteralNodeIF (including named literals): integers, reals, strings, chars
    - VariableExpressionNodeIF
    - OperatorNodeIF
      - ADD, ADD_POINTER_INT, ADDRESS_OF, BIT_AND, BIT_NOT, BIT_OR, BIT_XOR, DEREFERENCE, DIVIDE, EQUALS, GREATER_THAN, GTE, IF_THEN_ELSE, LEQ, LESS_THAN, LOGICAL_AND, LOGICAL_NOT, LOGICAL_OR, MODULO, MULTIPLY, NAVIGATE, NEGATIVE, NOT_EQUALS, SHIFT_LEFT, SHIFT_RIGHT, SUBSCRIPT, SUBTRACT, SUBTRACT_POINTER_INT, SUBTRACT_POINTER_POINTER
    - CastNodeIF
      - newType: TypeNodeIF (type you are casting to)
      - expression: ExpressionNodeIF
    - ArrayIndexNodeIF (a[i])
    - RecordNavigationNodeIF: (x.a)
    - QuantifierNodeIF
      - quantifier: {FORALL, EXISTS, LAMBDA, UNIFORM, SUM}
      - boundVariable: VariableDefinitionNodeIF
      - constraint: ExpressionNodeIF
      - expression: ExpressionNodeIF
    - InitializerNodeIF: special kind of expression used to initialize variables in their decl
    - SizeOfNodeIF: type, expression, or string literal
    - StartOfNodeIF: &a[0] -- that is the Cil way of converting an array to pointer to first element
    - FunctionApplicationNodeIF: invocation f(x) when f is abstract (pure) function
  - SideEffectExpressionNodeIF: expressions with side-effects
    - AssignmentNodeIF: x=expr, x++, x--, ++x, --x
    - FunctionInvocationNodeIF: invocation f(x) when f is concrete

Processing Stages

The AST can be used to represent the program at different stages of translation.

Stages

The input source file(s) are pre-processed, creating a stream of tokens with source information
1. #include
2. #define X 100 /* object-like macro */
3. #define min(X,Y) (X<=Y ? X : Y) /* function-like macro */
4. #ifdef...#endif
The token stream is parsed to produce an AST in which all variables, types, etc., are represented simply as identifiers
The pragma strings are parsed and processed, and those AST nodes are filled out
Symbol table information is associated to the identifiers in the AST
(Static) types are created and associated to every expression (use the types in the model package?)
Variable objects are created and inserted into AST
Side effects are removed by introducing temp variables
Model is produced

Modules

CPreProcessor: an ANTLR-generated parser that takes source file and produces token stream, applies pre-processing rules, produces new token stream
CParser: an ANTRL-generated parser that takes stream of sourced tokens from pre-processor and produces AST stage 0
AST, ASTNode

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text