The evaluation of our approach for verifying message-passing program
against function contracts is based on the MPI function contract
system described in Chapter \ref{chp:mpi:system}.  We applied our
implementation to a number of C/MPI/CIVL-C collective-style functions,
including functions from CIVL's MPI library, implementations of
advanced algorithms for MPI collective functions and collective-style
functions carved out from MPI scientific computing applications.

Two of the aforementioned collective-style functions are selected to
be presented with details in \S\ref{chp:eval:examples}.  The overall
experimental results and performance are discussed about in
\S\ref{chp:eval:results}.

\section{Running Examples} \label{chp:eval:examples}

Real MPI implementations, such as MPICH
\cite{Thakur:2005:OCC:2747766.2747771}, use advanced algorithms for
MPI collective functions.  And more such algorithms keep being
designed and implemented \cite{jesper2019optimal, KANG2019220}.  An
MPI contract system can be effective for verifying the correctness of
these algorithms since 1) it can verify function implementations in
isolation; and 2) different implementations of one MPI collective
function can share a function contract.  To illustrate such
effectiveness, we show the verification of two
\texttt{MPI\U{}Allgather} implementations in
\S\ref{sec:mpi:eval:allgather}.

The implementation of an MPI collective function, or more generally,
of an MPI collective-style function can be complicated though, their
complex has no effect on the verification of their caller
functions for a contract system.  Thanks to the modular verification
feature of contract systems, the behavior of a call to a contracted
function only depends on its function contract.  We illustrate such an
advantage of our MPI contract system in \S\ref{sec:mpi:eval:dotprod}
by presenting an example that uses the \texttt{MPI\U{}Allgather}
function.

\subsection{Allgather Implementations} \label{sec:mpi:eval:allgather}

\textbf{CIVL's implementation.} In CIVL, simple and deterministic algorithms are used to implement MPI
collective functions.  Figure \ref{fig:mpi:eval:allgather} shows
CIVL's implementation of the \texttt{MPI\U{}Allreduce} function, which
is just a call to the \texttt{MPI\U{}Reduce} followed by a call to the
\texttt{MPI\U{}Bcast}.

Both the \texttt{MPI\U{}Reduce} and \texttt{MPI\U{}Bcast} are
annotated with function contracts.  Therefore, for the purpose of
verifying the functional correctness of the \texttt{MPI\U{}Allreduce}
function, their definitions are not needed.  Since all these functions
are MPI collective functions, they are guaranteed not to interfere
with each other.  So there is no need to specify a requirement or
guarantee for these functions.

To verify \texttt{MPI\U{}Allreduce} with the contracts, the user needs
to specify a bounded number of processes.  The contract system then is
able to verify that \texttt{MPI\U{}Allreduce} satisfies its function
contract for the bounded number of processes and arbitrary message
data, data size, data type and reduction operation.

Next, we briefly explain these contracts.

For the contract of \texttt{MPI\U{}Bcast}, the state-requirement
states that

\begin{enumerate}
\item \texttt{root} must be in between 0 and the size of the
  communicator;
  
\item the type signature (i.e., the product of the \texttt{count} and
  the \texttt{datatype} extent) must be non-negative;

\item all processes must have the same values for \texttt{root} and
  the type signature,  respectively;

\item the pointer \texttt{buf} must point to an allocated memory
  region, of which the size shall not be less than the type signature.
\end{enumerate}
The state-guarantee ensures that eventually all processes will have
the same value in their memory regions referred by their
\texttt{buf}s.  In addition, only those memory regions of the non-root
processes will be modified.  Such a frame condition is expressed with
a behavior named by ``\texttt{nonroot}''.

For the contract of \texttt{MPI\U{}Reduce}, the state-requirement is
similar to the one of \texttt{MPI\U{}Bcast} except that it is stronger
over the values of \texttt{count} and \texttt{datatype}.  It requires
that all processes must have same values for \texttt{root} and
\texttt{datatype}, respectively.  In fact, the author of this
dissertation was not aware of that \texttt{MPI\U{}Reduce} needs a
stronger state-requirement than \texttt{MPI\U{}Bcast} until the
contract system reports an error.  The state-guarantee ensures that 1)
nothing will be modified for all the non-root processes; and 2) the
memory region referred by \texttt{recvbuf} on the root process will
eventually be assigned the result of reduction over all the memory
regions referred by \texttt{sendbuf}s on all the processes.

The contract of \texttt{MPI\U{}Allreduce} is similar to the one of
\texttt{MPI\U{}Reduce} except that \texttt{MPI\U{}Allreduce} ensures
all processes will eventually hold the reduction result.


\begin{figure}
  \begin{programSmall}
\mycomment{/*@ \BSL{}mpi\U{}collective(comm, COL):}\\
\mycomment{\ \ \ \ requires 0 <= root < \BSL{}mpi\U{}comm\U{}size \&\&
  0 <= count * \BSL{}mpi\U{}extent(datatype)}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}agree(root) \&\& \BSL{}mpi\U{}agree(count * \BSL{}mpi\U{}extent(datatype))}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}valid(buf, count, datatype);}\\
\mycomment{\ \ \ \ ensures \BSL{}mpi\U{}agree(\BSL{}mpi\U{}region(buf, count, datatype));}\\
\mycomment{\ \ \ \ behavior nonroot:}\\
\mycomment{\ \ \ \ \ \ assumes \BSL{}mpi\U{}comm\U{}rank != root;}\\
\mycomment{\ \ \ \ \ \ assigns  \BSL{}mpi\U{}region(buf, count,
  datatype); */}\\
\ccode{int} MPI\U{}Bcast(\ccode{void} * buf, \ccode{int} count, 
MPI\U{}Datatype datatype, \ccode{int} root, \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ MPI\U{}Comm
              comm);\\ \\
 \mycomment{/*@ \BSL{}mpi\U{}collective(comm, COL):} \\
 \mycomment{\ \ \ \   requires  0 <= count*\BSL{}mpi\U{}extent(datatype)} \\
 \mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}valid(sendbuf, count, datatype)}\\
 \mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}valid(recvbuf, count, datatype)}\\
 \mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}agree(root) \&\& \BSL{}mpi\U{}agree(count) \&\& \BSL{}mpi\U{}agree(datatype);}\\
 \mycomment{\ \ \ \ \ \ \ \ \ \ \&\& 0 <= root < \BSL{}mpi\U{}comm\U{}size;}\\
 \mycomment{\ \ \ \   behavior root:}\\
 \mycomment{\ \ \ \ \ \   assumes \BSL{}mpi\U{}comm\U{}rank == root;}\\
 \mycomment{\ \ \ \ \ \   assigns \BSL{}mpi\U{}region(recvbuf, count, datatype);}\\
  \mycomment{\ \ \ \ \ \  ensures  \BSL{}mpi\U{}equals(\BSL{}mpi\U{}region(recvbuf, count, datatype),}\\
  \mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 
    \BSL{}mpi\U{}reduce(sendbuf, count, datatype, op)); */}\\
\ccode{int} MPI\U{}Reduce(\ccode{void} * sendbuf, \ccode{void} * recvbuf, \ccode{int} count, MPI\U{}Datatype datatype, \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ MPI\U{}Op op, \ccode{int} root, MPI\U{}Comm comm); \\
\\
\mycomment{/*@ \BSL{}mpi\U{}collective(comm, COL):}\\
\mycomment{\ \ \ \ requires count >= 0 \&\& \BSL{}mpi\U{}valid(sendbuf, count,
  datatype)}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}valid(recvbuf, count, datatype)}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}mpi\U{}agree(count) \&\& \BSL{}mpi\U{}agree(op) \&\& \BSL{}mpi\U{}agree(datatype);}\\
\mycomment{\ \ \ \   assigns  \BSL{}mpi\U{}region(recvbuf, count, datatype);}\\
\mycomment{\ \ \ \   ensures  \BSL{}mpi\U{}equals(\BSL{}mpi\U{}region(recvbuf, count, datatype),}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \BSL{}mpi\U{}reduce(sendbuf, count, datatype, op));*/}\\
\ccode{int} MPI\U{}Allreduce(\ccode{void} * sendbuf, \ccode{void} * recvbuf, \ccode{int} count, \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ MPI\U{}Datatype datatype, MPI\U{}Op op, MPI\U{}Comm comm) \lb{} \\
\ \  MPI\U{}Reduce(sendbuf, recvbuf, count, datatype, op, 0, comm); \\
\ \  MPI\U{}Bcast(recvbuf, count, datatype, 0, comm); \\
\ \   \ccode{return} 0; \\
\rb{}\\
\end{programSmall}
\caption{Annotating function contracts for all the collective-style
  functions used by CIVL's implementation of
  \texttt{MPI\U{}Allreduce}.}
\label{fig:mpi:eval:allgather}
\end{figure}

\textbf{Recursive Doubling.}  Real MPI implementations use more
optimized algorithms for \texttt{MPI\U{}Allreduce}, such as the
\emph{recursive doubling} algorithm.  Figure
\ref{fig:mpi:eval:recursive-double} is a CIVL-C realization of a
recursive doubling \texttt{MPI\U{}Allreduce} pseudocode given in
\cite{Ruefenacht:2017:GRD:3163938.3164013}.  The function contract of
the \texttt{MPI\U{}Allreduce} in this figure is redundant and hence
omitted.

In this implementation, we used CIVL-C primitives to abstract details
that are irrelevant to the algorithm:

\begin{itemize}
\item The \texttt{\$mpi\U{}malloc} is a helper function written in
  CIVL-C.  It allocates a memory region of the size of the given type
  signature.  In the case of that the value of \texttt{datatype} is
  arbitrary, instead of elaborating all possible concrete choices,
  this function uses an uninterpreted function
  $\mymath{extent(\texttt{datatype})}$ to represent the extent of the
  \texttt{datatype} and allocates a memory region of size
  $\mymath{extent(\texttt{datatype}) * \texttt{count}}$.
  
\item In ACSL, reduction operations can be expressed with different
  constructs according to the reduction operation.  For example,
  summing up all the $n$ elements in an array $a$ can be expressed as
  \texttt{\BSL{}sum(0, $n$-1, \BSL{}lambda int i; $a$[i])} in ACSL.
  However, for the cases where the reduction operation is arbitrary,
  specifying a reduction operation in ACSL will inevitably elaborate
  all possible concrete choices.  However, elaboration will make the
  verification much more expensive.  Hence, we used a CIVL-C system
  function \texttt{\$memapply} to implement the \texttt{reduce}
  function that performs process local reduction operation:
  
  \begin{programNoNum}
  \ccode{void} reduce(\ccode{void} * inout, \ccode{void} * in,
  \ccode{int} count, \ccode{int} size, \\
\ \ \ \ \ \ \ \ \ \ \ \ MPI\U{}Op op) \lb{}\\
\ \  \$memapply(inout, (\$operation)op, in, count, size, inout); \\
\rb{}
\end{programNoNum}

The \texttt{\$memapply} function performs an ``element-wise''
\texttt{\$operation} on two memory regions referred by \texttt{inout}
and \texttt{in}, respectively, and writes the result back to
\texttt{inout}.  Here ``element-wise'' means that the operation is
applied to every pair of \texttt{size $/$ count} bytes memory regions
that are referred by \texttt{(char*)inout + $i$ * (size $/$ count)}
and \texttt{(char*)in + $i$ * (size $/$ count)}, respectively, for
$0 \leq i < \texttt{count}$.

When the \texttt{op} has a non-concrete value, \texttt{\$memapply}
uses an uninterpreted function to represent the general reduction
result.
\end{itemize}

With these abstractions, this function implementation can be verified
by the MPI contract system with the same configuration as the previous
implementation.

\begin{figure}
  \begin{programSmall}
\ccode{void} MPI\U{}Allreduce(\ccode{void} * sendbuf, \ccode{void} *
recvbuf, \ccode{int} count,  \\ 
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ MPI\U{}Datatype datatype, MPI\U{}Op op,  MPI\U{}Comm comm) \lb{}\\
\ \  \ccode{int} datasize = sizeofDatatype(datatype) * count; \\
\ \  \ccode{void} * global = \$mpi\U{}malloc(count, datatype); \\
\ \  \ccode{int} pof2, rem, int nprocs, rank, myrank, mask = 1; \\
\\
\ \  MPI\U{}Comm\U{}size(comm, \&nprocs);   MPI\U{}Comm\U{}rank(comm, \&rank);\\
\ \  pof2 = log2(nprocs);  pof2 = pow2(pof2); \\
\ \  rem = nprocs - pof2; \\
\ \  memcpy(global, sendbuf, datasize);\\
\ \ \ccode{if} (rank < 2 * rem)  \lb{} \\
\ \ \ \  \ccode{if} (rank \% 2 == 0) \lb{} \\
\ \ \ \ \ \   MPI\U{}Send(global, count, datatype, rank + 1, 0, comm); \\
\ \ \ \ \ \    myrank = -1; \\
\ \ \ \  \rb{} \ccode{else} \lb{} \\
\ \ \ \ \ \   MPI\U{}Recv(recvbuf, count, datatype, rank - 1, 0, comm, MPI\U{}STATUS\U{}IGNORE);\\
\ \ \ \ \ \   reduce(global, recvbuf, count, datasize, op); \\
\ \ \ \ \ \   myrank = rank / 2;\\
\ \ \ \   \rb{}\\
\ \  \rb{} \ccode{else} \\
\ \ \ \     myrank = rank - rem;\\
\ \ \ccode{if} (myrank != -1)  \\
\ \ \ \  \ccode{while} (mask < pof2) \lb{} \\
\ \ \ \ \ \   \ccode{int} dst, newdst = myrank \char`\^{} mask; \\
\\
\ \ \ \ \ \    \ccode{if} (newdst < rem) \\
\ \ \ \ \ \ \ \ 	dst = newdst * 2 + 1;\\
\ \ \ \ \ \     \ccode{else} \\
\ \ \ \ \ \ \ \	dst = newdst + rem;\\
\ \ \ \ \ \  MPI\U{}Sendrecv(global, count, datatype, dst, 0, \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ recvbuf, count, datatype, dst, 0, comm, MPI\U{}STATUS\U{}IGNORE);\\
\ \ \ \ \ \      reduce(global, recvbuf, count, datasize, op);\\
\ \ \ \ \ \      mask = mask <{}< 1;\\
\ \ \ \    \rb{}\\
\ \  \ccode{if} (rank < 2 * rem) \\
\ \ \ \   \ccode{if} (rank \% 2 != 0) \\
\ \ \ \ \ \   MPI\U{}Send(global, count, datatype, rank - 1, 0, comm); \\
\ \ \ \  \ccode{else} \\
\ \ \ \ \ \   MPI\U{}Recv(global, count, datatype, rank + 1, 0, comm, MPI\U{}STATUS\U{}IGNORE);\\
\ \  memcpy(recvbuf, global, datasize);\\
\ \  free(global);\\
\rb{}
\end{programSmall}
\caption{An \texttt{MPI\U{}Allreduce} implementation based on the
  recursive doubling algorithm.}
\label{fig:mpi:eval:recursive-double}
\end{figure}
  
\subsection{Parallel Vector Product} \label{sec:mpi:eval:dotprod}

The two \texttt{MPI\U{}Allreduce} implementations presented in
\S\ref{sec:mpi:eval:allgather} have different degrees of
complicatedness.  But the implementation is irrelevant to the
functional correctness of a caller function of
\texttt{MPI\U{}Allreduce} as long as \texttt{MPI\U{}Allreduce} is
annotated with a contract.  We present such a collective-style caller
function to \texttt{MPI\U{}Allreduce} in Fig.\ \ref{fig:mpi:dotprod}.
The data structures used in this figure are showed in Fig.\
\ref{fig:mpi:dotprod:structures}.  This C code was carved out from the
HYPRE \cite{falgout:2002:hypre} open source project.

%% what does this function do ?
The \texttt{hypre\U{}ParVectorInnerProd} function lets every process
computes the vector product of two sequences of elements in parallel
and then sums up the vector products on all the processes.

The collective behavior in the function contract of
\texttt{hypre\U{}ParVectorInnerProd} is associated to the MPI
communicator stored in the structure referred by \texttt{x}.  The
state-requirement states that

\begin{enumerate}
\item the structures referred by \texttt{x} and \texttt{y} have been
  allocated,

\item the \texttt{hypre\U{}Vector}
structures referred by the fields \texttt{local\U{}vector} of
\texttt{*x} and \texttt{*y} have been allocated, 

\item the sequences of elements referred by the fields \texttt{data}
  of \texttt{x->local\U{}vector} and \texttt{y->local\U{}vector} have
  been allocated, and

\item the size of an element sequence must be positive.
\end{enumerate}
The state-guarantee states that 1) nothing will be modified by this
function, and 2) the returned value will equal to the sum over all the
dot products computed by all the processes.

This function relies on the
\texttt{hypre\U{}SeqVectorInnerProd} sequential function to perform
process local dot product operations.  In order to fully verify the
functional correctness of \texttt{hypre\U{}ParVectorInnerProd}, one
must also verify the function \texttt{hypre\U{}SeqVectorInnerProd} independently.
The contract of this sequential function is purely in ACSL and is easy
to understand.  We remark that in order to verify this function for
arbitrary size of the element sequences, we need to annotate the loop
at line 23-24 with a loop invariant.  The loop invariant describes the
property of the loop and will be verified in an inductive way during
the symbolic execution.  Once the loop is verified as satisfying the
loop invariant, the effect of the loop to the program can be directly
derived from the loop invariant.  Detailed description of CIVL's
support on sequential loop invariants can be found in
\cite{10.1007/978-3-030-03421-4_12}.

By verifying \texttt{hypre\U{}SeqVectorInnerProd} and
\texttt{hypre\U{}ParVectorInnerProd} \\
against their contracts
separately, the functional correctness of both functions are thus
verified.  Of course, for \texttt{hypre\U{}ParVectorInnerProd}, the
result is limited for a bounded number of processes.

\begin{figure}[t]
  \begin{minipage}[t][][b]{.45\textwidth}
    \begin{programSmall}
      \ccode{typedef struct} \lb{}\\
      \ \  HYPRE\U{}Complex  *data;\\
      \ \   HYPRE\U{}Int      size; \\
      \ \   HYPRE\U{}Int   num\U{}vectors;   \\
      \ \   ... \\
      \rb{} hypre\U{}Vector;\\
    \end{programSmall}
  \end{minipage}
  \begin{minipage}[t][][b]{.45\textwidth} 
    \begin{programSmall}
      \ccode{typedef struct} hypre\U{}ParVector\U{}struct \lb{}\\
      \ \ MPI\U{}Comm	         comm;\\
      \ \ hypre\U{}Vector	*local\U{}vector; \\
      \ \ ... \\
      \rb{} hypre\U{}ParVector;
    \end{programSmall}
  \end{minipage}
  \caption{The data structures that are used in Fig.\
    \ref{fig:mpi:dotprod}.  Insignificant fields are omitted.}
  \label{fig:mpi:dotprod:structures}
\end{figure}

\begin{figure}
  \begin{programSmall}
\ccode{\#define} hypre\U{}ParVectorComm(vector)  	        ((vector) -> comm)\\
\ccode{\#define} hypre\U{}ParVectorLocalVector(vector)      ((vector) -> local\U{}vector)\\
\ccode{\#define} hypre\U{}VectorData(vector)      ((vector) -> data)\\
\ccode{\#define} hypre\U{}VectorSize(vector)      ((vector) -> size)\\
\ccode{\#define} hypre\U{}VectorNumVectors(vector) ((vector) -> num\U{}vectors)\\
\ccode{\#define} hypre\U{}conj(value) value\\
    \ccode{\#define} SIZE ((x->size)*(x->num\U{}vectors))\\
    \ccode{\#define} PAR\U{}SIZE
    ((x->local\U{}vector->size)*(x->local\U{}vector->num\U{}vectors)) \\
    \ccode{\#define} NPROCS   \BSL{}mpi\U{}comm\U{}size \\
\mycomment{/*@ requires \BSL{}valid(x) \&\& \BSL{}valid(y) \&\& 0 < SIZE}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \&\& \BSL{}valid(x->data + (0 .. SIZE-1)) \&\& \BSL{}valid(y->data + (0 .. SIZE-1));}\\
\mycomment{\ \ \ \  assigns  \BSL{}nothing;}\\
\mycomment{\ \ \ \  ensures  \BSL{}result == \BSL{}sum(0, SIZE-1, \BSL{}lambda int j;
  y->data[j] * x->data[j]); */}\\
HYPRE\U{}Real  hypre\U{}SeqVectorInnerProd(hypre\U{}Vector *x,
				      hypre\U{}Vector *y) \lb{}\\
\ \   HYPRE\U{}Complex *x\U{}data = hypre\U{}VectorData(x);\\
\ \   HYPRE\U{}Complex *y\U{}data = hypre\U{}VectorData(y);\\
\ \   HYPRE\U{}Int    i, size   = hypre\U{}VectorSize(x);  \\         
\ \   HYPRE\U{}Real     result = 0.0;\\
\ \   size *=hypre\U{}VectorNumVectors(x);\\
\ \ \mycomment{/*@ loop invariant 0 <= i <= size}\\
\ \ \mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \&\& result == \BSL{}sum(0, i-1, \BSL{}lambda int j; y\U{}data[j] * x\U{}data[j]);}\\
\ \  \mycomment{\ \ \ \ loop assigns result, i; */}\\
\ \  \ccode{for} (i = 0; i < size; i++) \\
\ \ \ \  result += hypre\U{}conj(y\U{}data[i]) * x\U{}data[i]; \\
\ \  \ccode{return} result;\\
\rb{}\\
    \\
\mycomment{/*@ \BSL{}mpi\U{}collective(hypre\U{}ParVectorComm(x), COL):}\\
\mycomment{\ \ \ \   requires \BSL{}valid(x) \&\& \BSL{}valid(y) \&\& 0 < PAR\U{}SIZE  \&\&}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \   \BSL{}valid(x->local\U{}vector) \&\& \BSL{}valid(y->local\U{}vector) \&\&}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \ \BSL{}valid(x->local\U{}vector->data + (0 .. PAR\U{}SIZE-1))  \&\&}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \  \BSL{}valid(y->local\U{}vector->data + (0 .. PAR\U{}SIZE-1));}\\
\mycomment{\ \ \ \    assigns \BSL{}nothing;}\\
\mycomment{\ \ \ \    ensures \BSL{}result == \BSL{}sum(0,  NPROCS-1, \BSL{}lambda int k;}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \BSL{}on(k,  \BSL{}sum(0,  PAR\U{}SIZE-1, \BSL{}lambda int t;}\\
\mycomment{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \   x->local\U{}vector->data[t] *
  y->local\U{}vector->data[t]))); */}\\
HYPRE\U{}Real hypre\U{}ParVectorInnerProd(hypre\U{}ParVector *x,
				     hypre\U{}ParVector *y) \lb{} \\
\ \  MPI\U{}Comm      comm    = hypre\U{}ParVectorComm(x);\\
\ \   hypre\U{}Vector *x\U{}local = hypre\U{}ParVectorLocalVector(x);\\
\ \   hypre\U{}Vector *y\U{}local = hypre\U{}ParVectorLocalVector(y);\\
\ \   HYPRE\U{}Real result = 0.0;\\
\ \   HYPRE\U{}Real local\U{}result = hypre\U{}SeqVectorInnerProd(x\U{}local, y\U{}local);\\    
\ \   hypre\U{}MPI\U{}Allreduce(\&local\U{}result, \&result, 1, HYPRE\U{}MPI\U{}REAL,\\
\ \ \ \ \ \ \ \ 		      hypre\U{}MPI\U{}SUM, comm);   \\
\ \  \ccode{return} result;\\
\rb{}
\end{programSmall}
\caption{A function that computes vector product in parallel using
  MPI.}
\label{fig:mpi:dotprod}
\end{figure}

\section{Experiment} \label{chp:eval:results}

We evaluate our MPI contract system implementation by using it to
specify and verify a set of C/MPI/CIVL-C collective-style functions.

This experiment contains two parts.  In the first part, we ignore the
checking of absence assertions in the contracts.  In other words, 1)
the first part assumes that no interference can happen for every
collective-style function; and 2) it does not verify the guarantees of
the verified functions.  In the second part, we enable full validity
verification for a selected subset of the experiments.

For each collective-style function in the experiment, we prepared a
set of negative examples, each of which contains a specific kind of
contract violation, such as state-guarantee violation, too weak
state-requirement to ensure the state-guarantee or a kind of absence
assertion violations.  Our system can precisely report the expect
error for each of these negative examples.
For brevity, we do not show the results
of all the negative examples.

%% why break up into 2 parts
Instead of directly verifying the full validity of every contracted
function, we separated the verification only for state-requirement and
-guarantees in the first part for two reasons.  First, the absence
assertions are introduced for reasoning about the case of
interference.  Most of our examples use only MPI collective functions
and wildcard-free send/receive operations.  They fall into the
category where no interference can happen naturally.  Second, full
validity verification involves checking path predicates (i.e., absence
assertions).  CIVL was originally designed for verifying properties
that can be expressed by state predicates only.  So its sophisticated
POR algorithm is not sound for full validity verification.  In our
contract system implementation, we enforce the soundness by informing
the CIVL model checker to stop applying POR when a statement that may
be visible to the absence assertions is enabled.  This is a coarse
approximation in that the model checker eventually has to explore a
large number of commutative executions.  Therefore, in the second part
of the experiment, examples cannot be scaled to the same level as
themselves in the first part.

%% setting of provers
CIVL uses automated theorem provers to reason about the logic formulas
that are still non-trivial after a series of simplifications applied by
CIVL.  Each prover is assigned a timeout so that it will not run
infinitely in undecidable cases.  A call to a prover is expensive.
Although the longer the timeout, the higher the possibility of that a
prover can eventually solve a formula, CIVL has to waste more time in
waiting for the reasoning of unsolvable formulas.  Therefore, in this
experiment, we carefully set the timeout for different provers.

\begin{enumerate}
\item Z3 and CVC4 are used in a sequence for all kinds of the reasoning
  with a 2 seconds timeout.
\item CVC4 will be invoked only if Z3 is not able to solve a formula
  within 2 seconds.
\item Why3 will only be used for checking assertions and is only
  invoked when neither Z3 nor CVC4 solves a formula within 2 seconds.
  Why3 is assigned a default 5 seconds timeout.  In fact, Why3 itself
  is not an automated theorem prover but a platform that translates
  the input and sends it to various provers including Z3 and CVC4.  We
  use Why3 for two of its advantages over our direct translation to Z3
  and CVC4: 1) in addition to the translation, Why3 introduces extra
  transformations and axioms for specific theories; 2) Why3 supplies
  libraries for rich theories such as permutation and multi-set.
\end{enumerate}
The experiment was run on an iMac with 3.5 GHz Intel Core i7 and 32 GB
1600 MHz DDR3.

%% purely functional correctness run
Table \ref{tab:experimental:result:part:I-1} and
\ref{tab:experimental:result:part:I-2} show the experimental results
of the first part.  

Examples in Table \ref{tab:experimental:result:part:I-1} are the
implementations of MPI collective functions in CIVL's MPI library.
These implementations are all based on simple and intuitive
algorithms.  For example, the implementation of \texttt{bcast} is
letting the root process send messages to all processes in a fixed
order.  Although the implementations are naive, their contracts are
general and conform to the MPI standard.

%% what is concluded from this ? representation of non-trivial
%%data sturcture values matters ?
Reasoning about arbitrary aggregate type objects can be expensive.
Observing the differences in the statistics of \texttt{bcast},
\texttt{gather} and \texttt{gatherv}.  The naive implementation of
\texttt{gather} can be seen as a reverse of the one of \texttt{bcast}:
the root process receives messages from all processes in a fixed
order.  These two examples have very different performances.  For the
runs of them with a same number of processes, the verification of
\texttt{gather} generates almost 9 times more prover calls than the
one of \texttt{bcast}.  This is because the value eventually in the
receiving buffer of the \texttt{gather} function is a partitioned
array, in which each part was sent from a process.  Considering the
fact that both the content and size of each part are arbitrary, the
general representation of such a value has to involve a number of
quantified formulas.  In \texttt{gatherv}, the ``gathered''
partitioned array is even more general in that the order of each part
is arbitrary.  This explains why verifying \texttt{gatherv} takes
twice the time of verifying \texttt{gather}.

%% concluded ? modular is effective
The contract system is effective in re-using verification results.
For example, the \texttt{allgather} example is implemented by
combining \texttt{gather} and \texttt{bcast}.  During the
verification, the behaviors of \texttt{gather} and \texttt{bcast} are
derived from their contracts respectively.  In other words, by
assuming that \texttt{gather} and \texttt{bcast} satisfy their
contracts, the verification results of them are directly re-used.
According to the statistics in the table, it is easier for the
contract system to verify \texttt{allgather} than \texttt{gather}.

%% conclude: contracts are re-usable.
A function contract can be re-used for different implementations.  In
Table \ref{tab:experimental:result:part:I-2}, \texttt{allreduce\U{}dr}
and \texttt{reduceScatter} are implementations of MPI collective
functions based on advanced algorithms.  Implementations are different
though, \texttt{allreduce\U{}dr} and \texttt{allreduce} share the same
function contract.

%% work togather with loop invariants and sequential contracts,

Rest of the examples in Table \ref{tab:experimental:result:part:I-2}
are functions from different MPI scientific computing applications.
In these examples, each process is responsible for performing some
computation locally.  Process-local functions can be specified by pure
ACSL function contracts.  Verifying and re-using pure ACSL contracts
is just a special case of dealing with MPI contracts for our contract
system.  Taking \texttt{diff1dExchange} and \texttt{diff1dIter} as an
example.  The \texttt{diff1dExchange} is the ``ghost cell exchange''
function. The \texttt{diff1dIter} is the loop body of the iterative
computational procedure of a parallel 1d-diffusion solver.  The
\texttt{diff1dIter} is a combination of a call to
\texttt{diff1dExchange} and a call to a process-local computing
function.  The behaviors of both calls are summarized from their
contracts.  The verification result of the process-local function is
not showed in the table since it is irrelevant to our topic.  Similar
for \texttt{diff2dExchange} and \texttt{diff2dIter}.

%% diff1d and diff2d again emphsizes the importance of the logic
%% reasoning
For functions from the 1d-diffusion solver, the contracts have no
specific assumption on the pattern of distribution.  That is, the left
neighbor of a process can be any other process.  Ditto for the right
neighbor.  However, for the functions from the 2d-diffusion solver, we
strengthen the contracts to assume a checkerboard distribution.  This
is the compromise we made to the complexity of proving these functions
for arbitrary parameters.  For this example, there are 2d arrays with
arbitrary contents and sizes, inside which partitions are moving
around.  The general representation we mentioned before for
arbitrarily partitioned arrays is powerful in expressiveness but makes
reasoning difficult.

%% matmat is typical example for collective-style division
The \texttt{matmat} example is a collective matrix multiplication
implementation.  It is a typical example for showing that dividing MPI
programs by collective-style functions are effective.  Figure
\ref{fig:eval:matmat} shows the function definition.  The function
definition can be considered as being composed of three
collective-style functions and a process-local procedure.  Taking the
process local function \texttt{matmat\U{}local}, which computes one
row of the result, as a special case of collective-style functions,
the verification of this function definition is thus divided into four
sub-problems.

\begin{figure}
  \begin{center}
  \begin{programSmall}
\ccode{int} N, M, L, *A, *B, *C, size, rank; \\    
    \ccode{void} matmat() \lb{}\\
\ \  \ccode{int} tmpC[L], tmpA[M]; \\
\\      
\ \  MPI\U{}Bcast(B, M*L, MPI\U{}INT, 0, MPI\U{}COMM\U{}WORLD);\\
\ \  MPI\U{}Scatter(A, M, MPI\U{}INT, tmpA, M, MPI\U{}INT, 0,
MPI\U{}COMM\U{}WORLD); \\
\ \  \ccode{if} (rank != 0) \\
\ \ \ \    memcpy(A, tmpA, M * sizeof(int));\\
\ \   matmat\U{}local(); \\
\ \  memcpy(tmpC, C, L * sizeof(int)); \\
\ \  MPI\U{}Gather(tmpC, L, MPI\U{}INT, C, L, MPI\U{}INT, 0, MPI\U{}COMM\U{}WORLD); \\
\rb{}
\end{programSmall}
\end{center}
\caption{The definition of the \texttt{matmat} function that
  collectively performs matrix multiplication.}
\label{fig:eval:matmat}
\end{figure}


%% odd-even sort is a good example for abstraction
Function contracts sometimes can help the developer to ignore the
details of communication but focus on the correctness of the
algorithm.  For instance, the \texttt{oddEvenSort} example comprises
the loop body of a parallel odd-even sorting algorithm.  The
correctness of this algorithm is not so obvious and parallelism makes
it even harder for developers to prove it.  Figure
\ref{fig:eval:oddEvenSort} shows the function contract of this
example.  Each process holds an element of an unsorted sequence.
Every $i$-th element in the sequence can be expressed as
\texttt{\BSL{}on(i, myvalue)}.  In the contract, the state-guarantee at
line 9-10 expresses the sortedness property; the one at line 11-12
expresses the permutation property.  The contract only describes input
and output of the function while leaves communication details
invisible.  To prove the sortedness and permutation of the complete
algorithm, one can focus on the induction on the contract.
\param{}

\begin{figure}
  \begin{center}
  \begin{programSmall}
\ccode{int} id, n; \mycomment{// id:\ process rank, n:\ sorting array} \\
T myvalue;     \mycomment{// element hold by the process } \\
MPI\U{}Comm comm; \\
\\
\mycomment{/*@ \BSL{}mpi\U{}collective(comm, P2P):} \\
\mycomment{\ \ \ \  requires \BSL{}mpi\U{}agree(i);} \\
\mycomment{\ \ \ \  requires 0 <= i < n \&\& id == \BSL{}mpi\U{}comm\U{}rank \&\& n == \BSL{}mpi\U{}comm\U{}size;}\\
\mycomment{\ \ \ \  assigns  myvalue;}\\
\mycomment{\ \ \ \  ensures  (id \% 2 == 0 \&\& id < n-1 ==> myvalue <=
                                \BSL{}on(id+1,  myvalue))} \\
\mycomment{\ \ \ \  \ \ \ \ || (id \% 2 == 1 \&\& id < n-1 ==> myvalue <= \BSL{}on(id+1, myvalue));}\\
\mycomment{\ \ \ \  ensures  \BSL{}sum(0, n-1, \BSL{}lambda int t;
  \BSL{}on(t, myvalue) == myvalue ?\ 1 :\ 0) ==} \\
\mycomment{\ \ \ \ \ \ \ \ \ \ \BSL{}sum(0, n-1, \BSL{}lambda
  int t;  \BSL{}old(\BSL{}on(t, myvalue)) == myvalue ?\ 1 :\ 0);}\\
\mycomment{*/}\\
\ccode{void} oddEvenParIter(\ccode{int} i); \mycomment{// i is the
  loop identifier}\\
\end{programSmall}
\end{center}
\caption{Function contract of the function in the \texttt{OddEvenSort}
  example.}
\label{fig:eval:oddEvenSort}
\end{figure}

%% run with assuming wildcard exists


%% table of experiments

\begin{table}
  \begin{center}
\begin{small}
  %% name   | mode 1 | np | states | prover calls | timeout | time |
  %% BLANK | mode 2 | np | states | prover calls | timeout | time | DESCRIPTION
  %% BLANK | mode 3 | np | states | prover calls | timeout | time |
  \begin{tabular}{| p{5cm} | r | r | r | r | r |}
    \hline 
    name \& description & \#procs & \#states & \#trans & \#prover  & time(s)\\
    \hline
        \multirow{5}{5cm}{\texttt{bcast:}
    root sends in a roll and the rests do a receive, implemented in
    CIVL's MPI library
    } & 2 & 1, 121 & 3, 500 & 7 & 6 \\
                        & 3 & 3, 057 & 8, 235 & 14 & 8 \\
                        & 4 & 5, 805 & 15, 632 & 23 & 12 \\
                        & 5 & 9, 797 & 26, 608 & 34 & 17 \\
                        & 6 & 15, 545 & 42, 832 & 47 & 25 \\
    \hline
    \multirow{5}{5cm}{\texttt{gather:}
    root receives in a roll and the rests do one send, implemented
    in CIVL's MPI library
    } & 2 & 2, 608 & 7, 246 & 47 & 9 \\
                        & 3 & 6, 791 & 18, 226 & 102 & 16 \\
                        & 4 & 14, 788 & 38, 659 & 187 & 28  \\
                        & 5 & 31, 541 & 79, 997 & 297 & 54  \\
                        & 6 & 71, 850 & 175, 925 & 428 & 123  \\
    \hline
    \multirow{5}{5cm}{\texttt{gatherv:}
    root receives in a roll and the rests do one send, implemented
    in CIVL's MPI library
    } & 2 & 1, 266 & 3, 692 & 41 & 8\\
                        & 3 & 3, 053 & 8, 640 & 85 & 13\\
                        & 4 & 5, 852 & 16, 398 & 140 & 27\\
                        & 5 & 10, 121 & 28, 029 & 215 & 78\\
                        & 6 & 16, 612 & 45, 512 & 310 & 236\\
    \hline
    \multirow{5}{5cm}{\texttt{scatter:}
    root sends in a roll and the rests do one receive, implemented
    in CIVL's MPI library
    } & 2 & 2, 466 & 7, 007 & 43 & 8 \\
                        & 3 & 6, 332 & 17, 438 & 101 & 14 \\
                        & 4 & 13, 509 & 36, 428 & 188 & 27 \\
                        & 5 & 28, 067 & 73, 801 & 275 & 46 \\
                        & 6 & 62, 483 & 158, 481 & 386 & 108 \\
    \hline
    \multirow{5}{5cm}{\texttt{alltoall:}
    each proc performs \#procs sends and receives
    } & 2 & 786 & 2, 038 & 17 & 6\\
                        & 3 & 1, 382 & 3, 424 & 45 & 7\\
                        & 4 & 2, 153 & 5, 111 & 67 & 10\\
                        & 5 & 3, 114 & 7, 135 & 89 & 14 \\
                        & 6 & 4, 287 & 9, 532 & 115 & 19\\
    \hline
    \multirow{5}{5cm}{\texttt{alltoall2}
    each proc collectively calls \texttt{scatter} \#procs times,
    implemented in CIVL's MPI library
    } & 2 & 933 & 2, 986 & 19 & 7\\
                        & 3 & 1, 769 & 5, 905 & 32 & 10\\
                        & 4 & 3, 011 & 10, 351 & 66 & 15\\
                        & 5 & 4, 779 & 16, 972 & 84 & 31 \\
                        & 6 & 7, 174 & 26, 560 & 150 & 69\\
    \hline
    \multirow{5}{5cm}{\texttt{allgather:}
    each proc collectively calls \texttt{gather} then \texttt{bcast},
    implemented in CIVL's MPI library
    } & 2 & 2, 119 & 6, 235 & 23 & 11 \\
                        & 3 & 3, 690 & 10, 699 & 34 & 15 \\
                        & 4 & 5, 794 & 16, 512 & 44 & 21 \\
                        & 5 & 8, 944 & 24, 820 & 55 & 31 \\
                        & 6 & 14, 176 & 37, 915 & 65 & 50 \\
    \hline
    \multirow{5}{5cm}{\texttt{allreduce:}
    each proc collectively calls \texttt{reduce} then \texttt{bcast},
    implemented in CIVL's MPI library
    } & 2 & 841 & 2, 958 & 3 & 6 \\
                        & 3 & 1, 368 & 4, 803 & 3 & 6 \\
                        & 4 & 1, 903 & 6, 763 & 3 & 7 \\
                        & 5 & 2, 446 & 8, 838 & 3 & 9 \\
                        & 6 & 2, 997 & 11, 028 & 3 & 10 \\
    \hline
  \end{tabular}
\end{small}
\end{center}
\caption{The experimental results of verifying CIVL's implementations
  of the MPI collective functions with respect to state-requirements and
  -guarantees.}
\label{tab:experimental:result:part:I-1}
\end{table}

\begin{table}
  \begin{center}
    \begin{small}
      \begin{tabular}{| p{5cm} | r | r | r | r | r |}
        \hline 
        name \& description & \#procs & \#states & \#trans & \#prover  & time(s)\\
        \hline
        \multirow{5}{5cm}{\texttt{allreduce\U{}dr:}
        an implementation of the recursive doubling algorithm given by a
        pseudo code in \cite{Ruefenacht:2017:GRD:3163938.3164013}
        } & 2 & 648 & 1, 904 & 4 & 7\\
                            & 3 & 1, 060 & 3, 006 & 4 & 7\\
                            & 4 & 1, 562 & 4, 349 & 4 & 8\\
                            & 5 & 2, 017 & 5, 615 & 4 & 10\\
                            & 6 & 2, 486 & 6, 942 & 4 & 11\\
        \hline
        \multirow{5}{5cm}{\texttt{reduceScatter:}
        an implementation of an algorithm optimized for
        non-commutative reduction operations \cite{Bernaschi:2002:EIR:1895489.1895529}
        } & 2 & 713 & 1, 905 & 13 & 7\\
                            & 3 & 1, 378 & 3, 163 & 42 & 11\\
                            & 4 & 2, 147 & 4, 679 & 70 & 16\\
                            & 5 & 3, 102 & 6, 489 & 111 & 25\\
                            & 6 & 4, 248 & 8, 629 & 166 & 50\\
        \hline
        \multirow{5}{5cm}{\texttt{diff1dExchange:}
        the exchange ghost cells function in a 1d-diffusion implementation
        } & 2 & 1, 771 & 5, 083 & 19 & 7\\
                            & 3 & 6, 979 & 19, 164 & 40 & 12\\
                            & 4 & 25, 546 & 68, 328 & 69 & 25\\
                            & 5 & 93, 086 & 244, 148 & 102 & 84\\
                            & 6 & 345, 202 & 891, 940 & 135 & 382\\
        \hline 
        \multirow{4}{5cm}{\texttt{diff1dIter:}
        one time step implementation for a 1d-diffusion solver
        } & 2 & 2, 186 & 6, 354 & 58 & 19 \\
                            & 3 & 8, 646 & 23, 677 & 161 & 85 \\
                            & 4 & 31, 492 & 82, 514 & 380 & 275 \\
                            & 5 & 114, 108 & 287, 467 & 829 & 731 \\
        \hline
        \multirow{2}{5cm}{\texttt{diff2dExchange:}
        $n \times{} n$ checkerboard ghost cell exchange 
        } & $2 \times 2$ & 2, 010 & 4, 787 & 52 & 43\\
                            & $3 \times 3$ & 5, 154 & 12, 817 & 135 & 115\\
        & & & & &\\
        \hline
        \multirow{3}{5cm}{\texttt{diff2dIter:}
        one time step for a 2d-diffusion solver with $n \times{} n$
        checkerboard distribution
        } & $2 \times 2$ & 1, 727 & 5, 126 & 66 & 95\\
                            & & & & & \\
                            & & & & & \\
                            & & & & & \\
%                            & 4 & 1717 & 5110 & 25 & +13\\
%                           & 4 & 1720 & 5110 & 33 & +14\\
%                          & 4 & 1716 & 5110 & 25 & +15\\
%                           & 4 & 1717 & 5110 & 33 & +16\\
        \hline
        \multirow{5}{5cm}{\texttt{dotProd:}
        parallel vector product function extracted from \cite{falgout:2002:hypre}
        } & 2 & 712 & 2, 315 & 26 & 39\\
                            & 3 & 1, 127 & 3, 629 & 51 & 82\\
                            & 4 & 1, 546 & 5, 010 & 84 & 141 \\
                            & 5 & 1, 995 & 6, 458 & 125 & 220 \\
                            & 6 & 2, 434 & 7, 973 & 174 & 213 \\
        \hline
        \multirow{5}{5cm}{\texttt{matmat:}
        collective matrix multiplication, each process computes for
        one row
        } & 2 & 1, 155 & 3, 829 & 37 & 35\\
                            & 3 & 1, 849 & 6, 096 & 40 & 41\\
                            & 4 & 2, 551 & 8, 469 & 46 & 49\\
                            & 5 & 3, 261 & 10, 948 & 52 & 57\\
                            & 6 & 3, 979 & 13, 533 & 56 & 64\\
        \hline
        \multirow{5}{5cm}{\texttt{oddEvenSort:}
        one loop iteration in a parallel odd-even sorting implementation
        } & 2 & 1, 115 & 3, 505 & 4 & 6\\
                            & 3 & 2, 141 & 6, 699 & 6 & 8\\
                            & 4 & 3, 947 & 12, 458 & 8 & 12\\
                            & 5 & 6, 320 & 20, 217 & 10 & 13\\
                            & 6 & 10, 713 & 34, 757 & 12 & 19\\
        \hline
      \end{tabular}
    \end{small}
  \end{center}
  \caption{The experimental results of verifying MPI collective-style
    functions with respect to state-requirements and -guarantees.  The
    experiments are under an assumption that no interference can
    happen.}
  \label{tab:experimental:result:part:I-2}
\end{table}

\begin{table}[t]
  \begin{center}
  \begin{tabular}{| l | r | r | r | r | r | r |}
    \hline 
    name & \#procs & \#states & \#seen & \#trans & \#prover  & time(s)\\
    \hline
    \texttt{gather\U{}w} & 2 & 5, 953 & 5 & 19, 597 & 48 & 12\\
         & 3 & 218, 766  & 302 & 764, 096 & 123 & 208 \\
    \hline
    \texttt{allgather} & 2 & 9, 848 & 6 & 32, 855 & 31 & 22 \\
         & 3 & 164, 155 & 93 & 547, 753 & 92 & 378 \\
    \hline
    \texttt{matmat} & 2 & 3, 116 & 1 & 11, 102 & 39 & 37 \\
         & 3 & 23, 254 & 17 & 83, 359 & 48 & 75 \\
         & 4 & 342, 653 & 285 & 1, 229, 132& 110 & 704\\
    \hline
    \texttt{oddEvenSort}
         & 2 & 1, 824 & 0 & 6, 327 & 4 & 6 \\
         & 3 & 8, 706 & 8 & 30, 164 & 6 & 10 \\
         & 4 & 96, 961 & 115 & 338, 889 & 10 & 66 \\
    \hline
    \texttt{diff1dIter} & 2 & 5, 668 & 5 & 19, 745 & 35 & 10 \\
         & 3 & 492, 163 & 567 & 1, 779, 177 & 182 & 505\\
    \hline
    \texttt{diff1dExchange} & 2 & 3, 485 & 3 &  11, 300 & 21 & 8 \\
         & 3 & 114, 173 & 150 & 388, 751 & 66 & 75\\
    \hline
    \texttt{dotProd} & 2 & 1, 711 & 1 & 6, 058 & 6 & 7\\
         & 3 & 7, 127 & 5 & 25, 431 & 12 & 15\\
         & 4 & 36, 450 & 23 & 130, 936 & 18 & 43 \\
         & 5 & 225, 223 & 119 & 811, 950 & 24 & 389 \\
    \hline
  \end{tabular}
\end{center}
\caption{The experimental results of verifying full validity of MPI
  collective-style functions.}
  \label{tab:experimental:result:part:II}
\end{table}


%% verify guarantees
In the second part of the experiment, we enable the contract system to
verify the full validity of the examples listed in Table
\ref{tab:experimental:result:part:II}.  In order to let the experiment
cover all kinds of violations, we implemented a \texttt{gather\U{}w}
function using wildcard receives, which is intended to deliver the
same functionality as \texttt{gather}.  Then we assume that other
examples (i.e., \texttt{allgather} \& \texttt{matmat}) use
\texttt{gather\U{}w} instead of \texttt{gather}.

In \texttt{gather\U{}w}, a root process receives messages from others
in a non-deterministic order with wildcards.  Due to the use of 
wildcard receives, \texttt{gather\U{}w} can be interfered by
statements following itself.  Therefore, the function contract of
\texttt{gather\U{}w} shall include a path-requirement asserting that no
process shall send a message with a specific tag to root after exiting
\texttt{gather\U{}w} until the root process exits.  Otherwise, the
contract system cannot prove the validity of this function with
respect to a function contract.

For the same reason, in \texttt{allgather} and \texttt{matmat}
examples, contracts of the collective functions shall be strong enough
to make sure that the calls to \texttt{gather\U{}w} will never be
interfered.  For example, in \texttt{allgather}, a call to
\texttt{bcast} follows a call to \texttt{gather\U{}w} hence the
contract of \texttt{bcast} shall guarantee that it cannot interfere
\texttt{gather\U{}w}.

For the rest of the examples in Table
\ref{tab:experimental:result:part:II}, interference is impossible
since only deterministic send and receive operations or MPI collective
functions are used.  So for these examples, in addition to what
have been verified in the first part, their guarantees are verified in
the second part.

Most of the examples in the second part can barely be scaled to more
than 3 processes within 1, 000 seconds.  Compare to the corresponding
result in the first part, the full validity verification of an example
has to explore much more executions that are commutative to the ones
already explored in the first part.  In theory, commutative executions
will eventually run into a same state.  This explains why the numbers
of prover calls of two corresponding runs in the first part and the
second part respectively are close though the states they explored
differ significantly.  State-guarantees are barely repeatedly checked
for commutative executions.  For example, the contract system explores
only 3, 947 states for \texttt{oddEvenSort} with 4 processes in the
first part while it explores almost 25 times more states for
\texttt{oddEvenSort} with the same setting in the second part.  The
difference in the number of prover calls is only 2.

We can conclude that the current POR algorithm is sound for full
validity verification but it is insufficient to claim its
effectiveness in performance.  Either better POR algorithm or new
approach for check absence assertions is needed.

%% TODO: sound but too expensive, better POR