Context Navigation

Changes between Version 1 and Version 2 of Notes_on_CUDA_Semantics

Timestamp:: 06/05/14 12:20:33 (12 years ago)
Author:: andrevm
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Notes_on_CUDA_Semantics

-              v1
+              v2
 == OpenMP Primitives ==
+== Cuda Translation Rules ==
+Constructs
+* `parallel`
+* worksharing
+  * `for`
+  * `sections` and `section`
+  * `single`
+* synchronization
+  * `barrier`
+  * `critical`
+  * `atomic`
+  * `ordered`
+  * `master`
+* `threadprivate`
+* `flush`
+Example Lists
+* `code`
+* non-code
+  * nested element
+Clauses
+* `private(`list`)`
+* `firstprivate(list)`
+* `lastprivate(list)`
+* `copyin(list)`
+* `shared(`list`)`
+* `default(none`|`shared)`
+* `num_threads(`n`)`
+* `collapse(n)`
+* `schedule(static, n)`
+* `schedule(dynamic, n)`
+* ...
+* `ordered`
+* `nowait`
+* `reduce`
+== Header ==
+Functions
+* `omp_get_num_threads()`
+* `omp_get_thread_num()`
+== Helper primitives ==
+None.
+== Modeling shared variables ==
+For each shared variable `v` introduce a second variable `v_state`.  The type of `v_state` is obtained from the type of `v` by replacing all primitive types (leaf nodes in the type tree) by `int`.  Initially all these ints are -1.  Both variables are declared in the same, shared, scope.
+In addition to the shared variable v, each thread has its own local copy named `_v`, declared in thread private scope.   It has the same type as `v`.
+Protocols for reads, writes, and flushes:
+A write to (some part of) the shared variable by thread tid:
+* if the state value is -1, do the write to the local copy and set the state value to tid.  Now thread tid is the "owner" of that memory unit.
+* if the state value is tid, do the write to the local copy.
+* else report a memory model error: you are attempting to write to a variable when some other thread has un-flushed writes to the same variable.  The other thread should flush, then you should flush, before doing this write.
+A read from (some part of) the shared variable by thread tid:
+* if the state value is -1, read your local copy and compare it to the global copy.  If they differ, report a memory model error: some other thread has modified the variable and flushed, but you did not flush before performing the read. (If you had flushed, your local copy and the shared copy would be equal.)
+* if the state value is tid, read your local copy.
+* else report a memory model error: you are reading from a variable when another thread has un-flushed writes to that variable.  The other thread should flush, and then you should flush, before doing this read.
+Translating `flush`of (some part of) the shared variable by thread tid:
+* if the state value is -1: copy the global value to your local copy of the variable.
+* if the state value is tid: copy your local value to the global copy of the variable and set the state value to -1.
+* else: report a memory model error, since you are doing a flush when some other thread has un-flushed writes to the variable.  The other thread should flush first.
+We will implement the following function, which is implicit in many of the OpenMP constructs:
+{{{
+  barrier_and_flush();
+}}}
+It does a barrier on `_barrier` and a flush on all shared variables.  After this completes, all local copies will agree with each other and with the shared copy of the variable, and all state variables will be -1.
+== Modeling worksharing state ==
+The worksharing state will be stored in another handle type object.  The situation here is analogous to the `$gcomm` and `$comm` use for MPI.  Those objects store the shared state for message-passing.    We need similar object for shared state the coordinates work-sharing and barrier constructs:
+* `$omp_gws`:  global work-sharing state
+* `$omp_ws`: local state.  A reference to a global object and a thread ID.
+The following object is used to specify the sequence of iterations to be assigned to one thread executing an omp for loop:
+{{{
+typedef struct {
+  int numIters;
+  int collapse;
+  int iters[][];
+} CIVL_omp_loop_info;
+}}}
+The dimensions are `iters[numIters][collapse]`.  The integer `iters[i][j]` is the value of the j-th loop variable in the i-th iteration performed by this thread.
+The following object is used to specify the subset of section assigned to one thread executing an omp sections construct:
+{{{
+typedef struct {
+  int numSections;
+  int sections[];
+} CIVL_omp_sections_info;
+}}}
+The length of the array `sections` is `numSections`.  The integer `sections[i]` is the index of the i-th section that this thread will execute.
+API:
+{{{
+/* Creates new global work-sharing state object, returning
+ * handle to it.  nthreads is the number of threads in
+ * the parallel region.  There is one of these per parallel region,
+ * created upon entering the region */
+$omp_gws $omp_gws_create($scope scope, int nthreads);
+$omp_gws_destroy($omp_gws gws);
+/* Creates a local work-sharing object, which is basically
+ * a pair consisting of a global work-sharing handle and
+ * a thread id. */
+$omp_ws $omp_ws_create($scope scope, $omp_gws, int tid);
+$omp_ws_destroy($omp_ws ws);
+/* for "for" loops only: called when a thread arrives, it
+ * returns the sequence of loop iterations to be performed by
+ * the thread.  Parameter location is the ID of the model location
+ * of the top of the loop.  It is needed to check that all threads
+ * encounter the same worksharing statements in the same order.
+ * The implementation will need the value start, the initial value of the loop variable;
+ * end is its final value; and inc, the increment (which can be
+ * positive or negative).   These values can all be obtained by getting
+ * the loop statement from the location and evaluating the expressions
+ * occurring there.*/
+CIVL_omp_loop_info $omp_ws_arrive_loop($omp_ws ws, int location);
+/* for sections: called at arrival, returns the sequence of sections to
+ * be executed by calling thread.  The sections are numbered in order,
+ * starting from 0. */
+CIVL_omp_sections_info $omp_ws_arrive_sections($omp_ws ws, int location);
+/* for single: called on arrival, returns whether or not to execute
+ * the single code */
+_Bool $omp_ws_arrive_single($omp_ws ws, int location);
+/* called when arriving at a barrier.  This does not
+ * impose the barrier, you still need to call system function
+ * $barrier... for that.  This is needed to ensure all threads
+ * in the team call the same sequence of worksharing and barrier
+ * constructs.  */
+void $omp_ws_arrive_barrier($omp_ws ws, int location);
+}}}
+What these functions do: basically the global data structure comprises a FIFO queue for each thread.   The queue contains work-sharing records, one record for each work-sharing or barrier construct encountered.    The record contains the basic information about the construct as provided by the arguments to the arrival function, as well as the distribution chosen for that thread.
+The constructs are a lot like MPI collective operations, and are modeled similarly.
+When a thread arrives at one of these constructs, it invokes the relevant arrival function.  At this point you can determine whether this thread is the first to arrive at that construct.  If its queue is empty, it is the first, otherwise it is not first, and the oldest entry in its queue will be the entry corresponding to this construct.
+When a thread is the first thread to arrive at a construct, a distribution is chosen for every thread and a record is created and enqueued in each thread queue (including the caller).   The distributions can be chosen nondeterministically, possibly with some restrictions to achieve some tractability/soundness compromise.  The record for this thread is then dequeued and the iterator returned.
+If a thread is not the first to arrive, its record is dequeued and compared with the arguments given in the function call.  They should match, and if they don't, an error is reported.  This indicates that either threads encountered constructs in different orders or the loop parameters changed.
+== Translations of specific directives ==
+=== Translating `parallel` ===
+`parallel`: this spawns some nondeterministic number of threads.  We will assume there is a constant `THREAD_MAX` defined somewhere.  The number of threads created will be between 1 and `THREAD_MAX` (inclusive).  Each thread is assigned an ID.  The original ("master") thread has ID 0.  All threads execute the parallel region.
+=== Subheader ===
 {{{
+  #pragma omp parallel ...
+  S
+  example_code_block();
 }}}
-=>
-{{{
+  {
-    int _nthreads = 1+$choose_int(THREAD_MAX);
-    $proc _threads[_nthreads];
-    $omp_gws _gws = $omp_gws_create($here, _nthreads);
-    $gbarrier _gbarrier = $gbarrier_create($here, _nthreads);
-    // declare shared variables and corresponding state variables
-    // initialize all state components to -1
-    void _thread(int _tid) {
-      $omp_ws _ws = $omp_ws_create($here, _gws, _tid);
-      $barrier _barrier = $barrier_create($here, _gbarrier, tid);
-      // declare local copies of shared variables
-      // declare private variables
-      translate(S) but replace each private variable `x` with `_x`, and
-        translate access to shared variables using protocols above;
-      flush any writes to shared variables;
+    }
-    for (int i=0; i<_nthreads; i++) _threads[i]=$spawn _thread(i);
-    for (int i=0; i<_nthreads; i++) $wait(_threads[i]);
+  }
-}}}
-All variables that occur in the parallel construct, i.e., the lexical extent of the parallel construct, must be determined to be either private or shared.   This is determined by the clauses and the default rules as specified in the OpenMP Standard.  Obviously any variable declared within the construct itself must be private.
-For all private variables `x` not declared within the parallel construct, create a new variable of the same type, `_x`.    The new variable is declared within the thread scope.  If `x` is also firstprivate,  then `_x` is initialized with the value of `x`, e.g. `int _x=x;`.  Otherwise, `_x` is uninitialized, so has an undefined value.
-=== Translating `for` ===
-Try to determine whether the loop iterations are independent.  In that case, they can all be executed by one thread.
-Otherwise:
-{{{
-// location 23:
-#pragma omp parallel for
-for (i=0; i<n; i++)
+  S
-}}}
-=>
-{{{
+{
-  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
-  int numIters = info.numIters;
-  for (int j=0; j<numIters; j++) {
-    int i = info.iters[j][0];
-    translate(S);
+  }
-  barrier_and_flush();
+}
-}}}
-We can vary the way iterators are chosen to explore different tradeoffs and strategies.  On one extreme, every kind of partition can be explored; on the other, some fixed strategy like round-robin with chunksize 1 can be used.  This only changes the definition of `$omp_ws_arrive_loop`, not the translation above.
-{{{
-// location 78:
-#pragma omp parallel for collapse(3)
-for (i=0; i<n; i++)
-  for (j=0; j<m; j++)
-    for (k=0; k<l; k++) {
+      S
+    }
-}}}
-=>
-{{{
+{
-  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 78);
-  int numIters = info.numIters;
-  for (int count=0; count<numIters; count++) {
-    int i = info.iters[count][0];
-    int j = info.iters[count][1];
-    int k = info.iters[count][2];
-    translate(S);
+  }
-  barrier_and_flush();
+}
-}}}
-=== Translating `reduction` clause ===
-{{{
-#pragma omp for reduction(+:x,y)
-for (i=a; i<b; i++) {
+  S
+}
-}}}
-=>
-{{{
+{
-  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
-  double _x=0.0, _y=0.0;
-  int numIters = info.numIters;
-  for (int _count=0; _count<numIters; _count++) {
-    int i = info.iters[_count][0];
-    translate(S) but replace x with _x and y with _y;
+  }
-  x += _x;
-  y += _y;
-  // note: do something with POR so it knows the operations above from
-  // different threads commute
-  barrier_and_flush();
+}
-}}}
-=== Translating `sections` ===
-If there are n sections, create n functions: section1, section2, ....  Again the question is how to distribute them among threads and in what order.
-As with loops, you really want to check these are independent and only do the interleaving exploration as a last resort.
-{{{
-// location 42:
-#pragma omp sections
-#pragma omp section
-  S0
-#pragma omp section
-  S1
-...
-}}}
-=>
-{{{
+{
-  $int_iter iter = $omp_ws_arrive_sections(_ws, 42);
-  while ($int_iter_hasNext(iter)) {
-    int _i = $int_iter_next(iter);
-    switch (_i) {
-    case 0: {
-      translate(S0);
-      break;
+    }
-    case 1: {
-      translate(S1);
-      break;
+    }
-    ...
-    } /* end of switch */
-  } /* end of while loop */
-  barrier_and_flush();
+}
-}}}
-=== Translating `single` ===
-{{{
-// location 33:
-#pragma omp single
+S
-}}}
-=>
-{{{
-if ($omp_arrive_single(_ws, 33)) {
-  translate(S);
+}
-barrier_and_flush();
-}}}
-=== Translating `barrier` ===
-{{{
-// location 58:
-#pragma omp barrier
-}}}
-=>
-{{{
-$omp_barrier_arrive(_ws, 58);
-barrier_and_flush();
-}}}
-=== Translating `critical` ===
-Basically, use a lock for each critical name, plus one for the "no name".  All threads must obtain lock to enter the critical section, then release it.
-I.e., if there are critical sections name a, b, and c, there should be global root-scope variables of boolean type named `_critical_noname`, `_critical_a`, etc.
-{{{
-#pragma omp critical a
+S
-}}}
-=>
-{{{
-...
-_Bool _critical_a = $false;
+.
+.
+.
-$when (!_critical_a) _critical_a=$true;
-translate(S);
-_critical_a=$false;
-}}}
-=== Translating `atomic` ===
-In general, reads and writes to shared variables will be processed using the protocols described above (dealing with state variables and ownership, etc.).  However if the operation occurs within an omp atomic construct, it is translated differently.
-If sequentially consistent atomic...
-If non-sequentially consistent atomic...
-===  Translating`ordered` ===
-This can only be used inside and OMP `for` loop in which the pragma used the `ordered` clause.  (Check that.)  It indicates that the specified region must be executed in iteration order.
-In this case the system function must return an int iterator in which the ints occur in loop order.
-{{{
-#pragma omp for ordered
-for (i=a; i<b; i++) {
-  ...
-  #pragma omp ordered
-  S1
-  ...
-  #pragma omp ordered
-  S2
-  ...
+}
-}}}
-=>
-{{{
+{
-  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
-  int order1=a, order2=a;
-  int numIters = info.numIters;
-  for (int _i=0; _i<numIters; _i++) {
-    int i = info.iters[_i][0];
-     ...
-    $when (order1==i) {
-      translate(S1);
-      order1++;
+    }
-      ...
-    $when (order2==i) {
-      translate(S2);
-      order2++;
+    }
-      ...
+  }
+}
-}}}
-=== Translating `master` ===
-{{{
-#pragma omp master
+S
-}}}
-=>
-{{{
-if (_tid == 0) {
-  translate(S);
+}
-}}}
-== Translating functions ==
-* `omp_get_num_threads()` => `_nthreads`
-* `omp_get_thread_num()` => `_tid`