Changes between Initial Version and Version 1 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
06/05/14 12:16:34 (12 years ago)
Author:
andrevm
Comment:

Initial creation. Copy of OpenMP translation page.

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v1 v1  
     1
     2
     3== OpenMP Primitives ==
     4
     5Constructs
     6* `parallel`
     7* worksharing
     8  * `for`
     9  * `sections` and `section`
     10  * `single`
     11* synchronization
     12  * `barrier`
     13  * `critical`
     14  * `atomic`
     15  * `ordered`
     16  * `master`
     17* `threadprivate`
     18* `flush`
     19
     20Clauses
     21* `private(`list`)`
     22* `firstprivate(list)`
     23* `lastprivate(list)`
     24* `copyin(list)`
     25* `shared(`list`)`
     26* `default(none`|`shared)`
     27* `num_threads(`n`)`
     28* `collapse(n)`
     29* `schedule(static, n)`
     30* `schedule(dynamic, n)`
     31* ...
     32* `ordered`
     33* `nowait`
     34* `reduce`
     35
     36Functions
     37* `omp_get_num_threads()`
     38* `omp_get_thread_num()`
     39
     40== Helper primitives ==
     41
     42None.
     43
     44== Modeling shared variables ==
     45
     46For each shared variable `v` introduce a second variable `v_state`.  The type of `v_state` is obtained from the type of `v` by replacing all primitive types (leaf nodes in the type tree) by `int`.  Initially all these ints are -1.  Both variables are declared in the same, shared, scope.
     47
     48In addition to the shared variable v, each thread has its own local copy named `_v`, declared in thread private scope.   It has the same type as `v`.
     49
     50Protocols for reads, writes, and flushes:
     51
     52A write to (some part of) the shared variable by thread tid:
     53* if the state value is -1, do the write to the local copy and set the state value to tid.  Now thread tid is the "owner" of that memory unit.
     54* if the state value is tid, do the write to the local copy.
     55* else report a memory model error: you are attempting to write to a variable when some other thread has un-flushed writes to the same variable.  The other thread should flush, then you should flush, before doing this write.
     56
     57A read from (some part of) the shared variable by thread tid:
     58* if the state value is -1, read your local copy and compare it to the global copy.  If they differ, report a memory model error: some other thread has modified the variable and flushed, but you did not flush before performing the read. (If you had flushed, your local copy and the shared copy would be equal.)
     59* if the state value is tid, read your local copy.
     60* else report a memory model error: you are reading from a variable when another thread has un-flushed writes to that variable.  The other thread should flush, and then you should flush, before doing this read.
     61
     62Translating `flush`of (some part of) the shared variable by thread tid:
     63* if the state value is -1: copy the global value to your local copy of the variable.
     64* if the state value is tid: copy your local value to the global copy of the variable and set the state value to -1.
     65* else: report a memory model error, since you are doing a flush when some other thread has un-flushed writes to the variable.  The other thread should flush first.
     66
     67We will implement the following function, which is implicit in many of the OpenMP constructs:
     68{{{
     69  barrier_and_flush();
     70}}}
     71It does a barrier on `_barrier` and a flush on all shared variables.  After this completes, all local copies will agree with each other and with the shared copy of the variable, and all state variables will be -1.
     72
     73== Modeling worksharing state ==
     74
     75The worksharing state will be stored in another handle type object.  The situation here is analogous to the `$gcomm` and `$comm` use for MPI.  Those objects store the shared state for message-passing.    We need similar object for shared state the coordinates work-sharing and barrier constructs:
     76* `$omp_gws`:  global work-sharing state
     77* `$omp_ws`: local state.  A reference to a global object and a thread ID.
     78
     79The following object is used to specify the sequence of iterations to be assigned to one thread executing an omp for loop:
     80{{{
     81typedef struct {
     82  int numIters;
     83  int collapse;
     84  int iters[][];
     85} CIVL_omp_loop_info;
     86}}}
     87
     88The dimensions are `iters[numIters][collapse]`.  The integer `iters[i][j]` is the value of the j-th loop variable in the i-th iteration performed by this thread.
     89
     90The following object is used to specify the subset of section assigned to one thread executing an omp sections construct:
     91{{{
     92typedef struct {
     93  int numSections;
     94  int sections[];
     95} CIVL_omp_sections_info;
     96}}}
     97The length of the array `sections` is `numSections`.  The integer `sections[i]` is the index of the i-th section that this thread will execute.
     98
     99API:
     100{{{
     101/* Creates new global work-sharing state object, returning
     102 * handle to it.  nthreads is the number of threads in
     103 * the parallel region.  There is one of these per parallel region,
     104 * created upon entering the region */
     105$omp_gws $omp_gws_create($scope scope, int nthreads);
     106
     107$omp_gws_destroy($omp_gws gws);
     108
     109/* Creates a local work-sharing object, which is basically
     110 * a pair consisting of a global work-sharing handle and
     111 * a thread id. */
     112$omp_ws $omp_ws_create($scope scope, $omp_gws, int tid);
     113
     114$omp_ws_destroy($omp_ws ws);
     115
     116/* for "for" loops only: called when a thread arrives, it
     117 * returns the sequence of loop iterations to be performed by
     118 * the thread.  Parameter location is the ID of the model location
     119 * of the top of the loop.  It is needed to check that all threads
     120 * encounter the same worksharing statements in the same order.
     121 * The implementation will need the value start, the initial value of the loop variable;
     122 * end is its final value; and inc, the increment (which can be
     123 * positive or negative).   These values can all be obtained by getting
     124 * the loop statement from the location and evaluating the expressions
     125 * occurring there.*/
     126CIVL_omp_loop_info $omp_ws_arrive_loop($omp_ws ws, int location);
     127
     128/* for sections: called at arrival, returns the sequence of sections to
     129 * be executed by calling thread.  The sections are numbered in order,
     130 * starting from 0. */
     131CIVL_omp_sections_info $omp_ws_arrive_sections($omp_ws ws, int location);
     132
     133/* for single: called on arrival, returns whether or not to execute
     134 * the single code */
     135_Bool $omp_ws_arrive_single($omp_ws ws, int location);
     136
     137/* called when arriving at a barrier.  This does not
     138 * impose the barrier, you still need to call system function
     139 * $barrier... for that.  This is needed to ensure all threads
     140 * in the team call the same sequence of worksharing and barrier
     141 * constructs.  */
     142void $omp_ws_arrive_barrier($omp_ws ws, int location);
     143}}}
     144
     145What these functions do: basically the global data structure comprises a FIFO queue for each thread.   The queue contains work-sharing records, one record for each work-sharing or barrier construct encountered.    The record contains the basic information about the construct as provided by the arguments to the arrival function, as well as the distribution chosen for that thread.
     146
     147The constructs are a lot like MPI collective operations, and are modeled similarly.
     148
     149When a thread arrives at one of these constructs, it invokes the relevant arrival function.  At this point you can determine whether this thread is the first to arrive at that construct.  If its queue is empty, it is the first, otherwise it is not first, and the oldest entry in its queue will be the entry corresponding to this construct.
     150
     151When a thread is the first thread to arrive at a construct, a distribution is chosen for every thread and a record is created and enqueued in each thread queue (including the caller).   The distributions can be chosen nondeterministically, possibly with some restrictions to achieve some tractability/soundness compromise.  The record for this thread is then dequeued and the iterator returned.
     152
     153If a thread is not the first to arrive, its record is dequeued and compared with the arguments given in the function call.  They should match, and if they don't, an error is reported.  This indicates that either threads encountered constructs in different orders or the loop parameters changed.
     154
     155
     156== Translations of specific directives ==
     157
     158=== Translating `parallel` ===
     159
     160`parallel`: this spawns some nondeterministic number of threads.  We will assume there is a constant `THREAD_MAX` defined somewhere.  The number of threads created will be between 1 and `THREAD_MAX` (inclusive).  Each thread is assigned an ID.  The original ("master") thread has ID 0.  All threads execute the parallel region.
     161
     162{{{
     163  #pragma omp parallel ...
     164  S
     165}}}
     166
     167=>
     168
     169{{{
     170  {
     171    int _nthreads = 1+$choose_int(THREAD_MAX);
     172    $proc _threads[_nthreads];
     173    $omp_gws _gws = $omp_gws_create($here, _nthreads);
     174    $gbarrier _gbarrier = $gbarrier_create($here, _nthreads);
     175    // declare shared variables and corresponding state variables
     176    // initialize all state components to -1
     177    void _thread(int _tid) {
     178      $omp_ws _ws = $omp_ws_create($here, _gws, _tid);
     179      $barrier _barrier = $barrier_create($here, _gbarrier, tid);
     180      // declare local copies of shared variables
     181      // declare private variables
     182
     183      translate(S) but replace each private variable `x` with `_x`, and
     184        translate access to shared variables using protocols above;
     185      flush any writes to shared variables;
     186    }
     187    for (int i=0; i<_nthreads; i++) _threads[i]=$spawn _thread(i);
     188    for (int i=0; i<_nthreads; i++) $wait(_threads[i]);
     189  }
     190}}}
     191
     192All variables that occur in the parallel construct, i.e., the lexical extent of the parallel construct, must be determined to be either private or shared.   This is determined by the clauses and the default rules as specified in the OpenMP Standard.  Obviously any variable declared within the construct itself must be private.
     193
     194For all private variables `x` not declared within the parallel construct, create a new variable of the same type, `_x`.    The new variable is declared within the thread scope.  If `x` is also firstprivate,  then `_x` is initialized with the value of `x`, e.g. `int _x=x;`.  Otherwise, `_x` is uninitialized, so has an undefined value.
     195
     196=== Translating `for` ===
     197
     198Try to determine whether the loop iterations are independent.  In that case, they can all be executed by one thread.
     199Otherwise:
     200
     201{{{
     202// location 23:
     203#pragma omp parallel for
     204for (i=0; i<n; i++) 
     205  S
     206}}}
     207
     208=>
     209
     210{{{
     211{
     212  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
     213
     214  int numIters = info.numIters;
     215  for (int j=0; j<numIters; j++) {
     216    int i = info.iters[j][0];
     217
     218    translate(S);
     219  }
     220  barrier_and_flush();
     221}
     222}}}
     223
     224We can vary the way iterators are chosen to explore different tradeoffs and strategies.  On one extreme, every kind of partition can be explored; on the other, some fixed strategy like round-robin with chunksize 1 can be used.  This only changes the definition of `$omp_ws_arrive_loop`, not the translation above.
     225
     226{{{
     227// location 78:
     228#pragma omp parallel for collapse(3)
     229for (i=0; i<n; i++)
     230  for (j=0; j<m; j++)
     231    for (k=0; k<l; k++) {
     232      S
     233    }
     234}}}
     235
     236=>
     237
     238{{{
     239{
     240  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 78);
     241
     242  int numIters = info.numIters;
     243  for (int count=0; count<numIters; count++) {
     244    int i = info.iters[count][0];
     245    int j = info.iters[count][1];
     246    int k = info.iters[count][2];
     247
     248    translate(S);
     249  }
     250  barrier_and_flush();
     251}
     252}}}
     253
     254=== Translating `reduction` clause ===
     255
     256{{{
     257#pragma omp for reduction(+:x,y)
     258for (i=a; i<b; i++) {
     259  S
     260}
     261}}}
     262
     263=>
     264
     265{{{
     266{
     267  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
     268  double _x=0.0, _y=0.0;
     269
     270  int numIters = info.numIters;
     271  for (int _count=0; _count<numIters; _count++) {
     272    int i = info.iters[_count][0];
     273
     274    translate(S) but replace x with _x and y with _y;
     275  }
     276  x += _x;
     277  y += _y;
     278  // note: do something with POR so it knows the operations above from
     279  // different threads commute
     280  barrier_and_flush();
     281}
     282}}}
     283
     284
     285
     286=== Translating `sections` ===
     287
     288If there are n sections, create n functions: section1, section2, ....  Again the question is how to distribute them among threads and in what order.
     289As with loops, you really want to check these are independent and only do the interleaving exploration as a last resort.
     290
     291{{{
     292// location 42:
     293#pragma omp sections
     294#pragma omp section
     295  S0
     296#pragma omp section
     297  S1
     298...
     299}}}
     300
     301=>
     302
     303{{{
     304{
     305  $int_iter iter = $omp_ws_arrive_sections(_ws, 42);
     306
     307  while ($int_iter_hasNext(iter)) {
     308    int _i = $int_iter_next(iter);
     309
     310    switch (_i) {
     311    case 0: {
     312      translate(S0);
     313      break;
     314    }
     315    case 1: {
     316      translate(S1);
     317      break;
     318    }
     319    ...
     320    } /* end of switch */
     321  } /* end of while loop */
     322  barrier_and_flush();
     323}
     324}}}
     325
     326
     327=== Translating `single` ===
     328
     329{{{
     330// location 33:
     331#pragma omp single
     332S
     333}}}
     334
     335=>
     336
     337{{{
     338if ($omp_arrive_single(_ws, 33)) {
     339  translate(S);
     340}
     341barrier_and_flush();
     342}}}
     343
     344
     345=== Translating `barrier` ===
     346
     347{{{
     348// location 58:
     349#pragma omp barrier
     350}}}
     351
     352=>
     353
     354{{{
     355$omp_barrier_arrive(_ws, 58);
     356barrier_and_flush();
     357}}}
     358
     359=== Translating `critical` ===
     360
     361Basically, use a lock for each critical name, plus one for the "no name".  All threads must obtain lock to enter the critical section, then release it.
     362I.e., if there are critical sections name a, b, and c, there should be global root-scope variables of boolean type named `_critical_noname`, `_critical_a`, etc.
     363
     364{{{
     365#pragma omp critical a
     366S
     367}}}
     368
     369=>
     370
     371{{{
     372...
     373_Bool _critical_a = $false;
     374.
     375.
     376.
     377$when (!_critical_a) _critical_a=$true;
     378translate(S);
     379_critical_a=$false;
     380}}}
     381
     382=== Translating `atomic` ===
     383
     384In general, reads and writes to shared variables will be processed using the protocols described above (dealing with state variables and ownership, etc.).  However if the operation occurs within an omp atomic construct, it is translated differently.
     385
     386If sequentially consistent atomic...
     387
     388If non-sequentially consistent atomic...
     389
     390
     391===  Translating`ordered` ===
     392
     393This can only be used inside and OMP `for` loop in which the pragma used the `ordered` clause.  (Check that.)  It indicates that the specified region must be executed in iteration order.
     394
     395In this case the system function must return an int iterator in which the ints occur in loop order.
     396
     397{{{
     398#pragma omp for ordered
     399for (i=a; i<b; i++) {
     400  ...
     401  #pragma omp ordered
     402  S1
     403  ...
     404  #pragma omp ordered
     405  S2
     406  ...
     407}
     408}}}
     409
     410=>
     411
     412{{{
     413{
     414  CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
     415  int order1=a, order2=a;
     416  int numIters = info.numIters;
     417
     418  for (int _i=0; _i<numIters; _i++) {
     419    int i = info.iters[_i][0];
     420     ...
     421    $when (order1==i) {
     422      translate(S1);
     423      order1++;
     424    }
     425      ...
     426    $when (order2==i) {
     427      translate(S2);
     428      order2++;
     429    }
     430      ...
     431  }
     432}
     433}}}
     434
     435=== Translating `master` ===
     436
     437{{{
     438#pragma omp master
     439S
     440}}}
     441
     442=>
     443
     444{{{
     445if (_tid == 0) {
     446  translate(S);
     447}
     448}}}
     449
     450== Translating functions ==
     451
     452* `omp_get_num_threads()` => `_nthreads`
     453* `omp_get_thread_num()` => `_tid`