Changes between Version 1 and Version 2 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
06/05/14 12:20:33 (12 years ago)
Author:
andrevm
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v1 v2  
    11
    22
    3 == OpenMP Primitives ==
     3== Cuda Translation Rules ==
    44
    5 Constructs
    6 * `parallel`
    7 * worksharing
    8   * `for`
    9   * `sections` and `section`
    10   * `single`
    11 * synchronization
    12   * `barrier`
    13   * `critical`
    14   * `atomic`
    15   * `ordered`
    16   * `master`
    17 * `threadprivate`
    18 * `flush`
     5Example Lists
     6* `code`
     7* non-code
     8  * nested element
    199
    20 Clauses
    21 * `private(`list`)`
    22 * `firstprivate(list)`
    23 * `lastprivate(list)`
    24 * `copyin(list)`
    25 * `shared(`list`)`
    26 * `default(none`|`shared)`
    27 * `num_threads(`n`)`
    28 * `collapse(n)`
    29 * `schedule(static, n)`
    30 * `schedule(dynamic, n)`
    31 * ...
    32 * `ordered`
    33 * `nowait`
    34 * `reduce`
     10== Header ==
    3511
    36 Functions
    37 * `omp_get_num_threads()`
    38 * `omp_get_thread_num()`
    39 
    40 == Helper primitives ==
    41 
    42 None.
    43 
    44 == Modeling shared variables ==
    45 
    46 For each shared variable `v` introduce a second variable `v_state`.  The type of `v_state` is obtained from the type of `v` by replacing all primitive types (leaf nodes in the type tree) by `int`.  Initially all these ints are -1.  Both variables are declared in the same, shared, scope.
    47 
    48 In addition to the shared variable v, each thread has its own local copy named `_v`, declared in thread private scope.   It has the same type as `v`.
    49 
    50 Protocols for reads, writes, and flushes:
    51 
    52 A write to (some part of) the shared variable by thread tid:
    53 * if the state value is -1, do the write to the local copy and set the state value to tid.  Now thread tid is the "owner" of that memory unit.
    54 * if the state value is tid, do the write to the local copy.
    55 * else report a memory model error: you are attempting to write to a variable when some other thread has un-flushed writes to the same variable.  The other thread should flush, then you should flush, before doing this write.
    56 
    57 A read from (some part of) the shared variable by thread tid:
    58 * if the state value is -1, read your local copy and compare it to the global copy.  If they differ, report a memory model error: some other thread has modified the variable and flushed, but you did not flush before performing the read. (If you had flushed, your local copy and the shared copy would be equal.)
    59 * if the state value is tid, read your local copy.
    60 * else report a memory model error: you are reading from a variable when another thread has un-flushed writes to that variable.  The other thread should flush, and then you should flush, before doing this read.
    61 
    62 Translating `flush`of (some part of) the shared variable by thread tid:
    63 * if the state value is -1: copy the global value to your local copy of the variable.
    64 * if the state value is tid: copy your local value to the global copy of the variable and set the state value to -1.
    65 * else: report a memory model error, since you are doing a flush when some other thread has un-flushed writes to the variable.  The other thread should flush first.
    66 
    67 We will implement the following function, which is implicit in many of the OpenMP constructs:
    68 {{{
    69   barrier_and_flush();
    70 }}}
    71 It does a barrier on `_barrier` and a flush on all shared variables.  After this completes, all local copies will agree with each other and with the shared copy of the variable, and all state variables will be -1.
    72 
    73 == Modeling worksharing state ==
    74 
    75 The worksharing state will be stored in another handle type object.  The situation here is analogous to the `$gcomm` and `$comm` use for MPI.  Those objects store the shared state for message-passing.    We need similar object for shared state the coordinates work-sharing and barrier constructs:
    76 * `$omp_gws`:  global work-sharing state
    77 * `$omp_ws`: local state.  A reference to a global object and a thread ID.
    78 
    79 The following object is used to specify the sequence of iterations to be assigned to one thread executing an omp for loop:
    80 {{{
    81 typedef struct {
    82   int numIters;
    83   int collapse;
    84   int iters[][];
    85 } CIVL_omp_loop_info;
    86 }}}
    87 
    88 The dimensions are `iters[numIters][collapse]`.  The integer `iters[i][j]` is the value of the j-th loop variable in the i-th iteration performed by this thread.
    89 
    90 The following object is used to specify the subset of section assigned to one thread executing an omp sections construct:
    91 {{{
    92 typedef struct {
    93   int numSections;
    94   int sections[];
    95 } CIVL_omp_sections_info;
    96 }}}
    97 The length of the array `sections` is `numSections`.  The integer `sections[i]` is the index of the i-th section that this thread will execute.
    98 
    99 API:
    100 {{{
    101 /* Creates new global work-sharing state object, returning
    102  * handle to it.  nthreads is the number of threads in
    103  * the parallel region.  There is one of these per parallel region,
    104  * created upon entering the region */
    105 $omp_gws $omp_gws_create($scope scope, int nthreads);
    106 
    107 $omp_gws_destroy($omp_gws gws);
    108 
    109 /* Creates a local work-sharing object, which is basically
    110  * a pair consisting of a global work-sharing handle and
    111  * a thread id. */
    112 $omp_ws $omp_ws_create($scope scope, $omp_gws, int tid);
    113 
    114 $omp_ws_destroy($omp_ws ws);
    115 
    116 /* for "for" loops only: called when a thread arrives, it
    117  * returns the sequence of loop iterations to be performed by
    118  * the thread.  Parameter location is the ID of the model location
    119  * of the top of the loop.  It is needed to check that all threads
    120  * encounter the same worksharing statements in the same order.
    121  * The implementation will need the value start, the initial value of the loop variable;
    122  * end is its final value; and inc, the increment (which can be
    123  * positive or negative).   These values can all be obtained by getting
    124  * the loop statement from the location and evaluating the expressions
    125  * occurring there.*/
    126 CIVL_omp_loop_info $omp_ws_arrive_loop($omp_ws ws, int location);
    127 
    128 /* for sections: called at arrival, returns the sequence of sections to
    129  * be executed by calling thread.  The sections are numbered in order,
    130  * starting from 0. */
    131 CIVL_omp_sections_info $omp_ws_arrive_sections($omp_ws ws, int location);
    132 
    133 /* for single: called on arrival, returns whether or not to execute
    134  * the single code */
    135 _Bool $omp_ws_arrive_single($omp_ws ws, int location);
    136 
    137 /* called when arriving at a barrier.  This does not
    138  * impose the barrier, you still need to call system function
    139  * $barrier... for that.  This is needed to ensure all threads
    140  * in the team call the same sequence of worksharing and barrier
    141  * constructs.  */
    142 void $omp_ws_arrive_barrier($omp_ws ws, int location);
    143 }}}
    144 
    145 What these functions do: basically the global data structure comprises a FIFO queue for each thread.   The queue contains work-sharing records, one record for each work-sharing or barrier construct encountered.    The record contains the basic information about the construct as provided by the arguments to the arrival function, as well as the distribution chosen for that thread.
    146 
    147 The constructs are a lot like MPI collective operations, and are modeled similarly.
    148 
    149 When a thread arrives at one of these constructs, it invokes the relevant arrival function.  At this point you can determine whether this thread is the first to arrive at that construct.  If its queue is empty, it is the first, otherwise it is not first, and the oldest entry in its queue will be the entry corresponding to this construct.
    150 
    151 When a thread is the first thread to arrive at a construct, a distribution is chosen for every thread and a record is created and enqueued in each thread queue (including the caller).   The distributions can be chosen nondeterministically, possibly with some restrictions to achieve some tractability/soundness compromise.  The record for this thread is then dequeued and the iterator returned.
    152 
    153 If a thread is not the first to arrive, its record is dequeued and compared with the arguments given in the function call.  They should match, and if they don't, an error is reported.  This indicates that either threads encountered constructs in different orders or the loop parameters changed.
    154 
    155 
    156 == Translations of specific directives ==
    157 
    158 === Translating `parallel` ===
    159 
    160 `parallel`: this spawns some nondeterministic number of threads.  We will assume there is a constant `THREAD_MAX` defined somewhere.  The number of threads created will be between 1 and `THREAD_MAX` (inclusive).  Each thread is assigned an ID.  The original ("master") thread has ID 0.  All threads execute the parallel region.
     12=== Subheader ===
    16113
    16214{{{
    163   #pragma omp parallel ...
    164   S
     15  example_code_block();
    16516}}}
    166 
    167 =>
    168 
    169 {{{
    170   {
    171     int _nthreads = 1+$choose_int(THREAD_MAX);
    172     $proc _threads[_nthreads];
    173     $omp_gws _gws = $omp_gws_create($here, _nthreads);
    174     $gbarrier _gbarrier = $gbarrier_create($here, _nthreads);
    175     // declare shared variables and corresponding state variables
    176     // initialize all state components to -1
    177     void _thread(int _tid) {
    178       $omp_ws _ws = $omp_ws_create($here, _gws, _tid);
    179       $barrier _barrier = $barrier_create($here, _gbarrier, tid);
    180       // declare local copies of shared variables
    181       // declare private variables
    182 
    183       translate(S) but replace each private variable `x` with `_x`, and
    184         translate access to shared variables using protocols above;
    185       flush any writes to shared variables;
    186     }
    187     for (int i=0; i<_nthreads; i++) _threads[i]=$spawn _thread(i);
    188     for (int i=0; i<_nthreads; i++) $wait(_threads[i]);
    189   }
    190 }}}
    191 
    192 All variables that occur in the parallel construct, i.e., the lexical extent of the parallel construct, must be determined to be either private or shared.   This is determined by the clauses and the default rules as specified in the OpenMP Standard.  Obviously any variable declared within the construct itself must be private.
    193 
    194 For all private variables `x` not declared within the parallel construct, create a new variable of the same type, `_x`.    The new variable is declared within the thread scope.  If `x` is also firstprivate,  then `_x` is initialized with the value of `x`, e.g. `int _x=x;`.  Otherwise, `_x` is uninitialized, so has an undefined value.
    195 
    196 === Translating `for` ===
    197 
    198 Try to determine whether the loop iterations are independent.  In that case, they can all be executed by one thread.
    199 Otherwise:
    200 
    201 {{{
    202 // location 23:
    203 #pragma omp parallel for
    204 for (i=0; i<n; i++) 
    205   S
    206 }}}
    207 
    208 =>
    209 
    210 {{{
    211 {
    212   CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
    213 
    214   int numIters = info.numIters;
    215   for (int j=0; j<numIters; j++) {
    216     int i = info.iters[j][0];
    217 
    218     translate(S);
    219   }
    220   barrier_and_flush();
    221 }
    222 }}}
    223 
    224 We can vary the way iterators are chosen to explore different tradeoffs and strategies.  On one extreme, every kind of partition can be explored; on the other, some fixed strategy like round-robin with chunksize 1 can be used.  This only changes the definition of `$omp_ws_arrive_loop`, not the translation above.
    225 
    226 {{{
    227 // location 78:
    228 #pragma omp parallel for collapse(3)
    229 for (i=0; i<n; i++)
    230   for (j=0; j<m; j++)
    231     for (k=0; k<l; k++) {
    232       S
    233     }
    234 }}}
    235 
    236 =>
    237 
    238 {{{
    239 {
    240   CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 78);
    241 
    242   int numIters = info.numIters;
    243   for (int count=0; count<numIters; count++) {
    244     int i = info.iters[count][0];
    245     int j = info.iters[count][1];
    246     int k = info.iters[count][2];
    247 
    248     translate(S);
    249   }
    250   barrier_and_flush();
    251 }
    252 }}}
    253 
    254 === Translating `reduction` clause ===
    255 
    256 {{{
    257 #pragma omp for reduction(+:x,y)
    258 for (i=a; i<b; i++) {
    259   S
    260 }
    261 }}}
    262 
    263 =>
    264 
    265 {{{
    266 {
    267   CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
    268   double _x=0.0, _y=0.0;
    269 
    270   int numIters = info.numIters;
    271   for (int _count=0; _count<numIters; _count++) {
    272     int i = info.iters[_count][0];
    273 
    274     translate(S) but replace x with _x and y with _y;
    275   }
    276   x += _x;
    277   y += _y;
    278   // note: do something with POR so it knows the operations above from
    279   // different threads commute
    280   barrier_and_flush();
    281 }
    282 }}}
    283 
    284 
    285 
    286 === Translating `sections` ===
    287 
    288 If there are n sections, create n functions: section1, section2, ....  Again the question is how to distribute them among threads and in what order.
    289 As with loops, you really want to check these are independent and only do the interleaving exploration as a last resort.
    290 
    291 {{{
    292 // location 42:
    293 #pragma omp sections
    294 #pragma omp section
    295   S0
    296 #pragma omp section
    297   S1
    298 ...
    299 }}}
    300 
    301 =>
    302 
    303 {{{
    304 {
    305   $int_iter iter = $omp_ws_arrive_sections(_ws, 42);
    306 
    307   while ($int_iter_hasNext(iter)) {
    308     int _i = $int_iter_next(iter);
    309 
    310     switch (_i) {
    311     case 0: {
    312       translate(S0);
    313       break;
    314     }
    315     case 1: {
    316       translate(S1);
    317       break;
    318     }
    319     ...
    320     } /* end of switch */
    321   } /* end of while loop */
    322   barrier_and_flush();
    323 }
    324 }}}
    325 
    326 
    327 === Translating `single` ===
    328 
    329 {{{
    330 // location 33:
    331 #pragma omp single
    332 S
    333 }}}
    334 
    335 =>
    336 
    337 {{{
    338 if ($omp_arrive_single(_ws, 33)) {
    339   translate(S);
    340 }
    341 barrier_and_flush();
    342 }}}
    343 
    344 
    345 === Translating `barrier` ===
    346 
    347 {{{
    348 // location 58:
    349 #pragma omp barrier
    350 }}}
    351 
    352 =>
    353 
    354 {{{
    355 $omp_barrier_arrive(_ws, 58);
    356 barrier_and_flush();
    357 }}}
    358 
    359 === Translating `critical` ===
    360 
    361 Basically, use a lock for each critical name, plus one for the "no name".  All threads must obtain lock to enter the critical section, then release it.
    362 I.e., if there are critical sections name a, b, and c, there should be global root-scope variables of boolean type named `_critical_noname`, `_critical_a`, etc.
    363 
    364 {{{
    365 #pragma omp critical a
    366 S
    367 }}}
    368 
    369 =>
    370 
    371 {{{
    372 ...
    373 _Bool _critical_a = $false;
    374 .
    375 .
    376 .
    377 $when (!_critical_a) _critical_a=$true;
    378 translate(S);
    379 _critical_a=$false;
    380 }}}
    381 
    382 === Translating `atomic` ===
    383 
    384 In general, reads and writes to shared variables will be processed using the protocols described above (dealing with state variables and ownership, etc.).  However if the operation occurs within an omp atomic construct, it is translated differently.
    385 
    386 If sequentially consistent atomic...
    387 
    388 If non-sequentially consistent atomic...
    389 
    390 
    391 ===  Translating`ordered` ===
    392 
    393 This can only be used inside and OMP `for` loop in which the pragma used the `ordered` clause.  (Check that.)  It indicates that the specified region must be executed in iteration order.
    394 
    395 In this case the system function must return an int iterator in which the ints occur in loop order.
    396 
    397 {{{
    398 #pragma omp for ordered
    399 for (i=a; i<b; i++) {
    400   ...
    401   #pragma omp ordered
    402   S1
    403   ...
    404   #pragma omp ordered
    405   S2
    406   ...
    407 }
    408 }}}
    409 
    410 =>
    411 
    412 {{{
    413 {
    414   CIVL_omp_loop_info info = $omp_ws_arrive_loop(_ws, 23);
    415   int order1=a, order2=a;
    416   int numIters = info.numIters;
    417 
    418   for (int _i=0; _i<numIters; _i++) {
    419     int i = info.iters[_i][0];
    420      ...
    421     $when (order1==i) {
    422       translate(S1);
    423       order1++;
    424     }
    425       ...
    426     $when (order2==i) {
    427       translate(S2);
    428       order2++;
    429     }
    430       ...
    431   }
    432 }
    433 }}}
    434 
    435 === Translating `master` ===
    436 
    437 {{{
    438 #pragma omp master
    439 S
    440 }}}
    441 
    442 =>
    443 
    444 {{{
    445 if (_tid == 0) {
    446   translate(S);
    447 }
    448 }}}
    449 
    450 == Translating functions ==
    451 
    452 * `omp_get_num_threads()` => `_nthreads`
    453 * `omp_get_thread_num()` => `_tid`