| | 1 | == Case study: integer sort from CORAL benchmarks == |
| | 2 | |
| | 3 | https://asc.llnl.gov/CORAL-benchmarks/#intsort |
| | 4 | |
| | 5 | The benchmark exercises OpenMP threads, MPI-IO, Posix IO, and all-to-all communications with both small and large message sizes. The majority of the time is spent in IO. |
| | 6 | |
| | 7 | It has three major phases: |
| | 8 | |
| | 9 | 1. Hash all elements to `N` bins, where `N = # of nodes`. Note the benchmark uses `1` MPI rank per node. By the end of this phase, each bin contains all elements falling within a sub range. For example, the first bin contains numbers ranging from `0` to `T/N`. Each bin is then stored as a “bin file”. In this phase, each rank performs three steps: |
| | 10 | |
| | 11 | a. Read a part (proportional to the DRAM size) of the input file (each rank reads a different part) |
| | 12 | b. Hash this part locally according to the bins defined above. |
| | 13 | c. Exchange elements among MPI ranks (all-to-all communication) to send them to the MPI rank that owns the corresponding bin. |
| | 14 | d. Each rank writes the received elements to its “bin file”. |
| | 15 | |
| | 16 | 2. Sort each "bin file" individually in each node. (Note: we assume the numbers are uniformly distributed so there is little concern for load balancing). No MPI communications. Each MPI rank treats its bin file as a collection of chunks, each chunk is of `page_size` (`page_size` is recommended to be no smaller than the page size of the file system). |
| | 17 | |
| | 18 | a. Sort chunks in parallel (with OpenMP) |
| | 19 | b. Merge every M chunks, repeat until there is only 1 chunk left. This chunk is the sorted bin file. |
| | 20 | Note: `M = DRAM_size/page_size`, so the number of merging stages is `log_M_(T/N/page_size)` |
| | 21 | |
| | 22 | 3. Concatenate all sorted bin files together into a single file. |