

International ACM Symposium on High Performance Distributed Computing

HPDC 2009, Munich, Germany, June 11-13, 2009

#### An Overview of High Performance Computing and Challenges for the Future

#### Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester





H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a year SC'xy in the States in November Meeting in Germany in June
- All data available from www.top500.org



**TPP** performance

# Performance Development



### Distribution of the Top500



32<sup>nd</sup> List: The TOP10

| Rank | Site                                  | Computer                                    | Country | Cores  | Rmax<br>[Tflops] | Rmax/<br>Rpeak |      | MF/W |
|------|---------------------------------------|---------------------------------------------|---------|--------|------------------|----------------|------|------|
| 1    | DOE/NNSA<br>Los Alamos Nat Lab        | IBM / Roadrunner -<br>BladeCenter QS22/LS21 | USA     | 129600 | 1105.0           | 76%            | 2.48 | 445  |
| 2    | DOE/OS<br>Oak Ridge Nat Lab           | Cray / Jaguar - Cray XT5<br>QC 2.3 GHz      | USA     | 150152 | 1059.0           | 77%            | 6.95 | 152  |
| 3    | NASA/Ames Research<br>Center/NAS      | SGI / Pleiades - SGI Altix<br>ICE 8200EX    | USA     | 51200  | 487.0            | 80%            | 2.09 | 233  |
| 4    | DOE/NNSA<br>Lawrence Livermore<br>NL  | IBM / eServer Blue Gene<br>Solution         | USA     | 212992 | 478.2            | 80%            | 2.32 | 205  |
| 5    | DOE/OS<br>Argonne Nat Lab             | IBM / Blue Gene/P Solution                  | USA     | 163840 | 450.3            | 81%            | 1.26 | 357  |
| 6    | NSF<br>TACC/ Univ. of Texas           | Sun / Ranger - SunBlade<br>x6420            | USA     | 62976  | 433.2            | 75%            | 2.0  | 217  |
| 7    | DOE/OS/NERSC/Law<br>rence Berkeley NL | Cray / Franklin - Cray XT4                  | USA     | 38642  | 266.3            | 75%            | 1.15 | 232  |
| 8    | DOE/OS<br>Oak Ridge Nat Lab           | Cray / Jaguar - Cray XT4                    | USA     | 30976  | 205.0            | 79%            | 1.58 | 130  |
| 9    | DOE/NNSA<br>Sandia Nat Lab            | Cray / Red Storm - XT3/4                    | USA     | 38208  | 204.2            | 72%            | 2.5  | 81   |
| 10   | Shanghai<br>Supercomputer Center      | Dawning 5000A, Windows<br>HPC 2008          | China   | 30720  | 180.6            | 77%            | .85  | 212  |



#### LANL Roadrunner A Petascale System in 2008



Dual Core Opteron Chip

## ORNL/UTK Computer Power Cost Projections 2008-2012

- Over the next 5 years ORNL/UTK will deploy 2 large Petascale systems
- Using 15 MW today
- By 2012 close to 50MW!!
- Power costs greater than \$10M today.
- Cost estimates based on \$0.07 per KwH



Power becomes the architectural driver for future large systems

### Something's Happening Here...



ICLUT

- In the "old days" it was: each year processors would become faster
- Today the clock speed is fixed or getting slower
- Things are still doubling every 18 -24 months
- Moore's Law reinterpretated.
  - Number of cores double every 18-24 months

## Moore's Law Reinterpreted

- Number of cores per chip doubles every 2 year, while clock speed remains fixed or decreases
- Need to deal with systems with millions of concurrent threads
  - Future generation will have billions of threads!
- Number of threads of execution doubles every 2 year



#### **Performance Development and Projections**



## Major Changes to Software

- Must rethink the design of our software
  - Another disruptive technology
    - Similar to what happened with cluster computing and message passing
  - Rethink and rewrite the applications, algorithms, and software
- Numerical libraries for example will change
  - For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

#### A New Generation of Software: ICLU

Parallel Linear Algebra Software for Multicore Architectures (PLASMA)



Those new algorithms

- have a very low granularity, they scale very well (multicore, petascale computing, ...)
- removes a lots of dependencies among the tasks, (multicore, distributed computing)
- avoid latency (distributed computing, out-of-core)
- rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.

### Coding for an Abstract Multicore

# Parallel software for multicores should have two characteristics:

- •Fine granularity:
  - High level of parallelism is needed
  - Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality.

#### •Asynchronicity:

• As the degree of thread level parallelism grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

# Steps in the LAPACK LU













FOR k = 0..TILES-1

 $A[k][k], T[k][k] \leftarrow \mathsf{DGRQRT}(A[k][k])$ 

```
FOR m = k+1..TILES-1
```

 $A[k][k], A[m][k], T[m][k] \leftarrow \mathsf{DTSQRT}(A[k][k], A[m][k], T[m][k])$ 

- **FOR** n = k+1..TILES-1
  - $A[k][n] \leftarrow \mathsf{DLARFB}(A[k][k], T[k][k], A[k][n])$

**FOR** m = k+1..TILES-1

```
A[k][n], A[m][n] \leftarrow \mathsf{DSSRFB}(A[m][k], T[m][k], A[k][n], A[m][n])
```

- input matrix stored and processed by square tiles
- complex DAG

### Achieving Fine Granularity

Fine granularity may require novel data formats to overcome the limitations of BLAS on small chunks of data.







PLASMA (Redesign LAPACK/ScaLAPACK) Parallel Linear Algebra Software for Multicore Architectures

- Asychronicity
  - Avoid fork-join (Bulk sync design)
- Dynamic Scheduling
  - Out of order execution
- Fine Granularity
  - Independent block operations
- Locality of Reference
  - Data storage Block Data Layout

Lead by Tennessee and Berkeley similar to LAPACK/ScaLAPACK as a community effort

### If We Had A Small Matrix Problem

- We would generate the DAG, find the critical path and execute it.
- DAG too large to generate ahead of time
  - Not explicitly generate
  - Dynamically generate the DAG as we go
- Machines will have large number of cores in a distributed fashion
  - Will have to engage in message passing
  - Distributed management
  - Locally have a run time system





 Here is the DAG for a factorization on a 20 x 20 matrix



- For a large matrix say O(10<sup>6</sup>) the DAG is huge
- Many challenges for the software













#### PLASMA Dynamic Task Scheduler



- task a unit of scheduling (quantum of work)
- slice a unit of dependency resolution (quantum of data)
- Current version uses one core to manage the task pool

# How to Deal with Complexity?

- Many parameters in the code needs to be optimized.
- Software adaptivity is the key for applications to effectively use available resources whose complexity is exponentially increasing
- Goal:
  - Automatically bridge the gap between the application and computers that are rapidly changing and getting more and more complex
- Non obvious interactions between HW/SW can effect outcome

# Auto-Tuning

 Best algorithm implementation can depend strongly on the problem, computer architecture, compiler,...

#### There are 2 main approaches

- Model-driven optimization

   [Analytical models for various parameters;
   Heavily used in the compilers community;
   May not give optimal results ]
- Empirical optimization
  - [Generate large number of code versions and runs them on a given platform to determine the best performing one; Effectiveness depends on the chosen parameters to optimize and the search heuristics used ]
- Natural approach is to combine them in a hybrid approach

[1<sup>st</sup> model-driven to limit the search space for a 2<sup>nd</sup> empirical part ]
 [Another aspect is adaptivity - to treat cases where tuning can not be restricted to optimizations at design, installation, or compile time ]

## Pruning the Search Space

Time serial core kernels (dgemm, dssrfb, dssssm).



Intel 64 - dgemm



Power 6 - dssrfb

- Pick up the 'best' NB/IB samples (pruning);
- Select one per matrix size and number of cores.

### DGETRF - Intel64 - 16 cores

DGETRF - Intel64 Xeon quad-socket quad-core (16 cores) - th. peak 153.6 Gflop/s



## Future Computer Systems

- Most likely be a hybrid design
- Think standard multicore chips and accelerator (GPUs)
- Today accelerators are attached
- Next generation more integrated
- Intel's Larrabee in 2010
  - 8,16,32,or 64 x86 cores
- AMD's Fusion in 2011



Intel Larrabee

- Multicore with embedded graphics ATI
- Nvidia's plans?

What's Next?



### Search Straig Hybrid Computing

 Match algorithmic requirements to architectural strengths of the hybrid components
 Multicore : small tasks/tiles
 Accelerator: large data parallel tasks



- e.g. split the computation into tasks; define critical path that "clears" the way for other large data parallel tasks; proper schedule the tasks execution
- Design algorithms with well defined "search space" to facilitate auto-tuning



### Current Work: MAGMA

 Algorithms (in particular LU) for Multicore + GPU systems

#### Challenges

- How to split the computation
- Software development
- Tuning



Work splitting (for single GPU + 8 cores host)





### Performance [in double precision]



### Exascale Computing

#### Google: exascale computing study

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems

Peter Kogge, Editor & Study Lead Keren Bergman Shekhar Borkar Dan Campbell William Carlson William Dally Monty Denneau Paul Franzon William Harrod Kerry Hill Jon Hiller Sherman Karp Stephen Keckler Dean Klein Robert Lucas Mark Richards Al Scarpelli Steven Scott Allan Snavely **Thomas Sterling R. Stanley Williams** Katherine Yelick



This work was sponsored by DARPA IPTO in the ExaScale Computing Study with Dr. William Harrod as Program Manager; AFRL contract number FA8650-07-C-7724. This report is published in the interest of scientific and technical information exchange and its publication does not constitute the Government's approval or disapproval of its ideas or findings

#### NOTICE

Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Government. The fact that the Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation; or convey any rights or permission to manufacture, use, or sell any patented invention that may relate to them.

APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED.



## Exascale Computing

- Exascale systems are likely feasible by 2017 2
- 10-100 Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly
- 3D packaging likely
- Large-scale optics based interconnects
- 10-100 PB of aggregate memory
- Hardware and software based fault management
- Heterogeneous cores
- Performance per watt stretch goal 100 GF/watt of sustained performance >> 10 - 100 MW Exascale system
- Power, area and capital costs will be significantly higher than for today's fastest systems

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems

Peter Kogge, Editor & Study Lead Keren Bergman Shekhar Borkar Dan Campbell William Carlson William Dally Monty Denneau Paul Franzon William Harrod Kerry Hill Jon Hiller Sherman Ka Stephen Keck Dean Klein Robert Lucas Mark Richards Al Scarnelli Steven Scott Allan Snavely Thomas Sterling R. Stanley William Katherine Yelick

This work was sponsored by DARPA IPTO in the ExaScale Computing Study with Dr. William Harrod as Program Manager, AFRL contract number FA3660-07-C-7724. This report is published in the interest of scientific and technical information exchange and its publication does not constitute the

NOTICE Using Government drawings, specifications, or other data included in this document for any purpose other than Government foremolated or supplied the drawings, specifications, or other data does not license the holder or any pueplied the drawings, specifications, or other stata

nt's approval or disapproval of its ideas or findings

manufacture, use, or sell any patented invention that may relate to them. APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED

September 28, 2008



#### Five Important Features to Consider When Computing at Scale

- Effective Use of Many-Core and Hybrid architectures
  - Dynamic Data Driven Execution
  - Block Data Layout
- Exploiting Mixed Precision in the Algorithms
  - Single Precision is 2X faster than Double Precision
  - With GP-GPUs 10x
- Self Adapting / Auto Tuning of Software
  - Too hard to do by hand
- Fault Tolerant Algorithms
  - With 1,000,000's of cores things will fail
- Communication Avoiding Algorithms
  - For dense computations from O(n *log* p) to O(*log* p) communications
  - GMRES s-step compute (x, Ax, A<sup>2</sup>x, ... A<sup>s</sup>x)



- For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware.
- This strategy needs to be rebalanced barriers to progress are increasingly on the software side.
- Moreover, the return on investment is more favorable to software.
  - Hardware has a half-life measured in years, while software has a half-life measured in decades.
- High Performance Ecosystem out of balance
  - Hardware, OS, Compilers, Software, Algorithms, Applications
    - No Moore's Law for software, algorithms and applications

## Collaborators / Support

Employment opportunities for post-docs in the ICL group at Tennessee

PLASMA Parallel Linear Algebra Software for Multicore Architectures

MAGMA Matrix Algebra on GPU and Multicore Architectures







Werben mit Google - Unternehmensangebote - Über Google - Google.com in English

©2009 - Datenschutz

#### **Contact Jack Dongarra**

#### If you are wondering what's beyond ExaFlops

| Mega, G | iga, <sup>-</sup> | Tera, |     |
|---------|-------------------|-------|-----|
| Peta,   | Exa,              | Zetta | ••• |

| 10 <sup>3</sup>         | kilo  |
|-------------------------|-------|
| 10 <sup>6</sup>         | mega  |
| 10 <sup>9</sup>         | giga  |
| <b>10</b> <sup>12</sup> | tera  |
| <b>10</b> <sup>15</sup> | peta  |
| 10 <sup>18</sup>        | exa   |
| <b>10</b> <sup>21</sup> | zetta |
|                         |       |

| <b>10</b> <sup>24</sup> | yotta |
|-------------------------|-------|
| 10 <sup>27</sup>        | xona  |
| <b>10</b> <sup>30</sup> | weka  |
| 10 <sup>33</sup>        | vunda |
| 10 <sup>36</sup>        | uda   |
| 10 <sup>39</sup>        | treda |
| <b>10</b> <sup>42</sup> | sorta |
| <b>10</b> <sup>45</sup> | rinta |
| 10 <sup>48</sup>        | quexa |
| <b>10</b> <sup>51</sup> | pepta |
| <b>10</b> <sup>54</sup> | ocha  |
| <b>10</b> <sup>57</sup> | nenaN |
| <b>10</b> <sup>60</sup> | minga |
| 1062                    |       |