

# INTRODUCTION TO HPC MAELSTROM BOOTCAMP ECMWF

8 November 2023 | Dr. Andreas Herten | Forschungszentrum Jülich, MAELSTROM





#### **Outline**

```
Introduction
Hardware
   Comparison to PC
   HPC vs. PC
   HPC
HPC System Overview
   Historical Machines
   JUWELS
      JUWELS Cluster
      JUWELS Booster
   GPUs
```

#### Software

1: Core Utilization

2: Parallelization

3: Distribution

**Enablement** 

Conclusion





What is **HPC**?

High Performance Computing is computing with a powerful machine using the available resources efficiently.

What kind of CPU does your computer have?
CPU generation, clock speed rate, number of cores, vector length





- What kind of CPU does your computer have?
  CPU generation, clock speed rate, number of cores, vector length
- How much memory does your computer have? Amount of memory, type, links (GB and GB/s)





- What kind of CPU does your computer have?
  CPU generation, clock speed rate, number of cores, vector length
- How much memory does your computer have? Amount of memory, type, links (GB and GB/s)
- What kind of GPU do you have?
  GPU generation, number of cores, power intake (TDP)





- What kind of CPU does your computer have?
  CPU generation, clock speed rate, number of cores, vector length
- How much memory does your computer have? Amount of memory, type, links (GB <u>and</u> GB/s)
- What kind of GPU do you have?
  GPU generation, number of cores, power intake (TDP)
- How fast is your network ?
  Throughput, latency





- What kind of CPU does your computer have?
  CPU generation, clock speed rate, number of cores, vector length
- How much memory does your computer have? Amount of memory, type, links (GB <u>and</u> GB/s)
- What kind of GPU do you have?
  GPU generation, number of cores, power intake (TDP)
- How fast is your network ?
  Throughput, latency































- Usually, 2 CPUs sockets, each with 64 cores; use mostly as one CPU with one memory
- 4 distinct GPUs, connected with each other (600 GB/s)
- 4 network connections, each 200 Gbit/s in each direction (InfiniBand HDR-200)









|  |  |  | HF |          |  |  |  |  |
|--|--|--|----|----------|--|--|--|--|
|  |  |  | No | ies<br>I |  |  |  |  |
|  |  |  |    |          |  |  |  |  |
|  |  |  |    |          |  |  |  |  |
|  |  |  |    |          |  |  |  |  |
|  |  |  |    |          |  |  |  |  |

| - | - | - | - |   | - | - | - | - | - | - | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |
| - | - | - | - | - | - |   | - | - |   |   | - |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |   |   |   |   |   |



High Performance Computing is computing with a powerful machine using the available resources efficiently.

High Performance Computing is computing with a powerful machine using the available resources efficiently.

#### **Powerful Machines**

#### Now

- Powerful nodes (large CPUs , accelerating GPUs , much memory )
- Many nodes (well-connected through high-speed interconnect )
- $\rightarrow \ \, \text{Beefed-up versions of commodity computers, with slight specializations; many}$



#### **Powerful Machines**

#### Now

- Powerful nodes (large CPUs , accelerating GPUs , much memory )
- Many nodes (well-connected through high-speed interconnect )
- $\rightarrow \ \ \text{Beefed-up versions of commodity computers, with slight specializations; many}$

#### **Past**

- First computers: Supercomputers! Mainframe machines: Large installations with most powerful hardware at the time
- PC era: Even then, specialized computers, like vector machines, or many low-speed CPUs (well-connected)
- Recent history: x86, then PowerPC, then GPU accelerators, then specialized Arm CPUs





- CDC 6600 supercomputer
- Around 1965
- First supercomputer
- 3 MFLOp/s
- See Wikipedia for more
- Picture by Control Data Corporation





- CDC 6600 supercomputer
- Around 1965
- First supercomputer
- 3 MFLOP/S
- See Wikipedia for more
- Picture by Control Data Corporation





HPC performance measured in **FLOP**/s.

■ Floating-point (like 3.14) operations per second





HPC performance measured in FLOP/s.

- Floating-point (like 3.14) operations per second
- Example: Processor with 2 GHz; 10 cores; per core: 2 multiplications and 2 additions (FMA) per cycle





HPC performance measured in FLOP/s.

- Floating-point (like 3.14) operations per second
- Example: Processor with 2 GHz; 10 cores; per core: 2 multiplications and 2 additions (FMA) per cycle

 $2 \times 10^9$  1/s 1/core \* 10 core\* \* (2 + 2) floating-point operation





HPC performance measured in FLOP/s.

- Floating-point (like 3.14) operations per second
- Example: Processor with 2 GHz; 10 cores; per core: 2 multiplications and 2 additions (FMA) per cycle

$$2 \times 10^9$$
 1/s 1/core \* 10 core\*  
\* (2 + 2) floating-point operation  
=2 \*  $10^9$  \* 10 \* 4 fl-op/s





HPC performance measured in FLOP/s.

- Floating-point (like 3.14) operations per second
- Example: Processor with 2 GHz; 10 cores; per core: 2 multiplications and 2 additions (FMA) per cycle

 $2 \times 10^9$  1/s 1/core \* 10 core \*

\* (2 + 2) floating-point operation

 $=2*10^9*10*4 fl-op/s$ 

 $=80 * 10^9$  FLOP/S

=80 GFLOP/S





- Cray-1 supercomputer
- Around 1978
- Very successful
- 160 MFLOP/S
- Probably pictured at NERSC





- Intel XP/S 140 supercomputer
- Around 1994
- 3680 Intel i860 RISC processors; large-scale parallel system
- 143 GFLOP/s
- Picture by top500.org





- JUGENE supercomputer
- **2008**
- 294 912 PowerPC 450 cores; energy-efficient
- 800 TFLOP/S
- Picture by top500.org





- Summit supercomputer
- **2018**
- 27 000 GPUs hosted by POWER9 CPUs; first #1 GPU supercomputer
- 200 PFLOP/S
- Picture by Oak Ridge National Lab





- Fugaku supercomputer
- **2020**
- 7 630 848 Arm A64FX cores; #1 supercomputer at release
- 537 PFLOP/s
- Picture by RIKEN





#### JUWELS Cluster - Jülich's Scalable System

- 2500 nodes with Intel Xeon CPUs (2 × 24 cores)
- 46 + 10 nodes with 4 NVIDIA Tesla V100 cards (16 GB memory)
- 10.4 (CPU) + 1.6 (GPU) PFLOP/s peak performance (Top500: #86)





#### JUWELS Booster - Scaling Higher!

- ullet 936 nodes with AMD EPYC Rome CPUs (2 imes 24 cores)
- Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s, 40 GB memory)
- InfiniBand DragonFly+ HDR-200 network; 4 × 200 Gbit/s per node



Member of the Helmholtz Association 8 November 2023 Slide 13|24





### Top500 List Nov 2021:

- #1 Europe
- #8 World
- #4\* Top/Green500

#### **JUWELS** Booster – Scaling Higher!

- 936 nodes with AMD EPYC Rome CPUs (2 × 24 cores)
- Each with 4 NVIDIA A100 Ampere GPUs (each: FP64TC: 19.5 TFLOP/s, 40 GB memory)
- InfiniBand DragonFly+ HDR-200 network; 4 × 200 Gbit/s per node



Member of the Helmholtz Association 8 November 2023 Slide 13124

 Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs; 1.102 EFLOP/s; also most energy-efficient!





- Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs; 1.102 EFLOP/s; also most energy-efficient!
- 2023: Aurora at Argonne with > 60 000 Intel Ponte Vecchio GPUs; > 2 EFLOP/S
- 2024: El Capitan at Lawrence Livermore with AMD MI300 GPUs; > 2 EFLOP/s





- Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs; 1.102 EFLOP/s; also most energy-efficient!
- 2023: Aurora at Argonne with > 60 000 Intel
   Ponte Vecchio GPUs; > 2 EFLOP/s
- 2024: El Capitan at Lawrence Livermore with AMD MI300 GPUs; > 2 EFLOP/S
- 7 2024: JUPITER at JSC 1 EFLOP/S! NVIDIA Hopper GPUs





- Current fastest supercomputer: Frontier at Oak Ridge (USA) with 38 000 AMD MI250X GPUs; 1.102 EFLOP/s; also most energy-efficient!
- 2023: Aurora at Argonne with > 60 000 Intel
   Ponte Vecchio GPUs; > 2 EFLOP/s
- 2024: El Capitan at Lawrence Livermore with AMD MI300 GPUs; > 2 EFLOP/S
- 7 2024: JUPITER at JSC 1 EFLOP/S! NVIDIA Hopper GPUs





- GPUs: Exascale Enablers
- Processors efficient at applying same (/similar) instruction on large set of data (image)
- Over last 15 years, extended from rendering to variable computing
- Not good for every task, but great for some, which happen to be computing with large amounts of similar data



- GPUs: Exascale Enablers
- Processors efficient at applying same (/similar) instruction on large set of data (image)
- Over last 15 years, extended from rendering to variable computing
- Not good for every task, but great for some, which happen to be computing with large amounts of similar data
- Programming model: SIMT, SIMD ⊗ SMT (vectors ⊗ threads)
- JUWELS Booster thread 100 % occupancy: 3744 GPUs  $\times$  108 SMs  $\times$  2048 threads/SM = 828 112 896 threads

JUPITER: > 10 000 000 000 threads



- GPUs: Exascale Enablers
- Processors efficient at applying same (/similar) instruction on large set of data (image)
- Over last 15 years, extended from rendering to variable computing
- Not good for every task, but great for some, which happen to be computing with large amounts of similar data
- Programming model: SIMT, SIMD  $\otimes$  SMT (vectors  $\otimes$  threads)
- JUWELS Booster thread 100 % occupancy: 3744 GPUs × 108 SMs × 2048 threads/SM = 828 112 896 threads JUPITER: > 10 000 000 000 threads
- Important vendors: First NVIDIA, then AMD, then Intel









- GPUs: Exascale Enablers
- Processors efficient at all large set of data (image)
- Over last 15 years, exter computing
- Not good for every task, be computing with larg
- Programming model: SI
- JUWELS Booster thread SMs × 2048 threads/SM JUPITER: > 10 000 000 00
- Important vendors: First



Tex





- GPUs: Exascale Enablers
- Processors efficient at applying same (/similar) instruction or

large set of da

Over last 15 ye computing

Not good for e be computing

Programming

■ JUWELS Boo SMs × 2048



JUPITER: > 10 000 000 000 threads

■ Important vendors: First NVIDIA, then AMD, then Inte



GPUs: Exascale Enablers

Processors eff large set of da

 Over last 15 ye computing

Not good for e be computing

Programming

JUWELS Boos SMs × 2048 t
JUPITER: > 10





Important vendors: First NVIDIA, then AMD, then Inte



High Performance Computing is computing with a powerful machine using the available resources efficiently.

High Performance Computing is computing with a powerful machine <u>using</u> the available resources efficiently.

### **Resource Utilization**



Exploit all capabilities of processing entity (core)



### **Resource Utilization**



Exploit all capabilities of processing entity (core)

Parallelize to all processing entities of node



### **Resource Utilization**

1 2 3

Exploit all capabilities of processing entity (core)

Parallelize to all processing entities of node

Distribute to all nodes



- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (*intrinsics*), modern compilers, optimized libraries



- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (*intrinsics*), modern compilers, optimized libraries
- Example: Vectorization/SIMD



- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries
- Example: Vectorization/SIMD

| $A_0$ | × | $B_0$ | + | $C_0$ | = | $D_0$ |
|-------|---|-------|---|-------|---|-------|
| $A_1$ | × | $B_1$ | + | $C_1$ | = | $D_1$ |
| $A_2$ | × | $B_2$ | + | $C_2$ | = | $D_2$ |
| A -   | ~ | R.    | _ | C     | _ | Do    |

× 4 multiplications

+ 4 additions

= 4 assignments

→ 8 instructions



- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries
- Example: Vectorization/SIMD





- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries
- Example: Vectorization/SIMD



× 4 multiplications + 4 additions SIMD = 4 assignments



CPU Instruction:
VADDPD

C Intrinsic:
\_mm256\_add\_pd();



→ 8 instructions

- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory **bandwidth**
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries
- Example: Vectorization/SIMD



× 4 multiplications 4 additions 4 assignments → 8 instructions



× 1 multiplication 1 addition = 1 assignment



1 assignment

1 instruction





- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (*intrinsics*), modern compilers, optimized libraries
- Example: Vectorization/SIMD



- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory bandwidth
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries
- Example: Vectorization/SIMD





Compiler!



1 instruction

- Modern CPUs: many advanced instructions, high clock rate, large caches, high memory **bandwidth**
- Use via tailored algorithms, specific functions (intrinsics), modern compilers, optimized libraries

Slide 18124

Example: Vectorization/SIMD



× 1 multiplication 1 addition **Improve** = 1 assignment throughput! → 2 instructions multiplication **Improve** 1 addition throughput 1 assignment 1 instruction



more!

Analysis/plot by Stepan Nassyr, 2022.



Member of the Helmholtz Association

8 November 2023

Slide 18124



From core to cores





- From core to cores
- From CPU cores to GPU cores





- From core to cores
- From CPU cores to GPU cores
- Parallelization: Tasks work on portion of full problem using some local shared memory; fine-grained split





- From core to cores
- From CPU cores to GPU cores
- Parallelization: Tasks work on portion of full problem using some local shared memory; fine-grained split

#### CPU Mostly through operating system capacities

- OS threads launched on cores
- Easiest threading interface: OpenMP

```
#pragma omp parallel for
for (int i = 0; i < N; i++) y[i] = x[i] * 3.14 + a[i];</pre>
```





- From core to cores
- From CPU cores to GPU cores
- Parallelization: Tasks work on portion of full problem using some local shared memory; fine-grained split
- CPU Mostly through operating system capacities
  - OS threads launched on cores
  - Easiest threading interface: OpenMP

```
#pragma omp parallel for
for (int i = 0; i < N; i++) y[i] = x[i] * 3.14 + a[i];</pre>
```

- GPU Through dedicated programming environments
  - Mostly, explicit models

```
int i = threadIdx.x + blockIdx.x * blockDim.x;
v[i] = x[i] * 3.14 + a[i];
```

Also, higher-level models (OpenMP, OpenACC)





From node to nodes





- From node to nodes
- Distribution: Tasks work on portion of full problem using distributed memory; coarse-grained split





- From node to nodes
- Distribution: Tasks work on portion of full problem using distributed memory; coarse-grained split
- Every task runs on own node with copy of program, defined exchange functions
- High-speed network important! GPUs directly attached to network





- From node to nodes
- Distribution: Tasks work on portion of full problem using distributed memory; coarse-grained split
- Every task runs on own node with copy of program, defined exchange functions
- High-speed network important! GPUs directly attached to network
- Classical programming model: MPI





Compilers Translate high-level code to low-level machine code, with general and very architecture-specific optimizations



Compilers

**Translate** high-level code to low-level machine code, with general *and* very architecture-specific optimizations

Frameworks

Offer pre-programmed function primitives to build a program upon



Compilers Translate high-level code to low-level machine code, with general *and* very architecture-specific optimizations

Frameworks Offer pre-programmed function primitives to build a program upon

Libraries **Back-end**, low-level functions, usually optimized extensively, sometimes by vendors themselves



Compilers

**Translate** high-level code to low-level machine code, with general *and* very architecture-specific optimizations

Frameworks

Offer pre-programmed function primitives to build a program upon

Libraries

**Back-end**, low-level functions, usually optimized extensively, sometimes by vendors themselves

#### Compilers

CPU GCC, LLVM, Intel, Cray
GPU + NVIDIA CUDA, NVHPC,
AMD

 Long history, constantly evolving

#### Frameworks

MPI OpenMPI, MPICH
Threads pthreads,
OpenMP
GPU CUDA, HIP, SYCL,

pSTL, Kokkos

#### Libraries

GPU cuBLAS, rocBLAS, cuDNN

→ TensorFlow, PyTorch, ELPA



High Performance Computing is computing with a powerful machine using the available resources efficiently.

# Conclusion

- HPC is intensive computing with largest machines
- Sometimes like Formula 1, sometimes like a tanker
- Sophisticated hardware is underlying everything, delivering up to 1.1 EFLOP/S
- Advanced software holds everything together and enables science at the frontiers









- HPC is intensive computing with largest machines
- Sometimes like Formula 1, sometimes like a tanker
- Sophisticated hardware is underlying everything, delivering up to 1.1 FFLOP/S
- Advanced software holds everything together and enables science at the frontiers, like
  - Plasma physics simulations
  - Drug discovery
  - Material design
  - Weather and climate modelling
  - Precise Artificial Intelligence







- HPC is intensive computing with largest machines
- Sometimes like Formula 1, sometimes like a tanker
- Sophisticated hardware is underlying everything, delivering up to 1.1 EFLOP/S
- Advanced software holds everything together and enables science at the frontiers, like
  - Plasma physics simulations
  - Drug discovery
  - Material design
  - Weather and climate modelling
  - Precise Artificial Intelligence









Member of the Helmholtz Association 8 November 2023 Slide 24124

- Sometimes like Formula 1, sometimes like a tanker
- Sophisticated hardware is underlying everything, delivering up to 1.1 EFLOP/S
- Advanced software holds everything together and enables science at the frontiers, like
  - Plasma physics simulations
  - Drug discovery
  - Material design
  - Weather and climate modelling
  - Precise Artificial Intelligence
- We are hiring!

go.fzj.de/jsc-jobs









# Appendix

Appendix License References



#### License

This slide deck is published under the following license: CC BY-SA 4.0





#### References: Images, Graphics I

- [1] Forschungszentrum Jülich GmbH (Ralf-Uwe Limbach). JUWELS Booster.
- [2] Control Data Corporation. *Picture: CDC 6600*. Computer History Museum. URL: https://www.computerhistory.org/revolution/supercomputers/10/33 (pages 33, 34).
- [3] Sandia National Lab. *Picture: Intel XP/S 140.* Top500.org. URL: https://www.top500.org/resources/top-systems/intel-xps-140-paragon-sandia-national-labs/(page 41).
- [4] Forschungszentrum Jülich. *Picture: JUGENE*. JUGENE Press Release. URL: https://www.fz-juelich.de/de/aktuelles/news/pressemitteilungen/2007/index4763\_htm (page 42).



#### References: Images, Graphics II

- [5] OLCF at ORNL. Picture: Summit. Flickr. URL: https://www.flickr.com/photos/olcf/42659222181/ (page 43).
- [6] RIKEN. Picture: Fugaku. Fujitsu.com. URL: https://blog.de.fujitsu.com/data-driven/fugaku-der-aktuellweltweit-leistungsstaerkste-supercomputer/(page 44).
- [7] OLCF at ORNL. *Picture: Frontier*. Flickr. URL: https://www.flickr.com/photos/olcf/52117623843/.
- [8] Nvidia Corporation. Pictures: Ampere GPU. Ampere Architecture Whitepaper. URL: http://www.nvidia.com/nvidia-ampere-architecture-whitepaper (pages 55-57).



#### References: Images, Graphics III

[9] AMD Inc. Pictures: Instinct MI250 GPU. Promotional Material. URL: https://videocardz.net/amd-instinct-mi250 (page 58).

