#### CHALMERS

# Graphics Hardware

## UlfAssarsson

## **Graphics hardware – why?**

- About 100x faster!
- Another reason: about 100x faster!
- Simple to pipeline and parallelize
- Current hardware based on triangle rasterization with programmable shading (e.g., OpenGL acceleration)
- Ray tracing: there are research architetures, and few commercial products
  - Renderdrive, RPU, (Gelato), NVIDIA OptiX
  - Or write your own GPU ray-tracer

## Perspective-correct interpolation of texture coordinates (and actually all screen-space-interpolated pervertex data)



## **Perspective-correct texturing**

How is texture coordinates interpolated over a triangle?Linearly?





#### **Linear interpolation**

#### **Perspective-correct interpolation**

- Perspective-correct interpolation gives foreshortening effect!
- Hardware does this for you, but you need to understand this anyway!



## **Recall the following**

Vertices are projected onto screen by non-linear transform. Hence, tex coords cannot be linearly interpolated in screen space (just like a 3Dposition cannot be).

- Perspective projection introduces a non-linear transform by the homogenization step:
  - Before projection, v, and after p (p=Mv)
  - After projection  $p_w$  is not 1!
  - Homogenization:  $(p_x/p_w, p_y/p_w, p_z/p_w, 1)$
  - Gives  $(p_x, p_y, p_z, p_z, 1)$

$$\mathbf{p} = \mathbf{M}\mathbf{v} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & -1/d & 0 \end{pmatrix} \begin{pmatrix} v_x \\ v_y \\ v_z \\ 1 \end{pmatrix} = \begin{pmatrix} v_x \\ v_y \\ v_z \\ -v_z/d \end{pmatrix}$$

#### Mathematic derivation: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3. 211&rep=rep1&type=pdf Texture coordinate interpolation

- Linear interpolation does not work
- Rational linear interpolation does:
  - u(x)=(ax+b)/(cx+d) (along a scanline where y=constant)
  - *a,b,c,d* are computed from triangle's vertices (x,y,z,w,u,v)
- Not really efficient to compute *a*,*b*,*c*,*d* per scan line
- Smarter:
  - Compute (u/w, v/w, 1/w) per vertex
  - These quantities can be linearly interpolated!
  - Then at each pixel, compute 1/(1/w)=w
  - And obtain:  $(w^*u/w, w^*v/w) = (u, v)$
  - The (u,v) are perspectively-correct interpolated
- Need to interpolate shading this way too
  - Though, not as annoying as textures
- Since linear interpolation now is OK, compute, e.g., Δ(u/w)/Δx, and use this to update u/w when stepping in the x-direction (similarly for other parameters)

## **Put differently:**



- Linear interpolation in screen space does not work for u,v
- Why:
  - We have applied a non-linear transform to each vertex position (x/w, y/w, z/w, w/w).
    - Non-linear due to 1/w factor from the homogenisation
- Solution:
  - We must apply the same non-linear transform to u,v
    - E.g. (u/w, v/w). This can now be correctly screenspace interpolated since it follows the same non-linear (1/w) transform (and interpolation) as (x/w, y/w, z/w).
    - When doing the texture lookups, we still need (u,v) and not (u/w, v/w).
    - So, multiply by w. But we don't have w at the pixel.
    - So, linearly interpolate (u/w, v/w, 1/w), which is computed in screenspace at each vertex.
    - Then at each pixel:
      - $u_i = (u/w)_i / (1/w)_i$
      - $v_i = (v/w)_i / (1/w)_i$

For a formal proof, see Jim Blinn,"W Pleasure, W Fun", IEEE Computer Graphics and Applications, p78-82, May/June 1998

#### Need to interpolate shading this way too, though, not as annoying as textures

## Overview of GPU architecture

-History / evolution

- GPU design: Several **cores** consisting of many **ALU**s (NVIDIA terminology: **Streaming Multiprocessors (SMMs)** of many **cores**
- GPU vs CPU

Take-away: bandwidth (cost of memory accesses) is a major problem

## Background: Graphics hardware architectures

- Evolution of graphics hardware has started from the end of the pipeline
  - Rasterizer was put into hardware first (most performance to gain from this)
  - Then the geometry stage
  - Application will not be put into GPU hardware (?)
- Two major ways of getting better performance:
  - Pipelining
  - Parallellization
  - Combinations of these are often used

### Parallellism

- "Simple" idea: compute n results in parallel, then combine results
- Not always simple!
  - Try to parallelize a sorting algorithm...
  - But vertices are independent of each other, and also pixels, so simpler for graphics hardware

• Can parallellize both geometry and rasterizer stage:



#### **CHALMERS**

#### Department of Computer Engineering





## **Graphics Processing Unit - GPU**



NVIDIA Geforce GTX 580

# NVIDIA Maxwell (GTX 980)

2014

SANANAN SERENCE SERENCE SANANAN ANDRESS SUBBLES SERVICES ADDRESS BARRARS SANADARS BARRARS SANADARS STREET, STREET STREETS STREETS NUMBER SECTIONS SECTION DESCRIPTION NAMESONS STREAMS STREAMS ADDREAMS **STREET** SECONDER STREET \*\*\*\* STREET, STREET ASSAULTS STREETS STREETS NURSERS STRUCTS STRUCTS STRUCTS **BESENSES** REPRESENTATION OF THE PARTY OF 10100 10100 10100 10100 ANALESSES BERERRE BERERRE STRATES BREAKEN ANADARAN BERERE BERERE SABARSES SERVICE ADDRESS PARTIES. BURNNESS SAUNNESS BURNNESS BURNNESS NUMBERS STREETS STREETS STREETS

III 9 North 9 N

10 11 222 11 200000

VALUE VALUE VALUE VALUE

ANNALSA SECONDER REPORTS BANKARA BRANKSEN BRANKSEN SENSERS BRANKSEN BARRARS BARRARS MARRARS BRANNAS NAMESANG BREAKING ANDREAS ADDRESS \*\*\*\*\*\* \*\*\*\*\*\*\*\* \*\*\*\*\*\* NUNNAUN DISSING BARRONN BRANNAN SANSANNA ANDRONA NURSESSON DESCRIPTION BRANKSEN RESIDENT REVEREN REPRESENT ---------**BUBBBBBB BEBBBBB** NUNNANUN UNNANUN NANNANA MANANANA MEMBANAN DEREMANAN UNNESSED SUBJESTS BREEKE BREEKE SUBBRERS SUBBRERS SUBBRERS SUBBRERS NUNBERS SUBJECT SERVICE STREETS IN NUNNNUM NEBERARN SERVICES BURGERS NAMESSAN SERVICES SERVICES SERVICES BERSSES SECTORS SECTORS ADDRESS NUNNERS SUBSERS SECTION. 233333232 22222232333 \*\*\*\*\*\*\*\*\* BARNARA WARRARD STATISTICS STREAMS NUMBER OF DESCRIPTION OF TAXABLE PARTIES. BANASSAS SERVICE ANALYMAN ANALYMAN MARKENN DISTRICT NANANAN NANANANA NANANANA MANANANA BENSENNS SENSERS BENESED BENESED NUNEXNEE REPAIRS BESTERS BESTERS SERVICIAL DATABASE NUNNERS SERVICE REVERSES SERVICES NUNBERRY DESIGNATION DESIGNATION DESIGNATION.

#### CHALMERS

16 SMMs ("Cores")
2MB L2 cache
64 output pixels / clock
(i.e., 64 ROPs)
2048 ALUs ("cores")
~6 Tflops

#### Each SMM:

- 128 ALUs
- 96KB L1 cache
- 8 TexUnits
- 32 Load/Store units for access to global memory



## NVIDIA Pascal GP100 (GTX 1080 / Titan X)

0 3584 cores **NVIDIA**. 11 Tflops (+) 330 •55 15.3Btrans. 16 GB Ram 4MB L2 ~64KB L1 256KB regs/SM

224 tex units

2016

#### **CHALMERS**

Department of Computer E2016



SM

#### GAMER: NEXUS

ing





## NVIDIA Volta GV100

2018



## NVIDIA Volta GV100

2018

|                                                                          | L1 Instruc                                     | tion Cache                            |                                       |                               |                  |                                   |                  |                                   |                                   |                    |                                   |
|--------------------------------------------------------------------------|------------------------------------------------|---------------------------------------|---------------------------------------|-------------------------------|------------------|-----------------------------------|------------------|-----------------------------------|-----------------------------------|--------------------|-----------------------------------|
| L0 Ir                                                                    | struction Cache                                | L0 Instru                             | tion Cache                            |                               |                  |                                   |                  |                                   |                                   |                    |                                   |
| Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk)          |                                                | Warp Schedul<br>Dispatch Uni          | er (32 thread/clk)<br>(32 thread/clk) |                               |                  |                                   | .0 Instru        | ction Ca                          | ache                              |                    |                                   |
| Register File (16.384 x 32-bit)                                          |                                                |                                       | 16.384 x 32-bit)                      |                               |                  | Warp                              | Schedul          | er (32 tl                         | nread/clk)                        |                    |                                   |
|                                                                          |                                                |                                       |                                       | Dispatch Unit (32 thread/clk) |                  |                                   |                  |                                   |                                   |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               |                  |                                   |                  | - (                               |                                   |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               |                  | Regis                             | ter File         | (16,384                           | x 32-bit)                         |                    |                                   |
| ED64 INT INT                                                             | FP32 FP32                                      | ED64 INT INT ED3                      | FP32                                  |                               |                  |                                   |                  |                                   |                                   |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32 TENSOR TENSOR<br>FP32 FP32 CORE CORE | EP64 INT INT FP3                      | TENSOR TENSOR<br>CORE CORE            |                               | FP64             | INT I                             | NT FP3           | 2 FP32                            | -                                 |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               |                  |                                   |                  |                                   |                                   | $\sim 1 \text{ m}$ | add                               |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               | FP64             | INT                               | NT FP3           | 2 FP32                            |                                   |                    | aaa                               |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               | FP64             | INT I                             | NT FP3           | 2 FP32                            |                                   |                    |                                   |
| LD/ LD/ LD/ LD/<br>ST ST ST ST                                           | LD/ LD/ LD/ LD/<br>ST ST ST ST SFU             | LD/ LD/ LD/ LD/ LD/<br>ST ST ST ST ST | LD/ LD/ LD/<br>ST ST ST SFU           |                               | 504              |                                   |                  |                                   |                                   |                    |                                   |
|                                                                          |                                                |                                       |                                       |                               | FP64             | INT                               | NI FP3           | 2 FP32                            | TENSOR                            | TENSOR             |                                   |
| L0 Instruction Cache L0 Instruction Cache Warp Scheduler (32 thread/clk) |                                                |                                       |                                       | FP64                          | INT I            | NT FP3                            | 2 FP32           | CORE                              | CORE                              |                    |                                   |
| Dispatch Unit (32 thread/clk) Dispatch Unit (32 thread/clk)              |                                                |                                       |                                       | ED64                          | INIT I           | NT ED2                            | 2 5022           |                                   |                                   |                    |                                   |
| CN . Register                                                            | File (16,384 x 32-bit)                         | Register File                         | 16,384 x 32-bit)                      |                               | FF04             |                                   | NI FF5           | 2 7752                            |                                   |                    |                                   |
| 5IVI.                                                                    | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               | FP64             | INT I                             | NT FP3           | 2 FP32                            |                                   |                    |                                   |
| 54 32-bi                                                                 | t fp/int cores                                 | FP64 INT INT FP3                      | FP32                                  |                               | ED64             | INT I                             |                  | 2 5032                            |                                   |                    |                                   |
| 512 16 1                                                                 | it coros                                       | FP64 INT INT FP3                      | FP32                                  |                               | THOSE            |                                   |                  |                                   |                                   |                    |                                   |
|                                                                          |                                                | FP64 INT INT FP3                      | FP32 TENSOR TENSOR                    | T                             | LD/ LD/<br>ST ST | LD/<br>ST                         | _D/ LD/<br>ST ST | LD/<br>ST                         | LD/ LD/<br>ST ST                  | SFU                |                                   |
| FP64 INT INT                                                             | FP32 FP32 CORE CORE                            | FP64 INT INT FP3                      | FP32 CORE CORE                        | Tensor core                   |                  | 1                                 |                  |                                   |                                   |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  | per clock:                    |                  |                                   | (                | Baa Baa                           | Baa Baa                           |                    |                                   |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  |                               |                  | -0,1 - 10,2 - 1                   | -0,5             | -0,0 -0,1                         | -0,2 -0,3                         | 0,0                | 0,1 90,2                          |
| FP64 INT INT                                                             | FP32 FP32                                      | FP64 INT INT FP3                      | FP32                                  | D =                           | A <sub>1,0</sub> | A <sub>1,1</sub> A <sub>1,2</sub> | 1,3              | B <sub>1,0</sub> B <sub>1,1</sub> | B <sub>1,2</sub> B <sub>1,3</sub> |                    | - <sub>1,1</sub> C <sub>1,2</sub> |
| LD/ LD/ LD/ LD/<br>ST ST ST ST                                           | LD/ LD/ LD/ LD/ ST ST ST ST                    | LD/ LD/ LD/ LD/ LD/<br>ST ST ST ST ST | LD/ LD/ LD/ ST SFU                    |                               | A <sub>2,0</sub> | A <sub>2,1</sub> A <sub>2,2</sub> | 2,3              | B <sub>2,0</sub> B <sub>2,1</sub> | B <sub>2,2</sub> B <sub>2,3</sub> | C <sub>2,0</sub> C | C <sub>2,1</sub> C <sub>2,2</sub> |
| 128KB L1 Data Cache / Shared Memory                                      |                                                |                                       |                                       |                               | A <sub>3,0</sub> | A <sub>3,1</sub> A <sub>3,2</sub> | 3,3              | B <sub>3,0</sub> B <sub>3,1</sub> | B <sub>3,2</sub> B <sub>3,3</sub> | C <sub>3,0</sub> 0 | C <sub>3,1</sub> C <sub>3,2</sub> |
| Tex                                                                      | Тех                                            | Тех                                   | Тех                                   | FP16 or FP3                   | 32               | FP16                              |                  | FI                                | P16                               | FP1                | 6 or FP32                         |

## **Graphics Hardware History**

- 80's:
  - linear interpolation of color over a scanline
  - Vector graphics
- 91' Super Nintendo, Neo Geo,
  - Rasterization of 1 single 3D rectangle per frame (FZero)
- 95-96': Playstation 1, 3dfx Voodoo 1
  - Rasterization of whole triangles (Voodoo 2, 1998)
- 99' Geforce (256)
  - Transforms and Lighting (geometry stage)
- 02' 3DLabs WildCat Viper, P10
  - Pixel shaders, integers,
- 02' ATI Radion 9700, GeforceFX
  - Vertex shaders and **Pixel shaders** with floats
- 06' Geforce 8800
  - Geometry shaders, integers and floats, logical operations
- Then:
  - More general multiprocessor systems, higher SIMD-width, more cores





## **Direct View Storage Tube**

### Created by Tektronix

- -Did not require constant refresh
- -Standard interface to computers
  - Allowed for standard software
  - Plot3D in Fortran
- -Relatively inexpensive
  - Opened door to use of computer graphics for CAD community



#### Tektronix 4014

## **Graphics Hardware History**

#### 2001 • In GeForce3: 600-800 pipeline stages!

- **57** million transistors
- First Pentium IV: 20 stages, 42 million transistors,

#### • Evolution of cards:

26

- 2004 X800 165M transistors
- 2005 X1800 320M trans, 625 MHz, 750 Mhz mem, 10Gpixels/s, 1.25G verts/s
- 2004 GeForce 6800: 222 M transistors, 400 MHz, 400 MHz core/550 MHz mem
- 2005 GeForce 7800: 302M trans, 13Gpix/s, 1.1Gverts/s, bw 54GB/s, 430 MHz core, mem 650MHz(1.3GHz)
- 2006 GeForce 8800: 681M trans, 39.2Gpix/s, 10.6Gverts/s, bw:103.7 GB/s, 612 MHz core (1500 for shaders), 1080 MHz mem (effective 2160 MHz)
- 2008 Geforce 280 GTX: 1.4G trans, 65nm, 602/1296 MHz core, 1107(\*2)MHz mem, 142GB/s, 48Gtex/s
- 2007 ATI Radeon HD 5870: 2.15G trans, 153GB/s, 40nm, 850 MHz, GDDR5, 256bit mem bus,
- 2010 Geforce GTX480: 3Gtrans, 700/1401 MHz core, Mem (1.848G(\*2)GHz), 177.4GB/s, 384bit mem bus, 40Gtexels/s
- 2011 GXT580: 3Gtrans, 772/1544, Mem: 2004/4008 MHz, 192.4GB/s, GDDR5, 384bit mem bus, 49.4 Gtex/s
- 2012 GTX680: 3.5Gtrans (7.1 for Tesla), 1006/1058, 192.2GB/s, 6GHz GDDR5, 256-bit mem bus.
- 2013 GTX780: 7.1G, core clock: 837MHz, 336 GB/s, Mem clock: 6GHz GDDR5, 384-bit mem bus
- 2014 GTX980: 7.1G?, core clock: ~1200MHz, 224GB/s, Mem clock: 7GHz GDDR5, 256-bit mem bus
- 2015 GTX Titan X: 8Gtrans, core clock: ~1000MHz, 336GB/s, Mem clock: 7GHz GDDR5, 384-bit mem bus
- 2016 Titan X: 12/15Gtrans, core clock: ~1500MHz, 480GB/s, Mem clock: 10Gbps GDDR5X, 4096-HBM2
- 2018 Nvidia Volta: 21.1Gtrans, core clock: ~1500MHz, 900GB/s, Mem: 4096-bit HBM2, Lesson learned: #trans doubles ~per 2 years. Core clock increases slowly. Mem clock –increases with new technology DDR2, DDR3, GDDR5, HBM2 and with more memory busses (à 64-bit). Now stacked.
  - We want as fast memory as possible! Why?
    - Parallelization can cover for slow clock. Parallelization more energy efficient than high clock frequency. Power consumption prop. to freq<sup>2</sup>.
    - Memory transfers often the bottleneck





Intel's Sandybridge AMD's Bulldozer

## Memory bandwidth usage is huge!!

- On top of that bandwith usage is never 100%.
- However, there are many techniques to reduce bandwith usage:
  - Texture caching with prefetching
  - Texture compression
  - Z-compression
  - Z-occlusion testing (HyperZ)

Bonus

## Z-occlusion testing and Zcompression

• One way of reducing bandwidth - ATI Inc., pioneered with their HyperZ technology • Very simple, and very effective Divide screen into tiles of 8x8 pixels Keep a status memory on-chip - Very fast access - Stores additional information that this algorithm uses Enables occlusion culling on triangle basis, zcompression, and fast Z-clears

#### Bonus

#### Architecture of Z-cull and Zcompress



- Store zmax per tile, and a flag (whether cleared, compressed/uncompressed)
- Rasterize one tile at a time
- Test if zmin on triangle is farther away than tile's zmax
  - If so, don't do any work for that tile!!!
  - Saves texturing and z-read for entire tile huge savings!
- Otherwize read compressed Z-buffer, & unpack
- Write to unpacked Z-buffer, and when finished compress and send back to memory, and also: update zmax
- For fast Z-clears: just set a flag to "clear" for each tile
   Then we don't need to read from Z-buffer, just send cleared Z for that tile



## Taxonomy of hardware design

for how to resynchronize (sort) parallelized work.

Outputs to frame buffers must respect incoming triangle order.

Take-aways: Sort-first, Sort-middle, Sort-Last Fragment, Sort-Last Image

## **Taxonomy of Hardware**

• We can do many computations in parallel:

- Pixel shading, vertex shading, geometry shading
   x,y,z,w r,g,b,a
- But results need to be sorted somewhere before reaching the screen.
  - Operations can be parallelized but result on screen must be as if each triangle where rendered one by one in their incoming order (according to OpenGL spec)
    - I.e., for every pixel, the rasterized fragments must be merged to the buffers in the original input triangle order
    - E.g., for blending (transparency), (z-culling + stencil test)

## **Taxonomy of hardware**

- Need to sort from model space to screen space
- Gives four major architectures:
  - Sort-first
  - Sort-IIrst
  - Sort-middle
  - Sort-Last Fragment
  - Sort-Last Image

#### Sorting Taxonomy



 Will describe these briefly. Sort-last fragment (and sort middle) are most common in
 commercial hardware

#### Sorting/dividing work to parallel execution units.

## **Sort-First**

- Sorts primitives before geometry stage
  - Screen in divided into large regions
  - A separate pipeline is responsible for each region (or many)
  - But vertex shader can change screen location!
- G is geometry, FG & FM is part of rasterizer (R)
  - A fragment is all the generated information for a pixel on a triangle
  - FG is Fragment Generation (finds which pixels are inside triangle)
  - FM is Fragment Merge (merges the created fragments with various buffers (Z, color))
- Not explored much at all, since:
  - Poor load balancing if uneven triangle distribution between regions.
  - Vertex shader can cange triangle position



## Sort-Middle

- Sorts betwen G and R
- Pretty natural, since after G, we know the screen-space positions of the triangles
- Older/cheaper hardware uses this
  - Examples include InfiniteReality (from SGI) KYRO architecture (from Imagination)
- Spread work arbitrarily among G's
- Then depending on screen-space position, sort to different R's
  - Screen can be split into "tiles". For example:
    - Rectangular blocks (8x8 pixels)
    - Every n scanlines
- The R is responsible for rendering inside tile
- Bads:
  - A triangle can be sent to many FG's depending on overlap (over tiles)
  - May give poor load balancing if triangles are unevenly distributed over the screen tiles



## **Sort-Last Fragment**

- Sorts betwen FG and FM
  XBOX, PS3, nVidia use this
- Again spread work among G's
- The generated work is sent to FG's
- Then sort fragments to FM's
  - An FM is responsible for a tile of pixels
- A triangle is only sent to one FG, so this avoids doing the same work twice
- (Bad: many more fragments to sort than triangles)



## Sort-Last Image

- Sorts after entire pipeline
- So each FG & FM has a separate frame buffer for entire screen (Z and color)
  - Typically: one whole graphics card per pipeline.



- After all primitives have been sent to the pipeline, the z-buffers and color buffers are merged into one color buffer
- Can be seen as a set of independent pipelines
- Huge memory requirements!
- Used in research, but probably not commerically

#### **CHALMERS**

#### Department of Computer Engineering





## Near-future GPUs

## **Current and Future Multicores in Graphics**

- Cell 2005
  - 8 cores à 4-float SIMD
  - 256KB L2 cache/core
  - 128 entry register file
  - 3.2 GHz
- NVIDIA 8800 GTX Nov 2006
  - 16 cores à 8-float SIMD (GTX 280 30 cores à 8, june '08)
  - 16 KB L1 cache, 64KB L2 cache (rumour)
  - 1.2-1.625 GHz
- Larrabee "2010"
  - 16-24 cores à 16-float SIMD (Xeon Phi: 61 cores, 2012)
  - Core = 16-float SIMD (=512bit FPU) + x86 proc with loops, branches + scalar ops, 4 threads/core
  - 32KB L1cache, 256KB L2-cache (512KB/core)
  - 1.7-2.4 GHz (1.1 GHz)
- NVIDIA Fermi GF100 2010, (GF110 2011)
  - 16 cores à 2x16-float SIMD (1x16 double SIMD)
  - 16/48 KB L1 cache, 768 KB L2 cache
- NVIDIA Kepler 2012 16 cores à 2x3x16=96 float SIMD
- NVIDIA Kepler 2013 16 cores à 2x6x16=192 float SIMD
- NVIDIA Titan X 2016 60 cores à 2x4x8=64 float SIMD
- NVIDIA Volta 2018 84 cores à 64 float SIMD + tensor cores (16-bit matrix mul+add)

#### PowerXCell 8i Processor – 2008

- 8 cores à 4-float SIMD
- 256KB L2 cache
- 128 entry register file
- but has better double precission







## NVIDIA year 2020

- Exaflop machine:
- Google on:
   "The Challenge of Future High-Performance Computing" Uppsala
- <u>http://media.medfarm.uu.se/play/video/3261#\_utma=1.4337140.1361541635.1</u>
   <u>361541635.1361541635.1&\_utmb=1.4.</u>
   <u>10.1361541635&utmc=1&utmx=-&utmz=1.1361541635.1.1.utmcsr=(direct)%7Cutmccn=(direct)%7Cutmcmd=(none)&utmv=-&utmk=104508928</u>
- Bill Dally, Chief Scientist & sr VP of Research, NVIDIA, prof. of Engineering, Stanford Univ.

 "Energy efficiency is key to performance" – Flops/W.



## If we have time...

# How create efficient GPU programs?

# Answer: coallesced memory accesses

## **Graphics Processing Unit - GPU**



46

## **Graphics Processing Unit - GPU**



**Beyond Programmable Shading** 

## Let's look at the GPU



Beyond Programmable Shading

#### NVIDIA Fermi – GTX480, 2010. 16 cores

## Let's look at the GPU



NVIDIA Kepler: 15-16 multi-processors (GTX 680, ~2012) Beyond Programmable Shading



Kepler: 15-16 multi-processors

Beyond Programmable Shading



Kepler: 15-16 multi-processors



Kepler: 15-16 multi-processors

## CUDA

- A kernel (=CUDA program) is executed by 100:s-1M:s threads
  - A "warp" = 32 threads, one thread per ALU
  - Warps (one to ~32) are grouped into one block
  - Block: executed on one core
    - One to 48 warps execute on a core



## Memory Acceses – Global Memory





#### 4 GB RAM

- Coalesced reads and writes
- For maximum performance, each thread should read from the same 16-float block (128 bytes)
  - i.e., the same cache-line

## Fermi

• Global mem accesses.

• One transaction:

• Two transactions:

![](_page_54_Figure_4.jpeg)

| Aligned and non-sequential |         |            |                        |         |           |          |        |  |
|----------------------------|---------|------------|------------------------|---------|-----------|----------|--------|--|
| Addresses:                 | 96      | 128        | 160                    | 192     | 224       | 256      | 288    |  |
|                            |         |            |                        |         |           |          |        |  |
|                            |         | ttt X      | ******                 | ******  | tttttXt   | tttt     |        |  |
| Threads:                   |         | 0          |                        |         |           | 31       |        |  |
| Compute                    | ity: 1. | 0 and 1.1  | 1.2 (                  | and 1.3 | 2.0       |          |        |  |
| Memory t                   | ons:    | Uncached   |                        |         |           | Cached   |        |  |
|                            |         | 8 x        | 32B at 12              | 8 1x 6  | 4B at 128 | 1 x 128B | at 128 |  |
|                            |         | 8 X<br>8 X | 32B at 16<br>32B at 19 | 0 1 x 6 | 4B at 192 |          |        |  |
|                            |         | 8 x        | 32B at 22              | 4       |           |          |        |  |

![](_page_54_Figure_6.jpeg)

Figure G-1. Examples of Global Memory Accesses by a Warp, 4-Byte Word per Thread, and Associated Memory Transactions Based on Compute Capability

## **Efficient Programming**

- If your program can be constructed this way, you are a winner!
- More often possible than anticipated

9

5

- Stream compaction
- Prefix sums
- Sorting

![](_page_55_Figure_6.jpeg)

Fermi: 16 multi-processors à 2x16 SIMD width

5 100 1 63 79

19 63 79 100

**Beyond Programmable Shading** 

#### CHALMERS

#### Department of Computer Engineering

![](_page_56_Figure_2.jpeg)

## Shaders and coallesced memory accesses

- Each core (e.g. 192-SIMD) executes the same instruction per clock cycle for either a:
  - Vertex shader:
    - E.g. 192 vertices
  - Geometry shader
    - E.g. 192 triangles
  - Fragment shader:
    - E.g. 192 pixels
       in blocks of at least 2x2 pixels
       (to compute texture filter derivatives) .
       Here is an example of blocks
       4x8 = 32 pixels:
- However, many architectures can execute different instructions, of the same shader, for different warps (groups of 32 ALUs)

![](_page_57_Figure_9.jpeg)

![](_page_57_Picture_10.jpeg)

![](_page_57_Figure_11.jpeg)

## Shaders and coallesced memory accesses

 For mipmap-filtered texture lookups in a fragment shader, this can provide coallesced memory accesses.

| GPU    |        |       |            |  |  |  |
|--------|--------|-------|------------|--|--|--|
| Core 1 | Core 2 | 0 0 0 | Core<br>16 |  |  |  |
| L1\$   | L1 \$  |       | L1 \$      |  |  |  |

![](_page_58_Figure_3.jpeg)

## Thread utilization

- Each core executes one program (=shader)
- Each of the 192 ALUs execute one "thread" (a shader for a vertex or fragment)
- Since the core executes the same instruction for at least 32 threads (as far as the programmer is concerned)...
- If (...)

  Then, a = b + c;
  The core must execute both paths if any of the 32 threads need the if and else-path.

  Else

  a = c + d;

  But not if all need the same path.

## Need to know:

 Perspective correct interpolation (e.g. for textures)

#### • Taxonomy:

- Sort first
- sort middle
- sort last fragment
- sort last image
- Bandwidth
  - Why it is a problem and how to "solve" it
    - L1 / L2 caches
    - Texture caching with prefetching
    - Texture compression, Z-compression, Z-occlusion testing (HyperZ)
- Be able to sketch the functional blocks and relation to hardware for a modern graphics card (next slide→)

Linearly interpolate  $(u_i/w_i, v_i/w_i, 1/w_i)$  in screenspace from each triangle vertex i. Then at each pixel:

$$u_{ip} = (u/w)_{ip} / (1/w)_{ip}$$
  
$$v_{ip} = (v/w)_{ip} / (1/w)_{ip}$$

where ip = screen-space interpolated value from the triangle vertices.

![](_page_60_Figure_16.jpeg)

#### **CHALMERS**

**D**ispla

#### Department of Computer Engineering

![](_page_61_Figure_2.jpeg)