David Kirk氏講演会ノート : CUDA and Tesla GPU Computing
TODO:校正すること
Tesla GPU Computing
Tesla Server
- PCI Express で接続する
GPU Computing Architecture
- same architecture : GeForce 8800
- 128 thread processors
- 518 GFLOPS peak
- 1.35GHz processor clock
- 1.5GB DRAM 75GB/s peak 800MHz GDDR3 clock
- Massively multi threaded parallel computing platform
- 12288 concurrent threads, hardware managed
- managed by hardware scheduler
- SM(Stream) Multithreadded Multiprocessor
- IEEE754 32 floating point
- 32-bit integer
- SM has SFU Special Function Units
- sin, cos, sqrt, log etc
- Scalar ISA
- barrier sync.
- Shared memory
- each sm 768 threads at a time
- running in smaller chunks called works
- executed in SIMD
- but don't have to write a SIMD program
- compliance of floating point
- not complient with the full IEEE754 floating point spec
- implements a subset
- 1/4 throuput for special functions: RCP, RSQRT, EXP2, LOG2, SIN, COS
- not complient with the full IEEE754 floating point spec
- evals fuction approximations
- Quadratic interolation with enhanced minimax approximation
- Interpolates pixel attributes
- Accuracy ranges from 22.5 to 24.0 floating points
- SM SIMPD Multithreaded Execution
- SM hardware implements zero-overhead warp and thread scheduling
- context switch with zero penalty
- SM hardware implements zero-overhead warp and thread scheduling
Multithreading and Thread Arrays
Data Parallel Problem Decomposition
- cooperative thread array
- array of threads that can cooperate: shared memory, memory barrier, locks
- sized 1 - 512
- can be layered as 1D, 2D, 3D
- Per-CTA shared memory
Data Parallel Levels
- Thread
- Computes result elements
- thread id #
- CTA
- Grid of CTAs
- Sequential Grids
Parallel Memory Sharing
- Local memory
- each thread has a private memory
- auto variables, register spill, etc
- each thread has a private memory
- Shared memory
- shared by the threads of CTA
- Inter-thread communication
- Global Memory
- shared by all threads
- Inter-grid communication
Transparent Scalability
- Transparent GPU scaling
- Challenge: Scale computing performance with GPU parallelism
- program must be insensitiv eto the number of cores: be abstract
- write one program for any number of SM cores
- Program runs of any size GPU without recompiling
- hardware do the all the scheduling
- Key: the transparent scalability!
- Programmer level
- Decomposes problem into sequential steps: grids
- CTAs
- threads
-
- Hardware: distributes CTA work to available SM cores
-
- CTA program computes a Block independently of others
- enables parallel computing of blocks of grid
- CTA program computes a Block independently of others
CUDA Programming Model
- SPMD model : single program multiple data programming model
CUDA: C on the GPU
- C program for a thread of a thread block in a grid
- extend C only where necessary
- simple explicit language mapping to parallel threads