nyaxtのPC作業ログ

David Kirk氏講演会ノート : CUDA and Tesla GPU Computing

event 3dcg

TODO:校正すること

Tesla GPU Computing

GeForce, Quadro, Tesla
dont have to use fixed graphics pipeline
50 - 200 GFLOPS
- on a ordinary C programs
- performance pulled by the instaible demands of PC game market
parallelism is doubling every year
SPMD
- Single program multiple data
  - exploit data parallelism in application

Tesla Server

PCI Express で接続する

GPU Computing Architecture

same architecture : GeForce 8800
128 thread processors
518 GFLOPS peak
1.35GHz processor clock

1.5GB DRAM 75GB/s peak 800MHz GDDR3 clock

Massively multi threaded parallel computing platform
12288 concurrent threads, hardware managed
- managed by hardware scheduler

SM(Stream) Multithreadded Multiprocessor
IEEE754 32 floating point
32-bit integer
SM has SFU Special Function Units
- sin, cos, sqrt, log etc

Scalar ISA
- barrier sync.
- Shared memory
each sm 768 threads at a time
running in smaller chunks called works
executed in SIMD
- but don't have to write a SIMD program

compliance of floating point
- not complient with the full IEEE754 floating point spec
  - implements a subset
- 1/4 throuput for special functions: RCP, RSQRT, EXP2, LOG2, SIN, COS
evals fuction approximations
- Quadratic interolation with enhanced minimax approximation
- Interpolates pixel attributes
Accuracy ranges from 22.5 to 24.0 floating points

SM SIMPD Multithreaded Execution
- SM hardware implements zero-overhead warp and thread scheduling
  - context switch with zero penalty

SIMD warp diverges and converges when threads branch independently
24 threads all scheduled by the hardware scheduler

Multithreading and Thread Arrays

Data Parallel Problem Decomposition

cooperative thread array
- array of threads that can cooperate: shared memory, memory barrier, locks
- sized 1 - 512
- can be layered as 1D, 2D, 3D
Per-CTA shared memory
- keeps data close to processor: minimize trips to global memory
- cf. access global memory: 76 GB/sec GDDR DRAM
  - high bandwidth, but high latency
  - keep the processor busy! if you wnat to use this

Data Parallel Levels

Thread
- Computes result elements
- thread id #
CTA
Grid of CTAs
Sequential Grids

Parallel Memory Sharing

Local memory
- each thread has a private memory
  - auto variables, register spill, etc
Shared memory
shared by the threads of CTA
Inter-thread communication
Global Memory
- shared by all threads
- Inter-grid communication

Transparent Scalability

Transparent GPU scaling
- ranges from 8 cores to many 100s of cores
- ranges from 100 to many 100s of threads
- doubling yearly

Challenge: Scale computing performance with GPU parallelism
- program must be insensitiv eto the number of cores: be abstract
- write one program for any number of SM cores
- Program runs of any size GPU without recompiling
  - hardware do the all the scheduling

Key: the transparent scalability!
- Programmer level
- Decomposes problem into sequential steps: grids
- CTAs
- threads

- Hardware: distributes CTA work to available SM cores

- CTA program computes a Block independently of others
  - enables parallel computing of blocks of grid

CUDA Programming Model

SPMD model : single program multiple data programming model

CUDA: C on the GPU

C program for a thread of a thread block in a grid
extend C only where necessary
simple explicit language mapping to parallel threads

My Questions

32-bit floating point
- some application DOES need double precision
  - will be coming out in few months
OpenGL or CUDA for offline renderer
- completely depends on rendering algorithms:
  - if fits in current graphics pipeline -> OpenGL, DirectX
  - completely different architecture -> CUDA

US NVidia インターンシップ

200人~300人
半分くらい正社員に