David Kirk氏講演会ノート : CUDA and Tesla GPU Computing

TODO:校正すること

Tesla GPU Computing

  • GeForce, Quadro, Tesla
  • dont have to use fixed graphics pipeline
  • 50 - 200 GFLOPS
    • on a ordinary C programs
    • performance pulled by the instaible demands of PC game market
  • parallelism is doubling every year
  • SPMD
    • Single program multiple data
      • exploit data parallelism in application

Tesla Server

GPU Computing Architecture

  • 1.5GB DRAM 75GB/s peak 800MHz GDDR3 clock
  • Massively multi threaded parallel computing platform
  • 12288 concurrent threads, hardware managed
    • managed by hardware scheduler
  • SM(Stream) Multithreadded Multiprocessor
  • IEEE754 32 floating point
  • 32-bit integer
  • SM has SFU Special Function Units
    • sin, cos, sqrt, log etc
  • Scalar ISA
    • barrier sync.
    • Shared memory
  • each sm 768 threads at a time
  • running in smaller chunks called works
  • executed in SIMD
    • but don't have to write a SIMD program
  • compliance of floating point
    • not complient with the full IEEE754 floating point spec
      • implements a subset
    • 1/4 throuput for special functions: RCP, RSQRT, EXP2, LOG2, SIN, COS
  • evals fuction approximations
    • Quadratic interolation with enhanced minimax approximation
    • Interpolates pixel attributes
  • Accuracy ranges from 22.5 to 24.0 floating points
  • SM SIMPD Multithreaded Execution
    • SM hardware implements zero-overhead warp and thread scheduling
      • context switch with zero penalty
  • SIMD warp diverges and converges when threads branch independently
  • 24 threads all scheduled by the hardware scheduler

Multithreading and Thread Arrays

Data Parallel Problem Decomposition

  • cooperative thread array
    • array of threads that can cooperate: shared memory, memory barrier, locks
    • sized 1 - 512
    • can be layered as 1D, 2D, 3D
  • Per-CTA shared memory
    • keeps data close to processor: minimize trips to global memory
    • cf. access global memory: 76 GB/sec GDDR DRAM
      • high bandwidth, but high latency
      • keep the processor busy! if you wnat to use this

Data Parallel Levels

  • Thread
    • Computes result elements
    • thread id #
  • CTA
  • Grid of CTAs
  • Sequential Grids

Parallel Memory Sharing

  • Local memory
    • each thread has a private memory
      • auto variables, register spill, etc
  • Shared memory
  • shared by the threads of CTA
  • Inter-thread communication
  • Global Memory
    • shared by all threads
    • Inter-grid communication

Transparent Scalability

  • Transparent GPU scaling
    • ranges from 8 cores to many 100s of cores
    • ranges from 100 to many 100s of threads
    • doubling yearly
  • Challenge: Scale computing performance with GPU parallelism
    • program must be insensitiv eto the number of cores: be abstract
    • write one program for any number of SM cores
    • Program runs of any size GPU without recompiling
      • hardware do the all the scheduling
  • Key: the transparent scalability!
    • Programmer level
    • Decomposes problem into sequential steps: grids
    • CTAs
    • threads
    • Hardware: distributes CTA work to available SM cores
    • CTA program computes a Block independently of others
      • enables parallel computing of blocks of grid

CUDA Programming Model

  • SPMD model : single program multiple data programming model

CUDA: C on the GPU

  • C program for a thread of a thread block in a grid
  • extend C only where necessary
  • simple explicit language mapping to parallel threads

My Questions

  • 32-bit floating point
    • some application DOES need double precision
      • will be coming out in few months
  • OpenGL or CUDA for offline renderer
    • completely depends on rendering algorithms:
      • if fits in current graphics pipeline -> OpenGL, DirectX
      • completely different architecture -> CUDA

US NVidiaインターンシップ

  • 200人~300人
  • 半分くらい正社員に