CUDA Threads.pptx

5/4/11 CUDA Threads
James Gain, Michelle Kuttel, Sebastian Wyngaard,
Simon Perkins and Jason Brownbridge
{ jgain | mkuttel | sperkins |jbrownbr}@cs.uct.ac.za
swyngaard@csir.co.za
3-6 May 2011
Origins
!  The CPU processing “core”
1 5/4/11 Origins
!  But gfx operations don’t need this
! H
  ighly parallel/synchronous execution
!  Many operations have spatial locality
!  Heavy use of SIMD
!  So we throw away cache
and other unnecessary
overhead
Origins
!  Use multiple cores with many lightweight threads
running concurrently on different cores
2 5/4/11 Origins
!  Still inefficient…
! o
  ne processor ---- one thread
!  Yet, threads differ only in input and output
!  The execution path is the same!
Solution?
Origins
!  Still very inefficient
! o
  ne processor ---- one thread
!  Yet, threads differ only in input and output
!  The execution path is the same!
(SIMD)
3 5/4/11 Origins
!  Branching can still hurt performance.
!  Watch out for __syncthreads()
Origins
!  Aren’t long wait times a concern without a data
cache? No. Just interleave the processing.
4 5/4/11 Some (G80) hardware info
Streaming Processor Array
TPC
TPC
TPC
Texture Processor Cluster
…
TPC
TPC
Streaming Multiprocessor
Instruction L1
SM
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SM
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
5 5/4/11 CUDA threads
!  Ids help determine the
data to work on
! B
  lock ID: 1D or 2D
!  Thread ID: 1D, 2D, or 3D
!  Simplifies memory
addressing…
!  Yesterday’s prac…
CUDA thread blocks
!  So multiple threads produce good parallelism…
But why do we need thread blocks?
6 5/4/11 CUDA thread blocks
!  All threads in same block will exec same instruction
!  threads in block share data
!  synchronised in doing their share of the work
!  But threads in different blocks cannot communicate
!  “transparent scalability” --- Lack of communication is a boon!
!  Since
! B
  locks can execute in any order
!  Threads of the block execute together on a single streaming processor
!  Thus, an increase in processor count produces proportional
increase in parallelism
CUDA thread blocks
! G
  80 limitations (for example)
!  Grid dimension limit is 64K in x or y
!  Block dimension limit is
!
!
!
!
!
!
  t most 512 threads, but also
A
 dim x <= 512
 dim y <= 512
 dim z <= 64
 And you can only have 768 threads per SM
 OR whichever of these 5 limitations you hit first!
!  Question: Which of these is the most efficient block dimensions to use
given this architectural description: 4x4, 8x8, 16x8, 16x16, 32x32 ?
!  HINT: SM occupancy…
7 5/4/11 CUDA thread blocks
!  So, out of the two possibilities below, which would you choose?
!  16 x 8 = 128 threads; 768 / 128 = 6 blocks < 8 block maximum, so we of course have
768 occupancy on a single SM
!  16 x 16 = 256 threads; 768 / 256 = 3 blocks < 8 block maximum, so again have full
occupancy
8 5/4/11 CUDA thread blocks
!  So, out of the two possibilities below, which would you choose?
!  16 x 8 = 128 threads; 768 / 128 = 6 blocks < 8 block maximum, so we of course have
768 occupancy on a single SM
Reason: With a larger block count, this option offers more opportunity to swap out stalled
blocks. Plus, it uses more streaming processors at once.
Actually there is another good reasons for choosing a block size with a “x” dimension that is a
multiple of 16…
CUDA Thread blocks: Warps
!  SM divides each thread block into warps
!  32 threads: [0,1,...,31], [32,33,...,63], etc.
!  SM exchanges stalled warps for waiting warps
!  A warp stalls if any thread in the warp stalls
!  If it stalls, we swap in some other warp
! W
  ith zero-overhead thread scheduling
!  i.e. a reason for throwing away the data cache and branch logic!
!  Can sync all threads in a block (warp)
!  SM waits for all threads to reach sync point
!  Avoids read-after-write, write-after-write,... errors
!  Conditionals allowed but must be uniform across entire
thread block.
!  
OK, but you mentioned 16 was a good dimension…
9 5/4/11 CUDA Thread blocks: Warps
!  Warps are not part of the CUDA specification
!  This means NVIDIA can do pretty much anything they want here
!  If you want scalable long-lasting code, then perhaps warps aren’t
for you
!  The only programmatic significant to a warp is when you
divide it in two half-warps (16 threads)
!  When a half-warp accesses global memory the possibility for
coalesced access arises
!  Much like pre-fetching, where get spatial local accesses for free
!  However, there are some steep requirements…
CUDA Thread blocks: Warps
!  To get coalesced memory access you need to have
!  Arrays aligned on 4/8/16 byte boundaries
!  Half-warp threads serially accessing consecutive memory
addresses
!  A thread usage pattern where only the “x”-direction is significant
! G
  enerally, means accessing things in the correct order
!  This is actually one good use for shared memory
!  …an illustration is in order…
!  The base address of these memory accesses must be aligned to a
multiple of element size (of the array data being accessed)
10 5/4/11 CUDA Thread blocks: Warps
!  Question:
!  Suppose you decide to use 16x16 thread blocks. How many warps
are there per SM?
CUDA Thread blocks: Warps
!  Question:
!  Suppose you decide to use 16x16 thread blocks. How many warps
are there per SM?
!  Answer:
! 7
  68 / (16x16) = 3 blocks of 256 threads
!  256 / 32 = 8 warps
!  8 warps per block means 8 * 3 = 24 warps on a SM
11 5/4/11 CUDA Thread blocks: Warps
!  Quick summary of warps
!  Essentially scheduling units of the SM
!  One and only one warp executes at a time on a SM
!  I’ll repeat that: At any point in time, only one warp is
executed by an SM
!  Warps are scheduled by some unknown hardware
algorithm according to some unspecified priority metric.
NVIDIA has the answers but aren’t sharing.
!  They’re an implementation decision by NVIDIA, so
everything I just told you could be a lie.
Some final points…
!  Bordering blocks do not in general run on the same
streaming multiprocessor
!  Blocks cannot synchronise during kernel execution.
The best you can do is wait for the kernel to finish.
!  You will live a long and productive life if you forget
about warps. IMHO, algorithmic considerations and
texture memory (that is, cached) accesses will likely
bring you much joy. This is the substance of the next
talk.
12