CUDA Basics

Learn the fundamental concepts of CUDA programming.

CUDA Program Structure

A typical CUDA program consists of code that runs on both the CPU (host) and GPU (device). The host code manages memory and launches kernels, while the device code runs in parallel on the GPU.

Your First CUDA Program

Let's look at a simple CUDA program that prints "Hello World" from multiple threads:

Loading...

Key Components:

  • __global__ keyword indicates a function that runs on the GPU
  • blockIdx.x and threadIdx.x are built-in variables for accessing block and thread indices
  • The <<<2, 4>>> syntax launches the kernel with 2 blocks of 4 threads each
  • cudaDeviceSynchronize() waits for all GPU operations to complete

Thread Hierarchy

CUDA organizes threads in a hierarchical structure:

  • Threads are grouped into blocks
  • Blocks are organized into a grid
  • This hierarchy allows CUDA to scale across different GPU architectures

Memory Model

CUDA provides different types of memory:

  • Global memory - accessible by all threads
  • Shared memory - shared between threads in a block
  • Local memory - private to each thread
  • Constant memory - read-only memory accessible by all threads

Try modifying and running the example code in our playground to better understand how CUDA threads work.