CUDA Basics

Learn the fundamental concepts of CUDA programming.

CUDA Program Structure

A typical CUDA program consists of code that runs on both the CPU (host) and GPU (device). The host code manages memory and launches kernels, while the device code runs in parallel on the GPU.

Your First CUDA Program

Let's look at a simple CUDA program that prints "Hello World" from multiple threads:

Key Components:

__global__ keyword indicates a function that runs on the GPU
blockIdx.x and threadIdx.x are built-in variables for accessing block and thread indices
The <<<2, 4>>> syntax launches the kernel with 2 blocks of 4 threads each
cudaDeviceSynchronize() waits for all GPU operations to complete

Thread Hierarchy

CUDA organizes threads in a hierarchical structure:

Threads are grouped into blocks
Blocks are organized into a grid
This hierarchy allows CUDA to scale across different GPU architectures

Memory Model

CUDA provides different types of memory:

Global memory - accessible by all threads
Shared memory - shared between threads in a block
Local memory - private to each thread
Constant memory - read-only memory accessible by all threads

Try modifying and running the example code in our playground to better understand how CUDA threads work.