NVIDIA CUDA Compute Unified Device Architecture Programming Guide
4.5.1.5 Asynchronous Concurrent Execution In order to facilitate concurrent execution between host and device, some runtime functions are asynchronous: Control is returned to the application before the device has completed the requested task. These are:
Kernel launches through __global__ functions or cuGridLaunch() and cuGridLaunchAsync(); The functions that perform memory copies and are suffixed with Async; The functions that perform device ↔ device memory copies; The functions that set memory.
Applications manage concurrency through streams. A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.