Cuda get threads per block. Modified 9 years, 1 month ago.

Cuda get threads per block 32 threads per block is not a good idea. #include <stdio. generally 32) Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. 512 blocks of 256 threads (I changed the numbers to make them more reflective of CUDA codes I've seen). However, as the number of streaming processors is finite, there has to be a physical limit. I would like the code to detect the maximum number of threads per block and further calculate the specified number of blocks in each direction. In this case, we would have the situation given in Figure 5-5. GPU Teaching Kit. h> __global__ void kernelA(){ // threadIdx. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. Since each SM can take up to 1536 threads, which translates to 24 Blocks. Each thread works with 4 bytes of data, making coalesced memory access patterns of 32*4=128 bytes. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail. if the amount of shared memory per block depends on the number of threads you For example, in the case in Figure 5-3, we looked at doubling the size of the array, but keeping the same number of blocks of threads. We illustrate these In the following code, the use of thread and block indices are shown, which can be used to differentiate work done by each thread. 0) Maximum threads in Z direction: 64. The SM has a maximum number of blocks that can be active at once. Occupancy：一个SM上active warp 比上该SM最大的active warps的数量的比值。常用函数 Event CUDA中Event用于在流的执行中添加标记点，用于检查正在执行的流是否到达给定点。作用一，Event可用于等待和测试时间插入点前的操作，作用和streamSynchronize类似。作用二，Event可插入不同的流中，用于流之间的操作。 1. When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched. If you apply max Y and Z the GPU will physically stop working before the grid finishes. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. The CUDA model provides a fairly straightforward way to map a thread to a unique value, beginning at 0, to use as an index into the array. my_addition_kernel <<< ceil((2^23 - 1) / 1024), 1024 >>>(da, db, dc) So you'd have 8192 blocks with 1024 threads each. 核函数以线程为单位进行计算的函数，cuda编程会涉及到大量的线程(thread),几千个到几万个thread同时并行计算，所有的thread其实都是在执行同一个核函数。之所以这么划分，比如将Grid划分为多维的Block，Block划分为 CUDA threads per block limitation. 3 Total amount of global memory: 4096 MBytes (4294770688 bytes) (30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores GPU Clock rate: 1296 MHz (1. In thrust, they will choose block size like 257 to avoid bank conflicts. this was found on both TitanV and TitanX (pascal) with cuda 9. In your example a 1D grid and a 1D block should do. 0 and the device properties returned by cudaGetDeviceProperties. Number of registers used per block. All threads of the executed warps are executed in parallel. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute • So, you can express your collection of blocks, and your collection of threads within a block, as a 1D array, a 2D array or a 3D array. the maximum amount of threads per block you can launch your kernel with is dependend on the amount of registers per thread. . Everything works great so far, but I need help understanding on how Blocks and Threads work in general and via CudaNative. 14. So if you have an M x N matrix then you can store it in a single array of length M*N. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently. grid() statement in your kernel code will return a tuple that has sensible ranges The N (2048*2048) quantity is the overall size of the data set. 16 SMs × 2048/32 = 1024 blocks. The CUDA Occupancy Calculator can be found on the CUDA homepage. Hence, the B200 GPU enables a single thread block to address up to 227 KB of shared memory. 6 GPUs (like yours) I generally recommend 512 threads per block (maximum, if that is relevant) for this reason. But maximum is hard-limited to 1024 threads per block. and adjust it frequently to choose a better one. To think about oddball cases, if you had 65 threads per block, someone might naively think that the threads per SM limit would imply a maximum of 31 blocks per SM, but in actuality those threadblocks of 65 threads consist of 3 This tutorial shows to get CUDA device properties using C++. So I ran the kernel with blocks of 196 threads. cuda. Therefore, both values should be greater than 0. In normal cases, i will make as 256*256 first. Lecture 3. Both functions are documented in the Execution Control section of the respective API documentation. (I know that if a block uses more registers than available on a On compute capability 1. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB Without considering occupancy limiters in a specific code (e. 128 threads per block resulted in faster performance than 1024. Now let’s suppose that we compute a new grid size (blocks per 1D grid) based on the array size and number of threads per block. In that case, the number of thread blocks to launch would entirely depend on the size of the problem. MAX_THREADS_PER_BLOCK has been 1024 since CC 3. I thought this design was optimal, but the occupancy When a kernel is started, the number of blocks per grid and the number of threads per block are fixed (gridDim and blockDim). All the blocks in the same grid contain the same number of threads. You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference. Objective – To learn how a CUDA kernel utilizes hardware execution resources – For 8X8, we have 64 threads per Block. 0 + ! Max dimensions of thread block (1024,1024, 64) but max threads 1024 !! Typical sizes: (16, 16), (32, 32) optimum size will depend on program. The maximum number of threads per block is 1024. 0 shown in parentheses) 512 x 1 x 1 (1024 x 1 x 1) 128 x 2 x 2 (256 x 2 x 2) Three-dimensional indexing provides a natural way to index elements in vectors, matrix, and volume and makes CUDA programming easier. With a 256 thread block you could have 3 active blocks per SM. As you have written it, that kernel is completely serial. Mapping a 1D grid of 1D thread blocks to an array¶ In many cases where we have a 1 dimensional array of data values, our primary goal in CUDA programming is to have each thread work on one element of the array. When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. Max number of threads which can be initiated in a single CUDA 特に重要なのは2行目の「Maximum number of threads per block:」で、これが1ブロックあたりに設定できる最大のスレッド数です。 3行目「Max dimension size of a thread block (x,y,z):」については、CUDAでGPUで並列処理するときに、1ブロックあたりのスレッド数を3次元で指定 The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). g. When you execute a kernel you decide how many ‘blocks of threads’ you want to launch. Under the above circumstances, every block will launch with a single thread, and that thread will have thread indices (threadIdx. This problem is a vector add problem, so the overall size of the vectors to be added is N elements. 0 and 12. 1. With 32 threads per block, you cannot even fill the SM completely, because ADA has a 24 thread block limit (per your data), and 24 * 32 = 768 Question 1 Can I calculate block/grid size using this method? No. From Wikipedia: Thread Block (CUDA):. Hello, I am new to CUDA and trying to wrap my head around calculating the ‘global thread id’ What I mean by this is the following: Say we have a grid of (2,2,1) and a blocks of (16,16,1) this will generate 1024 threads with the kernel invocation. In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. Take a look at the CUDA Occupacy block: 数个thread会被群组成一个block，同一个block中的thread可以同步，也可以通过shared memory进行通信。 grid: 多个block则会再构成grid。 CUDA软件结构 Warp. Blocks per SM. registers per thread, shared memory usage, etc. grid_size and block_size represent the number of blocks and the number of threads in each block, respectively, for launching the kernel. x: The thread id with respect to the thread's block // From 0 - (thread count per block - 1) // blockIdx. I got CUDAnative to work on a MacBook Pro 2016 on High Sierra with GeForce 1080 Ti running as an eGPU connected via USB-C. e. I am referring the ‘global thread id’ being each unique instance of a thread within the kernel. 1 hardware, the limit is 768 active/concurrent threads per multiprocessor (not per block like you wrote), which are executed in SIMD fashion in warps of 32 threads. ) If your code has occupancy The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. x gives the number of threads in a grid (in the x direction, in this case); block and grid variables can be 1, 2, or 3 dimensional. 1536/32 = 48 warps Cuda registers per thread. grid(2) returns the block coordinates. I have an evga GTX 560TI 2GB (Fermi) GPU From what I gathered: There are 32 cuda cores per multiprocessor(SM)? each (SM) can execute 46 warps each warp can execute 32 threads and the number of threads running I have a vast number of blocks to saturate the hardware, but for algorithmic reasons, the preferred number of threads per block is 32. 5 – CUDA Parallelism Model. Using one thread per block is often used to introduce CUDA to new programmers, but generally should not be used for performance-oriented code. Since each thread is processing one element, and we have 1,048,576 elements total, we will need 1,048,576 threads. Accelerated Computing. Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). and ran the following: @cuda threads=1 blocks=1 add!(C The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation). x: The block id with respect to the grid (all blocks in the kernel) // 文章浏览阅读1. The supported attributes are: CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK: Maximum number of You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks. Number of threads per block. Similarly, blocks are also indexed using the in-built 3D variable called What you need is cudaFuncGetAttributes if you are using the CUDA runtime API or cuFuncGetAttribute with CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK as RoBiK pointed out in his answer if you are using the CUDA driver API. It's common practice when handling 1-D data to only One is the "one thread per work item" style, which is usually taught in CUDA introductions. We calculate the number of blocks (of 256 threads) we’ll need to have that many threads, and it comes out to 4,096 blocks. global void add(int *a, int *b,int *c,int n) { int index = Hello, I think you are mixing threads per blocks and blocks of threads. E. Viewed 21k times So my guess was that if I lower the number of threads per block I can increase the number of registers in use. %PDF-1. The way I understand it blocks are assigned to a single SM with potentially multiple blocks per SM. If you launch a grid with max blocks in the X direction with each block having 1024 threads even a null block will likely take minutes. 1. The second parameter (B) is the number of threads per block, which For better process and data mapping, threads are grouped into thread blocks. Modified 9 years, 1 month ago. 256 or 1024), otherwise you may exceed your device's block limit since n is quite large in your example. Between 128 and 256 threads per block is a good initial range for You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. In CUDA threads are ‘packed’ in blocks of size (x, y, z). The number of blocks can be determined from the kernel launch: blockDim. 6666) A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. Could someone For performance applications like this you need to store 2D matrix information as a single array in memory. For people using cc8. Posted on February 4, 2019 February 22, 2019 by Julia Nedzelskaya. How to decide how many number of blocks vs threads in a block. Features and Technical Specifications points out that Maximum number of threads per block and Maximum x- or y-dimension of a block are both 1024 Form this I understand that the cuda. It is important to remember than these API calls provide the occupancy maximizing number of threads per block and not the block dimensions. Cite Bo Joel Svensson I’m having a hard time understanding how and why the number of threads per block affects the number of warps per SM. As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here. – If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor. How do we utilise the other 512 threads per SM? Device 1: "Quadro FX 5800" CUDA Driver Version / Runtime Version 5. The threads in each block are then broken down into 32 thread warps to be executed on the SM. Playing with cuda block size. The numbers you quote (2048 threads per multiprocessor, three multiprocessors in total = 6144 threads represent the first set of limits. The major factors In typical CUDA programs the number of blocks in a grid is significantly larger than the number of blocks that can execute simultaneously at any given time, which is The first parameter (A) is the number of blocks to launch, expressed as a dim3 (3-dimensional) variable. SM采用的 SIMT (Single-Instruction, Multiple-Thread，单指令多线程)架构，warp(线程束)是最基本的执行单元，一个warp包含32个 I am confused about the maximum number of threads which can be launched in a Fermi GPU. Thus, threads in a block may communicate with each other by writing and reading per-block shared memory at a synchronization barrier. Choosing Threads Per Block. 30 GHz) Memory Clock rate As I know, this is the main control flow of Cuda program: Kernel → thread block(s) → one block executes by a SM one time → thread block is divided into warps(32 threads per warp) → all warps are handled concurrently (is this mean parallel?) So now assumes that we are using the Fermi architecture which implements 1536 threads per block. 3, had a maximum of 512 threads per block and 65535 blocks in a single 1-dimensional grid (recall we set up a 1-D grid in this code). No, you cannot run 4 trillion threads simultaneously. You can schedule them, but there aren’t enough physical cores in the streaming multi processors to run them all at once. If we want to set maximum possible block size, we have to make sure that the product of If each thread requires big private memory, then using less threads per block helps but its not infinite so should be soft-limited to a minimum like 32 or 64 depending on algorithm. A hard limit on number of blocks per SM. If occupancy is below 100% and this factor is limiting active blocks, it means each block does not contain enough warps Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. Case 1 32 threads per block and 64 blocks. When launching 1024 threads per block, you can only reach a maximum of 2/3 of “full” occupancy. The maximum number of threads and blocks that can Max threads is further restricted by the resource usage of your kernel, that’s why you better use the occupancy API which takes that into account. Grids of Blocks When you have more data than the maximum number of threads per block. CUDA makes four pieces of information available to each thread: The thread index (threadIdx) The block index (blockIdx) The size and shape of a block (blockDim) For a number of block size settings, such as 2, 4 or 8, I was able to get faster speed (20% higher) compared to the case of 32 or 64 threads per block. For example if the maximum , and dimensions of a block are 512, 512 and 64, it should be allocated such that 512, which is the maximum number of threads per block. After you have that number, you simply launch the number of Maximum threads in X direction: 512 (1024 for compute capability >= 2. Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535 Suppose I have a GPU that allows the MAX_THREAD number of threads per block. I was using the information from Appendix G of the CUDA Programming Guide 3. But again the profiler shows 63 registers even though 63*192 is Hello folks. with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run Suppose I have a GPU that allows the MAX_THREAD number of threads per block. A &ew 'PU Keywords to Remember • CUDA – Nvidia [s C++ like language for programming GPUs •nvcc – nVidia CUDA compiler •SM -- Streaming Multiprocessor You also asked about the difference between something like 1024 blocks of 128 threads vs. The number of threads in a block is limited, but grids can be used for computations that require a large number of thread blocks to operate in parallel and to use all available multiprocessors. I would like assistance with a code that analyzes frames showing the profile of an object with a specific reflected laser (3D profilometry) at time 1 and compares it with the same object and reflected laser at time 2. And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. 2. Multiple blocks are combined to form a grid. Ask Question Asked 9 years, 8 months ago. Every thread launched to execute it is going to performing the same work. 0) Maximum threads in Y direction: 512 (1024 for compute capability >= 2. get_current_device() In [10]: print(ddd) <CUDA device 0 'b'GeForce GTX 970''> In [11]: print(ddd You don't want to vary the number of threads per block. x * gridDim. Early CUDA cards, up through compute capability 1. One of the most important elements of CUDA programming is choosing the right grid and block dimensions for the problem size. The limiting mechanism is described in the Programming Guide as follows: The maximum number of threads and blocks which can run concurrently on the GPU. The following code retrieves and prints various properties of each CUDA device in the system, including device name, compute capability, total global memory, maximum threads per block, etc. This also nicely matches the hardware warp size of 32 threads. Yes, you can launch an enormous grid. Threads in multiples of warp size (i. Now, I did read that limiting factors are number of registers, number of blocks per SM, etc So, in order to avoid confusion let me create a simple example. At time 1, all reference frames of the object with the laser will be produced, and at time 2, all current frames of the object with the laser will be The maximum number of thread blocks per SM is 32 for devices of compute capability 10. You can access I was playing around with the number of threads per block, and I noticed that it seems like a lower number resulted in greater overall performance. compute capability, total global memory, maximum threads per block, etc. This tool is an MS excel spreadsheet that helps you choose thread block size for your kernel in order to achieve highest occupancy of the GPU. If MAX_THREAD = 1024, and if dim3 threads_per_block is set to [32, 8, 4], as 32*8*4=1024, how can I calculate each dimension of dim3 blocks_per_grid so that I can regsPerBlock is the maximum number of 32-bit registers available to a thread block; this number is shared by all thread blocks simultaneously resident on a multiprocessor; warpSize is the warp size in threads; memPitch is the maximum pitch in bytes allowed by the memory copy functions that involve memory regions allocated through cudaMallocPitch(); After passing the barrier, these threads are also guaranteed to see all writes to memory performed by threads in the block before the barrier. Assuming shared memory and registers are not limiting factors, let us look at a couple of cases. But I can not find any code that gives me the maximum threads per block info. where gridsize is the number of blocks per grid you intend to launch. x Max number of threads_per_block = 1024 for Cuda Capability 2. 0 CUDA Capability Major/Minor version number: 1. Hello all, I need some clarification on the terms Blocks, Threads, Multiprocessors, and Cuda Cores and whats the maximum value for each one. 0-1. x,y,z gives the number of blocks in a grid, in the particular direction; blockDim. A block will get assigned to run on a particular SM and then its warps will get scheduled 32 threads at a time. Modified 5 years, 7 months ago. You generally cannot achieve full occupancy that way due to the blocks-per-sm limit. x,y,z gives the number of threads in a block, in the particular direction; gridDim. 6k次。博客探讨了CUDA编程中Blocks和Threads的概念。早期CUDA设备限制每个Block最多65535个Blocks，而现代设备可达2^31-1。每个Block的线程数通常不超过1024，且应为WarpSize（通常是32）的倍数。代码示例展示了如何获取设备的WarpSize。理想的线程数选择通常在128或256。 Launching blocks of 512 threads means you have the potential to maximize occupancy. When we do so, the cuda. While it may not apply here, often smaller block sizes, say 256 threads per block, allow for better utilization of GPU resources due to effects of granularity of resource allocation inside the hardware. For example (32,32,1) creates a block of Each of the following sections describes with illustrations and small code snippets how we can set up different block sizes (numbers of threads per block) and different grid sizes (blocks per grid) for this code example. However, each SM can only take up to 8 Blocks, a CUDA device's hardware implementation groups adjacent threads within a block into warps. Ask Question Asked 11 years, 9 months ago. The maximum number of threads and blocks which can be launched for a given kernel. Threads per block is already defined at 512. 2-1. So you can launch the following block configurations (compute capability >= 2. But, shouldn't I be able to get the actual thread coordinates and the block coordinates as well? by the number of threads per block, to get the grid dimension. Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Test program Hello CUDA users, We’ve just posted a new tool on the cuda site, the CUDA Occupancy Calculator. Either by working on different data, or in some cases taking different logic paths A minimum of 64 threads per block should be used, and only if there are multiple concurrent blocks per multiprocessor. I have tried various forums and searched everywhere but haven’t got all my answers clarified. • These can be helpful when thinking of your data as 2D or The maximum number of threads in the block is limited to 1024. A warp is considered active from the time its threads begin executing to the time when all threads in the warp have exited from the kernel. ddd=numba. Also, suppose it allows the MAX_BLOCK_DIM number of blocks per grid on each grid dimension of x, y, and z. If you run the API twice in each direction, you will likely get an illegal block size when the two values are combined. Viewed 11k times 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) Maximum memory pitch: 2147483647 bytes Feature Support per Compute Capability of the CUDA C Programming Guide Version 9. The cuda-samples repo A multiprocessor can actually support 768 threads, so if you would make your thread-block have 256 threads, you could run 3 blocks per multiprocessor (but then you cannot have more than 10 registers in your kernel (8192/768 = 10. The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be However, according to the CUDA manuals, it is better to use 128/256 thread per blocks if you are not worry about deep details about GPGPUs. If MAX_THREAD = CUDA estimating threads per blocks and block numbers for 2D grid data. 0. Threads in different blocks cannot synchronize as the CUDA runtime Returns in *pi the integer value of the attribute attrib on device dev. Thread Scheduling. ) the maximum number of threads per SM is a hardware limit that is in your deviceQuery output as “Maximum number of threads per multiprocessor” (it is also documented in the programming guide. CUDA reserves 1 KB of shared memory per thread block. The @JimPivarksi You might also want to try varying the block size while keeping the total number of threads constants. Here is a direct link: CUDA Occupancy . Determining threads per block and block per grid; Threads per SM, threads per block; CUDA Blocks and Threads; Warps and optimal number of blocks; My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I the block and grid size is depend on lot of things, as algorithm, work per thread, resource, latency. in this case threads 0 - The number blocks being used per SM depends on the following. I somehow assumed that maximizing the threads per block would have resulted in faster performance, but that’s not the case. Well, they're running the same number of total threads, but depending on the shared resources those threads require they could run very differently (or Here, we choose to use 256 threads per block. 7 %µµµµ 1 0 obj >/Metadata 8285 0 R/ViewerPreferences 8286 0 R>> endobj 2 0 obj > endobj 3 0 obj >/XObject >/Font >/ProcSet[/PDF/Text/ImageB/ImageC/ImageI Yes, the limit of 64 warps per SM is implied by the limit of 2048 threads per SM. The Guide K. 0 / 5. In the following statement, @cuda (A, B) kernel_vadd(d_a, d_b, d_c) I assume that A is the total number of Similarly, 16 active blocks with 128 threads per block (4 warps per block) would also result in 64 active warps, and 100% theoretical occupancy. The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 does this mean that the maximum number of threads in a 2d thread block is 512x512 which gives me a 262144 threads in every block? if yes, then is it a good practice to have this number of threads in a a kernel of minimum 256 blocks? How are they executed? In chunks of 32 threads, aka a warp. You probably want many threads per block (eg. Up to 512 threads per block, maximum of 8 active/concurrent blocks per multiprocessor, with a total of 8192 registers and 16kb share memory per mutliprcoessor. Your thread blocks are square and you want to use the maximum number of threads per block possible on the device. 3 support up to 1024 active threads, but still only 512 threads per block. 1 compute capable devices support up to 768 active threads on an SM, which means if you had 512 threads in your block you could only have 1 active block on the SM. This is the product of whatever your threadblock dimensions are (xyz). Block和线程之间的关系是层次化的，Block是由多个线程组成的集合。它们之间的共享资源、同步机制和调度关系使得CUDA能够有效地实现并行计算。通过合理地组织Block和线程，程序员可以充分利用GPU的并行处理能力，提高计算效率。有意找工作的同学，请参考博主的原创：《面试官心得--面试前应该 Assuming I have 1096 tasks that can be parallelized and mapped to CUDA blocks, with each block requiring 32 threads, the maximum number of blocks that can run in parallel is calculated as. I think that covers most sensible discussion of it. Amount of shared memory per block. My GTX 570 device query says the following. sape zgzzr chnrf ajtd nxmto msxpaww nxiyi kphagy enr jrns nxooifl vbdzlk wakvnb kpwhx wqa