Impact of Different Thread Block Sizes with Synchronization and Shared Memory on the Performance of GPGPU

International Journal of Electronics and Communication Engineering
© 2024 by SSRG - IJECE Journal
Volume 11 Issue 5
Year of Publication : 2024
Authors : Sonal John, Saurabh Jain
pdf
How to Cite?

Sonal John, Saurabh Jain, "Impact of Different Thread Block Sizes with Synchronization and Shared Memory on the Performance of GPGPU," SSRG International Journal of Electronics and Communication Engineering, vol. 11,  no. 5, pp. 108-114, 2024. Crossref, https://doi.org/10.14445/23488549/IJECE-V11I5P111

Abstract:

The Graphics Processing Units (GPUs) have many cores, and so give an improved execution level throughput. GPUs are intended for use in parallel computing. The size of a thread block plays a crucial role in determining the kernel's occupancy since thread-level parallelism is necessary to maximize overall performance. During the kernel launch, information on the number of threads per block and the number of blocks in a grid was provided. The variance in thread block sizes and the number of thread blocks within a grid greatly influence CUDA application performance. The impact of different thread block sizes with shared memory and synchronization on the total execution time of a few CUDA programs has been noted in this proposed work, along with a speed optimization. Additionally, implementing Shared Memory along with Synchronized Thread Blocks improves CUDA applications' overall performance in a GPGPU measurably.

Keywords:

GPGPU, Synchronization, Shared Memory, Thread Block, Parallelism.

References:

[1] Liang Hu, Xilong Che, and Si-Qing Zheng, “A Closer Look at GPGPU,” ACM Computing Surveys, vol. 48, no. 4, pp. 1-20, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Richard Vuduc, and Jee Choi, A Brief History and Introduction to GPGPU, Modern Accelerator Technologies for Geographic Information Science, pp. 9-23, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[3] John Nickolls, and William J. Dally, “The GPU Computing Era,” IEEE Micro, vol. 30, no. 2, pp. 56-69, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Massimiliano Fatica, and Gregory Ruetsch, CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming, Elsevier Science, pp. 43-114, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Vishwesh Jathala, “Hardware and Software Optimizations for GPU Resource Management,” Ph.D Thesis, Indian Institute of Technology, Kanpur, India, pp. 1-161, 2018.
[Google Scholar] [Publisher Link]
[6] H. Harmanani, Parallel Programming for Multi-Core and Cluster Systems CUDA Thread Scheduling, 2018. [Online]. Available: https://harmanani.github.io/classes/csc447/Notes/Lecture15.pdf
[7] M. Harris, Using Shared Memory in CUDA C/C++, Nvidia Developer, 2013. [Online]. Available: https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
[8] Jayshree Ghorpade et al., “GPGPU Processing in CUDA Architecture,” Advanced Computing: An International Journal, vol. 3, no. 1, pp. 105-120, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Lan Gao et al., “Thread-Level Locking for SIMT Architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 5, pp. 1121-1136, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Kazuhiko Ohno et al., “Automatic Optimization of Thread Mapping for a GPGPU Programming Framework,” 2014 Second International Symposium on Computing and Networking, Shizuoka, Japan, pp. 198-204, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Xiuhong Li, and Yun Liang, “Efficient Kernel Management on GPUs,” 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, pp. 85-90, 2016.
[Google Scholar] [Publisher Link]
[12] Sunpyo Hong, and Hyesoon Kim “An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness,” ISCA '09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152-163, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Tiffany A. Connors, and Apan Qasem, “Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels,” 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Bangkok, Thailand, pp. 442-449, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Jin Wang et al., “Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 528-540, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Paula Aguilera, Katherine Morrow, and Nam Sung Kim, “Fair Share: Allocation of GPU Resources for Both Performance and Fairness,” 2014 IEEE 32nd International Conference on Computer Design (ICCD), Seoul, Korea (South), pp. 440-447, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Pieter Hijma et al., “Optimization Techniques for GPU Programming,” ACM Computing Survey, vol. 55, no. 11, pp. 1-81, 2023.
[CrossRef] [Google Scholar] [Publisher Link]