2025 Volume 33 Pages 840-851
General purpose computing on graphics processing units (GPGPU) has an execution model in which the number and type of parallel tasks are managed by the CPU, making it difficult to execute fine-grained parallel programs efficiently with nested parallel tasks at a nonhomogeneous granularity. This work addresses this problem by efficiently executing fine-grained parallel programs by managing parallel tasks on the GPU using a fast memory allocation mechanism. As a preliminary implementation, this work proposes a method for splitting the computation in a fine-grained parallel fork-join program at the fork point and allocating each computation to the GPU memory as a parallel task. In addition, kernel fusion, parallel task reuse, and parallel throttling are explored as optimization methods for the proposed method. This work implements a fine-grained parallel fork-join program in CUDA and investigates its scalability and execution speed to evaluate the feasibility and performance of the proposed method.