Performance-efficient mechanisms for managing irregularity in throughput processors

Rhu, Minsoo

Performance-efficient mechanisms for managing irregularity in throughput processors

dc.contributor.advisor	Erez, Mattan	en
dc.creator	Rhu, Minsoo	en
dc.date.accessioned	2014-07-01T15:03:16Z	en
dc.date.accessioned	2018-01-22T22:26:09Z
dc.date.available	2018-01-22T22:26:09Z
dc.date.issued	2014-05	en
dc.date.submitted	May 2014	en
dc.date.updated	2014-07-01T15:03:17Z	en
dc.description	text	en
dc.description.abstract	Recent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.	en
dc.description.department	Electrical and Computer Engineering	en
dc.format.mimetype	application/pdf	en
dc.identifier.uri	http://hdl.handle.net/2152/24926	en
dc.language.iso	en	en
dc.subject	Computer architecture	en
dc.subject	GPU	en
dc.subject	Graphics	en
dc.subject	Throughput processors	en
dc.subject	Memory systems	en
dc.title	Performance-efficient mechanisms for managing irregularity in throughput processors	en
dc.type	Thesis	en

Collections

University of Texas at Austin

Performance-efficient mechanisms for managing irregularity in throughput processors

Files

Collections