Performance-efficient mechanisms for managing irregularity in throughput processors

dc.contributor.advisorErez, Mattanen
dc.creatorRhu, Minsooen
dc.date.accessioned2014-07-01T15:03:16Zen
dc.date.accessioned2018-01-22T22:26:09Z
dc.date.available2018-01-22T22:26:09Z
dc.date.issued2014-05en
dc.date.submittedMay 2014en
dc.date.updated2014-07-01T15:03:17Zen
dc.descriptiontexten
dc.description.abstractRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.en
dc.description.departmentElectrical and Computer Engineeringen
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttp://hdl.handle.net/2152/24926en
dc.language.isoenen
dc.subjectComputer architectureen
dc.subjectGPUen
dc.subjectGraphicsen
dc.subjectThroughput processorsen
dc.subjectMemory systemsen
dc.titlePerformance-efficient mechanisms for managing irregularity in throughput processorsen
dc.typeThesisen

Files