Exploiting hardware heterogeneity and parallelism for performance and energy efficiency of managed languages

Jibaja, Ivan

Exploiting hardware heterogeneity and parallelism for performance and energy efficiency of managed languages

Date

2015-12

Authors

Jibaja, Ivan

Abstract

On the software side, managed languages and their workloads are ubiquitous, executing on mobile, desktop, and server hardware. Managed languages boost the productivity of programmers by abstracting away the hardware using virtual machine technology. On the hardware side, modern hardware increasingly exploits parallelism to boost energy efficiency and performance with homogeneous cores, heterogenous cores, graphics processing units (GPUs), and vector instructions. Two major forms of parallelism are: task parallelism on different cores and vector instructions for data parallelism. With task parallelism, the hardware allows simultaneous execution of multiple instruction pipelines through multiple cores. With data parallelism, one core can perform the same instruction on multiple pieces of data. Furthermore, we expect hardware parallelism to continue to evolve and provide more heterogeneity. Existing programming language runtimes must continuously evolve so programmers and their workloads may efficiently utilize this evolving hardware for better performance and energy efficiency. However, efficiently exploiting hardware parallelism is at odds with programmer productivity, which seeks to abstract hardware details. My thesis is that managed language systems should and can abstract hardware parallelism with modest to no burden on developers to achieve high performance, energy efficiency, and portability on ever evolving parallel hardware. In particular, this thesis explores how the runtime can optimize and abstract heterogenous parallel hardware and how the compiler can exploit data parallelism with new high-level languages abstractions with a minimal burden on developers. We explore solutions from multiple levels of abstraction for different types of hardware parallelism. (1) For asymmetric multicore processors (AMP) which have been recently introduced, we design and implement an application scheduler in the Java virtual machine (JVM) that requires no changes to existing Java applications. The scheduler uses feedback from dynamic analyses that automatically identify critical threads and classifies application parallelism. Our scheduler automatically accelerates critical threads, honors thread priorities, considers core availability and thread sensitivity, and load balances scalable parallel threads on big and small cores to improve the average performance by 20% and energy efficiency by 9% on frequency-scaled AMP hardware for scalable, non-scalable, and sequential workloads over prior research and existing schedulers. (2) To exploit vector instructions, we design SIMD.js, a portable single instruction multiple data (SIMD) language extension for JavaScript (JS), and implement its compiler support that together add fine-grain data parallelism to JS. Our design principles seek portability, scalable performance across various SIMD hardware implementations, performance neutral without SIMD hardware, and compiler simplicity to ease vendor adoption on multiple browsers. We introduce type speculation, compiler optimizations, and code generation that convert high-level JS SIMD operations into minimal numbers of SIMD native instructions. Finally, to accomplish wide adoption of our portable SIMD language extension, we explore, analyze, and discuss the trade-offs of four different approaches that provide the functionality of SIMD.js when vector instructions are not supported by the hardware. SIMD.js delivers an average performance improvement of 3.3× on micro benchmarks and key graphic algorithms on various hardware platforms, browsers, and operating systems. These language extension and compiler technologies are in the final approval process to be included in the JavaScript standards. This thesis shows using virtual machine technologies protects programmers from the underlying details of hardware parallelism, achieves portability, and improves performance and energy efficiency.