Browsing by Subject "GPU"

Now showing 1 - 18 of 18

A hybrid fluid simulation on the Graphics Processing Unit (GPU)
(Texas A&M University, 2008-10-10) Flannery, Rebecca Lynn
This thesis presents a method to implement a hybrid particle/grid uid simulation on graphics hardware. The goal is to speed up the simulation by exploiting the parallelism of the graphics processing unit, or GPU. The Fluid Implicit Particle method is adapted to the programming style of the GPU. The methods were implemented on a current generation graphics card. The GPU based program exhibited a small speedup over its CPU based counterpart.
Analysis of GPU-based convolution for acoustic wave propagation modeling with finite differences: Fortran to CUDA-C step-by-step
(2014-05) Sadahiro, Makoto; Stoffa, Paul L., 1948-; Tatham, R. H. (Robert H.), 1943-
By projecting observed microseismic data backward in time to when fracturing occurred, it is possible to locate the fracture events in space, assuming a correct velocity model. In order to achieve this task in near real-time, a robust computational system to handle backward propagation, or Reverse Time Migration (RTM), is required. We can then test many different velocity models for each run of the RTM. We investigate the use of a Graphics Processing Unit (GPU) based system using Compute Unified Device Architecture for C (CUDA-C) as the programming language. Our preliminary results show a large improvement in run-time over conventional programming methods based on conventional Central Processing Unit (CPU) computing with Fortran. Considerable room for improvement still remains.
Comparison of linear and non-linear feature extraction on vegetation and oil spill hyperspectral images
Ramirez-Aguilar, Andres
Computational kinetics of a large scale biological process on GPU workstations : DNA bending
(2013-05) Ruymgaart, Arnold Peter; Elber, Ron
It has only recently become possible to study the dynamics of large time scale biological processes computationally in explicit solvent and atomic detail. This required a combination of advances in computer hardware, utilization of parallel and special purpose hardware as well as numerical and theoretical approaches. In this work we report advances in these areas contributing to the feasibility of a work of this scope in a reasonable time. We then make use of them to study an interesting model system, the action of the DNA bending protein 1IHF and demonstrate such an effort can now be performed on GPU equipped PC workstations. Many cellular processes require DNA bending. In the crowded compartment of the cell, DNA must be efficiently stored but this is just one example where bending is observed. Other examples include the effects of DNA structural features involved in transcription, gene regulation and recombination. 1IHF is a bacterial protein that binds and kinks DNA at sequence specific sites. The 1IHF binding to DNA is the cause or effect of bending of the double helix by almost 180 degrees. Most sequence specific DNA binding proteins bind in the major groove of the DNA and sequence specificity results from direct readout. 1IHF is an exception; it binds in the minor groove. The final structure of the binding/bending reaction was crystallized and shows the protein arm like features "latched" in place wrapping the DNA in the minor grooves and intercalating the tips between base pairs at the kink sites. This sequence specific, mostly indirect readout protein-DNA binding/bending interaction is therefore an interesting test case to study the mechanism of protein DNA binding and bending in general. Kinetic schemes have been proposed and numerous experimental studies have been carried out to validate these schemes. Experiments have included rapid kinetics laser T jump studies providing unprecedented temporal resolution and time resolved (quench flow) DNA foot-printing. Here we complement and add to those studies by investigating the mechanism and dynamics of the final latching/initial unlatching at an atomic level. This is accomplished with the computational tools of molecular dynamics and the theory of Milestoning. Our investigation begins by generating a reaction coordinate from the crystal structure of the DNA-protein complex and other images generated through modelling based on biochemical intuition. The initial path is generated by steepest descent minimization providing us with over 100 anchor images along the Steepest Descent Path (SDP) reaction coordinate. We then use the tools of Milestoning to sample hypersurfaces (milestones) between reaction coordinate anchors. Launching multiple trajectories from each milestone allowed us to accumulate average passage times to adjacent milestones and obtain transition probabilities. A complete set of rates was obtained this way allowing us to draw important conclusions about the mechanism of DNA bending. We uncover two possible metastable intermediates in the dissociation unkinking process. The first is an unexpected stable intermediate formed by initial unlatching of the IHF arms accompanied by a complete "psi-0" to "psi+140" conformational change of the IHF arm tip prolines. This unlatching (de-intercalation of the IHF tips from the kink sites) is required for any unkinking to occur. The second intermediate is formed by the IHF protein arms sliding over the DNA phosphate backbone and refolding in the next groove. The formation of this intermediate occurs on the millisecond timescale which is within experimental unkinking rate results. We show that our code optimization and parallelization enhancements allow the entire computational process of these millisecond timescale events in about one month on 10 or less GPU equipped workstations/cluster nodes bringing these studies within reach of researchers that do not have access to supercomputer clusters.
Core-characteristic-aware off-chip memory management in a multicore system-on-chip
(2012-12) Jeong, Min Kyu; Erez, Mattan; John, Lizy K.; Chiou, Derek; Lin, Calvin; Schulte, Michael J.
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities.
An enhanced GPU architecture for not-so-regular parallelism with special implications for database search
(2014-05) Narasiman, Veynu Tupil; Patt, Yale N.
Graphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications.
Fast self-shadowing using occluder textures
(Texas A&M University, 2007-04-25) Coleman, Christopher Ryan
A real-time self-shadowing technique is described. State of the art shadowing techniques that utilize modern hardware often require multiple rendering passes and introduce rendering artifacts. Combining separate ideas from earlier techniques which project geometry onto a plane and project imagery onto an object results in a new real-time technique for self-shadowing. This technique allows an artist to construct occluder textures and assign them to shadow planes for a self-shadowed model. Utilizing a graphics processing unit (GPU), a vertex program computes shadowing coordinates in real-time, while a fragment program applies the shading and shadowing in a single rendering pass. The methodology used to create shadow planes and write the vertex and fragment programs is given, as well as the relation to the previous work. This work includes implementing this technique, applying it to a small set of test models, describing the types of models for which the technique is well suited, as well as those for which it is not well suited, and comparing the technique??????s performance and image quality to other state of the art shadowing techniques. This technique performs as well as other real-time techniques and can reduce rendering artifacts in certain circumstances.
Finite element modeling of electromagnetic radiation and induced heat transfer in the human body
(2013-08) Kim, Kyungjoo; Demkowicz, Leszek; Eijkhout, Victor; Van de Geijn, Robert A.
This dissertation develops adaptive hp-Finite Element (FE) technology and a parallel sparse direct solver enabling the accurate modeling of the absorption of Electro-Magnetic (EM) energy in the human head. With a large and growing number of cell phone users, the adverse health effects of EM fields have raised public concerns. Most research that attempts to explain the relationship between exposure to EM fields and its harmful effects on the human body identifies temperature changes due to the EM energy as the dominant source of possible harm. The research presented here focuses on determining the temperature distribution within the human body exposed to EM fields with an emphasis on the human head. Major challenges in accurately determining the temperature changes lie in the dependence of EM material properties on the temperature. This leads to a formulation that couples the BioHeat Transfer (BHT) and Maxwell equations. The mathematical model is formed by the time-harmonic Maxwell equations weakly coupled with the transient BHT equation. This choice of equations reflects the relevant time scales. With a mobile device operating at a single frequency, EM fields arrive at a steady-state in the micro-second range. The heat sources induced by EM fields produce a transient temperature field converging to a steady-state distribution on a time scale ranging from seconds to minutes; this necessitates the transient formulation. Since the EM material properties depend upon the temperature, the equations are fully coupled; however, the coupling is realized weakly due to the different time scales for Maxwell and BHT equations. The BHT equation is discretized in time with a time step reflecting the thermal scales. After multiple time steps, the temperature field is used to determine the EM material properties and the time-harmonic Maxwell equations are solved. The resulting heat sources are recalculated and the process continued. Due to the weak coupling of the problems, the corresponding numerical models are established separately. The BHT equation is discretized with H¹ conforming elements, and Maxwell equations are discretized with H(curl) conforming elements. The complexity of the human head geometry naturally leads to the use of tetrahedral elements, which are commonly employed by unstructured mesh generators. The EM domain, including the head and a radiating source, is terminated by a Perfectly Matched Layer (PML), which is discretized with prismatic elements. The use of high order elements of different shapes and discretization types has motivated the development of a general 3D hp-FE code. In this work, we present new generic data structures and algorithms to perform adaptive local refinements on a hybrid mesh composed of different shaped elements. A variety of isotropic and anisotropic refinements that preserve conformity of discretization are designed. The refinement algorithms support one- irregular meshes with the constrained approximation technique. The algorithms are experimentally proven to be deadlock free. A second contribution of this dissertation lies with a new parallel sparse direct solver that targets linear systems arising from hp-FE methods. The new solver interfaces to the hierarchy of a locally refined mesh to build an elimination ordering for the factorization that reflects the h-refinements. By following mesh refinements, not only the computation of element matrices but also their factorization is restricted to new elements and their ancestors. The solver is parallelized by exploiting two-level task parallelism: tasks are first generated from a parallel post-order tree traversal on the assembly tree; next, those tasks are further refined by using algorithms-by-blocks to gain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executed after their dependencies are analyzed. This approach effectively reduces scheduling overhead and increases flexibility to handle irregular tasks. The solver outperforms the conventional general sparse direct solver for a class of problems formulated by high order FEs. Finally, numerical results for a 3D coupled BHT with Maxwell equations are presented. The solutions of this Maxwell code have been verified using the analytic Mie series solutions. Starting with simple spherical geometry, parametric studies are conducted on realistic head models for a typical frequency band (900 MHz) of mobile phones.
GPU programming for real-time watercolor simulation
(Texas A&M University, 2005-02-17) Scott, Jessica Stacy
This thesis presents a method for combining GPU programming with traditional programming to create a ﬂuid simulation based watercolor tool for artists. This application provides a graphical interface and a canvas upon which artists can create simulated watercolors in real time. The GPU, or Graphics Processing Unit, is an effcient and highly parallel processor located on the graphics card of a computer; GPU programming is touted as a way to improve performance in graphics and non?graphics applications. The effectiveness of this method in speeding up large, general purpose programs, however, is found here to be disappointing. In a small application with minimal CPU/GPU interaction, theoretical speedups of 10 times maybe achieved, but with the limitations of communication speed between the GPU and the CPU, gains are slight when this method is used in conjunction with traditional programming.
GPU-based Parallel Computing Models and Implementations for Two-party Privacy-preserving Protocols
(2013-11-25) Pu, Shi
In (two-party) privacy-preserving-based applications, two users use encrypted inputs to compute a function without giving out plaintext of their input values. Privacy-preserving computing algorithms have to utilize a large amount of computing resources to handle the encryption-decryption operations. In this dissertation, we study optimal utilization of computing resources on the graphic processor unit (GPU) architecture for privacy-preserving protocols based on secure function evaluation (SFE) and the Elliptic Curve Cryptographic (ECC) and related algorithms. A number of privacy-preserving protocols are implemented, including private set intersection (PSI), secret handshaking (SH), secure Edit distance (ED) and Smith-Waterman (SW) problems. PSI is chosen to represent ECC point multiplication related computations, SH for bilinear pairing, and the last two for SFE-based dynamic programming (DP) problems. They represent different types of computations, so that in-depth understanding of the benefits and limitations of the GPU architecture for privacy preserving protocols is gained. For SFE-based ED and SW problems, a wavefront parallel computing model on the CPU-GPU architecture under the semi-honest security model is proposed. Low level parallelization techniques for GPU-based gate (de-)garbler, synchronized parallel memory access, pipelining, and general GPU resource mapping policies are developed. This dissertation shows that the GPU architecture can be fully utilized to speed up SFE-based ED and SW algorithms, which are constructed with billions of garbled gates, on a contemporary GPU card GTX-680, with very little waste of processing cycles or memory space. For PSI and SH protocols and underlying ECC algorithms, the analysis in this research shows that the conventional Montgomery-based number system is more friendly to the GPU architecture than the Residue Number System (RNS) is. Analysis on experiment results further shows that the lazy reduction in higher extension fields can have performance benefits only when the GPU architecture has enough fast memory. The resulting Elliptic curve Arithmetic GPU Library (EAGL) can run 3350.9 R-ate (bilinear) pairing/sec, and 47000 point multiplication/sec at the 128-bit security level, on one GTX-680 card. The primary performance bottleneck is found to be lacking of advanced memory management functions in the contemporary GPU architecture for bilinear pairing operations. Substantial performance gain can be expected when the on-chip memory size and/or more advanced memory prefetching mechanisms are supported in future generations of GPUs.
Improving energy efficiency of reliable massively-parallel architectures
(2012-05) Krimer, Evgeni; Erez, Mattan; John, Lizy K.; Orshansky, Michael; Gerstlauer, Andreas; Sentis, Luis
While transistor size continues to shrink every technology generation increasing the amount of transistors on a die, the reduction in energy consumption is less significant. Furthermore, newer technologies induce fabrication challenges resulting in uncertainties in transistor and wire properties. Therefore to ensure correctness, design margins are introduced resulting in significantly sub-optimal energy efficiency. While increasing parallelism and the use of gating methods contribute to energy consumption reduction, ultimately, more radical changes to the architecture and better integration of architectural and circuit techniques will be necessary. This dissertation explores one such approach, combining a highly-efficient massively-parallel processor architecture with a design methodology that reduces energy by trimming design margins. Using a massively-parallel GPU-like (graphics processing unit) base- line architecture, we discuss the different components of process variation and design microarchitectural approaches supporting efficient margins reduction. We evaluate our design using a cycle-based GPU simulator, describe the conditions where efficiency improvements can be obtained, and explore the benefits of decoupling across a wide range of parameters. We architect a test-chip that was fabricated and show these mechanisms to work. We also discuss why previously developed related approaches fall short when process variation is very large, such as in low-voltage operation or as expected for future VLSI technology. We therefore develop and evaluate a new approach specifically for high-variation scenarios. To summarize, in this work, we address the emerging challenges of modern massively parallel architectures including energy efficient, reliable operation and high process variation. We believe that the results of this work are essential for breaking through the energy wall, continuing to improve the efficiency of future generations of the massively parallel architectures.
Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors
(2014-05) Li, Dong, active 21st century; Fussell, Donald S., 1951-; Burger, Douglas C., Ph. D.
Throughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput.
Performance-efficient mechanisms for managing irregularity in throughput processors
(2014-05) Rhu, Minsoo; Erez, Mattan
Recent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.
Perspective-Driven Radiosity on Graphics Hardware
(2011-08-08) Bozalina, Justin Taylor
Radiosity is a global illumination algorithm used by artists, architects, and engineers for its realistic simulation of lighting. Since the illumination model is global, complexity and run time grow as larger environments are provided. Algorithms exist which generate an incremental result and provide weighting based on the user's view of the environment. This thesis introduces an algorithm for directing and focusing radiosity calculations relative to the user's point-of-view and within the user's field-of-view, generating visually interesting results for a localized area more quickly than a traditional global approach. The algorithm, referred to as perspective-driven radiosity, is an extension of the importance-driven radiosity algorithm, which itself is an extension of the progressive refinement radiosity algorithm. The software implemented during research into the point-of-view/field-of-view-driven algorithm can demonstrate both of these algorithms, and can generate results for arbitrary geometry. Parameters can be adjusted by the user to provide results that favor speed or quality. To take advantage of the scalability of programmable graphics hardware, the algorithm is implemented as an extension of progressive refinement radiosity on the GPU, using OpenGL and GLSL. Results from each of the three implemented radiosity algorithms are compared using a variety of geometry.
Scaling reinforcement learning to the unconstrained multi-agent domain
(2009-06-02) Palmer, Victor
Reinforcement learning is a machine learning technique designed to mimic the way animals learn by receiving rewards and punishment. It is designed to train intelligent agents when very little is known about the agent?s environment, and consequently the agent?s designer is unable to hand-craft an appropriate policy. Using reinforcement learning, the agent?s designer can merely give reward to the agent when it does something right, and the algorithm will craft an appropriate policy automatically. In many situations it is desirable to use this technique to train systems of agents (for example, to train robots to play RoboCup soccer in a coordinated fashion). Unfortunately, several significant computational issues occur when using this technique to train systems of agents. This dissertation introduces a suite of techniques that overcome many of these difficulties in various common situations. First, we show how multi-agent reinforcement learning can be made more tractable by forming coalitions out of the agents, and training each coalition separately. Coalitions are formed by using information-theoretic techniques, and we find that by using a coalition-based approach, the computational complexity of reinforcement-learning can be made linear in the total system agent count. Next we look at ways to integrate domain knowledge into the reinforcement learning process, and how this can signifi-cantly improve the policy quality in multi-agent situations. Specifically, we find that integrating domain knowledge into a reinforcement learning process can overcome training data deficiencies and allow the learner to converge to acceptable solutions when lack of training data would have prevented such convergence without domain knowledge. We then show how to train policies over continuous action spaces, which can reduce problem complexity for domains that require continuous action spaces (analog controllers) by eliminating the need to finely discretize the action space. Finally, we look at ways to perform reinforcement learning on modern GPUs and show how by doing this we can tackle significantly larger problems. We find that by offloading some of the RL computation to the GPU, we can achieve almost a 4.5 speedup factor in the total training process.
simCUDA: A C++ based CUDA simulation framework
(2016-05) Das, Abhishek; Gerstlauer, Andreas, 1970-; Touba, Nur A
The primary objective of this thesis is to develop a CUDA simulation framework (simCUDA) that effectively maps the existing application written in CUDA to be executed on top of standard multi-core CPU architectures. This is done by specifically annotating the application at the source level itself, and making the relevant changes required for the application to run in a similar and functionally equivalent manner on a multi-core CPU as it would run in a CUDA-supported GPU. The simulation framework has been developed using C++11 threads, which provides an abstraction for a thread of execution, as well as several classes and class templates for mutexes, condition variables, and locks, to be used for their management. As an extension to the simulation framework, the basic block sequence of execution on a per thread basis is also computed for analysis. This information can in turn be used to derive the basic block sequence of execution on a per warp basis, and thus emulate and replicate real-world behavior of a GPU.
Soft MIMO Detection on Graphics Processing Units and Performance Study of Iterative MIMO Decoding
(2012-10-19) Arya, Richeek
In this thesis we have presented an implementation of soft Multi Input Multi Output (MIMO) detection, single tree search algorithm on Graphics Processing Units (GPUs). We have compared its performance on different GPUs and a Central Processing Unit (CPU). We have also done a performance study of iterative decoding algorithms. We have shown that by increasing the number of outer iterations error rate performance can be further improved. GPUs are specialized devices specially designed to accelerate graphics processing. They are massively parallel devices which can run thousands of threads simultaneously. Because of their tremendous processing power there is an increasing interest in using them for scientific and general purpose computations. Hence companies like Nvidia, Advanced Micro Devices (AMD) etc. have started their support for General Purpose GPU (GPGPU) applications. Nvidia came up with Compute Unified Device Architecture (CUDA) to program its GPUs. Efforts are made to come up with a standard language for parallel computing that can be used across platforms. OpenCL is the first such language which is supported by all major GPU and CPU vendors. MIMO detector has a high computational complexity. We have implemented a soft MIMO detector on GPUs and studied its throughput and latency performance. We have shown that a GPU can give throughput of up to 4Mbps for a soft detection algorithm which is more than sufficient for most general purpose tasks like voice communication etc. Compare to CPU a throughput increase of ~7x is achieved. We also compared the performances of two GPUs one with low computational power and one with high computational power. These comparisons show effect of thread serialization on algorithms with the lower end GPU's execution time curve shows a slope of 1/2. To further improve error rate performance iterative decoding techniques are employed where a feedback path is employed between detector and decoder. With an eye towards GPU implementation we have explored these algorithms. Better error rate performance however, comes at a price of higher power dissipation and more latency. By simulations we have shown that one can predict based on the Signal to Noise Ratio (SNR) values how many iterations need to be done before getting an acceptable Bit Error Rate (BER) and Frame Error Rate (FER) performance. Iterative decoding technique shows that a SNR gain of ~1:5dB is achieved when number of outer iterations is increased from zero. To reduce the complexity one can adjust number of possible candidates the algorithm can generate. We showed that where a candidate list of 128 is not sufficient for acceptable error rate performance for a 4x4 MIMO system using 16-QAM modulation scheme, performances are comparable with the list size of 512 and 1024 respectively.
Transport in higher dimensional phase spaces
(2016-12) Curry, Christopher Timothy; Morrison, Philip J.; Horton, Jr., Claude W; Hazeltine, Richard; Matzner, Richard; Gamba, Irene
We use a four dimensional symplectic mapping, the coupled cubic-quadratic map, to provide evidence of Arnol’d Diffusion in phase space. We use the method of frequency analysis for dynamical systems to demonstrate the existence of regular orbits, and show that these orbits enclose weakly chaotic orbits which escape in finite time around the tori. A new collocation method for frequency analysis is employed by adapting it to allow for higher precision results. Arbitrary precision numerics are used to obtain highly accurate orbits for long timescales, and the adapted frequency method is used to obtain highly accurate frequencies of the mapping. We review the method of frequency analysis, demonstrate its effectiveness and accuracy in determining frequencies and finding tori in simple systems and low-dimensional mappings, and extend the results to higher dimensions. In the four dimensional mapping, we find several regular orbits with irrational frequency ratios, indicating the existence of tori in the phase space, as well as interior orbits that escape around these tori.

Browsing by Subject "GPU"

Results Per Page

Sort Options