Browsing by Subject "CUDA"

Now showing 1 - 5 of 5

A Phase Based Dense Stereo Algorithm Implemented in CUDA
(2012-07-16) Macomber, Brent David
Stereo imaging is routinely used in Simultaneous Localization and Mapping (SLAM) systems for the navigation and control of autonomous spacecraft proximity operations, advanced robotics, and robotic mapping and surveying applications. A key step (and generally the most computationally expensive step) in the generation of high fidelity geometric environment models from image data is the solution of the dense stereo correspondence problem. A novel method for solving the stereo correspondence problem to sub-pixel accuracy in the Fourier frequency domain by exploiting the Convolution Theorem is developed. The method is tailored to challenging aerospace applications by incorporation of correction factors for common error sources. Error-checking metrics verify correspondence matches to ensure high quality depth reconstructions are generated. The effect of geometric foreshortening caused by the baseline displacement of the cameras is modeled and corrected, drastically improving correspondence matching on highly off-normal surfaces. A metric for quantifying the strength of correspondence matches is developed and implemented to recognize and reject weak correspondences, and a separate cross-check verification provides a final defense against erroneous matches. The core components of this phase based dense stereo algorithm are implemented and optimized in the Compute Uni ed Device Architecture (CUDA) parallel computation environment onboard an NVIDIA Graphics Processing Unit (GPU). Accurate dense stereo correspondence matching is performed on stereo image pairs at a rate of nearly 10Hz.
Analysis of GPU-based convolution for acoustic wave propagation modeling with finite differences: Fortran to CUDA-C step-by-step
(2014-05) Sadahiro, Makoto; Stoffa, Paul L., 1948-; Tatham, R. H. (Robert H.), 1943-
By projecting observed microseismic data backward in time to when fracturing occurred, it is possible to locate the fracture events in space, assuming a correct velocity model. In order to achieve this task in near real-time, a robust computational system to handle backward propagation, or Reverse Time Migration (RTM), is required. We can then test many different velocity models for each run of the RTM. We investigate the use of a Graphics Processing Unit (GPU) based system using Compute Unified Device Architecture for C (CUDA-C) as the programming language. Our preliminary results show a large improvement in run-time over conventional programming methods based on conventional Central Processing Unit (CPU) computing with Fortran. Considerable room for improvement still remains.
Computational kinetics of a large scale biological process on GPU workstations : DNA bending
(2013-05) Ruymgaart, Arnold Peter; Elber, Ron
It has only recently become possible to study the dynamics of large time scale biological processes computationally in explicit solvent and atomic detail. This required a combination of advances in computer hardware, utilization of parallel and special purpose hardware as well as numerical and theoretical approaches. In this work we report advances in these areas contributing to the feasibility of a work of this scope in a reasonable time. We then make use of them to study an interesting model system, the action of the DNA bending protein 1IHF and demonstrate such an effort can now be performed on GPU equipped PC workstations. Many cellular processes require DNA bending. In the crowded compartment of the cell, DNA must be efficiently stored but this is just one example where bending is observed. Other examples include the effects of DNA structural features involved in transcription, gene regulation and recombination. 1IHF is a bacterial protein that binds and kinks DNA at sequence specific sites. The 1IHF binding to DNA is the cause or effect of bending of the double helix by almost 180 degrees. Most sequence specific DNA binding proteins bind in the major groove of the DNA and sequence specificity results from direct readout. 1IHF is an exception; it binds in the minor groove. The final structure of the binding/bending reaction was crystallized and shows the protein arm like features "latched" in place wrapping the DNA in the minor grooves and intercalating the tips between base pairs at the kink sites. This sequence specific, mostly indirect readout protein-DNA binding/bending interaction is therefore an interesting test case to study the mechanism of protein DNA binding and bending in general. Kinetic schemes have been proposed and numerous experimental studies have been carried out to validate these schemes. Experiments have included rapid kinetics laser T jump studies providing unprecedented temporal resolution and time resolved (quench flow) DNA foot-printing. Here we complement and add to those studies by investigating the mechanism and dynamics of the final latching/initial unlatching at an atomic level. This is accomplished with the computational tools of molecular dynamics and the theory of Milestoning. Our investigation begins by generating a reaction coordinate from the crystal structure of the DNA-protein complex and other images generated through modelling based on biochemical intuition. The initial path is generated by steepest descent minimization providing us with over 100 anchor images along the Steepest Descent Path (SDP) reaction coordinate. We then use the tools of Milestoning to sample hypersurfaces (milestones) between reaction coordinate anchors. Launching multiple trajectories from each milestone allowed us to accumulate average passage times to adjacent milestones and obtain transition probabilities. A complete set of rates was obtained this way allowing us to draw important conclusions about the mechanism of DNA bending. We uncover two possible metastable intermediates in the dissociation unkinking process. The first is an unexpected stable intermediate formed by initial unlatching of the IHF arms accompanied by a complete "psi-0" to "psi+140" conformational change of the IHF arm tip prolines. This unlatching (de-intercalation of the IHF tips from the kink sites) is required for any unkinking to occur. The second intermediate is formed by the IHF protein arms sliding over the DNA phosphate backbone and refolding in the next groove. The formation of this intermediate occurs on the millisecond timescale which is within experimental unkinking rate results. We show that our code optimization and parallelization enhancements allow the entire computational process of these millisecond timescale events in about one month on 10 or less GPU equipped workstations/cluster nodes bringing these studies within reach of researchers that do not have access to supercomputer clusters.
simCUDA: A C++ based CUDA simulation framework
(2016-05) Das, Abhishek; Gerstlauer, Andreas, 1970-; Touba, Nur A
The primary objective of this thesis is to develop a CUDA simulation framework (simCUDA) that effectively maps the existing application written in CUDA to be executed on top of standard multi-core CPU architectures. This is done by specifically annotating the application at the source level itself, and making the relevant changes required for the application to run in a similar and functionally equivalent manner on a multi-core CPU as it would run in a CUDA-supported GPU. The simulation framework has been developed using C++11 threads, which provides an abstraction for a thread of execution, as well as several classes and class templates for mutexes, condition variables, and locks, to be used for their management. As an extension to the simulation framework, the basic block sequence of execution on a per thread basis is also computed for analysis. This information can in turn be used to derive the basic block sequence of execution on a per warp basis, and thus emulate and replicate real-world behavior of a GPU.
Soft MIMO Detection on Graphics Processing Units and Performance Study of Iterative MIMO Decoding
(2012-10-19) Arya, Richeek
In this thesis we have presented an implementation of soft Multi Input Multi Output (MIMO) detection, single tree search algorithm on Graphics Processing Units (GPUs). We have compared its performance on different GPUs and a Central Processing Unit (CPU). We have also done a performance study of iterative decoding algorithms. We have shown that by increasing the number of outer iterations error rate performance can be further improved. GPUs are specialized devices specially designed to accelerate graphics processing. They are massively parallel devices which can run thousands of threads simultaneously. Because of their tremendous processing power there is an increasing interest in using them for scientific and general purpose computations. Hence companies like Nvidia, Advanced Micro Devices (AMD) etc. have started their support for General Purpose GPU (GPGPU) applications. Nvidia came up with Compute Unified Device Architecture (CUDA) to program its GPUs. Efforts are made to come up with a standard language for parallel computing that can be used across platforms. OpenCL is the first such language which is supported by all major GPU and CPU vendors. MIMO detector has a high computational complexity. We have implemented a soft MIMO detector on GPUs and studied its throughput and latency performance. We have shown that a GPU can give throughput of up to 4Mbps for a soft detection algorithm which is more than sufficient for most general purpose tasks like voice communication etc. Compare to CPU a throughput increase of ~7x is achieved. We also compared the performances of two GPUs one with low computational power and one with high computational power. These comparisons show effect of thread serialization on algorithms with the lower end GPU's execution time curve shows a slope of 1/2. To further improve error rate performance iterative decoding techniques are employed where a feedback path is employed between detector and decoder. With an eye towards GPU implementation we have explored these algorithms. Better error rate performance however, comes at a price of higher power dissipation and more latency. By simulations we have shown that one can predict based on the Signal to Noise Ratio (SNR) values how many iterations need to be done before getting an acceptable Bit Error Rate (BER) and Frame Error Rate (FER) performance. Iterative decoding technique shows that a SNR gain of ~1:5dB is achieved when number of outer iterations is increased from zero. To reduce the complexity one can adjust number of possible candidates the algorithm can generate. We showed that where a candidate list of 128 is not sufficient for acceptable error rate performance for a 4x4 MIMO system using 16-QAM modulation scheme, performances are comparable with the list size of 512 and 1024 respectively.

Browsing by Subject "CUDA"

Results Per Page

Sort Options