Browsing by Subject "Parallel processing (Electronic computers)"

Now showing 1 - 20 of 25

A comparative analysis of graph partitioning tools
(Texas Tech University, 2004-08) Chinthapanti, Pavan K
Not available
An adaptable computing structure for cepstrum caculations
(Texas Tech University, 1984-12) Fowler, James McCormick
Not available
Architectural techniques to accelerate multimedia applications on general-purpose processors
(2001-08) Talla, Deependra, 1975-; John, Lizy Kurian
General-purpose processors (GPPs) have been augmented with multimedia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to extract data level parallelism in multimedia and digital signal processing (DSP) applications. This dissertation consists of a comprehensive evaluation of the execution characteristics of multimedia applications on SIMD enhanced GPPs, detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to accelerate multimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an underutilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently execute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia applications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The performance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.
Compiler directed speculation for embedded clustered EPIC machines
(2004) Pillai, Satish; Jacome, Margarida F.
Very Large Instruction Word (VLIW)/Explicitly Parallel Instruction Computing (EPIC) processors are a very attractive platform for many of today's multimedia and communications applications. In particular, clustered VLIW/EPIC machines can take aggressive advantage of the available instruction level parallelism (ILP), while maintaining high energy-delay ef- ciency. However, multicluster machines are more challenging to compile to than centralized machines. In this thesis, we propose a novel compilerdirected resource-aware ILP extraction technique, called predicated switching, that is targeted towards such multicluster VLIW/EPIC machines. The proposed technique integrates three powerful ILP extraction techniques { predication, speculation and software pipelining, in a combined framework. The three novel contributions in this dissertation are: (1) a compiler transformation, denoted Static Single Assignment - Predicated Switching (SSA-PS), that leverages required data transfers between clusters for performance gains; (2) a static speculation algorithm to decide which speci c kernel operations should actually be speculated in a region of code (hyperblock), possibly being simultaneously software pipelined, so as to maximize execution performance on the target processor; and (3) an ILP extraction ow incorporating several code generation phases critical to pro table ILP extraction by the compiler. Experimental results performed on a representative set of time critical kernels compiled for a number of target machines show that, when compared to two baseline \resource-unaware" speculation techniques (one that speculates aggressively and one that speculates conservatively), predicated switching improves performance with respect to at least one of the baselines in 65% of the cases by up to 50%. Moreover, we show that code size and register pressure are not adversely affected by our technique. Finally, we show that our ILP extraction framework combining speculation and software pipelining can effectively exploit the relative merits of both techniques.
Design and evaluation of a technology-scalable architecture for instruction-level parallelism
(2007) Nagarajan, Ramadass, 1977-; Burger, Douglas C., Ph. D.
Future performance improvements must come from the exploitation of concurrency at all levels. Recent approaches that focus on thread-level and data-level concurrency are a natural fit for certain application domains, but it is unclear whether they can be adapted efficiently to eliminate serial bottlenecks. Conventional superscalar hardware that instead focuses on instruction-level parallelism (ILP) is limited by power inefficiency, on-chip wire latency, and design complexity. Ultimately, poor single-thread performance and Amdahl’s law will inhibit the overall performance growth even on parallel workloads. To address this problem, we undertook the challenge of designing a scalable, wide-issue, large-window processor that mitigates complexity, reduces power overheads, and exploits ILP to improve single-thread performance at future wire-delay dominated technologies. This dissertation describes the design and evaluation of the TRIPS architecture for exploiting ILP. The TRIPS architecture belongs to a new class of instruction set architectures called Explicit Data Graph Execution (EDGE) architectures that use large dataflow graphs of computation and explicit producer-consumer communication to express concurrency to the hardware. We describe how these architectures match the characteristics of future sub-45 nm CMOS technologies to mitigate complexity and improve concurrency at reduced overheads. We describe the architectural and microarchitectural principles of the TRIPS architecture, which exploits ILP by issuing instructions widely, in dynamic dataflow fashion, from a large distributed window of instructions. We then describe our specific contributions to the development of the TRIPS prototype chip, which was implemented in a 130 nm ASIC technology and consists of more than 170 million transistors. In particular, we describe the implementation of the distributed control protocols that offer various services for executing a single program in the hardware. Finally, we describe a detailed evaluation of the TRIPS architecture and identify the key determinants of its performance. In particular, we describe the development of the infrastructure required for a detailed analysis, including a validated performance model, a highly optimized suite of benchmarks, and critical path models that identify various architectural and microarchitectural bottlenecks at a fine level of granularity. On a set of highly optimized benchmark kernels, the manufactured TRIPS parts out-perform a conventional superscalar processor by a factor of 3× on average. We find that the automatically compiled versions of the same kernels are yet to reap the benefits of the high-ILP TRIPS core, but exceed the performance of the superscalar processor in many cases. Our results indicate that the overhead of various control protocols that manage the overall execution in the processor have only a modest effect on performance. However, operand communication between various components in the distributed microarchitecture contributes to nearly a third of the execution cycles. Fanout instructions, which are necessitated by limited, fixed-width encoding in the dataflow instruction set, also contribute to non-trivial performance overheads. Our results point to an exciting line of future research to overcome these limitations and achieve low-overhead distributed dataflow execution.
Design, testing, and implementation of WiSeMote: A wireless sensor network for structural health monitoring
(2012-05) Hoover, Davis; Karp, Tanja; Rice, Jennifer A.; Li, Changzhi
Much research has been done toward the realization of wireless sensor networks (WSN's) that can detect damage by aggregating information in a distributed manner from civil structures. These networks are still bound by system limitations such as limited battery power, low processor speeds, and limited radio transmission throughput. While many systems have been presented that have the ability to practically perform damage detection with their limited resources, there still exists room for improvement in power consumption, noise immunity, and utilization of network resources. This research presents the WiSeMote: a new wireless sensor node and base station that improves upon the power consumption and noise immunity of existing systems. Performance characterization test results are shown for communication, time synchronization, noise immunity, and data quality. Furthermore, this research takes an existing strategy for damage detection and analyzes it from the perspective of our new sensor platform. This new strategy is also implemented on the WiSeMote in a more resource efficient manner. Improvements in power consumption, external memory usage, and processing time are made possible through a novel parallel processing strategy for real-time calculation.
Evaluating Impulse C and multiple parallelism partitions for a low-cost reconfigurable computing system.
(2009-04-01T12:08:36Z) Li Shen, Carmen C.; Duren, Russell Walker.; Engineering.; Baylor University. Dept. of Electrical and Computer Engineering.
Impulse C is a C-to-HDL compiler from Impulse Accelerated Technology that facilitates the introduction of software programmers, mathematicians, and scientists, into the realm of FPGA-based algorithm development for high-speed numerical computation. This thesis evaluates the Impulse C programming language and explores differing levels of parallelism across multiple, homogeneous, FPGA development platforms using the Aurora serial communication scheme. Impulse C and Xilinx IP cores are employed in the numerical computation of a neural network consisting of 27 inputs and 1200 outputs. The artificial neural network is capable of emulating an underwater acoustic environment and has been used to determine characteristic parameters of reflections from the ocean floor. Timing, logic utilization and ease-of-use are metrics used to evaluate Impulse C in the automatic generation of VHDL code for the network test application. Implementations with parallelism at the system level and at the intermediate (loop) level are explored as part of this study.
Instruction history management for high-performance microprocessors
(2003) Bhargava, Ravindra Nath; John, Lizy Kurian
History-driven dynamic optimization is an important factor in improving instruction throughput in future high-performance microprocessors. Historybased techniques have the ability to improve instruction-level parallelism by breaking program dependencies, eliminating long-latency microarchitecture operations, and improving prioritization within the microarchitecture. However, a combination of factors, such as wider issue widths, smaller transistors, larger die area, and increasing clock frequency, has led to microprocessors that are sensitive to both wire delays and energy consumption. In this environment, the global structures and long-distance communications that characterize current history data management are limiting instruction throughput. This dissertation proposes the ScatterFlow Framework for Instruction History Management. Execution history management tasks, such as history data storage, access, distribution, collection, and modification, are partitioned and dispersed throughout the instruction execution pipeline. History data packets are then associated with active instructions and flow with the instructions as they execute, encountering the history management tasks along the way. Between dynamic instances of the instructions, the history data packets reside in trace-based history storage that is synchronized with the instruction trace cache. Compared to traditional history data management, this ScatterFlow method improves instruction coverage, increases history data access bandwidth, shortens communication distances, improves history data accuracy in many cases, and decreases the effective history data access time. A comparison of general history management effectiveness between the ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow Framework provides superior history maturity and instruction coverage. The unique properties that arise due to trace-based history storage and partitioned history management are analyzed, and novel design enhancements are presented to increase the usefulness of instruction history data within the ScatterFlow Framework. To demonstrate the potential of the proposed framework, specific dynamic optimization techniques are implemented using the ScatterFlow Framework. These illustrative examples combine the history capture advantages with the access latency improvements while exhibiting desirable dynamic energy consumption properties. Compared to a traditional table-based predictor, performing ScatterFlow value prediction improves execution time and reduces dynamic energy consumption. In other detailed examples, ScatterFlowenabled cluster assignment demonstrates improved execution time over previous cluster assignment schemes, and ScatterFlow instruction-level profiling detects more useful execution traits than traditional fixed-size and infinite-size hardware tables.
Investigation of the suitability of parallel processing in local networks for variable-size Kalman filter
(Texas Tech University, 1989-12) Ho, Meng Wong
Not available
Jack Rabbit : an effective Cell BE programming system for high performance parallelism
(2011-05) Ellis, Apollo Isaac Orion; Lin, Yun Calvin; Fussell, Donald S., 1951-
The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors.
Load balancing strategies for parallel architectures
(2003) Iqbal, Saeed; Jacome, Margarida F.; Carey, Graham F.
Modeling and analysis of concurrent processing systems by temporal petri nets
(Texas Tech University, 1987-05) Lu, Harngdar A
Temporal Petri nets, a new class of Petrinets in which timing constraints are represented by the operators of temporal logic, have recently been proposed. Due to the versati1ity of the temporal logic operators to express temporal Tempora1 Petrinets, a new class of Petrinets in which timing constraints are represented by the operators of temporal logic, have recently been proposed. Due to the versati1ity of the temporal logic operators to express temporal assertions, temporal Petri nets are expected to be suitable for describing causal and temporal relationships between events of a system, including eventuality and fairness. In this thesis we investigate the potential of temporal Petrinets as a design/verification tool for concurrent systems through modeling and analysis of the handshake daisy chain arbiters and the alternating bit protocol. We present some methods for analyzing temporal Petri nets. A brief comparison of temporal Petri nets with other verification methods is a1so given.
Numerical optical processing
(Texas Tech University, 1984-12) Patkar, Niteen A.
NOT AVAILABLE
Parallel machine scheduling with time windows
(2004) Rojanasoonthon, Siwate; Bard, Jonathan F.
Parallel processing for system simulation
(Texas Tech University, 1982-05) Tai, Heng-ming
In this thesis, a relaxation algorithm is proposed for solving large scale system simulation problems in parallel. This algorithm is composed of both a time-step parallel algorithm and a component-wise parallel algorithm and fully exploits the interconnected nature of the system which is characterized by a component connection model. Furthermore, the possibility of this algorithm being implemented with the structures suggested by Pottle [6] and Van Ness [2] is discussed. In chapter 2, the component connection model of a dynamical system and the classical (sequential) relaxation algorithm are reviewed. In chapter 3, we describe not only the time-step parallel algorithm and the component is parallel algorithm, but also the combination of these two algorithms to formulate a "highly parallel" system simulation algorithm. Chapter 4 illustrates a number of examples in which the possible trade offs between efficiency, speed-up ratio, and waiting time are analyzed. Finally, some concluding remarks are made in chapter 5.
Part and tool scheduling rules for a flexible manufacturing system
(Texas Tech University, 1983-12) Acree, Elaine Strong
Scheduling rules for a specific general purpose Flexible Manufacturing System were investigated. The system consists of six general purpose machines with local storage at each machine, a work-in-process queue and a material handling cart. The primary purpose of this research was to investigate effects and interactions of three types of scheduling rules on the FMS performance. The rules included part scheduling on the machines and two resource allocation rules for tool scheduling and cart scheduling. In addition the part scheduling rule was modified with a tool look ahead rule to minimize tool delay. Cue to the numerous similarities in the scheduling and resource allocation problems of computer operating systems and the FMS, some techniques from operating systems were applied to the FMS tc reduce thrashing, prevent deadlocks and increase cart utilization efficiency, A simulation model was developed to investigate the scheduling rules. SLAM was the language used to simulate this system. The main performance criteria used in this model were machine utilization, cart utilization, total time in system, number of finished parts and number of completed stages. The most important result of this research was the discovery that individual tool allocation is superior to total tool allocation. For the system studied, the performance measurements were insensitive to the part and cart scheduling rules. However a severe reduction in cart speed caused the system to thrash. The reduction of cart speed provided the proper experimental conditions to prove that SDTF can be superior to ECFS for this system operating under thrashing conditions. The tool look ahead feature provided a gross improvement in machine utilization and a reduction in time waiting for tools. However this improvement was not statistically significant in all cases.
Performance enhancement of grid-based applications
(Texas Tech University, 2004-05) Pathak, Sameer
Not available
Performance enhancing software loop transformations for embedded VLIW/EPIC processors
(2001-12) Akturan, Cagdas, 1973-; Jacome, Margarida F.
Software pipelining is a performance enhancing loop optimization technique widely used in optimizing compilers. This technique is particularly effective in the context of multimedia and signal processing embedded applications, since the time critical segments of such applications are typically loops. Although software pipelining can dramatically increase the performance of a large segment of today’s embedded applications market, it has two important potential drawbacks. First, it may lead to a significant increase in code size, and thus, to a costly increase in program memory size requirements. Second, it typically increases register pressure. In the context of register limited embedded processors, such an increase may lead to an increase in spills to memory, and thus, to significant performance degradation. In this research, we studied the difficult cost-performance demands posed by the embedded systems market, and developed effective performance enhancing loop optimization techniques/algorithms that directly take into consideration these two critical cost factors. In this dissertation we propose a novel software pipelining framework suitable for compilers targeting clustered embedded VLIW/EPIC processors. The key difference between our approach and previous approaches is that our proposed software pipelining framework can handle code size constraints along with latency and resource constraints while minimizing the increase in register pressure (register file size requirements) typically incurred by software pipelining. This powerful and unique combination of optimization features allows embedded system designers to perform compiler assisted exploration of "Pareto optimal” points with respect to performance, code size, and register requirements, all important figures of merit for embedded software.
Polymorphous architectures: a unified approach for extracting concurrency of different granularities
(2006) Sankaralingam, Karthikeyan; Keckler, Stephen W.
Processor architects today are faced by two daunting challenges: emerging applications with heterogeneous computation needs and technology limitations of power, wire delay, and process variation. Designing multiple application-specific processors or specialized architectures introduces design complexity, a software programmability problem, and reduces economies of scale. There is a pressing need for design methodologies that can provide support for heterogeneous applications, combat processor complexity, and achieve economies of scale. In this dissertation, we introduce the notion of architectural polymorphism to build such scalable processors that provide support for heterogeneous computation by supporting different granularities of parallelism. Polymorphism configures coarse-grained microarchitecture blocks to provide anadaptive and flexible processor substrate. Technology scalability is achieved by designing an architecture using scalable and modular microarchitecture blocks. We use the dataflow graph as the unifying abstraction layer across three granularities of parallelism-instruction-level, thread-level, and data-level. To first order, this granularity of parallelism is the main difference between different classes of applications. All programs are expressed in terms of dataflow graphs and directly mapped to the hardware, appropriately partitioned as required by the granularity of parallelism. We introduce Explicit Data Graph Execution (EDGE) ISAs, a class of ISAs as an architectural solution for effciently expressing parallelism for building technology scalable architectures. We developed the TRIPS architecture implementing an EDGE ISA using a heavily partitioned and distributed microarchitecture to achieve technology scalability. The two most signicant features of the TRIPS microarchitecture are its heavily partitioned and modular design, and the use of microarchitecture networks for communication across modules. We have also built aprototype TRIPS chip in 130nm ASIC technology composed of two processor cores and a distributed 1MB Non-Uniform Cache Access Architecture (NUCA) on-chip memory system. Our performance results show that the TRIPS microarchitecture which provides a 16-issue machine with a 1024-entry instruction window can sustain good instruction-level parallelism. On a set of hand-optimized kernels IPCs in the range of 4 to 6 are seen, and on a set of benchmarks with ample data-level parallelism (DLP), compiler generated code produces IPCs in the range of 1 to 4. On the EEMBC and SPEC CPU2000 benchmarks we see IPCs in the range of 0.5 to 2.3. Comparing performance to the Alpha 21264, which is a high performance architecture tuned for ILP, TRIPS is up to 3.4 times better on the hand optimized kernels. However, compiler generated binaries for the DLP, EEMBC, and SPEC CPU2000 benchmarks perform worse on TRIPS compared to an Alpha 21264. With more aggressive compiler optimization we expect the performance of the compiler produced binaries to improve. The polymorphous mechanisms proposed in this dissertation are effective at exploiting thread-level parallelism and data-level parallelism. When executing four threads on a single processor, significantly high levels of processor utilization are seen; IPCs are in the range of 0.7 to 3.9 for an application mix consisting of EEMBC and SPEC CPU2000 workloads. When executing programs with DLP, the polymorphous mechanisms we propose provide harmonic mean speedups of 2.1X across a set of DLP workloads, compared to an execution model of extra ting only ILP. Compared to specialized architectures, these mechanisms provide competitive performance using a single execution substrate.
Processor allocation, message scheduling, and algorithm selection for parallel space-time adaptive processing
(Texas Tech University, 2000-08) West, Jack M.
The minimization of execution time (which includes both computation and communication components) and/or the maximization of throughput are of great significance in embedded parallel environments. Given tight system constraints associated with applications in these environments, it is imperative to efficiently map the tasks and/or data of an application onto the processors so as to reduce the imposed inter-processor communication traffic. In addition to mapping the tasks and data to the processors in an efficient manner, it is also important to schedule the communication of messages during phases of data movement so as to minimize network contention in an attempt to attain the smallest possible communication time. In this instance, mapping and scheduling can be classified as optimization problems, where the performance of the parallel system is vastly impacted by the optimization of both mapping and scheduling. This dissertation involves optimizing the mapping of data and the scheduling of messages for a class of signal processing techniques known as space-time adaptive processing (STAP). An objective function is proposed to measure the quality of data mapping to processing elements of a parallel system for a STAP algorithm. The objective function is a cost metric that provides a quantitative measurement of the message traffic generated during phases of data movement based on the mapping of data to processors on a parallel system. The results show significant differences in the quality of data mappings using the proposed objective function. A genetic algorithm (GA) based approach for solving the message scheduling optimization problem is proposed, and numerical results fi"om different scenarios are provided. The GA-based optimization is performed off-line, and the results of this optimization are static schedules for each processing element in the parallel system. These static schedules are then implemented in the on-line parallel STAP application. The results of this research illustrate significant improvement in communication time performance is possible using the proposed GA-based approach to scheduling.

Browsing by Subject "Parallel processing (Electronic computers)"

Results Per Page

Sort Options