Browsing by Subject "Computer architecture"

Now showing 1 - 20 of 48

A high level protocol for a loosely coupled multiprocessing environment
(Texas Tech University, 1986-08) Wey, Yih-shyan
Not available
A reference architecture for distributed intelligent systems and a preliminary description language for their integration
(Texas Tech University, 1993-05) Bird, Shawn Daniel
Artificial Intelligence and information systems research is just beginning to grapple with the problems of integrating autonomous, intelligent machine and human agents into complex problem-solving systems. These powerful multi-agent collaborative systems are theoretically possible, but conceptual tools for their development are lacking. This research develops a general model of distributed intelligent systems and a layered language for agent interaction in an attempt to provide a foundation for the realization of such conceptual tools. The model provides a basic definition of distributed, asynchronous, intelligent problem-solving systems. The language, which is constructed from temporal and modal logics, provides an axiomatized logical system in which the knowledge, beliefs, and intentions of multiple agents are exchanged. Combined, the language and model provide a reference architecture for the development of, and future research into distributed intelligent systems.
Adaptive predication via compiler-microarchitecture cooperation
(2007) Kim, Hyesoon, 1974-; Patt, Yale N.
Even after decades of research in branch prediction, branch predictors still remain imperfect, which results in significant performance loss in aggressive processors that sup- port large instruction windows and deep pipelines. Predicated execution can reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and data dependencies due to predicated execution some- times offset the performance benefits of having fewer mispredictions. This dissertation presents two cooperative compiler-microarchitecture mechanisms to reduce the branch mis- prediction penalty by combining predicated execution and branch prediction. The first mechanism is a set of new control flow instructions, called wish branches. With wish branches, the compiler generates code that can be executed either as normal branch code or as predicated code. At run-time, the hardware chooses between normal branch code and predicated code based on the run-time branch behavior and the estimated run-time effectiveness of each solution. The results show that wish branches can signifi- cantly improve both performance and energy efficiency compared to predication or branch prediction. To provide the benefit of predicated code to non-predicated Instruction Set Archi- tectures (ISAs) and to increase the benefit of predicated execution beyond the benefit of wish branches, this dissertation also presents and evaluates the Diverge-Merge Processor (DMP) architecture. In the diverge-merge processor, the compiler analyzes the control-flow graphs of the program and marks branches suitable for dynamic predication –called di- verge branches– and their corresponding control flow merge points. The hardware not only chooses whether to use branch prediction or predication, but also decides “which” instruc- tions after a branch should be predicated based on run-time branch behavior. This solution significantly reduces the overhead of predicated code and allows a very large set of control- flow graphs to be predicated, neither of which was possible previously because predication was performed statically without any run-time information. This dissertation compares DMP with all other major previously-proposed branch processing paradigms available in the literature in terms of performance, power, energy consumption, and complexity. The results show that DMP is the most energy-efficient and high-performance paradigm for branch handling. Code generation algorithms for the DMP architecture and cost-benefit analysis models of dynamic predication are also evaluated.
Algorithms for circuit layout compaction of building blocks
(Texas Tech University, 1985-12) Varadarajan, Ramachandran
Compaction is the CAD tool used to pack rough sketches or symbolic diagrams to produce error free IC layouts. With the ever increasing complexity of VLSI circuitry, the building block approach becomes very important for custom VLSI design. A graph-theoretic compaction algorithm is developed for the compaction of symbolically specified layouts of building block LSI's. The layout area is reduced by minimizing the pitch in each dimension separately. The algorithm is capable of handling mixed constraints; i.e., the constraints arising from design rule requirements (lower-bound type), and the User defined constraints (equality and upper-bound type).
An adaptable computing structure for cepstrum caculations
(Texas Tech University, 1984-12) Fowler, James McCormick
Not available
Architectural techniques to accelerate multimedia applications on general-purpose processors
(2001-08) Talla, Deependra, 1975-; John, Lizy Kurian
General-purpose processors (GPPs) have been augmented with multimedia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to extract data level parallelism in multimedia and digital signal processing (DSP) applications. This dissertation consists of a comprehensive evaluation of the execution characteristics of multimedia applications on SIMD enhanced GPPs, detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to accelerate multimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an underutilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently execute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia applications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The performance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.
Architectures and algorithms for high performance switching
(2004) Prakash, Amit; Aziz, Adnan
Switches are ubiquitous in modern computing, appearing in wide-area networks, multiprocessor servers, and data storage systems. With the the advent of high-speed link technology, switches have become the bottleneck in moving data in the network. Existing switch architectures either require the interconnection network and packet buffers to work at a very high speed or require complex scheduling problems to be solved quickly. In this dissertation we investigate whether there are switch architectures that can support high-speed links that are simultaneously easy to schedule, and can be built out of inexpensive components. The approach we take is using parallelism to solve complex scheduling problems. We choose switching architectures such that the corresponding scheduling problem can be efficiently solved with a reasonable amount of hardware. In particular, we present two switch architectures for which we have developed efficient scheduling algorithms. The first switch achieves optimum throughput and optimum average latency while the second switch guarantees optimum throughput only but uses considerably less hardware.
Atomic block formation for explicit data graph execution architectures
(2010-08) Maher, Bertrand Allen; McKinley, Kathryn S.; Burger, Douglas C., Ph. D.; Keckler, Stephen W.; Mahlke, Scott A.; Pingali, Keshav
Limits on power consumption, complexity, and on-chip latency have focused computer architects on power-efficient designs that exploit parallelism. One approach divides programs into atomic blocks of operations that execute semi-independently, which efficiently creates a large window of potentially concurrent operations. This dissertation studies the intertwined roles of the compiler, architecture, and microarchitecture in achieving efficiency and high performance with a block-atomic architecture. For such an architecture to achieve high performance the compiler must form blocks effectively. The compiler must create large blocks of instructions to amortize the per-block overhead, but control flow and content restrictions limit the compiler's options. Block formation should consider factors such of frequency of execution, block size such as selecting control-flow paths that are frequently executed, and exploiting locality of computations to reduce communication overheads. This dissertation determines what characteristics of programs influence block formation and proposes techniques to generate effective blocks. The first contribution is a method for solving phase-ordering problems inherent to block formation, mitigating the tension between block-enlarging optimizations---if-conversion, tail duplication, loop unrolling, and loop peeling---as well as scalar optimizations. Given these optimizations, analysis shows that the remaining obstacles to creating larger blocks are inherent in the control flow structure of applications, and furthermore that any fixed block size entails a sizable amount of wasted space. To eliminate this overhead, this dissertation proposes an architectural implementation of variable-size blocks that allow the compiler to dramatically improve block efficiency. We use these mechanisms to develop policies for block formation that achieve high performance on a range of applications and processor configurations. We find that the best policies differ significantly depending on the number of participating cores. Using machine learning, we discover generalized policies for particular hardware configurations and find that the best policy varies significantly between applications and based on the number of parallel resources available in the microarchitecture. These results show that effective and efficient block-atomic execution is possible when the compiler and microarchitecture are designed cooperatively.
Braids: out-of-order performance with almost in-order complexity
(2007-12) Tseng, Francis, 1976-; Patt, Yale N.
There is still much performance to be gained by out-of-order processors with wider issue widths. However, traditional methods of increasing issue width do not scale; that is, they drastically increase design complexity and power requirements. This dissertation introduces the braid, a compile-time generated entity that enables the execution core to scale to wider widths by exploiting the small fanout and short lifetime of values produced by the program. A braid captures dataflow and register usage information of the program which are known to the compiler but are not traditionally conveyed to the microarchitecture through the instruction set architecture. Braid processing requires identification by the compiler, minor augmentations to the instruction set architecture, and support by the microarchitecture. The execution core of the braid microarchitecture consists of a number of braid execution units (BEUs). The BEU is tailored to efficiently carry out the execution of a braid in an in-order fashion. Each BEU consists of a FIFO scheduler, a busy-bit vector, two functional units, and a small internal register file. The braid microarchitecture provides a number of opportunities for the reduction of design complexity. It reduces the port requirements of the renaming mechanism, it simplifies the steering process, it reduces the area, size, and port requirements of the register file, and it reduces the paths and port requirements of the bypass network. The complexity savings result in a design characterized by a lower power requirement, a shorter pipeline, and a higher clock frequency. On an 8-wide design, the result from executing braids is performance within 9% of a very aggressive conventional out-of-order microarchitecture with the complexity of an in-order implementation. Three bottlenecks are identified in the braid microarchitecture and a solution is presented to address each. The limitation on braid size is addressed by dynamic merging. The underutilization of braid execution resources caused by long-latency instructions is addressed by context sharing. The poor utilization of braid execution resources caused by single-instruction braids is addressed by heterogeneous execution resources.
Bridging the gap between mobile CPU design and user satisfaction via crowdsourcing
(2016-05) Halpern, Matthew Franklin; Janapa Reddi, Vijay; Tiwari, Mohit
This report aims to provide an understanding of how the mobile CPU designs have evolved and its influence on end-user satisfaction. To that end, a quantitative performance analysis is conducted across ten cutting-edge mobile CPU designs studied within top-selling off-the-shelf smartphones released over the past seven years. This analysis is then used to guide a large-scale user study spanning over 25,000 participants via crowdsourcing on the Amazon Mechanical Turk service. The user study asks participants to assess the responsiveness of interactive application use cases for a set of current-generation applications (e.g. Angry Birds and FaceBook) and next-generation applications (i.e. face recognition and augmented reality) relative to the performance capabilities of the devices studied. This framework allows us to quantitatively link how the mobile CPU designs studied impacted end-user satisfaction. The study results indicate that mobile CPU designs have exhibited signifiant performance improvements through aggressive core scaling techniques prevalent in desktop CPUs. Just as was observed in desktop CPU design, these same techniques have lead to excessive mobile CPU power consumption. However, from an end-user perspective this power consumption was not without success. Mobile CPUs have evolved to provide satisfactory experiences for the studied current- generation applications. The reason is that many of these applications rely heavily on single-threaded performance. Other, more recent applications, actually multi-thread user-critical parts of the applications, which also demonstrates that multi- core mobile CPUs are an important design consideration – contrary to conventional wisdom. However, looking ahead, the same mobile CPUs where not able to provide satisfactory experiences for many of the next-generation applications studied, questioning the sustainability of these power-hungry design techniques in future mobile CPU designs.
Constructing adaptable and scalable synthetic benchmarks for microprocessor performance evaluation
(2007-12) Joshi, Ajay Manohar, 1976-; John, Lizy Kurian
Benchmarks set standards for innovation in computer architecture research and industry product development. Consequently, it is of paramount importance that the benchmarks used in computer architecture research and development are representative of real-world applications. However, composing such representative workloads poses practical challenges to application analysis teams and benchmark developers - (1) Benchmarks that are standardized are open-source whereas applications of interest are typically proprietary, (2) Benchmarks are rigid, measure single-point performance, and only represent a sample of the application behavior space (3) Benchmark suites take several years to develop, but applications evolve at a faster rate, and (4) Benchmarks geared towards temperature and power characterization are difficult to develop and standardize. The objective of this dissertation is to develop an adaptive benchmark generation strategy to construct synthetic benchmarks to address these benchmarking challenges. We propose an approach for automatically distilling key hardware-independent performance attributes of a proprietary workload and capture them into a miniature synthetic benchmark clone. The advantage of the benchmark clone is that it hides the functional meaning of the code, but exhibits similar performance and power characteristics as the target application across a wide range of microarchitecture configurations. Moreover, the dynamic instruction count of the synthetic benchmark clone is substantially shorter than the proprietary application, greatly reducing overall simulation time -- for the SPEC CPU 2000 suite, the simulation time reduction is over five orders of magnitude compared to the entire benchmark execution. We develop an adaptive benchmark generation strategy that trades off accuracy to provide the flexibility to easily alter program characteristics. The parameterization of workload metrics makes it possible to succinctly describe an application's behavior using a limited number of fundamental program characteristics. This provides the ability to alter workload characteristics and construct scalable benchmarks that allows researchers to explore a wider range of the application behavior space, conduct program behavior studies, and model emerging workloads. The parameterized workload model is the foundation for automatically constructing power and temperature oriented synthetic workloads. We show that machine learning algorithms can be effectively used to search the application behavior space to automatically construct benchmarks for evaluating the power and temperature characteristics of a computer architecture design. The need for a scientific approach to construct synthetic benchmarks, to complement application benchmarks, has long been recognized by the computer architecture research community, and this dissertation work is a significant step towards achieving that goal.
CORDIC-based high-speed direct digital frequency synthesis
(2003) Kang, Chang Yong; Swartzlander, Earl E.
The circular-mode CORDIC (coordinate rotation digital computer) algorithm is analyzed in the context of direct digital frequency synthesis (DDFS). It is shown how the CORDIC parameters should be selected to meet given DDFS parameters, and, through simulations, it is demonstrated that jamming outperforms truncation as a datapath quantization scheme, making it an attractive alternative to rounding. It is also shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CORDIC output is truncated to the final digital-to-analog converter (DAC) width. Variations of the basic circular-mode CORDIC algorithm are investigated, and previous works for fast DDFS implementation based on those algorithms are discussed. A new DDFS architecture based on the differential CORDIC (DCORDIC) algorithm is proposed. The proposed architecture allows a bit-level pipelining in the angle path by implementing a two-dimensional systolic array. Unlike other fast DDFS architectures, it incorporates the phase accumulator in the bit-level pipelining framework so that a bottleneck-free datapath throughout the whole system is achieved. The sequential implementation and the hybrid implementation of the architecture for area-sensitive and low-latency designs, respectively, are proposed for further research.
Delay-sensitive branch predictors for future technologies
(2002-05) Jiménez, Daniel Angel, 1969-; Lin, Yun Calvin
Design of algorithm transformations for VLSI array processing
(Texas Tech University, 1986-12) Dorairaj, Ravishankar
The rapid advances in the very large scale integrated (VLSI) technology has created a flurry of research in designing future computer architectures. Many methods have been developed for parallel processing of algorithms by directly mapping them onto parallel architectures. A procedure, based on the mathematical transformation of the index set and dependence vectors of an algorithm, is developed to find algorithm transformations for VLSI array processing. The algorithm is modeled as a program graph which is a directed graph. Techniques are suggested to regularize the data-flou in an algorithm, thereby minimizing the communication requirements of the architecture. We derive a set of sufficient conditions on the structure of data-flou of a class of algorithms, for the existence of valid transformations. The VLSI array is modeled as a directed graph, and the program graph is mapped onto this using the algorithm transformation.
Design of platforms for computing context with spatio-temporal locality
(2011-05) Ziotopoulos, Agisilaos Georgios; De Veciana, Gustavo; Garg, Vijay; Mok, Al; Julien, Christine; Touba, Nur; Breternitz, Mauricio
This dissertation is in the area of pervasive computing. It focuses on designing platforms for storing, querying, and computing contextual information. More specifically, we are interested in platforms for storing and querying spatio-temporal events where queries exhibit locality. Recent advances in sensor technologies have made possible gathering a variety of information on the status of users, the environment machines, etc. Combining this information with computation we are able to extract context, i.e., a filtered high-level description of the situation. In many cases, the information gathered exhibits locality both in space and time, i.e., an event is likely to be consumed in a location close to the location where the event was produced, at a time whic h is close to the time the event was produced. This dissertation builds on this observation to create better platforms for computing context. We claim three key contributions. We have studied the problem of designing and optimizing spatial organizations for exchanging context. Our thesis has original theoretical work on how to create a platform based on cells of a Voronoi diagram for optimizing the energy and bandwidth required for mobiles to exchange contextual information t hat is tied to specific locations in the platform. Additionally, we applied our results to the problem of optimizing a system for surveilling the locations of entities within a given region. We have designed a platform for storing and querying spatio-temporal events exhibiting locality. Our platform is based on a P2P infrastructure of peers organized based on the Voronoi diagram associated with their locations to store events based on their own associated locations. We have developed theoretical results based on spatial point processes for the delay experienced by a typical query in this system. Additionally, we used simulations to study heuristics to improve the performance of our platform. Finally, we came up with protocols for the replicated storage of events in order to increase the fault-tolerance of our platform. Finally, in this thesis we propose a design for a platform, based on RFID tags, to support context-aware computing for indoor spaces. Our platform exploits the structure found in most indoor spaces to encode contextual information in suitably designed RFID tags. The elements of our platform collaborate based on a set of messages we developed to offer context-aware services to the users of the platform. We validated our research with an example hardware design of the RFID tag and a software emulation of the tag's functionality.
Design of wide-issue high-frequency processors in wire delay dominated technologies
(2004) Murukkathampoondi, Hrishikesh Sathyavasu; Burger, Douglas C., Ph. D.; Chase, Craig M.
Designing on-chip memory systems for throughput architectures
(2015-12) Diamond, Jeffrey Robert; Fussell, Donald S., 1951-; Keckler, Stephen W.; van de Geijn, Robert; Lin, Calvin; Eijkhout, Victor
Driven by the high arithmetic intensity and embarrassingly parallel nature of real time computer graphics, GPUs became the first wide spread throughput architecture. With the end of Dennard scaling and the plateau of single thread performance, nearly all computer chips at all scales have now become explicitly parallel, containing a hierarchy of cores and threads. Initially, these individual cores were imagined to be no different from traditional uniprocessors, and parallel programs no different than traditional parallel programs. Like GPUs, these modern chips share finite on-chip resources between threads. This results in novel performance and optimization issues at any granularity of parallelism, from cell phones to GPUs. Unfortunately, the performance characteristics of these systems tend to be non-linear and counter-intuitive. The programmer’s software stack has been slow in adapting to this paradigm shift. Compilers still focus primarily on optimizing single thread performance at the expense of throughput. Existing parallel applications are not a perfect match for modern multicore, multithreaded processors. And existing methodologies for performance analysis and simulation are not aligned with multicore issues. This dissertation begins with a mathematical analysis of throughput performance in the presence of shared on-chip resources. When cache hit rates begin to fall, there is a steep drop off in throughput performance. An optimistic view of this regime is that even small improvements to cache efficiency offer significant benefits. This motivates the exploration of general throughput optimizations in both hardware and software that apply to both coarse-grained and fine-grained parallel architectures, requiring no programmer intervention or tuning. This dissertation provides two such solutions. The first solution is a compiler optimization called “loop microfission” that can boost throughput performance by up to 50%. In the context of the intrachip scalability of supercomputing applications, we demonstrate the failings of conven- tional performance tuning software and compiler algorithms in the presence of shared resources. We introduce a new approach to throughput optimization, including a memory friendly performance analysis tool, and show that techniques for throughput optimization are similar to traditional optimizations, but require new priorities. The second solution is a hardware optimization called Arbitrary Modulus In- dexing (AMI), a technique that generalizes efficient implementations of the DIV/- MOD operation from Mersenne Primes to all integers. We show that the primary performance bottlenecks in modern GPUs for regular, memory intensive applications are bank and set conflicts in the shared on-chip memory system. AMI completely eliminates conflicts in all facets of the memory system at negligible hardware cost, and has even broader potential for optimizations throughout computer architecture.
Distributed selective re-execution for EDGE architectures
(2005) Desikan, Rajagopalan; Burger, Douglas C., Ph. D.
Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines
(2008-12) Mizan, Elias, 1976-; Chiou, Derek
Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance.
Explicit data graph compilation
(2009-12) Smith, Aaron Lee, 1977-; Burger, Douglas C., Ph. D.; John, Lizy K.; Keckler, Stephen W.; Lin, Calvin; McKinley, Kathryn S.
Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements, present conventional instruction set architectures such as RISC, CISC, and VLIW with difficult challenges. To show continued performance growth, future microprocessors must exploit concurrency power efficiently. An important question for any future system is the division of responsibilities between programmer, compiler, and hardware to discover and exploit concurrency. In this research we develop the first compiler for an Explicit Data Graph Execution (EDGE) architecture and show how to solve the new challenge of compiling to a block-based architecture. In EDGE architectures, the compiler is responsible for partitioning the program into a sequence of structured blocks that logically execute atomically. The EDGE ISA defines the structure of, and the restrictions on, these blocks. The TRIPS prototype processor is an EDGE architecture that employs four restrictions on blocks intended to strike a balance between software and hardware complexity. They are: (1) fixed block sizes (maximum of 128 instructions), (2) restricted number of loads and stores (no more than 32 may issue per block), (3) restricted register accesses (no more than eight reads and eight writes to each of four banks per block), and (4) constant number of block outputs (each block must always generate a constant number of register writes and stores, plus exactly one branch). The challenges addressed in this thesis are twofold. First, we develop the algorithms and internal representations necessary to support the new structural constraints imposed by the block-based EDGE execution model. This first step provides correct execution and demonstrates the feasibility of EDGE compilers. Next, we show how to optimize blocks using a dataflow predication model and provide results showing how the compiler is meeting this challenge on the SPEC2000 benchmarks. Using basic blocks as the baseline performance, we show that optimizations utilizing the dataflow predication model achieve up to 64% speedup on SPEC2000 with an average speedup of 31%.

Browsing by Subject "Computer architecture"

Results Per Page

Sort Options