Browsing by Subject "Computer storage devices"

Now showing 1 - 11 of 11

Enhancing memory controllers to improve DRAM power and performance
(2006) Hur, Ibrahim; Lin, Calvin
Technological advances and new architectural techniques have enabled processor performance to double almost every two years. However, these performance improvements have not resulted in comparable speedups for all applications, because the memory system performance has not kept pace with processor performance in modern systems. In this dissertation, by concentrating on the interface between the processors and memory, the memory controller, we propose novel solutions to all three aspects of the memory problem, that is bandwidth, latency, and power. To increase available bandwidth between the memory controller and DRAM, we introduce a new scheduling approach. To hide memory latency, we introduce a new hardware prefetching technique that is useful for applications with regular or irregular memory accesses. And finally, we show how memory controllers can be used to improve DRAM power consumption. We evaluate our techniques in the context of the memory controller of a highly tuned modern processor, the IBM Power5+. Our evaluation for both technical and commercial benchmarks in single-threaded and simultaneous multi-threaded environments show that our techniques for bandwidth increase, latency hiding, and power reduction achieve signicant improvements. For example, for singlethreaded applications, when our scheduling approach and prefetching method are implemented together, they improve the performance of the SPEC2006fp, NAS, and a set of commercial benchmarks by 14.3%, 13.7%, and 11.2%, respectively. In addition to providing substantial performance and power improvements, our techniques are superior to the previously proposed methods in terms of cost as well. For example, a version of our scheduling approach has been implemented in the Power5+, and it has increased the transistor count of the hip by only 0.02%. This dissertation shows that without increasing the complexity of neither the processor nor the memory organization, all three aspects of memory systems an be significantly improved with low- cost enhancements to the memory controller.
A generic memory module for events
(2007) Tecuci, Dan Gabriel; Porter, Bruce, 1956-
The ability to remember past experiences enables a system to improve its performance as well as its competence. For example, a system might be able solve problems faster by adapting previous solutions. Additional tasks, such as avoiding unwanted behavior by detecting potential problems, monitoring long-term goals by remembering what subgoals have been achieved, and reflection on past actions, become feasible. As the tasks that an intelligent system accomplishes become more and more complex, so does the experience it acquires in the process. Such experience has a temporal extent and is expressed in terms of concepts and relations with deep semantics associated to them. Memory systems should be able to deal with the temporal aspect of experience, exploit this semantic knowledge for storage and retrieval and do so in a scalable fashion. However, relying just on experience will not achieve a broad coverage, as it needs to be used in conjunction with other reasoning mechanisms. That is why we need the ability to add episodic memory functionality to intelligent systems. Today's knowledge-based systems are complex software applications and the ability to develop them in a modular fashion, using generic, reusable components is essential. We propose to separate the episodic memory from the system that uses it and to build a generic, reusable memory module that can be attached to a variety of applications in order to provide this functionality. Its goal is to provide accurate, scalable, efficient and content-addressable access to prior episodes. Having such a reusable memory module should allow research to focus on the generic aspects of memory representation, organization and retrieval and its interaction with the external application and it should also reduce the complexity of the overall system. In this dissertation we propose a set of general requirements that any memory module should provide regarding memory encoding, storage and retrieval. We present an implementation that satisfies these requirements and evaluate it on three different tasks: plan synthesis, plan recognition and Physics problem solving. The memory module proved easily adaptable to these tasks, providing fast, accurate and scalable retrieval.
Improving magneto-optic data storage densities using nonlinear equalization
(2004) Gupta, Sunil; Womack, Baxter F., 1930-
Qualification of the assembly process of flip-chip BGA packages for the next generation synchronous quad data rate sram device to ensure reliability
(2012-05) Shivan, Nivetha; Gale, Richard O.; Bayne, Stephen B.
Quad Data Rate SRAMS (QDR SRAM) with a maximum speed of 550MHz are the latest technology QDRs in the market. These devices use the traditional wire-bonding interconnects ball grid array package technology with about 165 signal pins. There are next generation QDR SRAMS that are being designed which operates at speeds much higher than 550MHz and signal pins twice as much as that of the present QDRs. These new products would require a new packaging interconnect technology called Flip Chip in order to accommodate higher speed and increased number of signal pins. The reason for this is that Flip Chip shows improved electrical properties over wire-bonding technologies. In this thesis, we deal with the qualification of Flip Chip interconnects technology for a higher pin count device.
Robust multithreaded applications
(2008-05) Napper, Jeffrey Michael; Alvisi, Lorenzo
This thesis discusses techniques for improving the fault tolerance of multithreaded applications. We consider the impact on fault tolerance methods of sharing address space and resources. We develop techniques in two broad categories: conservative multithreaded fault-tolerance (C-MTFT), which recovers an entire application on the failure of a single thread, and optimistic multithreaded fault-tolerance (OMTFT), which recovers threads independently as necessary. In the latter category, we provide a novel approach to recover hung threads while improving recovery time by managing access to shared resources so that hung threads can be restarted while other threads continue execution.
Scalable hardware memory disambiguation
(2007-12) Sethumadhavan, Lakshminarasimhan, 1978-; Burger, Douglas C., Ph. D.
This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution to improve concurrency. Such microprocessors use hardware memory structures for memory disambiguation, known as LoadStore Queues (LSQs), to ensure that memory instruction dependences are satisfied even when the memory instructions execute out-of-order. A typical LSQ implementation (circa 2006) holds all in-flight memory instructions in a physically centralized LSQ and performs a fully associative search on all buffered instructions to ensure that memory dependences are satisfied. These LSQ implementations do not scale because they use large, fully associative structures, which are known to be slow and power hungry. The increasing trend towards distributed microarchitectures further exacerbates these problems. As on-chip wire delays increase and high-performance processors become necessarily distributed, centralized structures such as the LSQ can limit scalability. This dissertation describes techniques to create scalable LSQs in both centralized and distributed microarchitectures. The problems and solutions described in this thesis are motivated and validated by real system designs. The dissertation starts with a description of the partitioned primary memory system of the TRIPS processor, of which the LSQ is an important component, and then through a series of optimizations describes how the power, area, and centralization problems of the LSQ can be solved with minor performance losses (if at all) even for large number of in flight memory instructions. The four solutions described in this dissertation — partitioning, filtering, late binding and efficient overflow management — enable power-, area-efficient, distributed and scalable LSQs, which in turn enable aggressive large-window processors capable of simultaneously executing thousands of instructions. To mitigate the power problem, we replaced the power-hungry, fully associative search with a power-efficient hash table lookup using a simple address-based Bloom filter. Bloom filters are probabilistic data structures used for testing set membership and can be used to quickly check if an instruction with the same data address is likely to be found in the LSQ without performing the associative search. Bloom filters typically eliminate more than 80% of the associative searches and they are highly effective because in most programs, it is uncommon for loads and stores to have the same data address and be in execution simultaneously. To rectify the area problem, we observe the fact that only a small fraction of all memory instructions are dependent, that only such dependent instructions need to be buffered in the LSQ, and that these instructions need to be in the LSQ only for certain parts of the pipelined execution. We propose two mechanisms to exploit these observations. The first mechanism, area filtering, is a hardware mechanism that couples Bloom filters and dependence predictors to dynamically identify and buffer only those instructions which are likely to be dependent. The second mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of these optimizations allows the number of LSQ slots to be reduced by up to one-half compared to a traditional organization without any performance degradation. Finally, we describe a new decentralized LSQ design for handling LSQ structural hazards in distributed microarchitectures. Decentralization of LSQs, and to a large extent distributed microarchitectures with memory speculation, has proved to be impractical because of the high performance penalties associated with the mechanisms for dealing with hazards. To solve this problem, we applied classic flow-control techniques from interconnection networks for handling resource con- flicts. The first method, memory-side buffering, buffers the overflowing instructions in a separate buffer near the LSQs. The second scheme, execution-side NACKing, sends the overflowing instruction back to the issue window from which it is later re-issued. The third scheme, network buffering, uses the buffers in the interconnection network between the execution units and memory to hold instructions when the LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network buffering scheme is the most robust of all the overflow schemes and shows less than 1% performance degradation due to overflows for a subset of SPEC CPU 2000 and EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS processor. The techniques proposed in this dissertation are independent, architectureneutral and their cumulative benefits result in LSQs that can be partitioned at a fine granularity and have low design complexity. Each of these partitions selectively buffers only memory instructions with true dependences and can be closely coupled with the execution units thus minimizing power, area, and latency. Such LSQ designs with near-ideal characteristics are well suited for microarchitectures with thousands of instructions in-flight and may enable even more aggressive microarchitectures in the future.
Single bit electrically erasable programmable read only memory fails: inline charge monitors for screening plasma damaged tunnel oxides
(Texas Tech University, 2004-12) Gopalakrishna, Amit Kumbasi
The purpose of this thesis is to inspect single-bit fails in electrically erasable programmable read-only memories (EEPROM). Single-bit fails is a failure mechanism in which data stored in the memory cell is lost. One or more bit lines have a cell that program to the required threshold voltage of a written cell whereas rest of the written cells in that line have a higher threshold value due to leakage of the data stored. However the data is not completely lost. Since Gate Oxide Integrity (GOI) is considered to be the source of data-retention problem of the memory, this thesis examines tunnel oxide and the process used in manufacturing EEPROM. The solution is to develop an in-line charge monitor that can screen the tunnel oxide at various stages in the process flow. The idea behind the thesis is to identify the damage to the tunnel oxide early in the flow so that further processing of the leaky oxide can be stopped thereby saving money. The first chapter, Semiconductor Memories, gives a brief introduction to different kind of memories available in the market and their evolution. Also working of different classification of memories is discussed. Chapter two, Non-Volatile Semiconductor Memory, gives an introduction on nonvolatile memory as EEPROM belongs to this family of memories. Various storage mechanisms, layout and physics of EEPROM are taken up. Failure modes in EEPROM and degradation mechanism are discussed in chapter three. Failure modes in EEPROM. Since plasma is considered to be the source for degradation of tunnel oxide, plasma physics and charging in plasma is discussed. Chapter four, Experimental approach - Charge Monitors, will include the design, process flow of the in-line monitor. The testing of the wafers followed by qualitative interpretation of the leakage curves is included in chapter five, Results and discussion. The quantitative analysis marks the end of this chapter. A physical model is developed for the charging of the antennas in chapter six, CONCLUSION AND FUTURE WORK which is followed by the future course of work.
Single-bit fails in electrically-erasable programmable read-only memory (EEPROM)
(Texas Tech University, 2003-05) Joseph, Aaron
The semiconductor industry relies heavily on the reliability of its products. This thesis examines single-bit fails that were detected in a subset of nonvolatile semiconductor memory, electrically-erasable programmable read-only memory (EEPROM). A brief introduction to the various memory types is presented for a better understanding of the evolution of EEPROMs. The test used to detect the fails is also described. This paper also presents the results of several experiments done to evaluate the effect of several processing steps and process parameters on the overall functional yield of the device and on the level of single-bit fails.
Switch-based Fast Fourier Transform processor
(2008-12) Mohd, Bassam Jamil, 1968-; Swartzlander, Earl E.
The demand for high-performance and power scalable DSP processors for telecommunication and portable devices has increased significantly in recent years. The Fast Fourier Transform (FFT) computation is essential to such designs. This work presents a switch-based architecture to design radix-2 FFT processors. The processor employs M processing elements, 2M memory arrays and M Read Only Memories (ROMs). One processing element performs one radix-2 butterfly operation. The memory arrays are designed as single-port memory, where each has a size of N/(2M); N is the number of FFT points. Compared with a single processing element, this approach provides a speedup of M. If not addressed, memory collisions degrade the processor performance. A novel algorithm to detect and resolve the collisions is presented. When a collision is detected, a memory management operation is executed. The performance of the switch architecture can be further enhanced by pipelining the design, where each pipeline stage employs a switch component. The result is a speedup of Mlog2N compared with a single processing element performance. The utilization of single-port memory reduces the design complexities and area. Furthermore, memory arrays significantly reduce power compared with the delay elements used in some FFT processors. The switch-based architecture facilitates deactivating processing elements for power scalability. It also facilitates implementing different FFT sizes. The VLSI implementation of a non-pipeline switch-based processor is presented. Matlab simulations are conducted to analyze the performance. The timing, power and area results from RTL, synthesis and layout simulations are discussed and compared with other processors.
A technology-scalable composable architecture
(2007) Kim, Changkyu; Burger, Douglas C., Ph. D.
Clock rate scaling can no longer sustain computer system performance scaling due to power and thermal constraints and diminishing performance returns of deep pipelining. Future performance improvements must therefore come from mining concurrency from applications. However, increasing global on-chip wire delays will limit the amount of state available in a single cycle, thereby hampering the ability to mine concurrency with conventional approaches. To address these technology challenges, the processor industry has migrated to chip multiprocessors (CMPs). The disadvantage of conventional CMP architectures, however, is their relative inflexibility to meet the wide range of application demands and operating targets that now exist. The granularity (e.g., issue width), the number of processors in a chip and memory hierarchies are fixed at design time based on the target workload mix, which result in suboptimal operation as the workload mix and operating targets change over time. In this dissertation, we explore the concept of composability to address both the increasing wire delay problem and the inflexibility of conventional CMP architectures. The basic concept of composability is the ability to dynamically adapt to diverse applications and operating targets, both in terms of granularity and functionality, by aggregating finegrained processing units or memory units. First, we propose a composable on-chip memory substrate, called Non-Uniform Access Cache Architecture (NUCA) to address increasing on-chip wire delay for future large caches. The NUCA substrate breaks large on-chip memories into many fine-grained memory banks that are independently accessible, with a switched network embedded in the cache. Lines can be mapped into this array of memory banks with fixed mappings or dynamic mappings, where cache lines can move around within the cache to further reduce the average cache hit latency. Second, we evaluate a range of strategies to build a composable processor. Composable processors provide flexibility of adapting the granularity of processors to various application demands and operating targets, and thus choose the hardware configurations best suited to any given point. A composable processor consists of a large number of lowpower, fine-grained processor cores that can be aggregated dynamically to form more powerful logical processors. We present architectural innovations to support composability in a power- and area-efficient manner.
The use of memory state knowledge to improve computer memory system organization
(2011-05) Isen, Ciji; John, Lizy Kurian; McKinley, Kathryn S.; Erez, Mattan; Aziz, Adnan; Bhargava, Ravi; Gratz, Paul V.
The trends in virtualization as well as multi-core, multiprocessor environments have translated to a massive increase in the amount of main memory each individual system needs to be fitted with, so as to effectively utilize this growing compute capacity. The increasing demand on main memory implies that the main memory devices and their issues are as important a part of system design as the central processors. The primary issues of modern memory are power, energy, and scaling of capacity. Nearly a third of the system power and energy can be from the memory subsystem. At the same time, modern main memory devices are limited by technology in their future ability to scale and keep pace with the modern program demands thereby requiring exploration of alternatives to main memory storage technology. This dissertation exploits dynamic knowledge of memory state and memory data value to improve memory performance and reduce memory energy consumption. A cross-boundary approach to communicate information about dynamic memory management state (allocated and deallocated memory) between software and hardware viii memory subsystem through a combination of ISA support and hardware structures is proposed in this research. These mechanisms help identify memory operations to regions of memory that have no impact on the correct execution of the program because they were either freshly allocated or deallocated. This inference about the impact stems from the fact that, data in memory regions that have been deallocated are no longer useful to the actual program code and data present in freshly allocated memory is also not useful to the program because the dynamic memory has not been defined by the program. By being cognizant of this, such memory operations are avoided thereby saving energy and improving the usefulness of the main memory. Furthermore, when stores write zeros to memory, the number of stores to the memory is reduced in this research by capturing it as compressed information which is stored along with memory management state information. Using the methods outlined above, this dissertation harnesses memory management state and data value information to achieve significant savings in energy consumption while extending the endurance limit of memory technologies.

Browsing by Subject "Computer storage devices"

Results Per Page

Sort Options