Browsing by Subject "SoC"

Now showing 1 - 6 of 6

Analysis of high performance interconnect in SoC with distributed switches and multiple issue bus protocols
(2011-05) Narayanasetty, Bhargavi; John, Lizy Kurian; Korson, Steve
In a System on a Chip (SoC), interconnect is the factor limiting Performance, Power, Area and Schedule (PPAS). Distributed crossbar switches also called as Switching Central Resources (SCR) are often used to implement high performance interconnect in a SoC – Network on a Chip (NoC). Multiple issue bus protocols like AXI (from ARM), VBUSM (from TI) are used in paths critical to the performance of the whole chip. Experimental analysis of effects on PPAS by architectural modifications to the SCRs is carried out, using synthesis tools and Texas Instruments (TI) in house power estimation tools. The effects of scaling of SCR sizes are discussed in this report. These results provide a quick means of estimation for architectural changes in the early design phase. Apart from SCR design, the other major domain, which is a concern, is deadlocks. Deadlocks are situations where the network resources are suspended waiting for each other. In this report various kinds of deadlocks are classified and their respective mitigations in such networks are provided. These analyses are necessary to qualify distributed SCR interconnect, which uses multiple issue protocols, across all scenarios of transactions. The entire analysis in this report is carried out using a flagship product of Texas Instruments. This ASIC SoC is a complex wireless base station developed in 2010- 2011, having 20 major cores. Since the parameters of crossbar switches with multiple issue bus protocols are commonly used in SoCs across the semiconductor industry, this reports provides us a strong basis for architectural/design selection and validation of all such high performance device interconnects. This report can be used as a seed for the development of an interface tool for architects. For a given architecture, the tool suggests architectural modifications, and reports deadlock situations. This new tool will aid architects to close design problems and bring provide a competitive specification very early in the design cycle. A working algorithm for the tool development is included in this report.
Built-in self test of RF subsystems
(2008-12) Zhang, Chaoming, 1980-; Abraham, Jacob A.
With the rapid development of wireless and wireline communications, a variety of new standards and applications are emerging in the marketplace. In order to achieve higher levels of integration, RF circuits are frequently embedded into System on Chip (SoC) or System in Package (SiP) products. These developments, however, lead to new challenges in manufacturing test time and cost. Use of traditional RF test techniques requires expensive high frequency test instruments and long test time, which makes test one of the bottlenecks for reducing IC costs. This research is in the area of built-in self test technique for RF subsystems. In the test approach followed in this research, on-chip detectors are used to calculate circuits specifications, and data converters are used to collect the data for analysis by an on-chip processor. A novel on-chip amplitude detector has been designed and optimized for RF circuit specification test. By using on-chip detectors, both the system performance and specifications of the individual components can be accurately measured. On-chip measurement results need to be collected by Analog to Digital Converters (ADCs). A novel time domain, low power ADC has been designed for this purpose. The ADC architecture is based on a linear voltage controlled delay line. Using this structure results in a linear transfer function for the input dependent delay. The time delay difference is then compared to a reference to generate a digital code. Two prototype test chips were fabricated in commercial CMOS processes. One is for the RF transceiver front end with on-chip detectors; the other is for the test ADC. The 940MHz RF transceiver front-end was implemented with on-chip detectors in a 0.18 [micrometer] CMOS technology. The chips were mounted onto RF Printed Circuit Boards (PCBs), with tunable power supply and biasing knobs. The detector was characterized with measurements which show that the detector keeps linear performance over a wide input amplitude range of 500mV. Preliminary simulation and measurements show accurate transceiver performance prediction under process variations. A 300MS/s 6 bit ADC was designed using the novel time domain architecture in a 0.13 [micrometer] standard digital CMOS process. The simulation results show 36.6dB Signal to Noise Ratio (SNR), 34.1dB Signal to Noise and Distortion Ratio (SNDR) for 99MHz input, Differential Non-Linearity (DNL)<0.2 Least Significant Bit (LSB), and Integral Non-Linearity (INL)<0.5LSB. Overall chip power is 2.7mW with a 1.2V power supply. The built-in detector RF test was extended to a full transceiver RF front end test with a loop-back setup, so that measurements can be made to verify the benefits of the technique. The application of the approach to testing gain, linearity and noise figure was investigated. New detector types are also evaluated. In addition, the low-power delay-line based ADC was characterized and improved to facilitate gathering of data from the detector. Several improved ADC structures at the system level are also analyzed. The built-in detector based RF test technique enables the cost-efficient test for SoCs.
Core-characteristic-aware off-chip memory management in a multicore system-on-chip
(2012-12) Jeong, Min Kyu; Erez, Mattan; John, Lizy K.; Chiou, Derek; Lin, Calvin; Schulte, Michael J.
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities.
Correct low power design transformations for hardware systems
(2013-08) Viswanath, Vinod; Abraham, Jacob A.
We present a generic proof methodology to automatically prove correctness of design transformations introduced at the Register-Transfer Level (RTL) to achieve lower power dissipation in hardware systems. We also introduce a new algorithm to reduce switching activity power dissipation in microprocessors. We further apply our technique in a completely different domain of dynamic power management of Systems-on-Chip (SoCs). We demonstrate our methodology on real-life circuits. In this thesis, we address the dual problem of transforming hardware systems at higher levels of abstraction to achieve lower power dissipation, and a reliable way to verify the correctness of the afore-mentioned transformations. The thesis is in three parts. The first part introduces Instruction-driven Slicing, a new algorithm to automatically introduce RTL/System level annotations in microprocessors to achieve lower switching power dissipation. The second part introduces Dedicated Rewriting, a rewriting based generic proof methodology to automatically prove correctness of such high-level transformations for lowering power dissipation. The third part implements dedicated rewriting in the context of dynamically managing power dissipation of mobile and hand-held devices. We first present instruction-driven slicing, a new technique for annotating microprocessor descriptions at the Register Transfer Level in order to achieve lower power dissipation. Our technique automatically annotates existing RTL code to optimize the circuit for lowering power dissipated by switching activity. Our technique can be applied at the architectural level as well, achieving similar power gains. We first demonstrate our technique on architectural and RTL models of a 32-bit OpenRISC pipelined processor (OR1200), showing power gains for the SPEC2000 benchmarks. These annotations achieve reduction in power dissipation by changing the logic of the design. We further extend our technique to an out-of-order superscalar core and demonstrate power gains for the same SPEC2000 benchmarks on architectural and RTL models of PUMA, a fixed point out-of-order PowerPC microprocessor. We next present dedicated rewriting, a novel technique to automatically prove the correctness of low power transformations in hardware systems described at the Register Transfer Level. We guarantee the correctness of any low power transformation by providing a functional equivalence proof of the hardware design before and after the transformation. Dedicated rewriting is a highly automated deductive verification technique specially honed for proving correctness of low power transformations. We provide a notion of equivalence and establish the equivalence proof within our dedicated rewriting system. We demonstrate our technique on a non-trivial case study. We show equivalence of a Verilog RTL implementation of a Viterbi decoder, a component of the DRM System-On-Chip (SoC), before and after the application of multiple low power transformations. We next apply dedicated rewriting to a broader context of holistic power management of SoCs. This in turn creates a self-checking system and will automatically flag conflicting constraints or rules. Our system will manage power constraint rules using dedicated rewriting specially honed for dynamic power management of SoC designs. Together, this provides a common platform and representation to seamlessly cooperate between hardware and software constraints to achieve maximum platform power optimization dynamically during execution. We demonstrate our technique in multiple contexts on an SoC design of the state-of-the-art next generation Intel smartphone platform. Finally, we give a proof of instruction-driven slicing. We first prove that the annotations automatically introduced in the OR1200 processor preserve the original functionality of the machine using the ACL2 theorem prover. Then we establish the same proof within our dedicated rewriting system, and discuss the merits of such a technique and a framework. In the context of today's shrinking hardware and mobile internet devices, lowering power dissipation is a key problem. Verifying the correctness of transformations which achieve that is usually a time-consuming affair. Automatic and reliable methods of verification that are easy to use are extremely important. In this thesis we have presented one such transformation, and a generic framework to prove correctness of that and similar transformations. Our methodology is constructed in a manner that easily and seamlessly fits into the design cycle of creating complicated hardware systems. Our technique is also general enough to be applied in a completely different context of dynamic power management of mobile and hand-held devices.
Design, Implementation and Evaluation of a Configurable NoC for AcENoCs FPGA Accelerated Emulation Platform
(2011-10-21) Lotlikar, Swapnil Subhash
The heterogenous nature and the demand for extensive parallel processing in modern applications have resulted in widespread use of Multicore System-on-Chip (SoC) architectures. The emerging Network-on-Chip (NoC) architecture provides an energy-efficient and scalable communication solution for Multicore SoCs, serving as a powerful replacement for traditional bus-based solutions. The key to successful realization of such architectures is a flexible, fast and robust emulation platform for fast design space exploration. In this research, we present the design and evaluation of a highly configurable NoC used in AcENoCs (Accelerated Emulation platform for NoCs), a flexible and cycle accurate field programmable gate array (FPGA) emulation platform for validating NoC architectures. Along with the implementation details, we also discuss the various design optimizations and tradeoffs, and assess the performance improvements of AcENoCs over existing simulators and emulators. We design a hardware library consisting of routers and links using verilog hardware description language (HDL). The router is parameterized and has a configurable number of physical ports, virtual channels (VCs) and pipeline depth. A packet switched NoC is constructed by connecting the routers in either 2D-Mesh or 2D-Torus topology. The NoC is integrated in the AcENoCs platform and prototyped on Xilinx Virtex-5 FPGA. The NoC was evaluated under various synthetic and realistic workloads generated by AcENoCs' traffic generators implemented on the Xilinx MicroBlaze embedded processor. In order to validate the NoC design, performance metrics like average latency and throughput were measured and compared against the results obtained using standard network simulators. FPGA implementation of the NoC using Xilinx tools indicated a 76% LUT utilization for a 5x5 2D-Mesh network. A VC allocator was found to be the single largest consumer of hardware resources within a router. The router design synthesized at a frequency of 135MHz, 124MHz and 109MHz for 3-port, 4-port and 5-port configurations, respectively. The operational frequency of the router in the AcENoCs environment was limited only by the software execution latency even though the hardware itself could be clocked at a much higher rate. An AcENoCs emulator showed speedup improvements of 10000-12000X over HDL simulators and 5-15X over software simulators, without sacrificing cycle accuracy.
Integrated temperature sensors in deep sub-micron CMOS technologies
(2014-05) Chowdhury, Golam Rasul; Hassibi, Arjang
Integrated temperature sensors play an important role in enhancing the performance of on-chip power and thermal management systems in today's highly-integrated system-on-chip (SoC) platforms, such as microprocessors. Accurate on-chip temperature measurement is essential to maximize the performance and reliability of these SoCs. However, due to non-uniform power consumption by different functional blocks, microprocessors have fairly large thermal gradient (and variation) across their chips. In the case of multi-core microprocessors for example, there are task-specific thermal gradients across different cores on the same die. As a result, multiple temperature sensors are needed to measure the temperature profile at all relevant coordinates of the chip. Subsequently, the results of the temperature measurements are used to take corrective measures to enhance the performance, or save the SoC from catastrophic over-heating situations which can cause permanent damage. Furthermore, in a large multi-core microprocessor, it is also imperative to continuously monitor potential hot-spots that are prone to thermal runaway. The locations of such hot spots depend on the operations and instruction the processor carries out at a given time. Due to practical limitations, it is an overkill to place a big size temperature sensor nearest to all possible hot spots. Thus, an ideal on-chip temperature sensor should have minimal area so that it can be placed non-invasively across the chip without drastically changing the chip floor plan. In addition, the power consumption of the sensors should be very low to reduce the power budget overhead of thermal monitoring system, and to minimize measurement inaccuracies due to self-heating. The objective of this research is to design an ultra-small size and ultra-low power temperature sensor such that it can be placed in the intimate proximity of all possible hot spots across the chip. The general idea is to use the leakage current of a reverse-bias p-n junction diode as an operand for temperature sensing. The tasks within this project are to examine the theoretical aspect of such sensors in both Silicon-On-Insulator (SOI), and bulk Complementary Metal-Oxide Semiconductor (CMOS) technologies, implement them in deep sub-micron technologies, and ultimately evaluate their performances, and compare them to existing solutions.

Browsing by Subject "SoC"

Results Per Page

Sort Options