Browsing by Subject "NoC"

Now showing 1 - 12 of 12

A verilog-hdl implementation of virtual channels in a network-on-chip router
(2009-05-15) Park, Sungho
As the feature size is continuously decreasing and integration density is increasing, interconnections have become a dominating factor in determining the overall quality of a chip. Due to the limited scalability of system bus, it cannot meet the requirement of current System-on-Chip (SoC) implementations where only a limited number of functional units can be supported. Long global wires also cause many design problems, such as routing congestion, noise coupling, and difficult timing closure. Network-on-Chip (NoC) architectures have been proposed to be an alternative to solve the above problems by using a packet-based communication network. The processing elements (PEs) communicate with each other by exchanging messages over the network and these messages go through buffers in each router. Buffers are one of the major resource used by the routers in virtual channel flow control. In this thesis, we analyze two kinds of buffer allocation approaches, static and dynamic buffer allocations. These approaches aim to increase throughput and minimize latency by means of virtual channel flow control. In statically allocated buffer architecture, size and organization are design time decisions and thus, do not perform optimally for all traffic conditions. In addition, statically allocated virtual channel consumes a waste of area and significant leakage power. However, dynamic buffer allocation scheme claims that buffer utilization can be increased using dynamic virtual channels. Dynamic virtual channel regulator (ViChaR), have been proposed to use centralized buffer architecture which dynamically allocates virtual channels and buffer slots in real-time depending on traffic conditions. This ViChaR?s dynamic buffer management scheme increases buffer utilization, but it also increases design complexity. In this research, we reexamine performance, power consumption, and area of ViChaR?s buffer architecture through implementation. We implement a generic router and a ViChaR architecture using Verilog-HDL. These RTL codes are verified by dynamic simulation, and synthesized by Design Compiler to get area and power consumption. In addition, we get latency through Static Timing Analysis. The results show that a ViChaR?s dynamic buffer management scheme increases the latency and power consumption significantly even though it could increase buffer utilization. Therefore, we need a novel design to achieve high buffer utilization without a loss.
Analysis of high performance interconnect in SoC with distributed switches and multiple issue bus protocols
(2011-05) Narayanasetty, Bhargavi; John, Lizy Kurian; Korson, Steve
In a System on a Chip (SoC), interconnect is the factor limiting Performance, Power, Area and Schedule (PPAS). Distributed crossbar switches also called as Switching Central Resources (SCR) are often used to implement high performance interconnect in a SoC – Network on a Chip (NoC). Multiple issue bus protocols like AXI (from ARM), VBUSM (from TI) are used in paths critical to the performance of the whole chip. Experimental analysis of effects on PPAS by architectural modifications to the SCRs is carried out, using synthesis tools and Texas Instruments (TI) in house power estimation tools. The effects of scaling of SCR sizes are discussed in this report. These results provide a quick means of estimation for architectural changes in the early design phase. Apart from SCR design, the other major domain, which is a concern, is deadlocks. Deadlocks are situations where the network resources are suspended waiting for each other. In this report various kinds of deadlocks are classified and their respective mitigations in such networks are provided. These analyses are necessary to qualify distributed SCR interconnect, which uses multiple issue protocols, across all scenarios of transactions. The entire analysis in this report is carried out using a flagship product of Texas Instruments. This ASIC SoC is a complex wireless base station developed in 2010- 2011, having 20 major cores. Since the parameters of crossbar switches with multiple issue bus protocols are commonly used in SoCs across the semiconductor industry, this reports provides us a strong basis for architectural/design selection and validation of all such high performance device interconnects. This report can be used as a seed for the development of an interface tool for architects. For a given architecture, the tool suggests architectural modifications, and reports deadlock situations. This new tool will aid architects to close design problems and bring provide a competitive specification very early in the design cycle. A working algorithm for the tool development is included in this report.
Asynchronous Bypass Channels Improving Performance for Multi-synchronous Network-on-chips
(2011-10-21) Jain, Tushar Naveen Kumar
Dr. Paul V. Gratz Network-on-Chip (NoC) designs have emerged as a replacement for traditional shared-bus designs for on-chip communications. As with all current VLSI design, however, reducing power consumption in NoCs is a critical challenge. One approach to reduce power is to dynamically scale the voltage and frequency of each network node or groups of nodes (DVFS). Another approach to reduce power consumption is to replace the balanced clock tree with a globally-asynchronous, locally-synchronous (GALS) clocking scheme. NoCs implemented with either of these schemes, however, tend to have high latencies as packets must be synchronized at the intermediate nodes between source and destination. In this work, we propose a novel router microarchitecture which offers superior performance versus typical synchroniz- ing router designs. Our approach features Asynchronous Bypass Channels (ABCs) at intermediate nodes thus avoiding synchronization delay. We also propose a new network topology and routing algorithm that leverage the advantages of the bypass channel offered by our router design. Our experiments show that our design improves the performance of a conventional synchronizing design with similar resources by up to 26 percent at low loads and increases saturation throughput by up to 50 percent.
Communication Reliability in Network on Chip Designs
(2012-10-19) Kumar, Reeshav
The performance of low latency Network on Chip (NoC) architectures, which incorporate fast bypass paths to reduce communication latency, is limited by crosstalk induced skewing of signal transitions on link wires. As a result of crosstalk interactions between wires, signal transitions belonging to the same flit or bit vector arrive at the destination at different times and are likely to violate setup and hold time constraints for the design. This thesis proposes a two-step technique: TransSync- RecSync, to dynamically eliminate packet errors resulting from inter-bit-line transition skew. The proposed approach adds minimally to router complexity and involves no wire overhead. The actual throughput of NoC designs with asynchronous bypass designs is evaluated and the benefits of augmenting such schemes with the proposed design are studied. The TransSync, TransSync-2-lines and RecSync schemes described here are found to improve the average communication latency by 26%, 20% and 38% respectively in a 7X7 mesh NoC with asynchronous bypass channel. This work also evaluates the bit-error ratio (BER) performance of several existing crosstalk avoidance and error correcting schemes and compares them to that of the proposed schemes. Both TransSync and RecSync scheme are dynamic in nature and can be switched on and off on-the-fly. The proposed schemes can therefore be employed to impart unequal error protection (UEP) against intra-flit skewing on NoC links. In the UEP, a larger fraction of the energy budget is spent in providing protection to those parts of the data being transmitted on the link which have a higher priority, while expending smaller effort in protecting relatively less important parts of the data. This allows us to achieve the prescribed level of performance with lower levels of power. The benefits of the presented technique are illustrated using an H.264 video decoder system-on-chip (SoC) employing NoC architecture. We show that for Akyio test streams transmitted over 3mm long link wires, the power consumption can be reduced by as much as 20% at the cost of an acceptable degradation in average peak signal to noise ratio (PSNR) with UEP.
Control Techniques for Uncore Power Mangement in Chip Multiprocessor Designs
(2013-08-01) Xu, Zheng
In chip-multiprocessor (CMP) designs, when the number of core increases, the size of on-chip communication fabric and data storage grows accordingly and therefore the chip power challenge is exacerbated. This thesis work considers the power management for networks-on-chip (NoC) and the last level cache, which constitute the uncore in CMP designs. NoC is regarded as a scalable approach to cope with the increasing demand for on-chip communication bandwidth. The last level cache is shared among all cores. The focus of this work is on the control techniques for uncore dynamic voltage and frequency scaling. A realistic but not well-studied scenario is investigated. That is, the entire uncore shares a single voltage/frequency domain, as opposed to separated domains in most of previous works. One appealing advantage here is that data packets no longer experience the interfacing overhead across different voltage/frequency domains. The classic PI (Proportional and Integral) control method is adopted due to its simplicity, flexibility and low implementation overhead. This thesis research outcome includes three parts. First, stability of the PI control is analyzed. Second, a model-assisted PI control scheme is proposed and studied. The model assist is to address the problem that no universally good reference point exists for the control. Third, the windup issue for the PI control is investigated. Full architecture simulations are performed on public benchmark suites to validate the proposed techniques. The result show 76% energy reduction with less than 6% performance degradation compared to constantly high voltage/frequency for uncore.
Design, Implementation and Evaluation of a Configurable NoC for AcENoCs FPGA Accelerated Emulation Platform
(2011-10-21) Lotlikar, Swapnil Subhash
The heterogenous nature and the demand for extensive parallel processing in modern applications have resulted in widespread use of Multicore System-on-Chip (SoC) architectures. The emerging Network-on-Chip (NoC) architecture provides an energy-efficient and scalable communication solution for Multicore SoCs, serving as a powerful replacement for traditional bus-based solutions. The key to successful realization of such architectures is a flexible, fast and robust emulation platform for fast design space exploration. In this research, we present the design and evaluation of a highly configurable NoC used in AcENoCs (Accelerated Emulation platform for NoCs), a flexible and cycle accurate field programmable gate array (FPGA) emulation platform for validating NoC architectures. Along with the implementation details, we also discuss the various design optimizations and tradeoffs, and assess the performance improvements of AcENoCs over existing simulators and emulators. We design a hardware library consisting of routers and links using verilog hardware description language (HDL). The router is parameterized and has a configurable number of physical ports, virtual channels (VCs) and pipeline depth. A packet switched NoC is constructed by connecting the routers in either 2D-Mesh or 2D-Torus topology. The NoC is integrated in the AcENoCs platform and prototyped on Xilinx Virtex-5 FPGA. The NoC was evaluated under various synthetic and realistic workloads generated by AcENoCs' traffic generators implemented on the Xilinx MicroBlaze embedded processor. In order to validate the NoC design, performance metrics like average latency and throughput were measured and compared against the results obtained using standard network simulators. FPGA implementation of the NoC using Xilinx tools indicated a 76% LUT utilization for a 5x5 2D-Mesh network. A VC allocator was found to be the single largest consumer of hardware resources within a router. The router design synthesized at a frequency of 135MHz, 124MHz and 109MHz for 3-port, 4-port and 5-port configurations, respectively. The operational frequency of the router in the AcENoCs environment was limited only by the software execution latency even though the hardware itself could be clocked at a much higher rate. An AcENoCs emulator showed speedup improvements of 10000-12000X over HDL simulators and 5-15X over software simulators, without sacrificing cycle accuracy.
Dynamic Power Management of High Performance Network on Chip
(2012-02-14) Mandal, Suman Kalyan
With increased density of modern System on Chip(SoC) communication between nodes has become a major problem. Network on Chip is a novel on chip communication paradigm to solve this by using highly scalable and efficient packet switched network. The addition of intelligent networking on the chip adds to the chip?s power consumption thus making management of communication power an interesting and challenging research problem. While VLSI techniques have evolved over time to enable power reduction in the circuit level, the highly dynamic nature of modern large SoC demand more than that. This dissertation explores some innovative dynamic solutions to manage the ever increasing communication power in the post sub-micron era. Today?s highly integrated SoCs require great level of cross layer optimizations to provide maximum efficiency. This dissertation aims at the dynamic power management problem from top. Starting with a system level distribution and management down to microarchitecture enhancements were found necessary to deliver maximum power efficiency. A distributed power budget sharing technique is proposed. To efficiently satisfy the established power budget, a novel flow control and throttling technique is proposed. Finally power efficiency of underlying microarchitecture is explored and novel buffer and link management techniques are developed. All of the proposed techniques yield improvement in power-performance efficiency of the NoC infrastructure.
Hybrid Nanophotonic NOC Design for GPGPU
(2012-07-16) Yuan, Wen
Due to the massive computational power, Graphics Processing Units (GPUs) have become a popular platform for executing general purpose parallel applications. The majority of on-chip communications in GPU architecture occur between memory controllers and compute cores, thus memory controllers become hot spots and bottle neck when conventional mesh interconnection networks are used. Leveraging this observation, we reduce the network latency and improve throughput by providing a nanophotonic ring network which connects all memory controllers. This new interconnection network employs a new routing algorithm that combines Dimension Ordered Routing (DOR) and nanophotonic ring algorithms. By exploring this new topology, we can achieve to reduce interconnection network latency by 17% on average (up to 32%) and improve IPC by 5% on average (up to 11.5%). We also analyze application characteristics of six CUDA benchmarks on the GPGPU-Sim simulator to obtain better perspective for designing high performance GPU interconnection network.
Mobile Home Node: Improving Directory Cache Coherence Performance in NoCs via Exploitation of Producer-Consumer Relationships
(2011-10-21) Soni, Tarun
The implementation of multiple processors on a single chip has been made possible with advancements in process technology. The benefits of having multiple cores on a single chip bring with it a new set of constraints for maintaining fast and consistent memory accesses. Cache coherence protocols are needed to maintain the consistency of shared memory on individual caches. Current cache coherency protocols are either snoop based, which is not scalable but provides fast access for small number of cores, or directory based, which involves a directory that acts as the ordering point providing scalability with relatively slower access. Our focus is on improving the memory access time of the scalable directory protocol. We have observed that most memory requests follow a pattern where in one of the processors, which we will dub the Producer, repeatedly writes to a particular memory location. A subset of the remaining cores, which we will dub the Consumers, repeatedly read the data from that same memory location. In our implementation we utilize this relationship to provide direct cache to cache transfers and minimize the access time by avoiding the indirection through the directory. We move the directory temporarily to the Producer node so that the consumer can directly request the producer for the cache line. Our technique improves the memory access time by 13 percent and reduces network traffic by 30 percent over standard directory coherence protocol with very little area overhead.
NoC Resource Allocation Based on Physical Design Techniques
(2014-05-05) Yang, Gongming
Networks-on-Chip (NoC) has been recognized as a scalable approach for on-chip communication. Quality-of-Service (QoS) is a fundamental part of application specific NoCs. This thesis focuses on resource allocation on NoC, to improve the capability of NoC for Guaranteed Service (GS). A graph model is adopted to describe physical and temporal sources of a NoC. Based on the graph model, an RRR-based algorithm is proposed for simultaneous routing and time slot allocation. In addition, a negotiation-based algorithm is suggested for achieving power-efficient QoS for application-specific NoCs. Last, a hybrid NoC architecture, which combines circuit switching and packet switching, is developed and investigated. Experimental results show that our techniques outperform previous works.
Ocin_tsim - A DVFS Aware Simulator for NoC Design Space Exploration and Optimization
(2010-07-14) Prabhu, Subodh
Networks-on-Chip (NoCs) are a general purpose, scalable replacement for shared medium wired interconnects offering many practical applications in industry. Dynamic Voltage Frequency Scaling (DVFS) is a technique whereby a chip?s voltage-frequency levels are varied at run time, often used to conserve dynamic power. Various DVFSbased NoC optimization techniques have been proposed. However, due to the resources required to validate architectural decisions through prototyping, few are implemented. As a result, designers are faced with a lack of insight into potential power savings or performance gains at early architecture stages. This thesis proposes a DVFS aware NoC simulator with support for per node power-frequency modeling to allow fine-tuning of such optimization techniques early on in the design cycle. The proposed simulator also provides a framework for benchmarking various candidate strategies to allow selective prototyping and optimization. As part of the research, DVFS extensions were built for an existing NoC performance simulator and released for public use. This thesis presents some of the preliminary results from our simulator that show the average power consumed per node for all the benchmarks in SPLASH 2 benchmark suite [74] to be quite similar to each other. This thesis also serves as a technical manual for the simulator extensions. Important links for downloading and using the simulator are provided at the end of this document in Appendix C.
SPAcENoCs : A Scalable Platform for FPGA Accelerated Emulator of NoCs
(2013-05-06) Chen, Guangming
The majority of modern high performance computing systems have employed on-chip multi-processors. As the number of on-chip cores soars, the traditional non- scalable communication infrastructures, commonly observed as shared buses or cross- bars, no longer accommodate the increasing communication demand by the modern multi-core chips. The newly emerging Network-On-Chip (NoC) interconnection scheme has provided a scalable, robust and power-efficient solution that also satisfies the requirements on both bandwidth and latency. A tool that enables swift exploration of the vast NoC design space is then in great demand to meet the stiff time pressure over research and development. Based on the work of AcENoCs, an NoC simulator designed on the basis of software and hardware codesign seeking for a large simulatable network size, the SPAcENoCs (Scalable Platform for FPGA Accelerated Emulator of NoCs) employs the Time-Division Multiplexing (TDM) techniques to implement a simulator for even larger NoCs without sacrificing simulation speed and cycle accuracy which have been highlighted in the work of AcENoCs. This paper will focus on re-organization of the given software/hardware codesigned frameworks so that the TDM techniques may be applied. While both frameworks require re-design, the major efforts involve re- construction of the hardware framework by adding data buffers and affiliated logic to ensure the data generated in different time divisions are properly preserved and trans- mitted. Various design tradeoffs over hardware budget and simulation performance are also discussed and attempted in this paper. During the development process, the techniques of device virtualization and generic programming are introduced to overcome the verification challenges that are commonly seen in software/hardware codesigned systems. The synthesis results of various design options suggested that the simulation of a 9 ? 6 network, more than twice the size of largest applicable size in AcENoCs, can be accommodated by the device. Based on the simulation result of AcENoCs, the estimated speedup of SPAcENoCs over software simulator for the 9 ? 6 NoC is around 28-94X, twice the one achieved by AcENoCs in a smaller network.

Browsing by Subject "NoC"

Results Per Page

Sort Options