Performance diagnosis in large operational networks



Journal Title

Journal ISSN

Volume Title



IP networks have become the unified platform that supports a rice and extremely diverse set of applications and services, including traditional IP data service, Voice over IP (VoIP), smart mobile devices (e.g., iPhone), Internet television (IPTV) and online gaming. Network performance and reliability are critical issues in today's operational networks because many applications place increasingly stringent reliability and performance requirements. Even the smallest network performance degradation could cause significant customer distress. In addition, new network and service features (e.g., MPLS fast re-route capabilities) are continually rolled out across the network to support new applications, improve network performance, and reduce the operational cost. Network operators are challenged with ensuring that network reliability and performance is improved over time even in the face of constant changes, network and service upgrades and recurring faulty behaviors. It is critical to detect, troubleshoot and repair performance degradations in a timely and accurate fashion. This is extremely challenging in large IP networks due to their massive scale, complicated topology, high protocol complexity, and continuously evolving nature through either software or hardware upgrades, configuration changes or traffic engineering.

In this dissertation, we first propose a novel infrastructure NICE (Network-wide Information Correlation and Exploration) that enables detection and troubleshooting of chronic network conditions by analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.

Second, we extend NICE to detect and troubleshoot performance problems in IPTV networks. Compared to traditional ISP networks, IPTV distribution network typically adopts a different structure (tree-like multicast as opposed to mesh), imposes more restrictive service constraints (both in reliability and performance), and often faces a much larger scalability issue (managing millions of residential gateways versus thousands of provider-edge routers). Tailoring to the scale and structure of IPTV network, we propose a novel multi-resolution data analysis approach Giza that enables fast detection and localization of regions in the multicast tree hierarchy where the problem becomes significant. Furthermore, we develop several statistical data mining techniques to troubleshoot the identified problems and diagnose their root causes. Validation against operational experiences demonstrates the effectiveness of our approach in detecting important performance issues and identifying interesting dependencies.

Finally, we design and implement a novel infrastructure MERCURY for detecting the impact of network upgrades on performance. It is crucial to monitor the network when upgrades are made because they can have a significant impact on network performance and if not monitored may lead to unexpected consequences in operational networks. This can be achieved manually for a small number of devices, but does not scale to large networks with hundreds or thousands of routers and extremely large number of different upgrades made on a regular basis. MERCURY extracts interesting triggers from a large number of network maintenance activities. It then identifies behavior changes in network performance caused by the triggers. It uses statistical rule mining and network configuration to identify commonality across the behavior changes. We systematically evaluate MERCURY using data collected at a large tier-1 ISP network. By comparing to operational practice, we show that MERCURY is able to capture the interesting triggers and behavior changes induced by the triggers. In some cases, MERCURY also discovers previously unknown network behaviors demonstrating the effectiveness in identifying network conditions flying under the radar.