Effective bug detection and localization using information retrieval

Saha, Ripon Kumar

Effective bug detection and localization using information retrieval

Date

2016-05

Authors

Saha, Ripon Kumar

Abstract

Software bugs pose a fundamental threat to the reliability of software systems, even in systems designed with the best software engineering (SE) teams using the best SE practices. Detecting bugs early and fixing them quickly are extremely important. However, they are very expensive and challenging, especially at-scale. While the sciences of bug detection (e.g., software testing) and localization via static and dynamic program analyses have been explored considerably, text-based Information Retrieval (IR) techniques for bug detection and localization are interesting and promising new approaches for these problems. One advantage of text-based approaches is that it can utilize a lot of (implicit) semantic information about a program’s functionality from the program text, which is almost impossible to extract using program analysis based techniques. This dissertation builds a deeper understanding of current bug triaging and fixing processes via mining software repositories, and introduces new techniques for effective bug detection and localization. The dissertation has three main parts. First, we perform a number of empirical studies to investigate the extent of and reasons for long lived bugs, their severities, and time spent in different phases of bug fixing process. We demonstrate that many bugs remain unfixed for inordinate period of time due to numerous reasons, including difficulties in detecting, localizing, and fixing them. Second, we demonstrate that developers use very similar program text in source code and their corresponding test cases, which could be utilized to implement powerful test prioritization techniques. We introduce a novel IR based regression test prioritization technique called REPiR that embodies our insight, and show that REPiR is more efficient than program analysis based or dynamic coverage based techniques. Third, we demonstrate that fine grained program text such as class names, method names, variable names, and comments carry different levels of information, and it can be utilized to improve IR based bug localization. We introduce a structured retrieval technique called BLUiR that embodies our insights and show that BLUiR outperforms the existing state-of-the-art IR-based bug localization approaches. Finally, we further improve BLUiR by natural language processing. We make four contributions in this dissertation. One, we provide empirical evidence that there are considerable numbers of non-trivial bugs in software projects that survive for a long time. We describe the reasons for delay in fixing, the nature of fixes, and overall fixing process of these long lived bugs in a great detail. Two, we introduce the notion of IR-based regression test prioritization based on program changes. Three, we introduce the notion of structured retrieval for bug localization. Four, we provide an in-depth analysis of the extent to which natural languages processing can play an important role in improving IR-based bug localization further. The central ideas are embodied in a suite of prototype tools. Rigorous empirical evaluation is performed to validate the efficacy of the proposed techniques using datasets containing a variety of real-world Java and C programs.