Abstract
The dissertation presents an approach that uncovers evolutionary couplings from the version histories of software systems. The version histories used are those stored in software repositories, managed by version-control tools such as Subversion and CVS. A combination of lightweight source-code analysis and differencing, development heuristics, and sequential-pattern mining techniques are used to uncover the evolutionary couplings. Evolutionary couplings are identified between different artifact types (e.g., source code and documentation and end-user documentation in different natural languages), as well as at fine-grained granularities (e.g., methods, control statements, preprocessor, and even comments) of source code. The specific archival source used is the commits stored in software repositories. This work directly falls under the realm of Mining Software Repositories (MSR) and Empirical Software Engineering. The applications of evolutionary couplings are demonstrated on a number of software evolution tasks such as change analysis and prediction, software traceability links recovery, and software document localization process. The approach is applied and evaluated on a number of versions of K Desktop Environment (KDE) - an open source system. The results show that the approach is very precise in predicting certain types of future changes from the past evolutionary couplings. No other work has systematically shown the use of version archives in uncovering traceability links, supporting document localization, and source code change prediction at fine-granularity levels over multiple versions. Additionally, the approach is used to automatically mine latent programming rules, consisting of function calls and their syntactic context, and their violations (e.g., a missing or out of order call) from a large body of source code. The inclusion of call order and syntactic context is a substantial step forward from previous work. Version histories of open source systems Linux Kernel and Apache httpd server are used in the evaluation process. The results show that the approach is able to uncover rules and violations that a previous approach did not. Finally, an in-depth survey and the first taxonomy of MSR approaches also resulted from this work as a byproduct.