The duplicate bug report analysis was done using linguistic features [1] alone. Let us look at process in detail below
Input : The fields "Summary", "Description" and "Comments" were considered, this is denoted as SDC (Summary, Description and Comments). Also the field "Comments" was left out, having "Summary" and "Description" alone. It is denoted as SD (Summart and Description).
Preprocessing : Stop words were removed. Also stemming was performed. In order to observe the impact of stemming, comparison is done with and without stemming.
Feature vector : Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are used. Also comparison is done for the two.
OpenMRS dataset had 83 duplicate reports out of 2315. A pair of duplicated report and the report which it duplicates are taken manually from the OpenMRS TRAC page. 7 out of 83 reports do not have the pair mentioned i.e., they are just marked as duplicate. Also 2 out of 83 have more than one report that it duplicates. So we finally have 78 pair of duplicate reports dataset. The metrics for evaluation is chosen similar to [1]. For each report 5, 10 and 15 similar reports are predicted, this is given by "K". If the duplicate report is part of the prediction then it is taken as a hit. Else it is taken as a non-hit. The percentage of hits is given in the following table
S.no | TF/TF-IDF | Stemming | SDC/SD | K | Ratio of hits |
1 | TF | No | SDC | 5 | 0.346 |
2 | TF | No | SDC | 10 | 0.372 |
3 | TF | No | SDC | 15 | 0.423 |
4 | TF-IDF | No | SDC | 5 | 0.474 |
5 | TF-IDF | No | SDC | 10 | 0.55 |
6 | TF-IDF | No | SDC | 15 | 0.55 | 7 | TF | Yes | SDC | 5 | 0.5 |
8 | TF | Yes | SDC | 10 | 0.576 |
9 | TF | Yes | SDC | 15 | 0.5897 |
10 | TF-IDF | Yes | SDC | 5 | 0.55 |
11 | TF-IDF | Yes | SDC | 10 | 0.60 |
12 | TF-IDF | Yes | SDC | 15 | 0.63 |
13 | TF | No | SD | 5 | 0.37 |
14 | TF | No | SD | 10 | 0.397 |
15 | TF | No | SD | 15 | 0.410 |
16 | TF-IDF | No | SD | 5 | 0.487 |
17 | TF-IDF | No | SD | 10 | 0.538 |
18 | TF-IDF | No | SD | 15 | 0.564 | 19 | TF | Yes | SD | 5 | 0.474 |
20 | TF | Yes | SD | 10 | 0.487 |
21 | TF | Yes | SD | 15 | 0.512 |
22 | TF-IDF | Yes | SD | 5 | 0.512 |
23 | TF-IDF | Yes | SD | 10 | 0.63 |
24 | TF-IDF | Yes | SD | 15 | 0.67 |
From the results, TF-IDF with stemming performs better than TF and without stemming. Also use of comments does not improve accuracy much, and even it degrades the performance slightly.
[1] Runeson, P., Alexandersson, M., and Nyholm, O. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th international Conference on Software Engineering (May 20 - 26, 2007). International Conference on Software Engineering. IEEE Computer Society, Washington, DC, 499-510. DOI= http://dx.doi.org/10.1109/ICSE.2007.32