jira-bug-analysis: Duplicate bug report identification results

The duplicate bug report analysis was done using linguistic features [1] alone. Let us look at process in detail below

Input : The fields "Summary", "Description" and "Comments" were considered, this is denoted as SDC (Summary, Description and Comments). Also the field "Comments" was left out, having "Summary" and "Description" alone. It is denoted as SD (Summart and Description).

Preprocessing : Stop words were removed. Also stemming was performed. In order to observe the impact of stemming, comparison is done with and without stemming.

Feature vector : Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are used. Also comparison is done for the two.

OpenMRS dataset had 83 duplicate reports out of 2315. A pair of duplicated report and the report which it duplicates are taken manually from the OpenMRS TRAC page. 7 out of 83 reports do not have the pair mentioned i.e., they are just marked as duplicate. Also 2 out of 83 have more than one report that it duplicates. So we finally have 78 pair of duplicate reports dataset. The metrics for evaluation is chosen similar to [1]. For each report 5, 10 and 15 similar reports are predicted, this is given by "K". If the duplicate report is part of the prediction then it is taken as a hit. Else it is taken as a non-hit. The percentage of hits is given in the following table

S.no	TF/TF-IDF	Stemming	SDC/SD	K	Ratio of hits
1	TF	No	SDC	5	0.346
2	TF	No	SDC	10	0.372
3	TF	No	SDC	15	0.423
4	TF-IDF	No	SDC	5	0.474
5	TF-IDF	No	SDC	10	0.55
6	TF-IDF	No	SDC	15	0.55
7	TF	Yes	SDC	5	0.5
8	TF	Yes	SDC	10	0.576
9	TF	Yes	SDC	15	0.5897
10	TF-IDF	Yes	SDC	5	0.55
11	TF-IDF	Yes	SDC	10	0.60
12	TF-IDF	Yes	SDC	15	0.63
13	TF	No	SD	5	0.37
14	TF	No	SD	10	0.397
15	TF	No	SD	15	0.410
16	TF-IDF	No	SD	5	0.487
17	TF-IDF	No	SD	10	0.538
18	TF-IDF	No	SD	15	0.564
19	TF	Yes	SD	5	0.474
20	TF	Yes	SD	10	0.487
21	TF	Yes	SD	15	0.512
22	TF-IDF	Yes	SD	5	0.512
23	TF-IDF	Yes	SD	10	0.63
24	TF-IDF	Yes	SD	15	0.67

From the results, TF-IDF with stemming performs better than TF and without stemming. Also use of comments does not improve accuracy much, and even it degrades the performance slightly.

[1] Runeson, P., Alexandersson, M., and Nyholm, O. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th international Conference on Software Engineering (May 20 - 26, 2007). International Conference on Software Engineering. IEEE Computer Society, Washington, DC, 499-510. DOI= http://dx.doi.org/10.1109/ICSE.2007.32

Tuesday, June 1, 2010

Duplicate bug report identification results

No comments:

Post a Comment

jira-bug-analysis

Followers

Blog Archive

About Me