jira-bug-analysis: 2010

Tuesday, June 1, 2010

Duplicate bug report identification results

The duplicate bug report analysis was done using linguistic features [1] alone. Let us look at process in detail below

Input : The fields "Summary", "Description" and "Comments" were considered, this is denoted as SDC (Summary, Description and Comments). Also the field "Comments" was left out, having "Summary" and "Description" alone. It is denoted as SD (Summart and Description).

Preprocessing : Stop words were removed. Also stemming was performed. In order to observe the impact of stemming, comparison is done with and without stemming.

Feature vector : Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are used. Also comparison is done for the two.

OpenMRS dataset had 83 duplicate reports out of 2315. A pair of duplicated report and the report which it duplicates are taken manually from the OpenMRS TRAC page. 7 out of 83 reports do not have the pair mentioned i.e., they are just marked as duplicate. Also 2 out of 83 have more than one report that it duplicates. So we finally have 78 pair of duplicate reports dataset. The metrics for evaluation is chosen similar to [1]. For each report 5, 10 and 15 similar reports are predicted, this is given by "K". If the duplicate report is part of the prediction then it is taken as a hit. Else it is taken as a non-hit. The percentage of hits is given in the following table

S.no	TF/TF-IDF	Stemming	SDC/SD	K	Ratio of hits
1	TF	No	SDC	5	0.346
2	TF	No	SDC	10	0.372
3	TF	No	SDC	15	0.423
4	TF-IDF	No	SDC	5	0.474
5	TF-IDF	No	SDC	10	0.55
6	TF-IDF	No	SDC	15	0.55
7	TF	Yes	SDC	5	0.5
8	TF	Yes	SDC	10	0.576
9	TF	Yes	SDC	15	0.5897
10	TF-IDF	Yes	SDC	5	0.55
11	TF-IDF	Yes	SDC	10	0.60
12	TF-IDF	Yes	SDC	15	0.63
13	TF	No	SD	5	0.37
14	TF	No	SD	10	0.397
15	TF	No	SD	15	0.410
16	TF-IDF	No	SD	5	0.487
17	TF-IDF	No	SD	10	0.538
18	TF-IDF	No	SD	15	0.564
19	TF	Yes	SD	5	0.474
20	TF	Yes	SD	10	0.487
21	TF	Yes	SD	15	0.512
22	TF-IDF	Yes	SD	5	0.512
23	TF-IDF	Yes	SD	10	0.63
24	TF-IDF	Yes	SD	15	0.67

From the results, TF-IDF with stemming performs better than TF and without stemming. Also use of comments does not improve accuracy much, and even it degrades the performance slightly.

[1] Runeson, P., Alexandersson, M., and Nyholm, O. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th international Conference on Software Engineering (May 20 - 26, 2007). International Conference on Software Engineering. IEEE Computer Society, Washington, DC, 499-510. DOI= http://dx.doi.org/10.1109/ICSE.2007.32

Thursday, May 27, 2010

Choice of Machine Learning algorithms, configuring API

Let us look at each task individually. The feature vector construction for this phase of analysis includes Term frequency vectors (TF), Term Frequency - Inverse Document Frequency vectors(TF-IDF). Also tokens will be stemmed and not stemmed. This gives four combination for feature representation.

Bug classification : The dataset has 2315 reports, out of which 825 are bugs and 1490 are non-bug reports. This is identified using the Type column (Type = "Defect"). This is a typical binary classification problem, where the labels are bug and non-bug (task, enhancement etc.,). The algorithms of choice include SVM, NaiveBayesian, K Nearest Neighbors, Decision tree. The API that will be used is Weka.

Duplicate identification : This task needs similarity to be computed between input report and other reports in DB, and if the similarity exceeds the threshold the report should be flagged as a duplicate. The threshold must be fixed w.r.t hold out set of 85 duplicate bug reports.

Expert Identification : In literature, this task is treated as a classification problem where experts are class labels. Also if the organization has components and a team to deal with the same, then the expert identification task is two phase. Phase I is to identify the component, and Phase II involves classification where are the target labels are confined to those experts who work on the component identified in Phase I. Since component is available for analysis (field name "components"), the focus is Phase II. Since we do not have huge dataset to train a model for each expert (2315 reports and 184 experts), intuitively a classifier that can be used in this scenario is K Nearest Neighbor. Thus the expert identification task must evaluated with different classifiers and the choice of K Nearest Neighbor must be analysed.

The dataset, feature vector construction codes (TF, TF-IDF, with and without stemming), weka API are in place. For duplicate identification Euclidean distance and Cosine similarity functions will be evaluated. The following posts will contain performance results of each task.

Tuesday, May 25, 2010

Training and test set creation

First step in any data mining project is training and test set creation and analysis. The same way, this project too involves this step. As part of this the XML dump of OpenMRS TRAC is used [Burke shared this]. The training and test was created for 1) expert identification 2) duplicate bug report identification and 3) bug classification tasks. The following posts will involve issues and experience on each task individually.

GSoc 2010 - OpenMRS

Hi all,

This blog will hold my experiences and information regarding my GSoc 2010 (Google Summer of code) project. The organization is OpenMRS and the project deals with adding functionalities to bug tracking tools like JIRA, TRAC etc., The target tool for this project is JIRA. The functionalities involve the following

Automatic bug triage

Duplicate bug report identification

Classifying a report as bug or not

Finding the likelihood of bug being fixed

Tuesday, June 1, 2010

Duplicate bug report identification results

Thursday, May 27, 2010

Choice of Machine Learning algorithms, configuring API

Tuesday, May 25, 2010

Training and test set creation

GSoc 2010 - OpenMRS

jira-bug-analysis

Followers

Blog Archive

About Me