Tuesday, June 1, 2010

Duplicate bug report identification results



The duplicate bug report analysis was done using linguistic features [1] alone. Let us look at process in detail below

Input : The fields "Summary", "Description" and "Comments" were considered, this is denoted as SDC (Summary, Description and Comments). Also the field "Comments" was left out, having "Summary" and "Description" alone. It is denoted as SD (Summart and Description).

Preprocessing : Stop words were removed. Also stemming was performed. In order to observe the impact of stemming, comparison is done with and without stemming.

Feature vector : Both term frequency (TF) and term frequency-inverse document frequency (TF-IDF) are used. Also comparison is done for the two.

OpenMRS dataset had 83 duplicate reports out of 2315. A pair of duplicated report and the report which it duplicates are taken manually from the OpenMRS TRAC page. 7 out of 83 reports do not have the pair mentioned i.e., they are just marked as duplicate. Also 2 out of 83 have more than one report that it duplicates. So we finally have 78 pair of duplicate reports dataset. The metrics for evaluation is chosen similar to [1]. For each report 5, 10 and 15 similar reports are predicted, this is given by "K". If the duplicate report is part of the prediction then it is taken as a hit. Else it is taken as a non-hit. The percentage of hits is given in the following table



S.noTF/TF-IDFStemmingSDC/SD KRatio of hits
1TFNoSDC50.346
2TFNoSDC100.372
3TFNoSDC150.423
4TF-IDFNoSDC50.474
5TF-IDFNoSDC100.55
6TF-IDFNoSDC150.55
7TFYesSDC50.5
8TFYesSDC100.576
9TFYesSDC150.5897
10TF-IDFYesSDC50.55
11TF-IDFYesSDC100.60
12TF-IDFYesSDC150.63
13TFNoSD50.37
14TFNoSD100.397
15TFNoSD150.410
16TF-IDFNoSD50.487
17TF-IDFNoSD100.538
18TF-IDFNoSD150.564
19TFYesSD50.474
20TFYesSD100.487
21TFYesSD150.512
22TF-IDFYesSD50.512
23TF-IDFYesSD100.63
24TF-IDFYesSD150.67


From the results, TF-IDF with stemming performs better than TF and without stemming. Also use of comments does not improve accuracy much, and even it degrades the performance slightly.

[1] Runeson, P., Alexandersson, M., and Nyholm, O. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th international Conference on Software Engineering (May 20 - 26, 2007). International Conference on Software Engineering. IEEE Computer Society, Washington, DC, 499-510. DOI= http://dx.doi.org/10.1109/ICSE.2007.32

Thursday, May 27, 2010

Choice of Machine Learning algorithms, configuring API

Let us look at each task individually. The feature vector construction for this phase of analysis includes Term frequency vectors (TF), Term Frequency - Inverse Document Frequency vectors(TF-IDF). Also tokens will be stemmed and not stemmed. This gives four combination for feature representation.
  • Bug classification : The dataset has 2315 reports, out of which 825 are bugs and 1490 are non-bug reports. This is identified using the Type column (Type = "Defect"). This is a typical binary classification problem, where the labels are bug and non-bug (task, enhancement etc.,). The algorithms of choice include SVM, NaiveBayesian, K Nearest Neighbors, Decision tree. The API that will be used is Weka.

  • Duplicate identification : This task needs similarity to be computed between input report and other reports in DB, and if the similarity exceeds the threshold the report should be flagged as a duplicate. The threshold must be fixed w.r.t hold out set of 85 duplicate bug reports.

  • Expert Identification : In literature, this task is treated as a classification problem where experts are class labels. Also if the organization has components and a team to deal with the same, then the expert identification task is two phase. Phase I is to identify the component, and Phase II involves classification where are the target labels are confined to those experts who work on the component identified in Phase I. Since component is available for analysis (field name "components"), the focus is Phase II. Since we do not have huge dataset to train a model for each expert (2315 reports and 184 experts), intuitively a classifier that can be used in this scenario is K Nearest Neighbor. Thus the expert identification task must evaluated with different classifiers and the choice of K Nearest Neighbor must be analysed.

  • The dataset, feature vector construction codes (TF, TF-IDF, with and without stemming), weka API are in place. For duplicate identification Euclidean distance and Cosine similarity functions will be evaluated. The following posts will contain performance results of each task.

    Tuesday, May 25, 2010

    Training and test set creation

    First step in any data mining project is training and test set creation and analysis. The same way, this project too involves this step. As part of this the XML dump of OpenMRS TRAC is used [Burke shared this]. The training and test was created for 1) expert identification 2) duplicate bug report identification and 3) bug classification tasks. The following posts will involve issues and experience on each task individually.

    GSoc 2010 - OpenMRS

    Hi all,

    This blog will hold my experiences and information regarding my GSoc 2010 (Google Summer of code) project. The organization is OpenMRS and the project deals with adding functionalities to bug tracking tools like JIRA, TRAC etc., The target tool for this project is JIRA. The functionalities involve the following
  • Automatic bug triage

  • Duplicate bug report identification

  • Classifying a report as bug or not

  • Finding the likelihood of bug being fixed
  •