Thursday, May 27, 2010

Choice of Machine Learning algorithms, configuring API

Let us look at each task individually. The feature vector construction for this phase of analysis includes Term frequency vectors (TF), Term Frequency - Inverse Document Frequency vectors(TF-IDF). Also tokens will be stemmed and not stemmed. This gives four combination for feature representation.
  • Bug classification : The dataset has 2315 reports, out of which 825 are bugs and 1490 are non-bug reports. This is identified using the Type column (Type = "Defect"). This is a typical binary classification problem, where the labels are bug and non-bug (task, enhancement etc.,). The algorithms of choice include SVM, NaiveBayesian, K Nearest Neighbors, Decision tree. The API that will be used is Weka.

  • Duplicate identification : This task needs similarity to be computed between input report and other reports in DB, and if the similarity exceeds the threshold the report should be flagged as a duplicate. The threshold must be fixed w.r.t hold out set of 85 duplicate bug reports.

  • Expert Identification : In literature, this task is treated as a classification problem where experts are class labels. Also if the organization has components and a team to deal with the same, then the expert identification task is two phase. Phase I is to identify the component, and Phase II involves classification where are the target labels are confined to those experts who work on the component identified in Phase I. Since component is available for analysis (field name "components"), the focus is Phase II. Since we do not have huge dataset to train a model for each expert (2315 reports and 184 experts), intuitively a classifier that can be used in this scenario is K Nearest Neighbor. Thus the expert identification task must evaluated with different classifiers and the choice of K Nearest Neighbor must be analysed.

  • The dataset, feature vector construction codes (TF, TF-IDF, with and without stemming), weka API are in place. For duplicate identification Euclidean distance and Cosine similarity functions will be evaluated. The following posts will contain performance results of each task.

    Tuesday, May 25, 2010

    Training and test set creation

    First step in any data mining project is training and test set creation and analysis. The same way, this project too involves this step. As part of this the XML dump of OpenMRS TRAC is used [Burke shared this]. The training and test was created for 1) expert identification 2) duplicate bug report identification and 3) bug classification tasks. The following posts will involve issues and experience on each task individually.

    GSoc 2010 - OpenMRS

    Hi all,

    This blog will hold my experiences and information regarding my GSoc 2010 (Google Summer of code) project. The organization is OpenMRS and the project deals with adding functionalities to bug tracking tools like JIRA, TRAC etc., The target tool for this project is JIRA. The functionalities involve the following
  • Automatic bug triage

  • Duplicate bug report identification

  • Classifying a report as bug or not

  • Finding the likelihood of bug being fixed
  •