Thursday, May 27, 2010

Choice of Machine Learning algorithms, configuring API

Let us look at each task individually. The feature vector construction for this phase of analysis includes Term frequency vectors (TF), Term Frequency - Inverse Document Frequency vectors(TF-IDF). Also tokens will be stemmed and not stemmed. This gives four combination for feature representation.
  • Bug classification : The dataset has 2315 reports, out of which 825 are bugs and 1490 are non-bug reports. This is identified using the Type column (Type = "Defect"). This is a typical binary classification problem, where the labels are bug and non-bug (task, enhancement etc.,). The algorithms of choice include SVM, NaiveBayesian, K Nearest Neighbors, Decision tree. The API that will be used is Weka.

  • Duplicate identification : This task needs similarity to be computed between input report and other reports in DB, and if the similarity exceeds the threshold the report should be flagged as a duplicate. The threshold must be fixed w.r.t hold out set of 85 duplicate bug reports.

  • Expert Identification : In literature, this task is treated as a classification problem where experts are class labels. Also if the organization has components and a team to deal with the same, then the expert identification task is two phase. Phase I is to identify the component, and Phase II involves classification where are the target labels are confined to those experts who work on the component identified in Phase I. Since component is available for analysis (field name "components"), the focus is Phase II. Since we do not have huge dataset to train a model for each expert (2315 reports and 184 experts), intuitively a classifier that can be used in this scenario is K Nearest Neighbor. Thus the expert identification task must evaluated with different classifiers and the choice of K Nearest Neighbor must be analysed.

  • The dataset, feature vector construction codes (TF, TF-IDF, with and without stemming), weka API are in place. For duplicate identification Euclidean distance and Cosine similarity functions will be evaluated. The following posts will contain performance results of each task.

    No comments:

    Post a Comment