Duplicate Bug Report Detection

During the hands-on session, we investigate the first step of duplicate bug report detection (DBRD). We will implement a model that can predict whether a bug report is going to be a duplicate or not (i.e., binary classification).

Learning Objective

Understand how duplicate bug report detection work
Be able to classify duplicate bug reports

Prerequisite

Installation of a package manager, conda (https://docs.conda.io/en/latest/). Then, install the below Python libraries via conda:
- pandas: https://pandas.pydata.org/
- scikit-learn: https://scikit-learn.org/stable/

If you use M1 Mac (ARM architecture) and cannot install the conda properly, please check this article: https://blog.donggyun.com/article/4

A Dataset for DBRD

The dataset covers the bug reports submitted to Mozilla Bugzilla. The crawled date range is from 1 Jan 2021 to 30 May 2021. It contains 15,076 pre-processed bug reports in JSON format. A bug report in the dataset looks as below:

{
    "bug_id":1710189,
    "title":"Lock Icon overlaps text in connection info popup in Proton",
    "status":"RESOLVED",
    "resolution":"DUPLICATE",
    "product":"Firefox",
    "component":"Site Identity",
    "version":"Firefox 90",
    "priority":"--",
    "type":"defect",
    "description":"Created attachment 9220926... "
}

Categorical Information Based DBRD

1. Clone the tutorial repo

> git clone git@github.com:handk85/dbrd-hands-on.git
> cd dbrd-hands-on

2. Classification script

Implementation (scripts/dbrd-rf.py)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

# The names of properties defined in pre-processed data
CATEGORICAL_FEATURES = ["product", "component", "version", "priority", "type"]
LABEL_FIELD = ["resolution"]

# Load data
dataset = pd.read_json("../data/preprocessed-data-MOZILLA.json")
# This lambda converts resolution into either True (i.e., duplicate) or False (i.e., non-duplicate)
dataset[LABEL_FIELD] = dataset[LABEL_FIELD].apply(lambda x: x == "DUPLICATE")
# Dataset MUST BE SORTED! You cannot train a model with future data in real practice.
dataset = dataset.sort_values("bug_id")
# Sample the dataset. We will use only 50% of the randomly selected dataset.
dataset = dataset.sample(frac=0.5)

# Since classification algorithm cannot take string values, transform the string values into numeric values
le = LabelEncoder()
features = dataset[CATEGORICAL_FEATURES]
X = features.apply(le.fit_transform)

y = dataset[LABEL_FIELD].values.ravel()

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# standardize both train and test data set to optimize the model
scaler = StandardScaler()
# fit only on training data
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Print the results
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Accuracy: %s" % accuracy_score(y_test, y_pred))
print("AUC-ROC: %s" % roc_auc_score(y_test, y_pred))

Output

[[1185   87]
 [ 126  110]]
              precision    recall  f1-score   support

       False       0.90      0.93      0.92      1272
        True       0.56      0.47      0.51       236

    accuracy                           0.86      1508
   macro avg       0.73      0.70      0.71      1508
weighted avg       0.85      0.86      0.85      1508

Accuracy: 0.8587533156498673
AUC-ROC: 0.69885273425008

Tasks

Task 1

Please use 100% dataset for the classification and check the performance difference.

Task 2

Please use 10% of dataset as test dataset. Also, please use 30% of dataset as test dataset.

Task 3

Please replace the random forest classifier with decision tree classifier. You can import decision tree classifier as below:

from sklearn.tree import DecisionTreeClassifier

You can also try other classifiers in scikit-learn (https://scikit-learn.org/stable/modules/classes.html).

Natural Language Based DBRD

1. Classification script

Implementation (scripts/dbrd-mlp.py)

import pandas as pd
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

# The names of properties defined in pre-processed data
CATEGORICAL_FEATURES = ["product", "component", "version", "priority", "type"]
NL_FEATURES = ["title", "description"]
LABEL_FIELD = ["resolution"]

# Load data
dataset = pd.read_json("../data/preprocessed-data-MOZILLA.json")
# This lambda converts resolution into either True (i.e., duplicate) or False (i.e., non-duplicate)
dataset[LABEL_FIELD] = dataset[LABEL_FIELD].apply(lambda x: x == "DUPLICATE")
# Dataset MUST BE SORTED! You cannot train a model with future data in real practice.
dataset = dataset.sort_values("bug_id")

# Select 250 bug reports for each label
duplicates = dataset[dataset.resolution].sample(n=250)
non_duplicates = dataset[~dataset.resolution].sample(n=250)
dataset = pd.concat([duplicates, non_duplicates])

# Vectorize the categorical features
le = LabelEncoder()
categorical_features = dataset[CATEGORICAL_FEATURES].apply(le.fit_transform)

# You can adopt text preprocessing techniques here
texts = dataset["title"] + "\n\n" + dataset["description"]

# Prepare features and labels
X = pd.concat([categorical_features, texts.rename("texts")], axis=1)
y = dataset[LABEL_FIELD].values.ravel()

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Use bag of words to convert texts into vectors
cv = CountVectorizer()
X_train = sp.sparse.hstack((cv.fit_transform(X_train.texts), X_train[CATEGORICAL_FEATURES]), format='csr')
X_test = sp.sparse.hstack((cv.transform(X_test.texts), X_test[CATEGORICAL_FEATURES]), format='csr')

# standardize both train and test data set to optimize the model
scaler = StandardScaler(with_mean=False)
# fit only on training data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

mlp = MLPClassifier()
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

# Print the results
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Accuracy: %s" % accuracy_score(y_test, y_pred))
print("AUC-ROC: %s" % roc_auc_score(y_test, y_pred))

Output

[[238  19]
 [ 33  12]]
              precision    recall  f1-score   support

       False       0.88      0.93      0.90       257
        True       0.39      0.27      0.32        45

    accuracy                           0.83       302
   macro avg       0.63      0.60      0.61       302
weighted avg       0.81      0.83      0.81       302

Accuracy: 0.8278145695364238
AUC-ROC: 0.5963683527885862

Tasks

Task 1

The above implementation does not leverage natural language pre-processing techniques (e.g., stopwords removal). Please install nltk via conda.

conda install nltk

You can modify scripts/dbrd-mlp.py to remove the common stopwords in the dataset as below:

... 
from nltk.corpus import stopwords

# The below two lines only need to be called once on your machine.
# Once you download the stopwords, you don't need to download the stopwords again.
import nltk
nltk.download('stopwords')
...

# Locate the below line in a proper position in the source code
texts=texts.apply(lambda x: " ".join([word for word in x.split() if word not in (stopwords.words('english'))]))

Task 2

Please adopt stemming in scripts/dbrd-mlp.py by using Porter stemmer. The expected change is as below:

...
from nltk.stem.porter import PorterStemmer
...

# Locate the below lines in a proper position in the source code
stemmer = PorterStemmer()
texts=texts.apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))

Task 3

We only used small amount of the entire dataset due to the time limits. Please try the entire dataset to train the model. Please note that it takes long time to train the model.

Task 4

In this hands-on, we only discuss about using features from a single bug report, no pairwise comparison. Thus, our model can only predict whether a bug report will have a label of duplicate or not. Please come up with a solution to recommend ranked list of potential duplicates.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Bug Report Detection

Learning Objective

Prerequisite

A Dataset for DBRD

Categorical Information Based DBRD

1. Clone the tutorial repo

2. Classification script

Tasks

Task 1

Task 2

Task 3

Natural Language Based DBRD

1. Classification script

Tasks

Task 1

Task 2

Task 3

Task 4

About

Releases

Packages

Languages

handk85/dbrd-hands-on

Folders and files

Latest commit

History

Repository files navigation

Duplicate Bug Report Detection

Learning Objective

Prerequisite

A Dataset for DBRD

Categorical Information Based DBRD

1. Clone the tutorial repo

2. Classification script

Tasks

Task 1

Task 2

Task 3

Natural Language Based DBRD

1. Classification script

Tasks

Task 1

Task 2

Task 3

Task 4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages