Bacteria and other micro-organisms have been very import for the field of biology. In this project E. Coli Bacteria DNA nucleotide sequences have been classified based on its Promoter class.
We shall explore the world of Bioinformatics by using Markov models, K-nearest neighbor (KNN) algorithms, Support Vector Machines (widely used), adaboost algorithm, Decision tree, Random forest classifier and such more algorithms.
# Lets start with importing all the required modules and packages and ensure their versions
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(np.__version__))
print('Sklearn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pd.__version__))
# Moving further lets import our data from UCI machine learning repo
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
# Explicitly defining the features(columns) of our data
col_names = ['Class','id','Sequence']
data =pd.read_csv(url,names=col_names)
data.head()
# We now see here that our data has tab spaces between id and sequence, thus we see '\t' in front of Sequence string
# removing those extra charaters from sequence string
classes = data.loc[:, 'Class']
sequences = list(data.loc[:,'Sequence'])
dataset = {}
for i, seq in enumerate(sequences):
nucleotides = list(seq)
nucleotides = [x for x in nucleotides if x != '\t']
nucleotides.append(classes[i])
dataset[i] = nucleotides
print(dataset[0])
# Here we get all the sequence of DNA base pairs (like a:adenine, t:thymine, g:guanine, c:cytosine)
# Also the last term is the class our nucleotide(promotor class either +/-)
# now moving on lets convert the above dict into pandas dataframe
df = pd.DataFrame(dataset)
print(df.head())
# Above dataframe doesn't look what we wanted so try and transpose it
df = df.transpose()
print(df.head())
# Changing the column name 57 to Class for better readability
df.rename(columns={57: 'Class'},inplace=True)
print(df.head())
# Now it looks more better with each column till 56 representing
# base pairs of DNA (adenine,thymine, guanine, cytosine) and last column is of promotor class
# What our final aim was also to predict the promotor class of the DNA sequence
test = df.iloc[:,-1]
print(test.head())
# Exploring the data
df.describe()
# Describe doesn't tell much when our data is of object(text) datatype, so we should count the number of each seq.
val_count = []
for name in df.columns:
val_count.append(df[name].value_counts())
info = pd.DataFrame(val_count)
info = info.transpose()
print(info)
# Our dataset has equal counts of both the classes promotor(+) as well as non-promotor(-)
# But knowing all this then too we can't apply ML models directly on data in 'String' formats
# So we need to convert object datatype into that of numerical data type
# Let's use pandas get_dummies function for that
numerical_df = pd.get_dummies(df)
print(numerical_df.head())
# Great! but we see that our class is also divided into 2 columns though it is only has binary categories
df = numerical_df.drop(columns=['Class_-'])
df.rename(columns = {'Class_+': 'Class'}, inplace=True)
print(df.head())
# Using Train test split from sklearn.model_selection
from sklearn import model_selection
# Create X as features and y as label
X = np.array(df.drop(['Class'], 1))
y = np.array(df['Class'])
# defining seed for reproducibility
seed = 1
# spliting data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)
Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different classification algorithms. It's relatively easy to test multiple models; as a result, we will compare and contrast the performance of ten different algorithms on some performance metrics such as accuracy_score and classification_report (best way).
import warnings
warnings.filterwarnings('ignore')
# We can start building algorithms! We'll need to import each algorithm we plan on using from sklearn.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
# defining scoring method
scoring = 'accuracy'
# we have 10 models to train
names = ["Nearest Neighbors", "Random Forest","Neural Net",
"Decision Tree","AdaBoost","Gaussian Process",
"Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]
# lets define each of the classifier
classifier = [
KNeighborsClassifier(n_neighbors=3),
RandomForestClassifier(n_estimators=10,max_depth=5,max_features=1),
MLPClassifier(alpha=1),
DecisionTreeClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianProcessClassifier(1.0*RBF(1.0)),
GaussianNB(),
SVC(kernel = 'linear'),
SVC(kernel = 'rbf'),
SVC(kernel='sigmoid')
]
models = zip(names,classifier)
# evaluate models
results = []
names = []
for name,model in models:
kfold = model_selection.KFold(n_splits=10,random_state=seed)
cv_results = model_selection.cross_val_score(model,X_train,y_train,cv=kfold,scoring=scoring)
results.append(cv_results)
names.append(name)
formating = "%s: %f (%f)" %(name, cv_results.mean(),cv_results.std())
print(formating)
print("Testing Scores")
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(name)
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
Accuracy - ratio of correctly predicted observation to the total observations.
Precision - (false positives) ratio of correctly predicted positive observations to the total predicted positive observations
Recall (Sensitivity) - (false negatives) ratio of correctly predicted positive observations to the all observations in actual class - yes.
F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false