DNA Sequences Classification

Bacteria and other micro-organisms have been very import for the field of biology. In this project E. Coli Bacteria DNA nucleotide sequences have been classified based on its Promoter class.

We shall explore the world of Bioinformatics by using Markov models, K-nearest neighbor (KNN) algorithms, Support Vector Machines (widely used), adaboost algorithm, Decision tree, Random forest classifier and such more algorithms.

  • Finally we shall test our data based on the training data model and compare the results of all the algorithms through classification report.
In [1]:
# Lets start with importing all the required modules and packages and ensure their versions

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(np.__version__))
print('Sklearn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pd.__version__))
Python: 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 21:41:56) 
[GCC 7.3.0]
Numpy: 1.15.2
Sklearn: 0.20.0
Pandas: 0.23.4

Step:1 Importing the dataset

In [2]:
# Moving further lets import our data from UCI machine learning repo

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'

# Explicitly defining the features(columns) of our data
col_names = ['Class','id','Sequence']

data  =pd.read_csv(url,names=col_names)
In [3]:
data.head()
Out[3]:
Class id Sequence
0 + S10 \t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1 + AMPC \t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2 + AROH \t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3 + DEOP2 \taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4 + LEU1_TRNA \ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...

Step 2: Data Preprocessing

In [4]:
# We now see here that our data has tab spaces between id and sequence, thus we see '\t' in front of Sequence string

# removing those extra charaters from sequence string
classes = data.loc[:, 'Class']
sequences = list(data.loc[:,'Sequence'])

dataset = {}

for i, seq in enumerate(sequences):
    
    nucleotides = list(seq)
    nucleotides = [x for x in nucleotides if x != '\t']
    
    nucleotides.append(classes[i])
    
    dataset[i] = nucleotides
    
print(dataset[0])
# Here we get all the sequence of DNA base pairs (like a:adenine, t:thymine, g:guanine, c:cytosine)
# Also the last term is the class our nucleotide(promotor class either +/-)
['t', 'a', 'c', 't', 'a', 'g', 'c', 'a', 'a', 't', 'a', 'c', 'g', 'c', 't', 't', 'g', 'c', 'g', 't', 't', 'c', 'g', 'g', 't', 'g', 'g', 't', 't', 'a', 'a', 'g', 't', 'a', 't', 'g', 't', 'a', 't', 'a', 'a', 't', 'g', 'c', 'g', 'c', 'g', 'g', 'g', 'c', 't', 't', 'g', 't', 'c', 'g', 't', '+']
In [5]:
# now moving on lets convert the above dict into pandas dataframe

df = pd.DataFrame(dataset)
print(df.head())
  0   1   2   3   4   5   6   7   8   9   ... 96  97  98  99  100 101 102 103  \
0   t   t   g   a   t   a   c   t   c   t ...   c   c   t   a   g   c   g   c   
1   a   g   t   a   c   g   a   t   g   t ...   c   g   a   g   a   c   t   g   
2   c   c   a   t   g   g   g   t   a   t ...   g   c   t   a   g   t   a   c   
3   t   t   c   t   a   g   g   c   c   t ...   a   t   g   g   a   c   t   g   
4   a   a   t   g   t   g   g   t   t   a ...   g   a   a   g   g   a   t   a   

  104 105  
0   c   t  
1   t   a  
2   c   a  
3   g   c  
4   t   a  

[5 rows x 106 columns]
In [6]:
# Above dataframe doesn't look what we wanted so try and transpose it

df = df.transpose()

print(df.head())
  0  1  2  3  4  5  6  7  8  9  ... 48 49 50 51 52 53 54 55 56 57
0  t  a  c  t  a  g  c  a  a  t ...  g  c  t  t  g  t  c  g  t  +
1  t  g  c  t  a  t  c  c  t  g ...  c  a  t  c  g  c  c  a  a  +
2  g  t  a  c  t  a  g  a  g  a ...  c  a  c  c  c  g  g  c  g  +
3  a  a  t  t  g  t  g  a  t  g ...  a  a  c  a  a  a  c  t  c  +
4  t  c  g  a  t  a  a  t  t  a ...  c  c  g  t  g  g  t  a  g  +

[5 rows x 58 columns]
In [7]:
# Changing the column name 57 to Class for better readability
df.rename(columns={57: 'Class'},inplace=True) 
print(df.head())
   0  1  2  3  4  5  6  7  8  9  ...  48 49 50 51 52 53 54 55 56 Class
0  t  a  c  t  a  g  c  a  a  t  ...   g  c  t  t  g  t  c  g  t     +
1  t  g  c  t  a  t  c  c  t  g  ...   c  a  t  c  g  c  c  a  a     +
2  g  t  a  c  t  a  g  a  g  a  ...   c  a  c  c  c  g  g  c  g     +
3  a  a  t  t  g  t  g  a  t  g  ...   a  a  c  a  a  a  c  t  c     +
4  t  c  g  a  t  a  a  t  t  a  ...   c  c  g  t  g  g  t  a  g     +

[5 rows x 58 columns]
In [8]:
# Now it looks more better with each column till 56 representing 
# base pairs of DNA (adenine,thymine, guanine, cytosine) and last column is of promotor class

# What our final aim was also to predict the promotor class of the DNA sequence
test = df.iloc[:,-1]
print(test.head())
0    +
1    +
2    +
3    +
4    +
Name: Class, dtype: object

Let's start to familiarize ourselves with the dataset so we can pick the most suitable algorithms for this data

In [9]:
# Exploring the data
df.describe()
Out[9]:
0 1 2 3 4 5 6 7 8 9 ... 48 49 50 51 52 53 54 55 56 Class
count 106 106 106 106 106 106 106 106 106 106 ... 106 106 106 106 106 106 106 106 106 106
unique 4 4 4 4 4 4 4 4 4 4 ... 4 4 4 4 4 4 4 4 4 2
top t a a c a a a a a a ... c c c t t c c c t -
freq 38 34 30 30 36 42 38 34 33 36 ... 36 42 31 33 35 32 29 29 34 53

4 rows × 58 columns

In [10]:
# Describe doesn't tell much when our data is of object(text) datatype, so we should count the number of each seq.

val_count = []

for name in df.columns:
    val_count.append(df[name].value_counts())

info = pd.DataFrame(val_count)
info = info.transpose()
print(info)
      0     1     2     3     4     5     6     7     8     9  ...      48  \
t  38.0  26.0  27.0  26.0  22.0  24.0  30.0  32.0  32.0  28.0  ...    21.0   
c  27.0  22.0  21.0  30.0  19.0  18.0  21.0  20.0  22.0  22.0  ...    36.0   
a  26.0  34.0  30.0  22.0  36.0  42.0  38.0  34.0  33.0  36.0  ...    23.0   
g  15.0  24.0  28.0  28.0  29.0  22.0  17.0  20.0  19.0  20.0  ...    26.0   
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...     NaN   
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...     NaN   

     49    50    51    52    53    54    55    56  Class  
t  22.0  23.0  33.0  35.0  30.0  23.0  29.0  34.0    NaN  
c  42.0  31.0  32.0  21.0  32.0  29.0  29.0  17.0    NaN  
a  24.0  28.0  27.0  25.0  22.0  26.0  24.0  27.0    NaN  
g  18.0  24.0  14.0  25.0  22.0  28.0  24.0  28.0    NaN  
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  

[6 rows x 58 columns]
In [11]:
# Our dataset has equal counts of both the classes promotor(+) as well as non-promotor(-)

# But knowing all this then too we can't apply ML models directly on data in 'String' formats
# So we need to convert object datatype into that of numerical data type 

# Let's use pandas get_dummies function for that
numerical_df = pd.get_dummies(df)
print(numerical_df.head())
   0_a  0_c  0_g  0_t  1_a  1_c  1_g  1_t  2_a  2_c   ...     55_a  55_c  \
0    0    0    0    1    1    0    0    0    0    1   ...        0     0   
1    0    0    0    1    0    0    1    0    0    1   ...        1     0   
2    0    0    1    0    0    0    0    1    1    0   ...        0     1   
3    1    0    0    0    1    0    0    0    0    0   ...        0     0   
4    0    0    0    1    0    1    0    0    0    0   ...        1     0   

   55_g  55_t  56_a  56_c  56_g  56_t  Class_+  Class_-  
0     1     0     0     0     0     1        1        0  
1     0     0     1     0     0     0        1        0  
2     0     0     0     0     1     0        1        0  
3     0     1     0     1     0     0        1        0  
4     0     0     0     0     1     0        1        0  

[5 rows x 230 columns]
In [12]:
# Great! but we see that our class is also divided into 2 columns though it is only has binary categories

df = numerical_df.drop(columns=['Class_-'])

df.rename(columns = {'Class_+': 'Class'}, inplace=True)

print(df.head())
   0_a  0_c  0_g  0_t  1_a  1_c  1_g  1_t  2_a  2_c  ...    54_t  55_a  55_c  \
0    0    0    0    1    1    0    0    0    0    1  ...       0     0     0   
1    0    0    0    1    0    0    1    0    0    1  ...       0     1     0   
2    0    0    1    0    0    0    0    1    1    0  ...       0     0     1   
3    1    0    0    0    1    0    0    0    0    0  ...       0     0     0   
4    0    0    0    1    0    1    0    0    0    0  ...       1     1     0   

   55_g  55_t  56_a  56_c  56_g  56_t  Class  
0     1     0     0     0     0     1      1  
1     0     0     1     0     0     0      1  
2     0     0     0     0     1     0      1  
3     0     1     0     1     0     0      1  
4     0     0     0     0     1     0      1  

[5 rows x 229 columns]

Step 3: Spliting our data into training and testing set

In [13]:
# Using Train test split from sklearn.model_selection
from sklearn import model_selection

# Create X as features and y as label
X = np.array(df.drop(['Class'], 1))
y = np.array(df['Class'])

# defining seed for reproducibility
seed = 1

# spliting data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)

Step 4: Applying Machine learning algorithms to our training sets

Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different classification algorithms. It's relatively easy to test multiple models; as a result, we will compare and contrast the performance of ten different algorithms on some performance metrics such as accuracy_score and classification_report (best way).

In [25]:
import warnings
warnings.filterwarnings('ignore')
# We can start building algorithms! We'll need to import each algorithm we plan on using from sklearn.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# defining scoring method
scoring = 'accuracy'

# we have 10 models to train
names = ["Nearest Neighbors", "Random Forest","Neural Net",
         "Decision Tree","AdaBoost","Gaussian Process",
         "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]

# lets define each of the classifier
classifier = [
    KNeighborsClassifier(n_neighbors=3),
    RandomForestClassifier(n_estimators=10,max_depth=5,max_features=1),
    MLPClassifier(alpha=1),
    DecisionTreeClassifier(max_depth=5),
    AdaBoostClassifier(),
    GaussianProcessClassifier(1.0*RBF(1.0)),
    GaussianNB(),
    SVC(kernel = 'linear'),
    SVC(kernel = 'rbf'),
    SVC(kernel='sigmoid')
]

models = zip(names,classifier)

# evaluate models

results = []
names = []

for name,model in models:
    kfold = model_selection.KFold(n_splits=10,random_state=seed)
    cv_results = model_selection.cross_val_score(model,X_train,y_train,cv=kfold,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    formating = "%s: %f (%f)" %(name, cv_results.mean(),cv_results.std())
    print(formating)
    print("Testing Scores")
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(name)
    
    print(accuracy_score(y_test, predictions))
    print(classification_report(y_test, predictions))
Nearest Neighbors: 0.823214 (0.113908)
Testing Scores
Nearest Neighbors
0.7777777777777778
              precision    recall  f1-score   support

           0       1.00      0.65      0.79        17
           1       0.62      1.00      0.77        10

   micro avg       0.78      0.78      0.78        27
   macro avg       0.81      0.82      0.78        27
weighted avg       0.86      0.78      0.78        27

Random Forest: 0.633929 (0.150308)
Testing Scores
Random Forest
0.5925925925925926
              precision    recall  f1-score   support

           0       0.88      0.41      0.56        17
           1       0.47      0.90      0.62        10

   micro avg       0.59      0.59      0.59        27
   macro avg       0.67      0.66      0.59        27
weighted avg       0.73      0.59      0.58        27

Neural Net: 0.887500 (0.087500)
Testing Scores
Neural Net
0.9259259259259259
              precision    recall  f1-score   support

           0       1.00      0.88      0.94        17
           1       0.83      1.00      0.91        10

   micro avg       0.93      0.93      0.93        27
   macro avg       0.92      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27

Decision Tree: 0.721429 (0.174818)
Testing Scores
Decision Tree
0.8518518518518519
              precision    recall  f1-score   support

           0       1.00      0.76      0.87        17
           1       0.71      1.00      0.83        10

   micro avg       0.85      0.85      0.85        27
   macro avg       0.86      0.88      0.85        27
weighted avg       0.89      0.85      0.85        27

AdaBoost: 0.925000 (0.114564)
Testing Scores
AdaBoost
0.8518518518518519
              precision    recall  f1-score   support

           0       1.00      0.76      0.87        17
           1       0.71      1.00      0.83        10

   micro avg       0.85      0.85      0.85        27
   macro avg       0.86      0.88      0.85        27
weighted avg       0.89      0.85      0.85        27

Gaussian Process: 0.873214 (0.056158)
Testing Scores
Gaussian Process
0.8888888888888888
              precision    recall  f1-score   support

           0       1.00      0.82      0.90        17
           1       0.77      1.00      0.87        10

   micro avg       0.89      0.89      0.89        27
   macro avg       0.88      0.91      0.89        27
weighted avg       0.91      0.89      0.89        27

Naive Bayes: 0.837500 (0.137500)
Testing Scores
Naive Bayes
0.9259259259259259
              precision    recall  f1-score   support

           0       1.00      0.88      0.94        17
           1       0.83      1.00      0.91        10

   micro avg       0.93      0.93      0.93        27
   macro avg       0.92      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27

SVM Linear: 0.850000 (0.108972)
Testing Scores
SVM Linear
0.9629629629629629
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        17
           1       0.91      1.00      0.95        10

   micro avg       0.96      0.96      0.96        27
   macro avg       0.95      0.97      0.96        27
weighted avg       0.97      0.96      0.96        27

SVM RBF: 0.737500 (0.117925)
Testing Scores
SVM RBF
0.7777777777777778
              precision    recall  f1-score   support

           0       1.00      0.65      0.79        17
           1       0.62      1.00      0.77        10

   micro avg       0.78      0.78      0.78        27
   macro avg       0.81      0.82      0.78        27
weighted avg       0.86      0.78      0.78        27

SVM Sigmoid: 0.569643 (0.159209)
Testing Scores
SVM Sigmoid
0.4444444444444444
              precision    recall  f1-score   support

           0       1.00      0.12      0.21        17
           1       0.40      1.00      0.57        10

   micro avg       0.44      0.44      0.44        27
   macro avg       0.70      0.56      0.39        27
weighted avg       0.78      0.44      0.34        27

Remember, performance on the training data is not that important. We want to know how well our algorithms can generalize to new data.

Finally, SVM linear performs the best and also it is not that surprising because, the reason SVM is used in the field of Bioinformatics widely.

Definations of attributes in classification report

Accuracy - ratio of correctly predicted observation to the total observations.

Precision - (false positives) ratio of correctly predicted positive observations to the total predicted positive observations

Recall (Sensitivity) - (false negatives) ratio of correctly predicted positive observations to the all observations in actual class - yes.

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false