Ensemble Techniques

Dec 5, 2019 aiml ensemble-technique jupyter-notebook pandas DecisionTreeClassifier RandomForestClassifier AdaBoostClassifier BaggingClassifier GradientBoostingClassifier

Share on:

Overview

Help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio

Problem Statement

Using the collected from existing customers, build a model that will help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio.

Solution

General and domain knowledge assumption

This problem statement relates to banking and financial sector. For time being let us forget about data set(though we know source and content) and assumes few things.

What can be the people's age in data set? We do not know from which region of the world this data belongs. But since it is bancking and financial domain we would have verified & authenticated users having a min age of 18 or 21 years age mostly and upper limit would be around ~100.

Knowing from the past experience of working with banking data set we know that their experience, salary, loan, cc expenditure are some inputs what we can expect to encounter in new data set and can heavily weight on the outcome of output variable which we need to predict.

We also need to consider the profession of an individual whom we are considering as input data. A person with high income usually invest in more than one financial domain but still has a good change of being among the people appling for deposit.

People with low and mid level of income range are very particular about investment and tend to trust banks more rather than investing in other places but as we do encounter outliers in our data set, there are certain inputs in this group of people that would still go and invest in places other than banks. Usually risk takers.

Our final outcome would be predecition for an individual whether he would be interested in term deposit or not, but why are we takling too much about investment. Well, there is inverse relation between investment and term deposit. Its a contradiction, deposit is also as investment, but if an individual is investing more on other investment plans than naturally his investment in term deposit would be fairly less.

Existing Algorithms and approaches

Since it is binary prediction problem based on number of input we already have few approaches in mind like NB Classifier, kNN. Logistic regression also seems a good fit for this. We have little more dimensions to consider 17+ we can even consider random fores with variable and random dimensions.

General Imports

 1#Import Necessary Libraries
 2
 3# NumPy: For mathematical funcations, array, matrices operations
 4import numpy as np 
 5
 6# Graph: Plotting graphs and other visula tools
 7import pandas as pd
 8import seaborn as sns
 9
10# sns.set_palette("muted")
11# sns.set(color_codes=True)
12# sns.color_palette("colorblind", 10)
13
14
15# color_palette = sns.color_palette()
16# To enable inline plotting graphs
17import matplotlib.pyplot as plt
18%matplotlib inline
19
20
21# palette = sns.color_palette("muted")
22
23# sns.set_palette(palette)
24
25# sns.palplot(palette)
26
27flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
28
29sns.set_palette(flatui)
30
31sns.palplot(sns.color_palette())

png

 1# Custom terminal printer
 2
 3#Lets try to print unique values from object data type
 4from IPython.display import Markdown, display
 5
 6def printTextAsMarkdown(title, content, color=None):
 7    if title is None:
 8        colorStr = "<span style='color:{}'>{}</span>".format(color, content)
 9    else:    
10        colorStr = "**<span style='color:{}'>{}</span>** : {}".format(color, title, content)
11        
12    display(Markdown(colorStr))

 1# Load data set
 2# Import CSV data using pandas data frame
 3df_original = pd.read_csv('bank-full.csv')
 4
 5# Print total columns
 6print("Total Colums in dataframe: ", len(df_original.columns))
 7
 8# Prepare columns names
 9df_original_columns = []
10for column in df_original.columns:
11    df_original_columns.append(column)
12
13
14    
15print("Columns list {}".format(df_original_columns))
16print("***********************************************************************************************************************")
17
18# Prepare mapping of column names for quick access
19df_original_columns_map = {}
20map_index: int = 0
21for column in df_original_columns:
22    df_original_columns_map[map_index] = column
23    map_index = map_index + 1
24    
25print("Columns Map {}".format(df_original_columns_map))
26
27# We have separated out columns and its mapping from data, at any point of time during data analysis or cleaning we 
28# can directly refer or get data from either index or column identifier

Total Colums in dataframe:  17
Columns list ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'Target']
***********************************************************************************************************************
Columns Map {0: 'age', 1: 'job', 2: 'marital', 3: 'education', 4: 'default', 5: 'balance', 6: 'housing', 7: 'loan', 8: 'contact', 9: 'day', 10: 'month', 11: 'duration', 12: 'campaign', 13: 'pdays', 14: 'previous', 15: 'poutcome', 16: 'Target'}

1# Data frame general analysis
2df_original.head(16)

 1# Dataframe information
 2# Lets analyse data based on following conditions
 3# 1. Check whether all rows x colums are loaded as given in question, all data must match before we start to even operate on it.
 4# 2. Print shape of the data
 5# 8. Check data types of each field
 6# 3. Find presence of null or missing values.
 7# 4. Visually inspect data and check presense of Outliers if there are any and see are 
 8#    they enough to drop or need to consider during model building
 9# 5. Print shape of the data
10# 6. Do we need to consider all data columns given in data set for model building
11# 7. Find Corr, median, mean, std deviation, min, max for columns.
12
13df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  Target     45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

We cannot use this raw data set as it is, as it container flelds which are of type object. This data is usually in the form of string and we should be able to get categories out of this obect type.

1# data types
2
3df_original.dtypes
4
5# also part of info indicating which are int nd object type though redundant.

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
Target       object
dtype: object

1# Check presence of any null values
2
3df_original.isnull().values.any()
4
5# This return `False` it mean we do no have any present of null values

False

1# Check presence of missing value
2    
3df_original.isna().values.any()
4
5# This return `False` it mean we do no have any present of missing values

False

1# Shape of the data
2
3df_original.shape
4
5# we have 45211 rows and 17 columns

(45211, 17)

1# Check data loading and analyse data description
2
3df_original.describe()

1# Print Different data types from dataframe and its reference type
2
3df_original.dtypes.value_counts()

object    10
int64      7
dtype: int64

As we see, only 7 colums have been loaded and rest 10 are missing from here. These seems to be categorical column and hence we need to convert them in numerical columns.

Before we move on to converting these values into categorical variable lets examine what are these values. This can be done by checking unique and unique count on that columns.

 1# Print unique for each column
 2
 3for name in df_original_columns:    
 4    if df_original[name].dtype == np.int64:
 5        #Sorting for better understanding
 6        sortedCategories =sorted(df_original[name].unique().tolist())
 7        
 8        formattedText = "has unique data in this range {}".format(sortedCategories)
 9        
10        printTextAsMarkdown(name, formattedText, color="red")
11        print("\n**************************************************************************************")

 1# We are most interested in qunie values of object data column, so lets filter out only object data type
 2# Priting ${df[name]} and its unique values
 3
 4# Container for object column type, later on while label encoding we need to convert only those column which are of type object
 5objectColumns = []
 6
 7for name in df_original_columns:    
 8    if df_original[name].dtype == np.object:
 9        
10        #Sorting for better understanding
11        sortedCategories =sorted(df_original[name].unique().tolist())
12        
13        formattedText = "has unique data in this range {}".format(sortedCategories)
14        
15        printTextAsMarkdown(name, formattedText, color="red")
16        
17        objectColumns.append(name)
18        print("\n**************************************************************************************")

job : has unique data in this range ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']

**************************************************************************************

marital : has unique data in this range ['divorced', 'married', 'single']

**************************************************************************************

education : has unique data in this range ['primary', 'secondary', 'tertiary', 'unknown']

**************************************************************************************

default : has unique data in this range ['no', 'yes']

**************************************************************************************

housing : has unique data in this range ['no', 'yes']

**************************************************************************************

loan : has unique data in this range ['no', 'yes']

**************************************************************************************

contact : has unique data in this range ['cellular', 'telephone', 'unknown']

**************************************************************************************

month : has unique data in this range ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep']

**************************************************************************************

poutcome : has unique data in this range ['failure', 'other', 'success', 'unknown']

**************************************************************************************

Target : has unique data in this range ['no', 'yes']

**************************************************************************************

Let us examine this object data type

Data spread is very minimal mean less categories
There is some presense of invalid data i.e unknown in job, education, contact, poutcome(unknown) here indicated we don't know whether we have failed or success response from this person.

1# Copy for original dataframe
2
3df_main = df_original.copy()

LabelEncoder and Caching

 1
 2
 3from sklearn import preprocessing
 4
 5# Create empty of of label encoders for different columns
 6# i wish to save each encoder corresponding to different colum title
 7
 8columnEncoders = {}
 9
10for name in objectColumns:
11    le = preprocessing.LabelEncoder()
12    # Fit encoder to pandas column
13    le.fit(df_main[name])
14    # apply transformation and assign it to df
15    df_main[name] = le.transform(df_main[name]) 
16    #put name and encoder in map 
17    columnEncoders[name] = le

1# Now lets revist basic operation on data frame
2
3df_main.head()
4
5# We can now see all data  into numeric form

1# Lets quickly print data types
2
3df_main.dtypes
4
5# Should print all int

age          int64
job          int64
marital      int64
education    int64
default      int64
balance      int64
housing      int64
loan         int64
contact      int64
day          int64
month        int64
duration     int64
campaign     int64
pdays        int64
previous     int64
poutcome     int64
Target       int64
dtype: object

1# Let analyse more about data, 
2
3#df_main.describe() difficult to view hence lets apply transpose() to visually see it better
4
5df_main.describe().transpose()

Visual Analysis

1# Lets see response distribution for target column
2print(df_main['Target'].value_counts())
3
4sns.countplot(x='Target',data=df_original)
5
6# Here we have a kind of improper data, what ever model we build would 
7# be dominated by column have strong hold on `NO` on output variable becuase we see that data come from the range where
8# most of the people have not opted for Term Deposit

0    39922
1     5289
Name: Target, dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x7f0633313bd0>

png

1# Histogram
2df_main.hist(figsize=(15,15))

png

1sns.distplot(df_main['age'],kde=True)
2
3# We can conclude that data set has people raning from 20-60 and there are some outliers present as well
4# becuase the tail on right is spreading a little more

<matplotlib.axes._subplots.AxesSubplot at 0x7f06331df590>

png

1sns.distplot(df_main['job'])
2# Looking at the graph we see that are multiple groups present in here, we have seen this and converted to categorical variable
3# lets print it from cached map `col`umnEncoders`
4print(columnEncoders['job'].classes_)

['admin.' 'blue-collar' 'entrepreneur' 'housemaid' 'management' 'retired'
 'self-employed' 'services' 'student' 'technician' 'unemployed' 'unknown']

png

1# Relation between age, job to Term deposit
2sns.barplot('job','age',hue='Target',data=df_main,ci=None)
3print(columnEncoders['job'].classes_[4])
4
5# We can see that management people have opted for term deposit more

management

png

1# Finding relation between job and term deposit
2pd.crosstab(df_original['job'], df_original['Target'])
3
4# We here conclude that management, technician, blue-collar are some of the categories that tend to apply for term deposit.
5# This conclusion is based on that fact that their earning is on higher side. A general human assumption.

1# Lets analyse education-wise which category tend to apply more for term deposit respectively
2sns.countplot(x='education', hue='Target',data=df_main)
3print(columnEncoders['education'].classes_)
4
5# Here we conclude that the order of applyterm deposit secondary > tertiary > primary > unknown

['primary' 'secondary' 'tertiary' 'unknown']

png

1# Lets analyse marital-status-wise which category tend to apply more for term deposit respectively
2
3sns.countplot(x='marital', hue='Target',data=df_main)#
4
5print(columnEncoders['marital'].classes_)
6
7# Here we conclude that the order of applyterm deposit is in married > single > divorced

['divorced' 'married' 'single']

png

1pd.crosstab(df_original['Target'], df_original['month'])
2
3# We see here
4# - May has higher success and failure of Target values
5# - August has the second highest acceptanec value
6# But in terms of percentage acceptance august has higher value than may.

1# Deposit by months visual
2sns.countplot(x='month', hue='Target',data=df_original)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0630af7b90>

png

1# Checking if people who have given contants have got term deposit
2sns.countplot(x='contact', hue='Target',data=df_original)
3
4# This indicated who have registerd cellular contacts have slightly higer rate of applying for term deposit
5# This again tell us people who are working in higher job profile like management, technicians

<matplotlib.axes._subplots.AxesSubplot at 0x7f0630a26710>

png

1
2# Converting duration in dataset which is in seconds to minutes upto decimal ot 2 digits
3decimal_points = 2
4df_main['duration'] = df_main['duration'] / 60
5df_main['duration'] = df_main['duration'].apply(lambda x: round(x, decimal_points))
6
7df_main.head()

1# Balance colums seems to be dominating all other values lets scale it
2from sklearn.preprocessing import MinMaxScaler
3
4scaler = MinMaxScaler()
5
6df_main[['balance']] = scaler.fit_transform(df_main[['balance']])

1# Looking closely at our columns we see some columns prefixed with char `p`, reading from the problem
2# statement we came to know that these field are some kind of indicated of previous analysis or campaign 
3# like poutcome is not neccessarily should be part of train data model becuase this is not an attribute on input
4# but a conclusion on the previous analysis/campaign
5
6# So we can even build our data model removing these `p{x}` columns
7df_main.head()

1df_main.corr()

1corr = df_main.corr()
2plt.figure(figsize = (12,12))
3
4sns.heatmap(corr, annot=True, linewidths=.5, cmap="YlGnBu")#, cbar=False)
5
6# Here we can simply reduce poutcome, pdays, previous, campaig

<matplotlib.axes._subplots.AxesSubplot at 0x7f0630d41250>

png

Training constants and general imports

 1
 2# Training constants and general imports
 3
 4from sklearn.tree import DecisionTreeClassifier
 5from sklearn.ensemble import RandomForestClassifier
 6from sklearn.ensemble import AdaBoostClassifier
 7from sklearn.ensemble import BaggingClassifier
 8from sklearn.ensemble import GradientBoostingClassifier
 9
10
11
12from sklearn import metrics
13from sklearn.metrics import classification_report
14
15# taking 70:30 training and test set
16test_size = 0.30 
17
18# Random number seeding for reapeatability of the code
19seed = 2 # spirit and opportunity Mars exploration rovers
20
21
22def isqrt(n):
23    x = n
24    y = (x + 1) // 2
25    while y < x:
26        x = y
27        y = (x + n // x) // 2
28    return x

Data Preparation

 1## Prepare input columns
 2df_main_x = df_main.copy()
 3
 4# Colums we are dropping which are mostly related to previous campaign
 5df_main_x = df_main_x.drop(['poutcome', 'duration', 'Target'], axis = 1) 
 6
 7# df_main_x_ary = np.asarray(df_main_x)
 8
 9df_main_y = df_original['Target']
10
11
12df_main_x.head()

1## Target column seperate
2df_main_y.head()

0    no
1    no
2    no
3    no
4    no
Name: Target, dtype: object

Training

1
2from sklearn.model_selection import train_test_split
3
4X_train, X_test, y_train, y_test = \
5    train_test_split(np.asarray(df_main_x), np.asarray(df_main_y), test_size=test_size, random_state=seed)

1# Holder class for data from different classifiers
2class EnsembleTechnique:
3    def __init__(self, score, prediction, accuracy, confusion_matrix, classification_report, n_estimators):
4        self.score = score
5        self.prediction = prediction
6        self.accuracy = accuracy
7        self.confusion_matrix = confusion_matrix
8        self.classification_report = classification_report
9        self.n_estimators = n_estimators

1rows = df_main_x.shape[0]
2print("Total rows {}".format(rows))
3maxLimit = isqrt(rows)
4print("Limit till we find Nestimators {}".format(maxLimit))
5
6# Result map to hold name and score for each model
7results = {}
8

Total rows 45211
Limit till we find Nestimators 212

Decision Tree using `entropy model`

 1#Init
 2decisionTreeClassifier = DecisionTreeClassifier(criterion = 'entropy')
 3
 4#fit data
 5decisionTreeClassifier.fit(X_train, y_train)
 6
 7#Predict
 8dtc_y_pred = decisionTreeClassifier.predict(X_test)
 9
10# Model score
11dtc_model_score = decisionTreeClassifier.score(X_test , y_test)
12
13# Accuracy
14dtc_model_accuracy = metrics.accuracy_score(y_test, dtc_y_pred)
15
16
17print("Prediction: {}".format(dtc_y_pred))
18print("Score: {}".format(dtc_model_score))
19print("Accuracy {}".format(dtc_model_accuracy))
20print("Confusion metrix")
21print(metrics.confusion_matrix(y_test, dtc_y_pred))
22print(classification_report(y_test,dtc_y_pred))
23
24results['Decision Tree'] = dtc_model_score

Prediction: ['no' 'no' 'no' ... 'no' 'no' 'no']
Score: 0.8261574756708936
Accuracy 0.8261574756708936
Confusion metrix
[[10738  1261]
 [ 1097   468]]
              precision    recall  f1-score   support

          no       0.91      0.89      0.90     11999
         yes       0.27      0.30      0.28      1565

    accuracy                           0.83     13564
   macro avg       0.59      0.60      0.59     13564
weighted avg       0.83      0.83      0.83     13564

Random Forest Classifier

 1
 2# determining n_estimators here, we should proceed with approach 2^n
 3# before stopping at the best outcome we should compare the result of previous outcome
 4previous = EnsembleTechnique(0.0, 0.0, 0.0, None, None, 0)
 5
 6counter = 1;
 7estimator = 0
 8while(estimator<maxLimit):
 9
10    estimator = pow(2, counter)
11    counter = counter + 1
12    
13    
14    #print("Estimating {}".format(estimator))
15    
16    #Init
17    randomForestClassifier = RandomForestClassifier(n_estimators=estimator)
18    #fit data
19    randomForestClassifier.fit(X_train, y_train)
20
21    #Predict
22    rfc_y_pred = randomForestClassifier.predict(X_test)
23
24    # Model score
25    rfc_model_score = randomForestClassifier.score(X_test , y_test)
26
27    # Accuracy
28    rfc_model_accuracy = metrics.accuracy_score(y_test, rfc_y_pred)
29    #print("Score {} for e {}".format(rfc_model_score, estimator))
30    #print(rfc_model_score)
31    if rfc_model_score > previous.score:
32        previous = EnsembleTechnique(rfc_model_score, rfc_y_pred, rfc_model_accuracy,
33                                    metrics.confusion_matrix(y_test, rfc_y_pred),
34                                    classification_report(y_test,rfc_y_pred), 
35                                    estimator)
36        
37    
38        
39    
40
41print("Prediction: {}".format(previous.prediction))
42print("Score: {}".format(previous.score))
43print("Accuracy {}".format(previous.accuracy))
44print("n estimators : {}".format(previous.n_estimators))
45print("Confusion metrix")
46print(previous.confusion_matrix)
47print(previous.classification_report)
48
49results['Random Forest'] = previous.score

Prediction: ['no' 'no' 'no' ... 'no' 'no' 'no']
Score: 0.8882335594219994
Accuracy 0.8882335594219994
n estimators : 128
Confusion metrix
[[11775   224]
 [ 1292   273]]
              precision    recall  f1-score   support

          no       0.90      0.98      0.94     11999
         yes       0.55      0.17      0.26      1565

    accuracy                           0.89     13564
   macro avg       0.73      0.58      0.60     13564
weighted avg       0.86      0.89      0.86     13564

Adaboost Classifier

 1
 2previous = EnsembleTechnique(0.0, 0.0, 0.0, None, None, 0)
 3
 4counter = 1;
 5estimator = 0
 6while(estimator<maxLimit):
 7
 8    estimator = pow(2, counter)
 9    counter = counter + 1
10    #Init
11    adaBoostClassifier = AdaBoostClassifier(n_estimators=estimator)
12    #fit data
13    adaBoostClassifier.fit(X_train, y_train)
14
15    #Predict
16    abc_y_pred = adaBoostClassifier.predict(X_test)
17
18    # Model score
19    abc_model_score = adaBoostClassifier.score(X_test , y_test)
20
21    # Accuracy
22    abc_model_accuracy = metrics.accuracy_score(y_test, abc_y_pred)
23    
24    if abc_model_score > previous.score:
25        previous = EnsembleTechnique(abc_model_score, abc_y_pred, abc_model_accuracy,
26                                    metrics.confusion_matrix(y_test, abc_y_pred),
27                                    classification_report(y_test, abc_y_pred), 
28                                    estimator)
29    
30    
31    
32    
33
34print("Prediction: {}".format(previous.prediction))
35print("Score: {}".format(previous.score))
36print("Accuracy {}".format(previous.accuracy))
37print("n estimators : {}".format(previous.n_estimators))
38print("Confusion metrix")
39print(previous.confusion_matrix)
40print(previous.classification_report)
41
42results['Adaboost Classifier'] = previous.score

Prediction: ['no' 'no' 'no' ... 'no' 'no' 'no']
Score: 0.8882335594219994
Accuracy 0.8882335594219994
n estimators : 256
Confusion metrix
[[11847   152]
 [ 1364   201]]
              precision    recall  f1-score   support

          no       0.90      0.99      0.94     11999
         yes       0.57      0.13      0.21      1565

    accuracy                           0.89     13564
   macro avg       0.73      0.56      0.57     13564
weighted avg       0.86      0.89      0.86     13564

Bagging Classifier

 1
 2
 3previous = EnsembleTechnique(0.0, 0.0, 0.0, None, None, 0)
 4
 5counter = 1;
 6estimator = 0
 7while(estimator<maxLimit):
 8
 9    estimator = pow(2, counter)
10    counter = counter + 1
11    #Init
12    baggingClassifier = BaggingClassifier(n_estimators=estimator, max_samples= .7, bootstrap=True)
13    #fit data
14    baggingClassifier.fit(X_train, y_train)
15
16    #Predict
17    bc_y_pred = baggingClassifier.predict(X_test)
18
19    # Model score
20    bc_model_score = baggingClassifier.score(X_test , y_test)
21
22    # Accuracy
23    bc_model_accuracy = metrics.accuracy_score(y_test, bc_y_pred)
24    print("{} for n estimator {}".format(bc_model_score, estimator))
25    if bc_model_score > previous.score:
26        previous = EnsembleTechnique(bc_model_score, bc_y_pred, bc_model_accuracy,
27                                    metrics.confusion_matrix(y_test, bc_y_pred),
28                                    classification_report(y_test, bc_y_pred), 
29                                    estimator) 
30
31    
32        
33        
34results['Bagging Classifier'] = previous.score 
35print("Prediction: {}".format(previous.prediction))
36print("Score: {}".format(previous.score))
37print("Accuracy {}".format(previous.accuracy))
38print("n estimators : {}".format(previous.n_estimators))
39print("Confusion metrix")
40print(previous.confusion_matrix)
41print(previous.classification_report)

0.8773223237982896 for n estimator 2
0.8773960483633146 for n estimator 4
0.8824093187850192 for n estimator 8
0.8850634031259216 for n estimator 16
0.8866116189914479 for n estimator 32
0.8859480979062223 for n estimator 64
0.8872014155116484 for n estimator 128
0.8858006487761723 for n estimator 256
Prediction: ['no' 'no' 'no' ... 'no' 'no' 'no']
Score: 0.8872014155116484
Accuracy 0.8872014155116484
n estimators : 128
Confusion metrix
[[11684   315]
 [ 1215   350]]
              precision    recall  f1-score   support

          no       0.91      0.97      0.94     11999
         yes       0.53      0.22      0.31      1565

    accuracy                           0.89     13564
   macro avg       0.72      0.60      0.63     13564
weighted avg       0.86      0.89      0.87     13564

Gradient Boost Classifier

 1
 2previous = EnsembleTechnique(0.0, 0.0, 0.0, None, None, 0)
 3
 4counter = 1;
 5estimator = 0
 6while(estimator<maxLimit):
 7
 8    estimator = pow(2, counter)
 9    counter = counter + 1
10    #Init
11    gradientBoostClassifier = GradientBoostingClassifier(n_estimators=estimator, learning_rate = 0.05)
12    #fit data
13    gradientBoostClassifier.fit(X_train, y_train)
14
15    #Predict
16    gb_y_pred = gradientBoostClassifier.predict(X_test)
17
18    # Model score
19    gb_model_score = gradientBoostClassifier.score(X_test , y_test)
20
21    # Accuracy
22    gb_model_accuracy = metrics.accuracy_score(y_test, gb_y_pred)
23    
24    #print("{} for n estimator {}".format(bc_model_score, estimator))
25    if gb_model_score > previous.score:
26        previous = EnsembleTechnique(gb_model_score, gb_y_pred, gb_model_accuracy,
27                                    metrics.confusion_matrix(y_test,gb_y_pred),
28                                    classification_report(y_test, gb_y_pred), 
29                                    estimator) 
30
31
32results['Gradient Boost Classifier'] = previous.score 
33print("Prediction: {}".format(previous.prediction))
34print("Score: {}".format(previous.score))
35print("Accuracy {}".format(previous.accuracy))
36print("n estimators : {}".format(previous.n_estimators))
37print("Confusion metrix")
38print(previous.confusion_matrix)
39print(previous.classification_report)

/home/ashish/installed_apps/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))


Prediction: ['no' 'no' 'no' ... 'no' 'no' 'no']
Score: 0.8904452963727514
Accuracy 0.8904452963727514
n estimators : 256
Confusion metrix
[[11804   195]
 [ 1291   274]]
              precision    recall  f1-score   support

          no       0.90      0.98      0.94     11999
         yes       0.58      0.18      0.27      1565

    accuracy                           0.89     13564
   macro avg       0.74      0.58      0.61     13564
weighted avg       0.86      0.89      0.86     13564

Analysis Result

1
2print("Model score are ")
3print(results)
4
5best_score = max(results, key=results.get);
6
7resultString = " has best score with accuracy **{}** ".format(results[best_score])
8
9printTextAsMarkdown(best_score, resultString, color="blue")

Model score are 
{'Decision Tree': 0.8261574756708936, 'Random Forest': 0.8882335594219994, 'Adaboost Classifier': 0.8882335594219994, 'Bagging Classifier': 0.8872014155116484, 'Gradient Boost Classifier': 0.8904452963727514}

Gradient Boost Classifier : has best score with accuracy 0.8904452963727514

Analysis Report

Recall: Is the total number of "Yes" in the label column of the dataset. So how many "Yes" labels does our model detect.

Precision: Means how sure is the prediction of our model that the actual label is a "Yes".

Decision tree will not yield the best result as it is based on all the individual attributes where as Random Forest would random pic the colums and would aggregate result. Random forest would always perform best in accuracy. But Gradient Boost Classifier accuracy is incremental. Each new tree would be better than the previous one. In terms of performance, Random Forest beats Gradient Boost CLassifier due to parallel nature of execution where in Gradient Boost work sequentially.

For the analysis it is clear that Gradient Boost Classifier give the best model score. We have also seen that number of trees also should be in certain range too less or many would not yield proper result.

Data

bank-full.csv

Source Code

bank_solution_term_deposit.ipynb