Supervised Learning Thera Bank case study

Share on:

Overview

The classification goal is to predict the likelihood of a liability customer buying personal loans.

Solution

 1#Import Necessary Libraries
 2
 3# NumPy: For mathematical funcations, array, matrices operations
 4import numpy as np 
 5
 6# Graph: Plotting graphs and other visula tools
 7import pandas as pd
 8import seaborn as sns
 9
10sns.set_palette("muted")
11
12# color_palette = sns.color_palette()
13# To enable inline plotting graphs
14import matplotlib.pyplot as plt
15%matplotlib inline
 1# Import CSV data using pandas data frame
 2df_original = pd.read_csv('bank.csv')
 3
 4# Prepare columns names
 5df_columns = []
 6for column in df_original.columns:
 7    df_columns.append(column)
 8
 9# Prepare mapping of column names for quick access
10df_columns_map = {}
11map_index: int = 0
12for column in df_columns:
13    df_columns_map[map_index] = column
14    map_index = map_index + 1
15    
16print("Columns {}".format(df_columns))
17print("***********************************************************************************************************************")
18print("Columns Map {}".format(df_columns_map))
19
20# We have separated out columns and its mapping from data, at any point of time during data analysis or cleaning we 
21# can directly refer or get data from either index or column identifier
22
23# See data overview
24
25df_original.head()
Columns ['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal Loan', 'Securities Account', 'CD Account', 'Online', 'CreditCard']
***********************************************************************************************************************
Columns Map {0: 'ID', 1: 'Age', 2: 'Experience', 3: 'Income', 4: 'ZIP Code', 5: 'Family', 6: 'CCAvg', 7: 'Education', 8: 'Mortgage', 9: 'Personal Loan', 10: 'Securities Account', 11: 'CD Account', 12: 'Online', 13: 'CreditCard'}
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
 1# Lets analyse data based on following conditions
 2# 1. Check whether all rows x colums are loaded as given in question, all data must match before we start to even operate on it.
 3# 2. Print shape of the data
 4# 8. Check data types of each field
 5# 3. Find presence of null or missing values.
 6# 4. Visually inspect data and check presense of Outliers if there are any and see are 
 7#    they enough to drop or need to consider during model building
 8# 5. Print shape of the data
 9# 6. Do we need to consider all data columns given in data set for model building
10# 7. Find Corr, median, mean, std deviation, min, max for columns.
11
12# Below is info for our data
13
14df_original.info();
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
1# 1. Check whether all rows x colums are loaded as given in question, all data must match before we start to even operate on it.
2
3#df_original.describe() difficult to view hence lets apply transpose() to visually see it better
4
5df_original.describe().transpose()
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Data Analysis Column Wise

  • ID: seems just a identity representation of row or an item in a data frame, this can be dropped when processing model.
  • Age: based on std, q1, q2, q3 seems valid values.
  • Experience: Look at min, it say -3 experience cannot be in negative and this particular needs correction. Ideal values should be 0-80 considering a person started to work at 20 years and lives for max 100 years.
  • ZipCode: All values seems fine. If we wish to discard region from the model still our model is not impacted if we drop this column. But at later stage we are predicting, people from which area are accepting more personal loans, then we may need to consider this field mandatorily.
  • Family: Data looks ok and can play role like if children are less, then less responsibility and hence no need of loan, but more kids and then sometime people tend to go for extra loan apart from education as well. So this is very important field in model buulding.
  • CreditCard: General human assumption i would do that, person who has creditcard is vrey unlikely to go for personal load. But if the need arises for a longer term some data points might be there who has credit card as well as personal loan.

Skipping some other fields which are self explanatory.

1# From our given data set we have succesfully loaded all columns looking at the column labels
2# Lets check shape of the data
3df_original.shape
4
5# Here we see total 5000 rows and 14 colums. 
(5000, 14)
1# sns.boxplot(y="Age", orient="v", x="Personal Loan", hue="Education", data=df_original)
1# Data types of fields
2
3df_original.dtypes
4
5# We see that everything is numeric data and need not need any conversion
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object
1# Presence of null values or missing values
2df_original.isnull().values.any()
3
4# This tell us that we have for each row x column.
False

Check validity for data, as we have seen, there is no missing data in out data frame, but are the values valid enough like we have seen that experience field has -3 having experience as -3 doesnt add any value to our model but it may impact our final consideration, this also depends on how many such values are present. Lets print column colmposition or categories spread of data for suspicious column.

 1# Lets see what are experience range in our data set we have.
 2
 3print("Experience has unique data in this range {}".format(df_original["Experience"].unique().tolist()))
 4
 5# len(df_original[df_original["Experience"] < 0]['Experience'].unique().tolist())
 6df_experience = df_original[df_original['Experience'] < 0]
 7df_experience['Experience'].value_counts()
 8
 9# So there are 52 values whose experience is a missing value which constitute ~1 % of data for experience column
10# We have few options to deal with this
11# - purge this invalid data
12# - Replace it with meaning full
Experience has unique data in this range [1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18, 21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29, 3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43]





-1    33
-2    15
-3     4
Name: Experience, dtype: int64
1# Lets quickly check Age
2
3print("Age experience has unique data in this range {}".format(df_original['Age'].unique().tolist()))
4# This looks ok.
5
6#Values count
7df_original['Age'].value_counts()
8
9# There are no suspicious values
Age experience has unique data in this range [25, 45, 39, 35, 37, 53, 50, 34, 65, 29, 48, 59, 67, 60, 38, 42, 46, 55, 56, 57, 44, 36, 43, 40, 30, 31, 51, 32, 61, 41, 28, 49, 47, 62, 58, 54, 33, 27, 66, 24, 52, 26, 64, 63, 23]





35    151
43    149
52    145
58    143
54    143
50    138
41    136
30    136
56    135
34    134
39    133
59    132
57    132
51    129
60    127
45    127
46    127
42    126
40    125
31    125
55    125
62    123
29    123
61    122
44    121
32    120
33    120
48    118
38    115
49    115
47    113
53    112
63    108
36    107
37    106
28    103
27     91
65     80
64     78
26     78
25     53
24     28
66     24
23     12
67     12
Name: Age, dtype: int64

We are able to view visually since values are in thousands range what if it is lakhs. It is not feasible of even a good approach to manually check terminal and see if we have invalid values. In that case we can try to parse the values in fields. Create a helper method for string, bool, int and other data types. Signature might look like this:

1def all_int(self, column_int_values_list):
2    # Iterate over list parsing values

Lets try to check validity of values using parse approach.

 1# Validate function
 2
 3def validate_column(column_as_list, column_name):
 4    print("Analysing {} column for unique value {}".format(column_name, column_as_list))
 5    for value in df_original['Age'].tolist():
 6        try:
 7            value += 1
 8        except TypeError:
 9            print("Error identyfying {} in {} column ".format(value, column_name) )
10            return False
11    return True       
12            
1print("Is Age column valid {}".format(validate_column(df_original['Age'].unique().tolist(), 'Age')))
Analysing Age column for unique value [25, 45, 39, 35, 37, 53, 50, 34, 65, 29, 48, 59, 67, 60, 38, 42, 46, 55, 56, 57, 44, 36, 43, 40, 30, 31, 51, 32, 61, 41, 28, 49, 47, 62, 58, 54, 33, 27, 66, 24, 52, 26, 64, 63, 23]
Is Age column valid True
1print("Is Experience column valid {}".format(validate_column(df_original['Experience'].unique().tolist(), 'Experience')))
Analysing Experience column for unique value [1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18, 21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29, 3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43]
Is Experience column valid True
1print("Is Income column valid {}".format(validate_column(df_original['Income'].unique().tolist(), 'Income')))
Analysing Income column for unique value [49, 34, 11, 100, 45, 29, 72, 22, 81, 180, 105, 114, 40, 112, 130, 193, 21, 25, 63, 62, 43, 152, 83, 158, 48, 119, 35, 41, 18, 50, 121, 71, 141, 80, 84, 60, 132, 104, 52, 194, 8, 131, 190, 44, 139, 93, 188, 39, 125, 32, 20, 115, 69, 85, 135, 12, 133, 19, 82, 109, 42, 78, 51, 113, 118, 64, 161, 94, 15, 74, 30, 38, 9, 92, 61, 73, 70, 149, 98, 128, 31, 58, 54, 124, 163, 24, 79, 134, 23, 13, 138, 171, 168, 65, 10, 148, 159, 169, 144, 165, 59, 68, 91, 172, 55, 155, 53, 89, 28, 75, 170, 120, 99, 111, 33, 129, 122, 150, 195, 110, 101, 191, 140, 153, 173, 174, 90, 179, 145, 200, 183, 182, 88, 160, 205, 164, 14, 175, 103, 108, 185, 204, 154, 102, 192, 202, 162, 142, 95, 184, 181, 143, 123, 178, 198, 201, 203, 189, 151, 199, 224, 218]
Is Income column valid True
1print("Is ZIP Code colum valid {}".format(validate_column(df_original['ZIP Code'].unique().tolist(), 'ZIP Code')))
Analysing ZIP Code column for unique value [91107, 90089, 94720, 94112, 91330, 92121, 91711, 93943, 93023, 94710, 90277, 93106, 94920, 91741, 95054, 95010, 94305, 91604, 94015, 90095, 91320, 95521, 95064, 90064, 94539, 94104, 94117, 94801, 94035, 92647, 95814, 94114, 94115, 92672, 94122, 90019, 95616, 94065, 95014, 91380, 95747, 92373, 92093, 94005, 90245, 95819, 94022, 90404, 93407, 94523, 90024, 91360, 95670, 95123, 90045, 91335, 93907, 92007, 94606, 94611, 94901, 92220, 93305, 95134, 94612, 92507, 91730, 94501, 94303, 94105, 94550, 92612, 95617, 92374, 94080, 94608, 93555, 93311, 94704, 92717, 92037, 95136, 94542, 94143, 91775, 92703, 92354, 92024, 92831, 92833, 94304, 90057, 92130, 91301, 92096, 92646, 92182, 92131, 93720, 90840, 95035, 93010, 94928, 95831, 91770, 90007, 94102, 91423, 93955, 94107, 92834, 93117, 94551, 94596, 94025, 94545, 95053, 90036, 91125, 95120, 94706, 95827, 90503, 90250, 95817, 95503, 93111, 94132, 95818, 91942, 90401, 93524, 95133, 92173, 94043, 92521, 92122, 93118, 92697, 94577, 91345, 94123, 92152, 91355, 94609, 94306, 96150, 94110, 94707, 91326, 90291, 92807, 95051, 94085, 92677, 92614, 92626, 94583, 92103, 92691, 92407, 90504, 94002, 95039, 94063, 94923, 95023, 90058, 92126, 94118, 90029, 92806, 94806, 92110, 94536, 90623, 92069, 92843, 92120, 95605, 90740, 91207, 95929, 93437, 90630, 90034, 90266, 95630, 93657, 92038, 91304, 92606, 92192, 90745, 95060, 94301, 92692, 92101, 94610, 90254, 94590, 92028, 92054, 92029, 93105, 91941, 92346, 94402, 94618, 94904, 9307, 95482, 91709, 91311, 94509, 92866, 91745, 94111, 94309, 90073, 92333, 90505, 94998, 94086, 94709, 95825, 90509, 93108, 94588, 91706, 92109, 92068, 95841, 92123, 91342, 90232, 92634, 91006, 91768, 90028, 92008, 95112, 92154, 92115, 92177, 90640, 94607, 92780, 90009, 92518, 91007, 93014, 94024, 90027, 95207, 90717, 94534, 94010, 91614, 94234, 90210, 95020, 92870, 92124, 90049, 94521, 95678, 95045, 92653, 92821, 90025, 92835, 91910, 94701, 91129, 90071, 96651, 94960, 91902, 90033, 95621, 90037, 90005, 93940, 91109, 93009, 93561, 95126, 94109, 93107, 94591, 92251, 92648, 92709, 91754, 92009, 96064, 91103, 91030, 90066, 95403, 91016, 95348, 91950, 95822, 94538, 92056, 93063, 91040, 92661, 94061, 95758, 96091, 94066, 94939, 95138, 95762, 92064, 94708, 92106, 92116, 91302, 90048, 90405, 92325, 91116, 92868, 90638, 90747, 93611, 95833, 91605, 92675, 90650, 95820, 90018, 93711, 95973, 92886, 95812, 91203, 91105, 95008, 90016, 90035, 92129, 90720, 94949, 90041, 95003, 95192, 91101, 94126, 90230, 93101, 91365, 91367, 91763, 92660, 92104, 91361, 90011, 90032, 95354, 94546, 92673, 95741, 95351, 92399, 90274, 94087, 90044, 94131, 94124, 95032, 90212, 93109, 94019, 95828, 90086, 94555, 93033, 93022, 91343, 91911, 94803, 94553, 95211, 90304, 92084, 90601, 92704, 92350, 94705, 93401, 90502, 94571, 95070, 92735, 95037, 95135, 94028, 96003, 91024, 90065, 95405, 95370, 93727, 92867, 95821, 94566, 95125, 94526, 94604, 96008, 93065, 96001, 95006, 90639, 92630, 95307, 91801, 94302, 91710, 93950, 90059, 94108, 94558, 93933, 92161, 94507, 94575, 95449, 93403, 93460, 95005, 93302, 94040, 91401, 95816, 92624, 95131, 94965, 91784, 91765, 90280, 95422, 95518, 95193, 92694, 90275, 90272, 91791, 92705, 91773, 93003, 90755, 96145, 94703, 96094, 95842, 94116, 90068, 94970, 90813, 94404, 94598]
Is ZIP Code colum valid True
1print("Is Family colum valid {}".format(validate_column(df_original['Family'].unique().tolist(), 'Family')))
Analysing Family column for unique value [4, 3, 1, 2]
Is Family colum valid True
1print("Is CCAvg colum valid {}".format(validate_column(df_original['CCAvg'].unique().tolist(), 'CCAvg')))
Analysing CCAvg column for unique value [1.6, 1.5, 1.0, 2.7, 0.4, 0.3, 0.6, 8.9, 2.4, 0.1, 3.8, 2.5, 2.0, 4.7, 8.1, 0.5, 0.9, 1.2, 0.7, 3.9, 0.2, 2.2, 3.3, 1.8, 2.9, 1.4, 5.0, 2.3, 1.1, 5.7, 4.5, 2.1, 8.0, 1.7, 0.0, 2.8, 3.5, 4.0, 2.6, 1.3, 5.6, 5.2, 3.0, 4.6, 3.6, 7.2, 1.75, 7.4, 2.67, 7.5, 6.5, 7.8, 7.9, 4.1, 1.9, 4.3, 6.8, 5.1, 3.1, 0.8, 3.7, 6.2, 0.75, 2.33, 4.9, 0.67, 3.2, 5.5, 6.9, 4.33, 7.3, 4.2, 4.4, 6.1, 6.33, 6.6, 5.3, 3.4, 7.0, 6.3, 8.3, 6.0, 1.67, 8.6, 7.6, 6.4, 10.0, 5.9, 5.4, 8.8, 1.33, 9.0, 6.7, 4.25, 6.67, 5.8, 4.8, 3.25, 5.67, 8.5, 4.75, 4.67, 3.67, 8.2, 3.33, 5.33, 9.3, 2.75]
Is CCAvg colum valid True
1print("Is Education colum valid {}".format(validate_column(df_original['Education'].unique().tolist(), 'Education')))
Analysing Education column for unique value [1, 2, 3]
Is Education colum valid True
1print("Is Mortgage colum valid {}".format(validate_column(df_original['Mortgage'].unique().tolist(), 'Mortgage')))
Analysing Mortgage column for unique value [0, 155, 104, 134, 111, 260, 163, 159, 97, 122, 193, 198, 285, 412, 153, 211, 207, 240, 455, 112, 336, 132, 118, 174, 126, 236, 166, 136, 309, 103, 366, 101, 251, 276, 161, 149, 188, 116, 135, 244, 164, 81, 315, 140, 95, 89, 90, 105, 100, 282, 209, 249, 91, 98, 145, 150, 169, 280, 99, 78, 264, 113, 117, 325, 121, 138, 77, 158, 109, 131, 391, 88, 129, 196, 617, 123, 167, 190, 248, 82, 402, 360, 392, 185, 419, 270, 148, 466, 175, 147, 220, 133, 182, 290, 125, 124, 224, 141, 119, 139, 115, 458, 172, 156, 547, 470, 304, 221, 108, 179, 271, 378, 176, 76, 314, 87, 203, 180, 230, 137, 152, 485, 300, 272, 144, 94, 208, 275, 83, 218, 327, 322, 205, 227, 239, 85, 160, 364, 449, 75, 107, 92, 187, 355, 106, 587, 214, 307, 263, 310, 127, 252, 170, 265, 177, 305, 372, 79, 301, 232, 289, 212, 250, 84, 130, 303, 256, 259, 204, 524, 157, 231, 287, 247, 333, 229, 357, 361, 294, 86, 329, 142, 184, 442, 233, 215, 394, 475, 197, 228, 297, 128, 241, 437, 178, 428, 162, 234, 257, 219, 337, 382, 397, 181, 120, 380, 200, 433, 222, 483, 154, 171, 146, 110, 201, 277, 268, 237, 102, 93, 354, 195, 194, 238, 226, 318, 342, 266, 114, 245, 341, 421, 359, 565, 319, 151, 267, 601, 567, 352, 284, 199, 80, 334, 389, 186, 246, 589, 242, 143, 323, 535, 293, 398, 343, 255, 311, 446, 223, 262, 422, 192, 217, 168, 299, 505, 400, 165, 183, 326, 298, 569, 374, 216, 191, 408, 406, 452, 432, 312, 477, 396, 582, 358, 213, 467, 331, 295, 235, 635, 385, 328, 522, 496, 415, 461, 344, 206, 368, 321, 296, 373, 292, 383, 427, 189, 202, 96, 429, 431, 286, 508, 210, 416, 553, 403, 225, 500, 313, 410, 273, 381, 330, 345, 253, 258, 351, 353, 308, 278, 464, 509, 243, 173, 481, 281, 306, 577, 302, 405, 571, 581, 550, 283, 612, 590, 541]
Is Mortgage colum valid True
1print("Is Securities Account colum valid {}".format(validate_column(df_original['Securities Account'].unique().tolist(), 'Securities Account')))
Analysing Securities Account column for unique value [1, 0]
Is Securities Account colum valid True
1print("Is CD Account colum valid {}".format(validate_column(df_original['CD Account'].unique().tolist(), 'CD Account')))
Analysing CD Account column for unique value [0, 1]
Is CD Account colum valid True
1print("Is Online colum valid {}".format(validate_column(df_original['Online'].unique().tolist(), 'Online')))
Analysing Online column for unique value [0, 1]
Is Online colum valid True
1print("Is CreditCard colum valid {}".format(validate_column(df_original['CreditCard'].unique().tolist(), 'CreditCard')))
Analysing CreditCard column for unique value [0, 1]
Is CreditCard colum valid True
1
2print("Is Personal Loan colum valid {}".format(validate_column(df_original['Personal Loan'].unique().tolist(), 'Personal Loan')))
Analysing Personal Loan column for unique value [0, 1]
Is Personal Loan colum valid True

Lets Visually inspect distribution of values across each column and check presence of outlier

1# Age
2sns.distplot(df_original['Age'],kde=True)
3# Here we conclude that data set is captured for a wide range of age group
<matplotlib.axes._subplots.AxesSubplot at 0x7f68473dc9d0>

png

1# Experience
2sns.distplot(df_original['Experience'],kde=True)
3
4#Again here wide range of experience levels
<matplotlib.axes._subplots.AxesSubplot at 0x7f6846b40590>

png

1#Lets Analyse which relation between age and personal load
2
3sns.catplot(y='Age', x='Personal Loan', data=df_original)
4# Except that 0 is more denser than 1 there is no enough visual reference that who would take more loan from this relationship
<seaborn.axisgrid.FacetGrid at 0x7f68473edf10>

png

1# Lets try income vs loan
2sns.catplot(y='Income', x='Personal Loan', data=df_original)
3# Quite evident, people whose Income is between 100 and 200 tend to take loan more compared to ones present in lower income.
4# SO field relation to income has proven to have good influence on personal loan, lets see what else fields we have relation to
5print(df_columns)
['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal Loan', 'Securities Account', 'CD Account', 'Online', 'CreditCard']

png

1sns.catplot(y='CCAvg', x='Personal Loan', data=df_original)
2# People whose credit card average is between 2 - 6 seems to have more personal loans 
3# that other range as seen from graph below
<seaborn.axisgrid.FacetGrid at 0x7f68469db990>

png

1# One more column we have in our dataset which doesnt corresponds to any income values but is the one which affect it
2# Family can we say that people where more family members are present have takne personal loan
3# sns.relplot(x='Personal Loan', y='Family', data=df_original, fit_reg=False)
4sns.relplot(x="Family", y="Income",hue="Personal Loan", data=df_original);
5
6# We can conclude from this graph that, people who have 3 or 4 family members and whose income is above ~100 to ~200 tend to opt
7# for personal loan more than a different ranges as observed from graph.

png

1# Little More exploration using categorical columns
2pd.crosstab(df_original['Personal Loan'],df_original['Family'])
Family 1 2 3 4
Personal Loan
0 1365 1190 877 1088
1 107 106 133 134
1#Box Plot family and personal loan coparison
2sns.boxplot(y="Family", orient="v", x="Personal Loan", data=df_original)
3# People are from 3-4 most probably these are people, having 1 more children seems to have opeted for personal loan
<matplotlib.axes._subplots.AxesSubplot at 0x7f6846954250>

png

1#Income and Personal Loan
2sns.boxplot(y="Income", orient="v", x="Personal Loan", data=df_original)
<matplotlib.axes._subplots.AxesSubplot at 0x7f68469d6550>

png

1# Lets Analyse using count plot
2sns.countplot(x='Personal Loan',data=df_original)
3# Even though we have established alot ot relationship between family, age, income, ccavg, income and decided to go further.
4# But the content of our data seem insufficient, we have very few cases of People who have opted for personal loan and
5# More cases of people who have rejected personal loan offer from bank. This imbalance sometimes can affect in model building
6# But in our case this is acceptable, because in real life people who take pl would be less.
<matplotlib.axes._subplots.AxesSubplot at 0x7f6846a85bd0>

png

1sns.countplot(x='Education', hue='Personal Loan',data=df_original)
2# From Graph it is evident 3 > 2 >1 where
3# 3: Working, 2: Graduates, 1: Under Graduates
<matplotlib.axes._subplots.AxesSubplot at 0x7f68467da1d0>

png

1# Relation between Family& Education to personal load
2sns.barplot('Education','Family',hue='Personal Loan',data=df_original,ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7f68467c9310>

png

1# Classes for all model goes here
2# 1. Class Logistic
3# 2. Class Knn
4# 3. Class NaiveBayes
 1# Remember df_main_x will be our main dataframe to be operted on df_original is the unmodified loaded data set which is pure :)
 2# seperate data i.e input columns and to be predicted column
 3# Drop 'Personal Loan' from dataframe as this is dependent variable and copy it alone in y dataframe
 4
 5df_main_x = df_original[['Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg', 
 6                       'Education', 'Mortgage', 'Securities Account', 'CD Account', 'Online', 
 7                       'CreditCard']]
 8
 9# Replace all -ve values in experience column to 0
10
11# df_main_x.Experience[df_main_x.Experience.lt(0)] = 0 
12
13# df_main_x['Experience'] = df_main_x['Experience'].map(lambda value: value if value >=0 else 0)
14
15df_main_x.loc[df_main_x['Experience']<0, 'Experience']=0 
16
17
18print("Experience values unique {}".format(validate_column(df_main_x['Experience'].unique().tolist(), 'Experience')))
19
20df_main_y = df_original['Personal Loan']
21
22
23# Also remember how we removed ID column as it is just a row counter
24
25df_main_x.describe().T
26
27# Now e see that there is no -ve value in experience columns
Analysing Experience column for unique value [1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18, 21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29, 3, 22, 0, 34, 38, 40, 33, 4, 42, 43]
Experience values unique True


/home/ashish/installed_apps/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.119600 11.440484 0.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0
1# X data frame
2df_main_x.head()
Age Experience Income ZIP Code Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 1
1# Y Data Frame
2df_main_y.head()
0    0
1    0
2    0
3    0
4    0
Name: Personal Loan, dtype: int64
 1# Training constants and general imports
 2
 3from math import sqrt
 4
 5from sklearn.linear_model import LogisticRegression
 6from sklearn.naive_bayes import GaussianNB
 7from sklearn.neighbors import KNeighborsClassifier
 8
 9from sklearn import metrics
10from sklearn.metrics import classification_report
11
12# taking 70:30 training and test set
13test_size = 0.30 
14
15# Random number seeding for reapeatability of the code
16seed = 29 # My BirthDate :)
17
18def isqrt(n):
19    x = n
20    y = (x + 1) // 2
21    while y < x:
22        x = y
23        y = (x + n // x) // 2
24    return x

Logistic Regression

Training General

1# Why are we doing Logistic Regression, because in linear regression response of the system in continuous, where as in 
2# logistic regression it is just limited number of possible outcomes i.e in our case [0] or [1] which is whethere a person 
3# is likely to take loan or not [yes] or [no]
4
5# Class LogisticRegressionProcess
6from sklearn.model_selection import train_test_split
7
8X_train, X_test, y_train, y_test = train_test_split(df_main_x, df_main_y, test_size=test_size, random_state=seed)        

Predicting Logistic Regression

1
2lr_model = LogisticRegression()
3
4lr_model.fit(X_train, y_train)
5
6lr_predict = lr_model.predict(X_test)
7
8lr_score = lr_model.score(X_test, y_test)

Evaluating Logistic Regression

1print("Model Score")
2print(lr_score)
3print("Model confusion matrix")
4print(metrics.confusion_matrix(y_test, lr_predict))
5
6
7print(classification_report(y_test,lr_predict))
Model Score
0.9053333333333333
Model confusion matrix
[[1314   49]
 [  93   44]]
              precision    recall  f1-score   support

           0       0.93      0.96      0.95      1363
           1       0.47      0.32      0.38       137

    accuracy                           0.91      1500
   macro avg       0.70      0.64      0.67      1500
weighted avg       0.89      0.91      0.90      1500

Naive Bayes

Predicting Naive Bayes

1nb_model = GaussianNB()
2
3nb_model.fit(X_train, y_train)
4
5y_nb_predict = nb_model.predict(X_test)
6
7nb_score = nb_model.score(X_test, y_test)

Evaluating Naive Bayes

1print("Model Score")
2print(nb_score)
3print("Model confusion matrix")
4print(metrics.confusion_matrix(y_test, y_nb_predict))
5
6
7print(classification_report(y_test,y_nb_predict))
Model Score
0.876
Model confusion matrix
[[1240  123]
 [  63   74]]
              precision    recall  f1-score   support

           0       0.95      0.91      0.93      1363
           1       0.38      0.54      0.44       137

    accuracy                           0.88      1500
   macro avg       0.66      0.72      0.69      1500
weighted avg       0.90      0.88      0.89      1500

K-NN

Predicting K-NN

 1knn_predict = 0
 2knn_score = 0
 3knn_value = 0
 4# We have total 5000 taging 
 5print(isqrt(df_main_x.shape[0]))
 6for i in range(isqrt(df_main_x.shape[0])):
 7    kvalue = i+1
 8    knn_model = KNeighborsClassifier(n_neighbors=kvalue)
 9    knn_model.fit(X_train, y_train)
10    new_knn_predict = knn_model.predict(X_test)
11    new_knn_score = knn_model.score(X_test, y_test)
12    if new_knn_score >= knn_score:
13        knn_score = new_knn_score
14        knn_predict = new_knn_predict
15        knn_value = kvalue
16
17print("Knn evaluation completed, best value is {}".format(knn_value))
70
Knn evaluation completed, best value is 28

Evaluating K-NN

1print("Model Score")
2print(knn_score)
3print("Model confusion matrix")
4print(metrics.confusion_matrix(y_test, knn_predict))
5
6
7print(classification_report(y_test,knn_predict))
Model Score
0.9106666666666666
Model confusion matrix
[[1360    3]
 [ 131    6]]
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1363
           1       0.67      0.04      0.08       137

    accuracy                           0.91      1500
   macro avg       0.79      0.52      0.52      1500
weighted avg       0.89      0.91      0.87      1500

Analysis Result

 1
 2
 3results = {'Logistic Regression': lr_score, 'Naive Bayes': nb_score, 'K-NN': knn_score}
 4
 5print("Model score are ")
 6print(results)
 7
 8best_score = max(results, key=results.get);
 9
10print("Best score is for {} with accuracy {} ".format(best_score, results[best_score]))
11
12if best_score == 'K-NN':
13    print(' with kvalue {}'.format(kvalue))
Model score are 
{'Logistic Regression': 0.9053333333333333, 'Naive Bayes': 0.876, 'K-NN': 0.9106666666666666}
Best score is for K-NN with accuracy 0.9106666666666666 
 with kvalue 70

Analysis Report

Our main aim was to find out people who would accept personal loan based on given data.

From the output we see that K-NN turned out to be the best model with accuracy of 0.91. The other nearest accuracy is 0.90 which is of Logistic Regression. For our use case by identifying the problem state and given the option we had we can consider Logistic Regression to be the best approach, as output to be predicted was 0/1 and that is what logistic regression does, it transforms its output using sigma function. Also,, Logis Regression is parameteric dependent algorithm where as K-NN is not. Theoretically K-NN is a little slower, as we have also seen that to find the best possible k value we had to iterate, this can impact as in our case it is dependent on input data size(row count).

So K-NN Would perform when there is no dependency on time constaints for finding out the best score, where as Logistic Regression would be the optimal choice for time constraints and when the target column is of binary predection asin this case.

Data

bank.csv

Source Code

bank_solution.ipynb

comments powered by Disqus