Page 1
Unit Structure
1.0 Objectives
1.1 Introduction
1.2 Logic Programming with PROLOG
1.3 Relationships among Objects and Properties Of Objects
1.4 Problem solving
1.4.1 Water jug problem
1.4.2 Tic -Tac-Toe problem
1.4.3 8-Puzzle Problem
1.5 Summary
1.6 References
1.7 Bibliography
1.8 Unit End Exercises
1.0 OBJECTIVES After reading this chapter students will be able to:
Explain the structure of PROLOG
Describe the logic programming of PROLOG
Have the knowledge about the objects and its working principles in
write the applications and problems of Artificial Intelligence
programs using PROLOG
1.1 INTRODUCTION PROLOG: Programming Logic language was designed in the 1970s by
Alain Colmerauer and a team of researcher s
It was possible to use logic to represent knowledge and to write programs.
It uses a subset of predicate logic and draws its structure from theoretical
works of earlier logicians such as Herbrand (1930) and Robinson (1965)
on the automation of theorem proving.
PROLOG supports:
● Natural Language Understanding
Page 2
2 Artificial Intelligence Lab ● Formal logic and associated forms of programming
● Reasoning modeling
● Database programming
● Expert System Development
● Real time AI programs
1.2 LOGIC PROGRAMMING WITH PROLOG PROLOG programs are oft en described as declarative , although they
unavoidably also have a procedural element. Programs are based on the
techniques developed by logicians to form valid conclusions from
available evidence. There are only two components to any program: facts
and ru les. The PROLOG system reads in the program and simply stores it.
The user gives the queries which can be answered by the system using the
facts and rules available to it. A simple example, is given below to
illustrate the same.
dog (puppy).
dog (kutty).
dog (jimmy).
cat (valu).
cat (miaw).
cat (mouse).
animal(Y): -dog(Y).
Output :
:- dog(puppy).
:- cat(kar).
PROLOG program, rules and facts, and also the use of queries that make
PROLOG search through its facts and rules to work out the answer.
Determ ining that puppy is an animal involves a very simple form of
logical reasoning:
Given that any Y is an animal if it is a dog and Puppy is a dog Deduce Puppy must be an animal
Page 3
3 Artificial Intelligence & Machine Learning Lab 1.3 RELATIONSHIPS AMONG OBJECTS AND PROPERTIES OF OBJECTS The relationship between the objects and the particular relationship among
the objects are explained through the f ollowing example.
Each family has three components: husband, wife and children are objects
of the family. As the number of children varies from family to family the
children are represented by a list that is capable of accommodating any
number of items. E ach person is, in turn, represented by a structure of four
components: name or it specifies the working organization and salary. The
family of can be stored in the database by the clause
person( tom, fox, date(7,may,1950), works(bbc,15200) ),
perso n( ann, fox, dat{9,may, 195 1), unemployed),
[person( pat, fox, date(5,may,1973), unemployed),
person( jim, fox, date(S,may,1973), unemployed) ] ).
This program shall be extended as adding the information on the gender of
the people that occur in the paren t relation. This can be done by simply
adding the following facts to our program:
female( pam).
male( tom).
male( bob).
female( liz).
female( pat).
female( ann).
male( jim).
The relations introduced here are male and female. These relations are
unary relat ions.
A binary relation like parent defines a relation between pairs of objects; on
the other hand, unary relations can be used to declare simple yes/no
properties of objects. The first unary clause above can be read: Pam is a
female. The same information declared in the two unary relations with one
binary relation, sex, instead. An alternative code snippet of program is :
gender( pam, feminine).
gender( tom, masculine).
gender( bob, masculine).
Page 4
4 Artificial Intelligence Lab The offspring relation is as the inverse of the parent relati on. We could
define offspring in a similar way as the parent relation; that is, by simply
providing a list of simple facts about the offspring relation, each fact
mentioning one pair of people such that one is an offspring of the other.
For example:
offspr ing( liz, tom).
However, the offspring relation can be defined much more elegantly by
making use of the fact that it is the inverse of parent, and that parent has
already been defined. This alternative way can be based on the following
logical statement:
For all X and Y,
Y is an offspring of X if
X is a parent of Y.
This formulation is already close to the formalism of PROLOG. The
corresponding PROLOG clause which has the same meaning is:
offspring( Y, X) : - parent( X, Y).
This clause can also be read as:
For all X and Y,
if X is a parent of Y then
Y is an offspring of X.
PROLOG clauses : Rules
offspring( Y, X) : - parent( X, Y).
Difference between facts and rules: A fact is something that is always,
unconditionally, true. On the other hand, rules specify thi ngs that may be
true if some condition is satisfied. Therefore we say that rules have:
A condition part and a conclusion part
The conclusion part is also called the head of a clause and the condition
part the body of a clause. For example:
offspring( y, X ) :- parent( X, y).
head body
If the condition parent( X, Y) is true then a logical consequence of this is
offspring( Y, X).
How rules are actually used by PROLOG is illustrated as
:- offspring( liz, tom).
Page 5
5 Artificial Intelligence & Machine Learning Lab 1.4 PROBLEM SOLVING 1.4.1 Water jug problem:
Probl em Statement:
In the water jug problem in Artificial Intelligence, we are provided with
two jugs: one having the capacity to hold 3 gallons of water and the other
has the capacity to hold 4 gallons of water.
There is no other measuring equipment available and the jugs also do not
have any kind of marking on them. So, the agent’s task here is to fill the 4 -
gallon jug with 2 gallons of water by using only these two jugs and no
other material. Initially, both our jugs are empty.
So, to solve this problem, foll owing set of rules were proposed:
Production rules for solving the water jug problem
Here, let x denote the 4 -gallon jug and y denote the 3 -gallon jug.
S.No. Initial State Condition Final state Description of action taken
1. (x,y) If x<4 (4,y) Fill the 4 gallon jug completely
2. (x,y) if y<3 (x,3) Fill the 3 gallon jug completely
3. (x,y) If x>0 (x -d,y) Pour some part from the 4 gallon jug
4. (x,y) If y>0 (x,y -d) Pour some part from the 3 gallon jug
5. (x,y) If x>0 (0,y) Empty the 4 gallon jug
6. (x,y) If y>0 (x,0) Empty the 3 gallon jug
7. (x,y) If (x+y)<7 (4, y -[4-x]) Pour some water from the 3 gallon jug to
fill the four gallon jug
8. (x,y) If (x+y)<7 (x -[3-y],y) Pour some water from the 4 gallon jug to
fill the 3 gallon jug.
9. (x,y) If (x+y)<4 (x+y,0) Pour all water from 3 gallon jug to the 4
gallon jug
10. (x,y) if (x+y)<3 (0, x+y) Pour all water from the 4 gallon jug to the 3
gallon jug
To solve the water jug problem in a minimum number of moves,
following set of rules in the given sequence s hould be performed:
Solution of water jug problem according to the production rules:
Page 6
6 Artificial Intelligence Lab S.No. 4 gallon jug contents 3 gallon jug contents Rule followed 1. 0 gallon 0 gallon Initial state 2. 0 gallon 3 gallons Rule no.2 3. 3 gallons 0 gallon Rule no. 9 4. 3 gallons 3 gallons Rule no. 2 5. 4 gallons 2 gallons Rule no. 7 6. 0 gallon 2 gallons Rule no. 5 7. 2 gallons 0 gallon Rule no. 9 On reaching the 7th attempt, the goal state is reached.
Aim: Writing clauses in PROLOG to solve water jug problem
Software used: SWI -PROLOG
Program Listing:
state(X,Y): -
X < 4,
write("Fill the 4 -Gallon Jug: (",X,",",Y,") --> (", 4,",",Y,") \n"),
state(4,Y) .
state(X,Y): - Y < 3,
write("Fill the 3 -Gallon Jug: (", X,",",Y,") --> (", X,",",3,") \n"),
state(X,Y): - X > 0,
not(visited _state(0,Y)),
Page 7
7 Artificial Intelligence & Machine Learning Lab assert(visited_state(X,Y)),
write("Empty the 4 -Gallon jug on ground: (", X,",",Y,") -->
(",0,",",Y,") \n"),
state(X,Y): - Y > 0,
write("Empty the 3 -Gallon jug on ground: (", X ,",",Y,") -->
(",X,",",0,") \n"),
state(X,Y): - X + Y >= 4,
Y > 0,
NEW_Y = Y - (4 - X),
write("Pour water from 3 -Gallon jug to 4 -gallon until it is full:
(",X,",",Y,") --> (", 4,"," ,NEW_Y,") \n"),
state(X,Y): - X + Y >=3,
X > 0,
NEW_X = X - (3 - Y),
write("Pour water from 4 -Gallon jug to 3-gallon until it is full:
(",X,",",Y,") --> (", NEW_X,",",3,") \n"),
state( NEW_X,3).
state(X,Y): - X + Y <=4,
Y > 0,
NEW_X = X + Y,
Page 8
8 Artificial Intelligence Lab write("Pour all the water fro m 3 -Gallon jug to 4 -gallon:
(",X,",",Y,") --> (", NEW_X,",",0,") \n"),
state(X,Y): - X+Y<=3,
X > 0,
NEW_Y = X + Y,
write("Pour all the water fro m 4 -Gallon jug to 3 -gallon:
(",X,",",Y,") --> (", 0,",",NEW_Y,") \n"),
state(0,2): - not(visited_state(2,0)),
write("Pour 2 gallons from 3 -Gallon jug to 4 -gallon: (", 0,",",2,") -->
(", 2," ,",0,") \n"),
state(2,Y): - not(visited_state(0,Y)),
write("Empty 2 gallons from 4-Gallon jug on the ground:
(",2,",",Y,") --> (", 0,",",Y,") \n"),
goal: -
makewindow(1,2,3," 4-3 Water Jug Problem",0,0,25,80),
1.4.2 Tic-Tac.Toe Problem:
Aim: Tic-Tac-Toe using A* algorithm.
Theory: A board game (such as tic -tac-toe) is usually programmed as a
state machine. Looking on the current -state and therefore the playe r’s
move, the game goes into the next -state.
tit-tat-toe (or Noughts and crosses, Xs and Os) could be a paper and
pencil for 2 players, X and O, who take turns marking the areas in an
exceedingly 3×3 grid.
Page 9
9 Artificial Intelligence & Machine Learning Lab The player who succeeds in putting 3 individual marks in an exceedingly
horizontal, vertical or diagonal row wins the game. Players shortly
discover that best play from each party ends up in a draw.
The game is generalized to an m,n,k -game during which 2 players
alternate putting stones of their own col our on an m×n board, with the
goal of obtaining k of their own colour in a row. Tit -Tat-Toe is the (3,3,3) -
/*A Tic -Tac-Toe program in PROLOG. */
/*Predicates that define the winning conditions:*/
win(Board, Player) : - rowwin(Board, Player).
win(Boa rd, Player) : - colwin(Board, Player).
win(Board, Player) : - diagwin(Board, Player).
rowwin(Board, Player) : - Board = [Player,Player,Player,_,_,_,_,_,_].
rowwin(Board, Player) : - Board = [_,_,_,Player,Player,Player,_,_,_].
rowwin(Board, Player) : - Board = [ _,_,_,_,_,_,Player,Player,Player].
colwin(Board, Player) : - Board = [Player,_,_,Player,_,_,Player,_,_].
colwin(Board, Player) : - Board = [_,Player,_,_,Player,_,_,Player,_].
colwin(Board, Player) : - Board = [_,_,Player,_,_,Player,_,_,Player].
diagwin(Board, Player) : - Board = [Player,_,_,_,Player,_,_,_,Player].
diagwin(Board, Player) : - Board = [_,_,Player,_,Player,_,Player,_,_].
/*Helping predicate for alternating play in a "self" game: */
game(Board, Player) :- win(Board, Player), !, write([player, Player, wins]).
game(Board, Player) :-
move([b,B,C,D,E,F,G,H,I], Player, [Player,B,C,D,E,F,G,H,I]).
move([A,b,C,D,E,F,G,H,I], Player, [A,Player,C,D,E,F,G,H,I]).
move([A,B,b,D,E,F,G,H,I], Player, [A,B,Player,D,E,F,G,H,I]).
move([A,B,C,b,E,F,G,H,I], Player, [A,B,C,Player,E,F,G,H,I]).
move([A,B,C,D,b,F,G,H,I], Player, [A,B,C,D,Player,F,G,H,I]).
move([A,B,C,D,E,b,G,H,I], Player, [A,B,C,D,E, Player,G,H,I]).
Page 10
10 Artificial Intelligence Lab move([A,B,C,D,E,F,b,H,I], Player, [A,B,C,D,E,F,Player,H,I]).
move([A,B,C,D,E,F,G,b,I], Player, [A,B,C,D,E,F,G,Player,I]).
move([A,B,C,D,E,F,G,H,b], Player, [A,B,C,D,E,F,G,H,Player]).
display([A,B,C,D,E,F,G,H,I]) :-
write([A,B,C]),nl,write([ D,E,F]),nl,
selfgame : - game([b,b,b,b,b,b,b,b,b],x).
/* Predicates to support playing a game with the user:*/
x_can_win_in_one(Board) : - move(Board, x, Newboard), win(Newboard,
/*The predicate orespond generates the computer's (p laying o) response from the
current Board . */
orespond(Board,Newboard) :-
move(Board, o, Newboard),
win(Newboard, o),
orespond(Board,Newboard) : -
move(Board, o, Newboard),
orespond(Board,Newboard) : -
move(Board, o, Newboard).
orespond(Board,Newboard) : -
write('Cats game!'), nl,
Newboard = Board.
/* Translation from an integer description of x's move to a board
xmove([b,B,C,D,E,F,G,H,I], 1, [x,B,C,D ,E,F,G,H,I]).
xmove([A,b,C,D,E,F,G,H,I], 2, [A,x,C,D,E,F,G,H,I]).
xmove([A,B,b,D,E,F,G,H,I], 3, [A,B,x,D,E,F,G,H,I]).
xmove([A,B,C,b,E,F,G,H,I], 4, [A,B,C,x,E,F,G,H,I]).
xmove([A,B,C,D,b,F,G,H,I], 5, [A,B,C,D,x,F,G,H,I]).
xmove([A,B,C,D,E,b,G,H,I], 6, [A,B ,C,D,E,x,G,H,I]).
Page 11
11 Artificial Intelligence & Machine Learning Lab xmove([A,B,C,D,E,F,b,H,I], 7, [A,B,C,D,E,F,x,H,I]).
xmove([A,B,C,D,E,F,G,b,I], 8, [A,B,C,D,E,F,G,x,I]).
xmove([A,B,C,D,E,F,G,H,b], 9, [A,B,C,D,E,F,G,H,x]).
xmove(Board, N, Board) : - write('Illegal move.'), nl.
% The 0 -place predicate play o starts a game with the user.
playo : - explain, playfrom([b,b,b,b,b,b,b,b,b]).
explain : -
write('You play X by entering integer positions followed by a period.'),
playfrom(Board) : - win(Board, x), write('You win!').
playfrom(Board) : - win(Board, o), write('I win!').
playfrom(Board) : - read(N),
xmove(Board, N, Newboard),
orespond(Newboard, Newnewboard),
1.4.3 8 -Puzzle Problem :
/* This pre dicate initialises the problem states. The first argument of
solve/3 is the initial state, the 2nd the goal state, and the third the plan that
will be produced. */
test(Plan): -
write('Initial state:'),nl,
Init= [at(tile4,1), at(tile3,2), at(tile8, 3), at(empty,4), at(tile2,5),
at(tile6,6), at(tile5,7), at(tile1,8), at(tile7,9)],
Goal= [at(tile1,1), at(tile2,2), at(tile3,3), at(tile4,4), at(empty,5),
at(tile5,6), at(tile6,7), at(tile7,8), at(tile8,9)],
nl,write('Goal stat e:'),nl,
Page 12
12 Artificial Intelligence Lab solve(State, Goal, Plan): -
solve(State, Goal, [], Plan).
/*Determines whether Current and Destination tiles are a valid move. */
is_movable(X1,Y1) : - (1 is X1 - Y1) ; ( -1 is X1 - Y1) ; (3 is X 1 - Y1) ; ( -3
is X1 - Y1).
/*This predicate produces the plan. Once the Goal list is a subset of the
current State the plan is complete and it is written to the screen using
write_sol */
solve(State, Goal, Plan, Plan): -
is_subset(Goal, State), nl,
solve(State, Goal, Sofar, Plan): -
act(Action, Preconditions, Delete, Add),
is_subset(Preconditions, State),
\+ member(Action, Sofar),
delete_list(Delete, State, Remainder),
append(Add, Remainder, NewState),
solve(Ne wState, Goal, [Action|Sofar], Plan).
[at(X,Y), at(empty,Z), is_movable(Y,Z)],
[at(X,Y), at(empty,Z)],
[at(X,Z), at(empty,Y)]).
/*Check is first list is a subset of the second */
is_subset([H|T], Set): -
member(H, Set),
is_subset(T, Set).
is_subset([], _).
/* Remove all elements of 1st list from second to create third. */
delete_list([H|T], Curstate, Newstate): -
remove(H, Curstate, Remainder),
delete_list(T, Remainder, Newstate).
delete_list([], Curstate, Curstate) .
remove(X, [X|T], T).
Page 13
13 Artificial Intelligence & Machine Learning Lab remove(X, [H|T], [H|R]): -
remove(X, T, R).
write_sol([H|T]): -
write(H), nl.
append([H|T], L1, [H|L2]): -
append(T, L1, L2).
append([], L, L).
member(X, [X|_]).
member(X, [_|T]): -
member (X, T).
1.5 SUMMARY This chapter explains how prolog is used in the logical programs.
Different applications like water jug problem, tic -tac-toe and decision
making justification problems are described.
1.6 UNIT END EXERCISES 1. Write a PROLOG program to prove a person as a human
2. Explain the object and property relations
3. Write Towers of Hanoi program to apply PROLOG concept
1.7 REFERENCES 1. Logic Programming with Prolog, Max Bramer, Springer
2. Prolog Programming for Artificial Intelligence, E. Kardelj University
. J. Stefan Institute
Page 14
Unit Structure
2.1 NumPy
2.2 Pandas
2.3 SciPy
2.4 Matplotlib
2.5 Scikit Learn.
2.1 NUMPY ● Python library is nothing but a ready made moule.
● This library can be used whenever we want.
● If we are writing a code and if a particular requirement arises then
instead of sitting and writing the whole code we can just use the ready
made code available in the library.
● Thus by using the library our time is getting saved in a very
wonder ful manner.
● We can relate the Python library with the real world book library too.
So if you imagine a book library it has a whole set of books with it.
We can choose the book according to our requirements. Similarly in
the python library we can choose a particular set of code which is
● The extension of library files are “.dll”
● Full form of dll is Dynamic Load Libraries
● So whenever we add a library in our program during the execution
phase it searches it and loads the particular module which is need ed.
● Now in this module we are studying about numpy which is one of the
libraries in python.
● NumPy stands for Numerical Python.
● It is one of the most widely used libray.
Page 15
15 Artificial Intelligence & Machine Learning Lab ● As it contains the code related to numerical details it is most popular
around data sci ence and machine learning as both these fields need a
lot of numerical logic getting applied in it.
● It is used whenever the situation in coding arises in working with an
● It does have methods that is made up for algebra related logics.
● This Numpy was made in the year 2005
● Example:
Lets try to insert array using numpy:
import numpy as ab
ar= ab.array(([1, 2, 3, 4, 5])
[1 2 3 4 5]
● In the above example in the first line we have imported the library by
typing n umpy.
● We have given our library a name called as ab, so now in the program
whenever there is a requirement of numpy we just need to type ab.
● Then we created the variable called ar then we added array data inside
the same
● Then we printed it.
● So output is pr inting the array data that has been inserted.
# The standard way to import NumPy:
import numpy as np
# Create a 2 -D array, set every second element in
# some rows and find max per row:
x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
# array([[ 0, 1, 2, 3, 4],
# [ -99, 6, -99, 8, -99],
Page 16
16 Introduction To Python Programming: Learn The Different Libraries # [ -99, 11, -99, 13, -99]])
# array([ 4, 8, 13])
# Generate normally distributed random numbers:
rng = np.random.default_rng()
samples = rng.normal(size=2500)
sampl es
array([ 0.38054775, -0.06020411, 0.07380668, ..., 1.07546484,
-0.20855135, 0.09773109])
2.2 PANDAS ● The main role of the pandas library is to analyze the data.
● It is open source in nature
● It is used in relational data
● On the top of Nump y library Pandas library is present.
● It is very quick in nature.
● It was made in the year 2008
● It is very efficient in datas.
● When it comes to pandas it is not necessary that the data should or
should belong to a kind of category but instead it allows many.
● By using pandas you can reshape, analyze, and change your data very
● Pandas supports two data structures:
1. Series :
It is an array.
It can hold any kind of data types like integer, float, character etc.
It points to the column .
Example1 : In the bel ow example each column i.e name and roll no
points to series. It is written in the following manner in the code:
ab=se.Series(df [‘Name’])
Page 17
17 Artificial Intelligence & Machine Learning Lab ab=se.Series(df [‘Roll no’])
Name Roll no Madhusri 01 Srivatsan 22 Anuradha 6 Balaguru 55
Example 2:
import pa ndas as ab
import numpy as sj
# Creating empty series
ser = sj.Series()
# simple array
data = sj.array(['g', 'e', 'e', 'k', 's'])
ser = ab.Series(data)
In the above example two libraries have been imported and are used
namely n umpy and pandas.
The library Pandas is getting represented by ab and similarly the
library numpy is getting represented by sj
Then series are created by calling it, so an empty series is called and
Page 18
18 Introduction To Python Programming: Learn The Different Libraries Then it is printed then the array i s getting added using numpy
Then finally they are printed
The output comes out in the below fashion.
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object
2. Data Frame :
It handles 3 parts, mainly data, columns and rows.
import pandas as pd
# Call ing DataFrame constructor
df = pd.DataFrame()
# list of strings
lst = ['Madhu, 'For', 'Madhusri', 'is',
'portal', 'for', 'students’']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Output:Empty DataFrame
Columns: []
Index: []
0 Madhu
1 For
2 Madhusri
3 is
Page 19
19 Artificial Intelligence & Machine Learning Lab 4 portal
5 for
6 students
import pandas as pd
df = pd.read_csv('data.csv')
Output: Duration Pulse Maxpulse Calories 0 60 110 130 409.1 1 60 117 145 479.0 2 60 103 135 340.0 3 45 109 175 282.4 4 45 117 148 406.0 5 60 102 127 300.5 6 60 110 136 374.0 7 45 104 134 253.3 8 30 109 133 195.1 9 60 98 124 269.0 10 60 103 147 329.3 11 60 100 120 250.7 12 60 106 128 345.3 13 60 104 132 379.3 14 60 98 123 275.0 15 60 98 120 215.2 .
And so on….
2.3 SCIPY It falls under NumPy :
● It uses scientific and mathematical logic.
● It makes the python very effective as it allows user interaction too.
● It stands for “Scientific Python”
● It is open source
● Mani pulating N -dimension array is done through SciPy
Page 20
20 Introduction To Python Programming: Learn The Different Libraries ● Some sub packages of SciPy are as follows:
● Scipy.clusetr : K mean algorithm and such similar algorithms can be
done using this library.
● : Inputs and outputs are handled here
import scipy
print(scipy._ _version__)
2.4 MATPLOTLIB ● It is used to plot graphs
● John D. Hunter created this
● It is open source
● In python you need to install matplotlib pip otherwise the code will
not execute. To do this go to cmd and go the the folder where python
is located any type the following command: Pip install matplotlib
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
Page 21
21 Artificial Intelligence & Machine Learning Lab
2.5 SCIKIT LEARN ● It is mainly used in machine learning
● It has lot of statistics related tools
● It is open source.
● By using the Scikit library the efficiency will im prove tremendously
as it is quite accurate.
● It is very useful in algorithms which are very famous in machine
learning like K -mean, K -nearest, clustering etc.
● It is available to everybody so any programmer if he or she feels like
utilizing it then can use it.
● Scikit requires Numpy
● Installation of scikit is must to make the program run, this can be done
in the following manner. pip install -U scikit-learn
● Example:
from sklearn.datasets import load_iris
iris = load_iris()
y =
Page 22
22 Introduction To Python Programming: Learn The Different Libraries featur e_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print(" \nFirst 10 rows of A: \n", A[:10])
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal lengt h (cm)',
'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 10 rows of X:
[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
● Features of Scikit learn are as follows:
● Clustering: Scikit can be applied in clustering algorithm, in clustering
the grouping is done on the basis of similarities like eg: age, color etc.
● Cross valid ation
● Feature selection
● Example:
# importing required libraries
import pandas as pd
Page 23
23 Artificial Intelligence & Machine Learning Lab from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
# shape of the dataset
print(' \nShape of training data :',train_data.shape)
print(' \nShape of testing data :',test_data.shape)
# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales
# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']
# seperate the independent and target variable on training data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']
Create the object of the Linear Regression model
You can also add other parameters and test your code here
Some parameters are : fit_ intercept and normalize
Documentation of sklearn Linear Regression:
https://scikit -
model = LinearRegression()
# fit the model with the training data,train _y)
# coefficeints of the trained model
Page 24
24 Introduction To Python Programming: Learn The Different Libraries print(' \nCoefficient of model :', model.coef_)
# intercept of the model
print(' \nIntercept of model',model.intercept_)
# predict the target on the test dataset
predict_train = model.predict(train_x)
print(' \nItem_Out let_Sales on training data',predict_train)
# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print(' \nRMSE on train dataset : ', rmse_train)
# predict the target on the testing dataset
predict_test = model.predict(test_x)
print(' \nItem_Outlet_Sales on test data',predict_test)
# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print(' \nRMSE on test dataset : ', rmse_test)
Item_Weight ... Outlet_Type_Supermarket Type3
0 6.800000 ... 0
1 15.600000 ... 0
2 12.911575 ... 1
3 11.800000 ... 0
4 17.850000 ... 0
[5 rows x 36 columns]
Shape of training data : (1364, 36)
Shape of testing data : (341, 36)
Page 25
25 Artificial Intelligence & Machine Learning Lab Coefficient of model :
[-3.84197604e+00 9.83065945e+00 1.61711856e+01 6.09197622e+01
-8.64161561e+01 1.23593376e+02 2.3471403 9e+02 -2.44597425e+02
-2.72938329e+01 -8.09611456e+00 -3.01147840e+02 1.70727611e+02
-5.40194744e+01 7.34248834e+01 1.70313375e+00 -5.07701615e+01
1.63553657e+02 -5.85286125e+01 1.04913492e+02 -6.01944874e+01
1.98948206e+02 -1.40959023e+02 1.194 26257e+02 2.66382669e+01
Page 26
Unit Structure
3.0 Objectives
3.1 Introduction - Regression
3.1.1 What is a Regression
3.2 Types of Regression models
3.2.1 Linear Regression
3.2.2 Need of a Linear regression
3.2.3 Positive Li near Relationship
3.2.4 Negative Linear Relationship
3.3 Cost function
3.3.1 Gradient descent
3.3.2 Impact of different values for learning rate
3.3.3 Use case
3.3.4 Steps to implement linear regression model
3.4 What i s logistic regression?
3.4.1 Hypothesis
3.4.2 A sigmoid function
3.5 Cost function
3.5.1 Gradient Descent
3.6 Lets Sum up
3.7 Exercises
3.8 References
3.0 OBJECTIVES This Chapter would make you understand the following concepts:
What is a Regression?
Types of a Regression.
What is the mean of Linear regression and the importance of Linear
Importance of cost function and gradient descent in a Linear
Impact of different values for learning rate.
What is the mean of logistic regression and the importance of Linear
Page 27
27 Artificial Intelligence & Machine Learning Lab Importance of cost function and gradient descent in a logistic
3.1 INTRODUCTION – REGRESSION Regression is a supervised learning technique that supports finding the
correlatio n among variables.
A regression problem is when the output variable is a real or continuous
3.1.1 What is a Regression :
In Regression, we plot a graph between the variables which best fit the
given data points. The machine learning model can delive r predictions
regarding the data. In naïve words , “Regression shows a line or curve
that passes through all the data points on a target -predictor graph in
such a way that the vertical distance between the data points and the
regression line is minimum.” It is used principally for prediction,
forecasting, time series modeling, and determining the causal -effect
relationship between variables.
3.2 TYPES OF REGRESSION MODELS 1. Linear Regression
2. Polynomial Regression
3. Logistics Regression
3.2.1 Linear Regression:
Linear regression is a quiet and simple statistical regression method used
for predictive analysis and shows the relationship between the continuous
variables. Linear regression shows the linear relationship between the
independent variable (X -axis) and the dependent variable (Y -axis),
consequently called linear regression. If there is a single input variable (x),
such linear regression is called simple linear regression . And if there is
more than one input variable, such linear regression is called multiple
linear regression. The linear regression model gives a sloped straight line
describing the relationship within the variables.
Page 28
28 Supervised Learning The above graph presents the linear relationship between the dependent
variable and independent variables. When the value of x (independent
variable ) increases, the value of y ( dependent variable ) is likewise
increasing. The red line is referred to as the best fit straight line. Based on
the given data points, we try to plot a line that models the points the best.
To calculate bes t-fit line linear regression uses a traditional slope -
intercept form.
y= Dependent Variable.
x= Independent Variable.
a0= intercept of the line.
a1 = Linear regression coefficient.
3.2.2 Need of a Linear regression:
Linear regression estimates the rel ationship between a dependent variable
and an independent variable. Let’s say we want to estimate the salary of an
employee based on year of experience. You have the recent company data,
which indicates that the relationship between experience and salary. Here
year of experience is an independent variable, and the salary of an
employee is a dependent variable, as the salary of an employee is
dependent on the experience of an employee. Using this insight, we can
predict the future salary of the employee base d on current & past
A regression line can be a Positive Linear Relationship or a Negative
Linear Relationship.
3.2.3 Positive Linear Relationship:
If the dependent variable expands on the Y -axis and the independent
variable progress on X -axis, then such a relationship is termed a Positive
linear relationship.
Page 29
29 Artificial Intelligence & Machine Learning Lab 3.2.4 Negative Linear Relationship :
If the dependent variable decreases on the Y -axis and the independent
variable increases on the X -axis, such a relationship is called a negative
linea r relationship.
The goal of the linear regression algorithm is to get the best values for a0
and a1 to find the best fit line. The best fit line should have the least error
means the error between predicted values and actual values should be
3.3 COST FUNCTION The cost function helps to figure out the best possible values for a0 and a1,
which provides the best fit line for the data points.
Cost function optimizes the regression coefficients or weights and
measures how a linear regression model i s performing. The cost function
is used to find the accuracy of the mapping function that maps the input
variable to the output variable. This mapping function is also known
as the Hypothesis function .
In Linear Regression, Mean Squared Error (MSE) cost fu nction is used,
which is the average of squared error that occurred between the predicted
values and actual values.
By simple linear equation y=mx+b we can calculate MSE as:
Let’s y = actual values, y i = predicted values
Using the MSE function, we will c hange the values of a0 and a1 such that
the MSE value settles at the minima. Model parameters xi, b (a0,a1) can be
manipulated to minimize the cost function. These parameters can be
Page 30
30 Supervised Learning determined using the gradient descent method so that the cost function
value is minimum.
3.3.1 Gradient descent :
Gradient descent is a method of updating a0 and a1 to minimize the cost
function (MSE). A regression model uses gradient descent to update the
coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a
random selection of coefficient values and then iteratively update the
values to reach the minimum cost function.
Imagine a pit in the shape of U. You are standing at the topmost point in
the pit, and your objective is to reach the bottom of the pit . There is a
treasure, and you can only take a discrete number of steps to reach the
bottom. If you decide to take one footstep at a time, you would eventually
get to the bottom of the pit but, this would take a longer time. If you
choose to take longer st eps each time, you may get to sooner but, there is a
chance that you could overshoot the bottom of the pit and not near the
bottom. In the gradient descent algorithm, the number of steps you take is
the learning rate, and this decides how fast the algorith m converges to the
To update a 0 and a 1, we take gradients from the cost function. To find
these gradients, we take partial derivatives for a 0 and a 1.
Page 31
31 Artificial Intelligence & Machine Learning Lab
The partial derivates are the gradients, and they are used to update the
values of a 0 and a 1. Alpha is the learning rate.
3.3.2 Impact of different values for learning rate :
Page 32
32 Supervised Learning The blue line represents the optimal value of the learning rate, and the cost
function value is minimized in a few iterations. The green line represents
if the learning ra te is lower than the optimal value, then the number of
iterations required high to minimize the cost function. If the learning rate
selected is very high, the cost function could continue to increase with
iterations and saturate at a value higher than the minimum value, that
represented by a red and black line.
3.3.3 Use case :
In this, I will take random numbers for the dependent variable (salary) and
an independent variable (experience) and will predict the impact of a year
of experience on salary.
3.3.4 Steps to implement linear regression model:
a) Import some required libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
b) Define the dataset
x= np.array([2.4,5.0,1.5,3.8,8.7,3.6,1.2,8.1,2.5,5,1.6,1.6,2.4,3.9,5.4])
y = np.array( [2.1,4.7,1.7,3.6,8.7,3.2,1.0,8.0,2.4,6,1.1,1.3,2.4,3.9,4.8])
n = np.size(x)
c) Plot the data points
plt.scatter(experience,salary, color = 'red')
Page 33
33 Artificial Intelligence & Machine Learning Lab The main function to calculate values of coefficien ts:
1. Initialize the parameters.
2. Predict the value of a dependent variable by given an independent
3. Calculate the error in prediction for all data points.
4. Calculate partial derivative w.r.t a0 and a1.
5. Calculate the cost for each number and add them.
6. Update the values of a0 and a1.
Initialize the parameters :
a0 = 0 #intercept`
a1 = 0 #Slop
lr = 0.0001 #Learning rate
iterations = 1000 # Number of iterations
error = [] # Error array to cal culate cost for each iterations .
for itr in range(iterations):
error_cost = 0
cost_a0 = 0
cost_a1 = 0
for i in range(len(experience)):
y_pred = a0+a1*experience[i] # predict value for given x
error_cost = error_cost +(sala ry[i]-y_pred)**2
for j in range(len(experience)):
partial_wrt_a0 = -2 *(salary[j] - (a0 + a1*experience[j]))
#partial derivative w.r.t a0
partial_wrt_a1 = ( -2*experience[j])*(salary[j] -(a0 +
#partial derivative w.r.t a1
cost_a0 = cost_a0 + partial_wrt_a0 #calculate cost for each number
and add
Page 34
34 Supervised Learning cost_a1 = cost_a1 + partial_wrt_a1 #calculate cost for each number
and add
a0 = a0 - lr * cost_a0 #update a0
a1 = a1 - lr * cost_a1 #update a1
print(itr,a0,a1) #Check iteration and updated a0 and a1
error.append(error_cost) #Append the data in array
At approximate iteration 50 - 60, we got the value of a0 and a1.
Plott ing the error for each iteration:
plt.plot(np.arange(1,len(error)+1),error,color='red',linewidth = 5)
plt.title("Iteration vr error")
Page 35
35 Artificial Intelligence & Machine Learning Lab
Predicting the values :
pred = a0+a1*experience
Plot the regression line:
plt.scatter(experience,salary,color = 'red')
plt.plot(experience,pred, color = 'green')
Page 36
36 Supervised Learning
Analyze the performance of the model by calculating the mean
squared error.
error 1 = salary - pred
se = np.sum(error1 ** 2)
mse = se/n
print("mean squared error is", mse)
Use the scikit library to confirm the above steps :
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
experience = ex perience.reshape( -1,1)
model = LinearRegression(),salary)
salary_pred = model.predict(experience)
Mse = mean_squared_error(salary, salary_pred)
print('slop', model.coef_)
print("Intercept", model.intercept_)
print("MSE", Mse)
Page 37
37 Artificial Intelligence & Machine Learning Lab
3.4 WHA T IS LOGISTIC REGRESSION? Logistic regression is a supervised learning algorithm that outputs values
between zero and one.
3.4.1 Hypothesis :
The objective of a logistic regression is to learn a function that outputs the
probability that the dependent varia ble is one for each training sample. To
achieve that, a sigmoid / logistic function is required for the
3.4.2 A sigmoid function :
Visually, it looks like this:
Fig. 1. Sigmoid Function
This hypothesis is typically represented by the foll owing function:
θ is a vector of parameters that corresponds to each independent
x is a vector of independent variables
Page 38
38 Supervised Learning 3.5 COST FUNCTION The cost function for logistic regression is derived from statistics using the
principle of maximum likelyhood estimation , whic h allows e fficient
identification of parameters. In addition the covex property of the cost
function allow gradient descent to work e ffectively.
● i is one of the mth training samples
● hƟ(xi) is the predicted value for the training sample
● yi is the actual value for the training sample
To understand the cost function, we can look into each of the two
components in isolation:
Suppose yi=1:
if , h Ɵ(xi) =1 then the predicon error = 0
if , h Ɵ(xi) =0 then the predicon error approaches infinity
These two scenarios are represented by the blue line in Figure 2 below.
Suppose yi=0:
if , h Ɵ(xi) =0 then the predicon error = 0
if , h Ɵ(xi) =1 then the predicon error approaches infinity
These two scenarios are represented by the blue line in Figure 2 belo w.
Fig. 2. Logistic Regression Cost Function
Page 39
39 Artificial Intelligence & Machine Learning Lab The logistic regression cost function can be further simplified into a one
line equation:
The overall objective is to minimise the cost function by iterating through
diferent valu es of Ɵ.
3.5.1 Gradient Descent
The gradient descent algorithm is as follows:
repeat until convergence
● values of j = 0,1, …, n
● α is the learning rate
Note: The gradient descent algorithm is identical to linear regression’s
3.6 LETS SUM UP What is a Regression?
Types of a Regression.
What is the mean of Linear regression and the importance of Linear
Importance of cost function and gradient descent in a Linear
Impact of different values for learning rate.
What is the mean o f logistic regression and the importance of Linear
Importance of cost function and gradient descent in a logistic
Page 40
40 Supervised Learning 3.7 EXERCISES Differentiate the Linear regression and logistic regression with a real
time example.
3.8 REFERENCES -regression -and-predicting -
values -based -on-a-training -dataset -real-world -examples -of-logistic -
regression -application -regression -with-
a-real-world -example -in-python/ -regression -real-life-
0administer,pressure%20as%20the%20response%20variable . -are-applications -of-linear -and-logistic -
regression -regression -real-life-examples/
https://ncss -wpengine.netdna -
content/themes/ncss/pdf/P rocedures/NCSS/Logistic_Regression.pdf -regression -in-
machine -learning/ -introductory -note-
on-linear -regression/ -Regression -
https://to -learning -basics -with-the-k-
nearest -neighbors -algorithm -6a6e71d01761 -nearest -neighbor -algorithm -for-
machine -learning -
understanding -and-implementati on-of-knn-algorithm/
Page 41
Unit Structure
4.0 Objectives
4.1 Advanced Optimization Algorithms
4.1.1 Multiclass Classification
4.1.2 Bias -Variance Tradeo
4.1.3 Regularization
4.2 Applications of Linear/Logistic regression.
4.2.1 Two things you can do using reg ression are
4.2.2Application of logistic regression
4.3 K-nearest Neighbors (KNN) Classification Model
4.4 Lets Sum up
4.5 References
4.6 Exercises
4.0 OBJECTIVES This Chapter would make you understand the following concepts:
Advanced Optimization Alg orithms
Applications of Linear/Logistic regression.
KNN - classification
4.1 ADVANCED OPTIMIZATION ALGORITHMS However, gradient descent is not the only algorithm that can minimize the
cost function. Other advanced optimization algorithms are:
● Conjugate gra dient
While these advanced algorithms are more complex and di fficult to
understand, they have the advantages of converging faster and not needing
to pick learning rate.
4.1.1 Multiclass Classification:
One-vs-rest is a method where you turn a n-class classification problem
into a nth seperate binary classification problem.
Page 42
42 Supervised Learning To deal with a multiclass problem, we then train a logistic regression
binary classifier for each class to predict the probability that y = i. The
prediction output for a giv en new input will be chosen based on the
classifier that has the highest probability.
where is the binary classifier
4.1.2 Bias-Variance Tradeo ff:
Overfing occurs when the algorithm tries too hard to fit the training data.
This usually results in a learn ed hypothesis that is too complex, fails to
generalize to new examples, and a cost funcon that is very close to zero on
the training set. On the contrary, underfing occurs when the algorithm tries
too lile to fit the training data. This usually results in a learned hypothesis
that is not complex enough, and fails to generalize to new examples.
Underfitting and Overfitting
Conceptually speaking, bias measures the di fference between model
predictions and the correct values. Variance re fers to the variability of a
model prediction for a given data point if you re -build the model multiple
As seen in Figure 4, the optimal level of model complexity is where
prediction error on unseen data points is minimized. Below the optimal
level of model complexity, bias will increase while variance will decrease
due to a hypothesis that is too simplified. On the contrary, a very complex
model will result in a low bias and high variance situation
Page 43
43 Artificial Intelligence & Machine Learning Lab
Bias-Variance Tradeo
4.1.3 Regularization:
For a m odel to generalize well, regulariza tion is usually introduced to
reduce over fitting of the training data.
This is represented by a regularization term, that is added to the cost
function that penalizes all parameters that are high in value. This leads to a
simpler hypothesis that is less prone to fitting. The new cost func tion then
● i is one of the training samples
● is the predicted value for the training sample i
● yi is the actual value for the training sample i
● λ is the regularizaon paramet er that controls the tradeo ff between fing
the training dataset well and having the parameters θ small in values
● j is one of the parameter θ
Overall objecve remains the same:
Page 44
44 Supervised Learning Gradient descent remains the same as well:
repeat until convergence
4.2 APP LICATIONS OF LINEAR/LOGISTIC REGRESSION Regression models are generally built on historical data which has some
independent variables and a dependent variable. A dependent variable is a
characteristic or quantity that you want to measure using the indepen dent
4.2.1 Two things you can do using regression are:
1. Find the impact of the dependent variables on the response based on
the historical data.
2. Use this generalization to predict what can happen in the future using
new cases.
Linear regr ession is used when the response is a continuous variable (CV).
Some examples of CVs are height of a person, sales of a product, revenues
of a company etc .
Logistic regression is used when the response you want to predict/measure
is categorical with two or more levels. Some examples are gender of a
person, outcome of a football match, etc .
For example let’s take a scenario where you are analyzing the voting
patterns of USA to predict who will win the next election.
In such case you would use :
1. Linear Re gression: if you want to predict the number of
people(continuous response) who will vote for democrats/republicans
in each county/city/state etc.,
2. Logistic Regression: if you want to predict the probability that a
certain person will vote for a democra t/republican or not.
Regressions can be used in real world applications such as :
1. Credit Scoring
2. Measuring the success rates of marketing campaigns
3. Predicting the revenues of a certain product
4. Is there going to be an earthquake on a particul ar day? etc.,
Page 45
45 Artificial Intelligence & Machine Learning Lab 4.2.2 Application of logistic regression
Logistic Regression Real Life Example: 1
Medical researchers want to know how exercise and weight impact the
probability of having a heart attack. To understand the relationship
between the predictor variables and the probability of having a heart
attack, researchers can perform logistic regression.
The response variable in the model will be heart attack and it has two
potential outcomes:
● A heart attack occurs.
● A heart attack does not occur.
The result s of the model will tell researchers exactly how changes in
exercise and weight affect the probability that a given individual has a
heart attack. The researchers can also use the fitted logistic regression
model to predict the probability that a given ind ividual has a heart
attacked, based on their weight and their time spent exercising.
Logistic Regression Real Life Example: 2
Researchers want to know how GPA, ACT score, and number of AP
classes taken impact the probability of getting accepted into a part icular
university. To understand the relationship between the predictor variables
and the probability of getting accepted, researchers can perform logistic
The response variable in the model will be “acceptance” and it has two
potential outcome s:
● A student gets accepted.
● A student does not get accepted.
The results of the model will tell researchers exactly how changes in GPA,
ACT score, and number of AP classes taken affect the probability that a
given individual gets accepted into the universi ty. The researchers can also
use the fitted logistic regression model to predict the probability that a
given individual gets accepted, based on their GPA, ACT score, and
number of AP classes taken.
Logistic Regression Real Life Example :3
A business wants to know whether word count and country of origin
impact the probability that an email is spam. To understand the
relationship between these two predictor variables and the probability of
an email being spam, researchers can perform logistic regression.
Page 46
46 Supervised Learning The response variable in the model will be “spam” and it has two potential
● The email is spam.
● The email is not spam.
The results of the model will tell the business exactly how changes in
word count and country of origin affect the probability of a given email
being spam. The business can also use the fitted logistic regression model
to predict the probability that a given email is spam, based on its word
count and country of origin.
Logistic Regression Real Life Example :4
A credit card company wan ts to know whether transaction amount and
credit score impact the probability of a given transaction being
fraudulent. To understand the relationship between these two predictor
variables and the probability of a transaction being fraudulent, the
company c an perform logistic regression.
The response variable in the model will be “fraudulent” and it has two
potential outcomes:
● The transaction is fraudulent.
● The transaction is not fraudulent.
The results of the model will tell the company exactly how changes in
transaction amount and credit score affect the probability of a given
transaction being fraudulent. The company can also use the fitted logistic
regression model to predict the probability that a given transaction is
fraudulent, based on the transaction amount and the credit score of the
individual who made the transaction.
4.3 K-NEAREST NEIGHBORS (KNN) CLASSIFICATION MODEL 1. Evaluation procedure 1 - Train and test on the entire dataset
1. Train the model on the entire dataset .
2. Test the model on the same d ataset , and evaluate how well we did by
comparing the predicted response values with the true response
In [1]:
# read in the iris data
from sklearn.datasets import load_iris
iris = load_iris()
Page 47
47 Artificial Intelligence & Machine Learning Lab # create X (features) and y (response)
X =
y =
1a. Logistic regression
In [2]:
# import the class
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg .fit(X, y)
# predict t he response values for the observations in X
logreg .predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [3]:
# store the predicted response values
y_pred = logreg .predict(X)
# check how many predictions were generated
Classification accuracy:
● Proportion of correct predictio ns
● Common evaluation metric for classification problems
Page 48
48 Supervised Learning In [4]:
# compute classification accuracy for the logistic regression model
from sklearn import metrics
print(metrics .accuracy_score(y, y_pred))
● Known as training accuracy when you train and test the model on
the same data
● 96% of our predictions are correct
1b. KNN (K=5)
In [5]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors =5), y)
y_pred = knn.predict(X)
print(metrics .accuracy_score(y, y_pred))
It seems, there is a higher accuracy here but there is a big issue of testing
on your training data
1c. KNN (K=1)
In [6]:
knn = KNeighborsClassifier(n_neighbors =1), y)
y_pred = knn.predict(X)
print(metrics .accuracy_score(y, y_pred ))
● KNN model:
1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to
the measurements of the unknown iris
3. Use the most popular response value from the K nearest neighbors as
the predicted response value for the unk nown iris
Page 49
49 Artificial Intelligence & Machine Learning Lab This would always have 100% accuracy, because we are testing on
the exact same data, it would always make correct predictions
KNN would search for one nearest observation and find that exact
same observation
KNN has memorized the training set
Because we testing on the exact same data, it would always make the
same prediction
1d. Problems with training and testing on the same data:
● Goal is to estimate likely performance of a model on out-of-sample
● But, maximizing training accuracy rewards over ly complex
models that won't necessarily generalize
● Unnecessarily complex models overfit the training data
Image Credit: Overfitting by Chabacano. Licensed under GFDL via
Wikimedia Commons.
● Green line (decision boundary): overfit
Page 50
50 Supervised Learning Your accuracy would be high but may not generalize well for future
Your accuracy is high because it is perfect in classifying your training
data but not out -of-sample data
● Black line (decision boundary): just right
Good for generalizing for future observations
● Hence we need to solve this issue using a train/test split that will be
explained below
2. Evaluation procedure 2 - Train/test split
1. Split the dataset into two pieces: a training set and a testing set .
2. Train the model on the training set .
3. Test the model on the testing set , and evaluate how well we did.
In [7]:
# print the shapes of X and y
# X is our features matrix with 150 x 4 dimension
print(X .shape)
# y is our response vector with 150 x 1 dimension
print(y .shape)
(150, 4)
In [8]:
# STEP 1: split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test _size =0.4,
random_state =4)
● test_size=0.4
40% of observations to test set
60% of observations to training set
● data is randomly assigned unless you use random_state
If you use random_state=4
Your data will be split exactly the same way
Page 51
51 Artificial Intelligence & Machine Learning Lab
What did this accomplish?
● Model can be trained and tested on different data
● Response values are known for the testing set, and thus predictions
can be evaluated
● Testing accuracy is a better estimate than training accuracy of out -of-
sample performance
In [9]:
# print the shapes of the new X objects
print(X_train .shape)
print(X_test .shape)
(90, 4)
(60, 4)
In [10]:
# print the shapes of the new y objects
print(y_train .shape)
print(y_test .shape)
In [11]:
# STEP 2: train the model on the training set
logreg = LogisticRegression()
logreg .fit(X_train, y_train)
Page 52
52 Supervised Learning Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=No ne, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [12]:
# STEP 3: make predictions on the testing set
y_pred = logreg .predict(X_test)
# compare actual response values (y_test) with predicted response values
print(metric s.accuracy_score(y_test, y_pred))
Repeat for KNN with K=5:
In [13]:
knn = KNeighborsClassifier(n_neighbors =5), y_train)
y_pred = knn.predict(X_test)
print(metrics .accuracy_score(y_test, y_pred))
Repeat for KNN with K=1:
In [14]:
knn = KNeighborsClassifier(n_neighbors =5), y_train)
y_pred = knn.predict(X_test)
print(metrics .accuracy_score(y_test, y_pred))
Can we locate an even better value for K?
In [15]:
# try K=1 through K=25 and record test ing accuracy
Page 53
53 Artificial Intelligence & Machine Learning Lab k_range = range(1, 26)
# We can create Python dictionary using [] or dict()
scores = []
# We use a loop through the range 1 to 26
# We append the scores in the dictionary
for k in k_range:
knn = KNeighborsClassifier(n_neighbors =k), y_train)
y_pred = knn.predict(X_test)
scores .append(metrics .accuracy_score(y_test, y_pred))
[0.94999999999999996, 0.94999999999999996, 0.96666666666666667,
0.96666666666666667, 0.96666666666666667, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.96666666666666667,
0.9833 3333333333328, 0.96666666666666667, 0.96666666666666667,
0.96666666666666667, 0.96666666666666667, 0.94999999999999996,
In [16]:
# import Matplotlib (scientific plotting library)
import matplotlib.pyplot as plt
# allow plots to appear within the notebook
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
Page 54
54 Supervised Learning ● Training accuracy rises as model complexity increases
● Testing accuracy penalizes models that are too complex or not
complex enough
● For KNN models, complexity is determined by the value of K (lower
value = more complex)
3. Making predictions on out -of-sample data :
In [17]:
# instantiate the model with the best known parameters
knn = KNeighborsClassifier(n_neighbors =11)
# train the model with X and y (not X_train and y_train), y)
# make a prediction for an out -of-sample observation
knn.predict([3, 5, 4, 2])
/Users/ritchieng/anaconda3/envs/py3k/lib/python3.5/site -
packages/sklearn/utils/ DeprecationWarning:
Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in
0.19. Reshape your data either using X. reshape( -1, 1) if your data has a
single feature or X.reshape(1, -1) if it contains a single sample.
4. Downsides of train/test split :
● Provides a high -variance estimate of out -of-sample accuracy
● K-fold cross -validati on overcomes this limitation
● But, train/test split is still useful because of its flexibility and speed
4.4 LETS SUM UP Advanced Optimization Algorithms .
Applications of Linear/Logistic regression.
KNN - classification .
Page 55
55 Artificial Intelligence & Machine Learning Lab 4.5 EXERCISES Appropriate the Linea r regression and logistic regression with a real
time example.
Take a real time example and execute about KNN - classification
4.6 REFERENCES https://www.quora. com/What -are-applications -of-linear -and-logistic -
regression -regression -real-life-examples/
https://ncss -wpengine.netdna -
content/themes/ncss/pdf/Procedures/NCSS/Logistic_Regression.pdf -regression -in-
machine -learning/ /2022/01/an -introductory -note-
on-linear -regression/ -learning -basics -with-the-k-
nearest -neighbors -algorithm -6a6e71d01761 -nearest -neighbor -algorithm -for-
machine -learning
https://www.analyti -
understanding -and-implementation -of-knn-algorithm/
Page 56
Unit Structure
5.1 Dimensionality reduction
5.2 Feature selection
5.3 Normalization
5.1 DIMENSIONALITY REDUCTION Dimensionality reduction eliminates some features of the dataset and
creates a restricted set of features tha t contains all of the information
needed to predict the target variables more efficiently and accurately.
Reducing the number of features normally also reduces the output
variability and complexity of the learning process. The covariance matrix
is an impor tant step in the dimensionality reduction process. It is a critical
process to check the correlation between different features.
Correlation and i ts Measurement:
There is a concept of correlation in machine learning that is called
multicollinearity. Multic ollinearity exists when one or more independent
variables highly correlate with each other. Multicollinearity makes
variables highly correlated to one another, which makes the variables’
coefficients highly unstable.
The coefficient is a significant part o f regression, and if this is unstable,
then there will be a poor outcome of the regression result.
Multicollinearity is confirmed by using Variance Inflation Factors (VIF).
Therefore, if multicollinearity is suspected, it can be checked using the
variance inflation factor (VIF).
Rules from VIF:
● A VIF of 1 would indicate complete independence from any other
● A VIF between 5 and 10 indicates a very high level of collinearity [ 4].
● The closer we get to 1, the more ideal the scenario for predictive
Page 57
57 Artificial Intelligence & Machine Learning Lab ● Each independent variable regresses against each independent
variable, and we calculate the VIF.
Heatmap also plays a crucial role in understanding the correlation between
The type of relationship between any two quantities varies over a period of
Correlation varies from -1 to +1
To be precise,
● Values that are close to +1 indicate a positive correlation.
● Values close to -1 indicate a negative correlation.
● Values close to 0 indicate no correlation at all.
Below is the heatmap to show how we will correlate which features are
highly dependent on the target feature and consider them.
The Covariance Matrix and Heatmap :
The covariance matrix is the first step in dimensionalit y reduction because
it gives an idea of the number of features that strongly relate, and it is
usually the first step in dimensionality reduction because it gives an idea
of the number of strongly related features so that those features can be
It also gives the detail of all independent features. It provides an idea of
the correlation between all the different pairs of features.
Identification of features in Iris dataset that are strongly correlated :
Import all the required packages:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
Load Iris dataset:
iris = datasets.load_iris()
Page 58
58 Features And Extraction
Iris dataset.
List all features:
Features of the Iris dataset:
Create a covariance matrix:
cov_data = np.corrcoef(
Covariance matrix of the Iris dataset.
Plot the covariance matrix to identify the correlation betw een features
using a heatmap:
img = plt.matshow(cov_data,
plt.colorbar(img, ticks = [ -1, 0, 1], fraction=0.045)for x in
Page 59
59 Artificial Intelligence & Machine Learning Lab for y in range(cov_data.shape[1]):
plt.text(x, y, "%0.2f" % cov_data[x,y], si ze=12, color='black',
ha="center", va="center")
Heatmap of the correlation matrix.
A correlation from the representation of the heatmap:
● Among the first and the third features.
● Between the first and the fourth features.
● Between the third and the fourth features.
Independent features:
● The second feature is almost independent of the others.
Here the correlation matrix and its pictorial representation have given the
idea about the potential number of features reduction. Therefore, two
features can be kept, and other features can be reduced apart from those
two features.
Feature Selection :
In feature selection, usually, a subset of original features is selected.
Page 60
60 Features And Extraction
Feature selection
Feature Extraction :
In feature extraction, a set of new fea tures are found. That is found
through some mapping from the existing features. Moreover, mapping can
be either linear or non -linear.
Feature Extraction
Linear Feature Extraction :
Linear feature extraction is straightforward to compute and analytically
Page 61
61 Artificial Intelligence & Machine Learning Lab Widespread linear feature extraction methods:
● Principal Component Analysis (PCA) : It seeks a projection that
preserves as much information as possible in the data.
● Linear Discriminant Analysis (LDA) :- It seeks a projection that best
discriminate s the data.
What is Principal Component Analysis?
Principal component analysis (PCA) is an unsupervised linear transformation
technique which is primarily used for feature extraction and dimensionality
reduction. It aims to find the directions of maximum v ariance in high -
dimensional data and projects the data onto a new subspace with equal
or fewer dimensions than the original one. In the diagram given below,
note the directions of maximum variance of data. This is represented using
PCA1 (first maximum vari ance) and PC2 (2nd maximum variance).
Fig 1. PCA – Directions of maximum variance
It is the direction of maximum variance of data that helps us identify an
object. For example, in a movie, it is okay to identify objects by 2 -dimensions
as these dimen sions represent direction of maximum variance. Take a look at
a real -world example of understanding direction of maximum variance in the
following picture representing Taj Mahal of Agra. The diagram below
represents the side view of Taj Mahal. There are m ultiple dimensions
consisting of information (maximum variance) which helps identify the
picture as Taj Mahal.
Fig.2 Taj Mahal Side View
Page 62
62 Features And Extraction Take a look the following picture of Taj Mahal from top view. Note that there
are only fewer dimensions in which info rmation is varying and the variance is
also not much. Hence, it is difficult to identify from top view whether the
picture is of Taj Mahal. Thus, top view can be ignored easily.
Fig3. Taj Mahal Top View
Thus, when training a model to classify whether a g iven structure is of Taj
Mahal or not, one would want to ignore the dimensions / features related to
top view as they don’t provide much information (as a result of low variance).
How is PCA different than other feature selection techniques?
The way PCA is different from other feature selection techniques such as
random forest, regularization techniques, forward/backward selection
techniques etc is that it does not require class labels to be present (thus
called as unsupervised) . More details along with Pyt hon code example will
be shared in future posts.
Pca Algorithm f or Feature Extraction :
The following represents 6 steps of principal component analysis (PCA)
1. Standardize the dataset : Standardizing / normalizing the dataset is the
first step one would need to take before performing PCA. The PCA
calculates a new projection of the given data set representing one or more
features. The new axes are based on the standard deviation of the value
of these features. So, a feature / variable with a high st andard deviation
will have a higher weight for the calculation of axis than a variable /
feature with a low standard deviation. If the data is normalized /
standardized, the standard deviation of all fetaures / variables get
measured on the same scale. Thu s, all variables have the same weight
and PCA calculates relevant axis appropriately. Note that the data is
Page 63
63 Artificial Intelligence & Machine Learning Lab standardized / normalized after creating training / test split. Python’s
sklearn.preprocessing StandardScaler class can be used for
standardizing th e dataset.
2. Construct the covariance matrix : Once the data is standardized, the
next step is to create n X n -dimensional covariance matrix, where n is the
number of dimensions in the dataset. The covariance matrix stores the
pairwise covariances between the different features. Note that a positive
covariance between two features indicates that the features increase or
decrease together, whereas a negative covariance indicates that the
features vary in opposite directions. Python’ s Numpy cov method can
be used to create covariance matrix .
3. Perform Eigendecomposition of covariance matrix : The next step is
to decompose the covariance matrix into its eigenvectors and
eigenvalues. The eigenvectors of the covariance matrix represent the
principal components (the d irections of maximum variance), whereas the
corresponding eigenvalues will define their magnitude. Numpy
linalg.eig or linalg.eigh can be used for decomposing covariance matrix
into eigenvectors and eigenvalues.
4. Selection o f most important Eigenvectors/ Eigenvalues: Sort the
eigenvalues by decreasing order to rank the corresponding eigenvectors.
Select k eigenvectors, which correspond to the k largest eigenvalues,
where k is the dimensionality of the new feature subspace ( ). One
can used the concepts of explained variables to select the k most
important eigenvectors.
5. Projection matrix creation of important eigenvectors : Construct a
projection matrix, W, from the top k eigenvectors.
6. Training / test dataset transformation : Finally, t ransform the d-
dimensional input training and test dataset using the projection matrix to
obtain the new k-dimensional feature subspace.
PCA Python Implementation Step -by-Step :
This section represents custom Python code for extracting the features
using PCA.
Dataset for PCA
Page 64
64 Features And Extraction Here are the steps followed for performing PCA:
● Perform one -hot encoding to transform categorical data set to numerical
data set
● Perform training / test split of the dataset
● Standardize the training and test data set
● Construct covaria nce matrix of the training data set
● Construct eigendecomposition of the covariance matrix
● Select the most important features using explained variance
● Construct project matrix; In the code below, the projection matrix is
created using the five eigenvectors that correspond to the top five
eigenvalues (largest), to capture about 75% of the variance in this dataset
● Transform the training data set into new feature subspace
Here is the custom python code ( without using sklearn.decomposition PCA
class ) to achieve the above PCA algorithm steps for feature extraction : 1 # 2 # Perform one-hot encoding 3 # 4 categorical_columns = df.columns[df.dtypes == object] # Find all categorical columns 5 6 df = pd.get_dummies(df, columns = categorical_columns, drop_first=True) 7 # 8 # Create training / test split 9 # 10 from sklearn.model_selection import train_test_split 11 X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']], 12 df['salary'], test_size=0.25, random_state=1) 13 # 14 # Standardize the dataset; This is very important before you apply PCA 15 # 16 from sklearn.preprocessing import StandardScaler 17 sc = StandardScaler() 18 19 X_train_std = sc.transform(X_train) 20 X_test_std = sc.transform(X_test) 21 # 22 # Import eigh method for calculating eigenvalues and eigenvectirs 23 # 24 from numpy.linalg import eigh 25 #
Page 65
65 Artificial Intelligence & Machine Learning Lab 26 # Determine covariance matrix 27 # 28 cov_matrix = np.cov(X_train_std, rowvar=False) 29 # 30 # Determine eigenvalues and eigenvectors 31 # 32 egnvalues, egnvectors = eigh(cov_matrix) 33 # 34 # Determine explained variance and select the most important eigenvectors based on explained variance 35 # 36 total_egnvalues = sum(egnvalues) 37 var_exp = [(i/total_egnvalues) for i in sorted(egnvalues,
reverse=True)] 38 # 39 # Construct projection matrix using the five eigenvectors that
correspond to the top five eigenvalues (largest), to capture about 75%
of the variance in this dataset 40 # 41 egnpairs = [(np.abs(egnvalues[i]), egnvectors[:, i]) 42 for i in range(len(egnvalues))] 43 egnpairs.sort(key=lambda k: k[0], reverse=True) 44 projectionMatrix = np.hstack((egnpairs[0][1][:, np.newaxis], 45 egnpairs[1][1][:, np.newaxis], 46 egnpairs[2][1][:, np.newaxis], 47 egnpairs[3][1][:, np.newaxis], 48 egnpairs[4][1][:, np.newaxis])) 49 # 50 # Transform the training data set 51 # 52 X_train_pca =
Python Sklearn Example :
This section represents Python code for extracting the features
using sklearn.decomposition class PCA. Here is the screenshot of the data
used. Salary is the label. The goal is to predict the salary.
Page 66
66 Features And Extraction Here are the steps followed for performing PCA:
● Perform one -hot encoding to transform categorical data set to numerical
data set
● Perform training / test split of the dataset
● Standardize the training and test data set
● Perform PCA by fitting and transforming the training data set to the new
feature subspace and later transforming test data set.
● As a final step, the transformed dataset can be used for training/testing
the model
Here is the python code to achieve the above PCA algorithm
steps for feature extraction : 1 # 2 # Perform one -hot encoding 3 # 4 categorical_columns = df.columns[df.dtypes == object] # Find all
categorical columns 5 6 df = pd.get_dummies(df, columns = categorical_columns,
drop_first= True) 7 # 8 # Create training / test split 9 # 10 from sklearn.model_selection import train_test_split 11 X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test = train_test_split(df[df.columns[df.columns != 'salary']], 12 df['salary'], test_size=0.25, random_state=1) 13 # 14 # Standardize the dataset; This is very important before you apply PCA 15 # 16 from sklearn.preprocessing import StandardScaler 17 sc = StandardScaler() 18 19 X_train_std = sc.transform(X_train) 20 X_test_std = sc.transform(X_test) 21 # 22 # Perform PCA 23 # 24 from sklearn.decomposition import PCA 25 pca = PCA() 26 # 27 # Determine transformed features 28 # 29 X_train_pca = pca.fit_transform(X_train_std) 30 X_test_pca = pca.transform(X_test_std)
Page 67
67 Artificial Intelligence & Machine Learning Lab 5.2 FEATURE SELECTION Feature Selection is one of the core concepts in machine learning
which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have a huge
influence on the performance you can achieve. Irrelevant or partially
relevant features can negatively impact model performance. Feature
selection and Data cleaning should be the first and most important step of
your model designing.
Feature Selection is th e process where you automatically or manually
select those features which contribute most to your prediction variable or
output in which you are interested in.
Having irrelevant features in your data can decrease the accuracy of the
models and make your mo del learn based on irrelevant features.
How to select features and what are Benefits of performing feature
selection before modeling your data?
• Reduces Overfitting : Less redundant data means less opportunity to
make decisions based on noise.
• Improves Accur acy: Less misleading data means modeling accuracy
• Reduces Training Time : fewer data points reduce algorithm
complexity and algorithms train faster.
I want to share my personal experience with this.
I prepared a model by selecting all the features and I got an accuracy of
around 65% which is not pretty good for a predictive model and after
doing some feature selection and feature engineering without doing any
logical changes in my model code my accuracy jumped to 81% which is
quite impressive
Now y ou know why I say feature selection should be the first and most
important step of your model design.
Feature Selection Methods:
I will share 3 Feature selection techniques that are easy to use and also
gives good results.
1. Univariate Selection
2. Feature Impo rtance
3. Correlation Matrix with Heatmap
Page 68
68 Features And Extraction Let’s have a look at these techniques one by one with an example
Description of variables in the above file :
battery_power: Total energy a battery can store in one time measured in
blue: Has Bluetooth or not
clock _speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: The longest time that a single battery charge will last when you
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost),
1(medium cost), 2(high cost) and 3(very high cost).
1. Univariate Selection :
Statistical tests can be used to select those features that have the strongest
relationship with the output variable.
Page 69
69 Artificial Intelligence & Machine Learning Lab The scikit -learn library provides the SelectKBest class that can be used
with a suite of different statistical tests to select a specific number of
The example below uses the chi -squared (chi²) statistical test for non -
negative features to select 10 of the best features from the Mobile Price
Range Prediction Dataset.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection imp ort chi2data =
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:, -1] #target column i.e price range#apply SelectKBest
class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit =,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming t he dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features
Top 10 Best Features using SelectKBest class
2. Feature Importance :
You can get the feature importance of each feature of your dataset by
using the feature importance property of the model.
Page 70
70 Features And Extraction Feature importance gives you a score for each feature of your data, the
higher the score more important or relevant is the feature towards your
output variable.
Feature importance is an inbuilt class that comes with Tree Based
Class ifiers, we will be using Extra Tree Classifier for extracting the top 10
features for the dataset.
import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:, -1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier(),y)
print(model.feature_importances_) #use inbuilt class feature_importances
of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_,
top 10 most important features in data
3. Correlation Matrix with H eatmap :
Correlation states how the features are related to each other or the target
Correlation can be positive (increase in one value of feature increases the
value of the target variable) or negative (increase in one value of feature
decreases the value of the target variable)
Page 71
71 Artificial Intelligence & Machine Learning Lab Heatmap makes it easy to identify which features are most related to the
target variable, we will plot heatmap of correlated features using the
seaborn library.
import pandas as pd
import numpy as np
import seaborn as snsd ata = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:, -1] #target column i.e price range
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(fig size=(20,20))
#plot heat map
5.3 NORMALIZATION Normalization is a technique often applied as part of data preparation for
machine learning. The goal of normalization is to change the values of
numeric columns in the dataset to use a common scale, without distorting
Page 72
72 Features And Extraction differences in the ranges of values or losing information. Normalization is
also required for some algorithms to model the data correctly.
For example, assume your input dat aset contains one column with values
ranging from 0 to 1, and another column with values ranging from 10,000
to 100,000. The great difference in the scale of the numbers could cause
problems when you attempt to combine the values as features during
modelli ng.
Normalization avoids these problems by creating new values that maintain
the general distribution and ratios in the source data, while keeping values
within a scale applied across all numeric columns used in the model.
This component offers several opt ions for transforming numeric data:
● You can change all values to a 0 -1 scale, or transform the values by
representing them as percentile ranks rather than absolute values.
● You can apply normalization to a single column, or to multiple
columns in the same d ataset.
If you need to repeat the pipeline, or apply the same normalization
steps to other data, you can save the steps as a normalization
transform, and apply it to other datasets that have the same schema.
Normalization Techniques at a Glance :
Four commo n normalization techniques may be useful:
● scaling to a range
● clipping
● log scaling
● z-score
The following charts show the effect of each normalization technique on
the distribution of the raw feature (price) on the left. The charts are based
on the data set from 1985 Ward's Automotive Yearbook that is part of
the UCI Machine Learning Repository under Automobile Data Set .
Figure 1. Summary of normalization techniques.
Page 73
73 Artificial Intelligence & Machine Learning Lab Scaling to a Range :
Recall from MLCC that scaling means converting floating -point feature
values from their natural range (for example, 100 to 900) into a standard
range —usually 0 and 1 (or sometimes -1 to +1). Use the following simple
formula to scale to a range:
\[ x' = (x - x_{min}) / (x_{max} - x_{min}) \]
Scaling to a range is a g ood choice when both of the following conditions
are met:
● You know the approximate upper and lower bounds on your data with
few or no outliers.
● Your data is approximately uniformly distributed across that range.
A good example is age. Most age values falls between 0 and 90, and every
part of the range has a substantial number of people.
In contrast, you would not use scaling on income, because only a few
people have very high incomes. The upper bound of the linear scale for
income would be very high, and mo st people would be squeezed into a
small part of the scale.
Feature Clipping :
If your data set contains extreme outliers, you might try feature clipping,
which caps all feature values above (or below) a certain value to fixed
value. For example, you could clip all temperature values above 40 to be
exactly 40.
You may apply feature clipping before or after other normalizations.
Formula: Set min/max values to avoid outliers :
Figure 2. Comparing a raw distribution and its clipped version.
Another simple clip ping strategy is to clip by z -score to + -Nσ (for
example, limit to + -3σ). Note that σ is the standard deviation.
Page 74
74 Features And Extraction Log Scaling :
Log scaling computes the log of your values to compress a wide range to a
narrow range.
\[ x' = log(x) \]
Log scaling is helpful w hen a handful of your values have many points,
while most other values have few points. This data distribution is known
as the power law distribution. Movie ratings are a good example. In the
chart below, most movies have very few ratings (the data in the tail), while
a few have lots of ratings (the data in the head). Log scaling changes the
distribution, helping to improve linear model performance.
Figure 3. Comparing a raw distribution to its log.
Z-Score :
Z-score is a variation of scaling that represen ts the number of standard
deviations away from the mean. You would use z -score to ensure your
feature distributions have mean = 0 and std = 1. It’s useful when there are
a few outliers, but not so extreme that you need clipping.
The formula for calculating the z -score of a point, x, is as follows:
\[ x' = (x - μ) / σ \]
Note: μ is the mean and σ is the standard deviation.
Page 75
75 Artificial Intelligence & Machine Learning Lab
Figure 4. Comparing a raw distribution to its z -score distribution.
Notice that z -score squeezes raw values that have a range of ~40000 down
into a range from roughly -1 to +4.
Suppose you're not sure whether the outliers truly are extreme. In this
case, start with z -score unless you have feature values that you don't want
the model to learn; for example, the values are the result of measurement
error or a quirk.
Configure Normali ze Data :
You can apply only one normalization method at a time using this
component. Therefore, the same normalization method is applied to all
columns that you select. To use different normalization methods, use a
second instance of Normalize Data .
1. Add th e Normalize Data component to your pipeline. You can find
the component In Azure Machine Learning, under Data
Transformation , in the Scale and Reduce category.
2. Connect a dataset that contains at least one column of all numbers.
3. Use the Column Selector to c hoose the numeric columns to
normalize. If you don't choose individual columns, by
default all numeric type columns in the input are included, and the
same normalization process is applied to all selected columns.
This can lead to strange results if you in clude numeric columns that
shouldn't be normalized! Always check the columns carefully.
If no numeric columns are detected, check the column metadata to verify
that the data type of the column is a supported numeric type.
To ensure that columns of a s pecific type are provided as input, try using
the Select Columns in Dataset component before Normalize Data .
4. Use 0 for constant columns when checked : Select this option when
any numeric column contains a single unchanging value. This ensures
that such columns are not used in normalization operations.
Page 76
76 Features And Extraction 5. From the Transformation method dropdown list, choose a single
mathematical function to ap ply to all selected columns.
Zscore : Converts all values to a z -score.
The values in the column are transformed using the following formula:
Mean and standard deviation are computed for each column separately.
Population standard deviation is used.
MinMa x: The min -max normalizer linearly rescales every feature to
the [0,1] interval.
Rescaling to the [0,1] interval is done by shifting the values of each
feature so that the minimal value is 0, and then dividing by the new
maximal value (which is the differe nce between the original maximal and
minimal values).
The values in the column are transformed using the following formula:
Logistic : The values in the column are transformed using the
following formula:
LogNormal : This option converts all values to a lognormal scale.
The values in the column are transformed using the following formula:
Here μ and σ are the parameters of the distribution, computed empirically
from the data as maximum likelihood estimates, for each column
TanH : All values a re converted to a hyperbolic tangent.
Page 77
77 Artificial Intelligence & Machine Learning Lab The values in the column are transformed using the following formula:
6. Submit the pipeline, or double -click the Normalize Data component
and select Run Selected .
Data Normalization w ith Pandas :
● Pandas: Pandas is an op en-source library that’s built on top of
NumPy library. it is a Python package that provides various data
structures and operations for manipulating numerical data and
statistics. It’s mainly popular for importing and analysing data
much easier. Pandas is fast and it’s high -performance &
productive for users.
● Data Normalization : Data Normalization could also be a
typical practice in machine learning which consists of
transforming numeric columns to a standard scale. In machine
learning, some feature values differ from others multiple times.
The features with higher values will dominate the learning
Steps Needed:
Here, we will apply some techniques to normalize the data and
discuss these with the help of examples. For this, let’s understand the
steps needed for data normalization with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples :
Here, we create data by some random values and apply some
normalization techniques to it.
Page 78
78 Features And Extraction # importing packages import pandas as pd # create data df = pd.DataFrame([ [180000, 110, 18.9, 1400], [360000, 905, 23.4, 1800], [230000, 230, 14.0, 1300], [60000, 450, 13.5, 1500]], columns=['Col A', 'Col B', 'Col C', 'Col D']) # view data display(df)
See the plot of this dataframe: import matplotlib.pyplot as plt df.plot(kind = 'bar')
Page 79
79 Artificial Intelligence & Machine Learning Lab
Let’s apply normalization techniques one by one.
Using The maximum absolute scaling :
The maximum absolute scaling rescales each feature between -1 and
1 by dividing every observation by its maximum absolute value. We
can apply the maximum absolute scaling in Pandas using the .max()
and .abs() me thods, as shown below: # copy the data df_max_scaled = df.copy() # apply normalization techniques for column in df_max_scaled.columns: df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max() # view normalized data display(df_max_scaled)
Output :
Page 80
80 Features And Extraction See the plot of this dataframe: import matplotlib.pyplot as plt df_max_scaled.plot(kind = 'bar') import matplotlib.pyplot as plt df_max_scaled.plot(kind = 'bar')
Using The min -max feature scaling :
The min -max approach (often called normalization) rescales the
feature to a hard and fast range of [0,1] by subtracting the minimum
value of the feature then dividing by the range. We can apply the
min-max scaling in Pandas using the .min() and .max() methods. # copy the data df_min_max_scaled = df.copy() # apply normalization techniques for column in df_min_max_scaled.columns: df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min()) # view normalized data print(df_min_max_scaled)
Page 81
81 Artificial Intelligence & Machine Learning Lab Output :
Let’s draw a plot with this dataframe: import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')
Using The z -score method :
The z -score method (often called sta ndardization) transforms the
info into distribution with a mean of 0 and a typical deviation of 1.
Each standardized value is computed by subtracting the mean of the
corresponding feature then dividing by the quality deviation. # copy the data df_z_scaled = df.copy() # apply normalization techniques for column in df_z_scaled.columns: df_z_scaled[column] = (df_z_scaled[column] -
Page 82
82 Features And Extraction df_z_scaled[column].mean()) / df_z_scaled[column].std() # view normalized data display(df_z_scaled) Output :
Let’s draw a plot with this dataframe: import matplotlib.pyplot as plt df_z_scaled.plot(kind='bar')
Page 83
Unit Structure
6.1 Introduction
6.2 Transformers
6.3 Principle Component Analysis (PCA)
6.1 INTRODUCTION What is AI Transformation? :
AI transformation is the next step after digital transformation. After a
company adopts digital processe s, the next step is to improve the
intelligence of those processes. This would increase the level of
automation as well as the effectiveness of those processes.
AI transformation touches all aspects of the modern enterprise including
both commercial and op erational activities. Tech giants are integrating AI
into their processes and products. For example, Google is calling itself
an “AI-first” organization. Besides tech giants, IDC estimates that at
least 90% of new organizations will insert AI technology into their
processes and products by 2025.
What are the steps to AI transformation?:
We have l isted below a set of the top 6 steps for Fortune 500 firms.
Smaller firms could skip having in -house teams and strive for less risky
and less investment heavy approaches such as relying on co nsultants for
targeted projects.
1. Outline your company’s AI strategy :
An AI strategy should include initiatives which will be uncovered as a
result of these exercises:
● Identify your company’s most valuable unique data sources
● Identify the most important processes which can benefit from
● Identify internal resources to drive the AI transformation
● Set ambitious, time -bound business targets
Page 84
84 Transformation 2. Execute pilot projects to gain momentum :
First few projects should create measurable business value whil e being
attainable. This is important for the transformation to gain trust across the
organization with achieved projects and it creates momentum that will
lead to AI projects with greater success.
These projects can rely on AI/ML powered tools in the mark etplace or for
more custom solutions, your company can run a data science
competition and rely on the wisdom of hundreds of data scientists. These
competitions use encrypted data and provide a low cost way to find high
performing data science solutions.
Implementing process mining is one of those easy -to-achieve and
impactful projects. With a process mining tool , your business can identify
existing inefficiencies and automate or improve those processes to achieve
savings or customer experience improvement. Thus, some process mining
tools generate a digital twin of an organization (DTO) which provides an
end-to-end overview of the processes in the company and offers
simulation capabilities to compare a ctual and hypothetical scenarios.
Another easy -to-deploy and impactful project is automating document -
based processes . While digital transformation projects in the 2000s just
dealt with removing paper from processes, a modern AI/digital
transformation project would reduce manual labour and automate data
extraction and processing of document data.
3. Build an in -house AI transformation team :
Outsourcing the AI work eases the start of the AI transformation process
but building an in -house AI transformation team can be more
advantageous in the long run. If necessary, outsourced partners can help
train your staff for upcoming p rojects.
4. Provide broad AI training :
Organizations should not expect adequate knowledge about AI
technologies from their staff. In order to have a successful AI
transformation, training each employee in accordance with their role can
be beneficial to ach ieve objectives.
● Executives and seniors should have knowledge about what AI can do
for the enterprise, how to develop an AI strategy and make proper
resource allocation decisions.
● Leaders of AI project teams should learn how to set direction for AI
project s, allocate resources, monitor and track progress.
● AI engineers should learn how to gather data, train AI models, and
deliver specific AI projects.
Page 85
85 Artificial Intelligence & Machine Learning Lab 5. Develop internal and external communications :
For the road to success in AI transformation, the organiza tion should
ensure alignment across the business by improving internal and external
6. Update the company’s AI strategy and continue with AI
transformation :
When the team gains momentum from the initial AI projects and forms a
deeper underst anding of AI, the organization will have a better
understanding of improvement areas where AI can create the most value.
An updated strategy that considers the company’s track record can set a
better direction for the company.
Here are the four types of tr ansformation in more detail:
Process Transformation :
A significant focus of corporate activity has been in business processes.
Data, analytics, APIs, machine learning and other technologies offer
corporations valuable new ways to reinvent processes through out the
corporation —with the goal of lowering costs, reducing cycle times, or
increasing quality. We see process transformation on the shop floor where
companies like Airbus have engaged heads -up display glasses to improve
the quality of human inspection o f airplanes. We also see process
transformations in customer experience, where companies like Domino's
Pizza have completely re -imagined the food ordering process; Dominos’
AnyWare lets customers order from any device. This innovation increased
customer co nvenience so much that it helped push the company to
overtake Pizza Hut in sales. And we see companies implementing
technologies like robotic process automation to streamline back office
processes like accounting and legal, for example. Process transformat ion
can create significant value and adopting technology in these areas is fast
becoming table -stakes. Because these transformations tend to be focused
efforts around specific areas of the business, they are often successfully
led by a CIO or CDO.
Busin ess Model Transformation:
Some companies are pursuing digital technologies to transform traditional
business models. Whereas process transformation focuses on finite areas
of the business, business model transformations are aimed at the
fundamental buildin g blocks of how value is delivered in the industry.
Examples of this kind of innovation are well -known, from Netflix'
reinvention of video distribution, to Apple's reinvention of music delivery
(I-Tunes), to Uber's reinvention of the taxi industry. But thi s kind of
transformation is occurring elsewhere. Insurance companies like Allstate
and Metromile are using data and analytics to un -bundle insurance
contracts and charge customers by -the-mile—a wholesale change to the
auto insurance business model. And, t hough not yet a reality, there are
Page 86
86 Transformation numerous efforts underway to transform the business of mining to a
wholly robotic exercise, where no humans travel below the surface.
The complex and strategic nature of these opportunities require
involvement and leader ship by Strategy and/or Business Units and they
are often launched as separate initiatives while continuing to operate the
traditional business. By changing the fundamental building blocks of
value, corporations that achieve business model transformation o pen
significant new opportunities for growth. More companies should pursue
this path.
Domain Transformation:
An area where we see surprisingly little focus —but enormous
opportunity —is the area of domain transformation. New technologies are
redefining p roducts and services, blurring industry boundaries and
creating entirely new sets of non -traditional competitors. What many
executives don’t appreciate is the very real opportunity for these new
technologies to unlock wholly new businesses for their compan ies beyond
currently served markets. And often, it is this type of transformation is that
offers the greatest opportunities to create new value.
A clear example how domain transformation works may be the online
retailer, Amazon. Amazon expanded into a new market domain with the
launch of Amazon Web Services (AWS), now the largest cloud
computing/infrastructure service, in a domain formerly owned by the IT
giants like Microsoft and IBM. What made Amazon’s entry into this
domain possible was a combination of the strong digital capabilities it had
built in storage, computing databases to support its core retail business
coupled with an installed base of thousands of relationships with young,
growing companies that increasingly needed computing services to
grow . AWS is not a mere adjacency or business extension for Amazon,
but a wholly different business in a fundamentally different market space.
The AWS business now represents nearly 60% % of Amazon’s annual
It may be tempting for Executives of non -tech businesses to view the
experience of Amazon or other digitally -native companies (such as Apple
or Google that have also expanded into new domains) as special; their
ability to acquire and leverage technology may be greater than other
companies. But in today’s digital world, technology gaps are no longer a
barrier. Any company can access and acquire the new technologies needed
to unlock new growth —and do so cheaply and efficiently. The building
block technologies that are unlocking new business domains ( artificial
intelligence, machine learning, internet of things (IOT), augmented reality,
etc.) can be sourced today not only from the traditional IT supply -base like
Microsoft or IBM but also from a growing startup ecosystem, where we
see the greatest innov ation taking place. Corporations that know how to
reach and leverage this innovation efficiently, particularly from new
sources, are reaping the benefits of new growth.
Page 87
87 Artificial Intelligence & Machine Learning Lab We see (and have helped) numerous industrial companies that have
undergone domain tran sformations. ThyssenKrupp, a diversified industrial
engineering company, broadened its offerings to introduce a lucrative new
digital business alongside its traditional business. The company leveraged
a strong industrial market position and Internet of Thi ngs (IOT)
capabilities to help clients manage the maintenance of elevators with asset
health and predictive maintenance offerings —creating a significant new
source of revenue beyond the core. In another example, a major equipment
manufacturer is moving bey ond its core machine offerings to introduce a
digital platform of solutions for its client sites: job -site activity
coordination, remote equipment tracking, situational awareness, and
supply chain optimization. The company is moving to become no longer
merely a heavy equipment provider, but also a digital solutions company.
The lesson is to recognize the new domain opportunities afforded by new
technologies and understand they can be captured —even by traditional
incumbents. Because these opportunities in volve re -defining business
boundaries, pursuing these opportunities often involves Strategy and the
Cultural/Organizational Transformation:
Full, long -term digital transformation requires redefining organizational
mindsets, processes, and talent & ca pabilities for the digital world. Best -
in-class corporations recognize digital requires agile workflows, a bias
toward testing and learning, decentralized decision -making, and a greater
reliance on business ecosystems. And they take active steps to bring
change to their organizations. Experian, the consumer credit agency and
one of the most successful digital transformations, changed its
organization by embedding agile development and collaboration into its
workflows and by driving a fundamental shift in em ployee focus from
equipment to data, company -wide. Similarly, Pitney Bowes, the 100 -year
old postage equipment company, made the successful transition to become
a “technology company” by promoting a “culture of innovation,”
according to its head of innovat ion, and by shifting company values to
focus on customer -centricity.
But neither of these companies focused initially on organization and
culture --being digital isn’t the same as creating value from digital. Instead,
these companies pulled innovation ski lls, digital mindsets and agility into
the corporation on the back of concrete initiatives to drive
growth. Experian recognized the importance of beginning with a
lighthouse digital project to create internal APIs. It forced teams to adopt
digital workflow practices but in doing so demonstrated the power of
digital to change old organizational norms. Similarly, Pitney Bowes CEO
Mark Lautenbach began its transformation with a primary focus on
customer -facing offerings, developing new commerce cloud to allow
customers to better manage and pay for shipments. “As you’re thinking
about transforming a company… try to realize those cores, those gems that
you have that you can pivot off of to create that next chapter,” he told
Fortune. Progress on business initiativ es dragged organizational change
Page 88
88 Transformation like agile development and innovation along. Cultural/organizational
change is a long -term requirement of success, but best in class companies
regard the building of these capabilities as a product of, rather than a
prerequ isite for, business transformation initiatives.
As technology change increases, industries will continue to be forced to
change. Corporations that regard and pursue digital transformation in a
multi -dimensional way will find greater success than those tha t don’t.
Transformers can be understood in terms of their three components:
1. An Encoder that encodes an input sequence into state representation
2. An Attention mechanism that enables our Transformer model to focus
on the right as pects of the sequential input stream. This is used
repeatedly within both the encoder and the decoder to help them
contextualize the input data.
3. A Decoder that decodes the state representation vector to generate the
target output sequence.
Page 89
89 Artificial Intelligence & Machine Learning Lab Understanding th e Training Data:
Sample data Point: “write a function that adds two numbers ”:
Python Code :
def add _two_numbers (num1 ,num2 ):
sum = num1 + num2
return sum
Tokenizing the Data:
Our Input(SRC) and Output(TRG) sequence exist in the form of single
strings that need to be further tokenized in order to be sent into the
transformer model.
To tokenize the Input sequence we make use of spacy.
Input = data.Field(tokenize = 'spacy',
lower= True )
To tokenize our Output sequence we make use of our custom tokenizer
built upon Python’s source code tokenizer. Python’s tokenizer returns
several attributes for each token. We only extract the token type and the
corresponding str ing attribute in form of a tuple(i.e., (token_type_int,
token_string)) as the final token.
Tokenized Input:
SRC = [' ', 'write', 'a', 'python', 'function', 'to', 'add', 'two', 'user', 'provided',
'numbers', 'and', 'return', 'the', 'sum']
Tokenized Output:
TRG = [(57, 'utf -8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'),
(53, ','), (1, 'num2'), (53, ')'), (53, ':'), (4, ' \n'), (5, ' '), (1, 'sum'), (53, '='),
(1, 'num1'), (53, '+'), (1, 'num2'), (4, ' \n'), (1, 'return'), (1, 'sum'), (4, ' '), (6,
''), (0, '')]
Data Augmentations:
While tokenizing the python code, we mask the names of certain variables
randomly(with ‘var_1, ‘var_2’ etc) to ensure that the model that we train
does not merely fixate on the way the variables are named and actua lly
tries to understand the inherent logic and syntax of the python code.
Page 90
90 Transformation For example, consider the following program.
def add_two_numbers (num1 ,num2 ):
sum = num1 + num2
return sum
We can replace some of the above variables to create new data points. The
following are valid augmentations.
def add_two_numbers (var_1 ,num2 ):
sum = var_1 + num2
return sum
def add_two_numbers (num1 ,var_1 ):
sum = num1 + var_1
return sum
def add_two_numbers (var_1 ,var_2 ):
sum = var_1 + var_2
return sum
In the above example, we have therefore expanded a single data point into
3 more data points using our random variable replacement technique.
We implement our augmentations at the time of generating our tokens.
While randomly picking variables to mask we avoid keyword
literals( keyword.kwlist ), control structures(as can be seen in
below skip_list ), and object properties. We add all such literals that need
to be skipped into the skip_list.
We now apply our augmentations and tokenization using
Pytorch’s .
Output = data.Field(tokenize = augment_tokenize_python_code,
lower= False )
Our tokenized Output after applying tokenization:
TRG = [(57, 'utf -8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'),
(53, ','), (1, 'var_1 '), (53, ')'), (53, ':'), (4, ' \n'), (5, ' '), (1, 'sum'), (53, '='),
Page 91
91 Artificial Intelligence & Machine Learning Lab (1, 'num1'), (53, '+'), (1, 'var_1'), (4, ' \n'), (1, 'return'), (1, 'sum'), (4, ''), (6,
''), (0, '')]
Feeding Data:
To feed data into our model we first create batches. The tokenized
predictions are then untokenized via the untokenize function of Python’s
source code tokenizer.
Loss Function:
We have used augmentations in our dataset to ma sk variable literals. This
means that our model can predict a variety of values for a particular
variable and all of them are correct as long as the predictions are
Page 92
92 Transformation consistent through the code. This would mean that our training labels are
not very certain and hence it would make more sense to treat them to be
correct with probability 1- smooth_eps and incorrect otherwise. This is
what label smoothening does. By adding label smoothening to Cross -
Entropy we ensure that the model does not become too confident in
predicting some of our variables that can be replaced via augmentations.
Now with all our components set we can train our model using
backpropagation. We split our dataset into training and validation data.
Our model is trained until our validation loss does not improve any
It is important to note that label smoothening leads to much higher loss
values as compared to models that do not make use of label smoothening.
But this is as expected as we do not intend to be certain with our label
predictions. This is particularly the case with variables as there can be
multiple correct options as long as the pre dictions are consistent through
the target code sequence.
Sample Results :
Input: “program to sort a list of dictionaries by key”
Output :
var_1 ={'Nikhil':{'roll':24 ,'marks':17 },
'Akshat':{'roll':54 ,'marks':12 },
'Akash':{'roll':15 },'marks':15 }}
sort_k ey ='marks'
res ='marks'
res =var_2 (test_dict .items (),key =lambda x :x [1 ][sort_key ])
print ("The sorted dictionary by marks is : "+str (res ))
Input: “function to sum odd elements of list”
Output :
def sum_odd_elements (l :list ):
return sum ([i for i in l if i %2 = =1 ])
Input: “program to reverse a string ”
Page 93
93 Artificial Intelligence & Machine Learning Lab Output :
var_1 = 'Today is bad day'
var_1 [:: -1 ]
6.3 PRINCIPLE COMPONENTS ANALYSIS (PCA): Principal Component Analysis is an unsupervised learning algorithm that
is used for the dimensional ity reduction in machine learning . It is a
statistical process that converts the observations of correlated features into
a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal
Components . It is one of the popular tools that is used for exploratory
data analysis and predictive modelling. It is a technique to draw strong
patterns from the given dataset by reducin g the variances.
PCA generally tries to find the lower -dimensional surface to project the
high-dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduc es the
dimensionality. Some real -world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables an d drops the least
important variable.
The PCA algorithm is based on some mathematical concepts such as:
Variance and Covariance
Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
Dimensionality: It is the number of features or variables present in
the given dataset. More easily, it is the number of columns present in
the dataset.
Correlation: It signifies that how strongly two variables are related to
each other. Such as if one changes, the other variable also gets
changed. The correlati on value ranges from -1 to +1. Here, -1 occurs
if variables are inversely proportional to each other, and +1 indicates
that variables are directly proportional to each other.
Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non -zero vector v
is given. Then v will be eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance betwe en the
pair of variables is called the Covariance Matrix.
Page 94
94 Transformation Principal Components in PCA:
As described above, the transformed new features or the output of PCA
are the Principal Components. The number of these PCs are either equal to
or less than the original features present in the dataset. Some properties of
these principal components are given below:
The principal component must be the linear combination of the
original features.
These components are orthogonal, i.e., the correlation between a pair
of varia bles is zero.
The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
Steps for PCA Algorithm :
1. Getting the dataset : Firstly, we need to take the input dataset and
divide it into two subparts X and Y, where X is the training set, and Y
is the validation set.
2. Representing data into a structure : Now we will represent our
dataset into a structure. Such as we will represent the two -dimensional
matrix of independent varia ble X. Here each row corresponds to the
data items, and the column corresponds to the Features. The number
of columns is the dimensions of the dataset.
3. Standardizing the data : In this step, we will standardize our dataset.
Such as in a particular column, t he features with high variance are
more important compared to the features with lower variance.
If the importance of features is independent of the variance of the
feature, then we will divide each data item in a column with the
standard deviation of the c olumn. Here we will name the matrix as Z.
4. Calculating the Covariance of Z : To calculate the covariance of Z,
we will take the matrix Z, and will transpose it. After transpose, we
will multiply it by Z. The output matrix will be the Covariance matrix
of Z.
5. Calculating the Eigen Values and Eigen Vectors : Now we need to
calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of
the axes with high information. And the coefficient s of these
eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors : In this step, we will take all the
eigenvalues and will sort them in decreasing order, which means from
largest to smallest. And simultaneously sort the eigenvectors
accordi ngly in matrix P of eigenvalues. The resultant matrix will be
named as P*.
Page 95
95 Artificial Intelligence & Machine Learning Lab 7. Calculating the new features Or Principal Components : Here we
will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the
linear combination of original features. Each column of the Z* matrix
is independent of each other.
8. Remove less or unimportan t features from the new dataset: The
new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed
Applications of Principal Component Analysis:
PCA is mainly used as the dimensionality reduction technique in
variou s AI applications such as computer vision, image
compression, etc.
It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data mining,
Psychology, etc.
We can use principal component analysis (PCA) for the following
● To reduce the number of dimensions in the dataset.
● To find patterns in the high -dimensional dataset
● To visualize the data of high dimensionality
● To ignore noise
● To improve classification
● To gets a compact description
● To captures as much of the original variance in the data as possible
In summary, we can define principal component analysis (PCA) as the
transformation of any high number of variables into a smaller number of
uncorrelated variables called principal components (PCs), developed to
capture as much of the data’s variance as possible.
PCA was invented in 1901 by Karl Pearson and Harold Hotelling as an
analog of the Principal axis theorem [ 1] [2] [3].
Mathematically the main objective of PCA is to:
● Find an orthonormal basis for the data.
● Sort dimensions in the order of importance.
● Discard the low significance dimensions.
Page 96
96 Transformation ● Focus on uncorrelated and Gaussian components.
Steps involved in PCA :
● Standardize the PCA.
● Calculate the covariance matrix.
● Find the eigenvalues and eigenvectors for the covariance matrix.
● Plot the vectors on the scaled data.
Example of a problem where PCA is required:
There are 100 students in a class with m different features like grade, age,
height, weight, hair color, and others.
Most of the features may not be relevant that describe the student.
Therefore, it is vital to find the critical features that characterize a student.
Some analysis based on the observation of different features of a student:
● Every student has a vector of da ta that defines him the length of m.
e.g. (height, weight, hair_color, grade,….) or (181, 68, black, 99, ….).
● Each column is one student vector. So, n = 100.
● It creates an m*n matrix.
● Each student lies in an m -dimensional vector space.
Features to Ignore :
● Collinear features or linearly dependent features. e.g., leg size and
● Noisy features that are constant. e.g., the thickness of hair
● Constant features. e.g., Number of teeth.
Features to Keep :
● Non-collinear features or low covariance.
● Features that change a lot, high variance. e.g., grade.
Math Behind PCA :
It is essential to understand the mathematics involved before kickstarting
PCA. Eigenvalues and eigenvector play important roles in PCA.
Eigenvectors and eigenvalues:
The eigenvectors and eigenvalu es of a covariance matrix (or correlation)
describe the source of the PCA. Eigenvectors (main components)
Page 97
97 Artificial Intelligence & Machine Learning Lab determine the direction of the new attribute space, and eigenvalues
determine its magnitude.
The PCA’s main objective is to reduce the data’s dime nsionality by
projecting it into a smaller subspace, where the eigenvectors form the
axes. However, the eigenvectors define only the new axes’ directions
because they all have a size of 1. Consequently, to decide which
eigenvector(s), we can discard withou t losing much information in the
subspace construction and checking the corresponding eigenvalues. The
eigenvectors with the highest values are the ones that include more
information about the distribution of our data.
Covariance Matrix:
The classic PCA approach calculates the covariance matrix, where each
element represents the covariance between two attributes. The covariance
between two attributes is calculated as shown below:
Figure 10: The equation to calculate the covariance between two
attributes .
Create a matrix:
import pandas as pd
import numpy as npmatrix = np.array([[0, 3, 4], [1, 2, 4], [3, 4, 5]])
Figure 11: Matrix.
Convert matrix to covariance matrix:
Figure 12: Covariance matrix.
An exciting feature of the covari ance matrix is that the sum of the matrix’s
main diagonal is equal to the eigenvalues’ sum.
Correlation Matrix:
Another way to calculate eigenvalues and eigenvectors is by using the
correlation matrix. Although the matrices are different, they will res ult in
the same eigenvalues and eigenvectors (shown later) since the covariance
matrix's normalization gives the correlation matrix.
Page 98
98 Transformation
Figure 13: Equation of the correlation matrix.
Create a matrix:
matrix_a = np.array([[0.1, .32, .2, 0.4, 0.8],
[.23, .18, .56, .61, .12],
[.9, .3, .6, .5, .3],
[.34, .75, .91, .19, .21]])
Convert to correlation matrix:
Figure 14: Correlation matrix:
How does PCA work? :
Figure 15: Working with PCA [5].
The orthogonal projection of data from high dimensions to lower
dimensions such that (from figure 15):
● Maximizes the variance of the projected line (purple)
● Minimizes the MSE between the data points and projections (blue)
Applications of PCA :
These are the typical applications of PCA:
● Data Visualization.
● Data Compression.
Page 99
99 Artificial Intelligence & Machine Learning Lab ● Noise Reduction.
● Data Classification.
● Image Compression.
● Face Recognition.
Implementation of PCA With Python :
Implementation of principal component analysis (PCA) on the Iris dataset
with Python:
Load Iris dataset:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaleriris = load_iris()
df = pd.DataFrame(, columns=iris.feature_names)df['class']
Figure 16: Iris dataset.
Get the value of x and y:
x = df.drop(labels='class', axis=1).values
y = df['class'].values
Implementation of PCA with a covari ance Matrix:
class convers_pca():
def __init__(self, no_of_components):
self.no_of_components = no_of_components
self.eigen_values = None
self.eigen_vectors = None
def transform(self, x):
Page 100
100 Transformation return - self.mean, self.projection_matrix.T)
def inverse_transform(self, x):
return, self.projection_matrix) + self.mean
def fit(self, x):
self.no_of_components = x.shape[1] if self.no_of_components is
None else self.no_o f_components
self.mean = np.mean(x, axis=0)
cov_matrix = np.cov(x - self.mean, rowvar=False)
self.eigen_values, self.eigen_vectors = np.linalg.eig(cov_matrix)
self.eigen_vectors = self.eigen_vectors.T
self.sorted_components = np.argsort(self.eigen_values)[:: -1]
self.projection_matrix =
xplained_varia nce = self.eigen_values[self.sorted_components]
self.explained_variance_ratio = self.explained_variance /
Standardization of x:
std = StandardScaler()
transformed = StandardScaler().fit_transform(x)
PCA with two components:
pca = convers_pca(no_of_components=2)
Check eigenvectors:
Check eigenvalues:
Check sorted component:
Plot PCA with several components = 2:
x_std = pca.transform(trans formed)plt.figure()
plt.scatter(x_std[:, 0], x_std[:, 1], c=y)
Page 101
101 Artificial Intelligence & Machine Learning Lab
Figure 17 : PCA visualization.
Page 102
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Definition
7.3 Basic Algorithms
7.3.1 K -Means clustering
7.3.2 Practical advantages
7.4 Stages
7.5 Pseudo -code
7.6 The K -Means Al gorithm Fits within the Framework of Cover’s
7.7 Partitioning Clustering Approach
7.8 The K-means algorithm: a heuristic method
7.8.1 How K -means partitions?
7.8.2 K -means Demo
7.8.3 Application
7.8.4 Relevant issues of K -Means algorithm
7.9 Lets Sum up
7.10 Unit End Exercises
7.11 References
7.0 OBJECTIVES This Chapter would make you understand the following concepts:
What is K-Means clustering algorithm
Definition of K-Means clustering algorithm
Basics of K -Means clustering
Practical advantages of K-Means clustering algorithm
Stages of K-Means clustering algorithm
Pseudo code of K-Means clustering algorithm
The K -Means Algorithm Fits within the Framework of Cover’s
Page 103
103 Artificial Intelligence & Machine Learning Lab Partitioning Clustering Approach
The K-means algorithm: a he uristic method
How K -means partitions?
K-means Demo
Application of K-Means algorithm
Relevant issues of K -Means algorithm
7.1 INTRODUCTION – K-MEANS CLUSTERING ALGORITHM K-Means Clustering is an unsupervised learning algorithm that is used to
solve the cl ustering problems in machine learning or data science.
7.2 DEFINITION: K-MEANS CLUSTERING ALGORITHM A prototypical unsupervised learning algorithm is K -means, which is
clustering algorithm. Given X = {x 1,...,x m} the goal of K -means is to
partition it into k clusters such that each point in a cluster is similar to
points from its own cluster than with points from some other cluster
7.3 BASIC ALGORITHMS Towards this end, define prototype vectors µ 1,...,µ k and an indicator vector
rij which is 1 if, and only if, xi is assigned to cluster j. To cluster our
dataset we will minimize the following distortion measure, which
minimizes the distance of each point from the prototype vector:
where r = {
}, µ = {µ j}, and
denotes the usual Euclidean
square norm.
7.3.1 K -Means clustering :
The computation is to be performed in an unsupervised manner. In this
section, we describe a solution to this problem that is rooted in clustering,
by which we mean the following:
Clustering is a form of unsupervised learning whereby a set of
observations (i.e., data points) is partitioned into natural groupings or
clusters of patterns in such a way that the measure of similarity between
any pair of observations assigned to each cluster minimizes a spe cified
cost function.
Page 104
104 Unsupervised Learning K-Means Clustering Algorithm We have chosen to focus on the so -called K-means algorithm , because it is
simple to implement, yet effective in performance, two features that have
made it highly popular.
Let {Xi }N
i=1 denote a set of multidimensional observations t hat is to be
partitioned into a proposed set of K clusters, where K is smaller than the
number of observations, N. Let the relationship.
j = C(i), i = 1, 2, ..., N
denote a many -to-one mapper, called the encoder, which assigns the i th
observation x i to the jth cluster according to a rule yet to be defined. To do
this encoding, we need a measure of similarity between every pair of
vectors x i and x i’ which is denoted by d(x i, xi’ ).When the measure d(xi,
xi’) is small enough, both x i and x i’ are assigned to t he same cluster;
otherwise, they are assigned to different clusters.
To optimize the clustering process, we introduce the following cost
function (Hastie et al.,2001):
For a prescribed K, the requirement is to find the encoder C(i)=jfor which
the cost fu nction J(C) is minimized. At this point in the discussion, we
note that the encoder C is unknown —hence the functional dependence of
the cost function J on C.
In K -means clustering, the squared Euclidean norm is used to define the
measure of similarity betw een the observations x i and x i’ as shown by
Hence ,
We now make two points:
1. The squared Euclidean distance between the observations x i and x i’ is
symmetric; that is,
2. The inner summation reads as follows: For a given , the encoder C
assigns to cluster j all the observations that are closest to xi. Except
for a scaling factor, the sum of the observations so assigned is an
estimate of the mean vector pertaining to cluster j; the scaling factor
in question is 1/Nj, where Nj is the number of dat a points within
Page 105
105 Artificial Intelligence & Machine Learning Lab cluster j. On account of these two points, we may therefore reduce to
the simplified form
where denotes the “estimated” mean vector associated with cluster j4 .In
effect, the mean may be viewed as the center of cluster j. In light of we
may now restate the clustering problem as follows:
Given a set of N observations, find the encoder C that assigns these
observations to the K clusters in such a way that, within each cluster, the
average measure of dissimilarity of the assigned observations fr om the
cluster mean is minimized.
Indeed, it is because of the essence of this statement that the clustering
technique described herein is commonly known as the K -means algorithm.
For an interpretation of the cost function J(C) we may say that, except for
a scaling factor 1/Nj, the inner summation in this equation is an estimate
of the variance of the observations associated with cluster j for a given
encoder C, as shown by
Accordingly, we may view the cost function J(C) as a measure of the total
cluster va riance resulting from the assignments of all the N observations to
the K clusters that are made by encoder C.
With encoder C being unknown, how do we minimize the cost function
J(C) To address this key question, we use an iterative descent algorithm,
each ite ration of which involves a two -step optimization. The first step
uses the nearest neighbor rule to minimize the cost function J(C) of with
respect to the mean vector for a given encoder C. The second step
minimizes the inner summation with respect to the en coder C for a given
mean vector .This two -step iterative procedure is continued until
convergence is attained.
Thus, in mathematical terms, the K -means algorithm proceeds in two
Step 1 : For a given encoder C, the total cluster variance is minimized wi th
respect to the assigned set of cluster means ; that is, we perform, the
following minimization:
for a given C
Step 2: Having computed the optimized cluster means in step 1,we next
optimize the encoder as follows
Page 106
106 Unsupervised Learning K-Means Clustering Algorithm
Starting from some initial choice of the encoder C, the algorithm goes
back and forth between these two steps until there is no further change in
the cluster assignments.
Each of these two steps is designed to reduce the cost function J(C) in its
own way; hence, convergence of the algorithm is assured. However,
because the algorithm lacks a global optimality criterion, the result may
converge to a local minimum, resulting in a suboptimal solution to the
clustering assignment.
7.3.2 Practical advantages :
Nevertheless, the algorithm has |Practica l advantages :
1. The K -means algorithm is computationally efficient, in that its
complexity is linear in the number of clusters.
2. When the clusters are compactly distributed in data space, they are
faithfully recovered by the algorithm.
One last comment is in order: To initialize the K -means algorithm, the
recommended procedure is to start the algorithm with many different
random choices for the means for the proposed size K and then choose the
particular set for which the double summation in assumes the smallest
7.4 STAGES OF K -MEANS CLUSTERING ALGORITHM Our goal is to find r and µ, but since it is not easy to jointly minimize J
with respect to both r and µ, we will adapt a two stage strategy:
Stage 1 :
Keep the µ fixed and determine r .
In this case, it is easy to see that the minimization decomposes into m
independent problems. The solution for the i -th data point xi can be found
by setting:
and 0 otherwise.
Stage 2 :
Keep the r fixed and determine µ. Since the r’s are fixed, J is an quadratic
function of µ. It can be minimized by setting the deriva tive with respect to
µj to be 0.
Page 107
107 Artificial Intelligence & Machine Learning Lab
Rearranging obtains
counts the number of points assigned to cluster j, we are
essentially setting µj to be the sample mean of the points assigned to
cluster j.
7.5 PSEUDO -CODE Detailed pseudo -code can be found in K-Means Algorithms :
Cluster( X) {Cluster dataset X}
Initialize cluster centers µ j for j = 1,...,k randomly
for i = 1 to m do
Compute j’ = arg m inj=1,...,k d(xi,µj)
Set r ij’ = 1 and r ij = 0 for all j’= j
end for
for j = 1 to k do
Compute µj =
end for
until Cluster assignments r ij are unchanged
return {µ1,...,µ k} and r ij
The algorithm stops when the cluster assignments do not c hange
7.6 THE K -MEANS ALGORITHM FITS WITHIN THE FRAMEWORK OF COVER’S THEOREM The K -means algorithm applies a nonlinear transformation to the input
signal x. We say so because the measure of dissimilarity —namely, the
squared Euclidean distanc e ,on which it is based —is a nonlinear function
of the input signal x for a given cluster center xj. Furthermore, with each
cluster discovered by the K -means algorithm defining a particular
computational unit in the hidden layer, it follows that if the numbe r of
Page 108
108 Unsupervised Learning K-Means Clustering Algorithm
k clusters, K, is large enough, the K -means algorithm will satisfy the other
requirement of Cover’s theorem —that is, that the dimensionality of the
hidden layer is high enough. We therefore conclude that the K -means
algorithm is indeed computationally power ful enough to transform a set of
nonlinearly separable patterns into separable ones in accordance with this
theorem. Now that this objective has been satisfied, we are ready to
consider designing the linear output layer of the RBF network.
7.7 PARTITIONING CLUSTERING APPROACH a typical clustering analysis approach via iteratively partitioning
training data set to learn a partition of the given data space
learning a partition on a data set to produce several non -empty
clusters (usually, the number of cluster s given in advance)
in principle, optimal partition achieved via ptimizeg the sum of
squared distance to its “representative object” in each cluster
e.g., Euclidean distance d 2 (x,m )= ∑Ν(xn−mkn )2
● Given a K, find a partition of K clusters to ptimize the chosen
partitioning criterion (cost function)
● global optimum: exhaustively search all partitions
7.8 THE K-MEANS ALGORITHM: A HEURISTIC METHOD ● K-means algorithm (MacQueen’67): each clus ter is represented by the
centre of the cluster and the algorithm converges to stable centriods of
● K-means algorithm is the simplest partitioning method for clustering
analysis and widely used in data mining applications.
Given the cluster number K, the K-means algorithm is carried out in three
steps after initialisation:
Initialisation: set seed points (randomly)
1. Assign each object to the cluster of the nearest seed point measured
with a specific distance metric 2 E
1 x
k d 2 (x,m ) k
Page 109
109 Artificial Intelligence & Machine Learning Lab 2. Compute new seed points as the cen troids of the clusters of the
current partition (the centroid is the centre, i.e., mean point , of the
3. Go back to Step 1), stop when no more new assignment (i.e.,
membership in each cluster no longer changes)
7.8.1 How K -means partitions? :
When K c entroids are set/fixed, they partition the whole data space into K
mutually exclusive subspaces to form a partition.
A partition amounts to a Voronoi Diagram -Changing positions of
centroids l eads to a new partitioning.
7.8.2 K-means Demo :
K-means Demo
Page 110
110 Unsupervised Learning K-Means Clustering Algorithm 7.8.3 Application :
Colour -Based Image Segmentation Using K-means
Step 1 : Loading a colour image of tissue stained with hemotoxylin and
eosin (H&E)
Colour -Based Image Segmentation Using K-means
Step 2 : Convert the image from RGB colour space to L*a*b* colour space
● Unlike the RGB colour model, L*a*b* colour is designed to
approx imate human vision.
● There is a complicated transformation between RGB and L*a*b*.
(L*, a*, b*) = T(R, G, B).
(R, G, B) = T’(L*, a*, b*).
Colour -Based Image Segmentation Using K-means :
Step 3 : Undertake clustering analysis in the (a*, b*) colour space with the
K-means algorithm
● In the L*a*b* colour space, each pixel has a properties or feature
vector:(L*, a*, b*).
● Like feature selection, L* feature i s discarded. As a result, each pixel
has a feature vector (a*, b*).
● Applying the K-means algorithm to the image in the a*b* feature space
where K = 3 by applying the domain knowledge.
Colour -Based Image Segmentation Using K-means :
Step 4 : Label every pixel in the image using the results from
K-means clustering (indicated by three different grey levels)
Page 111
111 Artificial Intelligence & Machine Learning Lab
Colour -Based Image Segmentation Using K-means :
Step 5 : Create Images that Segment the H&E Image by Colour
• Apply the label and the colour information of each pixel to achieve
separate colou r images corresponding to three clusters. “blue” pixels “white” pixels
“pink” pixels
Colour -Based Image Segmentation Using K-means :
Step 6: Segment the nuclei into a separate image with the L* feature
• In cluster 1, there are dark and light blue objects (pixels). The dark blue
objects (pixels) correspond to nuclei (with the domain knowledge).
Page 112
112 Unsupervised Learning K-Means Clustering Algorithm • L* feature specifies the brightness values of e ach colour.
• With a threshold for L*, we achieve an image containing the nuclei
7.8.4 Relevant issues of K -Means algorithm
Computational complexity
● O(tKn), where n is number of objects, Kis number of clusters, and tis
number of iterations. Normall y, K, t << n .
Local optimum
● sensitive to initial seed points
● converge to a local optimum: maybe an unwanted solution
Other problems
● Need to specify K, the number of clusters, in advance
● Unable to handle noisy data and outliers ( K-Medoids algorithm)
● Not sui table for discovering clusters with non -convex shapes
the K-mean performance?
Two issues with K -Means are worth noting.
First , it is sensitive to the choice of the initial cluster centers µ. A number
of practical heuristics have been developed. For insta nce, one could
randomly choose k points from the given dataset as cluster centers. Other
methods try to pick k points from X which are farthest away from each
Second , it makes a hard assignment of every point to a cluster center.
Variants which we w ill encounter later in the book will relax this. Instead
Page 113
113 Artificial Intelligence & Machine Learning Lab of letting r ij ∈ {0,1} these soft variants will replace it with the probability
that a given xi belongs to cluster j.
The K -Means algorithm concludes our discussion of a set of basic
machine learning methods for classification and regression. They provide
a useful starting point for an aspiring machine learning researcher.
7.9 LET’S SUM UP We will have a clear idea about Definition , Basic Algorithms, Stages and
Pseudo code of K-Means clustering algorithm .
7.10 UNIT END EXERCISES Take a Data set available and execute on different inputs of K-Means
clustering algorithm.
7.11 REFERENCES -means -clustering -algorithm -in-
machine -learning -means -clustering -algorithm -
applications -evaluation -methods -and-drawbacks -aa03e644b48a -k-
means -clustering -in-machine -learningwith -examples/ -means -clustering -introduction/ -learning -tutorial/k -
means -clustering -algorithm
Page 114
Unit Structure
8.0 Objectives
8.1 Definition – K-Medoid clustering algorithm
8.2 Introduction - K-Medoid clustering algorithm
8.3 K-Means & K -Medoids Clustering - Outliers Comparison
8.4 K-Medoi ds - Basic Algorithm
8.5 K-Medoids - Pam Algorithm
8.5.1 Typical Pam Example .8.6 Advantages And Disadvantages Of
8.7 CLARA – Clustering Large Applications
8.7.1 CLARA Algorithm
8.8 Comparison CLARA Vs PAM
8.9 Applications
8.10 General Applications of Clustering
8.11 Working of the K -Medoids approach
8.11.1 Complexity of K -Medoids algorithm
8.11.2 Advantages of the technique
8.12 Practical Implementation
8.13 Lets Sum up
8.14 Unit End Exercises
8.15 References
8.0 OBJECTIVES This Chapter wo uld make you understand the following concepts:
What is K-Medoid clustering algorithm
Definition of K-Medoid clustering algorithm
Comparison of K-Medoid clustering algorithm
K-Medoid Basic algorithm
K-Medoid PAM algorithm
Clara – Clustering Large Applications
Working and Practical Implementation
Page 115
115 Artificial Intelligence & Machine Learning Lab 8.1 DEFINITION – K-MEDOID CLUSTERING ALGORITHM K-Medoids is a clustering algorithm resembling the K -Means clustering
technique. It falls under the category of unsupervised machine learning.
8.2 INTRODUCTI ON - K-MEDOID CLUSTERING ALGORITHM It majorly differs from the K -Means algorithm in terms of the way it
selects the clusters’ centres. The former selects the average of a cluster’s
points as its centre (which may or may not be one of the data points) whil e
the latter always picks the actual data points from the clusters as their
centres (also known as ‘ exemplars ’ or ‘medoids ’). K-Medoids also
differs in this respect from the K -Medians algorithm which is the same as
K-means, except that it chooses the media ns (instead of means) of the
clusters as centres.
The mean in k -means clustering is sensitive to outliers. Since an object
with an extremely high value may substantially distort the distribution of
data. Hence we move to k -medoids. Instead of taking mean o f cluster we
take the most centrally located point in cluster as it’s center. These are
called medoids.
8.4 K -MEDOIDS - BASIC ALGORITHM Input : Number of K (the clusters to form)
Page 116
116 Unsupervised Learning K- Medoid Clustering Algorithm
Initialize: Select K points as the initial representative objects i.e initial K -
medoids of our K clusters.
Repeat: Assign each point to the cluster with the closest medoid m.
Randomly select a non -representative object o i
Compute the total cost of swapping S, the med oid m with o i
If S < 0:
Swap m with o i to form new set of medoids.
Stop when convergence criteria is meet.
PAM stands for Partitioning Around Medoids .
GOAL : To find Clusters that have minimum average dissimilarity
between objects that belong to same cluster.
1. Start with initial set of medoids.
2. Iteratively replace one of the medoids with a non -medoid if it reduces
total sum of SSE of resulting cluster.
SSE is calculated as below :
Page 117
117 Artificial Intelligence & Machine Learning Lab Where k is number of clusters and x is a data point in cluster C i and M i is
medo id of C i
8.5.1 Typical Pam Example :
K-Medoids (Pam) Example :
For K = 2
Randomly Select m1 = (3,4) and m2 =(7,4)
Using Manhattan as similarity metric we get,
C1 = ( o1, o2, o3, o4 )
Page 118
118 Unsupervised Learning K- Medoid Clustering Algorithm C2 = ( o5, o6, o7, o8 , o9, o10)
Compute absolute error as follows :
E = (o1 -o2) + (o3 -o2) + (o4 -o2) + (o5 -o8) +(o6 -o8)+(o7 -o8) +(o9 -o8) +
E = (3+4+4) + (3+1+1+2+2)
Therefore, E = 20
Swapping o8 with o7
Compute absolute error as follows :
E = (o1 -o2) + (o3 -o2) + (o4 -o2) + (o5 -o7) +(o6 -o7)+(o8 -o7) +(o9 -o7) +
E = (3+4+4) + (2+2+1+3+3)
Therefore, E = 22
Let’s now calculate cost function S for this swap, S = E for (o2,07) - E for
(o2, o8)
S = 22 - 20
Therefore S > 0,
This swap is undesirable
● PAM is more flexible as it can use any similarity measure.
● PAM is more robust than k -means as it handles noise better.
PAM algorithm for K -medoid clustering works well for dataset but cannot
scale well for large data set due to high computational overhead.
Pam Complexity : O(k(n -k)2 ) this is because we compute distance of n -k
points with each k point, to decide in which cluster it will fall and after
this we try to replace each of the medoid with a non medoid and find it’s
distance with n -k points.
To overcome this we make use of CLARA
Page 119
119 Artificial Intelligence & Machine Learning Lab 8.7 CLARA – CLUSTERING LARGE APPLICATIONS ● Improvement over PAM
● Finds medoids in a sample from the dataset
● [Idea]: If the samples are sufficiently random, the medoids of t he
sample approximate the medoids of the dataset
● [Heuristics]: 5 samples of size 40+2k gives satisfactory results
● Works well for large datasets (n=1000, k=10)
8.7.1 Clara Algorithm :
1. Split randomly the data sets in multiple subsets with fixed size
(sampsize )
2. Compute PAM algorithm on each subset and choose the corresponding
k representative objects (medoids). Assign each observation of the
entire data set to the closest medoid.
3. Calculate the mean (or the sum) of the dissimilarities of the
observations to thei r closest medoid. This is used as a measure of the
goodness of the clustering.
4. Retain the sub -dataset for which the mean (or sum) is minimal. A
further analysis is carried out on the final partition.
8.8 COMPARISON CLARA vs PAM Strength :
deals with larger data sets than PAM
CLARA Outperforms PAM in terms of running time and
quality of clustering
Weakness :
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole
Page 120
120 Unsupervised Learning K- Medoid Clustering Algorithm
Social Network :
Document Clustering
Page 121
121 Artificial Intelligence & Machine Learning Lab 8.10 GENERAL APPLICATIONS OF CLUSTERING 1. Recognition
2. Spatial Data Analysis
a. create thematic maps in GIS by clustering feature spaces
b. detect spatial clusters and explain them in spatial data mining
1. Image Processing
2. Economic Science (especially market research)
3. WWW
a. Document classification
b. Cluster Weblog data to discover groups of similar access patterns
8.11 WORKING OF THE K -MEDOIDS APPROACH The steps followed by the K -Medo ids algorithm for clustering are as
1. Randomly choose ‘k’ points from the input data (‘k’ is the number of
clusters to be formed). The correctness of the choice of k’s value can be
assessed using methods such as silhouette method .
2. Each data point gets assigned to the cluster to which its nearest medoid
3. For each data point of cluster i, its distance from all other data points is
computed and added. The point of ith clust er for which the computed
sum of distances from other points is minimal is assigned as the medoid
for that cluster.
4. Steps (2) and (3) are repeated until convergence is reached i.e. the
medoids stop moving.
8.11.1 Complexity of K -Medoids algorithm:
The comp lexity of the K -Medoids algorithm comes to O(N2CT) where N,
C and T denote the number of data points, number of clusters and number
of iterations respectively. With similar notations, the complexity K -Means
algorithm can be given as O(NCT).
8.11.2 Advantag es of the technique :
Mean of the data points is a measure that gets highly affected by the
extreme points. So in K -Means algorithm, the centroid may get shifted to a
wrong position and hence result in incorrect clustering if the data has
outliers because then other points will move away from. On the contrary, a
medoid in the K -Medoids algorithm is the most central element of the
Page 122
122 Unsupervised Learning K- Medoid Clustering Algorithm cluster, such that its distance from other points is minimum. Since
medoids do not get influenced by extremities, the K -Medoids a lgorithm is
more robust to outliers and noise than K -Means algorithm.
The following figure explains how mean’s and medoid’s positions can
vary in the presence of an outlier.
Besides, K -Medoids algorithm can be used with arbitrarily chosen
dissimilarity measure (e.g. cosine similarity ) or any distance metric, unlike
K-Means which usually needs Euclidean distance metric to arrive at
efficient solutions.
K-Medoids algorithm is found useful for practical applications such as
face recognition. The medoid can correspond to the typical photo of the
individual whose face is to be recognized. But if K -Means algorithm is
used instea d, some blurred image may get assigned as the centroid, which
has mixed features from several photos of the individual and hence makes
the face recognition task difficult.
8.12 PRACTICAL IMPLEMENTATION Here’s a demonstration of implementing K -Medoids algor ithm on a
dataset containing 8*8 dimensional images of handwritten digits. The task
is to divide the data points into 10 clusters (for classes 0 -9) using K -
Medoids. The dataset used is a copy of the test set of the original
dataset available on UCI ML Repository . The code here has been
implemented in Google colab using Python 3.7.10 and scikit -learn -extra
0.1.0 b2 versions.
Step -Wise Explanation of The Code Is As Follows:
1. Install :
scikit -learn -extra Python module, an extension of scikit -learn designed for
implementing more advanced algorithms that cannot be used by mere
inclusion of scikit -learn in the code.
!pip install scikit -learn -extra
Page 123
123 Artificial Intelligence & Machine Learning Lab 2. Import required libraries and mod ules:
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
#Import the digits’ dataset available in sklearn.datasets package
from sklearn.datasets import load_digits
Instead of using all 64 attributes of the dataset, we use Principal
Component Analysis (PCA) to reduce the dimensions of features set such
that most of the useful information is covered.
from s klearn.decomposition import PCA
Import module for standardizing the dataset i.e. rescaling the data such
that its has mean of 0 and standard deviation of 1
from skle arn.preprocessing import scale
3. Prepare the input data :
#Load the digits dataset
dataset = load_digits()
#Standardize the data
digit_data = scale(
Compute number of output classes i.e. number of digits for which we have
the data (here 10 (0 -9))
num_digits = len(np.unique(
4. Reduce the d imensions of the data using PCA:
red_data = PCA (n_components=2).fit_transform(digit_data)
PCA constructs new components by linear combinations of original
features. ‘n_components’ parameter denotes the number of newly formed
Page 124
124 Unsupervised Learning K- Medoid Clustering Algorithm components to be considered. fit_transform() method fits the PCA
models and performs di mensionality reduction on digit_data.
5. Plot the decision boundaries for each cluster. Assign a different
color to each for differentiation:
h = 0.02 #step size of the mesh
#Minimum and maximum x -coordinates
xmin, xmax = red_data[:, 0].min() - 1, red_data[:, 0].max() + 1
#Minimum and maximum y -coordinates
ymin, ymax = red_data[:, 1].min() - 1, red_data[:, 1].max() + 1
xx, yy = np.meshgrid (np.arange(xmin, xmax, h), np.arange(ymin, ymax,
6. Define an array of K -Medoids variants to be used :
We have used three different distance metrics ( Manhattan distance,
Euclidean distance and Cosine dissimilarity/distance ) for computing
the distance of each data point from every other data point while selecting
the medoid.
Visit this page to know about the distance metrics used in detail.
The parameters we have specified in the KMedoids() method have the
following significance:
● metric – distance metric to be used (default: ‘euclidean’)
● n_clusters – number of clusters to be formed and hence the number of
medoids (one per cluster) (default value: 8)
● init – ‘heuristic’ method used for medoid initialization
For each data point, itd distance from all other points is computed and
the distances are summed up. N_clusters number of points for which such
a sum of distances are minimum, are chosen as medoids.
● max_iter – maxi mum number of the algorithm’s iterations to be
performed when fitting the data
The KMedoids() method of scikit -learn -extra by default used
the PAM (Partition Around Medoids) algorithm for finding the
models = [
KMedoids(metric="manhattan", n_clusters=num_digits,
Page 125
125 Artificial Intelligence & Machine Learning Lab init="heuristic", max_iter=2),"Manhattan metric",
KMedoids(metric="euclidean", n_clusters=num_digits,
init="heuristic", m ax_iter=2),"Euclidean metric",
(KMedoids(metric="cosine", n_clusters=num_digits, init="heuristic",
max_iter=2), "Cosine metric", ),
7. Initialize the number of rows and columns of the plot for plotting
subplots of eac h of the three metrics’ results:
#number of rows = integer(ceiling(number of model variants/2))
num_rows = int(np.ceil(len(models) / 2.0))
#number of columns
num_cols = 2
8. Fit each of the model variants to the data an d plot the resultant
#Clear the cu rrent figure first (if any)
#Initialize dimensions of the plot
The ‘models’ array defined in step (6) contains three tuples, each having a
model variant’s parameters and its descriptive text. We iterate through
each of the tuples, fit the data to the model and plot the results.
for i, (model, description) in enumerate(models):
# Fit each point in the mesh to the model
#Predict the labels for points in the mesh
Z = mod el.predict( np.c_ [xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
Page 126
126 Unsupervised Learning K- Medoid Clustering Algorithm #Subplot for the ith model variant
plt.subplot(num_col s, num_rows, i + 1)
#Display the subplot
plt.imshow (
Z, #data to be plotted
interpolation ="nearest",
#bounding box coordinates (left,right,bottom,top)
extent=(xx.min(), xx.max(), yy.min(), yy.max()),, #colormap
aspect="auto", #asp ect ratio of the axes
origin="lower", #set origin as lower left corner of the axes
red_data[:, 0], red_data[:, 1], "k.", markersize=2, alpha=0.3
# Plot the centroids as white cross marks
centroids = model.cluster_centers_
centroids[:, 0],
centroids[:, 1],
s=169, #marker’s size (points^2)
linewidths=3, #width of boundary lines
color="w", #white color for centroids ma rkings
zorder =10, #drawing order of axes )
#describing text of the tuple will be title of the subplot
plt.xlim(xmin, xmax) #limits of x -coordinates
plt.ylim(ymin, ymax) #limits of y -coordinates
Page 127
127 Artificial Intelligence & Machine Learning Lab plt.xticks(())
#Upper title of the whole plot
#Text to be displayed
"K-Medoids algorithm implemented with different metrics \n\n",
fontsize=20, #size of the fonts
8.13 LET’S SUM UP We will have a clear idea about :
What is K-Medoid clustering algorithm
Definition of K-Medoid clustering algorithm
Comparison of K-Medoid clustering algorithm
K-Medoid Basic algorithm
K-Medoid PAM algorithm
Clara – Clustering Large Applications
Working and Practical Implementation
8.14 UNIT END EXERCISES Take a Data set available and execute on different inputs of K-Medoid
clustering algorithm.
● -medoids -in-r-algorithm -
and-practical -examples/
● https://towardsdatas -k-means -k-means -
and-k-medoids -clustering -algorithms -ad9c9fbf47ca
● -medoids -clustering/
● -medoids -in-r-algorithm -
and-practical -examples/
Page 128
Unit Structure
9.0 Introduction to SVMS
9.1 What Is A Support Vector Machine, And How Does It Work?
9.2 What Is The Purpose of SVM?
9.3 Importing Datasets
9.4 The Es tablishment o f A Support Vector Machine
9.5 A Simple Description o f The S VM Classification Algorithm
9.6 What Is The Best Way To Transform This Problem Into A Linear
9.7 Kernel F or The Radial Basis Function (R BF) And Python Examples
9.8 Build A Model With Default Values For C And Gamma
9.9 Radial Basis Function (R BF) Kernel: The Go -To Kernel
9.10 Conclusion
9.11 References
9.0 INTRODUCTION TO SVMS Support vector machines (SVMs, also known as support vector networks)
are supervised learning models w ith related learning algorithms for
classification and regression analysis in machine learning. A Support
Vector Machine (SVM) is a discriminative classifier with a separating
hyperplane as its formal definition. In other words, the algorithm produces
an ideal hyperplane that categorizes fresh samples given labeled training
data (supervised learning).
9.1 WHAT IS A SUPPORT VECTOR MACHINE, AND HOW DOES IT WORK? An SVM model is a representation of the examples as points in space,
mapped so that the examples o f the different categories are separated by as
wide a gap as possible. SVMs may do non -linear classification, implicitly
translating their inputs into high -dimensional feature spaces, in addition to
linear classification.
Page 129
129 Artificial Intelligence & Machine Learning Lab 9.2 WHAT IS THE PURPOSE OF SVM? An SVM training algorithm creates a model that assigns new examples to
one of two categories, making it a non -probabilistic binary linear
classifier, given a series of training examples that are individually
designated as belonging to one of two categories.
Before you go any further, make sure you have a basic knowledge of this
topic. In this article, I'll show you how to use machine learning techniques
like scikit -learn to classify cancer UCI datasets using SVM.
Numpy, Pandas, matplot -lib, and scikit -learn are required.
Let's look at a simple support vector categorization example. To begin, we
must first generate a dataset:
Implemention in python
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs
# creating dat asets X containing n_samples
# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
import matplotlib.pyplot as plt
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
Page 130
130 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels Support vector machines consider a region around the line of a particular
width in addition to drawing a line between two classes. Here's an
example of how it may appear:
# creating line space between -1 to 3.5
xfit = np.linspace( -1, 3.5)
# plotti ng scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
# plot a line between the different sets of data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), ( -0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, ' -k')
plt.fill_between(xfi t, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim( -1, 3.5);
9.3 IMPORTING DATASETS Support vector machines, which optimize a linear discriminant model
reflecting the perpendicular distance between datasets, have th is
understanding. Let's now use our training data to train the classifier. We
must first import cancer datasets as a CSV file, from which we will train
two features out of all the features.
# importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Page 131
131 Artificial Intelligence & Machine Learning Lab # reading csv file and extracting class column to y.
x = pd.read_csv("C: \...\cancer.csv")
a = np.array(x)
y = a[:,30] # classes having 0 and 1
# extracting two features
x = np.column_stack((x.malignant,x.benign))
# 569 samples and 2 features
print (x),(y)
[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. ,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1.,
1., 0., 0., 1., 0., 0 ., 1., 1., 1., 1., 0., 1., ....,
9.4 THE ESTABLISHMENT OF A SUPPORT VECTOR MACHINE These locations will now be fitted with a Support Vector Machine
Classifier. While the mathematical specifics of the likelihood model are
fascinating, we 'll save those for another time. Instead, we'll approach the
scikit -learn algorithm as a black box that performs the aforementioned
# import support vector classifier
# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='line ar')
Page 132
132 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels # fitting x samples and y classes, y)
The model can then be used to forecast new values after it has been fitted:
clf.predict([[120, 990]])
clf.predict([[85, 550]])
array([ 0.])
array([ 1.])
Let's have a look at the graph to see what t his means.
9.5 A SIMPLE DESCRIPTION OF THE SVM CLASSIFICATION ALGORITHM Assume we have a set of points that are divided into two classes. We want
to split those two classes so that we can accurately assign any new points
to one or the other in the future .
The SVM algorithm seeks out a hyperplane that separates these two
classes by the greatest margin possible. A hard margin can be utilized if
classes are entirely linearly separable. Otherwise, a soft margin is
Note that support vectors are the p oints that end up on the margins.
Page 133
133 Artificial Intelligence & Machine Learning Lab Hard -margin :
The SVM method is used to separate the two classes of points. Scenario
with a tight margin.
● The "H1" hyperplane is incapable of accurately separating the two
classes; hence it is not a suitable solution to our problem.
● The "H2" hyperplane accurately splits classes. The distance between
the hyperplane and the nearest blue and green points, on the other
hand, is extremely small. As a result, there's a good risk that any
future new points may be classified erro neously. The algorithm, for
example, would allocate the new grey point (x1=3, x2=3.6) to the
green class when it is evident that it should belong to the blue class
● Finally, the "H3" hyperplane appropriately and with the greatest
possible margin di vides the two classes (yellow shaded area). A
solution has been discovered!
It's worth noting that determining the maximum feasible margin allows for
a more accurate classification of additional data, resulting in a far more
robust model. When utilizing th e "H3" hyperplane, you can see that the
new grey point is correctly allocated to the blue class.
It may not always be possible to completely separate the two classes. In
such cases, a soft -margin is employed, with some points permitted to be
misclassified or to fall within the margin (yellow shaded area). This is
where the "slack" value, represented b y ξ (xi).
Page 134
134 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels
The SVM method is used to separate the two classes of points. Scenario
with a soft margin.
The green point inside the margin is treated as an outlier by the "H4"
hyperplane in this case. As a result, the support vectors are the two green
spots closest to the main group. This increases the model's resilience by
allowing for a bigger margin.
Note that you may tweak the hyperparameter C to decide how much you
care about misclassifications (and points inside the margin) in the
algorithm. C is es sentially a weight that has been assigned to. A low C
wants to categorize all training instances correctly, producing a closer
match to the training data but making it less robust, whereas a high C
strives to classify all training examples correctly, produ cing a closer fit to
the training data but making it less robust.
While a high C value will likely result in higher model performance on the
training data, there is a substantial risk of over fitting the model, which
will result in poor test data outcomes.
Kernel Trick:
SVM was previously explained in the context of linearly separable blue
and green classes. What if we wanted to use SVMs to solve non -linear
problems? How would we go about doing that? The kernel technique
comes into play at this point. A k ernel is a function that takes a nonlinear
problem and converts it to a linear problem in a higher -dimensional space.
Let's look at an example to demonstrate this method.
Page 135
135 Artificial Intelligence & Machine Learning Lab Assume you have two classes, red and black, as indicated in the
diagram below:
Data in its original two -dimensional form.
As you can see, red and black points are not linearly separable because
there is no way to construct a line that separates these two classes. We can,
however, distinguish them by drawing a circle with all of the red d ots
inside and the black points outside.
9.6 WHAT IS THE BEST WAY TO TRANSFORM THIS PROBLEM INTO A LINEAR ONE? Make a third dimension out of the sum of squared x and y values:
z = x² + y²
We can now design a hyperplane (flat 2D surface) to separate red and
black points using this three -dimensional space with x, y, and z values. As
a result, the SVM classification algorithm is now available.
9.7 KERNEL FOR THE RADIAL BASIS FUNCTION (RBF) AND PYTHON EXAMPLES The default kernel in sklearn's SVM classification algorithm is RBF,
which can be defined using the formula:
Where gamma can be adjusted manually and must be greater than zero. In
sklearn's SVM classification method, the default value for gamma is:
Page 136
136 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels Briefly:
||x - x'||² Between two feature vectors, 2 i s the squared Euclidean distance
(2 points). Gamma is a scalar that expresses how powerful a single
training sample (point) can be.
As a result of the above design, we can control the influence of specific
points on the overall algorithm. The bigger the ga mma, the closer the other
points must be to have an impact on the model. In the Python examples
below, we'll see how adjusting gamma affects the results.
The following data and libraries will be used:
● Kaggle chess games data
● Scikit -learn library f or separating the data into train -test samples,
creating SVM classification models, and model evaluation
● Data manipulation with Pandas and Numpy
Let’s import all the libraries:
make optimal hyperplanes using matplotlib function.
import pandas as pd # for d ata manipulation
import numpy as np # for data manipulation
from sklearn.model_selection import train_test_split # for splitting the
data into train and test samples
from sklearn.metrics import classification_report # for model evaluation
from skle arn.svm import SVC # for Support Vector Classification model
import as px # for data visualization
import plotly.graph_objects as go # for data visualization
After you've saved the data to your machine, use the code below to ingest
it. We a lso get a few new variables that we can use in the modeling.
# Read in the csv
df=pd.read_csv('games.csv', encoding='utf -8')
# Difference between white rating and black rating - independent variable
df['rating_difference']=df['white_rating'] -df['black_rati ng']
# White wins flag (1=win vs. 0=not -win) - dependent (target) variable
df['white_win']=df['winner'].apply(lambda x: 1 if x=='white' else 0)
Page 137
137 Artificial Intelligence & Machine Learning Lab # Print a snapshot of a few columns
Let's now write a few functions th at we may use to generate different
models and plot the results.
This function divides the data into train and test samples, fits the model,
predicts the outcome on a test set, and calculates model performance
def fitting(X, y, C, gamma):
# Create training and testing samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
# Fit the model
# Note, available kernels: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’,
‘precomputed’}, default=’rbf’
model = SVC(kernel='rbf', probability=True, C=C, gamma=gamma)
clf =, y_train)
# Predict class labels on training data
pred_labels_tr = model.predict(X_train)
# Predict class labels on a test data
pred_labels_te = model.predi ct(X_test)
# Use score method to get accuracy of the model
print(' ----- Evaluation on Test Data -----')
score_te = model.score(X_test, y_test)
Page 138
138 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels print('Accuracy Score: ', score_te)
# Look at classification report to evaluate the model
print(classification_report(y_test, pred_labels_te))
print(' -------------------------------------------------------- ')
print(' ----- Evaluation on Training Data -----')
score_tr = model.score(X_train, y_train)
print('Accuracy Score: ', scor e_tr)
# Look at classification report to evaluate the model
print(classification_report(y_train, pred_labels_tr))
print(' -------------------------------------------------------- ')
# Return relevant data for chart plotting
return X_ train, X_test, y_train, y_test, clf
With the test data and model prediction surface, the following function
will create a Plotly 3D scatter graph.
def Plot_3D(X, X_test, y_test, clf):
# Specify a size of the mesh to be used
mesh_size = 5
margin = 1
# Create a mesh grid on which we will run our model
x_min, x_max = X.iloc[:, 0].fillna(X.mean()).min() - margin, X.iloc[:,
0].fillna(X.mean()).max() + margin
y_min, y_max = X.iloc[:, 1].fillna(X.mean()).min() - margin, oc[:,
1].fillna(X.mean()).max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# Calculate predictions on grid
Z = clf.predict_proba(np. c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
Page 139
139 Artificial Intelligence & Machine Learning Lab # Create a 3D scatter plot with predictions
fig = px.scatter_3d(x=X_test['rating_difference'], y=X_test['turns'],
opacity=0.8, color_discrete_sequence=[' black'])
# Set figure title and colors
fig.update_layout(#title_text="Scatter 3D Plot with SVM Prediction
paper_bgcolor = 'white',
scene = dict(xaxis=dict(backgroundcolor='white',
# Update marker size
# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=Z, name='SVM
colorscale='RdBu', showscale=False,
contours = {"z": {"show": True, "start": 0.2, "end": 0.8,
"size": 0.05}}))
9.8 BUILD A MODEL WITH DEFAULT VALUES FOR C AND GAMMA Let's create our first SVM model with the 'rating difference' and 'turns'
fields as independent variables (attributes/predictors) and the 'white win'
flag as the target.
Page 140
140 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels Note that we're cheating a little because the final number of moves won't
be kno wn until after the match. As a result, if we were to make model
predictions before the match, we wouldn't be able to use 'turns.' However,
this is merely for demonstration purposes, therefore we'll use it in the
examples below.
The code is brief because we 're using our previously defined 'fitting'
# Select data for modeling
X=df[['rating_difference', 'turns']]
# Fit the model and display results
X_train, X_test, y_train, y_test, clf = fitting(X, y, 1, 'scale')
The function prints the following model evaluation metrics:
SVM model performance metrics.
We can see that the model's performance on test data is similar to that on
training data, indicating that the default hyperparameters allow the model
to generalize well.
Now w e'll use the Plot 3D function to see the prediction:
Plot_3D(X, X_test, y_test, clf)
Page 141
141 Artificial Intelligence & Machine Learning Lab
SVM classification model prediction plane using default hyperparameters.
Note that the top black spots are actual class=1 (white won), whereas the
bottom black points ar e actual class=0 (white did not win). Meanwhile, the
surface represents the model's chance of white wine.
While the probability varies locally, the decision boundary is about x=0
(i.e., rating difference=0) because this is where the probability crosses the
p=0.5 line.
Let's examine what happens if we set gamma to a relatively high value.
SVM model performance metrics with Gamma=0.1.
Page 142
142 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels As can be shown, raising gamma improves model performance on training
data but degrades model perf ormance on test data. The graph below
explains why this is the case.
Prediction plane for a gamma=0.1 SVM classification model. Colorscale='Aggrnyl'
was used in the featured image.
Rather than a smooth prediction surface, we now have one that is highly
"spiky." We need to look into the kernel function a little more to see why
this happens.
When we use a high gamma value, we are telling the function that the
close points are significantly more crucial for the prediction than the far
points. As a result, we see these "spikes" since the prediction is based on
individual points in the training instances rather than the environment.
Reducing gamma, on the other hand, tells the function that when
generating a forecast, it's not only the specific point that matte rs, but also
the points around it. Let's look at another case with a low gamma value to
see if this is correct.
SVM MODEL 3 — GAMMA = 0.000001
Let’s rerun the functions:
Page 143
143 Artificial Intelligence & Machine Learning Lab
SVM model performance metrics with Gamma=0.000001.
Reducing gamma improved the mode l's robustness, as expected, with an
increase in model performance on the test data (accuracy = 0.66). The
graph below shows how much smoother the prediction surface has gotten
after giving the spots further away more influence.
Prediction plane for SVM classification model with gamma=0.000001..
Page 144
144 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels C. Hyperparameter Adjustment:
I chose not to add examples in this tale using various C values because it
impacts the smoothness of the prediction plane similarly to gamma, but for
different reasons. You may obser ve for yourself by using the "fitting"
function with a value of C=100. some points permitted to be misclassified
or to fall within the margin (yellow shaded area) this increases the model's
resilience by allowing for a bigger margin.
9.9 RADIAL BASIS FUNCT ION (RBF) KERNEL: THE GO-TO KERNEL We're working on a non -linear dataset with a Machine Learning technique
like Support Vector Machines, but you can't seem to figure out the correct
feature transform or kernel to employ. Fear not, because the Radial Basis
Function (RBF) Kernel is here to save the day.
Due to its resemblance to the Gaussian distribution, RBF kernels are the
most generic form of kernelization and one of the most extensively used
kernels. For two points X1 and X2, the RBF kernel function compu tes
their similarity, or how near they are to one other. This kernel can be
expressed mathematically as follows:
1. ‘σ’ is the variance and our hyper parameter
2. ||X₁ - X₂|| is the Euclidean (L ₂-norm) Distance between two points X ₁
and X ₂
Let d ₁₂ be the distance between the two points X ₁ and X ₂, we can now
represent d ₁₂ as follows:
Fig 2: In space, the distance between two points is called the distance between two
points in space.
Page 145
145 Artificial Intelligence & Machine Learning Lab The following is a rewrite of the kernel equation:
The RBF kernel can have a maximum value of 1 when d 12 is 0, which
means that the points are equal, i.e. X 1 = X 2.
1. There is no d istance between the points when they are the same,
therefore they are incredibly comparable.
2. The kernel value is less than 1 and close to 0 when the points are
separated by a wide distance, indicating that the points are dissimilar.
Because we can see that as the distance between the point’s increases, they
become less similar, distance can be regarded of as an analogue to
Fig 3: As distance grows, similarity reduces.
Finding the proper value of “to determine which points should be rega rded
comparable is critical, and this can be proved on a case -by-case basis..
a] σ = 1
When σ = 1, σ² = 1 and the RBF kernel’s mathematical equation will be as
The curve for this equation is shown below, and we can see that the RBF
Kernel reduces exponentially as the distance rises, and is 0 for distances
larger than 4.
Fig 4: RBF Kernel for σ = 1 [Image by Author]
Page 146
146 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels 1. We can see that when d ₁₂ = 0, the similarity is 1, and when
d₁₂ exceeds 4 units, the similarity is 0.
2. We can see from the graph that if the distance between the points is
less than 4, the points are s imilar, and if the distance is larger than 4,
the points are dissimilar.
b] σ = 0.1
When σ = 0.1, σ² = 0.01 and the RBF kernel’s mathematical equation will
be as follows:
For σ = 0.1, the width of the Region of Similarity is the smallest, therefore
only extremely close points are considered comparable.
Fig 4 a: RBF Kernel for σ = 0.1
1. The curve is severely peaked, with a value of 0 for distances larger
than 0.2.
2. Only if the distance between the points is less than or equal to 0.2 is
the pair consi dered comparable.
b] σ = 10
When σ = 10, σ² = 100 and the RBF kernel’s mathematical equation will
be as follows:
For σ = 100, the width of the Region of Similarity is enormous, allowing
for the comparison of points that are far apart.
Page 147
147 Artificial Intelligence & Machine Learning Lab
Fig 5: RBF Kerne l for σ = 10
1. The curve has a great width.
2. For distances up to 10 units, the points are deemed comparable; but,
for distances greater than 10 units, they are considered distinct.
The width of the Region of Similarity changes as changes, as shown in the
examples above.
Using hyperparameter tuning approaches such as Grid Search Cross -
Validation and Random Search Cross -Validation, you may find the
appropriate for a particular dataset.
The RBF Kernel is well -known due to its resemblance to the K -Nearest
Neighbor Algorithm. Because RBF Kernel Support Vector Machines only
need to store the support vectors during training and not the complete
dataset, it has the advantages of K -NN and avoids the space complexity
The RBF Kernel Support Vector Machines are included in the scikit -learn
toolkit and have two hyperparameters: 'C' for SVM and "for the RBF
Kernel. In this case, is inversely proportional to.
Page 148
148 Classifying Data Using Support Vector Machines (SVMS): SVM-RBF Kernels
Fig 6: RBF Kernel SVM for Iris Dataset
The RBF Kernel Support Vector Machines are included in the scikit -learn
toolkit and have two hyper parameters: 'C' for SVM and " for the RBF
Kernel. In this case, is inversely proportional to.
9.10 CONCLUSION A Support Vector Machine (SVM) is a discriminative classifier with a
separating hyperplane as its formal definition. An SVM training algorithm
creates a model that assigns new examples to one of two categories,
making it a non -probabilistic binary linear classifier. To train the
classifier, we must first import the cancer datasets as a CSV file. We then
extract two features out of all the samples and train them on top of each
other. The SVM algorithm seeks out a hyperplane that separates these two
classes by the greatest margin possible.
A hard margin can be utilized if cla sses are entirely linearly separable.
Otherwise, a soft margin is required. Let's have a look at the graph to see
what this means. The SVM method is used to separate the two classes of
points. In such cases, a soft margin is employed, with some points
perm itted to be misclassified or to fall within the margin (yellow shaded
area) This increases the model's resilience by allowing for a bigger
9.11 REFERENCES ● -data-using -support -vector -
machinessvms -in-python/
● https://towardsda -classifier -and-rbf-kernel -how-to-
make -better -models -in-python -73bb4914af5b
● -basis -functi on-rbf-kernel -the-
go-to-kernel -acf0d22c798a
Page 149
Unit structure
10.0 Objectives
10.1 Decision Tree
10.2 Ensemble Techniques – Bagging
10.3 Ensemble Techniques – Boosting
10.4 Ensemble Techniques – Stacking
10.5 Ensemble Techniques – Voting
10.6 Random Forest - Bagging Attri bute Bagging And Voting For Class
10.7 Summary
10.8 References
10.0 OBJECTIVES This chapter will enable students to:
● Make use of Data sets in implementing the machine learning
● Implement the machine learning concepts and algorith ms in any
suitable language of choice.
Data sets can be taken from standard repositories or constructed by the
10.1 DECISION TREE Objectives: This chapter will enable students to:
● Make use of Data sets in implementing the machine learning
algori thms
● Implement the machine learning concepts and algorithms in any
suitable language of choice.
Data sets can be taken from standard repositories or constructed by the
Decision -tree algorithm falls under the category of supervised learning
algorithms. It works for both continuous as well as categorical output
variables. Makes use of the Tree representation. Can be used for
Page 150
150 Decision Tree classification. Given a decision tree, how do we predict an outcome for a
class label? We start from the root o f the tree. CART stands for
Classification and Regression Trees.
For example, consider a dataset of cats and dogs, with their features. The
label here is accordingly "cat", or "dog", and the goal is to identify the
animal based on its features, using a deci sion tree. Say, if at a particular
node in the tree, the input to a node contains only a single type of label,
say cats, we can infer that it is perfectly grouped, or "unmixed". On the
other hand, if the input contains a mix of cats and dogs, we would hav e to
ask another question about the features in the dataset that can help us
narrow down, and divide the mix further to try and "unmix" them
# Program to implement decision tree in Python
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Function importing Dataset
def importdata():
balance_data = pd.read_csv(', sep= ',', header = None)
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
return balance_data
# Function to split the dataset
def splitdataset(balance_data):
# Separating the target variable
Page 151
151 Artificial Intelligence & Machine Learning Lab X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split
(X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex.
def train_using_gini(X _train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing trai ning, y_train)
return clf_entropy
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
Page 152
152 Decision Tree return y_pre d
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
print("Report : ",
classification_report(y_test, y_pred))
# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operation al Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entro py)
cal_accuracy(y_test, y_pred_entropy)
# Calling main function
if __name__=="__main__":
Page 153
153 Artificial Intelligence & Machine Learning Lab A supervised learning algorithm. Makes use of the Tree representation.
Can be used for classification.
10.2 ENSEMBLE TECHNIQUES – BAGGING # importing utility modules
# download the train data set from
“ -traincsv ”
import pandas as pd
from sklearn.model_selection import train_test_split
from sklea rn.metrics import mean_squared_error
# importing machine learning models for prediction
import xgboost as xgb
# importing bagging module
from sklearn.ensemble import BaggingRegressor
# loading train data set in dataframe from train_data.csv file
df = ad_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split
(train, target, test_size=0.20)
# initializing the bagging model using XGboost as base model with default
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())
# training model, y_train)
# predicting the output on the test dataset
pred = model.predict(X_test)
Page 154
154 Decision Tree # printing the root mean squared error between real value and predicted
print(mean_squared_error(y_test, pred_final))
10.3 ENSEMBLE TECHNIQUES – BOOSTING # importing utility modules
import pand as as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import GradientBoostingRegressor
# loading train data set in dataframe fro m train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split
(train, target, test_size=0.20)
# initializing the boosting module with default parameters
model = GradientBoostingRegressor()
# training the model on the train dataset, y_train)
# predict ing the output on the test dataset
pred_final = model.predict(X_test)
# printing the root mean squared error between real value and predicted
print(mean_squared_error(y_test, pred_final))
Page 155
155 Artificial Intelligence & Machine Learning Lab 10.4 ENSEMBLE TECHNIQUES – STACKING # importing utility modul es
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from skle arn.linear_model import LinearRegression
# importing stacking lib
from vecstack import stacking
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split
(train, target, test_size=0.20)
# initializing all the base model obj ects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# putting all base model objects in one list
all_models = [model_1, model_2, model_3]
# computing the stack features
s_train, s_test = stacking(all_models, X_train, X_test, y_train,
regression=True, n_folds=4)
Page 156
156 Decision Tree # initializing the second -level model
final_model = model_1
# fitting the second level model with stack features
final_model =, y_train)
# predicting the fin al output using stacking
pred_final = final_model.predict(X_test)
# printing the root mean squared error between real value and predicted
print(mean_squared_error(y_test, pred_final))
10.5 ENSEMBLE TECHNIQUES – VOTING # importing utility modules
impo rt pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
# importing machine learning models for prediction
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# importing v oting classifier
from sklearn.ensemble import VotingClassifier
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["Weekday"]
# getting train data from the d ataframe
train = df.drop("Weekday")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(train, target,
# initializing all the model objects with default parameters
model_1 = LogisticRegression()
Page 157
157 Artificial Intelligence & Machine Learning Lab model_2 = XGBClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(
estimators=[('lr', model_1), ('xgb', model_2), ('rf', model_3)],
# training all the model on the train datase t, y_train)
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
10.6 RANDOM FOREST - BAGGING ATTRIBUTE BAGGIN G AND VOTING FOR CLASS SELECTION Random forest is like bootstrapping algorithm with Decision tree (CART)
model. Suppose we have 1000 observations in the complete population
with 10 variables. Random forest will try to build multiple CART along
with differe nt samples and different initial variables. It will take a random
sample of 100 observations and then chose 5 initial variables randomly to
build a CART model. It will go on repeating the process say about 10
times and then make a final prediction on each of the observations. Final
prediction is a function of each prediction. This final prediction can simply
be the mean of each prediction.
Random Forest - bagging Attribute bagging and voting for class selection
# importing utility modules
import pandas as p d
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestClassifier
# loading train data set in dataframe from train_data.csv fil e
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["Weekday"]
Page 158
158 Decision Tree # getting train data from the dataframe
train = df.drop("Weekday")
# Splitting between train data into training and validation dataset
X_train, X_test, y_t rain, y_test = train_test_split(train, target,
# initializing all the model objects with default parameters
model_3 = RandomForestClassifier()
# training all the model on the train dataset, y_train)
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
example 2 :
import pandas as pd
import numpy as np
dataset = pd.read_csv('/content/petrol_consu mption.csv')
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import Random Forest Regressor
regressor = Random Forest Regressor(n_estimators=20,random_state =0)
Page 159
159 Artificial Intelligence & Machine Learning Lab, y_train)
y_pred = regressor.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test,
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:',
np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Using Random Forest for Classification :
import pandas as pd
import numpy as np
dataset = pd.read_csv('/content/bill_authentication.csv')
X = dataset.iloc [:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
# Feature Scaling
from sklearn.preprocessing import StandardScale r
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import Random Forest Classifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0), y_train)
y_pr ed = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix,
Page 160
160 Decision Tree print(accuracy_score(y_test, y_pred))
from sklearn.ens emble import Random Forest Classifier
classifier = Random Forest Classifier(n_estimators=200, random_state=0), y_train)
y_pred = classifier.predict(X_test)
10.7 SUMMARY Ensemble means a group of elements viewed as a whole rather than
individually. An Ensemble method creates multiple models and combines
them to solve it. Ensemble methods help to improve the
robustness/generalizability of the model. In this chapter, we had discussed
some methods with their implementation in Python.
10.8 REFERENCES 1 Aurelian Géron, Hands -On Machine Learning with Scikit -Learn,
Keras, and TensorFlow, 2nd Edition.
2 Paul J. Deitel, Python Fundamentals.
3 Stuart Russell, Peter Norvig ,Artificial Intelligence – A Modern
Approach, , Pearson Education / Prentic e Hall of India, 3rd Edition,
4 EthemAlpaydın, Introduction to Machine Learning, PHI, Third
Edition, ISBN No. 978 -81-203- 5078 -6.
5 Peter Harrington, Machine Learning in Action. Manning Publications,
April 2012ISBN 9781617290183.
6 Introduction to Computer Programming using Python, John V Gu ttag
7 Core Python Programming, R. Nageswara Rao
8 -intelligence -machine -learning -iiit-
hprogra m/program -details.pdf
9 -robotics
12 https://scikit
13 https :// -machine -learning -
algorithmspython -scikit -learn/
14 -learning#syllabus
15 https://data -ml-data-preprocessing/
Page 161
Unit Structure
11.0 Boosting Algorithms
11.1 How it works
11.2 Types of boosting Algorithms
11.3 Introduction to AdaBoost Algorithm
11.3.1 What is AdaBoost Algorithm
11.3.2 How it works
11.3.3 What is AdaBoost algo rithm used for
11.3.4 Pros and Cons
11.3.5 Pseudocode of AdaBoost
11.4 Gradient Boosting Machines Algorithm
11.4.1 Implementation
11.4.2 Implementation using Scikit learn
11.4.3 Stochastic Gradient Boosting
11.4.4 Shrinkage
11.4.5 Regulariza tion
11.4.6 Tree constraints
11.0 BOOSTING ALGORITHM Boosting algorithms are the exceptional algorithms that are utilized to
enhance the existing result of the data model and assist to fix the errors.
[1,4,7] They utilize the concept of the weak learner and strong learner
discussion through the weighted average values and higher votes values
for prediction. They use decision stamp, margin maximizing classification
for processing purpose. Machine learning algorithms like AdaBoost or
Adaptive boosting Algo rithm, Gradient, XG Boosting algorithm and
Voting Ensemble are used to follow the process of training for predicting
and fine -tuning of the result. [1,4,7]
Let’s understand this with an example of the email, which recognize
whether the email, is a spam or not? It can be recognized it by the
following conditions:
If an email contains lots of source like that means it is spam.
Page 162
162 Boosting Algorithms If an email contains only one file image, then it is spam.
If an email contains the message of “You Own a lottery of $ xxxxx”,
that means it is spam.
Not Spam:
If an email contains some known source, then it is not spam.
If it contains the official domain like, etc., that means it is
not spam.
The above -mentioned rules are not that powerful to recognize the spam or
not; hence these rules are called as weak learners.
To convert weak learner to strong learner, combine the prediction of the
weak learner using the following methods.
Using weighted average.
Consider prediction has a higher vote.
Consider the above 5 r ules; there are 3 votes for spam and 2 votes for not
spam. As there is high vote for spam, we consider it as spam.
11.1 HOW IT WORKS? To choose the right distributions follow the steps as specified:
Step 1: The base Learning algorithm combines each distrib ution and
applies equal weight to each distribution.
Step 2: If any prediction occurs during the first base learning algorithm,
then we pay high attention to that prediction error.
Step 3: Repeat step 2 until the limit of the Base Learning algorithm has
been reached or high accuracy.
Step 4: Combines the entire weak learner to create one strong prediction
11.2 TYPES OF BOOSTING ALGORITHM 1. AdaBoost (Adaptive Boosting) algorithm
2. Gradient Boosting algorithm
3. XG Boost algorithm
4. Voting Ensemble
Page 163
163 Artificial Intelligence & Machine Learning Lab 11.3 INTRODUCTION TO ADABOOST ALGORITHM An adaBoost calculation can be utilized to boost the execution of any
machine learning calculation. Machine Learning has gotten to be a capable
tool which can make predictions based on a huge sum of data. It has end ed
up so well known in later times that the application of machine learning
can be found in our day -to-day exercises [1,4,7]. A common illustration
of it is getting proposals for items whereas shopping online based on the
past things bought by the client. Machine Learning, frequently alluded to
as predictive analysis, can be characterized as the capability of computers
to memorize without being programmed unequivocally. As a substitute, it
utilizes the algorithms to analyze input data to foresee output ins ide an
specified range [1,4,7].
11.3.1 What is AdaBoost Algorithm? :
Boosting originated from the question of whether a set of weak classifiers
could be converted to a strong classifier or not? A weak learner is a learner
who is better than random guessing . AdaBoost transforms weak learners
or predictors to strong predictors in order to solve problems of
classification [1,4,7].
For classification, the final equation can be put as below:
Here f m designates the mth weak classifier, and Ѳ m represents its
corresponding weight.
11.3.2 How it works? :
AdaBoost can be used to improve the performance of machine learning
algorithms. It is used best with weak learners, to achieve high accuracy
[1,4,7]. Consider a data set c ontaining n number of points:
-1 represents negative class, and 1 indicate positive. It is initialized as
below, the weight for each data point as:
If we consider iteration from 1 to M for m, we will get the below
First, we have to select t he weak classifier with the lowest weighted
classification error by fitting the weak classifiers to the data set.
Page 164
164 Boosting Algorithms
Then calculating the weight for the mth weak classifier as below:
The weight is positive for any classifier with an accuracy > 50%, become s
larger if the classifier is more accurate, and negative if the classifier has an
accuracy < 50%. The prediction can be combined by inverting the sign. By
inverting the sign of the prediction, a classifier with a 40% accuracy can
be converted into a 60% a ccuracy [1,4,7].
Updating the weight for each data point as below:
Zm is here the normalization factor. It makes sure that the sum total of all
instance weights becomes equal to 1.
11.3.3 What is AdaBoost Algorithm Used for? :
AdaBoost can be used for f ace detection as it appears to be the standard
algorithm for face detection in images. It employs a rejection cascade
comprising of numerous layers of classifiers. As the detection window is
not recognized at any layer as a face, it gets rejected. The firs t classifier in
the window discards the negative window keeping the computational cost
to the least. Even if AdaBoost combines the weak classifiers, the
principles of AdaBoost are utilized to find the best features to utilize in
each layer of the cascade [ 1,4,7].
11.3.4 Pros and Cons :
AdaBoost Algorithm is it is fast, simple and easy to program. It has the
flexibility to be combined with any machine learning algorithm, and
doesn’t need to tune the parameters except for T. It has been extended to
learn ing problems beyond binary classification, and it is versatile as it can
be used with text or numeric data [1,4,7].
Weak classifiers being too weak can lead to low margins and overfitting
11.3.5 Pseudocode of AdaBoost [2,3,6] :
1. Initially s et uniform example weights.
Page 165
165 Artificial Intelligence & Machine Learning Lab 2. for Each base learner do:
Train base learner with a weighted sample.
Test base learner on all data.
Set learner weight with a weighted error.
Set example weights based on ensemble predictions.
3. end for
Implementation of AdaBoost Using Python :
Step 1: Importing the Modules :
Import the required packages and modules.
In Python we have the AdaBoostClassifier and AdaBoostRegressor classes
from the scikit -learn library. As we deal we would import
AdaBoostClassifier. The train_ test_split method is used to split our dataset
into training and test sets. We also import datasets, from which we will
use the the Iris Dataset [2,3,6].
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_select ion import train_test_split
from sklearn import metrics
Step 2: Exploring the data :
This dataset contains four features about different types of Iris flowers
(sepal length, sepal width, petal length, petal width). The target is to
predict the type of flowe r from three possibilities: Setosa, Versicolour, and
Virginica. The dataset is available in the scikit -learn library, or you can
also download it from the UCI Machine Learning Library [2,3,6].
Next, we make our data ready by loading it from the datasets pa ckage
using the load_iris() method. We assign the data to the iris variable [2,3,6].
Further, we split our dataset into input variable X, which contains the
features sepal length, sepal width, petal length, and petal width.
Y is our target variable, or the class that we have to predict: either Iris
Setosa, Iris Versicolour, or Iris Virginica. Below is an example of what
our data looks like.
Page 166
166 Boosting Algorithms
Step 3: Splitting the data :
Splitting the dataset into training and testing datasets is a good idea to see
if our mo del is classifying the data points correctly on unseen data [2,3,6].
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Split the dataset into 70% training and 30% test.
Step 4: Fitting the Model :
Building the AdaBoost Model. AdaBoos t takes Decision Tree as its
learner model by default. We make an AdaBoostClassifier object and
name it abc [2,3,6]. Few important parameters of AdaBoost are :
base_estimator: It is a weak learner used to train the model.
n_estimators: Number of weak learn ers to train in each iteration.
learning_rate: It contributes to the weights of weak learners. It uses 1
as a default value.
abc = AdaBoostClassifier(n_estimators=50,
We then go ahead and fit our object abc to our training dataset. We call it
a model.
model =, y_train)
Page 167
167 Artificial Intelligence & Machine Learning Lab Step 5: Making the Predictions :
Our next step would be to see how good or bad our model is to predict our
target values.
y_pred = model.predict(X_test)
Step 6: Evaluating the model :
The Model accuracy will tell us how many times our model predicts the
correct classes.
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
An accuracy of 86.66% is achieved.
11.4 GRADIENT BOOSTING ALGORITHM Gradi ent boosting algorithm is a machine learning technique used to
define loss function and reduce it [4,7,8]. It is also used to solve problems
of classification using various prediction models involving the following
1. Loss Function :
The use of the l oss function depends on the type of problem. The
advantage of gradient boosting is that there is no need for a new boosting
algorithm for each loss function [4,7,8].
2. Weak Learner :
In gradient boosting, decision trees are used as a weak learner. A
regres sion tree is used to give true values, which can be combined together
to create correct predictions. Like in the AdaBoost algorithm, small trees
with a single split are used, i.e. decision stump. Larger trees are used for
large levels I,e 4 -8 levels [4,7,8 ].
3. Additive Model :
In this model, trees are added one at a time. existing trees remains the
same. During the addition of trees, gradient descent is used to minimize
the loss function.
The Gradient Boosting Machine is a powerful ensemble machine learning
algorithm that uses decision trees.
Gradient boosting is a generalization of AdaBoosting, improving the
performance of the approach and introducing ideas from bootstrap
aggregation to further improve the models, such as randomly sampling the
samples and f eatures when fitting ensemble members.
Page 168
168 Boosting Algorithms Gradient boosting performs well, if not the best, on a wide range of tabular
datasets, and versions of the algorithm like XGBoost and LightBoost often
play an important role in winning machine learning competitions [4 ,7,8].
Gradient Boosting ensemble is an ensemble created from decision trees
added sequentially to the model.
11.4 GRADIENT BOOSTING MACHINES ALGORITHM Gradient boosting refers to a class of ensemble machine learning
algorithms that can be used for classif ication or regression predictive
modeling problems.
Gradient boosting is also known as gradient tree boosting, stochastic
gradient boosting, and gradient boosting machines. Models are fit using
any arbitrary differentiable loss function and gradient descen t optimization
algorithm. This gives the technique its name, “gradient boosting,” as the
loss gradient is minimized as the model is fit, much like a neural network
Gradient boosting works by building weak prediction models sequentially
where each model tries to predict the error left over by the previous model.
Because of this, the algorithm tends to over -fit rather quick.
Implementations of the algorithm:
1. Gradient Boosting from scratch
2. Using the scikit -learn in -built function.
In gradient b oosting decision trees, we combine many weak learners to
come up with one strong learner. The weak learners here are the individual
decision trees. All the trees are connected in series and each tree tries to
minimise the error of the previous tree. Sequen tial boosting algorithms are
slow to learn, but highly accurate [1,4,7].
The weak learners are fit in such a way that each new learner fits into the
residuals of the previous step so as the model improves. The final model
aggregate s the result of each step and thus a strong learner is achieved. A
loss function is used to detect the residuals. Mean squared error (MSE) is
Page 169
169 Artificial Intelligence & Machine Learning Lab used for a regression task and logarithmic loss (log loss) is used for
classification tasks [1,4,7].
Learning rate and n_estimators (Hyperparameters) :
Hyperparemetes are key parts of learning algorithms which influence the
performance and accuracy of a model. Learning rate and n_estimators are
two basic hyperparameters for gradient boosting decision trees. Learning
rate, signified as α, basically implies how quick the show learns. Each tree
added modifies the overall model. The size of the modification is
controlled by learning rate. Learning rate is proportional to model learns.
The advantage of slower learning rate i s that the model becomes more
robust and efficient [1,4,7].
Note :
Problem in gradient boosting decision trees is overfitting due to addition
of too many trees whereas in random forests, addition of too many tress
won’t cause overfitting.
Algorithm :
Let’s say the output model $y$ when fit to only 1 decision tree, is given by
$$A_1 + B_1x +e_1
where $e1$ is there sidual from this decision tree. In gradient boosting, we
fit the consecutive decision trees on there sidual from the last one [1,4,7].
So when gra dient boosting is applied to this model, the consecutive
decision trees will be mathematically represented as:
e_1 = A_2 + B_2x + e_2
e_2 = A_3 + B_3x + e_3
Note that here we stop at 3 decision trees, but in an actual gradient
boosting model, the number of learners or decision trees is much more
[1,4,7]. The final model of the decision tree will be given by:
y = A_1 + A_2 + A_3 + B_1x + B_2x + B_3x + e_3 $$
11.4.1 Implementation :
Implementation from Scratch
Consider simulated data as shown in scatter plot b elow with 1 input (x)
and 1 output (y) variables.
Page 170
170 Boosting Algorithms
Calculate error residuals. Actual target value, minus predicted target value
[e1= y – y_predicted1 ]
Fit a new model on error residuals as target variable with same input
variables [call it e1_predicte d]
Add the predicted residuals to the previous predictions [y_predicted2 =
y_predicted1 + e1_predicted]
Fit another model on residuals that is still left. i.e. [e2 = y – y_predicted2]
and repeat steps 2 to 5 until it starts overfitting or the sum of residu als
become constant. Overfitting can be controlled by consistently checking
accuracy on validation data.
Page 171
171 Artificial Intelligence & Machine Learning Lab
The code above is a very basic implementation of gradient boosting trees.
The actual libraries have a lot of hyperparameters that can be tuned for
better results. This can be better understood by using the gradient boosting
algorithm on a real dataset.
11.4.2 Implementation using Scikit -learn :
Using the PIMA Indians Diabetes dataset, which has information about a
an individual’s health parameters and a n output of 0 or 1, depending on
whether or not he has diabates. The task here is classify a individual as
diabetic, when given the required inputs about his health.
Page 172
172 Boosting Algorithms
The accuracy is 73%, which is average. This can be improved by tuning
the hyperpa rameters or processing the data to remove outliers.
Page 173
173 Artificial Intelligence & Machine Learning Lab Improving perfomance of gradient boosted decision trees [1,4,7] :
Gradient boosting algorithms are prone to overfitting and consequently
poor perfomance on test dataset. There are some pointers you can kee p in
mind to improve the perfomance of gradient boosting algorithm.
11.4.3 Stochastic Gradient Boosting :
Stochastic gradient boosting involves sub sampling the training dataset
and training individual learners on random samples created by this sub
sampling . This reduces the correlation between results from individual
learners and combining results with low correlation provides us with a
better overall result.
11.4.4 Shrinkage :
The predictions of each tree are added together sequentially. Instead, the
contr ibution of each tree to this sum can be weighted to slow down the
learning by the algorithm. This weighting is called a shrinkage or a
learning rate. Using a low learning rate can dramatically improve the
perfomance of your gradient boosting model. Usually a learning rate in the
range of 0.1 to 0.3 gives the best results [1,4,7].
11.4.5 Regularization :
L1 and L2 regularization penalties can be implemented on leaf weight
values to slow down learning and prevent over -fitting. Gradient tree
boosting implementations often also use regularization by limiting the
minimum number of observations in trees’ termin al nodes.
11.4.6 Tree Constraints :
There are a number of ways in which a tree can be constrained to improve
Number of trees
Tree depth
Minimum improvement in loss
Number of observations per split
Page 174
Unit Structure
12.0 Examples
12.1 Example 1
12.2 Example 2
12.3 Gradient Boosting for classification
12.4 Gradient Boosting for regression
12.5 Gradient Boosting hyperparameters
12.6 Explore number of Samples
12.7 Explore Number of features
12.8 Explore learning rate
12.9 Explore Tree depth
12.10 Grid search hyperparameters
12.1 EXAMPLE 1 Gradient Boosting is a popular boosting algorithm. In gradient boosting,
each predictor corrects its predecessor’s error. There is a techniq ue called
the Gradient Boosted Trees whose base learner is CART (Classification
and Regression Trees) [5].
The below diagram explains how gradient boosted trees are trained for
regression problems.
Gradient Boosted Trees for Regression :
The ensemble cons ists of N trees. Tree1 is trained using the feature matrix
X and the labels y. The predictions labelled y1(hat) are used to determine
the training set residual errors r1. Tree2 is then trained using the feature
matrix X and the residual errors r1 of Tree1 as labels. The predicted
Page 175
175 Artificial Intelligence & Machine Learning Lab results r1(hat) are then used to determine the residual r2. The process is
repeated until all the N trees forming the ensemble are trained [5].
There is an important parameter used in this technique known as
Shrinkage .
Shrinkage refers to the fact that the prediction of each tree in the
ensemble is shrunk after it is multiplied by the learning rate (eta) which
ranges between 0 to 1. There is a trade -off between eta and number of
estimators, decreasing learning rate needs to be compe nsated with
increasing estimators in order to reach certain model performance. Since
all trees are trained now, predictions can be made [5].
Each tree predicts a label and final prediction is given by the formula,
y(pred) = y1 + (eta * r1) + (eta * r2) + ....... + (eta * rN)
The class of the gradient boosting regression in scikit -learn is
GradientBoostingRegressor. A similar algorithm is used for
classification known as GradientBoostingClassifier .
12.2 EXAMPLE 2 Gradient Boosting Scikit -Learn API :
Using a modern version of the library by running the following script:
Page 176
176 Examples
Running the script will print your version of scikit -learn.
Gradient boosting is provided via the Gradient Boosting Regressor and
Gradient Boosting Classifier classes.
Both models operat e the same way and take the same arguments that
influence how the decision trees are created and added to the ensemble.
Randomness is used in the construction of the model. This means that
each time the algorithm is run on the same data, it will produce a slightly
different model.
When using machine learning algorithms that have a stochastic learning
algorithm, it is good practice to evaluate them by averaging their
performance across multiple runs or repeats of cross -validation. When
fitting a final model, it may be desirable to either increase the number of
trees until the variance of the model is reduced across repeated
evaluations, or to fit multiple final models and average their predictions
Let’s take a look at how to develop a Gradient Boosting ensemble for both
classification and regression.
12.3 GRADIENT BOOSTING FOR CLASSIFICATION [1, 4, 7] In this section, we will look at using Gradient Boosting for a classification
First, we can use the make_classification() function to create a sy nthetic
binary classification problem with 1,000 examples and 20 input features.
The complete example is listed below.
Running the example creates the dataset and summarizes the shape of the
input and output components.
1. (1000, 20) (1000,)
Page 177
177 Artificial Intelligence & Machine Learning Lab Next, we can evaluate a Gradient Boosting algorithm on this dataset [3,9]..
We will evaluate the model using repeated stratified k -fold cross -
validation, with three repeats and 10 folds. We will report the mean and
standard deviation of the accuracy of the model acros s all repeats and
folds [1].
Running the example reports the mean and standard deviation accuracy of
the model.
Gradient Boosting ensemble with default hyperparameters achieves a
classification accuracy of about 89.9 percent on this test dataset.
Mean Ac curacy: 0.899 (0.030)
First, the Gradient Boosting ensemble is fit on all available data, then the
predict() function can be called to make predictions on new data.
The example below demonstrates this on our binary classification dataset.
Running the exa mple fits the Gradient Boosting ensemble model on the
entire dataset and is then used to make a prediction on a new row of data,
as we might when using the model in an application.
Page 178
178 Examples Predicted Class: 1
Now that we are familiar with using Gradient Boosting f or classification,
let’s look at the API for regression.
12.4 GRADIENT BOOSTING FOR REGRESSION Using make_regression() function to create a synthetic regression problem
with 1,000 examples and 20 input features.
The complete example is listed below.
Runn ing the example creates the dataset and summarizes the shape of the
input and output components.
1. (1000, 20) (1000,)
Next, we can evaluate a Gradient Boosting algorithm on this dataset.
As we did with the last section, we will evaluate the model using re peated
k-fold cross -validation, with three repeats and 10 folds. We will report the
mean absolute error (MAE) of the model across all repeats and folds. The
scikit -learn library makes the MAE negative so that it is maximized
instead of minimized. This mean s that larger negative MAE are better and
a perfect model has a MAE of 0.
The complete example is listed below [1].
Running the example reports the mean and standard deviation accuracy of
the model.
Page 179
179 Artificial Intelligence & Machine Learning Lab In this case, we can see the Gradient Boosting ensemble with default
hyperparameters achieves a MAE of about 62.
1. MAE: -62.475 (3.254)
We can also use the Gradient Boosting model as a final model and make
predictions for regression.
First, the Gradient Boosting ensemble is fit on all available data, then the
predict() function can be called to make predictions on new data.
The example below demonstrates this on our regression dataset [1].
Running the example fits the Gradient Boosting ensemble model on the
entire dataset and is then used to make a predictio n on a new row of data,
as we might when using the model in an application.
Prediction: 37
Now that we are familiar with using the scikit -learn API to evaluate and
use Gradient Boosting ensembles, let’s look at configuring the model [1].
12.5 GRADIENT BOOS TING HYPERPARAMETERS The number of trees can be set via the “n_estimators” argument and
defaults to 100.
The example below explores the effect of the number of trees with values
between 10 to 5,000.
Page 180
180 Examples
Running the example first reports the mean accuracy f or each configured
number of decision trees.
In this case, we can see that that performance improves on this dataset
until about 500 trees, after which performance appears to level off. Unlike
AdaBoost, Gradient Boosting appears to not overfit as the numbe r of trees
is increased in this case [1].
Page 181
181 Artificial Intelligence & Machine Learning Lab A box and whisker plot is created for the distribution of accuracy scores
for each configured number of trees.
We can see the general trend of increasing model performance and
ensemble size.
Box Plot of Gradien t Boosting Ensemble Size vs. Classification Accuracy
12.6 EXPLORE NUMBER OF SAMPLES The number of samples used to fit each tree can be varied. This means that
each tree is fit on a randomly selected subset of the training dataset [1, 4,
Using fewer sam ples introduces more variance for each tree, although it
can improve the overall performance of the model.
The number of samples used to fit each tree is specified by the
“subsample” argument and can be set to a fraction of the training dataset
size. By de fault, it is set to 1.0 to use the entire training dataset.
The example below demonstrates the effect of the sample size on model
performance [1, 4, 7].
Page 182
182 Examples
In this case, we can see that mean performance is probably best for a
sample size that is about h alf the size of the training dataset, such as 0.4 or
higher [1, 4, 7].
Page 183
183 Artificial Intelligence & Machine Learning Lab
Box Plot of Gradient Boosting Ensemble Sample Size vs. Classification
12.7 EXPLORE NUMBER OF FEATURES [1, 4, 7] The number of features used to fit each decision tree can be varied.
Like changing the number of samples, changing the number of features
introduces additional variance into the model, which may improve
performance, although it might require an increase in the number of trees.
The number of features used by each tr ee is taken as a random sample and
is specified by the “max_features” argument and defaults to all features in
the training dataset.
The example below explores the effect of the number of features on model
performance for the test dataset between 1 and 20.
Page 184
184 Examples
A box and whisker plot is created for the distribution of accuracy scores
for each configured number of trees [1, 4, 7].
We can see the general trend of increasing model performance perhaps
peaking around eight or nine features and staying somewha t level.
Page 185
185 Artificial Intelligence & Machine Learning Lab Box Plot of Gradient Boosting Ensemble Number of Features vs.
Classification Accuracy
12.8 EXPLORE LEARNING RATE [1, 4, 7] Learning rate controls the amount of contribution that each model has on
the ensemble prediction. Smaller rates may requir e more decision trees in
the ensemble, whereas larger rates may require an ensemble with fewer
trees. It is common to explore learning rate values on a log scale, such as
between a very small value like 0.0001 and 1.0. The learning rate can be
controlled v ia the “learning_rate” argument and defaults to 0.1.
The example below explores the learning rate and compares the effect of
values between 0.0001 and 1.0.
Page 186
186 Examples This highlights the trade -off between the number of trees (speed of
training) and learning rate, e .g. we can fit a model faster by using fewer
trees and a larger learning rate.
A box and whisker plot is created for the distribution of accuracy scores
for each configured number of trees.
Box Plot of Gradient Boosting Ensemble Learning Rate vs. Class ification
12.9 EXPLORE TREE DEPTH [1, 4, 7] Like varying the number of samples and features used to fit each decision
tree, varying the depth of each tree is another important hyperparameter
for gradient boosting.
The tree depth controls how speci alized each tree is to the training dataset:
how general or overfit it might be. Trees are preferred that are not too
shallow and general and not too deep and specialized.
Gradient boosting performs well with trees that have a modest depth
finding a balanc e between skill and generality [1, 4, 7].
Tree depth is controlled via the “max_depth” argument and defaults to 3.
The example below explores tree depths between 1 and 10 and the effect
on model performance.
Page 187
187 Artificial Intelligence & Machine Learning Lab
Running the example first reports the mean acc uracy for each configured
tree depth.
Performance improves with tree depth, perhaps peaking around a depth of
3 to 6, after which the deeper, more specialized trees result in worse
Page 188
188 Examples
A box and whisker plot is created for the distribution of ac curacy scores
for each configured tree depth.
We can see the general trend of increasing model performance with the
tree depth to a point, after which performance begins to degrade rapidly
with the over -specialized trees.
Box Plot of Gradient Boosting En semble Tree Depth vs. Classification
Page 189
189 Artificial Intelligence & Machine Learning Lab 12.10 GRID SEARCH HYPERPARAMETERS [1,4,7] Gradient boosting can be challenging to configure as the algorithm as
many key hyperparameters that influence the behavior of the model on
training data and the hyperpa rameters interact with each other.
As such, it is a good practice to use a search process to discover a
configuration of the model hyperparameters that works well or best for a
given predictive modeling problem. Popular search processes include a
random se arch and a grid search.
In this section we will look at grid searching common ranges for the key
hyperparameters for the gradient boosting algorithm that you can use as
starting point for your own projects. This can be achieving using the
GridSearchCV clas s and specifying a dictionary that maps model
hyperparameter names to the values to search.
In this case, we will grid search four key hyperparameters for gradient
boosting: the number of trees used in the ensemble, the learning rate,
subsample size used t o train each tree, and the maximum depth of each
tree. We will use a range of popular well performing values for each
Each configuration combination will be evaluated using repeated k -fold
cross -validation and configurations will be compare d using the mean
score, in this case, classification accuracy.
The complete example of grid searching the key hyperparameters of the
gradient boosting algorithm on our synthetic classification dataset is listed
Page 190
190 Examples
Running the example many take a wh ile depending on your hardware. At
the end of the run, the configuration that achieved the best score is
reported first, followed by the scores for all other configurations that were
A configuration with a learning rate of 0.1, max depth of 7 l evels, 500
trees and a subsample of 70% performed the best with a classification
accuracy of about 94.6 percent.
The model may perform even better with more trees such as 1,000 or
5,000 although these configurations were not tested in this case to ensure
that the grid search completed in a reasonable time.
Page 191
Unit Structure
13.1 XG Boost
13.1.0 Boosting
13.1.1 Using XGBoost in Python
13.1.2 k- fold cross validation using XGBoost
13.1.3 XGBoost Installation Guide
13.2 Voting Ensembles
13.2.1 Voting ensemble for classification
13.2.2 Hard voting ensemble for classification
13.1 XG BOOST Extreme Gradient Boosting (XG Boost) is an upgraded implementation of
the Gradient Boosting Algorithm, which is developed for high
computational speed, scalability, and better performance [1-4,7].
XG Boost has various features, which are as follows:
1. Parallel Processing
2. Cross -Validation
3. Cache Optimization
4. Distributed Computing
XGBoost is becoming popular :
Speed and performance
Core algorithm is parallelizable
Consistently outperforms o ther algorithm methods
Wide variety of tuning parameters
XGBoost (Extreme Gradient Boosting) belongs to a family of boosting
algorithms and uses the gradient boosting (GBM) framework at its core. It
is an optimized distributed gradient boosting library. Bu t wait, what is
boosting? Well, keep on reading.
Page 192
192 XG Boost 13.1.0 Boosting [1 -4,7]:
Boosting is a sequential technique which works on the principle of an
ensemble. It combines a set of weak learners and delivers improved
prediction accuracy. At any instant t, the m odel outcomes are weighed
based on the outcomes of previous instant t -1. The outcomes predicted
correctly are given a lower weight and the ones miss -classified are
weighted higher. Let's understand boosting in general with a simple
Four cla ssifiers (in 4 boxes), shown above, are trying to classify + and -
classes as homogeneously as possible.
1. Box 1: The first classifier (usually a decision stump) creates a vertical
line (split) at D1. It says anything to the left of D1 is + and anything t o the
right of D1 is -. However, this classifier misclassifies three + points.
Note : a Decision Stump is a Decision Tree model that only splits off at
one level, therefore the final prediction is based on only one feature.
2. Box 2: The second classifier g ives more weight to the three +
misclassified points (see the bigger size of +) and creates a vertical line at
D2. Again it says, anything to the right of D2 is - and left is +. Still, it
makes mistakes by incorrectly classifying three - points.
3. Box 3: Again, the third classifier gives more weight to the three -
misclassified points and creates a horizontal line at D3. Still, this classifier
fails to classify the points (in the circles) correctly.
Page 193
193 Artificial Intelligence & Machine Learning Lab 4. Box 4: This is a weighted combination of the weak clas sifiers (Box 1,2
and 3). As you can see, it does a good job at classifying all the points
That's the basic idea behind boosting algorithms is building a weak model,
making conclusions about the various feature importance and parameters,
and then using those conclusions to build a new, stronger model and
capitalize on the misclassification error of the previous model and try to
reduce it. Now, let's come to XGBoost. To begin with, you should know
about the default base learners of XGBoost: tree en sembles . The tree
ensemble model is a set of classification and regression trees (CART).
Trees are grown one after another ,and attempts to reduce the
misclassification rate are made in subsequent iterations. Here’s a simple
example of a CART that classifi es whether someone will like computer
games straight from the XGBoost's documentation.
If you check the image in Tree Ensemble section, you will notice each tree
gives a different prediction score depending on the data it sees and the
scores of each indivi dual tree are summed up to get the final score.
13.1.1 Using XGBoost in Python [1 -4,7]:
import the Boston Housing dataset and store it in a variable called boston.
from sklearn.datasets import load_boston
boston = load_boston()
The boston variable itself is a dictionary, so you can check for its keys
using the .keys() method.
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
You can easily check for its shape by using the attribute,
which will return the size of the dataset.
(506, 13)
As you can see it returned (506, 13), that means there are 506 rows of data
with 13 columns. Now, if you want to know what the 13 columns are, you
can simply use the .feature_names attribute and it will retu rn the feature
'B' 'LSTAT']
Page 194
194 XG Boost The description of the dataset is available in the dataset itself. You can
take a look at it using .DESCR.
print(boston .DESCR)
Boston House Prices dataset
Notes :
Data Set Characteristics:
: Number of Instances: 506
: Number of Attributes: 13 numeric/categorical predictive
: Median Value (attribute 14) is usually the target
: Attribute I nformation (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non -retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner -occupied units built prior to 1940
- DIS weighted distances to five Boston employm ent centres
- RAD index of accessibility to radial highways
- TAX full -value property -tax rate per $10,000
- PTRATIO pupil -teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by
- LSTAT % lower status of the population
- MEDV Median value of owner -occupied homes in $1000's
: Missing Attribute Values: None
Now let’s convert it into a pandas DataFrame! For that you need to import
the pandas library and call the DataFrame() function passing the argumen t
Page 195
195 Artificial Intelligence & Machine Learning Lab To label the names of the columns, use the .columnns
attribute of the pandas DataFrame and assign it to boston.feature_names.
import pandas as pd
data = pd.DataFrame(
data.columns = boston.feature_names
Explore the top 5 rows of the dataset by using head() method on your
pandas DataFrame.
You'll notice that there is no column called PRICE in the DataFrame. This
is because the target column is available in another attribute called Append t o your pandas DataFrame.
data['PRICE'] =
Run the .info() method on your DataFrame to get useful information about
the data.
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non -null float64
ZN 506 non -null float64
INDUS 506 non -null float64
CHAS 506 non -null float64
NOX 506 non -null float64
RM 506 non -null float64
AGE 506 non -null float64
DIS 506 non -null float64
RAD 506 non -null float64
Page 196
196 XG Boost TAX 506 non -null float64
PTRATIO 506 non -null float64
B 506 non -null float64
LSTAT 506 non -null float64
PRICE 506 non -null float64
dtypes: float64(14)
memory usage: 55.4 KB
Turns out that this dataset has 14 columns (including the t arget variable
PRICE) and 506 rows. Notice that the columns are of float data -type
indicating the presence of only continuous features with no missing values
in any of the columns. To get more summary statistics of the different
features in the dataset you will use the describe() method on your
Note that describe() only gives summary statistics of columns which are
continuous in nature and not categorical.
If you plan to use XGBoost on a dataset which has categorical features
you may want to consider applying some encoding (like one -hot encoding)
to such features before training the model.
Without delving into more exploratory analysis and feature engineering,
you will now focus on applying the algorithm to train the model on th is
Install python libraries like xgboost on your system using pip install
xgboost on cmd.
Page 197
197 Artificial Intelligence & Machine Learning Lab XGBoost's hyperparameters :
At this point, before building the model, you should be aware of the tuning
parameters that XGBoost provides. Well, there are a p lethora of tuning
parameters for tree -based learners in XGBoost and you can read all about
them here. But the most common ones that you should know are:
learning_rate: step size shrinkage used to prevent overfitting. Range is
max_depth: determines ho w deeply each tree is allowed to grow during
any boosting round.
subsample: percentage of samples used per tree. Low value can lead to
colsample_bytree: percentage of features used per tree. High value can
lead to overfitting.
n_estimators: number of trees you want to build.
objective: determines the loss function to be used like reg:linear for
regression problems, reg:logistic for classification problems with only
decision, binary:logistic for classification problems with probability.
XGBoost also supports regularization parameters to penalize models as
they become more complex and reduce them to simple (parsimonious)
models [1 -4,7].
gamma: controls whether a given node will split based on the expected
reduction in loss after the split. A high er value leads to fewer splits.
Supported only for tree -based learners.
alpha: L1 regularization on leaf weights. A large value leads to more
lambda: L2 regularization on leaf weights and is smoother than L1
It's also worth mentioning that though you are using trees as your base
learners, you can also use XGBoost's relatively less popular linear base
learners and one other tree learner known as dart. All you have to do is set
the booster parameter to either gbtree (default),g blinear or dart.
Now, you will create the train and test set for cross -validation of the
results using the train_test_split function from sklearn's model_selection
module with test_size size equal to 20% of the data. Also, to maintain
reproducibility of th e results, a random_state is also assigned.
Page 198
198 XG Boost
Well, you can see that your RMSE for the price prediction came out to be
around 10.8 per 1000$.
13.1.2 k -fold Cross V alidation using XGBoost [1 -4,7]:
In order to build more robust models, it is common to do a k -fold cross
validation where all the entries in the original training dataset are used for
both training as well as validation. Also, each entry is used for validation
just once. XGBoost supports k -fold cross validation via the cv() method.
All you have to do is specify the nfolds parameter, which is the number of
cross validation sets you want to build. Also, it supports many other
parameters (check out this link) like:
num_boost_round: denotes the number of trees you build (analogous to
metr ics: tells the evaluation metrics to be watched during CV
as_pandas: to return the results in a pandas DataFrame.
early_stopping_rounds: finishes training of the model early if the hold -
out metric ("rmse" in our case) does not improve for a given number of
seed: for reproducibility of results.
This time you will create a hyper -parameter dictionary params which
holds all the hyper -parameters and their values as key -value pairs but will
Page 199
199 Artificial Intelligence & Machine Learning Lab exclude the n_estimators from the hyper -parameter dictionary beca use you
will use num_boost_rounds instead.
You will use these parameters to build a 3 -fold cross validation model by
invoking XGBoost's cv() method and store the results in a cv_results
DataFrame. Note that here you are using the Dmatrix object you created
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate':
'max_depth': 5, 'alpha': 10}
cv_results =, params=params, nfold=3,
num_boost_round=50,early_stopping_ro unds=10,metrics="rmse",
as_pandas=True, seed=123)
cv_results contains train and test RMSE metrics for each boosting round.
Extract and print the final boosting round metric.
print((cv_results["test -rmse -mean"]).tail(1))
49 4.031162
Name: test -rmse -mean, dtype: float64
You can see that your RMSE for the price prediction has reduced as
compared to last time and came out to be around 4.03 per 1000$. You can
reach an even lower RMSE for a different set of hyper -parameters. You
may conside r applying techniques like Grid Search, Random Search and
Bayesian Optimization to reach the optimal set of hyper -parameters.
Visualize Boosting Trees and Feature Importance [1 -4,7]:
You can also visualize individual trees from the fully boosted model that
XGBoost creates using the entire housing dataset. XGBoost has a
plot_tree() function that makes this type of visualization easy. Once you
train a model using the XGBoost learning API, you can pass it to the
Page 200
200 XG Boost plot_tree() function along with the number of tr ees you want to plot using
the num_trees argument.
xg_reg = xgb.train(params=params, dtrain=data_dmatrix,
Plotting the first tree with the matplotlib library:
These plots provide insight into how the model arrived at its final
decis ions and what splits it made to arrive at those decisions.
Another way to visualize your XGBoost models is to examine the
importance of each feature column in the original dataset within the
One simple way of doing this involves counting the number of times each
feature is split on across all boosting rounds (trees) in the model, and then
visualizing the result as a bar graph, with the features ordered according to
how many times they appear. XGBoost has a plot_importance() function
that allows you t o do exactly this.
Page 201
201 Artificial Intelligence & Machine Learning Lab
As you can see the feature RM has been given the highest importance
score among all the features.
Example 2 :
XGBoost Regression API [1 -4,7]
XGBoost can be installed as a standalone library and an XGBoost model
can be developed using the scikit -learn API.
Install the XGBoost library.
sudo pip install xgboost
You can then confirm that the XGBoost library was installed correctly and
can be used by running the following script.
# check xgboost version
import xgboost
print(xgboost.__vers ion__)
Running the script will print your version of the XGBoost library you have
Your version should be the same or higher. If not, you must upgrade your
version of the XGBoost library.
If you do have errors when trying to run the above script, I recommend
downgrading to version 1.0.1 (or lower). This can be achieved by
specifying the version to install to the pip command, as follows:
sudo pip install xgboost==1.0.1
Page 202
202 XG Boost If you require specific instructions for your development environment, see
the tutorial:
13.1.3 XGBoost Installation Guide [1 -4,7]:
The XGBoost library has its own custom API, although we will use the
method via the scikit -learn wrapper classes: XGBRegressor and
XGBClassifier. This will allow us to use the full suite of tools from the
scikit -learn machine learning library to prepare data and evaluate models.
An XGBoost regression model can be defined by creating an instance of
the XGBRegressor class; for example:
# create an xgboost regression model
model = XGBRegressor()
You can s pecify hyperparameter values to the class constructor to
configure the model.
Perhaps the most commonly configured hyperparameters are the
n_estimators: The number of trees in the ensemble, often increased until
no further improvements are seen.
max_depth: The maximum depth of each tree, often values are between 1
and 10.
eta: The learning rate used to weight each model, often set to small values
such as 0.3, 0.1, 0.01, or smaller.
subsample: The number of samples (rows) used in each tree, set to a
value between 0 and 1, often 1.0 to use all samples.
colsample_bytree: Number of features (columns) used in each tree, set to
a value between 0 and 1, often 1.0 to use all features.
For example:
# create an xgboost regression model
model = XGBRegres sor(n_estimators=1000, max_depth=7, eta=0.1,
subsample=0.7, colsample_bytree=0.8)
Good hyperparameter values can be found by trial and error for a given
dataset, or systematic experimentation such as using a grid search across a
range of values.
Page 203
203 Artificial Intelligence & Machine Learning Lab Randomness is used in the construction of the model. This means that
each time the algorithm is run on the same data, it may produce a slightly
different model.
When using machine learning algorithms that have a stochastic learning
algorithm, it is good practice to evaluate them by averaging their
performance across multiple runs or repeats of cross -validation. When
fitting a final model, it may be desirable to either increase the number of
trees until the variance of the model is reduced across repeated
evaluations, or to fit multiple final models and average their predictions.
Let’s take a look at how to develop an XGBoost ensemble for regression.
XGBoost Regression Example [1 -4,7]:
In this section, we will look at how we might develop an XGBoost model
for a standar d regression predictive modeling dataset.
First, let’s introduce a standard regression dataset.
We will use the housing dataset.
The housing dataset is a standard machine learning dataset comprising 506
rows of data with 13 numerical input variables and a numerical target
Using a test harness of repeated stratified 10 -fold cross -validation with
three repeats, a naive model can achieve a mean absolute error (MAE) of
about 6.6. A top -performing model can achieve a MAE on this same test
harness of ab out 1.9. This provides the bounds of expected performance
on this dataset.
The dataset involves predicting the house price given details of the
house’s suburb in the American city of Boston.
Housing Dataset (housing.csv) [1 -4,7]:
Housing Description (housi ng.names)
No need to download the dataset; we will download it automatically as
part of our worked examples.
The example below downloads and loads the dataset as a Pandas
DataFrame and summarizes the shape of the dataset and the first five rows
of data.
# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
Page 204
204 XG Boost url =
dataframe = read_csv(url, header=None)
# summarize shape
print(d ataframe.shape)
# summarize first few lines
Running the example confirms the 506 rows of data and 13 input variables
and a single numeric target variable (14 in total). We can also see that all
input variables are numeric.
(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]
Next, let’s evaluate a regression XGBoost model with default
hyperparameters on the problem.
First, we can split t he loaded dataset into input and output columns for
training and evaluating a predictive model.
# split data into input and output columns
X, y = data[:, : -1], data[:, -1]
Next, we can create an instance of the model with a default configuration.
# define model
model = XGBRegressor()
We will evaluate the model using the best practice of repeated k -fold
cross -validation with 3 repeats and 10 folds.
This can be achieved by using the RepeatedKFold class to configure the
evaluation procedure and calling the cross_val_score() to evaluate the
model using the procedure and collect the scores.
Model performance will be evaluated using mean squared error (MAE).
Note, MAE is made negative in the scikit -learn library so that it can be
Page 205
205 Artificial Intelligence & Machine Learning Lab maximized. As such, we can ignore the sign and assume all errors are
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y,
scoring='neg_mean_absolute_error', cv=cv, n_jobs= -1)
Once evaluated, we can report the estimated performance of the model
when used to make predictions on new data for this problem.
In this case, because the scores were made negative, we can use the
absolute() NumPy function to make the scores positive.
We then report a statistical summary of the performance using the mean
and standard deviation of the distribution of scores, another good practice.
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), s cores.std()) )
Tying this together, the complete example of evaluating an XGBoost
model on the housing regression predictive modeling problem is listed
Page 206
206 XG Boost
scores = cross_val_score(model, X, y,
scoring='neg_mean_absolute_error', cv=cv, n_jobs= -1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
Running the example evaluates the XGBoost Regression algorithm on the
housing dataset and reports the average MAE across the three repeats of
10-fold cross -validation.
In this case, we can see that the model achieved a MAE of about 2.1.
This is a good score, better than the baseline, meaning the model has skill
and close to the best score of 1.9.
Mean MAE: 2.109 (0.320)
We may decide to use the XGBoost Regression model as our final model
and make predictions on new data.
This can be achieved by fitting the model on all available data and calling
the predict() function, passing in a new row of data.
Page 207
207 Artificial Intelligence & Machine Learning Lab
13.2 VOTING ENSEMBLES [1 -4,7] A voting e nsemble (or a “majority voting ensemble“) is an ensemble
machine learning model that combines the predictions from multiple other
It is a technique that may be used to improve model performance, ideally
achieving better performance than any single model used in the ensemble.
A voting ensemble works by combining the predictions from multiple
models. It can be used for classification or regression. In the case of
regression, this involves calculating the average of the predictions from
the models. In the case of classification, the predictions for each label are
summed and the label with the majority vote is predicted.
Regression Voting Ensemble: Predictions are the average of contributing
Classification Voting Ensemble: Predictions are the maj ority vote of
contributing models.
There are two approaches to the majority vote prediction for classification;
they are hard voting and soft voting.
Hard voting involves summing the predictions for each class label and
predicting the class label with the most votes. Soft voting involves
summing the predicted probabilities (or probability -like scores) for each
class label and predicting the class label with the largest probability.
Hard Voting : Predict the class with the largest sum of votes from models
Page 208
208 XG Boost Soft Voting : Predict the class with the largest summed probability from
A voting ensemble may be considered a meta -model, a model of models.
As a meta -model, it could be used with any collection of existing trained
machine learning models and the exis ting models do not need to be aware
that they are being used in the ensemble. This means you could explore
using a voting ensemble on any set or subset of fit models for your
predictive modeling task.
A voting ensemble is appropriate when you have two or m ore models that
perform well on a predictive modeling task. The models used in the
ensemble must mostly agree with their predictions.
Use voting ensembles when:
All models in the ensemble have generally the same good
All models in the ensemble mostly already agree.
Hard voting is appropriate when the models used in the voting ensemble
predict crisp class labels. Soft voting is appropriate when the models used
in the voting ensemble predict the probability of class membership. Soft
voting can be used for models that do not natively predict a class
membership probability, although may require calibration of their
probability -like scores prior to being used in the ensemble (e.g. support
vector machine, k -nearest neighbors, and decision trees).
Hard voting is for models that predict class labels.
Soft voting is for models that predict class membership probabilities.
The voting ensemble is not guaranteed to provide better performance than
any single model used in the ensemble. If any given model used in the
ensemble performs better than the voting ensemble, that model should
probably be used instead of the voting ensemble.
This is not always the case. A voting ensemble can offer lower variance in
the predictions made over individual models. This can be seen in a lower
variance in prediction error for regression tasks. This can also be seen in a
lower variance in accuracy for classification tasks. This lower variance
may result in a lower mean performance of the ensemble, which might be
desirable given t he higher stability or confidence of the model.
Use a voting ensemble if:
It results in better performance than any model used in the ensemble.
It results in a lower variance than any model used in the ensemble.
A voting ensemble is particularly useful for machine learning models that
use a stochastic learning algorithm and result in a different final model
Page 209
209 Artificial Intelligence & Machine Learning Lab each time it is trained on the same dataset. One example is neural
networks that are fit using stochastic gradient descent.
Another particularly useful case for voting ensembles is when combining
multiple fits of the same machine learning algorithm with slightly different
Voting ensembles are most effective when:
Combining multiple fits of a model trained using stochastic learning
algorit hms.
Combining multiple fits of a model with different hyperparameters.
A limitation of the voting ensemble is that it treats all models the same,
meaning all models contribute equally to the prediction. This is a problem
if some models are good in some si tuations and poor in others.
An extension to the voting ensemble to address this problem is to use a
weighted average or weighted voting of the contributing models. This is
sometimes called blending. A further extension is to use a machine
learning model t o learn when and how much to trust each model when
making predictions. This is referred to as stacked generalization, or
stacking for short.
Extensions to voting ensembles:
Weighted Average Ensemble (blending).
Stacked Generalization (stacking).
Voting Ens emble Scikit -Learn API [1 -4,7]:
Voting ensembles can be implemented from scratch, although it can be
challenging for beginners.
The scikit -learn Python machine learning library provides an
implementation of voting for machine learning.
It is available in v ersion 0.22 of the library and higher.
First, confirm that you are using a modern version of the library by
running the following script:
# check scikit -learn version
import sklearn
Running the script will print your version of s cikit-learn.
Your version should be the same or higher. If not, you must upgrade your
version of the scikit -learn library.
Page 210
210 XG Boost Voting is provided via the VotingRegressor and VotingClassifier classes.
Both models operate the same way and take the same arguments . Using
the model requires that you specify a list of estimators that make
predictions and are combined in the voting ensemble.
A list of base models is provided via the “estimators” argument. This is a
Python list where each element in the list is a tuple with the name of the
model and the configured model instance. Each model in the list must
have a unique name.
Now that we are familiar with the voting ensemble API in scikit -learn,
let’s look at some worked examples.
13.2.1 Voting Ensemble for Classific ation [1 -4,7]:
First, we can use the make_classification() function to create a synthetic
binary classification problem with 1,000 examples and 20 input features.
The complete example is listed below.
# test classification dataset
from sklearn.datasets imp ort make_classification
# define dataset
Page 211
211 Artificial Intelligence & Machine Learning Lab X, y = make_classification(n_samples=1000, n_features=20,
n_informative=15, n_redundant=5, random_state=2)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates the dataset and summarizes the s hape of the
input and output components.
(1000, 20) (1000,)
Next, we will demonstrate hard voting and soft voting for this dataset.
13.2.2 Hard Voting Ensemble for Classification [1 -4,7]:
We can demonstrate hard voting with a k -nearest neighbor algorithm.
We can fit five different versions of the KNN algorithm, each with a
different number of neighbors used when making predictions. We will use
1, 3, 5, 7, and 9 neighbors (odd numbers in an attempt to avoid ties).
Our expectation is that by combining the pre dicted class labels predicted
by each different KNN model that the hard voting ensemble will achieve a
better predictive performance than any standalone model used in the
ensemble, on average.
First, we can create a function named get_voting() that creates each KNN
model and combines the models into a hard voting ensemble.
Page 212
212 XG Boost
We can then create a list of models to evaluate, including each standalone
version of the KNN model configurations and the hard voting ensemble.
This will help us directly compare eac h standalone configuration of the
KNN model with the ensemble in terms of the distribution of classification
accuracy scores. The get_models() function below creates the list of
models for us to evaluate.
Each model will be evaluated using repeated k -fold cross -validation.
The evaluate_model() function below takes a model instance and returns
as a list of scores from three repeats of stratified 10 -fold cross -validation.
# evaluate a give model using cross -validation
def evaluate_model(model, X, y):
Page 213
213 Artificial Intelligence & Machine Learning Lab cv = R epeatedStratifiedKFold(n_splits=10, n_repeats=3,
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv,
n_jobs= -1, error_score='raise')
return scores
We can then report the mean performance of each algorithm, and also
create a bo x and whisker plot to compare the distribution of accuracy
scores for each algorithm.
# compare hard voting to standalone classifiers
Page 214
214 XG Boost
Running the example first reports the mean and standard deviation
accuracy for each model.
Note: Your results may vary given the stochastic nature of the algorithm
or evaluation procedure, or differences in numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the hard voting ensemble achieves a better classification
accuracy of about 90.2% compared to all standalone versions of the
A box -and-whisker plot is then created comparing the distribution
accuracy scores for each model, allowing us to clearly see that hard voting
ensemble performing better than all s tandalone models on average.
Page 215
215 Artificial Intelligence & Machine Learning Lab First, the hard voting ensemble is fit on all available data, then the
predict() function can be called to make predictions on new data.
Running the example fits the hard voting ensemble model on the entire
dataset and is t hen used to make a prediction on a new row of data, as we
might when using the model in an application.
Predicted Class: 1
Soft Voting Ensemble for Classification
We can demonstrate soft voting with the support vector machine (SVM)
The SVM algor ithm does not natively predict probabilities, although it can
be configured to predict probability -like scores by setting the “probability”
argument to “True” in the SVC class.
We can fit five different versions of the SVM algorithm with a polynomial
kerne l, each with a different polynomial degree, set via the “degree”
argument. We will use degrees 1 -5.
Our expectation is that by combining the predicted class membership
probability scores predicted by each different SVM model that the soft
voting ensemble w ill achieve a better predictive performance than any
standalone model used in the ensemble, on average.
First, we can create a function named get_voting() that creates the SVM
models and combines them into a soft voting ensemble.
Page 216
216 XG Boost
We can then create a lis t of models to evaluate, including each standalone
version of the SVM model configurations and the soft voting ensemble.
This will help us directly compare each standalone configuration of the
SVM model with the ensemble in terms of the distribution of cla ssification
accuracy scores. The get_models() function below creates the list of
models for us to evaluate.
return models
We can evaluate and report model performance using repeated k -fold
cross -validation as we did in the previous section.
Tying this t ogether, the complete example is listed below.
Page 217
217 Artificial Intelligence & Machine Learning Lab
Running the example first reports the mean and standard deviation
accuracy for each model.
Page 218
218 XG Boost Note: Your results may vary given the stochastic nature of the algorithm
or evaluation procedure, or differences i n numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the soft voting ensemble achieves a better classification
accuracy of about 92.4% compared to all standalone versions of the
A box -and-whis ker plot is then created comparing the distribution
accuracy scores for each model, allowing us to clearly see that soft voting
ensemble performing better than all standalone models on average.
If we choose a soft voting ensemble as our final model, we c an fit and use
it to make predictions on new data just like any other model.
First, the soft voting ensemble is fit on all available data, then the predict()
function can be called to make predictions on new data.
Page 219
219 Artificial Intelligence & Machine Learning Lab
Running the example fits the soft voting ensemble model on the entire
dataset and is then used to make a prediction on a new row of data, as we
might when using the model in an application.
Predicted Class: 1
Voting Ensemble for Regression
We will look at using voting for a regression problem.
First, we can use the make_regression() function to create a synthetic
regression problem with 1,000 examples and 20 input features.
The complete example is listed below.
# test regression dataset
from sklearn.datasets import make_regression
# define datase t
X, y = make_regression(n_samples=1000, n_features=20,
n_informative=15, noise=0.1, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates the dataset and summarizes the shape of the
input and output components.
(1000, 20) (1000,)
Page 220
220 XG Boost We can demonstrate ensemble voting for regression with a decision tree
algorithm, sometimes referred to as a classification and regression tree
(CART) algorithm.
We can fit five different versions of the CART algorithm, each with a
different m aximum depth of the decision tree, set via the “max_depth”
argument. We will use depths of 1 -5.
Our expectation is that by combining the values predicted by each
different CART model that the voting ensemble will achieve a better
predictive performance tha n any standalone model used in the ensemble,
on average.
First, we can create a function named get_voting() that creates each CART
model and combines the models into a voting ensemble.
We can then create a list of models to evaluate, including each stand alone
version of the CART model configurations and the soft voting ensemble.
This will help us directly compare each standalone configuration of the
CART model with the ensemble in terms of the distribution of error
scores. The get_models() function below creates the list of models for us
to evaluate.
Page 221
221 Artificial Intelligence & Machine Learning Lab
We can evaluate and report model performance using repeated k -fold
cross -validation as we did in the previous section.
Models are evaluated using mean absolute error (MAE). The scikit -learn
makes the score negative so that it can be maximized. This means that the
reported MAE scores are negative, larger values are better, and 0
represents no error.
Tying this together, the complete example is listed below.
Page 222
222 XG Boost
Running the example first reports the mean and standard deviation
accuracy for each model.
Note: Your results may vary given the stochastic nature of the algorithm
or evaluation procedure, or differences in numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the voting ensemble achieves a better mean squared error of
about -136.338, which is larger (better) compared to all standalone
versions of the model.
A box -and-whisker plot is then created comparing the distribution
negative MAE scores for eac h model, allowing us to clearly see that
voting ensemble performing better than all standalone models on average.
Page 223
223 Artificial Intelligence & Machine Learning Lab
If we choose a voting ensemble as our final model, we can fit and use it to
make predictions on new data just like any other model.
First, t he voting ensemble is fit on all available data, then the predict()
function can be called to make predictions on new data.
The example below demonstrates this on our binary classification dataset.
Running the example fits the voting ensemble model on th e entire dataset
and is then used to make a prediction on a new row of data, as we might
when using the model in an application.
Predicted Value: 141.319
Page 224
Unit Structure
14.1 Deploy your Machine Learning Models
14.1.0 How to deploy machine learning models
14.1.1 Test and clean code ready for deployment
14.1.2 Prepare the model for container deployment
14.1.3 Beyond machine learning deployment
14.1.4 Challenges for machine learning deployment
14.2 Ways to Deploy Machine Learning Models in Production
14.2.1 To create a machine learning web service, you need at least
three steps
14.2.2 Deploying machin e learning models for batch prediction
14.2.3 Deploying machine learning models on edge devices as
embedded models
Video Lectures
14.1 DEPLOY YOUR MACHINE LEARNING MODELS [12] Machine learning deployment is the process o f deploying a machine
learning model in a live environment. The model can be deployed across a
range of different environments and will often be integrated with apps
through an API. Deployment is a key step in an organisation gaining
operational value from machine learning.
Machine learning models will usually be developed in an offline or local
environment, so will need to be deployed to be used with live data. A data
scientist may create many different models, some of which never make it
to the deployment stage. Developing these models can be very resource
intensive. Deployment is the final step for an organisation to start
generating a return on investment for the organisation.
However, deployment from a local environment to a real -world
application can b e complex. Models may need specific infrastructure and
will need to be closely monitored to ensure ongoing effectiveness. For this
reason, machine learning deployment must be properly managed so it’s
efficient and streamlined.
Page 225
225 Artificial Intelligence & Machine Learning Lab This guide explores the basic steps required for machine learning
deployment in a containerised environment, the challenges organisations
may face, and the tools available to streamline the process.
14.1.0 How to deploy machine learning models [12] :
Machine learning deployment can be a complex task and will differ
depending on the system environment and type of machine learning
model. Each organisation will likely have existing DevOps processes that
may need to be adapted for machine learning deployment. However, the
general deployment process for machine learning models deployed to a
containerised environment will consist of four broad steps.
The four steps to machine learning deployment include:
Develop and create a model in a training environment.
Test and clean the code ready for de ployment.
Prepare for container deployment.
Plan for continuous monitoring and maintenance after machine
learning deployment.
Create the machine learning model in a training environment
Data scientists will often create and develop many different machine
learning models, of which only a few will make it into the deployment
phase. Models will usually be built in a local or offline environment, fed
by training data. There are different types of machine learning processes
for developing different models. These will differ depending on the task
the algorithm is being trained to complete. Examples include supervised
machine learning in which a model is trained on labelled datasets or
unsupervised machine learning where the algorithm identifies patterns and
trends in data.
Organisations may use machine learning models for a range of reasons.
Examples include streamlining monotonous administrative tasks, fine -
tuning marketing campaigns, driving system efficiency, or completing the
initial stages of research and deve lopment. A popular use is the
categorisation and segmentation of raw data into defined groups. Once the
model is trained and performing to a given accuracy on training data, it is
ready to be prepared for deployment.
14.1.1 Test and clean code ready for de ployment [12] :
The next step is to check if the code is of sufficient quality to be deployed.
This is to ensure the model functions in a new live environment, but also
so other members of the organisation can understand the model’s creation
process. The mo del is likely to have been developed in an offline
environment by a data scientist. So, for deployment in a live setting the
code will need to be scrutinised and streamline where possible.
Page 226
226 Deployment Of Machine Learning Algorithms Accurately explaining the results of a model is a key part of the m achine
learning oversight process. Clarity around development is needed for the
results and predictions to be accepted in a business setting. For this
reason, a clear explanatory document or ‘read me’ file should be
There are three simple steps to prepare for deployment at this stage:
Create a ‘read me’ file to explain the model in detail ready for
deployment by the development team.
Clean and scrutinise the code and functions and ensure clear naming
conventions using a style guide.
Test the code to check if the model functions as expected.
14.1.2 Prepare the model for container deployment [12] :
Containerisation is a powerful tool in machine learning deployment.
Containers are the perfect environment for machine learning deployment
and can be desc ribed as a kind of operating system visualisation. It’s a
popular environment for machine learning deployment and development
because containers make scaling easy. Containerised code also makes
updating or deploying distinct areas of the model straightforw ard. This
lowers the risk of downtime for the whole model and makes maintenance
more efficient.
The containers contain all elements needed for the machine learning code
to function, ensuring a consistent environment. Numerous containers will
often make up machine learning model architecture. Yet, as each container
is deployed in isolation from the wider operating system and
infrastructure, it can draw resources from a range of settings including
local and cloud systems. Container orchestration platforms lik e
Kubernetes help with the automation of container management such as
monitoring, scheduling, and scaling.
14.1.3 Beyond machine learning deployment [12] :
Successful machine learning deployment is more than just ensuring the
model is initially functioning in a live setting. Ongoing governance is
needed to ensure the model is on track and working effectively and
efficiently. Beyond the development of machine learning models,
establishing the processes to monitor and deploy the model can be a
challenge. Howev er, it’s a vital part of the ongoing success of machine
learning deployment, and models can be kept optimised to avoid data drift
or outliers.
Once the processes are planned and in place to monitor the machine
learning model, data drift and emerging ineffi ciencies can be detected and
resolved. Some models can also be regularly retrained with new data to
avoid the model drifting too far from the live data. Considering the model
Page 227
227 Artificial Intelligence & Machine Learning Lab after deployment means machine learning will be effective in an
organisation for the long term.
14.1.4 Challenges for machine learning deployment [12] :
The training and development of machine learning models is usually
resource -intensive and will often be the focus of an organisation. The
process of machine learning deployment is also a complex task and
requires a high degree of planning to be effective. Taking a model
developed in an offline environment and deploying it in a live
environment will always bring unique risks and challenges. A major
challenge is bridging the gap between da ta scientists who developed the
model and the developers that will deploy the model. Skillsets and
expertise may not overlap in these distinct areas, so efficient workflow
management is vital.
Machine learning deployment can be a challenge for many organis ations,
especially if infrastructure must be built for deployment. Considerations
around scaling the model to meet capacity add another layer of
complexity. The effectiveness of the model itself is also a key challenge.
Ensuring results are accurate with n o bias can be difficult. After machine
learning deployment, the model should be continuously tested and
monitored to drive improvements and continuous optimisation.
The main challenges for machine learning deployment include [12]:
A lack of communication b etween the development team and data
scientists causing inefficiencies in the deployment process.
Ensuring the right infrastructure and environment is in place for
machine learning deployment.
The ongoing monitoring of model accuracy and efficiency in a re al-
world setting can be difficult but is vital to achieving optimisation.
Scaling machine learning models from training environment to real -
world data, especially when capacity needs to be elastic.
Explaining predictions and results from a model so that th e algorithm
is trusted within the organisation.
Products for streamlining machine learning deployment
Planning and executing machine learning deployment can often be a
complex task. Models need to be managed and monitored to ensure
ongoing functionality, a nd initial deployment must be expertly planned for
peak efficiency. Products like Seldon Deploy provide all the elements for
a successful machine learning deployment, as well as the insight tools
needed for ongoing maintenance.
The platform is language -agnostic, so it is prepared for any model
developed by a development team. It can easily integrate deployed
machine learning models with other apps through API connections. It’s a
Page 228
228 Deployment Of Machine Learning Algorithms platform for collaboration between data scientists and the development
team, he lping to simplify the deployment process.
Seldon Deploy features for machine learning deployment include [12]:
Workflow management tools to test and deploy models and make
planning more straightforward.
Integration with Seldon Core, a platform for containe rised machine
learning deployment using Kubernetes. It converts machine learning
models in a range of languages ready for containerised deployment.
Accessible analytics dashboards to monitor and visualise the ongoing
health of the model including monitorin g data drift and detecting
Innate scalability to help organisations expand to meet varying levels
of capacity, avoiding the risk of downtime.
The ability to be installed across different local or cloud systems to fit
the organisation’s current sy stem architecture.
14.2 WAYS TO DEPLOY MACHINE LEARNING MODELS IN PRODUCTION Deploy ML models and make them available to users or other
components of your project[12]
Deploying machine learning models as web services [12] :
The simplest way to deploy a m achine learning model is to create a web
service for prediction. In this example, we use the Flask web framework to
wrap a simple random forest classifier built with scikit -learn.
14.2.1 To create a machine learning web service, you need at least
three ste ps [12]:
The first step is to create a machine learning model, train it and validate its
performance. The following script will train a random forest classifier.
Page 229
229 Artificial Intelligence & Machine Learning Lab Model testing and validation are not included here to keep it simple. But
do remember those ar e an integral part of any machine learning project.
In the next step, we need to persist the model. The environment where we
deploy the application is often different from where we train them.
Training usually requires a different set of resources. Thus this separation
helps organizations optimize their budget and efforts.
Scikit -learn offers python specific serialization that makes model
persistence and restoration effortless. The following is an example of how
we can store the trained model in a pickle file.
from sklearn.externals import joblib
joblib.dump(classifier, 'classifier.pkl')
Finally, we can serve the persisted model using a web framework. The
following code creates a REST API using Flask. This file is hosted in a
different environment, often i n a cloud server.
The above code takes input in a POST request through
https://localhost:8080/predict and returns the prediction in a JSON
Page 230
230 Deployment Of Machine Learning Algorithms 14.2.2 Deploying machine learning m odels for batch prediction [12]:
While online models can serve predict ion, on -demand batch predictions
are sometimes preferable.
Offline models can be optimized to handle a high volume of job instances
and run more complex models. In batch production mode, you don't need
to worry about scaling or managing servers either.
Batch prediction can be as simple as calling the predict function with a
data set of input variables. The following command does it.
prediction = classifier.predict(UNSEEN_DATASET)
Sometimes you will have to schedule the training or prediction in the
batch pr ocessing method. There are several ways to do this. My favorite is
to use either Airflow or Prefect to automate the task.
import requests
from datetime import timedelta, datetime
import pandas as pd
from prefect import task, Flow
from prefect.schedules imp ort IntervalSchedule
@task(max_retries=3, retry_delay=timedelta(5))
def predict(input_data_path:str):
This task load the saved model, input data and returns prediction.
If failed this task will retry 3 times at 5 min interval and fail
perme nantly.
Page 231
231 Artificial Intelligence & Machine Learning Lab
The above script schedules prediction on a weekly basis starting from 5
seconds after the script execution. Prefect will retry the tasks 3 times if
they fail.
However, building the model may require multiple stages in the batch
processi ng framework. You need to decide what features are required and
how you should construct the model for each stage.
Train the model on a high -performance computing system with an
appropriate batch -processing framework.
Usually, you partition the training da ta into segments that are processed
sequentially, one after the other. You can do this by splitting the dataset
using a sampling scheme (e.g., balanced sampling, stratified sampling) or
via some online algorithm (e.g., map -reduce).
The partitions can be di stributed to multiple machines, but they must all
load the same set of features. Feature scaling is recommended. If you used
unsupervised pre -training (e.g., autoencoders) for transfer learning, you
must undo each partition.
After all the stages have been executed, you can predict unseen data with
the resulting model by iterating sequentially over the partitions.
14.2.3 Deploying machine learning models on edge devices as
embedded models [12]:
Computing on edge devices such as mobile and IoT has become very
popular in recent years. The benefits of deploying a machine learning
model on edge devices include, but are not limited to:
Page 232
232 Deployment Of Machine Learning Algorithms Reduced latency as the device is likely to be close to the user than a server
far away.
Reduce data bandwidth consumption as we sh ip processed results back to
the cloud instead of raw data that requires big size and eventually more
Edge devices such as mobile and IoT devices have limited computation
power and storage capacity due to the nature of their hardware. We cannot
simply deploy machine learning models to these devices directly,
especially if our model is big or requires extensive computation to run
inference on them.
Instead, we should simplify the model using techniques such as
quantization and aggregation while ma intaining accuracy. These
simplified models can be deployed efficiently on edge devices with
limited computation, memory, and storage.
We can use the TensorFlow Lite library on Android to simplify our
TensorFlow model. TensorFlow Lite is an open -source sof tware library
for mobile and embedded devices that tries to do what the name says: run
TensorFlow models in Mobile and Embedded platforms.
The following example converts a Keras TensorFlow model.
Page 233
233 Artificial Intelligence & Machine Learning Lab REFERENCES 1. Quick Introduction to Boosting Algorithms in Machine Learning. -introduction -
boosting -algorithms -machine -learning/. [Last Accessed on
2. Boosting Algorithms Explained. -algorithms -explained -
d38f56e f3f30[Last Accessed on 10.03.2022]
3. A Comprehensive Guide To Boosting Machine Learning Algorithms. -machine -learning/[Last
Accessed on 10.03.2022]
4. Essence of Boosting Ensembles for Machine Learning.
https://machinel -of-boosting -ensembles -
for-machine -learning/[Last Accessed on 10.03.2022]
5. Boosting in Machine Learning | Boosting and AdaBoost. -vs-boosting -in-machine -
learning/[Last Accessed on 10.03.2022]
6. [Last Accessed on 10.03.2022]
7. Machine Learning Plus Platform . [Last Accessed on
8. Weights & Biases with Gradient. [Last
Accessed on 10.03.20 22]
9. Build a machine Learning Web App in 5 Minutes. [Last Accessed on 10.03.2022]
10. AdaBoost Algorithm. -algorithm/.
[Last Accessed on 10.03.2022]
11. Implementing the AdaBoost Algorithm From S cratch. -the-adaboost -
algorithm -from -scratch/?ref=gcse. [Last Accessed on 10.03.2022]
12. Optimisation algorithms for differentiable functions. -optimisation -for-machine -learning.
[Last Accessed on 10.03.2022]
13. Quiz – Machine Learning. [Last Accessed on
Page 234
234 Deployment Of Machine Learning Algorithms TUTORIALS 1. How to Develop a Weighted Average Ensemble for Deep Learning
Neural Networks : -
average -ensemble -for-deep -learning -neural -networks/ [Last Accessed
on 10.03.2022]
2. How to Develop a Stacking Ensemble for Deep Learning Neural
Networks in Python With Keras : -ensemble -for-deep -
learning -neural -networks/ [ Last Accessed on 10.03.2022]
Page 235
235 Artificial Intelligence & Machine Learning Lab BOOKS 1. Schapire RE, Freund Y. Boosting: Foundations and algorithms.
Kybernetes. 2013 Jan 4.
2. Zhou ZH. Ensemble methods: foundations and algorithms. CRC
press; 2012 Jun 6.
3. Mohri M, Rostamizadeh A, Talwalkar A. Foundatio ns of machine
learning. MIT press; 2018 Dec 25.
4. Zhou ZH. Ensemble methods: foundations and algorithms. CRC
press; 2012 Jun 6.
5. Data Mining: Practical Machine Learning Tools and Techniques,
Page 236
236 Deployment Of Machine Learning Algorithms MOOCS Machine Learning: Classification. https://www. -
classification/boosting -rV0iX
Advanced Machine Learning and Signal Processing. -machine -learning -signal -
processing/boosting -and-gradient -boosted -trees -
8MEjw?redirectTo=%2Flearn%2Fadvanced -machine -learning -signal -
Boosting Machine Learning Models in Python. -machine -learning -models -in-
Boosting Algorithm in Python. https://python -
learning/boosting -algorit hm-in-python.php
Gradient Boosting Algorithm. -boosting -
Learning: Boosting. -engineering -
and-computer -science/6 -034-artificial -intelligence -fall-2010/lecture -
videos/lecture -17-learning -boosting/.
Bagging and Boosting. -
for-free/courses/bagging -and-boosting.
Page 237
237 Artificial Intelligence & Machine Learning Lab APIS 1. Ensemble methods scikit -learn API.
2. sklearn.ensemble.VotingClassifier API.
3. sklearn.ensemble.VotingRegressor API.
Page 238
238 Deployment Of Machine Learning Algorithms VIDEO LECTURES 1. Boosting Machine Learning Tutorial | Adaptive Boosting, Gradient
Boosting, XGBoost | Edureka. [Last Accessed
on 10.03.2022]
2. A Quick Guide to Boosting in Machine Learning.
3. Introduction To Gradient Boosting algorithm (simplistic n graphical) -
Machine Learning.
4. Gradient Boosting In Depth Intuition - Part 1 Machine Learning. h?v=Nol1hVtLOSg
5. Gradient Boosting - Math Clearly Explained Step By Step | Machine
Learning Step By Step. -
6. Visual Guide to Gradient Boosted Trees (xgboost).
7. Xgbo ost Classification Indepth Maths Intuition - Machine Learning
8. Trevor Hastie - Gradient Boosting Machine Learning.
9. Machine Learning Lecture 32 "Boosting " -Cornell CS4780 SP17.
Page 239
239 Artificial Intelligence & Machine Learning Lab QUIZ 1. Ensemble learning can only be applied to supervised learning
A. True
B. False
2. Ensembles will yield bad results when there is significant diversity
among the models.
Note: All individual models have meaningful and good predictions.
A. true
B. false
3. Which of the following is / are true about weak learners used in
ensemble model?
1. They have low variance and they don’t usually overfit
2. They have high bias, so the y can not solve hard learning problems
3. They have high variance and they don’t usually overfit
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. none of these
4. Ensemble of classifiers may or may not be more accurate than any of
its individual model.
A. true
B. fals e
5. If you use an ensemble of different base models, is it necessary to
tune the hyper parameters of all base models to improve the ensemble
A. yes
B. no
C. can’t say
6. Generally, an ensemble method works better, if the individual base
models have ____________?
Page 240
240 Deployment Of Machine Learning Algorithms Note: Suppose each individual base models have accuracy greater
than 50%.
A. less correlation among predictions
B. high correlation among predictions
C. correlation does not have any impact on ensemble output
D. none of the above
7. In an election, N candidates are competing against each other and
people are voting for either of the candidates. Voters don’t
communicate with each other while casting their votes. Which of the
following ensemble method works similar to above -discussed e lection
Hint: Persons are like base models of ensemble method.
A. bagging
B. boosting
C. a or b
D. none of these
8. Suppose there are 25 base classifiers. Each classifier has error rates of
e = 0.35.
Suppose you are using averaging as ensemble technique. What will be
the probabilities that ensemble of above 25 classifiers will make a
wrong prediction?
Note: All classifiers are independent of each other
A. 0.05
B. 0.06
C. 0.07
D. 0.09
9. In machine learning, an algorithm (or learning algorithm) is said to be
unstable if a small change in training data cause the large change in
the learned classifiers.True or False: Bagging of unstable classifiers is
a good idea
A. true
B. false
10. Which of the following parameters can be tuned for finding good
ensemble model in bagging based algorithms?
Page 241
241 Artificial Intelligence & Machine Learning Lab 1. Max number of samples
2. Max features
3. Bootstrapping of samples
4. Bootstrapping of features
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. all of above
11. How is the model capacity affected with dropout rate (where mo del
capacity means the ability of a neural network to approximate
complex functions)?
A. model capacity increases in increase in dropout rate
B. model capacity decreases in increase in dropout rate
C. model capacity is not affected on increase in dropout r ate
D. none of these
12. Dropout is computationally expensive technique w.r.t. bagging
A. true
B. false
13. Suppose, you want to apply a stepwise forward selection method for
choosing the best models for an ensemble model. Which of the
following is the cor rect order of the steps?
Note: You have more than 1000 models predictions
1. Add the models predictions (or in another term take the average)
one by one in the ensemble which improves the metrics in the
validation set.
2. Start with empty ensemble
3. Retur n the ensemble from the nested set of ensembles that has
maximum performance on the validation set
A. 1-2-3
B. 1-3-4
C. 2-1-3
D. none of above
Page 242
242 Deployment Of Machine Learning Algorithms 14. Suppose, you have 2000 different models with their predictions and
want to ensemble predictions of best x mod els. Now, which of the
following can be a possible method to select the best x models for an
A. step wise forward selection
B. step wise backward elimination
C. both
D. none of above
15. Below are the two ensemble models:
1. E1(M1, M2, M3) and
2. E2(M4, M5, M6)
Above, Mx is the individual base models.
Which of the following are more likely to choose if following
conditions for E1 and E2 are given?
E1: Individual Models accuracies are high but models are of the same type
or in another term less div erse
E2: Individual Models accuracies are high but they are of different types
in another term high diverse in nature
A. e1
B. e2
C. any of e1 and e2
D. none of these
16. In boosting, individual base learners can be parallel.
A. true
B. false
17. Which of the following is true about bagging?
1. Bagging can be parallel
2. The aim of bagging is to reduce bias not variance
3. Bagging helps in reducing overfitting
A. 1 and 2
B. 2 and 3
C. 1 and 3
Page 243
243 Artificial Intelligence & Machine Learning Lab D. all of these
18. Suppose you are using stacking with n differe nt machine learning
algorithms with k folds on data.
Which of the following is true about one level (m base models + 1
stacker) stacking?
Note: Here, we are working on binary classification problem
All base models are trained on all features
You are using k folds for base models
A. you will have only k features after the first stage
B. you will have only m features after the first stage
C. you will have k+m features after the first stage
D. you will have k*n features after the first stage
19. Which of the f ollowing is the difference between stacking and
A. stacking has less stable cv compared to blending
B. in blending, you create out of fold prediction
C. stacking is simpler than blending
D. none of these
20. Which of the following can be one of t he steps in stacking?
1. Divide the training data into k folds
2. Train k models on each k -1 folds and get the out of fold predictions
for remaining one fold
3. Divide the test data set in “k” folds and get individual fold
predictions by different algorith ms
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of above
21. Which of the following are advantages of stacking?
1) More robust model
2) better prediction
Page 244
244 Deployment Of Machine Learning Algorithms 3) Lower time of execution
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of the above
22. Which of the following are correct statement(s) about stacking?
A machine learning model is trained on predictions of multiple machine
learning models
A Logistic regression will definitely work better in the second stage as
compared to other classification methods
First stage mo dels are trained on full / partial feature space of training data
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of above
23. Which of the following is true about weighted majority votes?
1. We want to give higher weights to better performing models
2. Inferior m odels can overrule the best model if collective weighted
votes for inferior models is higher than best model
3. Voting is special case of weighted voting
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. 1, 2 and 3
24. Which of the following is true about averaging ense mble?
A. it can only be used in classification problem
B. it can only be used in regression problem
C. it can be used in both classification as well as regression
D. none of these
25. How can we assign the weights to output of different models in an
ensemb le?
Page 245
245 Artificial Intelligence & Machine Learning Lab 1. Use an algorithm to return the optimal weights
2. Choose the weights using cross validation
3. Give high weights to more accurate models
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. all of above