## Page 1

1

UNIT I

1

APPLIED MATH AND MACHINE

LEARNING BASICS

Unit Structure

1.0 Objectives

1.1 Introduction and Overview of Applied M ath and Machine Learning

Basics

1.2 Linear Algebra

1.2.1 Scalars, Vectors, Matrices and Tensors

1.2.2 Multiplying Matrices and Vectors

1.2.3 Identity and Inverse Matrices

1.2.4 Linear Dependence and Span

1.2.5 Norms

1.2.6 Special Kinds of Matrices and Vectors

1.2.7 Eigende composition

1.3 Summary

Unit End Exercises

Bibliography

1.0 OBJECTIVES

This chapter will help to you understand the:

Concepts of basic Mathematics useful in Machine learning and deep

Learning

Basic concepts related scalar, vectors, matrix and tensor

Different types of matrix and its operations

Decomposition of matrix

1.1 INTRODUCTION AND OVERVIEW OF APPLIED MATH AND MACHINE LEARNING BASICS

This section of the book tells some basic mathematical concepts which

helps to understand the deep Learning. Deep learning is a subdomain of

machine learning which is concerned with algorithms, mathematical

functions and artificial neural network.

munotes.in

## Page 2

2

1.2 LINEAR ALGEBRA

Linear Algebra is one of the widely used branches of mathematics related

with mathematical structures which is continuous rather than discrete

mathematics . It includes operations like addition, scalar multiplication that

helps to understand concepts like linear transformations , vector spaces ,

linear equations , matrices and determinants . A good knowledge of Linear

Algebra is important to understand and worki ng with many essential

machine learning algorithm s, especially algorithms related with deep

learning.

1.2.1 Scalars, Vectors, Matrices and Tensors :

Let‟s start with some basic definitions:

(1.1)

Understanding of a scalar, a vector, a matrix, a tensor

Scalar: A scalar is a single number represent 0th order tensor . Scalars are

written in low ercase and italics. For Example: n. there is many different

sets of numbers with interest in deep learning. The notation x ∈ ℝ

represents x is a scalar belonging to a real values numbers i.e. ℝ. ℕ states

the set of positive integers (1, 2, 3, …). ℤ states the integers, which is

combination of positive , negative and zero values . Rational numbers are

representing by notation ℚ

Python code to explains arithmetic operations on Scalars :

Code: # In-Built Scalars a = 15 b = 3 print(type(a)) print(type(b)) print(a + b) print(a - b) print(a * b) print(a / b) Output: 15

munotes.in

## Page 3

3

12 45 5 # Python code snippet checks if the given variable is scalar or not.

import numpy as np

# Is Scalar Function

def isscalar(num):

if isinstance(num, generic):

return True

else:

return False

print(np.isscalar(2.4))

print(np.isscalar([4.1]))

print(np.isscalar (False))

Output:

True

False

True

Vector: A vector is an ordered array of single numbers represent 1st order

tensor. Vectors should be written in lowercase, bold, and italics . For

Example : x.

x =[x1 x2 x3 … xn] or x = [

] (1.2)

Here, first element of x is x1, the next second element is x2 and so on

element gets listed. To understand the necessary element of a vector index

wise, the scalar element of a vector positioned ith is written as x[i].

Suppose S={1,2,4} then x1=1, x2=2, x4=4. The – sign to index is used to

indicate the complement of a S, like for example x-1 is the vector

consisting of all elements of x except x1, and x-s is the vector consist of all

elements of x except x1 , x2 and x4 . Vectors are pieces of objects known as

vector spaces , which can be considered as collection of all possible vectors

of a particular dimension.

#Python code demonstrating Vectors import numpy as np # Declaring Vectors x = [2, 4, 6] y = [3, 5, 1] print(type(x)) # Vector addition using Numpy munotes.in

## Page 4

4

z = np.add(x, y) print(z) print(type(z)) Output: [2, 4, 6, 3, 5, 1] [5 9 7]

Matrix: Matrix is a 2 -D rectangular array consisting of numbers

represents 2nd order tensor. Matrices should be written in uppercase,

bold and italics . For Example: X. If p and q are positive integers, that

is p, q ∈ ℕ then the p×q matrix contains p*q numbers, with p rows

and q columns

A = [

] (1.3)

Full matrix component can be express as follows:

A = (1.4)

Some of the operations of matrices are as follows :

Matrix Addition :

We can do addition of Matrices to scalars, vectors and other matrices.

These precise techniques are often used in machine learning and deep

learning.

Matrix -Matrix Addition :

Z=X+Y

# Python code for Matrix Addition import numpy as np

x = np.matrix([[5, 3], [2, 6]])

sum = x.sum()

print(sum)

Output: 16 munotes.in

## Page 5

5

Here , when shapes of two matrices are equal addition is possible otherwise

not.

Matrix -Scalar Addition :

Here, addition of given scalar with all elements of matrix takes place.

Matrix Scalar Multiplication :

Here, Multiplication of scalar element with matrix element takes place

# Python code for Matrix Scalar Multiplication import numpy as np x = np.matrix([[2, 3], [3, 4]]) s_mul = x * 3

# Python code for Matrix-Matrix Addition import numpy as np

p = np.matrix([[1, 1], [2, 2]])

q = np.matrix([[3, 3], [4, 4]])

m_sum = np.add(p, q)

print(m_sum)

Output :

[[ 4 4] [ 6 6]]

# Python code for Matrix -Scalar Addition

import numpy as np

x = np.matrix([[1, 1], [3, 3]])

s_sum = x + 1

print(s_sum)

Output:

[[2 2]

[4 4]] munotes.in

## Page 6

6

print(s_mul) Output: [[ 6 9] [9 12]]

Matrix Transpose :

With transpose operation horizontal row vector can be convert ed into

vertical column vector and vice versa.

(1.5)

(1.6)

A= (1.7)

AT =

(1.8)

# Python code for Matrix Transpose import numpy as np a = np.array([[2, 3], [5, 7]]) print(a) a.transpose() print(a) Output: [[2 3] [5 7]] array([[2, 5], [3, 7]])

zyxzyxmmaaaaaaaaaaaaaaaaaa

TTijTjircccrrTrcrrcc

][,635241 654321 :examplesi.e.,,212221212111212222111211

MMmunotes.in

## Page 7

7

Tensors: Sometimes we need an array with more than 2 ax is; a tensor is

an array of numbers arranged on a grid. It wraps scalar, vector and matrix.

We can represent tensor name “A” as A. we can access element of A at

coordinates (i, j, k) by writing

1.2.2 Multiplying Matrices and Vectors :

When A is a m × n matrix & B is a k × p matrix, AB is only possible if

n=k. The result will be an m ×p matrix.

Simple to know we can perform A*B IF: Number of columns in A =

Number of rows in B

A*B =

Another example:

The dot product operation is defined as:

= = (1.9)

Need to understand product of two matrices is not just a matrix having

individual elements in that some operation are exist that called element

wise product or hadmard product which is denoted as A

B.

Properties of the dot product:

Simplification of the matrix product

- (AB)T=BTAT (1.10)

2015161316114520351044213411432233124321504132

1812615105128436261635251534241432165432635241654321654,321BAABBAmunotes.in

## Page 8

8

Matrix multiplication is NOT commutative means order of

multiplication is important

- AB≠BA (1.11)

Matrix multiplication IS associative

- A(BC)=(AB)C (1.12)

Matrix multiplication IS distributive

- A(B+C)=AB+AC (1.13)

- (A+B)C=AC+BC (1.14)

1.2.3 Identity and Inverse Matrices :

Identity Matrix: An i dentity matrix is a square matrix where value of

diagonal elements is one and rest of matrix elements values are equal to

zero. The product is the matrix A and identi ty matrix is matrix A itself.

Another name for Identity Matrix is Unit Matrix or Elementary Matrix .

Identity Mat rix is denoted with the letter n, where n n represents the

order of the matrix .

IA = A and AI = A (1.15)

AI = IA = A (1.16)

One of the important properties of identity matrix is: A = A, where A

is any square matrix of order n n

= , =[

], =[

]

Square matrix : A matrix with the same number of rows and columns is

called a square matrix .

Example: [

]

Note: An identity matrix is a perpetually square matrix .

Inverse Matrix : Let „A‟ be any square matrix. An inverse matrix of „ A‟ is

denoted by „ A-1‟ and is such a matrix that AA-1=A-1A= n. if we get A-1 of

Matrix „A‟ then it is known as invertible. Non -square matrices do not have

inverses . It is not necessary all square matrices have inverses. If a square

matrix has an inverse then it is known as invertible or non-singular

munotes.in

## Page 9

9

Example

A= [

] and its inverse A-1= [

]

A A-1= [

][

]= [

]

A-1 A= [

][

]= [

]

Here, we can say that [

] and [

]are inverses of each other .

1.2.4 Linear Dependence and Span :

Suppose we have Ax = b, to get A-1 here has exactly one solution for every

value of b. There is a possibility, to have no solution or infinitely many

solutions for the systems of values b. To understand and analyse what kind

of different solutions the equation has we can consider columns of A

which specifies different direction can travel from origin and determine

number of ways reaching towards b.

A linear combination of a list ( ⃗⃗⃗⃗ ⃗⃗⃗⃗ ) is some set of vector

of the form which is given by multiplying each vector ( ) by respective

scalar coefficient and adding the results.

( )

(1.17)

Given a set of vectors { ⃗⃗⃗⃗ ⃗⃗⃗⃗ } in a vector space V , any vector

of the form

v = ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗⃗ for some scalars a 1, a2, . . ., a k is called a

linear combination of v 1, v1, . . . , v k.

The set of all points obtainable by linear combination of the original

vectors is called Span . In other words Span (( ⃗⃗⃗⃗ ⃗⃗⃗⃗ ∈

ℝ)

Now determining Ax = b need to test whether b is in the span of columns

of A, this particular span is known as column space or the range of A.

Linear Independence : Given a set of vectors { ⃗⃗⃗⃗ ⃗⃗⃗⃗ } in a

vector space V, they are said to be linearly independent if the equation

⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗⃗ = 0 has only the trivial solution.

If ( ⃗⃗⃗⃗ ⃗⃗⃗⃗ ) are not linearly independent they are linearly

dependent.

munotes.in

## Page 10

10

1.2.5 Norms :

Norm is a function that in return measures length/size of any vector

(excluding zero vectors ). In machine learning multiple times we measure

the size of vectors by using norms. The norms are clubbed under p -norms

or (lₚ-norms) family, where p is arbitrary number greater than or equal to

1.

The p -norm of vector x can be presented as

(1.17)

Every element of vector x is has the power p. Then their sum is raised to

the power ( ⁄)

Simply represented as ,

(1.18)

Any norm function f should be satisfied following conditions:

1. If norm of x is greater than Zero then x is not equal to 0 (Zero V ector)

and if norm is equal to Zero then x is a zero vector.

If f (x) > 0 then x 0 (1.19)

If f (x) = 0 then x = 0 (1.20)

2. For any scalar quantity, say K

f (Kx) = K f(x) (1.21)

3. Suppose we have another vector y

f (x + y) f (x) + f (y) (1.22)

If above three properties are satisfied then function f is norm .

Manhattan Distance (1 -norm) :

Manhattan distance also called 1 norm it measure the distance between

two points that can travel along orthogonal blocks.

Example:

= [2,4]

munotes.in

## Page 11

11

Infinity -norm :

The infinity -norm also called max -norm which returns absolute value in

the given vector.

Suppose, we want to find infini ty-norm of another vector, like a

= [2,4, -5]

= 5

Euclidean Norm (2 -norm) :

One of the most used Norms is 2 -norm, used to calculate magnitude of

vector.

| | = ( ) ⁄

= ( ) ⁄

= √

1.2.6 Special Kinds of Matrices and Vectors :

Some of the matrices:

Diagonal Matrix : Diagonal matrix is a matrix Denoted by D where

diagonal element values are nonzero and other entries of matrix are

zero.

Example:

D = [

]

As a Vector it will be represented as,

d= (d11, d22, d33)

With scalar values it can be represented as follows:

d = (1, 2, 3)

munotes.in

## Page 12

12

#Python code for diagonal matrix from numpy import array from numpy import diag A = array([[1, 2, 3], [1, 2, 3], [1, 2, 3]]) print(A) # Print diagonal vector d = diag(A) print(d) # create diagonal matrix from vector D = diag(d) print(D) Output: [[1 2 3] [1 2 3] [1 2 3]] [1 2 3] [[1 0 0] [0 2 0] [0 0 3]]

It is not compulsory diagonal matrix is square, there is possible to

construct rectangular diagonal matrix as well. We can use Diagonal matrix

in machine learning to obtain less expensive alg orithm.

Triangular Matrix: A triangular matrix is a type of square matrix

where values are filled in the upper -right or lower -left of the matrix

with the remaining elements of the matrix are filled with zero values. A

triangular matrix with filled some values which lie above the main

diagonal is called an upper triangular matrix. Whereas, a triangular

matrix with filled values which lie below the main diagonal is called a

lower triangular matrix.

Example of a 3×3 upper triangular matrix

A = [

]

Example of a 3×3 lower triangular matrix

D = [

]

munotes.in

## Page 13

13

# Python code for triangular Matrices from numpy import array from numpy import tril from numpy import triu A = array([[1, 2, 3], [1, 2, 3], [1, 2, 3]]) print(A) lower = tril(A) print(lower) upper = triu(A) print(upper) Output: [[1 2 3] [1 2 3] [1 2 3]] [[1 0 0] [1 2 0] [1 2 3]] [[1 2 3] [0 2 3] [0 0 3]]

Symmetric Matrix : Symmetric matrix is a matrix where top -right

triangle is the same as the bottom -left triangle.

Example:

A= [

]

Transpose of Symmetric Matrix is equal to original Symmetric Matrix .

A= AT (1.23)

Orthogonal Matrix : Orthogonal matrix is a matrix if a dot product of

two vectors is equals to zero, and then is called Orthogonal . It is a type

of square matrix whose columns and rows are orthonormal unit vectors,

e.g. it is perpendicular and have a length or magnitude of One . Here

rows are mutually orthonormal and columns are mutually orthonormal.

AT A = AAT = I (1.24)

munotes.in

## Page 14

14

# python code for orthogonal atrix from numpy import array from numpy.linalg import inv Q = array([[1, 0], [0, -1]]) print(Q) # inverse equivalence V = inv(Q) print(Q.T) print(V) # identity equivalence I = Q.dot(Q.T) print(I) Output: [[ 1 0] [ 0 -1]] [[ 1 0] [ 0 -1]] [[ 1. 0.] [-0. -1.]] [[1 0] [0 1]]

1.2.7 Eigende composition :

We usually used to see multiple mathematical structures which are

complex to understand, so need to break it with some simplified form and

understand useful properties for the same. We can decomposed integers

with some prime factors, same way we can decompose matrix.

To make complex operations into simple structure m atrix decompositions

are very useful. Matrix decomposition tools helps for reducing a matrix to

their constituent parts .

Eigende composition of a matrix is a widely used matrix decomposition

which involves decomposition of a square matrix into a set of eigenvectors

and eigenvalues. This kind of decomposition also helps in machine

learning like Principal Component Analysis method .

A vector is an eigenvector of a matrix if it fulfils the following equation.

Av = λ (λ=lambda ) (1.25)

munotes.in

## Page 15

15

The equation is called the eigenvalue equation, where A is the p arent

square matrix that we decompose, v is the non-zero eigenve ctor of the

matrix, and lambda represents the eigenvalue scalar . We can also find left

eigenvector like vTA = λ vT, but we normally use right eigenvector.

It is possibility matrix can have one eigenvector and eigenvalue for each

aspect of the parent matrix. Not necessary all square matrices can be

decomposed into eigenvectors and eigenvalues , some may be decomposed

in a way that requires complex numbers.

The parent matrix would be presented as to be a product of the

eigenvectors and eigenvalues

A = V diag(λ)V-1 (1.26)

Here, V is a matrix comprised of the eigenvectors, diag( λ) is a diagonal

matrix comprised of the eigenvalues along the diagonal , V-1 is the inverse

of the matrix comprised of the eigenvectors .

Eigen is not a name it is pronounced as “eye-gan” is a German wo rd that

means “own” as in belonging to the parent matrix.

Decomposition does not mean compression of the m atrix perhaps it breaks

it down into constituent parts to make certain operations on the matrix

easier to perform .

The eigendecomposition of a matrix gives many useful facts about the

matrix. If the eigen values are zero then t he matrix is singular.

Eigenvectors and Eigenvalues : Eigenvectors are unit vectors that mean

their length or magnitude is equal to one . They are often known as right

vectors, which simply mean a column vector (as opposed to a row vector

or a left vector).

A matrix contains only positive eigen values is known as a positive

definite matrix , and if it contains all negative eigenvalues, it is known as

a negative definite matrix .

Eigendecomposition Calculation :

# Python code for eigendecomposition from numpy import array from numpy.linalg import eig # Define matrix A = array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(A) # Calculate eigendecomposition val, vect = eig(A) munotes.in

## Page 16

16

print( val)

print( vect)

Output:

[[1 2 3]

[4 5 6]

[7 8 9]]

[ 1.61168440e+01 -1.11684397e+00 -9.75918483e -16]

[[-0.23197069 -0.78583024 0.40824829]

[-0.52532209 -0.08675134 -0.81649658]

[-0.8186735 0.61232756 0.40824829]]

1.3 SUMMARY

This chapter includes basic concepts and differentiation related to scalar,

vector, matrix and tensor which is mostly used in machine learning as well

as in deep learning. This chapter explain Linear Algebra concepts broadly.

Different types of Matrix, Operation explain with python programs.

Specifically learned:

- Basic difference between scalar, vector, matrix and tensor.

- Matrix types, different operations related with matrices in Python with

NumPy .

- Eigendecomposition and the role of eigenvectors and eigenvalues .

UNIT END EXERCISES

1. Differentiate between scalar and vector

2. Explain Arithmetic operations on scalar with example.

3. What is Matrix and explain matrices addition

4. Explain Matrix transpose with an example.

5. Write Properties of dot products.

6. Explain Inverse matrix with an example

7. Write a short note on Norms.

8. Explain Eigenvectors and Eigenvalues.

9. Compare Symmetric Matrix and Orthogon al Matrix

10. Write a note on linear combination

BIBLIOGRAPHY

1. Ian Goodfellow, Yoshua Bengio, Aaron Courvile “ Deep

Learning”,1st edition 2016 munotes.in

## Page 17

17

2. https://hadrienj.github.io/posts/Deep -Learning -Book -Series -2.1-

Scalars -Vectors -Matrices -and-Tensors/

3. https://hadrienj.github.io/posts/Deep -Learning-Book -Series -2.2-

Multiplying -Matrices -and-Vectors/

4. https://medium.com/linear -algebra/part -18-norms -30a8b3739bb

*****

munotes.in

## Page 18

18

2

NUMERICAL COMPUTATION

Unit Structure

2.0 Objectives

2.1 Introduction to Numerical Computation

2.2 Overflow and Underflow

2.3 Poor Conditioning

2.4 Gradient Based Optimization

2.5 Constraint optimization

2.6 Summary

Unit End Exercises

Bibliography

2.0 OBJECTIVES

After going through this unit, you will be able to:

Define Overflow and Underflow techniques

Clear Concepts like Poor Conditioning, Gradient based optimization

Different ways of Constraint Optimization

2.1 INTRODUCTION TO NUMERICAL COMPUTATION

A lot amount of computation is needed in machine learning algorithms to

solve complex and vital problems. Here need to follow iterative process

where mathematical methods used to get best solutions. Numerical

Computation needed large number of arithmetic calculation hence

required efficient and fast computational devices. The approach is to

formulate mathematical model and solve problems with arithmetic

calculations. .

2.2 OVERFLOW AND UNDERFLOW

Difficulties which are faced going from mathematics to computers are

representing infinitely many real number on less or finite memory. This

means that every calculations experience some approximation error. One

of them is rounding error, where undergoes many operations, but

algorithms which in work get failed due to not designing of minimum

accumulation to the rounding error. In computers numbers are stored as

discrete digits, so when we used to make arithme tic calculations which

result in extra digits that result we cannot append into main result and face

overflow or underflow error .

munotes.in

## Page 19

19

Underflow: Underflow is a situation it happens in a computer or similar

device s when a result of mathematical operation is sm aller than the device

is capable of storing. It is kind of trickier to recognize because it has to

work with precision in floating points. Here, cause of discrete nature of

storage capacity in computers, we cannot store any small number either.

The floatin g-point format comes up with some techniques to represent

fractional numbers. When we implement these in calculations that result in

a smaller number than our desire least value . This happens when numbers

close to 0 are rounded to 0.

Like, suppose we want to perform a calculation, 0.005×0.005. The answer

to this is 0.000025, but we don’t have this many decimal places available.

Here , we discard the least -signi ficant bits and store 0.000 , which is

absolute an erroneous answer. Underflow is representational error and

occurs mostly when dealing with decimal arithmetic. It also happens when

two negative numbers are added and the result is out of range for the

device to store

Overflow: Overflow shows that we have done a calculation , the result of

that is so larger than we can hold and represent. Like, the processor is

running an operation as an increment but the operand having same

capacity, it cannot hold the result. The whole issue is occurs cause of the

memory size in computers. In mathematics, numbers are absolute and

there’s no concept of precision. But in case of computers, due to hardware

limitations and especially in memory size, real numbers are rounded to the

nearest value.

Suppose we have an integer stored in 1 byte. We can store greatest number

that is 255 in one byte, i.e. 11111111. Now, Let’s add 2 to it to get

00000010. The result is 257, which is 100000001 . The result has 9 bits,

whereas the integers we are working with consist of only 8 .

One of the function that m ust be stabilized against underflow and overflow

name is the softmax function. The SoftMax Function is used to predict the

probabilities associated with a Multinoulli Distribution

(2.1)

If , then we can show that . But if we want to

calculate the same value in computers numerically, depending on the value

of you could get different results.

If is a big positive number then will be overflow and the softmax will

be undefined and return NaN . And if is a big negative number will be

a underflow and softmax will become zero, resulting in “division by zero”.

The solution to both of these problems is a technique called stabilization .

munotes.in

## Page 20

20

Unfortunately, stabilization is not a complete solution. It is used to sol ve

mathematical problem analytically. As for the softmax given above, let’s

define as follow:

(2.2)

is a new vector where all the elements of are subtracted by its biggest

element. Doing so will result in a vector where at least one of its elements

is zero ( ‘s biggest element). While analytically you can show

that , the problem of overflow is solved

as well. At the same time, the problem of underflow and division by zero

is fixed too. That’s because at least one element of the denominator is one

(the biggest element which is now zero). So there’s no way division by

zero could happen now. This process is called stabilization.

2.3 POOR CONDITIONING

When function changes with respect to small changes in its inputs it’s

called Conditioning. Functions that changes rapidly when their inputs are

unsettle slightly can be problematic for scientific computation because

rounding errors in the inputs can result in big changes in the output.

Consider the function on vector x:

(2.3)

Assuming:

(2.4)

Has eigenvalue decomposition, its condition number is:

(2.5)

This ratio is the magnitude of the largest and smallest eigenvalues. When

this number is big, matrix inversion is particularly sensitive to error in the

input.

2.4 GRADIENT BASED OPTIMIZATION

Most Machine learning , deep learning algorithms involve optimization,

Optimization presents to the task of either minimizing or maximizing

some function f(x) by altering x. The function which needs to optimize is

the Objective Function, sometimes called the Criterion .

munotes.in

## Page 21

21

We denote the value that minimizes or maxi mizes a function with a

superscript *.

For a single variable function:

(2.6)

For a multi variable function:

(2.7)

• Minimize/maximize a function f (x) by altering x

– Usually stated a minimization

– Maximization accomplished by minimizing –f(x)

• f (x) referred to as objective function or criterion

– In minimization also referred to as loss function cost, or error

– Example is linear least squares

– Denote opti mum value by x*=arg min f (x)

Calculus in Optimization :

Suppose function y=f (x), x, y real numbers, Derivative of function

denoted: f’(x) or as dy/dx

• Derivative f’(x) gives the slope of f (x) at point x

• It specifies how to scale a small change in input to obtaina

corresponding change in the output: f (x + ε) ≈ f (x) + ε f’ (x)

– It tells how you make a small change in input to make a small

improvement in y

– We know that f (x - ε sign (f’(x))) is less than f (x) for small ε. Thus

we can reduce f (x) by moving x in small steps with opposite sign of

derivative, this technique is called gradient descent

Gradient Descent Illustrated:

munotes.in

## Page 22

22

Figure1 : An illustration of how the gradient descent algorithm uses the

derivatives of a function can be used to follow the function downhill to a

minimum

Source: Ian Goodfellow, Yoshua Bengio, Aaron Courvile “Deep

Learning”,1st edition 2016

For x>0, f(x) increases with x and f’(x)>0

• For x<0, f(x) is decreases with x and f’(x)<0

• Use f’(x) to follow function d ownhill

• Reduce f(x) by going in direction opposite sign of derivative f’(x)

Stationary points, Local Optima:

Here, When f’(x)=0 derivative provides no information about direction of

move .

Points where f’(x)=0 are known as stationary or critical points .

– Local minimum/maximum: a point where f(x) lower/higher than all its

neighbours

– Some critical points are neither maxima nor minima. These are known

as saddle points

Figure 2: Minimum, Maximum, Saddle Point

Source: Ian Goodfellow, Yoshua Bengio, Aaron Courvile “Deep

Learning”,1st edition 2016

Optimization algorithms may fail to find global minimum

• Generally accept such solutions

munotes.in

## Page 23

23

Figure 3: Local minimum

Source: Ian Goodfellow, Yoshua Bengio, Aaron Courvile “Deep

Learning”,1st edition 2016

We often minimize functions with multiple inputs: f: 𝑅 →R. For

minimization to make sense there must still be only one (scalar) output .

Here , we need partial derivatives డ

డೣ f(x) measures how f changes as only

variable 𝑥 increases at point x

• Gradient generalizes notion of derivative where derivative is wrt a

vector

• Gradient is vector containing all of the partial derivatives denoted ∇௫

f(x)

– Element i of the gradient is t he partial derivative of f wrt 𝑥

– Critical points are where every element of the gradient is equal to zero

Directional Derivative:

The directional derivate in direction u, a unit vector, is the slope of the

function f in direction u. It’s the derivat ive of the function f(x + au) with

respect to a

(2.8)

It is used to minimize f, we would like to find the direction in which f

decreases the fastest., we can do this using the directional derivative:

(2.9)

munotes.in

## Page 24

24

(2.10)

Where theta is the angle between u and the gradient.

Where ||u||2 = 1 and ignoring all things dependent on u we get:

The gradient points directly uphill, and the negative gradient po ints

directly downhill

• Thus we can decrease f by moving in the direction of the negative

gradient

– This is known as the method of steepest descent or gradient

descent

• Steepest descent proposes a new point

(2.11)

– Where ε is the learning rate, a positive scalar. Set to a small

constant.

Here , We can choose ε in several different ways like follows

• Popular approach: set ε to a small constant

• Another a pproach is called line search:

• Evaluate for several values of ε and choose the one

that results in smallest objective function value.

Steepest descent converges when every element of the gradient is zero

– In practice, very close to zero

• We may be able to avoid iterative algorithm and jump to the critical

point by solving the equation for x.

Gradient descent is limited to continuous spaces

• Concept of repeatedly making the best small move can be generalized

to discrete spaces

• Ascending an objective function of discrete parameters is called hill

climbing .

Beyond Gradient: Jacobian and Hessian matrices:

Sometimes we need to find all derivatives of a function whose input and

output are both vectors

munotes.in

## Page 25

25

• If we have function f: 𝑅 →𝑅 then the matrix of partial derivatives is

known

as the Jacobian matrix J defined as

We are also sometimes interested in a derivative of a derivative.

• For a function f: 𝑅 →R the derivative wrt 𝑥 of the derivative of f wrt 𝑥

is denoted as

• In a single dimension we can denote by f’’(x)

• Tells us how the first derivative will change as we vary the input

• This important as it tells us whether a gradient step will cause as much

of an improvement as based on gradient alone.

Figure 4: Quadratic functions with different curvatures

Source: https://cedar.buffalo.edu/~srihari/CSE676/

2.5 CONSTRAINT OPTIMIZATION

We may wish to optimize f(x) when the solution x is constrained to lie in

set S

– Such values of x are feasible solutions

• Often we want a solution that is small, such as||x||≤1

• Simple approach: modify gradient descent taking constraint into

account (using Lagrangian formulation)

Least squares with Lagrangian:

We wish to minimize

• Subject to const raint 𝑥்x ≤ 1

munotes.in

## Page 26

26

• We introduce the Lagrangian

– And solve the problem

• For the unconstrained problem (no Lagrangian)

the smallest norm solution is x=A+b

– If this solution is not feasible, differentiate

Lagrangian wrt x to obtain 𝐴்Ax-𝐴்b+2λx=0

– Solution takes the form x = ( 𝐴்A+2λI)-1𝐴்b

– Choosing λ: continue solving linear equation and increasing λ until x

has the correct norm.

Karush -Kuhn -Tucker is a very general solution to constrained

optimization

• While Lagrangian allows equality constraints, KKT allows both

equality and inequality constraints

• To define a generalized Lagrangian we need to describe S in terms of

equalities and inequalities .

Generalized Lagrangian: Set S is described in terms of m functions g(i)

and n functions h(j) so that

(2.12)

– Functions of g are equality constraints and functions of h are inequality

constraints

• Introduce new variables λ i and αj for each constraint (called KKT

multipliers) giving the generalized Lagrangian and here We can now

solve the unconstrained optimization problem

2.6 SUMMARY

This chapter is focused on need of Numerical computation. We faced

many problems while calculations with respect to memory, multiple errors

we faced like Overflow and Underflow, We can use softmax function to

resolved problems. Use of Gradient Based Optimization and Constraint

optimization explain well for better calculations.

UNIT END EXERCISES

1. Compare Overflow and Underflow

2. Explain Softmax function in detail

3. Write note on Poor Conditioning

munotes.in

## Page 27

27

4. What is a need of Gradient Optimization and Explain in detail.

5. Define Minimum, Maximum and Saddle point

6. Define Local minimum

7. Explain Constraint Optimization ways.

BIBLIOGRAPHY

1. Ian Goodfellow, Yoshua Bengio, Aaron Courvile “Deep

Learning”,1st edition 2016

2. https://towardsdatascience.com/deep -learning -chapter -4-numerical -

computation -14de09526931

3. https://medium.com/computronium/numerical -method -considerations -

for-machine -learning -e69331d28611

4. https://challengeenthusias t.com/2018/07/10/stabilizing -against -

overflow -and-underflow/

5. https://cedar.buffalo.edu/~srihari/CSE676/

*****

munotes.in

## Page 28

28

UNIT II

3

DEEP NETWORKS

Unit Structure

3.0 Objectives

3.1 Introduction

3.2 Deep feedforward network

3.2.1 Simple Deep Neural Network

3.2.2 Generic Deep Neural Network

3.2.3 Computations in Deep Neural Network

3.2.4 Gradient -Based Learning

3.3 Regulari zation for deep learning

3.3.1 L2 regularization

3.3.2 L1 Regularization

3.3.3 Entropy Regularization

3.3.4 Dropout

3.3.5 Data augmentation

3.4 Optimization for Training deep models

3.4.1 How Learning Differs from Pure Optimization

3.4.2. Challenges in Ne ural Network Optimization

3.4.2.1 Ill -conditioning

3.4.2.2 Local minima

3.4.2.3 Plateaus, Saddle Points and Other Flat Regions

3.4.3 Stochastic Gradient Descent

3.5 Summary

3.6 List of References

Unit End Exercises

Bibliography

3.0 OBJECTIVES

After going through this unit, you will be able to:

Define Deep feedforward network and techniques

State the characteristics in Deep feedforward network

Describe the basic concept of Regularization for deep learning and its

types.

Explain Optimization for Tr aining deep models

munotes.in

## Page 29

29

3.1 INTRODUCTION

A deep neural network (DNN) is an artificial neural network (ANN) with

multiple layers between the input and output layers. There are different

types of neural networks but they always consist of the same components:

neurons, synapses, weights, biases, and functions. These components

functioning similar to the human brains and can be trained like any other

ML algorithm.

Deep learning is part of a broader family of machine learning methods

based on artificial neural netw orks with representation learning. Learning

can be supervised, semi -supervised or unsupervised.

For example, a DNN that is trained to recognize dog breeds will go over

the given image and calculate the probability that the dog in the image is a

certain bre ed. The user can review the results and select which

probabilities the network should display (above a certain threshold, etc.)

and return the proposed label. Each mathematical manipulation as such is

considered a layer, and complex DNN have many layers, h ence the name

"deep" networks.

DNNs can model complex non -linear relationships. DNN architectures

generate compositional models where the object is expressed as a layered

composition of primitives. The extra layers enable composition of features

from lower layers, potentially modeling complex data with fewer units

than a similarly performing shallow network. For instance, it was proved

that sparse multivariate polynomials are exponentially easier to

approximate with DNNs than with shallow networks.

Deep arc hitectures include many variants of a few basic approaches. Each

architecture has found success in specific domains. It is not always

possible to compare the performance of multiple architectures, unless they

have been evaluated on the same data sets.

3.2 DEEP FEEDFORWARD NETWORK

Multi -layered Network of neurons is composed of many sigmoid neurons.

MLNs are capable of handling the non -linearly separable data. The layers

present between the input and output layers are called hidden layers. The

hidden laye rs are used to handle the complex non -linearly separable

relations between input and the output.

Deep feedforward networks, also often called feedforward neural

networks, or multilayer perceptrons (MLPs), are the quintessential deep

learning models. The go al of a feedforward network is to approximate

some function f ∗. For example, for a classifier, y = f ∗(x) maps an input x

to a category y. A feedfor ward network defines a mapping y = f (x; θ) and

learns the value of the parameters θ that result in the best function

approximation.

These models are called feedforward b ecause information ﬂows through

the function being evaluated from x, through the intermediate

computations used to define f , and finally to the output y. There are no

feedback connectio ns in which outputs of the model are fed back into munotes.in

## Page 30

30

itself. When feedfo rward neural networks are extended to include

feedback connections, they are called recurrent neural networks.

Feedforward networks are of extreme importance to machine learning

practitioners. They form the basis of many important commercial

applications.

3.2.1 Simple Deep Neural Network :

In simple neural network to solve complex non -linear decision boundaries.

For example of a mobile phone like/dislike predictor with two variables:

screen size, and the cost. It has a complex d ecision boundary as

shown below,

Fig. 1 Mobile phone Predictor

Decision Boundary :

In single sigmoid neuron, it is impossible to obtain this kind of non -linear

decision boundary. Regardless of how we vary the sigmoid neuron

parameters w and b. Now change the situation and use a s imple network

of neurons for the same problem and see how it handles.

Fig. 2 Simple Neural Network

munotes.in

## Page 31

31

Simple Neural Network:

Here inputs x ₁ - screen size and x ₂ - price going into the network along

with the bias b ₁ and b₂. In break down the model neuron by neuron to

understand. We have our first neuron (leftmost) in the first layer which is

connected to inputs x ₁ and x ₂ with the w eights w ₁₁ and w ₁₂ and the

bias b ₁ and b ₂. The output of that neuron represented as h ₁, which is a

function of x ₁ and x ₂ with parameters w ₁₁ and w₁₂.

h1 = f1 (x1, x2)

Output of the first Neuron:

If we apply the sigmoid function to the inputs x ₁ and x ₂ with the

appropriate weights w ₁₁, w₁₂ and bias b ₁ we would get an output h ₁,

which would be some real value between 0 and 1. The sigmoid output for

the first neuron h ₁ will be given by the following equation,

Output Sigmoid for First Neuron:

Next up, w e have another neuron in the first layer which is connected to

inputs x ₁ and x ₂ with the weights w ₁₃ and w ₁₄ along with the bias

b₃ and b ₄. The output of the second neuron represented as h₂.

H2 = f2 (x1, x2)

Output of the second Neuron:

Similarly, the s igmoid output for the second neuron h ₂ will be given by

the following equation,

So far we have seen the neurons present in the first layer but we also have

another output neuron which takes h ₁ and h ₂ as the input as opposed to

the previous neurons. The output from this neuron will be the final

predicted output, which is a function of h ₁ and h ₂. The predicted output

is given by the following equation,

munotes.in

## Page 32

32

We can adjust only w ₁, w₂ and b - parameters of a single sigmoid

neuron. Now we can adjust the 9 p arameters (w ₁₁, w₁₂, w₁₃, w₁₄,

w₂₁, w₂₂, b₁, b₂, b₃), which allows the handling of much complex

decision boundary. By trying out the different configurations for these

parameters we would be able to find the optimal surface where output for

the entire midd le area (red points) is one and everywhere else is zero, what

we desire.

Fig. 3 Decision Boundary from Network

3.2.2 Generic Deep Neural Network :

Previous ly, we have seen the neural network for a specific task, now we

will check about the neural networ k in generic form.

Fig. 4 Generic Network without Connections

munotes.in

## Page 33

33

In network of neurons with two hidden layers (in blue but there can more

than 2 layers if needed) and each hidden layer has 3 sigmoid neurons there

can be more neurons. In three inputs going into the network and there are

two neurons in the output layer. We will take this network as it is and

understand the intricacies of the deep neural network.

The terminology then we will go into how these neurons interact with each

other. For each of thes e neurons, two things will happen

1. Pre-activation represented by ‘a’: It is a weighted sum of inputs plus

the bias.

2. Activation represented by ‘h’: Activation function is Sigmoid

function.

Fig. 5 Generic Network with Connections

Let’s understand the netw ork neuron by neuron. Consider the first neuron

present in the first hidden layer. The first neuron is connected to each of

the inputs by weight W₁.

Going forward I will be using this format of the indices to represent the

weights and biases associated wi th a particular neuron,

W₁₁₁— Weight associated with the first neuron present in the first

hidden layer connected to the first input.

W₁₁₂— Weight associated with the first neuron present in the first

hidden layer connected to the second input.

b₁₁ — Bias associated with the first neuron present in the first

hidden layer.

b₁₂ — Bias associated with the second neuron present in the first

hidden layer.

munotes.in

## Page 34

34

Here W ₁ a weight matrix containing the individual weights associated

with the respective inputs. The pre -activation at each layer is the weighted

sum of the inputs from the previous layer plus bias. The mathematical

equation for pre -activation at each layer ‘i’ is given by,

ai (x) = W i hi-1 (x) +b i

Pre-activation Function :

The activation at each layer is equ al to applying the sigmoid function to

the output of pre -activation of that layer. The mathematical equation for

the activation at each layer ‘i’ is given by,

hi (x) = g(a i(x))

where ‘g’ is called activation Function.

Finally, we can get the predicted output of the neural network by applying

some kind of activation function (could be softmax depending on the task)

to the pre -activation output of the previous layer. The equation for the

predicted output is shown below,

f (x) = h L = O(a L)

Where ‘O’ is c alled as the output activation function.

3.2.3 Computations in Deep Neural Network :

Consider that you have 100 inputs and 10 neurons in the first and second

hidden layers. Each of the 100 inputs will be connected to the neurons will

be The weight matrix of the first neuron W ₁ will have a total of 10 x

100 weights.

Weight Matrix :

Remember, we are following a very specific format for the indices of the

weight and bias variable as shown below,

W(Layer number)(Neuron number in the layer)(Input number)

munotes.in

## Page 35

35

b(Layer number)(Bias number associated for that input)

Now let’s see how we can compute the pre -activation for the first neuron

of the first layer a ₁₁. We know that pre -activation is nothing but the

weighted sum of inputs plus bias. In other words, it is t he dot product

between the first row of the weight matrix W ₁ and the input matrix X

plus bias b₁₁.

Similarly the pre -activation for other 9 neurons in the first layer given by,

In short, the overall pre -activation of the first layer is given by,

a1 = W 1 * x + b 1

Where,

W₁ is a matrix containing the individual weights associated with the

corresponding inputs and b ₁ is a vector containing(b ₁₁, b₁₂,

b₁₃,….,b₁₀) the individual bias associated with the sigmoid neurons.

The activation for the first lay er is given by,

h1 = g(a 1)

Where ‘g’ represents the sigmoid function.

Remember that a ₁ is a vector of 10 pre -activation values, here we are

applying the element -wise sigmoid function on all these 10 values and

storing them in another vector represented as h ₁. Similarly, we can

compute the pre -activation and activation values for ’n’ number of hidden

layers present in the network.

Output Layer of DNN :

So far we have talked about the computations in the hidden layer. Now we

will talk about the computatio ns in the output layer.

munotes.in

## Page 36

36

Fig. 6 The output Activation function is chosen depending on the task at hand, can

be softmax or linear.

Softmax Function :

We will use the Softmax function as the output activation function. The

most frequently used activation f unction in deep learning for classification

problems.

Softmax Function :

In the Softmax function, the output is always positive, irrespective of

the input.

Fig. 6 Softmax Fubction .

munotes.in

## Page 37

37

Now, let us illustrate the Softmax function on the above -shown networ k

with 4 output neurons. The output of all these 4 neurons is represented in a

vector ‘a’. To this vector, we will apply our softmax activation function to

get the predicted probability distribution as shown below,

By applying the softmax function we w ould get a predicted probability

distribution and our true output is also a probability distribution, we can

compare these two distributions to compute the loss of the network.

Loss Function :

In this section, we will talk about the loss function for bina ry and multi -

class classification. The purpose of the loss function is to tell the model

that some correction needs to be done in the learning process.

In general, the number of neurons in the output layer would be equal to the

number of classes. But in th e case of binary classification, we can use only

one sigmoid neuron which outputs the probability P(Y=1) therefore we

can obtain P(Y=0) = 1 -P(Y=1). In the case of classification, We will use

the cross -entropy loss to compare the predicted probability distr ibution

and the true probability distribution.

Cross -entropy loss for binary classification is given by,

Cross -entropy loss for multi -class classification is given by,

3.2.4 Gradient -Based Learning :

Designing and training a neural network is not much di ﬀerent from

training any other machine learning model with gradient descent. The

largest di ﬀerence between the linear models we have seen so far and neural

networks is that the nonlinearity of a neural network causes most

interesting loss functions to become non -convex. This means that neural

networks are usually trained by using iterative, gradient -based optimizers

that merely drive the cost function to a very low value, rather than the

linear equation solvers used to train linear regression models or the convex

munotes.in

## Page 38

38

optimization algorithms with global convergence guarantees used to train

logistic regression or SVMs.

Fig 7. Ball diagram depicting the visualisation of how gradient based learning occur.

As we can see at right of figure 7, an analogy of a ball dropped in a deep

valley and it settle downs at the bottom of the valley. Similarly we want

our cost function to get minimize and get to the minimum value possible.

When we move the ball a small amount Δv1 in the v1 direction, and a

small amount Δv2 in the v2 direction. Calculus tells us that C changes as

follows:

Change of the v1 and v2 such that the change in cost is negative is

desirable. We can also denote ΔC≈ ∇C⋅Δv. where ∇C is,

and Δv is,

Indeed, there’s even a sense in which gradient descent is the optimal

strategy for searching for a minimum. Let’s suppose that we’re trying to

make a move Δv in position so as to decrease C as much as possible. We’ll

constrain the size of the move so that ǁΔvǁ=ϵ for some small fixed ϵ>0. In

other word s, we want a move that is a small step of a fixed size, and we’re

trying to find the movement direction which decreases C as much as

possible. It can be proved that the choice of Δv which minimizes ∇C⋅Δv is

Δv=−η∇C, where η=ϵ/ǁ ∇Cǁ is determined by the size constraint ǁΔvǁ= ϵ. So

gradient descent can be viewed as a way of taking small steps in the

direction which does the most to immediately decrease C. Now that the

gradient vector ∇C has corresponding components ∂C/∂w 𝑘 and ∂C/∂b ℓ.

Writing out the gradient descent update rule in terms of components, we

have

munotes.in

## Page 39

39

Now there are many challenges in training gradient based learning. But for

now I just want to mention one problem. When the number of training

inputs is very large this can take a long time, and learni ng thus occurs

slowly. An idea called stochastic gradient descent can be used to speed up

learning. To make these ideas more precise, stochastic gradient descent

works by randomly picking out a small number m of randomly chosen

training inputs. We’ll label those random training inputs X1,X2,…,Xm and

refer to them as a mini-batch . Provided the sample size mm is large enough

we expect that the average value of the ∇CXj will be roughly equal to the

average over all ∇Cx, that is,

This modification helps us in reducing a good amount of computational

load. Stochastic gradient descent applied to non -convex loss functions has

no such convergence guarantee, and is sensitive to the values of the initial

parameters. For feedforward neural networks, it is important to initialize all

weights to small random values. The biases may be initialized to zero or to

small positive values.

3.3 Regularization for deep learning

Regularization is a set of strategies used in Machine Learning to reduce

the generalization error. M ost models, after training, perform very well on

a specific subset of the overall population but fail to generalize well. This

is also known as overfitting. Regularization strategies aim to reduce

overfitting and keep, at the same time, the training error as low as

possible.

ResNet CNN architecture were originally proposed in 2015. A recent

paper called “Revisiting ResNets: Improved Training and Scaling

Strategies” applied modern regularization methods and achieved more

than 3% test set accuracy on Imagenet . If the test set consists of 100K

images, this means that 3K more images were classified correctly!

munotes.in

## Page 40

40

Fig. 8 Revisiting ResNets: Improved Training and Scaling Strategies

In the context of deep learning, most regularization strategies are based on

regular izing estimators. Regularization of an estimator works by trading

increased bias for reduced variance. An e ﬀective regularizer is one that

makes a proﬁtabletrade, reducing variance signiﬁcantly while not overly

increasing the bias.

In simple terms, regula rization results in simpler models . And as the

Occam’s razor principle argues: the simplest models are the most likely to

perform better. Actually, we constrain the model to a smaller set of

possible solutions by introducing different techniques.

To get a better insight you need to understand the famous bias -variance

tradeoff.

The bias -variance tradeoff: overfitting and underfitting

First, let’s clarify that bias -variance tradeoff and overfitting -underfitting

are equivalent.

Overfitting refers to the ph enomenon where a neural network models the

training data very well but fails when it sees new data from the same

problem domain. Overfitting is caused by noise in the training data that the

neural network picks up during training and learns it as an underl ying

concept of the data.

Fig. 9 Underfitting and overfitting

munotes.in

## Page 41

41

The bias error is an error from wrong assumptions in the learning

algorithm. High bias can cause an algorithm to miss the relevant relations

between features and target outputs. This is called underfitting.

The variance is an error from sensitivity to small fluctuations in the

training set. High variance may result in modeling the random noise in the

training data. This is called overfitting.

The bias-variance tradeoff is a term to describe t he fact that we can

reduce the variance by increasing the bias. Good regularization techniques

strive to simultaneously minimize the two sources of error. Hence,

achieving better generalization.

3.3.1 L2 regularization :

The L2 regularization is the most common type of all regularization

techniques and is also commonly known as weight decay or Ride

Regression.

The mathematical derivation of this regularization, as well as the

mathematical explanation of why this method works at reducing

overfitting, is qu ite long and complex. Since this is a very practical article I

don’t want to focus on mathematics more than it is required. Instead, I want

to convey the intuition behind this technique and most importantly how to

implement it so you can address the overfi tting problem during your deep

learning projects.

During the L2 regularization the loss function of the neural network as

extended by a so -called regularization term, which is called here Ω.

The L2 regularizer will have a big impact on the directions of the weight

vector that don’t “contribute” much to the loss function. On the other

hand, it will have a relatively small effect on the directions that contribute

to the loss function. As a result, we reduce the variance of our model,

which makes it easier t o generalize on unseen data.

The regularization term Ω is defined as the Euclidean Norm (or L2 norm)

of the weight matrices, which is the sum over all squared weight values of

a weight matrix. The regularization term is weighted by the scalar alpha

divided by two and added to the regular loss function that i s chosen for the

current task. This leads to a new expression for the loss function:

Eq 2. Regularization loss during L2 regularization.

Alpha is sometimes called as the regularization rate and is an additional

hyperparameter we introduce into the neur al network. Simply speaking

alpha determines how much we regularize our model.

munotes.in

## Page 42

42

In the next step we can compute the gradient of the new loss function and

put the gradient into the update rule for the weights:

Eq. 3 Gradient Descent during L2 Regular ization.

Some reformulations of the update rule lead to the expression which very

much looks like the update rule for the weights during regular gradient

descent:

Eq.4 Gradient Descent during L2 Regularization.

The only difference is that by adding t he regularization term we introduce

an additional subtraction from the current weights (first term in the

equation).

In other words independent of the gradient of the loss function we are

making our weights a little bit smaller each time an update is perf ormed.

3.3.2 L1 regularization :

In the case of L1 regularization (also knows as Lasso regression), we

simply use another regularization term Ω. This term is the sum of the

absolute values of the weight parameters in a weight matrix:

Eq.5 Regularization Term for L1 Regularization.

As in the previous case, we multiply the regularization term by alpha and

add the entire thing to the loss function.

Eq.6 Loss function during L1 Regularization.

The derivative of the new loss function leads to the followi ng expression,

which the sum of the gradient of the old loss function and sign of a weight

value times alpha.

munotes.in

## Page 43

43

Eq. 7 Gradient of the loss function during L1 Regularization.

As we can see, the regularization term does not scale linearly, contrary to

L2 regularization, but it’s a constant factor with an alternating sign. How

does this affect the overall training?

The L1 regularizer introduces sparsity in the weights by forcing more

weights to be zero instead of reducing the average magnitude of all

weig hts ( as the L2 regularizer does). In other words, L1 suggests that

some features should be discarded whatsoever from the training process.

Elastic net :

Elastic net is a method that linearly combines L1 and L2 regularization

with the goal to acquire the best of both worlds . More specifically the

penalty term is as follows.

\Omega( \theta) = \lambda_1 ||w||_1 + \lambda_2||w||^2_2 Ω(θ)=λ1∣∣w∣∣1

+λ2∣∣w∣∣22

Elastic Net regularization reduces the effect of certain features, as L1 does,

but at the same time, it does not eliminate them. So it combines feature

elimination from L1 and feature coefficient reduction from the L2.

What does Regularization achieve? :

Performing L2 regularization encourages the weight values tow ards

zero (but not exactly zero)

Performin g L1 regularization encourages the weight values to be zero

Intuitively speaking smaller weights reduce the impact of the hidden

neurons. In that case, those hidden neurons become neglectable and the

overall complexity of the neural network gets reduced.

As mentioned earlier l ess complex models typically avoid modeling noise

in the data, and therefore, there is no overfitting.

But you have to be careful. When choosing the regularization term α. The

goal is to strike the right balance between low complexity of the model and

accuracy

If your alpha v alue is too high, your model will be simple, but you run

the risk of underfitting your data. Your model won’t learn enough

about the training data to make useful predictions.

If your alpha value is too low, your mod el will be more complex, and

you run the risk of overfitting your data. Your model will learn too munotes.in

## Page 44

44

much about the particularities of the training data, and won’t be able to

generalize to new data.

L2 & L1 regularizatio :

L1 and L2 are the most common types o f regularization. These update the

general cost function by adding another term known as the regularization

term.

Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight

matrices decrease because it assumes that a neural network with smaller

weight matrices leads to simpler models. Therefore, it will also reduce

overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter

whose value is optimized for better results. L2 regularization is also

known as weight decay as it forces the weights to decay towa rds zero (but

not exactly zero).

In L1, we have:

In this, w e penalize the absolute value of the weights. Unlike L2, the

weights may be reduced to zero here. Hence, it is very useful when we are

trying to compress our model. Otherwise, we usually prefer L2 over it.

Similarly, we can also apply L1 regularization.

3.3.3 Entropy Regularization :

Entropy regularization is another norm penalty method that applies to

probabilistic models. It has also been used in different Reinforcement

Learning techniques such as A3C and policy optimization techniques.

Similarly to the pre vious methods, we add a penalty term to the loss

function.

The term “Entropy” has been taken from information theory and

represents the average level of "information" inherent in the variable's

munotes.in

## Page 45

45

possible outcomes. An equivalent definition of entropy is the expected

value of the information of a variable.

One very simple explanation of why it works is that it forces the

probability distribution towards the uniform distribution to reduce

variance.

In the context of Reinforcement Learning, one can say that the entropy

term added to the loss, promotes action diversity and allows better

exploration of the environment.

3.3.4 Dropout :

In addition to the L2 and L1 regularization, another famous and powerful

regularization technique is called the dropout regularizati on. The procedure

behind dropout regularization is quite simple.

In a nutshell, dropout means that during training with some probability P a

neuron of the neural network gets turned off during training.

Another strategy to regularize deep neural networks is dropout. Dropout

falls into noise injection techniques and can be seen as noise injection into

the hidden units of the network .

In practice, during training, some number of layer outputs are randomly

ignored (dropped out) with probability pp.

During tes t time, all units are present , but they have been scaled down

by pp. This is happening because after dropout, the next layers will receive

lower values. In the test phase though, we are keeping all units so the

values will be a lot higher than expected. Th at’s why we need to scale

them down.

By using dropout, the same layer will alter its connectivity and will search

for alternative paths to convey the information in the next layer. As a

result, each update to a layer during training is performed with a dif ferent

“view” of the configured layer. Conceptually, it approximates training a

large number of neural networks with different architectures in parallel.

"Dropping" values means temporarily removing them from the network

for the current forward pass, along with all its incoming and outgoing

connections. Dropout has the effect of making the training process noisy.

The choice of the probability pp depends on the architecture.

Other Dropout variations :

There are many more variations of Dropout that have been pr oposed over

the years. To keep this article relatively digestible, I won’t go into many

details for each one. But I will briefly mention a few of them. munotes.in

## Page 46

46

1. Inverted dropout : also randomly drops some units with a

probability pp. The difference with traditional dropout is: During

training, it also scales the activations by the inverse of the keep

probability 1-p1−p. The reason behind this is: to prevent the

activations from becoming too large thus the need to modify the

network during the testing phase. The end result will be similar to the

traditional dropout.

2. Gaussian dropout : instead of dropping units during tr aining, is

injecting noise to the weights of each unit. The noise is, more often

than not ,Gaussian. This results in:

1. A reduction in the computati onal effort during testing time.

2. No weight scaling is required.

3. Faster training overall

3. DropConnect follows a slightly different approach. Instead of

zeroing out random activations (units), it zeros random weights

during each forward pass . The weights are dropped with a

probability of 1-p1−p. This essentially transforms a fully connected

layer to a sparsely connected layer. Mathematically we can represent

DropConnect as: r = a \left(\left(M *

W\right){v} \right) r=a((M∗W)v) where rr is the layers’ output, vv the

input, WW the weights and MM a binary matrix. MM is a mask that

instantiates a different connectivity pattern from each data sample.

Usually, the mask is derived from each training

example. DropConnect can be seen as a generalization of Dropout to

the full -connection structure of a lay er.

4. Variational Dropout : we use the same dropout mask on each

timestep. This means that we will drop the same network units each

time.

5. Attention Dropout : popular over the past years because of the rapid

advancements of attention -based models like Transfor mers . As you

may have guessed, we randomly dropped certain attention units with a

probability pp.

6. Adaptive Dropout: a technique that extends dropout by allowing the

dropout probability to be different for different units. The intuition is

that there may be hidden units that can individually make confident

predictions for the presence or absence of an important feature or

combination of features.

7. Embedding Dropout : a strategy that performs dropout on the

embedding matrix and is used for a full forward and ba ckward pass. munotes.in

## Page 47

47

8. DropBlock: is used in Convolutional Neural networks and it discards

all units in a continuous region of the feature map.

Stochastic Depth :

Stochastic depth goes a step further. It drops entire network blocks while

keeping the model intact durin g testing. The most popular application is in

large ResNets where we bypass certain blocks through their skip

connections.

In particular, Stochastic depth drops out each layer in the network that has

residual connections around it. It does so with a specif ied

probability pp that is a function of the layer depth.

Fig. 10 Deep Networks with Stochastic Depth

Early stopping :

Early stopping is one of the most commonly used strategies because it is

very simple and quite effective. It refers to the process of stopping the

training when the training error is no longer decreasing but the validation

error is starting to rise .

Fig. 11 Process of Early Stopping.

This implies that we store the trainable parameters periodically and track

the validation error. After th e training stopped, we return the trainable

parameters to the exact point where the validation error started to rise,

instead of the last ones.

A different way to think of early stopping is as a very efficient

hyperparameter selection algorithm, which sets the number of epochs to

munotes.in

## Page 48

48

the absolute best. It essentially restricts the optimization procedure to a

small volume of the trainable parameters space close to the initial

parameters.

It can also be proven that in the case of a simple linear model with a

quad ratic error function and simple gradient descent, early stopping is

equivalent to L2 regularization.

Parameter sharing :

Parameter sharing follows a different approach. Instead of penalizing

model parameters, it forces a group of parameters to be equal . This can

be seen as a way to apply our previous domain knowledge to the training

process. Various approaches have been proposed over the years but the

most popular one is by far Convolutional Neural Networks.

Convolutional Neural Networks take advantage of the spatial structure of

images by sharing parameters across different locations in the input. Since

each kernel is convoluted with different blocks of the input image, the

weight is shared among the blocks instead of having separate ones.

Batch normalization :

Batch normalization (BN) can also be used as a form of regularization.

Batch normalization fixes the means and variances of the input by

bringing the feature in the same range. More specifically, we concentrate

the features in a compact Gaussian -like spac e.

Visually this can be represented as:

Fig. 12 Batch Normalization

munotes.in

## Page 49

49

Batch normalization can implicitly regularize the model and in many

cases, it is preferred over Dropout.

One can think of batch normalization as a similar process with dropout

because i t essentially injects noise. Instead of multiplying each hidden unit

with a random value, it multiplies them with the deviation of all the hidden

units in the minibatch. It also subtracts a random value from each hidden

unit at each step.

3.3.5 Data augme ntation :

Data augmentation is the final strategy that we need to mention. Although

not strictly a regularization method, it sure has its place here.

Data augmentation refers to the process of generating new training

examples to our dataset . More training da ta means lower model’s

variance, a.k.a lower generalization error. Simple as that. It can also be

seen as a form of noise injection in the training dataset.

Data augmentation can be achieved in many different ways. Let’s explore

some of them.

1. Basic Data Ma nipulations : The first simple thing to do is to perform

geometric transformations on data. Most notably, if we’re talking

about images we have solutions such as: Image flipping, cropping,

rotations, translations, image color modification, image mixing

etc. Cutout is a commonly used idea where we remove certain image

regions. Another idea, called Mixup , is the process of blending two

images from the dataset into one image.

2. Feature Space Augmentation : Instead of transforming data in the

input space as above, we can apply transformations on the feature

space. For example, an autoencoder might be used to extract the latent

representation. Noise can then be added in the latent representation

which results in a transformation of the original data point.

3. GAN-based Augmentation: Generative Adversarial Networks have

been proven to work extremely well on data generation so they are a

natural choice for data augmentation.

4. Meta -Learning: In meta -learning, we use neural networks to

optimize other neural networks by tunin g their hyperparameters,

improving their layout, and more. A similar approach can also be

applied in data augmentation. In simple terms, we use a classification

network to tune an augmentation network into generating better

images. Example: We feed random images to an Augmentation

Network (most likely a GAN), which will generate augmented

images. Both the augmented image and the original are passed into a munotes.in

## Page 50

50

second network, which compares them and tells us how good the

augmented image is. After repeating the p rocess the augmentation

network becomes better and better at producing new images.

Regularization is an integral part of training Deep Neural Networks. In all

the aforementioned strategies fall into two different high -level categories.

They either penalize the trainable parameters or they inject noise

somewhere along the training lifecycle. Whether this is on the training

data, the network architecture, the trainable parameters or the target labels.

3.4 OPTIMIZATION FOR TRAINING DEEP MODELS

Deep learning al gorithms involve optimization in many contexts. For

example, performing inference in models such as PCA involves solving an

optimization problem. We often use analytical optimization to write proofs

or design algorithms. Of all of the many optimization pro blems involved

in deep learning, the most difficult is neural network training. It is quite

common to invest days to months of time on hundreds of machines in

order to solve even a single instance of the neural network training

problem. Because this proble m is so important and so expensive, a

specialized set of optimization techniques have been developed for solving

it.

3.4.1 How Learning Differs from Pure Optimization :

Optimization algorithms used for training of deep models differ from

traditional optimiz ation algorithms in several ways. Machine learning

usually acts indirectly. In most machine learning scenarios, we care about

some performance measure P, that is defined with respect to the test set and

may also be intractable. We therefore optimize P only indirectly. We

reduce a different cost function J(θ) in the hope that doing so will improve

P . This is in contrast to pure optimization, where minimizing J is a goal in

and of itself. Optimization algorithms for training deep models also

typically includ e some specialization on the specific structure of machine

learning objective functions. In Machine Learning (ML), we care about a

certain performance measure (say P, for e.g. accuracy) defined w.r.t the test

set and optimize J(θ) (for e.g. cross -entropy l oss) with the hope that it

improves P as well. In pure optimization, optimizing J(θ) is the final goal.

The expected generalization error (risk) is taken over the true data -

generating distribution p_data . If we do have that, it becomes an

optimization prob lem.

Notice how the expectation is taken over the true data generating

distribution.

munotes.in

## Page 51

51

When we don’t have p_data but a finite training set, we have a ML

problem. The latter can be converted back to an optimization problem by

replacing p_data with the empir ical distribution with p ̂_data obtained from

the training set, thereby reducing the empirical risk . This is

called empirical risk minimization (ERM):

Notice the change in distribution over which the expectation is taken.

Although this might look relatively similar to optimizati on, there are two

main problems. Firstly, ERM is prone to overfitting with the possibility of

the dataset being learned by high capacity models (models with the ability

to learn extremely complex functions). Secondly, ERM might not be

feasible. Most optimi zation algorithms now are based on Gradient

Descent (GD) which requires a derivative calculation and hence, may not

work with certain loss functions like the 0 –1 loss (as it is not

differentiable).

It is for the reasons mentioned above that a surrogate los s

function (SLF) is used instead, that acts as a proxy. For e.g.

the negative log -likelihood of the true class is used as a surrogate for

0–1 loss. I’ve added a code snippet below that would help you

understand why 0 –1 loss won’t work for Gradient Descent but cross -

entropy, being a smooth function, would.

Fig. 13 Compari son of Zero -one loss and cross - entropy loss.

It can be seen that the 0 –1 loss is a non -differentiable function and hence,

not compatible with gradient -based algorithms like Gradient Desce nt.

Cross -entropy is a smooth approximation of the 0 –1 loss.

Using a SLF might even turn out to be beneficial as you can keep

continuing to obtain a better test error by pushing the classes even further

apart to get a more reliable classifier. By this, I m ean that suppose we were

using a 0 –1 loss with a threshold of, say, 0.5 to assign each class. Here, in

munotes.in

## Page 52

52

case the true class is 1, our model would have no motivation to push the

prediction score close to 1 once it’s able to get it above 0.5. However, using

cross-entropy, since we are using the raw prediction scores, the model

keeps trying to push the prediction closer to the true class.

Another common difference is that training might be halted following

some convergence criterion based on Early Stopping to p revent

overfitting, when the derivative of the surrogate loss function might

still be large . This is different from pure optimization which is halted

only when the derivative becomes very small.

In ML optimization algorithms, the objective function decomp oses as

a sum over the examples and we can perform updates by randomly

sampling a batch of examples and taking the average over the

examples in that batch. If we consider n random variables, each having

the true mean given by μ, the Standard Error (S.E.) o f the

mean estimated from those n random variables is given as follows:

This indicates that as we include more examples for making an update,

the returns of additional examples in improving the error is less than

linear . Thus, if we use 100 and 10000 exa mples separately to make an

update, the latter takes 100 times more compute, but reduces the error only

by a factor of 10. Thus, it’s better to compute rapid approximate

updates rather than a slow exact update .

There are 3 types of sampling based algorithm s — batch gradient

descent (BGD), where the entire training set is used to make a single

update, stochastic gradient descent (SGD), where a single example is

used to make a weight update and mini-batch gradient descent

(MBGD), where a batch (not to be conf used with BGD) of examples is

randomly sampled from the entire training set and is used to make an

update. Mini-batch GD is nowadays commonly referred to as

Stochastic GD.

It is a common practise to use batch sizes of powers of 2 to offer better run -

time w ith certain hardware. Small batches tend to have a regularizing effect

because of the noise they inject as each update is made by seeing only a

very small portion of the entire training set.

The mini -batches should be selected randomly. It is sufficient to

shuffle the dataset once and iterate over it multiple times. In the first

epoch, the network sees each example for the first time and hence, the

estimate of gradient is an unbiased estimate of the gradient of the true

generalization error. However, from t he second epoch onward, the

estimate becomes biased as it is re -sampling from data that it has

already seen. Such a sampling algorithm is called Random Reshuffling

munotes.in

## Page 53

53

and although their analysis even for generalized linear models, which

are strongly convex, i s an open problem till date, reasonable efforts

have been made to show that this biased estimate o f the gradient is

decent enough.

3.4.2 Challenges in Neural Network Optimization :

Optimization in general is an extremely difficult task. Traditionally,

machi ne learning has avoided the difficulty of general optimization by

carefully designing the objective function and constraints to ensure that the

optimization problem is convex. When training neural networks, we must

confront the general non -convex case. Eve n convex optimization is not

without its complications. The optimization problem for training neural

networks is generally non-convex . Some of the challenges faced are

mentioned below:

3.4.2.1 Ill-conditioning of the Hessian Matrix :

Some challenges arise even wh en optimizing convex functions. Of these,

the most prominent is ill -conditioning of the Hessian matrix H. This is a

very general problem in most numerical optimization. For the sake of

completion, the Hessian matrix H of a function f with a vector -valued

input x is given as:

Ill-conditioning : is said to happen when the first term exceeds the second

term as then the cost would be increasing. In many cases, the gradient

might be large leading to a large gradient norm (i.e. g’g).

However, g’Hg might be even larger than the gradient norm. This would

lead to slower learning as we would need to reduce the learning rate to

make the first term lower than the second. To clarify more on this, since the

first term contains the 2nd power of ϵ, and ϵ being less than 1 , ϵ² < ϵ. So, to

prevent ill -conditioning, the first term should be lower than the second, but

munotes.in

## Page 54

54

given that g’Hg > g’g, this can be achieved only by reducing the learning

rate, leading to slower learning. Thus, although ideally the gradient norm

should decre ase during training (since our primary aim is to reach a global

minima where the gradient is 0), we can still get successful training even

with the gradient norm increasing as shown below:

Fig 14 Gradient norm

Figure 14 Gradient descent often does not ar rive at a critical point of any

kind. In this example, the gradient norm increases throughout training of a

convolutional network used for object detection. (Left)A scatterplot

showing how the norms of individual gradient evaluations are distributed

over t ime. To improve legibility, only one gradient norm is plotted per

epoch. The running average of all gradient norms is plotted as a solid

curve. The gradient norm clearly increases over time, rather than

decreasing as we would expect if the training process converged to a

critical point. (Right)Despite the increasing gradient, the training process is

reasonably successful. The validation set classification error decreases to a

low level.

3.4.2.2 Local m inima :

One of the most prominent features of a convex opt imization problem is

that it can be reduced to the problem of finding a local minimum. Any local

minimum is gua ranteed to be a global minimum. Some convex functions

have a flat region at the bottom rather than a single global minimum point,

but any point w ithin such a flat region is an acceptable solution. When

optimizing a convex function, we know that we have reached a good

solution if we find a critical point of any kind. With non -convex functions,

such as neural nets, it is possible to have many local m inima. Indeed,

nearly any deep model is essentially guaranteed to have an extremely large

number of local minima. However, as we will see, this is not necessarily a

major problem. Neural networks and any models with multiple

equivalently parametrized laten t variables all have multiple local minima

because of the model identifiability problem. A model is said to be

identifiable if a sufficiently large training set can rule out all but one setting

of the model’s parameters. Models with latent variables are of ten not

identifiable because we can obtain equivalent models by exchanging latent

munotes.in

## Page 55

55

variables with each other. For example, we could take a neural network and

modify layer 1 by swapping the incoming weight vector for unit i with the

incoming weight vector fo r unit j, then doing the same for the outgoing

weight vectors. If we have m layers with n units each, then there are n!m

ways of arranging the hidden units. This kind of non -identifiability is

known as weight space symmetry. Nearly any Deep Learning (DL ) model

is guaranteed to have an extremely large number of local minima (LM)

arising due to the model identifiability problem.

A model is said to be identifiable if a sufficiently large training set can rule

out all but one setting of the model parameters. I n case of neural networks,

we can obtain equivalent models by swapping the position of the neurons.

Thus, they are not identifiable.

(a)

(b)

Fig. 15 Local Minima

Swapping the two hidden nodes leads to equivalent models. Thus, even

after having a suffic iently large training set, there is not a unique setting of

parameters. This is the model identifiability problem that neural networks

suffer from.

However, all the local minima caused due to this have the same value of

the cost function, thus not being a problem. However, if local minima with

high cost are common, it becomes a serious problem as shown above.

Many points other than local minima can lead to low gradients. Nowadays,

it’s common to aim for a low but not minimal cost value.

munotes.in

## Page 56

56

3.4.2.3 Plateaus, Sa ddle Points and Other Flat Regions :

Saddle point (SP) is another type of point with zero gradient where some

points around it have higher value and the others have lower. Intuitively,

this means that a saddle point acts as both a local minima for some

neigh bors and a local maxima for the others. Thus, Hessian at SP has both

positive and negative eigenvalues for a function to curve upwards or

downwards around a point as in the case of local minima and local

maxima, the eigenvalues should have the same sign, p ositive for local

minim a and negative for local maxima .

Fig. 16 Saddle Point

It’s called a saddle point as it looks like the saddle of a horse. For many

classes of random functions, saddle points become more common at high

dimensions with the ratio of nu mber of SPs to LMs growing exponentially

with n for an n -dimensional space. Many random functions have an

amazing property that near points with low cost, the Hessian tends to take

up mostly positive eigenvalues. SGD empirically tends to rapidly avoid

enco untering a high -cost saddle point.

Fig 17 Position of Plateau

It is problematic to get stuck in a plateau where the value of the cost

function is high.

Cliffs and Exploding Gradients : Neural Networks (NNs) might

sometimes have extremely steep regions re sembling cliffs due to the

repeated multiplication of weights. Suppose we use a 3 -layer (input -

hidden -output) neural network with all the activation functions as

munotes.in

## Page 57

57

linear. We choose the same number of input, hidden an d output

neurons, thus, using the same weight W for each layer. The output

layer y = W*h where h = W*x represents the hidden layer, finally

giving y = W*W x . So, deep neural networks involve multiplication of

a large number of parameters leading to sharp n on-linearities in the

parameter space. These non -linearities give rise to high gradients in

some places. At the edge of such a cliff, an update step might throw

the parameters extremely far.

Fig. 18 Cliffs and Exploding Gradients

Image depicting the prob lem of exploding gradients when approaching a

cliff. 1) Usual training going on with the parameters moving towards the

lower cost region. 2) The gradient at the bottom left -most point pointed

downwards (correct direction) but the step -size was too large, w hich

caused the parameters to land at a point having large cost value. 3) The

gradient at this new point moved the parameters in a completely different

position undoing most of the training done until that point.

It can be taken care of by using Gradient C lipping (GC) . The gradient

indicates only the direction in which to make the update. If the GD update

proposes making a very large step, GC intervenes to reduce the step size.

Long -Term Dependencies : This problem is encountered when the

NN becomes sufficie ntly deep. For example, if the same weight

matrix W is used in each layer, after t steps, we’d get W *W * W …

(t times) . Using the eigendecomposition of W:

Here, V is an orthonormal matrix, i.e. V V’ = I

Thus, any eigenvalues not near an absolute value o f one would either

explode or vanish leading to the Vanishing and Exploding

Gradient problem. The use of the same weight matrix is especially the case

in Recurrent NNs (RNNs), where this is a serious problem.

munotes.in

## Page 58

58

Inexact Gradients : Most optimization algorithms use a noisy/biased

estimate of the gradient in cases where the estimate is based on

sampling, or in cases where the true gradient is intractable for e.g. in

the case of training a Restricted Boltzmann Machine (RBM), an

approximation of the gradient is use d. For RBM, the contrastive

divergence algorithm gives a technique for approximating the gradient

of its intractable log -likelihood.

Neural Networks might not end up at any critical point at all and such

critical points might not even necessarily exist. A lot of the problems

might be avoided if there exists a space connected reasonably directly

to the solution by a path that local descent can follow and if we are

able to initialize learning within that well -behaved space. Thus,

choosing good initial points should be studied.

Stochastic Gradient Descent :

This has already been described before but there are certain things that

should be kept in mind regarding SGD. The learning rate ϵ is a very

important parameter for SGD. ϵ should be reduced after each epoch in

general. This is due to the fact that the random sampling of batches acts as

a source of noise which might make SGD keep oscillating around the

minima without actually reaching it. This is shown below:

Fig. 19 Stochastic Gradient Descent

The true gradie nt of the total cost function (involving the entire

dataset) actually becomes 0 when we reach the minimum. Hence, BGD can

use a fixed learning rate. The following conditions guarantee convergence

under convexity assumptions in case of SGD:

(∈)= ∞ஶ

ୀଵ

(∈𝑘ଶ)= ∞ஶ

ୀଵ

Setting it too low makes the training proceed slowly which might lead to

the algorithm being stuck at a high cost value. Setting it too high would

lead to large oscillations which might even push the learning outs ide the

optimal region. The best way is to monitor the first several iterations and

munotes.in

## Page 59

59

set the learning rate to be higher than the best performing one, but not too

high to cause instability.

Fig. 20 Learning Rate

A big advantage of SGD is that the time take n to compute a weight update

doesn’t grow with the number of training examples as each update is

computed after observing a batch of samples which is independent of the

total number of training examples. Theoretically, for a convex problem,

BGD makes the e rror rate O(1/k) after k iterations whereas SGD makes

it O(1/√k). However, SGD compensates for this with its advantages after a

few iterations along with the ability to make rapid updates in the case of a

large training set.

Momentum : The momentum algorith m accumulates the exponentially

decaying moving average of past gradients (called as velocity ) and

uses it as the direction in which to take the next step. Momentum is

given by mass times velocity, which is equal to velocity if we’re using

unit mass. The m omentum update is given by:

Momentum weight update

The step size (earlier equal to learning rate * gradient) now depends on

how large and aligned the sequence of gradients are. If the gradients at

each iteration point in the same direction (say g), it will lead to a higher

value of the step size as they just keep accumulating. Once it reaches a

constant (terminal) velocity, the step size becomes ϵ || g|| / (1 — α). Thus,

using α as 0.9 makes the speed 10 times. Common values of α are 0.5, 0.9

and 0.99 .

munotes.in

## Page 60

60

Fig. 21 Momentum

Momentum aims primarily to solve two problems: poor conditioning of the

Hessian matrix and variance in the stochastic gradient. Here, we illustrate

how momentum overcomes the first of these two problems. The contour

lines depict a quad ratic loss function with a poorly conditioned Hessian

matrix. The red path cutting across the contours indicates the path followed

by the momentum learning rule as it minimizes this function. At each step

along the way, we draw an arrow indicating the step that gradient descent

would take at that point. We can see that a poorly conditioned quadratic

objective looks like a long, narrow valley or canyon with steep sides.

Momentum correctly traverses the canyon lengthwise, while gradient steps

waste time movin g back and forth across the narrow axis of the canyon.

Viewing it as the Newtonian dynamics of a particle sliding down a hill, the

momentum algorithm consists of solving a set of differential equations via

numerical simulation. There are two kinds of forc es involved as shown

below:

Fig. 22 Momentum (forces)

Momentum can be seen as two forces operating together. 1) Proportional

to the negative of the gradient such that whenever it descends a steep part

of the surface, it gathers speed and continues slidin g in that direction until

it goes uphill again. 2) A viscous drag force (friction) proportional to -

v(t) without the presence of which the particle would keep oscillating back

and forth as the negative of the gradient would keep forcing it to move

downhill . Viscous force is suitable as it is weak enough to allow the

gradient to cause motion and strong enough to resist any motion if the

gradient doesn’t justify moving

munotes.in

## Page 61

61

Nesterov Momentum : This is a slight modification of the usual

momentum equation. Here, the gradient is calculated after applying the

current velocity to the parameters, which can be viewed as adding a

correction factor:

Nesterov momentum weight update

The intuition behind Nesterov momentum is that upon being at a point θ in

the parameter space, the momentum update is going to shift the point

by αv. So, we are soon going to end up in the vicinity of (θ + αv). Thus, it

might be better to compute the gradient from that point onward. The figure

below describes this visually:

Fig. 23 Nesterov momentum weight update

3.5 SUMMARY

Feedforward networks continue to have unfulfilled potential. In the future,

we expect they will be applied to many more tasks, and that advances in

optimization algorithms and model design will improv e their performance

even further. In This unit, firstly described the neural network family of

models. This module introduced the basic concepts of generalization,

underfitting, overfitting, bias, variance and regularization. In Second part,

we describe re gularization in more detail, focusing on regularization

strategies for deep models or models that may be used as building blocks to

form deep models. In second part described most of the general strategies

used to regularize neural networks. In third part begin with a description of

how optimization used as a training algorithm for a machine learning task

differs from pure optimization. We then define several practical algorithms,

including both optimization algorithms themselves and strategies for

initiali zing the parameters. We have now described the basic family of

neural network models and how to regularize and optimize them. In the

chapters ahead, we turn to specializations of the neural network family, that

allow neural networks to scale to very large sizes and process input data

that has special structure. The optimization methods discussed in this

chapter are often directly applicable to these specialized architectures with

little or no modification.

munotes.in

## Page 62

62

3.6 LIST OF REFERENCES

Ackley, D. H., Hinton, G. E. , and Sejnowski, T. J. (1985). A learning

algorithm for Boltzmann machines. Cognitive Science, 9, 147 –169.

Alain, G. and Bengio, Y. (2013). What regularized auto -encoders

learn from the data generating distribution. In ICLR’2013,

arXiv:1211.4246.

Baldi, P. and Hornik, K. (1989). Neural networks and principal

component analysis: Learning from examples without local minima.

Neural Networks, 2, 53 –58.

Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent

networks. ArXiv e -prints.

Bengio, Y., Lambl in, P., Popovici, D., and Larochelle, H. (2007).

Greedy layer -wise training of deep networks. In NIPS’2006 .

Chapelle, O., Weston, J., and Schölkopf, B. (2003). Cluster kernels

for semi -supervised learning. In S. Becker, S. Thrun, and K.

Obermayer, editors , Advances in Neural Information Processing

Systems 15 (NIPS’02), pages 585 –592, Cambridge, MA. MIT Press.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun,

Y. (2014). The loss surface of multilayer networks.

Dauphin, Y., Pascanu, R., Gulc ehre, C., Cho, K., Ganguli, S., and

Bengio, Y. (2014). Identifying and attacking the saddle point problem

in high -dimensional non -convex optimization. In NIPS’2014.

Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural

neural networks. In Advan ces in Neural Information Processing

Systems , pages 2062 –2070. 320

E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clustering with

deep learning: Taxonomy and new methods. arXiv:1801.07648, 2018.

https://arxiv.org/abs/1801.07648

R. Al -Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word

representations for multilingual nlp. arXiv:1307.1662, 2013.

https://arxiv.org/abs/1307.1662

UNIT END EXERCISES

Define and explain Deep Networks with example?

Describe Deep feedforward network with its types.

What Simple Deep Neural Network? Explain with Example.

How to compute Deep Neural Network. Explain.

Explain Gradient -Based Learning?

Define Regularizaion wit example?

Compare L1 Regularization and L2 Regularization?

Define Underfitting and overfitting.

What is Dropou t. Explain in details?

Define and Explain Data Augmentation.

Explain Local Minima with Diagram. munotes.in

## Page 63

63

Explain Momentum. Give details.

Define Plateaus, Saddle Points and Other Flat Regions . Explain

with Diagram.

Explain Challenges in Neural Network Optimization .

BIBLIOGRAPHY

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep

Learning. MIT Press. ISBN 978 -0-26203561 -3. Archived from the

original on 2016 -04-16. Retrieved 2021 -05-09, introductory textbook.

Charu C. Aggarwa, Neural Networks and Deep Learning, ISBN 978 -

3-319-94462 -3 IBM T. J. Watson Research Center, International

Business Machines, Yorktown Heig hts, NY, USA.

R. Ahuja, T. Magnanti, and J. Orlin. Network flows: Theory,

algorithms, and applications. Prentice Hall, 1993.

*****

munotes.in

## Page 64

64

UNIT II I

4

CONVOLUTIONAL NEURAL NETWORK

Unit Structure

4.0 Objectives

4.1 Introduction

4.2 What is Convolutional Neural Network

4.3 Why ConvNets over Feed -Forward Neural Nets?

4.4 Convolutional Operation

4.5 Pooling

4.6 Data Types

4.7 Convolution Algorithms

4.8 Relation of Convolutional Network with Deep Learning

4.9 Difference between CNN and RNN

4.10 Conclusion

Exercise

4.0 OBJECTIVES:

In this chapter the student will learn about:

Convolution concept

Convolution Operations

Convents over Feed -Forward Neural Nets

Examples

Applications

4.1 INTRODUCTION

Artificial Intelligence has been witnessing a monumental growth in

bridging the gap between the capabilities of humans and machines.

Researchers and enthusiasts alike, work on numerous aspects of the field to

make amazing things happen. One of many such areas is the domain of

Computer Vision. A Convolutional Neural Network (ConvNet/CNN) is

a Deep Learning algorithm which can take in an input image, assign

importan ce (learnable weights and biases) to various aspects/objects in the

image and be able to differentiate one from the other. The pre -processing

required in a ConvNet is much lower as compared to other classification

algorithms. While in primitive methods fil ters are hand -engineered, with

enough training, ConvNets can learn these filters/characteristics.

munotes.in

## Page 65

65

4.2 WHAT IS CONVOLUTIONAL NEURAL NETWORK?

Convolutional Neural Network is one of the main categories to do image

classification and image recognition in ne ural networks. Scene labelling,

objects detections, and face recognition, etc., are some of the areas where

convolutional neural networks are widely used.

As shown in fig. 4.1, CNN takes an image as input, which is classified and

process under a certain c ategory such as car, truck, van, etc. The computer

sees an image as an array of pixels and depends on the resolution of the

image. Based on image resolution, it will see as h * w * d , where h= height

w= width and d= dimension. For example, An RGB image is 6 * 6 *

3 array of the matrix, and the grayscale image is 4 * 4 * 1 array of the

matrix.

Fig. 4.1: Convolution Neural Network

In CNN, each input image will pass through a sequence of convolution

layers along with pooling, fully connected layers, filters (Also known as

kernels). After that, we will apply the Soft -max function to classify an

object with probabilistic values 0 and 1.

4.2.1 Convolution Layer :

Convolution layer is the first layer to extract features from an input image.

By learning image features using a small square of input data, the

convolutional layer preserves the relationship between pixels. It is a

mathematical operation which takes two inputs such as image matrix and a

kernel or filter.

o The dimension of the image matrix is h×w×d .

o The dimension of the filter is fh×fw×d.

o The dimension of the output is (h-fh+1)×(w -fw+1)×1 .

munotes.in

## Page 66

66

Let's start with consideration a 5*5 image whose pixel values are 0, 1, and

filter matrix 3*3 as:

The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is

called " Features Map " and show as an output.

Convolution of an image with different filters can perform an operation

such as blur, sharpen, and edge detection by applying filters.

4.2.2 STRIDES

Stride is the number of pixels which are shift over the input matrix. When

the stride is equalled to 1, then we move the filters to 1 pixel at a time and

similarly, if the stride is equalled to 2, then we move the filters to 2 pixels

at a time. The following figure shows that the convolution would work

with a stride of 2.

munotes.in

## Page 67

67

Fig. 4.2: Convolutional Strides

4.2.3 Padding

Padding plays a crucial role in building the convolutional neural network.

If the image will get shrink and if we will take a neural network with 100's

of layers on it, it will give us a small image after filtered in the end.

If we take a three by three filter on top of a grayscale image and do the

convolving then what will happen?

Fig. 4.3: Convolutional Padding

It is clear from the above picture that the pixel in the corner will only get

covers one time, but the middle pixel will get covered more than once. It

means that we have more information on that middle pixel, so there are

two downsides:

o Shrinking outputs

o Losing information on the corner of the image.

To overcome this, we have introduced padding to an image. "Padding is

an additional layer which can add to the border of an image."

munotes.in

## Page 68

68

4.3 WHY CONVNETS OVER FEED -FORWARD NEURAL NETS?

The architecture of a ConvNet is inspired by the organisation of the Visual

Cortex and is akin to the connectivity pattern of Neurons in the Human

Brain. Individual neurons can only respond to stimuli in a small area of the

visual field called the Receptive Field. A group of similar fields will

encompass the full visual region if th ey overlap.

Fig 4.4: Flattening of a 3x3 image matrix into a 9x1 vector

An image is nothing, but a matrix of pixel values as shown in fig. 4.4. In

cases of extremely basic binary images, the method might show an average

precision score while performing prediction of classes but would have little

to no accuracy when it comes to complex images having pixel

dependencies throughout.

A ConvNet is able to successfully capture the Spatial and Temporal

dependencies in an image through the application of relevan t filters. The

architecture performs a better fitting to the image dataset due to the

reduction in the number of parameters involved and reusability of weights.

In other words, the network can be trained to understand the sophistication

of the image better .

Fig 4.5: 4x4x3 RGB Image

munotes.in

## Page 69

69

In the figure 4.5, there is an RGB image which has been separated by its

three color planes — Red, Green, and Blue. There are several such color

spaces in which images exist — Grayscale , RGB, HSV, CMYK, etc. You

can imagine how computationally hard things will get once the photos

exceed 8K (76804320) dimensions. The ConvNet's job is to compress the

images into a format that is easier to process while preserving elements that

are importan t for obtaining a decent prediction. This is critical for

designing an architecture that is capable of learning features while also

being scalable to large datasets.

4.4 CONVOLUTIONAL OPERATION

Convolution is a specialized kind of linear operation. Convo lution is an

operation on two functions of a real - valued argument. To motivate the

definition of convolution, we start with examples of two functions we

might use. Convnets are simply neural networks that use convolution in

place of general matrix multipl ication in at least one of their layers.

4.4.1 Convolution Kernels :

A kernel is a small 2D matrix whose contents are based upon the

operations to be performed. A kernel maps on the input image by simple

matrix multiplication and addition, the output obta ined is of lower

dimensions and therefore easier to work with.

Fig 4.6: Kernel types

To smoothen the image before processing, Sharpen image(enhance the

depth of edges) and edge detection the above example shown in fig 4.6.

The shape of a kernel is heavily dependent on the input shape of the image

and architecture of the entire network, mostly the size of kernels

is (MxM) i.e a square matrix. The movement of a kernel is always from

left to right and top to bottom.

munotes.in

## Page 70

70

Fig. 4.7: Kernel Movement

As discussed above the stride defines for example, a stride of 1 causes the

kernel to slide by one row/column at a time, whereas a stride of 2 causes

the kernel to slide by two rows/columns.

the input matrix has shape 4x4x1 and the kernel is of size 3x 3 since the

shape of input is larger than the kernel, we are able to implement a sliding

window protocol and apply the kernel over entire input. First entry in the

convoluted result is calculated as:

45*0 + 12*( -1) + 5*0 + 22*( -1) + 10*5 + 35*( -1) + 88*0 + 26*(-1) +

51*0 = -45

munotes.in

## Page 71

71

4.4.2 Sliding window protocol:

1. The kernel gets into position at the top -left corner of the input matrix.

2. Then it starts moving left to right, calculating the dot product and

saving it to a new matrix until it has reached the last column.

3. Next, kernel resets its position at first column but now it slides one row

to the bottom. Thus following the fashion left -right and top -bottom.

4. Steps 2 and 3 are repeated till the entire input has been processed.

The kernel will move from front to back, left to right, and top to bottom for

a 3D input matrix. You should have a fundamental knowledge of the

Convolution operation, which is the heart of a Convolutional Neural

Network by now.

4.5 POOLING

Pooling layer plays an important role in pre -processing of an image.

Pooling layer reduces the number of parameters when the images are too

large. Pooling is " downscaling " of the image obtained from the previous

layers. It can be compared to shrinking an image to reduce its pixel density.

Spatial pooling is also called downsampling or subsampling, which reduces

the dimensionality of each map but retains the important infor mation.

There are the following types of spatial pooling:

Max Pooling :

Max pooling is a sample -based discretization process . Its main objective

is to downscale an input representation, reducing its dimensionality and

allowing for the assumption to be mad e about features contained in the sub -

region binned.

Max pooling is done by applying a max filter to non -overlapping sub -

regions of the initial representation.

munotes.in

## Page 72

72

Average Pooling :

Down -scaling will perform through average pooling by dividing the input

into rectangular pooling regions and computing the average values of each

region.

layer = averagePooling2dLayer(poolSize)

layer = averagePooling2dLayer(poolSize,Name,Value)

Sum Pooling :

The sub -region for sum pooling or mean pooling are set exactly the same

as for max-pooling but instead of using the max function we use sum or

mean.

4.6 DATA TYPES

The data used with a convolutional network usually consist of several

channels, each channel being the observation of a different quantity at

some point in space or time. One advantage to convolutional networks is

that they can also process inputs with var ying spatial extents. These kinds

of input simply cannot be represented by traditional, matrix multiplication -

based neural networks. This provides a compelling reason to use

convolutional networks even when computational cost and overfitting are

not signif icant issues. For example, consider a collection of images in

which each image has a different width and height. It is unclear how to

model such inputs with a weight matrix of fixed size. Convolution is

straightforward to apply; the kernel is simply applie d a different number of

times depending on the size of the input, and the output of the convolution

operation scales accordingly. Convolution may be viewed as matrix

multiplication; the same convolution kernel induces a different size of

munotes.in

## Page 73

73

doubly block circu lant matrix for each size of input. Sometimes the output

of the network as well as the input is allowed to have variable size, for

example, if we want to assign a class label to each pixel of the input. In

this case, no further design work is necessary. In other cases, the network

must produce some fixed -size output, for example, if we want to assign a

single class label to the entire image. In this case, we must make some

additional design steps, like inserting a pooling layer whose pooling

regions scale i n size proportional to the size of the input, to maintain a

fixed number of pooled outputs.

Single channel Multichannel 1-D Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step. Skeleton animation data: Animations of 3-D computer-rendered characters are generated by altering the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton 2-D Audio data that has been pre-processed with a Fourier transform: We can transform the audio waveform into a 2-D tensor with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and the vertical axes of the image, conferring translation equivariance in both directions 3-D Volumetric data: A common source of this kind of data is medical imaging technology, such as CT Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

4.7 CONVOLUTION ALGORITHMS:

Convolution is equivalent to converting both the input and the kernel to the

frequency domain using a Fourier transform, performing point -wise

multiplication of the two signals, and converting back to the time domain

using an inverse Fourier transform. For some problem sizes, this can be

faster than the naive implementation of discrete convolution. When a d -

dimensional kernel can be expressed as the outer product of d vectors, one

vector per dimension, the kernel is called separable. When the kernel is munotes.in

## Page 74

74

separable, naive convolution is inefficient. It is equivalent to compose d

one-dimensional convolutions with each of these vectors. The composed

approach is significantly faster than performing one d -dimensional

convolution with their outer product. The kernel also takes fewer

parameters to represent as vectors. If the kernel is w elements wide in each

dimension, then naive multidimensional convolution requires O(wd )

runtime and parameter storage space, while separable convolution requires

O(w × d) runtime and parameter storage space.

4.8 RELATION OF CONVOLUTIONAL NETWORK WITH DEEP LEARNING

A Convolutional Neural Network (ConvNet /CNN) is a Deep Learning

algorithm which can take in an input image, assign importance (learnable

weights and biases) to various aspects/objects in the image and be able to

differentiate one from the other. The pre -processing required in a ConvNet

is much lower as compared to other classification algorithms. While in

primitive methods filters are hand -engineered, with enough training,

ConvNets can learn these filters/characteristics. The architecture of a

ConvNet is analogous to that of the connectivity pat tern of Neurons in the

Human Brain and was inspired by the organization of the Visual Cortex.

Individual neurons respond to stimuli only in a restricted region of the

visual field known as the Receptive Field. A collection of such fields

overlaps to cover the entire visual area.

4.9 DIFFERENCE BETWEEN CNN AND RNN S.no CNN RNN 1 CNN stands for Convolutional Neural Network. RNN stands for Recurrent Neural Network. 2 CNN is more potent than RNN. RNN includes less feature compatibility when compared to CNN. 3 CNN is ideal for images and video processing. RNN is ideal for text and speech Analysis. 4 It is suitable for spatial data like images. RNN is used for temporal data, also called sequential data. 5 The network takes fixed-size inputs and generates fixed size outputs. RNN can handle arbitrary input/ output lengths. munotes.in

## Page 75

75

6 CNN is a type of feed-forward artificial neural network with variations of multilayer perceptron's designed to use minimal amounts of pre-processing. RNN, unlike feed-forward neural networks- can use their internal memory to process arbitrary sequences of inputs.

4.10 CONCLUSION

In this chapter we learned about the fundamental concept of neural network

with its application for classifying the image. A Convolutional Neural

Network ( ConvNet/CNN) is a Deep Learning algorithm which can take in

an input image, assign importance (learnable weights and biases) to various

aspects/objects in the image and be able to differentiate one from the other.

Pooling layer reduces the number of parame ters when the images are too

large. Convolution is straightforward to apply; the kernel is simply applied

a different number of times depending on the size of the input, and the

output of the convolution operation scales accordingly. Convolution may

be vie wed as matrix multiplication; the same convolution kernel induces a

different size of doubly block circulant matrix for each size of input. CNN

is a type of feed -forward artificial neural network with variations of

multilayer perceptron's designed to use m inimal amounts of pre -processing.

RNN, unlike feed -forward neural networks - can use their internal memory

to process arbitrary sequences of inputs.

EXERCISE

1. What is convolution neural network? How it is different from neural

network.

2. Explain the mechanism of convolution neural network.

3. What is Pooling? Explain the role of pooling.

4. Explain the working of MAX and Average Pooling.

4. Explain different types of data types.

6. Write a note on Convolution algorithm.

7. Give comparison between recurrent neural network and convolutional

neural network.

REFERENCES

1. Ian Goodfellow, Yoshua Bengio, Aaron Courvile, Deep Learning,

MIT Press, 2016

2. Nikhil Buduma, Fundamentals of Deep Learning, O’Reilly,

*****

munotes.in

## Page 76

76

5

SEQUENCE MODELLING

Unit Structure

5.0 Objectives

5.1 Introduction

5.2 Auto -Completion

5.2.1 Parts of Speech Tagging

5.2.2 Sequence Classification

5.3 Unfolding Computational Graphs

5.4 Recurrent Neural Networks

5.5 Types of RNNs

5.6 Natural Language Processing and Word Embeddings

5.6.1 Introduction to Word Embeddings

5.6.2 Learning Word Embeddings: Word2vec

5.6.3 Applications using Word Embeddings

5.7 Conclusion

Unit End Exercise

5.0 OBJECTIVES

In this chapter the student will learn about:

Recurrent neural networks

Use of Sequence modelling

Applications of sequence modelling

5.1 INTRODUCTION

Having a solid grasp on deep learning techniques feels like acquiring a

super power these days. From classifying images and translating

languages to building a self -driving car, all these tasks are being driven by

computers rather than manual human effort. Deep learning has penetrated

multiple and diverse industries, and it continues to break new gr ound on

an almost weekly basis. Sequence Modelling is the task of predicting what

word/letter comes next. Unlike the FNN and CNN, in sequence modelling,

the current output is dependent on the previous input and the length of the

input is not fixed. The abi lity to predict what comes next in a sequence is

fascinating. Sequence models, in supervised learning, can be used to

address a variety of applications including financial time series prediction,

speech recognition, music generation, sentiment classificati on, machine

translation and video activity recognition. The obvious question that

always pop -ups that, Why not a standard network?. We can say that munotes.in

## Page 77

77

Traditional feedforward neural networks do not share features across

different positions of the network. In other words, these models assume

that all inputs (and outputs) are independent of each other. This model

would not work in sequence prediction since the previous inputs are

inherently important in predicting the next output. For example, if you

were predic ting the next word in a stream of text, you would want to know

at least a couple of words before the target word.

5.2 AUTO -COMPLETION

Auto -completion is a feature for predicting the rest of a que ry a user is

typing, which can improve the user search experience and accelerate the

shopping process before checkout. It can also improve the search response

quality and thus create higher revenue by providing well -formatted

queries.

5.2.1 Parts of Speec h Tagging :

Part-of-Speech tagging is a well -known task in Natural Language

Processing . It refers to the process of classifying words into their parts of

speech (also known as words classes or lexical categories). This is a

supervised learning approach. Parts of speech tags are the properties of the

words, which define their main context, functions, and usage in a sentence.

Some of the commonly used parts of speech tags are

Nouns : Which defines any object or entity

Verbs : That defines some action.

Adjectives and Adverbs : This acts as a modifier, quantifier, or intensifier

in any sentence.

Further, Has and purchased belong to the verb indicating that they are the

actions. The Laptop and Apple store are the nouns. New is the adjective

whose role is to modify the context of the laptop. Parts of speech tags are

defined by the relationship of words with the other words in the sentence.

5.2.2 Sequence Classification :

Sequence classification is a predictive modeling problem where you have

some sequence of inputs over space or time and the task is to predict a

munotes.in

## Page 78

78

category for the sequence. Few examples where sequence models are used

in real -world scenarios.

Speech recognition:

Here, the input is an audio clip and the model has to produce the text

transcript. The audio is considered a sequence as it plays over time. Also,

the transcript is a sequence of words.

Sentiment Classification:

Another popular application of sequence models. We pass a text sentence

as input and the model has to predict the sentiment of the sentence

(positive, negative, angry, elated, etc.). The output can also be in the form

of ratings or stars.

DNA sequence analysis:

Given a DNA sequence as input, we want our model to predict which part

of the DNA belongs to which protein.

Machine Translation:

We input a sentence in one language, say French, and we want our model

to convert it into another language, say English. Here, both the input and

the output are sequences:

Video activity recognition:

This is actually a very upcoming (and current trending) use of sequence

models. The model predicts what activity is going on in a given video.

Here, the input is a sequence of frames.

Modelling Sequence Le arning Problems

munotes.in

## Page 79

79

5.3 UNFOLDING COMPUTATIONAL GRAPHS

A computational graph is a way to formalize the structure of a set of

computations, such as those involved in mapping inputs and

parameters to outputs and loss. we explain the idea of unfolding a

recursive or recurrent computation into a computational graph that

has a repetitive structure, typically corresponding to a chain of

events. Unfolding this graph results in the sharing of parameters

across a deep network structure.

• Each node represents the state at some time t, and the function f maps

the state at t to the state at t+ 1.

• The same parameters (the same value of θ used to parametrize f) are

used for all time steps.

• These cycles represent the influence of the present value of a variable

on its own value at a future time step. Such computational graphs

allow us to define recurrent neural networks. We then describe many

different ways to construct, train, and use recurrent neural networks.

A computational graph is a way to formalize the structure of a set of

computations, such as those involved in mapping inputs and

parameters to outputs and loss.

• We explain the idea of unfolding a recur sive or recurrent computation

into a computational graph that has a repetitive structure, typically

corresponding to a chain of events.

• Unfolding this graph results in the sharing of parameters across a deep

network structure.

A recurrent network with n o outputs. This recurrent network just processes

information from the input x by incorporating it into the state h that is

passed forward through time. (Left) Circuit diagram. The black square

indicates a delay of a single time step. (Right) The same net work seen as

an unfolded computational graph, where each node is now associated

with one particular time instance. Recurrent neural networks can be built

in many di ﬀerent ways. Much as almost any function can be considered a

f

h

h(t—1)

h(t)

h(t+1)

x(t—1)

x(t)

x(t+

1)

h(... )

h(... )

f

Unfold

f

f

f munotes.in

## Page 80

80

feedforward neural network, e ssentially any function involving recurrence

can be considered a recurrent neural network. When the recurrent network

is trained to perform a task that requires predicting the future from the

past, the network typically learns to use h(t) as a kind of loss y summary of

the task -relevant aspects of the past sequence of inputs up to t.

5.4 RECURRENT NEURAL NETWORKS

Recurrent Neural Networks are used to learn mapping from X to Y, when

either X or Y, or both X and Y, are some sequences. But it is not use as a

standard neural network for these sequence problems? Some examples of

important design patterns for recurrent n eural networks include the

following: Recurrent networks that produce an output at each time step

and have recurrent connections between hidden units, illustrated in figure

There are primarily two problems with this:

1 Inputs and outputs do not have a fixed length, i.e., some input

sentences can be of 10 words while others could be <> 10. The same

is true for the eventual output

2 We will not be able to share features learned across different positions

of text if we use a standard neural network

We need a representation that will help us to parse through different

sentence lengths as well as reduce the number of parameters in the model.

This is where we use a recurrent neural network. This is how a typical

RNN looks like:

munotes.in

## Page 81

81

A RNN takes the first word (x <1>) and feeds it into a neural network layer

which predicts an output (y’<1>). This process is repeated until the last

time step x which generates the last output y’. This is the

network where the number of words in input as well as the output are

same. The RNN scans through the data in a left to right sequence. Note

that the parameters that the RNN uses for each time step are shared. We

will have parameters shared between each input and hidden layer (Wax),

every timestep (Waa) and between the hidd en layer and the output (Wya).

So if we are making predictions for x<3>, we will also have information

about x<1> and x<2>. A potential weakness of RNN is that it only takes

information from the previous timesteps and not from the ones that come

later. Thi s problem can be solved using bi -directional RNNs which we

will discuss later. For now, let’s look at forward propagation steps in a

RNN model:

a<0> is a vector of all zeros and we calculate the further activations

similar to that of a standard neural network:

a<0> = 0

a<1> = g(W aa * a<0> + W ax * x<1> + b a)

y<1> = g’(W ya * a<1> + b y)

Similarly, we can calculate the output at each time step. The generalized

form of these formulae can be written as:

We horizontally stack W aa and W ya to get W a. a and x are stacked

vertically. Rather than carrying around 2 parameter matrices, we now have

just 1 matrix. And that, in a nutshell, is how forward propagation works

for recurrent neural networks.

5.5 TYPES OF RNNS

We can have different types of RNNs to deal with use cases where the

sequence length differs. These problems can be classified into the

following categories:

Many -to-many :

The name entity recognition examples we saw earlier fall under this

category. We have a sequence of words, and for each word, we have to

munotes.in

## Page 82

82

predict whether it is a name or not. The RNN architecture for such a

problem looks like this:

For every input word, we predict a corresponding output word.

Many -to-one:

Consider the sentiment classification problem. We pass a sentenc e to the

model and it returns the sentiment or rating corresponding to that sentence.

This is a many -to-one problem where the input sequence can have varied

length, whereas there will only be a single output. The RNN architecture

for such problems will loo k something like this:

Here, we get a single output at the end of the sentence.

One-to-many :

Consider the example of music generation where we want to predict the

lyrics using the music as input. In such scenarios, the input is just a single

word (or a single integer), and the output can be of varied length. The

RNN architecture for this type of problems looks like the below:

There is one more type of RNN which is popularly used in the industry.

Consider the machine translation application where we take an input

sentence in one language and translate it into another language. It is a

munotes.in

## Page 83

83

many -to-many problem but the length of the input sequence might or

might not be equal to the length of output sequence.

In such cases, we have an encoder part and a de coder part. The encoder

part reads the input sentence and the decoder translates it to the output

sentence:

5.6 NATURAL LANGUAGE PROCESSING AND WORD EMBEDDINGS

Natural language processing with deep learning is an important

combination. Using word vector representations and embedding layers you

can train recurrent neural networks with outstanding performances in a

wide variety of industries. Examples of applications are sentiment

analysis, named entity recognition and machine translation.

5.6.1 Introduction to Word Embeddings :

Word Representation :

One of the weaknesses of one -hot representation is that it treats each

word as a thing unto itself, and it doesn't allow an algorithm to easily

generalize across words.

o Because the any product between any two different one -hot vector is

zero.

o It doesn't know that somehow apple and orange are much more

similar than king and orange or queen and orange.

Instead we can learn a futurized representation.

o But by a lot of the features of apple and orange are ac tually the same,

or take on very similar values. And so, this increases the odds of the

learning algorithm that has figured out that orange juice is a thing, to

also quickly figure out that apple juice is a thing.

o The features we'll end up learning, won't have a easy to interpret

interpretation like that component one is gender, component two is

royal, component three is age and so on. What they're representing

will be a bit harder to figure out.

o But nonetheless, the featurized representations we will learn , will

allow an algorithm to quickly figure out that apple and orange are

more similar than say, king and orange or queen and orange.

munotes.in

## Page 84

84

features\words Man (5391) Woman (9853) King (4914) Queen (7157) Apple (456) Orange (6257) Gender -1 1 -0.95 0.97 0.00 0.01 Royal 0.01 0.02 0.93 0.95 -0.01 0.00 Age (adult?) 0.03 0.02 0.7 0.69 0.03 -0.02 Food 0.09 0.01 0.02 0.01 0.95 0.97 Size ... ... ... ... ... ... ... ... ... ... ... ... ...

Using word embeddings :

Word embeddings tend to make the biggest difference when the task

you're trying to carry out has a relatively smaller training set.

Useful for NLP standard tasks.

Named entity recognition

Text summarization

Co-reference

Parsing

5.6.2 Learning Word Embeddings: Word2vec :

Learning word embeddings:

In the history of deep learning as applied to learning word

embeddings, people actually started off with relatively complex

algorithms. And then over time, researchers discovered they can use

simpler and simpler and simpler algori thms and still get very good

results especially for a large dataset.

A more complex algorithm: a neural language model, by Yoshua

Bengio, Rejean Ducharme, Pascals Vincent, and Christian Jauvin: A

Neural Probabilistic Language Model .

o Let's start to build a neural network to predict the next word in the

sequence below.

o I want a glass of orange ______.

o 4343 9665 1 3852 6163 6257

munotes.in

## Page 85

85

o If we have a fixed historical window of 4 words (4 is a

hyperparameter), then we take the four embedding vectors and

stack them together, and feed them into a neural network, and

then feed this neural network output to a softmax, and the

softmax classifies among the 10,000 possible outputs in the vocab

for the final word we're trying to predict. These two layers have

their own parameters W1,b1 and W2, b2.

o This is one of the earlier and pretty successful algorithms for

learning word embeddings.

A more genera lized algorithm.

o We have a longer sentence: I want a glass of orange juice to go along with

my cereal . The task is to predict the word juice in the middle.

o If it goes to build a language model then is natural for the context to be

a few words right before the target word. But if your goal isn't to learn

the language model per se, then you can choose other contexts.

o Contexts:

Last 4 words: descibed previously.

4 words on left & right: a glass of orange ___ to go along with

Last 1 word: orange , much more simp ler context.

Nearby 1 word: glass. This is the idea of a Skip -Gram model, which works

surprisingly well.

o If your main goal is really to learn a word embedding, then you can use all

of these other contexts and they will result in very meaningful work

embedd ings as well.

5.6.3 Applications using Word Embeddings:

Sentiment Classification :

Comments Stars The dessert is excellent. 4 Service was quite slow. 2 Good for a quick meal, but nothing special. 3 Completely lacking in good taste, good service, and good ambience. 1

munotes.in

## Page 86

86

A simple sentiment classification model :

So one of the challenges of sentiment classification is you might not have a

huge label data set.

If this was trained on a very large data set, like a hundred billion words, then

this allows you to take a lot of knowledge even from infrequent words and

apply them to your problem, even words that weren't in your labeled training

set.

Notice that by usin g the average operation here, this particular algorithm

works for reviews that are short or long because even if a review that is 100

words long, you can just sum or average all the feature vectors for all

hundred words and so that gives you a representati on, a 300 -dimensional

feature representation, that you can then pass into your sentiment classifier.

One of the problems with this algorithm is it ignores word order .

o "Completely lacking in good taste, good service, and good ambiance".

o This is a very negat ive review. But the word good appears a lot.

A more sophisticated model :

Instead of just summing all of your word embeddings, you can instead use a

RNN for sentiment classification.

munotes.in

## Page 87

87

o In the graph, the one -hot vector representation is skipped.

o This is an example of a many -to-one RNN architecture.

Debiasing word embeddings :

Word embeddings maybe have the bias problem such as gender bias,

ethnicity bias and so on. As word embeddings can learn analogies like

man is to woman like king to queen. The paper sh ows that a learned word

embedding might output:

Man: Computer_Programmer as Woman: Homemaker

Learning algorithms are making very important decisions and so I think

it's important that we try to change learning algorithms to diminish as

much as is possible, or, ideally, eliminate these types of undesirable

biases.

Identify bias direction :

o The first thing we're going to do is to identify the direction corresponding to

a particular bias we want to reduce or eliminate.

o And take a few of these differences and basically average them. And this

will allow you to figure out in this case that what looks like this direction is

the gender direction, or the bias direction. Suppose we have a 50 -

dimensional word embedding.

g1 = e she - ehe

g2 = e girl - eboy

g3 = e mother - efather

g4 = e woman - eman

o g = g 1 + g 2 + g 3 + g 4 + ... for gender vector.

o Then we have

cosine_similarity(sophie, g)) = 0.318687898594

cosine_similarity(john, g)) = -0.23163356146

to see male names tend to have positive similarity with gender vector

wherea s female names tend to have a negative similarity. This is acceptable.

o But we also have

cosine_similarity(computer, g)) = -0.103303588739

cosine_similarity(singer, g)) = 0.185005181365

It is astonishing how these results reflect certain unhealthy gender

stereotypes.

o The bias direction can be higher than 1 -dimensional. Rather than taking

an average, SVD (singular value decomposition) and PCA might help.

Neutralize

o For every word that is not definitional, project to get rid of bias.

munotes.in

## Page 88

88

5.7 CONCLUSION

Recurrent neural networks, or RNNs are a family of neural

networks for processing sequential data. A computational graph is a

way to formalize the structure of a set of computations, such as

those involved in mapping inputs and parameters to outputs and

loss. Traditional neural networks require the input and output

sequence lengths to be constant across all predictions.Recurrent

neural networks can be built in many di ﬀerent ways. When the

recurrent network is trained to perform a task that requires

predicti ng the future from the past. Natural language processing

with deep learning is an important combination. Using word vector

representations and embedding layers you can train recurrent neural

networks with outstanding performances in a wide variety of

indus tries.

UNIT END EXERCISE

1. What is sequence modelling? State its applications.

2. Explain the mechanism of sequence modelling.

3. Write a note on Parts of Speech.

4. Explain the classification process using sequence modelling.

5. Explain the arch itecture of RNN.

5. What is word embedding?

7. Explain different types of RNN.

REFERENCES

1. Ian Goodfellow, Yoshua Bengio, Aaron Courvile, Deep Learning,

MIT Press, 2016

2. Nikhil Buduma, Fundamentals of Deep Learning, O’Reilly, 2017

3. Shamsi Fatma Al and Guessoum Ahmed. 2005. A hidden Markov

model -based POS tagger for Arabic. In Proceedings of the 8th

International Conference on the Statistical Analysis of Textual

Data. 31–42

*****

munotes.in

## Page 89

89

6

APPLICATIONS

Unit Structure

6.0 Objectives

6.1 Introduction

6.2 Large -Scale Deep Learning

6.3 Computer Vision

6.4 Speech Recognition

6.5 Natural Language Processing

6.6 Other Applications

6.7 Summary

Exercise

References

6.0 OBJECTIVES

In this chapter the student will learn about:

how to use deep learning to solve applications in computer vision,

speech recognition, natural language processing, and other

application areas of commercial interest.

6.1 INTRODUCTION

Deep learning is based on the philosophy of connectionism. Deep

Learning is a field of Artificial Intelligence (AI) that aims to imbibe

human brain function in data processing machines. The way the human

brain works serves as the foundation of deep learning which is also called

deep ne ural learning. Deep learning (DL) has achieved promising results

on a wide spectrum of AI application domains ranging from computer

vision. Data processing is a significant field and deep learning helps in

processing vast amounts of data with the help of i dentified and verified

patterns established by the human brain. A revolutionary technique of

machine learning, deep learning has helped the field of technology

advance manifold. The rise of deep learning in AI has helped the digital

domain excel and evolve unstoppably. Numerous applications and

advantages of deep learning can further be used to understand the concept

in a better manner.

munotes.in

## Page 90

90

6.2 LARGE -SCALE DEEP LEARNING

Deep learning (DL) has achieved promising results on a wide spectrum of

AI application domain s ranging from computer vision, natural language

processing and machine translation, information retrieval and many others.

The scale is the main driver behind the rise of DL. Larger datasets and

neural networks consistently yield better performance across all tasks that

generally require more computation and longer training time. Therefore,

recent years have witnessed a surge of interests from both academia and

industry in scaling up DL with distributed training on a large cluster of

devices such as TPUs a nd GPUs with higher computation capability and

memory limit. Data parallelism has become a dominant practice for

distributed training. It distributes a large batch to multiple devices, where

each device holds an identical model replica, computes the gradie nt of a

local batch, and finally gathers the gradients at each iteration for

synchronous parameter update. With recent optimization techniques, it is

now able to train very large batches on thousands of GPU devices.

However, training at such scales require s overcoming both algorithmic

and systems related challenges. One of the main challenges is the

degradation of model accuracy with large batch size beyond a certain point

(e.g., 32k). Naively increasing the batch size typically results in

degradation of ge neralization performance and reduces computational

benefits. Additionally, we cannot always improve the training speed by

just using more processors as the communication cost is a non -negligible

overhead. Intuitively multiple processors collaboratively tra ining one task

can reduce the overall training time, but the corresponding communication

cost between processors is heavy and limits the model scalibility. Worse

still, models with tens of billions to trillions of parameters clearly do not

fit into memory of a single device, and simply adding more devices will

not help scale the training.

6.2.2 Fast CPU Implementations :

Traditionally, neural networks were trained using the CPU of a single

machine. Today, this approach is generally considered insufficient. We

now mostly use GPU computing or the CPUs of many machines

networked together. Before moving to these expensive setups, researchers

worked hard to demonstrate that CPUs could not manage the high

computational workload required by neural networks. Each n ew model of

CPU has different performance characteristics, so sometimes floating -

point implementations can be faster too. The important principle is that

careful specialization of numerical computation routines can yield a large

payoff.

6.2.3 GPU Implemen tations :

Graphics processing units (GPUs) are specialized hardware components

that were originally developed for graphics applications. The consumer munotes.in

## Page 91

91

market for video gaming systems spurred development of graphics

processing hardware. The performance chara cteristics needed for good

video gaming systems turn out to be beneficial for neural networks as

well. GPU hardware was originally so specialized that it could only be

used for graphics tasks. Over time, GPU hardware became more flexible,

allowing custom s ubroutines to be used to transform the coordinates of

vertices or assign colors to pixels.

6.2.4 Large -Scale Distributed Implementations :

Distributing inference is simple, because each input example we want to

process can be run by a separate machine. This is known as data

parallelism. It is also possible to get model parallelism, where multiple

machines work together on a single datapoint, with each machine running

a different part of the model. This is feasible for both inference and

training. Data parallelism during training is somewhat harder. We can

increase the size of the minibatch used for a single SGD step, but usually

we get less th an linear returns in terms of optimization performance. It

would be better to allow multiple machines to compute multiple gradient

descent steps in parallel.

6.3 COMPUTER VISION

Computer vision is a very broad field encompassing a wide variety of

ways of processing images, and an amazing diversity of applications.

Applications of computer vision range from reproducing human visual

abilities, such as recognizing faces, to creating entirely new categories of

visual abilities. As an example of the latter c ategory, one recent computer

vision application is to recognize sound waves from the vibrations they

induce in objects visible in a video. Most deep learning research on

computer vision has not focused on such exotic applications that expand

the realm of w hat is possible with imagery but rather a small core of AI

goals aimed at replicating human abilities. Most deep learning for

computer vision is used for object recognition or detection of some form,

whether this means reporting which object is present in an image,

annotating an image with bounding boxes around each object, transcribing

a sequence of symbols from an image, or labelling each pixel in an image

with the identity of the object it belongs to.

6.3.1 Preprocessing :

Computer vision usually requir es relatively little of this kind of pre -

processing. The images should be standardized so that their pixels all lie in

the same, reasonable range, like [0,1] or [ -1, 1]. Mixing images that lie in

[0,1] with images that lie in [0, 255] will usually result i n failure.

Formatting images to have the same scale is the only kind of pre -

processing that is strictly necessary. Many computer vision architectures

require images of a standard size, so images must be cropped or scaled to munotes.in

## Page 92

92

fit that size. Even this rescali ng is not always strictly necessary. Some

convolutional models accept variably sized inputs and dynamically adjust

the size of their pooling regions to keep the output size constant. Dataset

augmentation is an excellent way to reduce the generalization err or of

most computer vision models. A related idea applicable at test time is to

show the model many different versions of the same input (for example,

the same image cropped at slightly different locations) and have the

different instantiations of the mode l vote to determine the output.

6.3.2 Dataset Augmentation :

It is an excellent way to reduce the generalization error of most computer

vision models. A related idea applicable at test time is to show the model

many different versions of the same input ( for example, the same image

cropped at slightly different locations) and have the different instantiations

of the model vote to determine the output. Object recognition is a

classification task that is especially amenable to this form of dataset

augmentati on because the class is invariant to so many transformations

and the input can be easily transformed with many geometric operations.

As described before, classifiers can benefit from random translations,

rotations, and in some cases, flips of the input to augment the dataset. In

specialized computer vision applications, more advanced transformations

are commonly used for dataset augmentation.

6.4 SPEECH RECOGNITION

The task of speech recognition is to map an acoustic signal containing a

spoken natural lan guage utterance into the corresponding sequence of

words intended by the speaker. Let X = (x(1), x(2) , . . . , x(T) ) denote the

sequence of acoustic input vectors (traditionally produced by splitting the

audio into 20ms frames). Most speech recognition s ystems pre -process the

input using specialized hand -designed features, but some deep learning

systems learn features from raw input. Let y = (y1, y2, . . ., yN ) denote the

target output sequence (usually a sequence of words or characters). The

automatic s peech recognition (ASR) task consists of creating a function f ∗

ASR that computes the most probable linguistic sequence y given the

acoustic sequence X:

f∗ ASR(X) = arg max y P ∗ (y | X = X)

where P ∗ is the true conditional distribution relating the inputs X to the

targets y.

6.5 NATURAL LANGUAGE PROCESSING

Natural language processing (NLP) is the use of human languages, such as

English or French, by a computer. Computer programs typically read and

emit specialized languages designed to allow efficient and unambiguous munotes.in

## Page 93

93

parsing by simple programs. More naturally occurring languages are often

ambiguous and defy formal description. Natural language processing

includes applications such as machine transla tion, in which the learner

must read a sentence in one human language and emit an equivalent

sentence in another human language. Many NLP applications are based on

language models that define a probability distribution over sequences of

words, characters o r bytes in a natural language.

6.5.1 n-grams’:

An n -gram is a sequence of n tokens. Models based on n -grams define the

conditional probability of the n -th token given the preceding n − 1 tokens.

The model uses products of these conditional distributions to define the

probability distribution over longer sequences:

This decomposition is justified by the chain rule of probability. The

probability distribut ion over the initial sequenceP (x1, . . . , xn−1) may be

modeled by a different model with a smaller value of n. Training n -gram

models is straightforward because the maximum likelihood estimate can

be computed simply by counting how many times each possib le n gram

occurs in the training set.

6.5.2 Hierarchical Softmax :

A classical approach (Goodman, 2001) to reducing the computational

burden of high -dimensional output layers over large vocabulary sets V is

to decompose probabilities hierarchically. Instead of necessitating a

number of computations proportional to |V| (an d also proportional to the

number of hidden units, nh), the |V| factor can be reduced to as low as log

|V|. To predict the conditional probabilities required at each node of the

tree, we typically use a logistic regression model at each node of the tree,

and provide the same context C as input to all of these models. Because

the correct output is encoded in the training set, we can use supervised

learning to train the logistic regression models.

6.6 OTHER APPLICATIONS

The types of applications of deep lear ning that are different from the

standard object recognition, speech recognition and natural language

processing tasks discussed above.

6.6.1 Recommender Systems :

One of the major families of applications of machine learning in the

information technology secto r is the ability to make recommendations of

munotes.in

## Page 94

94

items to potential users or customers. Two major types of applications can

be distinguished: online advertising and item recommendations. Both rely

on predicting the association between a user and an item, either to predict

the probability of some action or the expected gain. If an ad is shown or a

recommendation is made regarding that product to that user. Companies

including Amazon and eBay use machine learning, including deep

learning, for their product recomme ndations.

6.6.2 Exploration Versus Exploitation :

Many recommendation problems are most accurately described

theoretically as contextual bandits. The issue is that when we use the

recommendation system to collect data, we get a biased and incomplete

view of the preferences of users: we only see the responses of users to the

items they were recommended and not to the other items. This would be

like training a classifier by picking one class yˆ for each training example

x (typically the class with the highest prob ability according to the model)

and then only getting as feedback whether this was the correct class or not.

The bandit problem is easier in the sense that the learner knows which

reward is associated with which action. In the general reinforcement

learnin g scenario, a high reward or a low reward might have been caused

by a recent action or by an action in the distant past.

6.6.3 Knowledge Representation, Reasoning and Question Answering:

Deep learning approaches have been very successful in language

modeling, machine translation and natural language processing due to the

use of embeddings for symbols and words. These embeddings represent

semantic knowledge about individual words and concepts.

6.7 SUMMARY

Deep learning has been applied to many other applications besides the

ones described here, and will surely be applied to even more after this

writing. Given larger datasets and bigger models consistently yielding

significant improvements in accuracy, large -scale deep learning has

become an inevitable trend. As da tasets increase in size and DNNs in

complexity, the computational intensity, communication cost and memory

demands of deep learning increase proportionally. Recent years have

witnessed a surge of interests from both academia and industry in scaling

up DL with distributed training on a large cluster of devices such as TPUs

and GPUs with higher computation capability and memory limit.

Eventually, knowledge of relations combined with a reasoning process and

understanding of natural language could allow us to build a general

question answering system.

munotes.in

## Page 95

95

UNIT END EXERCISE

1. Explain deep learning application with reference to Natural Language

Processing

2. Explain deep learning role in speech recognition.

3. Write a note on large scale deep learning.

4. Write a note on computer vision.

5. Give comparison between Exploration and Exploitation.

6. Explain in brief about recommender system.

REFERENCES

1. Ian Goodfellow, Yoshua Bengio, Aaron Courvile, Deep Learning,

MIT Press, 2016

2. Nikhil Buduma, Fundamentals of Deep Learning, O’Reilly, 2017

3. Shamsi Fatma Al and Guessoum Ahmed. 2006. A hidden Markov

model -based POS tagger for Arabic. In Proceedings of the 8th

International Conference on the Statistical Analysis of Textual

Data. 31–42.

*****

munotes.in

## Page 96

96

UNIT IV

7

DEEP LEARNING RESEARCH

Unit Structure

7.0 Objectives

7.1 Introduction

7.2 Linear Factor Models

7.2.1 Probabilistic PCA (Principal Component Analysis )

7.2.2 Factor Analysis

7.2.3 ICA ( Independent Component Analysis )

7.2.4 SFA ( Slow feature analysis )

7.2.5 Sparse coding

7.2.6 Manifold Interpretation of PCA

7.3 Autoencoders

7.3.1 Undercomplete Autoencoders

7.3.2 Regularized Autoencoders

7.3.3 Denoising Autoencoders

7.3.4 Applications of Autoencoders

7.4 Representation learning

7.5 Summary

Questions

References

7.0 OBJECTIVES

After going through this unit, you will be able to:

Understand and classify Linear factor Models.

Understand significance of Autoencoders

Understand representation learning and its use in deep learning

models.

7.1 INTRODUCTION

In the earlier units we have studied about supervised learning algorithms

and other models. We understand that a good accuracy can be achieved

with a good amount of labeled data, which at times doesn’t seem to exist.

Quality of data used in supervised learning algorithms is always

questionable. Thus, we look ahead for developing general models in deep

learning which can work better in absence of labeled data or supervised

data. This will also help in good applicability and ac hieve higher accuracy. munotes.in

## Page 97

97

In this chapter we shall focus on un super vised learning. Though

unsupervised learning doesn’t solve the problem as supervised learning

does, but a lot can be explored and researched. A major problem with

unsupervised learning is the high dimensionality of the problem which

eventually generate s other problems like computations, statistical

calculation s etc. M any of this can be handled with proper design and other

approaches. Let us have a look at existing unsupervised learning

approac hes in the following part .

7.2 LINEAR FACTOR MODELS

Probabilistic models use probabilistic inference to predict variables using

other given variables. A linear factor model is a probabilistic model where

X is obtained by using a linear decoder function. This linear decoder

function is created by combining linear transformation (Wh+b) with noise.

X=Wh+b+noise

Where ,

h is the latent variable,

W is the weight matrix

b denotes bias

noise , is normally distributed or diagonal.

In linear factor model, h is an explanatory factor obtained from a factorial

distribution .

h~p(h), where , p(h)= ∏𝑝(ℎ𝑖)

here all hi’s are independent .

Graphically relation between xi’s and h i’s can be represented as follows,

Figure 7.1: relation between x i’s and h i’s

We can see above how the observed variables x i are obtained from hidden

variables hi.

The probabilistic PCA (principal component analysis) , factor analysis and

ICA(Independent component analysis) models are typically the variations

h1

h2

h3

X1

X2

X3 munotes.in

## Page 98

98

of Linear factor models with different choices of noise and the distribution

of latent variable.

7.2.1 Probabilistic PCA :

Principal component analysis (PCA) is a popular technique used in

variable selection, the concept of PCA is to fin d the set of axes which

communicate the most information of the data set, thus reducing the

dimensions to work with. As we know v ariance corresponds to

information, the process in PCA is to find vectors with maximum variance

and keep repeating it till we f ind the vectors equal to the dimension of the

dataset. We select m vectors (axes) for a dataset with dimension d, such

that m

PCA is been widely used for dimensionality reduction but it fails to

capture informatio n if the relationships are nonlinear. We have

probabilistic Principal component analysis which takes advantage of the

observations that most variations in the data can be captured by the latent

variables with residual error 𝜎ଶ.As 𝜎ଶ 𝑡𝑒𝑛𝑑𝑠 𝑡𝑜 0 the probabilistic PCA

becomes PCA.

In probabilistic PCA , noise is considered to be drawn from a diagonal

covariance Gaussian distribution, where the

covariance matrix is a diagonal matrix of equal variances.

Covariance matrix=diag( 𝜎ଶ)

Where 𝜎ଶ=[𝜎 ଶ,𝜎 ଶ,𝜎 ଶ,…………,𝜎 ଶ]், a v ariance vector.

In this case h captures, the dependencies between variable s x.

Thus we have

ℎ∼N(h;0,I)

𝑥∼N(x;b,WWT+𝜎ଶ𝐼)

Thus, x=Wh+b+ 𝜎𝑧, where z is Gaussian noise.

7.2.2 Factor Analysis :

In contrast to PCA factor analysis focusses on locating independent

variables. Considering the Linear factor models discussed above in Factor

analysis linear model, the latent variable h is unit variance Gaussian and

the variable x are assumed to be conditionally dependent such that h is

given.

In Factor analysis noise is considered to be drawn from a diagonal

covariance Gaussian distribution, where the covariance matrix is a

diagonal matrix of variances.

Covaria nce matrix=diag( 𝜎ଶ) munotes.in

## Page 99

99

Where 𝜎ଶ=[𝜎ଵଶ,𝜎ଶଶ,𝜎ଷଶ,…………𝜎ଶ]், a v ariance vector.

In this case h captures the dependencies between variable x .

Thus we have

ℎ∼N(h;0,I)

𝑥∼N(x;b,WWT+diag( 𝜎ଶ))

7.2.3 Independent Component Analysis (ICA) :

ICA dea ls with separating an observed signal into the underlying signals

which are scaled and added together to form observed data. These signals

are supposed to be independent.

For example, separating speech signal of people ta lking simultaneously

Figure 7.2: Algorithm separates speech signal

In this model the prior distribution p(h) from which h is generated, is fixed

well before.

Thus,

x=Wh

This Model can be trained using Maximum likelihood.

If p(h) is independent, then we can obtain underlying factors that are likely

to be independent.

For example, if we have n microphones placed in different locations such

that x i is an observation of mixed signals,

hi is an estimate of original independent signal

ICA can help se paration of signals i.e. each hi contains only one person

speaking clearly.

Many variants of ICA are possible. And all these variant s require p(h) to

be non-Gaussian .

munotes.in

## Page 100

100

7.2.4 Slow Feature Analysis (SFA) :

Slow Feature Analysis is based on slowness principle; it is applied to any

model trained with gradient descent. In slow feature analysis the

information is extracted from time signals. It’s an unsupervised learning

algorithm which helps in extracting slowly varying features from a quick

varying sig nal.

For example, in video frame of moving zebra, a zebra moves from left to

right, an individual pixel will quickly change from black to white and back

again as the zebra’s stripes pass over the pixel. But the feature indicating

whether a zebra is in th e image will not change, and the feature describing

the zebra’s position will change slowly.

Slow feature analysis is efficient because it is used in linear feature

extractor. It is possible theoretically to predict the feature SFA will learn.

7.2.5 Sparse coding :

Sparse coding is a class of unsupervised learning methods for learning sets

to represent data efficiently. This is the most used Linear factor model

from the unsupervised feature learning. It helps in deciding the value of

latent variable in the model.

In this model also, x is obtained using decoder and reconstructions.

X=Wh+b+noise

The model assumes that linear factor model has Gaussian noise with

isotropic precision β.

p(x|h)=N(x;Wh+b,(1/ β)I)

7.2.6 Manifold Interpretation of PCA :

Linear factor models can be explained as a manifold. We know that

Principal Component analysis helps in dimensionality reduction, but it

works well when the data has linear relationship. Problem arises when

data has nonlinear relationships. These problems can be handled using

manifold learning. Manifold learning describes the high dimensional

datasets into low dimensional manifolds. Probabilistic PCA can be viewed

as a region of the shape of a thin pancake with high probability. PCA can

be interpreted as al igning this pancake with a linear manifold in a higher -

dimensional space .

7.3 AUTOENCODERS

Autoencoder neural network is trained using unsupervised learning.

Autoencoders are feedforward NN having outpust same as input.The input

is compressed and then reconstructed using Autoencoders.

Input -->Encoder -->code -->Decoder -->Output ( same as input) munotes.in

## Page 101

101

An auto encoder has 3 components :

1. Encoder: The encoder compresses the input and produces the code.

This is done with the help of an encoding method.

2. Code: it is the latent representation of the code

3. Decoder : input is reconstructed by the decoder. This is done with the

help of a decoding method.

Autoencoders is mainly used in dimensionality reduction, but they also

exhibit other properties.

They are Data -specific . Autoencoders are lossy they don’t yield the exact

same ouput as the input, it can be a degraded representation. To train an

autoencode r we need the raw input data. Autoencoders generate their own

labels from the training data and therefore they are termed as self -

supervised or unsupervised algorithms.

7.3.1 Undercomplete Autoencoders :

An autoencoder whose code dimension is less than the input dimension is

called undercomplete. Learning an undercomplete representation pushes

the autoencoder to capture the most significant features of the training

data. Thus, to obtain main features from the autoencoder is by constraining

h to have smaller dimension than x.

7.3.2 Regularized Autoencoders :

We have seen above that, Undercomplete autoencoders, with code

dimension less than the input dimension, can learn the most salient

features of the data distribution. Also, autoencoders fail to learn anything

useful if the encoder and decoder are given too much amount of data .

Possibly, one can train any architecture of autoencoder by taking the code

dimension and the capacity of the encoder and decoder based on the

complexity of distribution to be modeled. This can be achieved with

Regularized autoencoders. Instead of limiting the model capacity,

regularized autoencoders use a loss function that promotes the model to

have other properties apart from copying its input to its output. These

properties include :

1. sparseness of the representation

2. compactness of the derivative of the repres entation

3. robustness to noise / missing inputs.

A regularized autoencoder can be nonlinear and overcomplete however it

can still learn something important about the data distribution.

Typically, sparse autoencoders are used to learn features for a different

job, such as classification. Instead of just operating as an identity function, munotes.in

## Page 102

102

an autoencoder that has been regularized to be sparse must respond to

unique statistical properties of the dataset it has been trained on. In this

sense, using a sparsity penalty to train a model to execute a copying task

can result in a model that has learnt important features as a side effect.

Autoencoders with Denoising Rather than adding a penalty to the cost

function, we can change the reconstruction error term of the cost function

to obtain an autoencoder that learns anything meaningful. Instead, a

denoising autoencoder, minimizes L(x, g(f(x~ )), where x~ is a duplicate of

x that has been distorted b y noise. Instead of just copying their input,

denoising autoencoders must repair the damage.

Autoencoders are frequently learned using just a single layer encoder and

decoder. This is not, however, a must. Deep encoders and decoders

provide numerous benef its. Remember that depth in a feedforward

network has a lot of advantages. These benefits also apply to autoencoders

because they are feedforward networks. Furthermore, because the encoder

and decoder are both feedforward networks, each of these autoencode r

components can benefit from depth on its own. The universal

approximator theorem implies that a feedforward neural network with at

least one hidden layer can achieve non -trivial depth.

Given enough hidden units, a deep autoencoder with at least one extr a

hidden layer inside the encoder can approximate any mapping from input

to code arbitrarily accurately. The computational cost of modelling some

functions can be reduced by an order of magnitude when using depth.

Depth can also reduce the quantity of trai ning data required to learn some

functions tremendously. Because greedily pretraining the deep architecture

by training a stack of shallow autoencoders is a typical technique for

training a deep autoencoder, we frequently meet shallow autoencoders,

even wh en the end goal is to train a deep autoencoder.

7.3.3 Denoising Autoencoders :

An alternate way to make the autoencoder to learn some important

features is by adding random noise to the input . Because the input contains

random noise, it wont be possible for an autoencoder to replicate the input

to the output.

Thus the input can be d ecod ed by removing the noise, this process is

called denoising autoencoder.

Autoencoder is expected to generate the input image even if it not actually

seeing that input.

munotes.in

## Page 103

103

Figure 7.3 Autoencoder

7.3.4 Applications of Autoencoders :

Autoencoders have been effectively used in dimensionality reduction

and information retrieval tasks.

Autoencoders are good at denoising of images.

Autoencoders are used in Feature Extraction, they help to learn

important hidden feature s of the input data.

Variational Autoencoder (VAE), is used to generate images.

Autoencoders are used in Sequence -to-Sequence Prediction.

Deep Autoencoders can be used in recommending movies, books, or

other items.

7.4 REPRESENTATION LEARNING

Representation learning is learning representations of input data typically

by transforming it or extracting features from it (by some means), that

makes it easier to perform a task like classification or prediction.

Representation makes subsequent learning easier.

Information processing tasks can be easy or difficult depending on how

the information is represented and viewed. Thus, a representation is said to

be good if it makes the subsequent learning ta sks easier, and the choice of

representation depends on what is the successive task. This determines if

the representation is good or bad. For example, it will be easy for a person

to divide 670 by 6 using long division method, but once we change the

representation of numbers to other forms like hexadecimal or roman then

it becomes difficult. Thus representation matters.

Feedforward networks taught by supervised learning can be thought of as

doing a sort of representation learning. A linear classifier, such as a

SoftMax regression classifier, is often used as the network's final layer.

The rest of the network learns to give this classifier a representation. The

representation at every hidden layer takes on qualities that make the

classification task easier when trained with a supervised criterion.

We frequently have a lot of unlabeled training data and a small amount of

annotated training data. On the labelled subset, supervised learning

techniques often result in significant overfitting. Semi -supervised learning

munotes.in

## Page 104

104

can help overcome the overfitting pr oblem by learning from unlabeled

data as well. We can develop good unlabeled data representations and then

use these representations to perform the supervised learning challenge.

The procedure to train a deep supervised network without requiring

architect ural specializations like convolution or recurrence is termed

as greedy layer wise unsupervised pretraining. This process shows how a

representation is learned for one task can be useful for another task.

Greedy layer -wise pretraining gets its name from t he fact that it's a greedy

algorithm, which means it optimizes each piece of the solution separately,

one at a time, rather than all at once. Layer -wise refers to the independent

elements that make up the network's layers. Because each layer is trained

with an unsupervised representation learning algorithm, it is called

unsupervised.

Other unsupervised learning techniques, such as deep autoencoders and

probabilistic models with multiple layers of latent variables, can benefit

from greedy layer -wise unsuper vised pretraining.

5.6 SUMMARY

Linear factor models are the simplest generative models.

Probabilistic PCA, Factor models, independent component Analysis

are all obtained with variations in Linear factor models.

Slow Feature Analysis is based on slowness principle and is used in

important feature extraction.

Autoencoders are feedforward neural networks where the input is

same as the output.

An autoencoder has 3 components: Encoder, Code & Decoder .

Significant features can be learnt using undercomplete, reg ularized

and denoised autoencoders.

The concept of representation learning links together all the many

forms of deep learning. Feedforward and recurrent networks,

autoencoders and probabilistic models all learn and develop

representations.

QUESTIONS

1. What are Linear factor Models.

2. Explain the concept of Probabilistic PCA

3. Compare Factor Analysis & Independent Component Analysis munotes.in

## Page 105

105

4. Write short note on Slow feature analysis

5. What is Sparse coding ?

7. What is Manifold Interpretation of PCA ?

7. What are Autoencoders?

8. How do we obtain Undercomplete Autoencoders & Regularized

Autoencoders?

9. What is the significance of Denoising Autoencoders

10. List Applications of Autoencoders

11. Write a short not on importance of Representation lear ning in deep

learning.

REFERENCES & BIBLIOGRAPHY Deep Learning Ian Goodfellow, Yoshua Bengio, Aaron Courvile An

MIT Press book 1st 2016

Deep Learning: Methods and Applications Deng & Yu Now

Publishers 1st 2013

https://towardsdatascience.com/applied -deep -learning -part-3-

autoencoders -1c083af4d798

https://towardsdatascience.com/tagged/representation -learning

https://cedar.buffalo.edu/~srihari/CSE676/

Youtube Channel : Sargur Srihari

*****

munotes.in

## Page 106

106

UNIT V

8

APPROXIMATE INFERENCE

Unit Structure

8.0 Objectives

8.1 Introduction to Approximate Inference

8.2 Approximate Inference in machine learning

8.3 Approximate Inference in deep learning

8.4. Inference as Optimization

8.5 Expectation Maximization

8.5.1 Algorithm of Expectation Maximization

8.6 Maximum a Posteriori (MAP)

8.9 Variational Inference and Learning

8.8 Discrete Latent Variables

8.9 Summary

8.0 OBJECTIVES

This chapter would make you understand the following concepts:

Fundamentals of approximate inference

Algorithm and working of Expectation Maximization

8.1 INTRODUCTION

Approximate inference methods useful to learn realistic models from large

dataset (image or video or textual or any other.) by falling computation

time for accuracy.

Exact inference is carried out with the posterior probability of the

parameters. However, we often do not have access to that posterior — it

may be difficult to compute, sample, or both! In those cases, if we can find

a — necessarily bi ased, but simpler — approximation to the posterior, we

can use that approximation to carry out inference.

8.1.1 Posterior Probability Definition :

● A posterior probability, in Bayesian statistics, is the revised or

updated probability of an event occurring after taking into

consideration new information. munotes.in

## Page 107

107

● The posterior probability is calculated by updating the prior

probability using Bayes' theorem.

● In statistical terms, the posterior probability is the probability of event

A occurring given that event B has occurred.

Bayes' Theorem Formula :

The formula to calculate a posterior probability of A occurring given

that B occurred:

𝑃(𝐴∣𝐵)=𝑃(𝐴∩𝐵)

𝑃(𝐵)=𝑃(𝐴)×𝑃(𝐴)

𝑃(𝐵)

Remark:

A, B = events

P (B∣A) = the probability of B occurring given that Ais true

P (B) and P(B) = the probabilities of A occurring and B occurring

independently of each other

Bayes' theorem can be used in many applications, such as economics,

medicine, and finance. In finance, Bayes' theorem can be used to update a

previous belief once new information is obtained.

Prior probability represents what is originally believed before new

evidence is introduced, and posterior probability takes this new

information into account.

8.2 APPROXIMATE INFERENCE IN MACHINE LEARNING

From a probabilistic perspective, we can frame the problem of learning as

an inference problem. We have a model 𝑀 which is parameterized by a set

of variables 𝜃. We also have access to some observed data 𝐷={𝑥}ୀଵே and

that can include labels or anything else. Our goal in learning is to find a

setting for θ such that our model is useful. In the language of probability,

we are looking for a posterior distribution over the parameters, and Bayes

theorem tells us exactly what to do:

𝑝(𝜃|𝐷,𝑀)=𝑝(𝜃,𝑀)𝑝(𝑀)

𝑝(𝐷|𝑀)

where 𝑝(𝑥|𝜃,𝑀),𝑝(𝜃|𝑀) are the likelihood and prior distribution

(respectively) and are specified by 𝑀.

The simplicity of the Bayes theorem has complexities that in most cases

we cannot exac tly solve the inference problem. In most cases, 𝑝(𝐷|𝑀) is

completely intractable, and we cannot really do anything with that

equation.

munotes.in

## Page 108

108

The notion of approximate inference addresses the above issue: so we can

approximately solve Bayes theorem for compl ex cases, ie. We scale up

Bayesian learning to the types of interesting, high -dimensional datasets

that we want to deal with today’s Machine Learning.

We can roughly divide approximate inference schemes into two

categories: deterministic and stochastic. S tochastic methods are based on

the idea of Monte -Carlo sampling i.e., we can approximate any

expectation w.r.t. a distribution as a mean of samples from it:

𝐸𝑝(𝑥)[𝑓(𝑥)]≈1

𝐿𝑓(𝑥)

ୀଵ 𝑤𝑖𝑡ℎ 𝑥∼𝑝(𝑥)

The problem of inference can be converted into sampling from the

posterior distribution. Happily, there are many cases where the posterior is

intractable, but we can still sample from it. Here, Markov -Chain Monte

Carlo algorithms dominate the landscape.

Deterministic methods substitute the proble m of inference with

optimization. We can parameterize the approximation with

some variational parameters, and then minimize a probabilistic divergence

w.r.t. the variational parameters. We then use the trained approximate

distribution instead of the true, intractable one.

8.3 APPROXIMATE INFERENCE IN DEEP LEARNING

In the context of deep learning, we usually have a set of visible

variables 𝑣 and a set of latent variables ℎ. The challenge of inference

usually refers to the difficult problem of computing 𝑝(ℎ | 𝑣) or taking

expectations with respect to it. Such operations are often necessary for

tasks like maximum likelihood learning.

In most graphical models with multiple layers of hidden variables have

intractable posterior distributions. Exact inferen ce requires an exponential

amount of time in these models. Even some models with only a single

layer, such as sparse coding, have the same problem.

Intractable inference problems in deep learning usually arise from

interactions between latent variables in a structured graphical model. (Ex.

Refer Figure No 8.1)

Intractable inference problems in deep learning are usually the result of

interactions between latent variables in a structured graphical model.

These interactions can be due to edges directly connecting one latent

variable to another or longer paths that are activated when the child of a

V-structure is observed. munotes.in

## Page 109

109

Figure No 8.1: Sample full y connected Network

● (Left part of Figure No 8.1) A semi -restricted Boltzmann machine

with connections between hidden units. These direct connections

between latent variables make the posterior distribution intractable

because of the large cliques of laten t variables.

● (Center part of Figure No 8.1)A deep Boltzmann machine, organized

into layers of variables without intra -layer connections, still has an

intractable posterior distribution because of the connections between

layers.

● (Right part of Figure No 8.1)This directed model has interactions

between latent variables when the visible variables are observed

because every two latent variables are coparents.

● Some probabilistic models are able to provide tractable inference over

the latent variables despite having one of the graph structures depicted

above.

8.4 INFERENCE AS OPTIMIZATION

Approximate inference algorithms may then be derived by approximating

the underlying optimization problem.

To compute the log -probability of the observed data, 𝑙𝑜𝑔 𝑝(𝑣; 𝜃). we can

compute a lower bound 𝐿(𝑣,𝜃,𝑞) on 𝑙𝑜𝑔 𝑝(𝑣;𝜃). This bound is called the

evidence lower bound (ELBO). Another commonly used name for this

lower bound is the negative variational free energy .

The evidence lower bound is defined to be

𝐿(𝑣,𝜃,𝑞)= 𝑙𝑜𝑔 𝑝(𝑣; 𝜃)− 𝐷 (𝑞(ℎ |𝑣) || 𝑝(ℎ | 𝑣; 𝜃))

Remark:

Observed variables 𝑣

Latent variables ℎ

𝑞 is an arbitrary probability distribution over ℎ.

munotes.in

## Page 110

110

The difference between 𝑙𝑜𝑔 𝑝(𝑣) and 𝐿(𝑣,𝜃,𝑞) is given by the 𝐾𝐿

divergence, and because the 𝐾𝐿 divergence is always nonnegative, we can

see that 𝐿 always has at most the same value as the desired log probability .

The two are equal if and only if 𝑞 is the same distribution as 𝑝(ℎ | 𝑣).

The canonical definition of the evidence lower bound

𝐿(𝑣,𝜃,𝑞) = 𝐸∼ [𝑙𝑜𝑔 𝑝(ℎ,𝑣)] + 𝐻(𝑞).

For an appropriate choice of 𝑞, 𝐿 is tractable to compute. For any choice

of 𝑞,𝐿 provides a lower bound on the likelihood. For 𝑞(ℎ | 𝑣) that are

better approximations of 𝑝(ℎ | 𝑣), the lower bound L will be tighter, in

other words, closer to 𝑙𝑜𝑔𝑙𝑜𝑔 𝑝(𝑣) . When 𝑞(ℎ | 𝑣) = 𝑝(ℎ | 𝑣), the

approximation is perfect, and 𝐿(𝑣,𝜃,𝑞)=𝑙𝑜𝑔𝑙𝑜𝑔 𝑝(𝑣; 𝜃) .

The procedure for finding the 𝑞 that maximizes 𝐿. Exact inference

maximizes 𝐿 perfectly by searching over a family of functions q that

includes 𝑝(ℎ | 𝑣).

To derive different forms of approximate inference from approximate

optimization to find 𝑞.

It can make the optimization procedure less expensive but approximate by

restricting the family of distributions 𝑞 that the optimization is allowed to

search over or by using an imperfect optimization procedure that may not

completely maximize 𝐿

but may merely increase it by a significant amount. No matter what choice

of q we use, L is a lower bound. It can get tighter or looser bounds that are

cheaper or more expensive to compute depending on how it chooses to

approach this optimization problem.

It can obtain a poorly matched 𝑞 but reduce the computational cost by

using an imperfect optimization procedure, or by using a perfect

optimization procedure over a restricted family of 𝑞 distributions.

8.5 EXPECTATION MAXIMIZATION

An expectation -maximization algorithm is an approach for performing

maximum likelihood estimation in the presence of latent variables. It does

this by first estimating the values for the latent variables, then optimizing

the model, then repeating these two st eps until convergence. It is an

effective and general approach and is most commonly used for density

estimation with missing data, such as clustering algorithms like the

Gaussian Mixture Model.

The EM algorithm is an iterative approach that cycles between two modes.

The first mode attempts to estimate the missing or latent variables called

the estimation -step or E -step. The second mode attempts to optimize the munotes.in

## Page 111

111

parameters of the model to best explain the data called the maximization -

step or M -step.

● E-Step. Estimate the missing variables in the dataset.

● M-Step. Maximize the parameters of the model in the presence of the

data.

The E -step (expectation step):

Let θ(0) denote the value of the parameters at the beginning of the step.

Set 𝑞(ℎ(𝑖) | 𝑣) = 𝑝(ℎ(𝑖) | 𝑣(𝑖); 𝜃(0)) for all indices i of the training

examples v(i) ; to train on (both batch and minibatch variants are valid).

q is defined in terms of the current parameter value of θ(0); if we vary θ,

then 𝑝(ℎ | 𝑣;𝜃) will change, but q(h | v) w ill remain equal to

𝑝(ℎ | 𝑣; 𝜃(0)).

The M -step (maximization step):

Completely or partially maximize

L(v(),θ,q)

with respect to θ using any optimization algorithm can be used.

8.5.1 Algorithm of Expectation Maximization :

1. Given a set of incomplete data, consider a set of starting parameters.

2. Expectation step (E – step): Using the observed available data of the

dataset, estimate (guess) the values of the missing data.

3. Maximization step (M – step): Complete data generated after the

expectat ion (E) step is used in order to update the parameters.

4. Repeat Step 2 to 3 until convergence

The EM algorithm is useful for discovering the values of latent variables.

It can be used for the purpose of estimating the parameters of the Hidden

Markov Model (HMM). It can be used to fill the missing data in a sample.

It can be used as the basis of unsupervised learning of clusters.

The E -step and M -step are often pretty easy for many problems in terms of

implementation.

It is always guaranteed that likelihood will increase with every iteration.

Solutions to the M -steps often exist in the closed -form.

EM algorithm has slow convergence. It makes convergence to the local

optima only. It requires both the probabilities, forward and backward

(numerical optimization requires only forward probability). munotes.in

## Page 112

112

8.6 MAXIMUM A POSTERIORI (MAP)

The term inference is referring to computing the probability distribution

over one set of variables given another. When training probabilistic

models with latent variables, we are usually interested in computing

𝑝(ℎ | 𝑣). An alternative form of inference is to compute the single most

likely value of the missing variables, rather than to infer the entire

distribution over their possible values. In the context of latent variable

models, this means computing

ℎ∗ = 𝑎𝑟𝑔 𝑚𝑎𝑥 (𝑝(ℎ | 𝑣))

ℎ

This is known as maximum a posteriori (MAP) inference.

Bayes theorem provi des a principled way of calculating conditional

probability.

It involves calculating the conditional probability of one outcome given

another outcome, using the inverse of this relationship, stated as follows:

𝑃(𝐴 | 𝐵) = (𝑃(𝐵 | 𝐴) ∗ 𝑃(𝐴)) / 𝑃(𝐵)

The quantity that we are calculating is typically referred to as the posterior

probability of A given B and 𝑃(𝐴) is referred to as the prior probability of

A. The normalizing constant of 𝑃(𝐵) can be removed, and the posterior

can be shown to be prop ortional to the probability of B given A multiplied

by the prior.

𝑃(𝐴 | 𝐵) is proportional to 𝑃(𝐵 | 𝐴) ∗ 𝑃(𝐴)

Or, simply:

𝑃(𝐴 | 𝐵) = 𝑃(𝐵 | 𝐴) ∗ 𝑃(𝐴)

This is a helpful simplification as we are not interested in estimating a

probability, b ut instead in optimizing a quantity. A proportional quantity is

good enough for this purpose.

We can now relate this calculation to our desire to estimate a distribution

and parameters (theta) that best explains our dataset (X),

𝑃(𝑡ℎ𝑒𝑡𝑎 | 𝑋) = 𝑃(𝑋 | 𝑡ℎ𝑒𝑡𝑎) ∗ 𝑃(𝑡ℎ𝑒𝑡𝑎)

Maximizing this quantity over a range of theta solves an optimization

problem for estimating the central tendency of the posterior probability

(e.g. the model of the distribution).

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑃(𝑋 | 𝑡ℎ𝑒𝑡𝑎) ∗ 𝑃(𝑡ℎ𝑒𝑡𝑎)

In machine learning, Maximum a Posteriori optimization provides a

Bayesian probability framework for fitting model parameters to training

data and an alternative and sibling to the perhaps more common Maximum

Likelihood E stimation framework. munotes.in

## Page 113

113

8.7 VARIATIONAL INFERENCE AND LEARNING

The evidence lower bound 𝐿(𝑣,𝜃,𝑞) is a lower bound on 𝑙𝑜𝑔 𝑝(𝑣;𝜃),

how inference can be viewed as maximizing 𝐿 with respect to 𝑞, and how

learning can be viewed as maximizing 𝐿 with respect to 𝜃. The EM

algorithm enables us to make large learning steps with a fixed 𝑞 and that

learning algorithms based on MAP inference enable us to learn using a

point estimate of 𝑝(ℎ | 𝑣) rather than inferring the entire distribution.

The co re idea behind variational learning is that we can maximize 𝐿 over a

restricted family of distributions q. This family should be chosen so that it

is easy to compute 𝐸𝑞 𝑙𝑜𝑔 𝑝(ℎ,𝑣). A typical way to do this is to introduce

assumptions about how q factorizes. A common approach to variational

learning is to impose the restriction that 𝑞 is a factorial distribution:

𝑞(ℎ | 𝑣) = ෑ𝑞(ℎ𝑖 | 𝑣)

This is called the mean -field approach. More generally, we can impose

any graphical model struc ture we choose on q, to flexibly determine how

many interactions we want our approximation to capture. This fully

general graphical model approach is called structured variational

inference

The beauty of the variational approach is that we do not need to specify a

specific parametric form for 𝑞.

8.8 DISCRETE LATENT VARIABLES

Variational inference with discrete latent variables is relatively

straightforward. We define a distribution 𝑞, typically one where each

factor of 𝑞 is just defined by a lookup table over discrete states. In the

simplest case, ℎ is binary and we make the mean -field assumption that q

factorizes over each individual ℎ. In this case, we can parametrize 𝑞 with

a vector ℎ whose entries are probabilities. Then 𝑞(ℎ = 1 | 𝑣) = ℎప

After determining how to represent q, we simply optimize its parameters.

With discrete latent variables, this is just a standard optimization problem.

In principle, the selection of q c ould be done with any optimization

algorithm, such as gradient descent.

Because this optimization must occur in the inner loop of a learning

algorithm, it must be very fast. To achieve this speed, we typically use

special optimization algorithms that are designed to solve comparatively

small and simple problems in few iterations. A popular choice is to iterate

fixed -point equations, in other words, to solve

munotes.in

## Page 114

114

𝜕

𝜕ℎ𝑖𝐿=0

for ℎ𝑖. We repeatedly update different elements of ℎ until we satisfy a

convergence criterion.

8.9 SUMMARY

Approximate inference methods useful to learn realistic models from

large dataset

approximate inference schemes into two categories: deterministic and

stochastic.

The challenge of inference usually refers to the difficult problem of

computing p(h | v) or taking expectations with respect to it.

The optimization procedure less expensive than approximate

An expectation -maximization algorithm is an approach for perfor ming

maximum likelihood estimation in the presence of latent variables.

Maximum a Posteriori optimization provides a Bayesian probability

framework for fitting model parameters to training data

Experiment P ractice :

● Implement Bayes' Theorem Formula.

● Implement Bayesian -Neural -Networks

● Implement Expectation -Maximization Algorithm

● Implement MAP Algorithm

LIST OF REFERENCES

[1] Heskes, Tom & Albers, Kees & Kappen, Hilbert. (2012).

Approximate Inference and Constrained Optimization. Proceedings

of Uncertainty in AI.

[2] Shin Kamada, Takumi Ichimura.: An adaptive learning method of

Deep Belief Network by layer generation algorithm. IEEE Xplore.

Nov 2016.

[3 ] Shin Kamada, Takumi Ichimura.: An adaptive learning method of

Deep Belief Network by layer g eneration algorithm. IEEE Xplore.

Nov 2016.

[4] Mohamed A,Dahl G E, Hinton. G.: Acoustic modeling using deep

belief networks[J]. Audio Speech and Language Processing IEEE

Transactions an, vol. 20, no. 1, pp. 14 -22, 2012

[5] Boureau Y, Cun L. Y.: Sparse f eature learning for deep belief

networks [C] //Advances in neural information processing systems.

pp. 1185 -1192, 2008 munotes.in

## Page 115

115

[6] Hinton. G.: A practical guide to training restricted Boltzmann

machines[J]. Momentum, vol. 9, no. 1, pp. 926, 2010

[7] Bengio Y, Lam blin P, Popovici Det al.:Greedy layer -wise training of

deep networks. Advances in neural information processing systems,

vol. 19, pp. 153, 200 8.

[8] Ranzato A M, Szummer M.: Semi -supervised learning of compact

document representations with deep networks[C]//Proceedings of the

25th international conference on Machine learning. ACM, pp. 792 -

799, 2008.

[9] Neal R M, Hinton G. E.: A view of the EM algorithm that justifies

incremental sparse and other variants[M]//Learning in graphical

models. pp. 355 -368, 1998.

[10] Hinton G E, Salakhutdinov R. R.: Reducing the dimensionality of

data with neural networks[J]. Science, vol. 3 13, no. 5786, pp 504 -

507, 2006.

[11] Goodfellow and Bengio, “Deep learning ” 2016

MODEL QUESTIONS

1. Explain how approximate Inference works machine learning

2. Explain relationship between approximate Inference deep learning

3. Define Posterior Probability

4. Explain Expectation Maximization algorithm

5. Write Expectation Maximization algorithm

6. Write Maximum a Posteriori (MAP) algorithm

8. Compare Supervised and Unsupervised Learning

*****

munotes.in

## Page 116

116

9

DEEP GENERATIVE MODELS

Unit Structure

9.0 Objectives

9.1 Introduction

9.2 Deep Generative Models

9.3 Generative Adversarial Networks

9.4 GANs as a Two Player Game

9.5. Boltzmann Machines

9.6 Restricted Boltzmann Machines

9.7 Deep Belief Networks

9.8 Summary

Experiment Practice

9.0 OBJECTIVES

This chapter would make you und erstand the following concepts

Fundamentals of Deep Generative Models

Algorithm and working of Restricted Boltzmann Machines

Algorithm and working of Deep Belief Networks

9.1 INTRODUCTION

A generative model includes the distribution of the data itself, and tells us

how likely a given example s are . For example, models that predict the

next word in a sequence are typically generative models (usually much

simpler than GANs) because they can assign a probability to a sequence of

words.

Some applications are as follows :

Generate Examples for Image Datasets ,

Generate Realistic Photographs

Image -to-Image Translation ,

Face Frontal View Generation

Generate Photographs of Human Faces

Photos to Emojis

Photograph Editing

Face Aging

Photo Blending munotes.in

## Page 117

117

Super Resolution

Photo Inpainting

Generate Cartoon Characters

Text-to-Image Translation

Semantic -Image -to-Photo Translation

Generate New Human Poses

Clothing Translation

Video Prediction

3D Object Generation

9.1 DEEP GENERATIVE MODELS

Deep generative models (DGM) are neural networks with many hidden

layers trained to approximate complicated, high -dimensional probability

distributions using a large number of s amples. When trained successfully,

we can use the DGMs to estimate the likelihood of each observation and to

create new samples from the underlying distribution. Applications of deep

generative models (DGM), such as creating fake portraits from celebrity

images are quite popular.

The ambitious goal in DGM training is to learn an unknown or intractable

probability distribution from a typically small number of independent and

identically distributed samples. When trained successfully, we can use the

DGM to estimate the likelihood of a given sample and to create new

samples that are similar to samples from the unknown distribution. These

problems have been at the core of probability and statistics for decades b ut

remain computationally challenging to solve, particularly in high

dimensions.

Three key mathematical challenges in DGM :

1. DGM training is an ill -posed problem since uniquely identifying a

probability distribution from a finite number of samples is impossible.

Hence, the performance of the DGM will depend heavily on so -called

hyper -parameters, which include the design of the network, the choice

of training objective, regularization, and training algorithms.

2. Training the generator requires a way t o quantify its samples’

similarity to those from the intractable distribution. In the approaches

considered here, this either requires the inversion of the generator or

comparing the distribution of generated samples to the given dataset.

Both of these ave nues have their distinct challenges. Inverting the

generator is complicated in most cases, particularly when it is

modeled by a neural network that is nonlinear by design. Quantifying

the similarity of two probability distributions from samples leads to

two-sample test problems, which are especially difficult without prior

assumptions on the distributions. munotes.in

## Page 118

118

3. Most common approaches for training DGMs assume that we can

approximate the intractable distribution by transforming a known and

much simpler probabi lity distribution (for instance, a Gaussian) in a

latent space of a known dimension. In most practical cases,

determining the latent space dimension is impossible and is left as a

hyper -parameter that the user needs to choose. This choice is both

difficult and important. With an overly conservative estimate, the

generator may not approximate the data well enough, and an

overestimate can render the generator non -injective, which

complicates the training.

9.1.1 Supervised vs. Unsupervised Learning :

A typical machine learning problem involves using a model to make a

prediction, e.g. predictive modeling.

This requires a training dataset that is used to train a model, comprised of

multiple examples, called samples, each with input variables (X) and

outpu t class labels (y). A model is trained by showing examples of inputs,

having it predict outputs, and correcting the model to make the outputs

more like the expected outputs.

In the predictive or supervised learning approach, the goal is to learn a

mapping from inputs x to outputs y, given a labeled set of input -output

pairs.

Examples of supervised learning problems include classification and

regression, and examples of supervised learning algorithms include

logistic regression and random forest.

There is another paradigm of learning where the model is only given the

input variables (X) and the problem does not have any output variables

(y). A model is constructed by extracting or summarizing the patterns in

the input data. There is no correction of the mo del, as the model is not

predicting anything.

Figure 9.1 Example of Supervised Learning

munotes.in

## Page 119

119

The second main type of machine learning is the descriptive or

unsupervised learning approach. Here we are only given inputs, and the

goal is to find “interesting patterns” in the data. […] This is a much less

well-defined problem, since we are not told wh at kinds of patterns to look

for, and there is no obvious error metric to use (unlike supervised learning,

where we can compare our prediction of y for a given x to the observed

value). This lack of correction is generally referred to as an unsupervised

form of learning or unsupervised learning.

Figure 9.2 Unsupervised Learning

Example of Unsupervised Learning :

Examples of unsupervised learning problems include clustering and

generative modeling, and examples of unsupervised learning algorithms

are K-means and Generative Adversarial Networks.

9.1.2 Discriminative vs. Generative Modeling :

In supervised learning, we may be interested in developing a model to

predict a class label given an example of input variables.

This predictive modeling task is called classification.

Classification is also traditionally referred to as discriminative modeling.

We use the training data to find a discriminant function f(x) that maps

each x directly onto a class label, thereby combining the inference and

decision s tages into a single learning problem.

This is because a model must discriminate examples of input variables

across classes; it must choose or make a decision as to what class a given

example belongs to.

Figure 9.3 Example of Discriminative Modeling

munotes.in

## Page 120

120

Alternately, unsupervised models that summarize the distribution of input

variables may be able to be used to create or generate new examples in the

input distribution. As such, these types of models are referred to as

generative models.

Figure 9.4 Example of Generative Modeling

For example, a single variable may have a known data distribution, such

as a Gaussian distribution, or bell shape. A generative model may be able

to sufficiently summarize this data distribution, and then be used to

generate new variables that plausibly fit into the distribution of the input

variable.

Approaches that explicitly or implicitly model the distribution of inputs , as

well as outputs, are known as generative models because by sampling

from them it is possible to ge nerate synthetic data points in the input

space.

In fact, a really good generative model may be able to generate new

examples that are not just plausible, but indistinguishable from real

examples from the problem domain.

Examples of Generative Models

Naive Bayes is an example of a generative model that is more often used

as a discriminative model.

For example, Naive Bayes works by summarizing the probability

distribution of each input variable and the output class. When a prediction

is made, the proba bility for each possible outcome is calculated for each

variable, the independent probabilities are combined, and the most likely

outcome is predicted. Used in reverse, the probability distributions for

each variable can be sampled to generate newly plausi ble (independent)

feature values.

Other examples of generative models include Latent Dirichlet Allocation,

or LDA, and the Gaussian Mixture Model, or GMM.

Deep learning methods can be used as generative models. Two popular

examples include the Restricted Boltzmann Machine, or RBM, and the

Deep Belief Network, or DBN.

munotes.in

## Page 121

121

Two modern examples of deep learning generative modeling algorithms

include the Variational Autoencoder, or VAE, and the Generative

Adversarial Network, or GAN.

9.2 GENERATIVE ADVERSARIAL NETWORKS

Generative Adversarial Networks, or GANs, are a deep -learning -based

generative model.

More generally, GANs are a model architecture for training a generative

model, and it is most common to use deep learning models in thi s

architecture.

The GAN architecture was first described in the 2014 paper by Ian

Goodfellow, et al. titled “Generative Adversarial Networks.”

A standardized approach called Deep Convolutional Generative

Adversarial Networks, or DCGAN, that led to more st able models was

later formalized by Alec Radford, et al. in the 2015 paper titled

“Unsupervised Representation Learning with Deep Convolutional

Generative Adversarial Networks“.

The GAN model architecture involves two sub -models: a generator model

for gen erating new examples and a discriminator model for classifying

whether generated examples are real, from the domain, or fake, g enerated

by the generator model.

1. Generator . The model is used to generate new plausible examples

from the problem domain.

2. Discriminator . The model that is used to classify examples as real

(from the domain) or fake (generated).

The generator network directly produces samples. Its adversary, the

discriminator network, attempts to distinguish between samples drawn

from the tra ining data and samples drawn from the generator.

9.2.1 The Generator Model :

The generator model takes a fixed -length random vector as input and

generates a sample in the domain.

The vector is drawn randomly from a Gaussian distribution, and the vector

is used to seed the generative process. After training, points in this

multidimensional vector space will correspond to points in the problem

domain, forming a compressed representation of the data distribution.

This vector space is referred to as a late nt space, or a vector space

comprised of latent variables. Latent variables, or hidden variables, are

those variables that are important for a domain but are not directly munotes.in

## Page 122

122

observable. A latent variable is a random variable that we cannot observe

directly.

We often refer to latent variables, or a latent space, as a projection or

compression of data distribution. That is, a latent space provides

compression or high -level concepts of the observed raw data such as the

input data distribution. In the case of GAN s, the generator model applies

meaning to points in a chosen latent space, such that new points drawn

from the latent space can be provided to the generator model as input and

used to generate new and different output examples.

Machine -learning models can learn the statistical latent space of images,

music, and stories, and they can then sample from this space, creating new

artworks with characteristics similar to those the model has seen in its

training data.

After training, the generator model is kept a nd used to generate new

samples.

Figure 9.5 Example of the GAN Generator Model

9.2.2 The Discriminator Model :

The discriminator model takes an example from the domain as input (real

or generated) and predicts a binary class label of real or fake (generated).

The real example comes from the training dataset. The generated examples

are output by the generator model.

The discriminator is a normal (and well understood) classification model.

After the training process, the discriminator model is disc arded as we are

interested in the generator.

Sometimes, the generator can be repurposed as it has learned to effectively

extract features from examples in the problem domain. Some or all of the

feature extraction layers can be used in transfer learning ap plications using

the same or similar input data.

We propose that one way to build good image representations is by

training Generative Adversarial Networks (GANs), and later reusing parts

munotes.in

## Page 123

123

of the generator and discriminator networks as feature extractors f or

supervised tasks

Figure 9.6 Example of the GAN Discriminator Model

9.3 GANS AS A TWO -PLAYER GAME

Generative modeling is an unsupervised learning problem, as we discussed

in the previous section, although a clever property of the GAN architecture

is that the training of the generative model is framed as a supervised

learning problem.

The two models, the generator and discriminator, are trained together. The

generator generates a batch of samples, and these, along with real

examples from the domain, are provided to the discriminator and classified

as real or fake.

The discriminator is then updated to get better at discriminating real and

fake samples in the next round, and importantly, the generator is updated

based on how well, or not, the generated samples fooled the discriminator.

We can think of the generator as being like a counterfeiter, trying to make

fake money, and the discriminator as being like police, trying to allow

legitimate money and catch counterfeit money. To succeed in this game,

the counterfeiter must learn to make money that is indistinguishable from

genuine money, and the generator network must learn to create samples

that are drawn from the same distribution as the training data.

In this way, the two models are competing against each other, they are

adversarial in the game theory sense, and are playing a zero -sum game.

Because the GAN framework can naturally be analyzed with the tools of

game theory, we call GANs “adversarial”.

In this case, zero -sum means that when the discrim inator successfully

identifies real and fake samples, it is rewarded or no change is needed to

the model parameters, whereas the generator is penalized with large

updates to model parameters.

munotes.in

## Page 124

124

Alternately, when the generator fools the discriminator, it is rewarded, or

no change is needed to the model parameters, but the discriminator is

penalized and its model parameters are updated.

At a limit, the generator generates perfect replicas from the input domain

every time, and the discriminator cannot tell the difference and predicts

“unsure” (e.g. 50% for real and fake) in every case. This is just an example

of an idealized case; we do not need to get to this point to arrive at a useful

generator model.

Figure 9.7 Example of the Generative Adversarial Network Model

Architecture

Training drives the discriminator to attempt to learn to correctly classify

samples as real or fake. Simultaneously, the generator attempts to fool the

classifier into believing its samples are real. At convergence, the

generator’s samples are indistinguishable from real data, and the

discriminator outputs 1/2 everywhere. The discriminator may then be

discarded.

9.3.1 GANs and Convolutional Neural Networks :

GANs typically work with image data and use Convolutional Neural

Networks, or CNNs, as the generator and discriminator models.

The reason for this may be both because the first description of the

technique was in the field of computer vision and used CNNs and image

data, and because of the remarkable progress th at has been seen in recent

years using CNNs more generally to achieve state -of-the-art results on a

suite of computer vision tasks such as object detection and face

recognition.

munotes.in

## Page 125

125

Modeling image data means that the latent space, the input to the

generator, provides a compressed representation of the set of images or

photographs used to train the model. It also means that the generator

generates new images or photographs, providing an output that can be

easily viewed and assessed by developers or users of the model.

It may be this fact above others, the ability to visually assess the quality of

the generated output, that has both led to the focus of computer vision

applications with CNNs and on the massive leaps in the capability of

GANs as compared to other g enerative models, deep learning -based or

otherwise.

9.3.2 Conditional GANs :

An important extension to the GAN is in its use for conditionally

generating an output.

The generative model can be trained to generate new examples from the

input domain, where the input, the random vector from the latent space, is

provided with (conditioned by) some additional input.

The additional input could be a class value, such as male or female in the

generation of photographs of people, or a digit, in the case of genera ting

images of handwritten digits.

Generative adversarial nets can be extended to a conditional model if both

the generator and discriminator are conditioned on some extra information

y. y could be any kind of auxiliary information, such as class labels o r data

from other modalities. We can perform the conditioning by feeding y into

both the discriminator and generator as [an] additional input layer.

The discriminator is also conditioned, meaning that it is provided both

with an input image that is either real or fake and the additional input. In

the case of a classification label type conditional input, the discriminator

would then expect that the input would be of that class, in turn teaching

the generator to generate examples of that class in order to f ool the

discriminator.

In this way, a conditional GAN can be used to generate examples from a

domain of a given type.

Taken one step further, the GAN models can be conditioned on an

example from the domain, such as an image. This allows for applications

of GANs such as text -to-image translation, or image -to-image translation.

This allows for some of the more impressive applications of GANs, such

as style transfer, photo colorization, transforming photos from summer to

winter or day to night, and so on.

In the case of conditional GANs for image -to-image translation, such as

transforming day to night, the discriminator is provided examples of real munotes.in

## Page 126

126

and generated nighttime photos as well as (conditioned on) real daytime

photos as input. The generator is provi ded with a random vector from the

latent space as well as (conditioned on) real daytime photos as input.

Figure 9.8 Example of a Conditional Generative Adversarial Network

Model Architecture

One of the many major advancements in the use of deep learning methods

in domains such as computer vision is a technique called data

augmentation.

Data augmentation results in better -performing models, both increasing

model skill and providing a regularizing effect, reducing generalization

error. It works by creating new, artificial but plausible examples from the

input problem domain on which the model is trained.

The techniques are primitive in the case of image data, involving crops,

flips, zooms, and other simple transforms of existing images in the trai ning

dataset.

Successful generative modeling provides an alternative and potentially

more domain -specific approach for data augmentation. In fact, data

augmentation is a simplified version of generative modeling, although it is

rarely described this way.

In complex domains or domains with a limited amount of data, generative

modeling provides a path towards more training for modeling. GANs have

seen much success in this use case in domains such as deep reinforcement

learning.

Among these reasons, he high lights GANs’ successful ability to model

high-dimensional data, handle missing data, and the capacity of GANs to

provide multi -modal outputs or multiple plausible answers.

Perhaps the most compelling application of GANs is in conditional GANs

for tasks th at require the generation of new examples.

● Image Super -Resolution. The ability to generate high -resolution

versions of input images.

munotes.in

## Page 127

127

● Creating Art. The ability to create new and artistic images, sketches,

paintings , and more.

● Image -to-Image Translation. The ability to translate photographs across

domains, such as day to night, summer to winter, and more.

Perhaps the most compelling reason that GANs are widely studied,

developed, and used is because of their success. GANs have been able to

generate photos so realistic that humans are unable to tell that they are of

objects, scenes, and people that do not exist in real life.

Astonishing is not a sufficient adjective for their capability and success.

9.4 BOLTZMANN MACHINES

Boltzmann Machines is an unsupervised DL model in which every node is

connected to every other node. That is, unlike the ANNs, CNNs, RNNs,

and SOMs, the Boltzmann Machines are undirected (or the connections

are bidirectional). Boltzmann Machine is not a deterministic DL model but

a stochastic or generative DL model. It is rather a representation of a

certain system. There are two types of nodes in the Boltzmann Machine —

Visible nodes — those nodes which we can and do measure, and the

Hidden nodes – those nodes which we cannot or d o not measure. Although

the node types are different, the Boltzmann machine considers them as the

same and everything works as one single system. The training data is fed

into the Boltzmann Machine and the weights of the system are adjusted

accordingly. Bo ltzmann machines help us understand abnormalities by

learning about the working of the system in normal conditions.

Figure 9.9 Boltzmann Machine

Boltzmann Distribution is used in the sampling distribution of the

Boltzmann Machine. The Boltzmann distribution is governed by the

equation –

Pi = e(-∈i/kT)/ ∑e(-∈j/kT)

Pi - the probability of the system being in state i

munotes.in

## Page 128

128

∈i - Energy of system in state i

T - Temperature of the system

k - Boltzmann constant

∑e(-∈j/kT) - Sum of values for all possible states of the system

9.5 RESTRICTED BOLTZMANN MACHINES

In a full Boltzmann machine, each node is connected to every other node

and hence the connections grow exponentially. This is the reason we use

RBMs. The restrictions in the node connections in RBMs are as follows –

● Hidden nodes cannot be connected to one another.

● Visible nodes connected to one another.

Figure 9.10 Boltzmann Machine units structure

9.5.1 Energy function example for Restricted Boltzmann Machine :

E(v, h) = -∑ aivi - ∑ bjhj - ∑∑ viwi,jhj

a, v - biases in the system - constants

vi, hj - visible node, hidden node

P(v, h) = Probability of being in a certain state

P(v, h) = e(-E(v, h))/Z

Z - sum if values for all possible states

9.5.2 Features of Restricted Boltzmann Machine :

● They use recurrent and symmetric structures .

● RBMs in their learning process try to associate a high probability with

low energy states and vice -versa.

● There are no intralayer connections.

● It is an unsupervised learning algorithm i.e, it makes inferences fr om

input data without labeled responses.

munotes.in

## Page 129

129

9.5.3 Working of Restricted Boltzmann Machine :

Figure 9.11 working of Restricted Boltzmann Machine

The above image shows the first step in training an RBM with multiple

inputs. The inputs are multiplied by the weights and then added to the bias.

The result is then passed through a sigmoid activation function and the

output determines if the hidden state gets activated or not. Weights will be

a matrix with the number of input nodes as the number of rows and the

number of hidden nodes as the number of columns. The first hidden node

will receive the vector multiplication of the inputs multiplied by the first

column of weights before the corresponding bias term is added to it.

And if you are wondering what a sigmoid function is, here is the formula:

So the equation that we get in this step would be,

where h(1) and v(0) are the corresponding vectors (column matrices) for

the hidden and the visible layers with the superscript as the iteration v(0)

means the input that we provide to the network) and a is the hidden layer

bias vector. (Note that we are dealing with vectors and matrices here and

not one -dimensional values.)

munotes.in

## Page 130

130

Figure 9.12 reconstructions in Restricted Boltzmann Machine

Now this image shows the reverse phase or the reconstruction phase. It is

similar to the first pass but in the opposite direction. The equation comes

out to be:

where v(1) and h(1) are the corresponding vectors (column matrices) for

the visible and the hidden layers with the superscrip t as the iteration and b

is the visible layer bias vector.

9.5.4 The learning process :

Now, the difference v(0)−v(1) can be considered as the reconstruction

error that we need to reduce in subsequent steps of the training process. So

the weights are adjusted in each iteration so as to minimize this error and

this is what the learning process essentially is. Now, let us try to

understand this process in mathematical terms without going too deep into

mathematics. In the forward pass, we are calculating the probability of

output h(1) given the input v(0) and the weights W denoted by:

And in the backward pass, while reconstructing the input, we are

calculating the probability of output v(1) given the input h(1) and the

weights W denoted by:

munotes.in

## Page 131

131

The weights used in both the forward and the backward pass are the same.

Together, these two conditional probabilities lead us to the joint

distribution of inputs and the activations:

Reconstruction is different from regression or classification in that it

estimates the probability distribution of the original input instead of

associating a continuous/discrete value to an input example. This means it

is trying to guess multiple values at the same time. This is known as

generative learning as opposed to discr iminative learning that happens in a

classification problem (mapping input to labels).

9.5.5Advantages and Disadvantages of RBM :

9.5.5.1 Advantages:

● Expressive enough to encode any distribution and computationally

efficient.

● Faster than traditional Boltzmann Machine due to the restrictions in

terms of connections between nodes.

● Activations of the hidden layer can be used as input to other models as

useful features to improve performance

9.5.5.1 Disadvantages:

● Training is more difficult as it is difficult to calculate the Energy

gradient function.

● The CD -k algorithm used in RBMs is not as familiar as the

backpropagation algorithm.

● Weight Adjustment

9.5.6 Applications of RBM :

● Hand Written Digit Recognition

● Real-time intrapulse recognition of radar

9.5.7 Difference between Autoencoders & RBMs :

Autoencoder is a simple 3 -layer neural network where output units are

directly connected back to input units. Typically, the number of hidden

units is much less than the number of visible ones. The task of training is

to minimize an error or reconstruction, i.e. find the most efficient compact

representation for input data.

RBM shares a similar idea, but it uses stochastic units with particular

distribution instead of deterministic distribution. The task of training is to

munotes.in

## Page 132

132

find out how these two sets of variables are actually connected to each

other.

One aspect that distinguishes RBM from other autoencoders is that it has

two biases. The hidden bias helps the RBM produce the activations on the

forward pass, while the visible layer’s biases help the RBM learn the

reconstructions on the backward pass.

9.6 DEEP BELIEF NETWORKS

Deep belief nets are probabilistic generative models that are composed of

multiple layers of stochastic, latent variables. The latent variables typically

have binary values and are often called hidden units or feature detectors.

The top two layers have undirected, symmetric connections between them

and form an associative memory. The lower layers receive top -down,

directed connections fr om the layer above. The states of the units in the

lowest layer represent a data vector.

The two most significant properties of deep belief nets are:

There is an efficient, layer -by-layer procedure for learning the top -down,

generative weights that determine how the variables in one layer depend

on the variables in the layer above.

After learning, the values of the latent variables in every layer can be

inferred by a single, bottom -up pass that starts with an observed data

vector in the bottom layer and uses the generative weights in the reverse

direction.

Deep belief nets are learned one layer at a time by treating the values of

the latent variables in one layer, when they are being inferred from data, as

the data for training the next layer. This efficient, greedy learning can be

followed by, or combined with, other learning procedures that fine -tune all

of the weights to improve the generative or discriminative performance of

the whole network.

Discriminative fine -tuning can be performed by addin g a final layer of

variables that represent the desired outputs and backpropagating error

derivatives. When networks with many hidden layers are applied to highly

structured input data, such as images, backpropagation works much better

if the feature detec tors in the hidden layers are initialized by learning a

deep belief net that models the structure in the input data

Deep B elief Networks have two phases:

● Pre-train Phase

● Fine-tune Phase

munotes.in

## Page 133

133

The pre -train phase is nothing but multiple layers of RBNs, while Fine

Tune Phase is a feed-forward neural network. Let us visualize both the

steps: -

Figure 9.13 Deep Belief Networks

Algorithm for Training DBN :

Input: Dataset

Output: Trained network

Step 1 : Train the first layer as RBM that models the input 𝑎 = ℎ() as its

visible layer.

Step 2 : By using the first layer obtain the representation of the input that

will be used as

input for the next layer

𝑝൫ℎ() ൯ or 𝑝൫ℎ() ൯

Step 3: Train the second layer as an RBM

Step 4 : Repeat step 2 and step 3 for all the number of layers

Steps to update the weigh t

Input: Random Weight initialized

Output: Wight updated based on the error rate

Update the weight of Edge

munotes.in

## Page 134

134

Positive phase:

Compute positive statistics for edge 𝐸𝑖𝑗

Positive ( 𝑬𝒊𝒋)⇒ 𝒑(𝑯𝑱=1|v)

The individual activation probabilities for the hidden layer is

Negative Phase:

Compute negative statistics for edge 𝐸𝑖𝑗

Negative ( 𝑬𝒊𝒋)⇒ 𝑝(𝑣𝑖=1|H)

The individual activation pro babilities for visible layer is

Deep belief networks can substitute for a deep feedforward network or,

given more complex arrangements, even a convolutional neural network.

They have the advantages that they are less computationally expens ive

(they grow linearly in computational complexity with the number of

layers, instead of exponentially, as with feedforward neural networks); and

that they are significantly less vulnerable to the vanishing gradients

problem.

However, because deep belief networks impart significant restrictions on

their weight connections, they are also vastly less expressive than a deep

neural network, which outperforms them on tasks for which sufficient

input data is available.

Even in their prime, deep belief networks were rarely used in direct

application. They were instead used as a pretraining step: a deep belief

network with the same overall architecture as a corresponding deep neural

network is defined and trained. Its weights are then taken and placed into

the co rresponding deep neural network, which is then fine -tuned and put

to application.

Deep belief networks eventually fell out of favor in this application as

well. For one thing, RBMs are just a special case of autoencoders, which

were found to be more broad ly flexible and useful both for pretraining and

for other applications. For another thing, the introduction of ReLU

activation and its further refinement into leaky ReLU, along with the

introduction of more sophisticated optimizers, learning late scheduler s,

and dropout techniques, have worked to greatly alleviate the vanishing

gradient problem in practice, at the same time that increased data volumes

and computes power have made direct deep neural network applications to

problems more tractable.

munotes.in

## Page 135

135

9.7 SUMMARY

We have discovered a large number of applications of Generative

Adversarial Networks, or GANs.

Broad catagory of GAN are live supervised and unsupervised

The GAN model architecture involves two sub -models: a generator

and discriminator

The generator model takes a fixed -length random vector as input and

generates a sample in the domain

The discriminator model takes an example from the domain as input

(real or generated) and predicts a binary class label of real or fake

(generated).

Boltzmann Machines is an unsupervised DL model in which every

node is connected to every other node.

Deep belief nets are probabilistic generative models that are

composed of multiple layers of stochastic, latent variables.

Experiment P ractice :

● Apply Restricted Bol tzmann Machine implementation for

Recommender System (Amazon product suggestions or Netflix

movie)

● Apply Deep Belief Networks implementation for Recommender

System (Amazon product suggestions or Netflix movie)

LIST OF REFERENCES

[1] Heskes, Tom & Albers, Kees & Kappen, Hilbert. (2012).

Approximate Inference and Constrained Optimization. Proceedings

of Uncertainty in AI.

[2] Shin Kamada, Takumi Ichimura.: An adaptive learning method of

Deep Belief Network by layer generation algorithm. IEEE Xplore.

Nov 2016.

[3 ] Shin Kamada, Takumi Ichimura.: An adaptive learning method of

Deep Belief Network by layer generation algorithm. IEEE Xplore.

Nov 2016.

[4] Mohamed A, Dahl G E, Hinton. G.: Acoustic modeling using deep

belief networks[J]. Audio Speech and L anguage Processing IEEE

Transactions an, vol. 20, no. 1, pp. 14 -22, 2012

[5] Boureau Y, Cun L. Y.: Sparse feature learning for deep belief

networks [C] //Advances in neural information processing systems.

pp. 1185 -1192, 2008

[6] Hinton. G.: A practical g uide to training restricted Boltzmann

machines[J]. Momentum, vol. 9, no. 1, pp. 926, 2010 munotes.in

## Page 136

136

[7] Bengio Y, Lamblin P, Popovici Det al.:Greedy layer -wise training of

deep networks. Advances in neural information processing systems,

vol. 19, pp. 153, 2007.

[8] Ranzato A M, Szummer M.: Semi -supervised learning of compact

document representations with deep networks[C]//Proceedings of the

25th international conference on Machine learning. ACM, pp. 792 -

799, 200 9.

[9] Neal R M, Hinton G. E.: A view of the EM algorithm that justifies

incremental sparse and other variants[M]//Learning in graphical

models. pp. 355 -368, 199 9.

[10] Hinton G E, Salakhutdinov R. R.: Reducing the dimensionality of

data with neural networks[J]. Science, vol. 313, no. 5786, pp 504 -

507, 2006.

[11] Goodfellow and Bengio, “Deep learning ” 2016

[12] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in

neural information processing systems. 2014.

[13] Kingma, Diederik P., and Max Welling. "Auto -encoding variational

bayes." ar Xiv preprint arXiv:1312.6114 (2013).

[14] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised

representation learning with deep convolutional generative

adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

[15] Nowozin, Sebastian, Bot ond Cseke, and Ryota Tomioka. "f -GAN:

Training generative neural samplers using variational divergence

minimization." Advances in Neural Information Processing Systems.

2016.

[16] Denton, Emily L., Soumith Chintala, and Rob Fergus. "Deep

Generative Image M odels using a￼ Laplacian Pyramid of

Adversarial Networks."

Advances in neural information processing systems. 2015.

[17] Chen, Xi, et al. "Infogan: Interpretable representation learning by

information maximizing generative adversarial nets." Advances in

Neural

Information Processing Systems. 2016.

[18] Zhao, Junbo, Michael Mathieu, and Yann LeCun. "Energy -based

generative adversarial network." arXiv preprint arXiv:1609.03126

(2016).

[19] Doersch, Carl. "Tutorial on variational autoencoders." arXiv prepri nt

arXiv:1606.05908 (2016).

[20] Wang, K -C., and Richard Zemel. "classifying NBA offensive plays

using neural networks." MIT Sloan Sports Analytics Conference,

2016.

[21] Wikipedia “Kernel density estimation” munotes.in

## Page 137

137

[22] Wikipedia “Principal component analysis”

[23] Wikipedia “Autoencoder

MODEL QUESTIONS

1. Explain how approximate Inference works machine learning

2. Explain relationship between approximate Inference deep learning

3. Define Posterior Probability

4. Explain Expectation Maximization algorithm

5. Write Expectation Maximization algorithm

6. Write Maximum a Posteriori (MAP) algorithm

7. Compare Supervised and Unsupervised Learning

9. Compare Discriminative Generative Modeling

9. Explain Generative Adversarial Networks in detail.

9. Explain t he Generator Model and t he Discriminator Model of GAN

11. Explain Conditional GANs

12. Write short note on Boltzmann Machines

13. Explain Restricted Boltzmann Machines

14. Write Features of Restricted Boltzmann Machine

15. Compare Boltzmann Machin es and Restricted Boltzmann Machine

16. Write Advantages and Disadvantages of RBM

17. Write Applications of RBM also explain any one

19. Explain Deep Belief Networks with its working.

19. Write Algorithm for Training DBN

20. Explain how weights are updated in DBN .

*****

munotes.in