## Page 1

1

UNIT I

1

THE MEAN, MEDIAN, MODE, AND

OTHER MEASURES OF CENTRAL

TENDENCY

Unit structure

1.0 Objective s

1.1 Introduction

1.2 Index or Subscript , Notation , Summation Notation

1.3 Averages or Measures of Central Tendency

1.4 Arithmetic Mean

1.4.1 Arithmetic Mean Computed from Grouped Data

1.4.2 Properties of the Arithmetic Mean

1.5 Weighted Arithmetic Mean

1.6 Median

1.7 Mode

1.8 Empirical Relation between the Mean, Median, and Mode

1.9 Geometric Mean

1.10 Harmonic Mean

1.11 Relation between the Arithmetic , Geometric, and Harmonic Means

1.12 Root Mean Square

1.13 Quartiles, Deciles and Percentiles

1.14 Software and Measures of Central Tendency

1.15 Summary

1.16 Exercise

1.17 References

1.0 OBJECTIVES

After going through this chapter, students will able to learn

To present huge data in a summarized form

To calculate and interpret the mean, the median and the mode,

To facilitate comparison

To calculate geometric mean, harmonic mean

To trace precise relationship munotes.in

## Page 2

2

To calculate Quartiles, deciles and percentiles

To help in decision -making

1.1 INTRODUCTION

A measure of central tendency is a single value that describes a set of data

by identifying the central position within that set of data. Mean, median

and mode are different measures of central tendency in a numerical data.

The word average is commonly used in day to day conversation like we

often talk about average height of the girls, average student of the class,

and average run rate of the match. When we say average means neither too

good nor b ad. However, in statistics the term average has different

meaning. Average is a single value which representing a group of values

so such a value easy to understand, easy to compute and based on all

observations.

1.2 INDEX OR SUBSC RIPT, NOTATION, SUMMATION NOTATION

Let the symbol X i (read „X subscript i) denote any of the N values X 1, X2,

X3, …….X N assumed by a variable X. The letter i in X i which can stand

for any of the numbers 1, 2, 3, ……, N is called a subscript or index. Any

letter other than i such as j, k, p, q or r could be used also.

Summation Notation:

The symbol ∑

is used to denote the sum of all the ‟s from i = 1 to

N.

∑

= X1 + X 2 + X 3 + ……. + X N

We generally denote this sum simply by ∑ , ∑ .

The symbol ∑ is the Greek capital letter sigma denoting sum.

Ex. ∑

= aX1 + aX 2 + aX 3 + ……. + aX N

= a (X1 + X 2 + X 3 + ……. + X N) = a ∑

, where a

is a constant.

1.3 AVERAGES OR MEASURES OF CENTRAL TENDENCY

There are different ways of measuring the central tendency of a set of

values .

Various authors defined Average differently.

“Average is an attempt to find one single figure to describe whole of

figures.” – Clark munotes.in

## Page 3

3

“An average is a single value selected from a group of values to represent

them in some way - a value which is supposed to stand for whole group, of

which it is a part, as typical of all the values in the group.” – A. E. Waugh

“An average is a typical value in the sense that it is sometimes employed

to represent all the individual values in the series or of a variable.” – Ya-

Lun-Chou

Types of Averages:

Arithmetic Mean: a. Simple, b. W eighted

Median

Mode

Geometric Mean

Harmonic Mean

1.4 ARITHMETIC MEAN

The most popular and widely used measure of representing the entire data

by one value is mean or Average.

It simply involves taking the sum of a group of numbers, then dividing

that sum by the total number of values in the group.

Arithmetic mean can be of two types.

a. Simple arithmetic mean

b. Weighted arithmetic mean

A. Simple Arithmetic Mean – Individual Observations:

Calculation of mean in case of individual observations [ i. e. when

frequencies are not given] is very simple. Here, we add all values of the

variable and divide the total by the number of items.

̅ =

= ∑

̅ = Arithmetic Mean; N = number of observations;

∑ = sum of all the values of the variable X i. e.

Ex 1. Find the Arithmeti c mean of following five values 8, 45 , 49, 54, 79.

Sol: We know that, ̅ =

̅ =

=

= 47

Ex 2. Find the Arithmetic mean of following values.

4350, 7200, 6750, 5480, 7940, 3820, 5920, 8450, 4900, 5350.

Sol: We know that, ̅ =

munotes.in

## Page 4

4

̅ =

=

=

6416

Short cut method: ̅ = A + ∑

Where A is assumed mean and d is deviation of items from assumed mean

i. e. d = ( ).

Ex 3. Calculate arithmetic mean from following data.

2690, 3670, 4580, 5660, 2750, 2830, 4100, 572 0, 50 40, 4840

Sol: X ( X-A) 2690 -2310 3670 -1330 4580 -420 5660 660 2750 -2250 2830 -2170 4100 -900 5720 720 5040 40 4840 -160 ∑ = -8120

Consider assumed mean, A = 5000

̅ = A + ∑

= 5000 -

=4188

1.4.1 Arithmetic Mean Computed from Grouped Data:

Simple Arithmetic Mean – Discrete series:

Calculation of mean in case of frequencies are given,

̅ = ∑

f = Frequency;

X = the variable

N = Total number of observations i.e. ∑

Here, first multiply the frequency of each row with variable and obtain the

total ∑ and then divide the total by number of observations, i.e. total

frequency.

Ex 4. Following are the marks obtained by 60 students. Calculate

arithmetic mean.

munotes.in

## Page 5

5

Marks 15 30 45 60 70 80 No. of students 6 14 15 15 4 6

Sol: Let the marks denoted by X and number of students denoted by f.

Marks X No. of Students f fX 15 6 90 30 14 420 45 15 675 60 15 900 70 4 280 80 6 480 N = 60 ∑ = 2845

̅ = ∑

=

= 47.42

Short cut method: ̅ = A + ∑

Where A is assumed mean and d is deviation of items from assumed mean

i. e. d = ( ),

N = ∑

Ex 5. Calculate arithmetic mean by the short cut method using data from

Ex. 4

Sol: Marks X No. of Students f d = ( ) fd 15 6 -30 -180 30 14 -15 -210 45 15 0 0 60 15 15 225 70 4 25 100 80 6 35 210 N = 60 ∑ = 145

Assumed mean, A = 45

̅ = A + ∑

= 45 +

=47.4166

Simple Arithmetic Mean – Continuous Series:

̅ = ∑

m = mid-point of various classes ; f = the frequency of each class;

N = the total frequency

Here, first obtain the mod -point of each class and denote it by m. munotes.in

## Page 6

6

Multiply the se mid -points by the respective frequency of each class and

obtain the total ∑

Divide the total by the sum of the frequency, i.e. N.

Ex 5. From the following data compute arithmetic mean.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 No. of students 5 10 25 30 20 10

Sol:

Marks Mid- point m No. of Students f fm 0-10 5 5 25 10-20 15 10 150 20-30 25 25 625 30-40 35 30 1050 40-50 45 20 900 50-60 55 10 550 N = 100 ∑ = 3300

̅ = ∑

=

= 33

Ex 6 . From the following data compute arithmetic mean.

Class Intervals 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 4 4 7 10 12 8 5

Sol:

Marks Mid- point m No. of Students f f m 0-10 5 4 20 10-20 15 4 60 20-30 25 7 175 30-40 35 10 350 40-50 45 12 540 50-60 55 8 440 50-60 65 5 325 N = 50 ∑ = 1910

̅ = ∑

=

= 38.2

Short cut method: ̅ = A + ∑

munotes.in

## Page 7

7

Where A is assumed mean and d is deviation of items from assumed mean

i. e. d = ( ), m= mid point , N = ∑

Ex 7. Calculate arithmetic mean by the short cut method using data from

Ex. 5.

Sol: Marks Mid- point m No. of Students f d = ( ) fd 0-10 5 5 -30 -150 10-20 15 10 -20 -200 20-30 25 25 -10 -250 30-40 35 30 0 0 40-50 45 20 10 200 50-60 55 10 20 200 N = 100 ∑ = 200

Assumed mean, A = 35

̅ = A + ∑

= 35 -

= 33

1.4.2 Properties of the Arithmetic Mean:

1. The sum of deviation from their arithmetic mean is always equal to

zero.

Symbolically, ∑( ̅ )= 0

Ex 8:

X 10 20 30 40 50 ∑ = 150 X - ̅ -20 -10 0 10 20 ∑ ̅= 0

̅ ∑

=

= 30

When we calculate the deviations of all the items from their arithmetic

mean ( ̅ =30), we find that the sum of the deviations from the arithmetic

mean i. e. ∑( ̅ )= 0

2. The sum of squared deviations of the items from arithmetic mean is

minimum, that is, less than the sum of squared deviations of the items

from any other value.

Ex 9 : X X - ̅ (X - 4)2 2 -2 4 3 -1 1 4 0 0 munotes.in

## Page 8

8

5 1 1 6 2 4 ∑ = 20 ∑ ̅= 0 ∑( )̅̅̅ 2= 0

̅ ∑

=

= 4

The sum of the squared deviations is equal to 10 in the above example. If

the deviations are taken from any other value the sum of the squared

deviations are taken from any other value the sum of the squared

deviations would be gre ater than 10.

Let us calculate the squares of the deviations of item from the value less

than the arithmetic mean, say 3

X X - 3 (X - 3)2 2 -1 1 3 0 0 4 1 1 5 2 4 6 3 9 ∑ = 20 ∑( )2= 0

3. Arithmetic mean is NOT independent of change of origin.

If each observation of a series is increased (or decreased) by a constant,

then the mean of these observations is also increased (or decreased) by

that constant.

4. Arithmetic mean is NOT independent of change of scale.

If each observation of a series is multiplied (or divided) by constant, then

the mean of these observations is also multiplied (or divided) by that

constant.

5. If arithmetic mean and number of items of two or more related groups

are given, then we can compute the combined mean using the formula

given below.

̅12 = ̅̅̅̅ ̅̅̅̅

,

Where

̅12 = Combined mean of two groups ;

N1 = Number of items in the first group ; N2 = Number of items in the

second group

̅̅̅̅ = Ari thmetic mean of the first group; ̅̅̅̅ = Arithmetic mean of the

second group

munotes.in

## Page 9

9

1.5 WEIGHTED ARITHMETIC MEAN

Arithmetic mean gives equal importance to all the items. When

importance of the items are not same, in these cases we compute weighted

arithmetic mean. The term weighted represents to the relative importance

to the item.

̅w =

= ∑

∑

Where

̅w represent the weighted arithmetic mean; X represent the variable

values i. e. X 1, X2 …… Xn

W represent the weights attached to the variable values i. e. w 1, w2 ……

wn respectively.

To calculate weighted arithmetic mean, multiply the w eight by the

variable X and obtain the total ∑ . Then divide this total by the sum of

the weights, i.e. ∑

In case of frequency distribution, if f1, f2. ….. . fn are the frequencies of the

variable values X 1, X 2,……X n respectively then the weighted arithmetic

mean is given by

̅w = ( ) ( ) ( )

̅w = ∑ ( )

∑

Note: Simple arithmetic mean shall be equal to the arithmetic mean if the

weights are equal.

Ex. 10 Calculate the weighted mean for following data.

X 1 2 5 7 W 2 14 8 32

Sol:

X W WX 1 2 2 2 14 28 5 8 40 7 32 224 ∑ = 56 ∑ = 294 munotes.in

## Page 10

10

̅w = ∑

∑ =

= 5.25

Ex. 11 Calculate the weighted mean for following data.

Wages per Day ( X ) 200 150 85 No. of workers ( W ) 25 20 10

Sol:

̅w = ∑

∑ =

= 160.90

Ex. 12. Calculate the weighted mean for following data and compare it

with arithmetic mean

Subject Weight Student X Y Z Physics 2 72 42 52 Chemistry 3 75 52 62 Biology 5 58 88 68

Sol: For Student X,

Arithmetic Mean, ̅X = ∑

=

=

= 67.67

Weighted Arithmetic Mean, ̅wX = ∑

∑ = ( ) ( ) ( )

=

=

= 65.9

For Student Y,

Arithmetic Mean, ̅y = ∑

=

=

= 60.67

Weighted Arithmetic Mean, ̅wY = ∑

∑ = ( ) ( ) ( )

=

=

= 68

For Student Z,

Arithmetic Mean, ̅Z = ∑

=

=

= 60.67 X W WX 200 25 5000 150 20 3000 85 10 850 ∑ = 55 ∑ = 8850 munotes.in

## Page 11

11

Weighted Arithmetic Mean, ̅wZ = ∑

∑ = ( ) ( ) ( )

=

=

= 63

1.6 MEDIAN

Median is a middle value in the distribution. Median is a numeric value

that separates the higher half of a set from the lower half . It is the value

that the number of observations above it is equal to the number of

observations below it. The median is thus a positional average.

For example, if the salary of five employees is 6100, 7150, 7250, 7500 and

8500 the median would be 7250.

When odd number of observations are there then the calculations of

median is simple. When an even number of observations are given, there

is no single middle position value and the median is taken to be the

arithmetic mean of two middlemost items.

In the above example we are given the salary of six employees as 6100 ,

7150, 7250, 7500, 8500 and 9000, the median salary would be

Median =

=

= 7375

Hence, in case of even number of observations median may be found by

averaging two middle position values.

Calculations of Median – Individual Observation s:

Arrange the data in ascending or descending order of magnitude.

In a group composed of an odd number of values, add 1 to the total

number of values and divide by 2 gives median value.

Median = size of

th item

Ex. 13 From the fol lowing data, compute the median:

15, 9, 7, 23, 25, 25, 42, 25, 16, 14, 58, 25, 31

Sol: Arrange the numbers in ascending order 7, 9, 14, 15, 16, 23, 25, 25,

25, 25, 31, 42, 58

Median = = size of

th item =

= 7th item = 25

Median = 25

The procedure for calculating median of an even numbered of items is not

as above. The median value for a group composed of an even number of

items is the arithmetic mean of the two middle values – i.e. adding two

values in the middle and dividing by 2

munotes.in

## Page 12

12

Ex. 14 From the fol lowing data, compute the median:

451, 502, 523, 512, 622, 612, 754, 732, 701, 721

Sol: Arrange the numbers in ascending order

451, 502, 512, 523, 612, 622, 701, 721, 732, 754

Median = = size of

th item =

= 5.5th item

Size of 5.5th item =

=

Median = 617

Calculations of Median – Discrete Series:

Steps:

1. First arrange the data in ascending or descending order.

2. Find out the cumulative frequencies.

3. Apply formula : Median = size of

4. Find out total in the cumulative frequency column which is equal to

or next higher to that value and determine the value of the

variable corresponding to it. That gives the median value.

Ex. 15 From the following data, find the value of median.

Income ( Rs.) 450 500 630 550 710 580 No. of persons 29 31 21 25 11 35

Sol: Income ( Rs.) Ascending order No. of persons f Cumulative Frequency c.f. 450 29 29 500 31 60 550 25 85 580 35 120 630 21 141 710 11 152 Median = = size of

th item =

= 76.5th item

Size of 76.5th item = Rs. 550 It is median income.

Calculations of Median – Continuous Series:

The following formula is used to calculate median for continuous series.

Median = L + ⁄

x i

L = Lower limit of median class; f = Simple freq. of the median class;

c.f. = Cumulative freq. of the preceding the median class;

i= Class interval of the median class

munotes.in

## Page 13

13

Ex. 16 From the following data, find the value of median.

Marks 70-80 60-70 50-60 40-50 30-40 20-30 10-20 No. of students 10 15 26 30 42 31 24

Sol: Arrange the data in ascending order

Marks f c.f. 10-20 24 24 20-30 31 55 30-40 42 97 40-50 30 127 50-60 26 153 60-70 15 168 70-80 10 178

Median = size of

item =

= 89th item

Median lies in the class 30-40 (marked in pink)

Median = L + ⁄

x i

L = 30.

= 89, c.f. = 55, f = 42, i = 10

Median =30 +

x 10

= 30 + 8.09 = 38.09

1.7 MODE

The mode or the modal value is that value in a series of observations

which occurs with the greatest frequency.

For example, the mode of the values 4, 6, 9, 6, 5, 6, 9, 4 would be 6.

Calculations of Mode – Discrete Series:

Ex. 1 7 From the following data, find the value of m ode.

Size of cloth 28 29 30 31 32 33 No. of persons wearing 15 25 45 70 55 20

Sol: The mode or modal size is 31 because the value 31 occurred

maximum number of times.

Calculations of Mode – Continuous Series:

The following formula is used to calculate mode for continuous series.

Mode =

x i ,

L = Lower limit of modal class; f1 = freq. of the modal class; munotes.in

## Page 14

14

fo= freq. of the class preceding the m odal class;

f2= freq. of the class succeeding the modal class;

i= Class interval of the modal class

Ex. 1 8 From the following data, find the value of mode. Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 No. of students 3 5 7 10 12 15 12 6 2 8

Sol: After observing the table, modal class is 50-60

Mode =

x i ,

= 50 +

x 10

= 50 +

x 10 = 55

1.8 EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE

Karl Pearson has expressed the relationship between mean, median and

mode as follows:

Mode = Mean – 3 [Mean – Median]

Mode = 3 Median – 2 Mean

If we know any of the two values out of the three, we can compute third

from these relationships.

1.9 GEOMETRIC MEAN

Geometric mean of a set of n observations is the nth root of their product.

G. M. = √( )( )( ) ( ) .

G. M. of 3 values 2, 4, 8 would be

G. M. = √ = √ = 4

For calculation purpose, take the logarithm of both sides

log G. M.=

log G. M.= ∑

G. M. = Antilog [∑

In Discrete series, G. M. = Antilog [∑

In Continuous series, G. M. = Antilog [∑

munotes.in

## Page 15

15

Calculations of Geometric Mean – Discrete/ Individual Series:

Ex. 1 9 Daily income of ten families of a particular place is below.

Calculate Geometric Mean . 85 70 15 75 500 8 45 250 40 36

Sol:

X log X 85 1.9294 70 1.8451 15 1.1761 75 1.8751 500 2.6990 8 0.9031 45 1.6532 250 2.3979 40 1.6021 36 1.5563 ∑ log X 17.6373

G. M. = Antilog [∑

= Antilog [∑

= Antilog (1.7637) = 58.03

Calculations of Geometric Mean – Continuous Series:

Ex 20 . Calculate Geometric Mean from following data.

Marks 4-8 8-12 12-16 16-20 20-24 24-28 28-32 32-36 36-40 Frequency 8 12 20 30 15 12 10 6 2

Sol: Marks m.p (m) f log m f log m 4-8 6 8 0.7782 6.2256 8-12 10 12 1.0000 12.0000 12-16 14 20 1.1461 22.922 16-20 18 30 1.2553 37.6590 20-24 22 15 1.3424 20.1360 24-28 26 12 1.4150 16.9800 28-32 30 10 1.4771 14.7710 32-36 34 6 1.5315 9.1890 36-40 38 2 1.5798 3.156 N= 115 ∑ f log m= 143.0386 munotes.in

## Page 16

16

G. M. = Antilog [∑

= Antilog [∑

= Antilog (1.2438) = 17. 53

1.10 HARMONIC MEAN

Harmonic mean of a number of observations, none of which is zero, is the

reciprocal of the arithmetic mean of the reciprocals of the given values.

Thus, harmonic mean (H. M.) of n observations x i, i = 1, 2, ….,n is given

by,

H. M. =

∑

=

Calculations of Harmonic Mean – Individual Observations:

Ex. 21 Find the harmonic mean of 4, 36, 45, 50, 75.

Sol: H. M. =

=

=

= 15

Calculations of Harmonic Mean – Discrete Series:

Formula for harmonic mean in Discrete series,

H. M. =

∑

=

∑

Ex. 22 From the following data, Find the harmonic mean.

Marks 10 20 30 40 50 No. of students 20 40 60 30 10

Sol: Marks X f f/X 10 20 2 20 40 2 30 60 2 40 30 0.75 50 10 0.20 N = 160 ∑ ⁄

H. M. =

∑

=

= 23.0215 munotes.in

## Page 17

17

Calculations of Harmonic Mean – Continuous Series:

Formula for harmonic mean in continuous series,

H. M. =

∑

Ex. 23 From the following data, compute the value of harmonic mean.

Class interval 10-20 20-30 30-40 40-50 50-60 Frequency 6 8 12 9 5

Sol: Class Interval Mid point (m) f f/m 10 – 20 15 6 0.40 20 – 30 25 8 0.32 30 – 40 35 12 0.3428 40 – 50 45 9 0.2 50 - 60 55 5 0.0909 N = 40 ∑ ⁄

H. M. =

∑

=

= 29.54

1.11 RELATION BETWEEN THE ARITHMET IC, GEOMETRIC AND HARMONIC MEAN

Arithmetic mean is greater than geometric mean and geometric mean is

greater than harmonic mean.

A.M. G. M H. M.

The quality signs hold only if all the numbers X 1, X 2, X 3,…. X n are

identical.

1.12 ROOT MEAN SQUARE

The root mean square (RMS) is defined as the square root of the mean

square (the arithmetic mean of the squares of a set of numbers). It is also

called as the Quadratic average. Sometimes it is denoted by √ ̅ and

given by,

RMS = √ ̅ = √∑

= √∑

It is very useful in fields that study sine waves like electrical engineering.

munotes.in

## Page 18

18

Ex. 24 Find RMS of 1, 3, 5, 7 and 9

Sol: RMS = √∑

= √

= √

= √ = 7.28

1.13 QUARTILES, DECILES AND PERCENTILES

From the definition of median that it‟s the middle point which divides the

set of ordered data into two equal parts. I n the same way we can divide the

set into four equal parts and this called quartiles. These values denoted by

Q1, Q 2 and Q 3, are called the first, second and the third quartile

respectively. In the same way the values that divide the data into 10 equal

parts are called deciles and are denoted by D 1, D 2, …., D 9 whereas the

values dividing the data into 100 equal parts are called percentiles and are

denoted by P 1, P 2, …., P 99. The fifth decile and the 50th percentile

corresponds to median.

Formulae:

Quartile :

For individual observations, Qi =(

). No. of observation , i= 1, 2, 3

For discrete series, Qi =(

). N, N= ∑ and i= 1, 2, 3

For continuous series, Q i = L +

. c,

Where, i= 1, 2, 3 , c = size of class interval.

L = Lower limit of the class interval in which lower quartile lies,

f = freq. of the interval in which lower quartile lies,

cf = cumulative freq. of the class preceding the quartile class,

Deciles:

For individual observations, Di =(

). No. of observation, i= 1, 2, …, 9

For discrete series, D i =(

). N, N= ∑ and i= 1, 2,…., 9

For continuous series, D i = L +

. c, i= 1, 2, …, 9

Percentiles:

For individual observations, Pi =(

). value of observation, i= 1,2,…, 99

For discrete series, Pi =(

). N, N= ∑ and i= 1, 2,…., 99

For continuous series, P i = L +

. c, i= 1, 2, …, 99

munotes.in

## Page 19

19

Ex. 25 Find the quartiles Q 1, Q3 , D 1, D 5, D 8, P 8, P 50 and P 85 of the

following data 20, 30, 25, 23, 22, 32, 36.

Sol: Arrange data in ascending order, n = 7 i.e. odd number

20, 22, 23, 25, 30, 32, 36

q1 = (

).7 = 1.75 q1 = 2 Q1= 22

q3 = (

).7 = 5 .75 q2 = 6 Q3= 32

d1 = (

).7 = 0.7 d1 = 1 D1= 20

d5 = (

).7 = 3.5 d5 = 4 D5= 25

d8 = (

).7 = 5.6 d8 = 6 D8= 32

p8 = (

).7 = 0.56 p8 = 1 P8= 20

p50 = (

).7 = 3.5 p50 = 4 P50= 25

p85 = (

).7 = 5.95 p85 = 6 P85= 32

Ex. 26 Find Q 1, Q3, D4, P27 for the following data.

X 0 1 2 3 4 5 6 7 8 f 1 9 26 59 72 52 29 7 1 c.f. 1 10 36 95 167 219 248 255 256

Sol. We know that, Qi =(

). N

Q1 = (

).256 = 64 and c.f. just greater than 64 is 95. Hence Q1 = 3

Q3 = (

).256 = 192 and c.f. just greater than 192 is 219. Hence Q3 = 5

D4 =(

).256 = 102.4 and c.f. just greater than 102.4 is 167. Hence D4 = 4

P27 =(

).256 = 69.12 and c.f. just greater than 69.12 is 95. Hence P27 = 3

Ex. 2 7 Find Q 1, Q3, D2, P90 for the following data.

Marks Below 10 10-20 20- 40 40-60 60-80 Above 80 No. of students 8 10 22 25 10 5

Sol: We know that, Qi = L +

. c, Marks Below 10 10-20 20- 40 40-60 60-80 Above 80 f 8 10 22 25 10 5 cf 8 18 40 65 75 80 munotes.in

## Page 20

20

Q1 = Size of (N/4)th item = size of (80/4)= 20th item. Q 1 lies in the class

20-40.

L=20, N/4 = 20, cf = 18, f = 22 and c = 20

Q1 = 20 + {(20 – 18)/22}* 20 = 20 + 1.82 = 21.82

Q3 = Size of (3N/4)th item = size of (3*80/4)= 60th item. Q 3 lies in the

class 40 -60.

L=40, 3N/4 = 6 0, cf = 40, f = 25 and c = 20

Q3 = 40 + {( 60 – 40)/25}* 20 = 56

D2 = Size of (2N/10)th item = size of (2*80/10)= 16th item. D 2 lies in the

class 10 -20

L=10, 2N/10 = 16, cf=8, f = 10 and c=10

D2 = 10 + {(16 –8)/10}*10 = 18

P90 = Size of (90N/100)th item = size of (90*80/10)= 72th item. P 90 lies in

the class 60 -80.

L=60, 90N/100 = 72, cf=65, f = 10 and c=20

P90 = 60 + {(72 –65)/10}*20 = 74.

1.14 SOFTWARE AND MEASURES OF CENTRAL TENDENCY

There are many software available to calculate measures of central

tendency. We can use Excel to calculate the standard measures of central

tendency (mean, median and mode). In Microsoft Excel, the mean can

calculated by using one of the functions like AVERAGE, AVERAGEA,

AVERAGEIF, AVERAGEIFS. The mean can be calculated by using th e

MEDIAN function. We can calculate a mode by using the MODE

functio n, GEOMEAN to calculate geometric mean and HARMEAN to

calculate harmonic mean.

We can use SPSS to calculate the standard measures of central tendency

(mean, median and mode). We can get S PSS to compute mean, median

and mode in the command submenu. Go to the Statistics menu, select the

Analyse submenu, and then the Descriptive Statistics submenu and then

the Frequencies option. We can use MINITAB to calculate the standard

measures of centra l tendency using the functions M ean, Median, Mode

ang GMEAN. To compute these go to Stat -Tables -Descriptive statistics.

Using R software one could easily obtain the value of the mean using

summary function.

We could find median value using summary function in R. The

randomForest library can be used to impute the missing values using

Median for numeric variables. Mode is used for missing value imputation

for categorical variables using randomForest library in R. Model can be

easily located graphically. You shouldn‟t be surprised that the R‟s mode munotes.in

## Page 21

21

function (mode ()) does not provide a model value. It shows the datatype

of the particular variable which does not comply with our standard

expectation. So how one would find mode using R software? We need to

use table function for finding mode. As you know the table function in R

provides frequency distribution of the variable. Thus the value with

highest frequency is a modal value.

Geometric mean is the only average that is recommended for finding

average growth (decline) rates. It is defined as the nth root of the product

of n terms. Since it is defined in product terms so the observation

shouldn‟t be having zero or negative values. We don‟t have a built-in

function in R for its computation but one could find it by using its formula

directly in R platform.

1.15 SUMMARY

A measure of central tendency is a measure that tells us where the middle

of a group of data lies. Mean, median and mode are the most important

measu res of central tendency. The complete dataset may be represented by

these values. It is not necessary for mean, median and mode to have the

same values. Mean is sensitive to extreme data values. Median is a better

way to understand skewed distribution than mean. It is possible that there

is no mode in the data. Mean and median cannot be zero unless all data

values are zero.

1.16 EXERCISE

1. Find the arithmetic mean of the following distribution:

X 10 30 50 70 89 f 7 8 10 15 10

2. Find the arithmetic mean of the following distribution:

X 3 9 12 14 15 17 f 1 3 4 1 4 2

3. Find the arithmetic mean of the following data.

Class Interval 15-25 25-35 35- 45 45- 55 55-65 65-75 75-85 Frequency 6 11 7 4 4 2 1

4. Find the arithmetic mean of the following data. munotes.in

## Page 22

22

Class Interval 10-20 20-30 30-40 40-50 50-60 Frequency 30 27 14 17 2

5. Obtain the median for the following frequency distribution:

X 1 2 3 4 5 6 7 8 9 f 8 10 11 16 20 25 15 9 6

[Ans: Median = 5]

6. Obtain the median from the following data.

X 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 f 35 45 70 105 90 74 51 30

7. Find the mode for the following distribution.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 No. of students 5 8 7 12 28 20 10 10 [Ans: Mode = 46.67]

8. Calculate Geometric Mean from following data.

125 1462 38 7 0.22 0.08 12.75 0.5 [ Ans: 6.952]

9. Find the geometric mean, harmonic mean and root mean square of the

numbers 3, 5, 6, 6, 7, 10 and 12.

[Ans: G. M. = 6.43, H. M. = 5.87, RMS = 7.55]

10. Find the arithmetic mean, geometric mean, harmonic mean of

numbers 2, 4 and 8. Check the relation between them.

11. Calculate Quartile 3, Deciles -7 and Percentiles 20 from following

data.

Class 2 - 4 4 – 6 6 – 8 8 - 10 Frequency 3 4 2 1

12. Calculate Q 1, Q2, Q3 D1, D5, D9, P11, P65 from following data.

Wages No. of employees 250.00 – 259.99 8 260.00 – 269.99 10 munotes.in

## Page 23

23

270.00 – 279.99 16 280.00 – 289.99 14 290.00 – 299.99 10 300.0 – 309.99 5 310.00 – 319.99 2

1.16 REFERENCES

FUNDAMENTAL OF MATHEMATICAL STATISTICS by S. C.

Gupta and V. K. Kapoor

Statistical Methods by S. P. Gupta

STATISTICS by Murray R. Spiegel, Larry J. Stephens

*****

munotes.in

## Page 24

24

2

THE STANDARD DEVIATION AND

OTHER MEASURES OF DISPERSION

Unit structure

2.0 Objectives

2.1 Introduction

2.2 Dispersion, or Variation

2.3 Range

2.4 Semi -Interquartile Range

2.5 Mean Deviation

2.6 10–90 Percentile Range

2.7 Standard Deviation

2.8 Short Methods for C omputing the Standard Deviation

2.9 Propert ies of the Standard Deviation

2.10 Variance

2.11 Charlie r’s Check

2.12 Sheppard’s Correction for Variance

2.13 Empirical Relations between Measures of Dispersion

2.14 Absolute and Relative Dispersion

2.15 Coefficient of Variation

2.16 Standardized Variable and Standard S cores

2.17 Software and Measures of Dispersion

2.18 Summary

2.19 Exercise

2.20 Reference

2.0 OBJECTIVES

After going through this chapter, students will able to learn

To provide the importance of the concept of dispersion

To calculate range, semi -Interquartile range, mean deviation

To explain why measures of dispersion must be reported in addition to

measures of central tendency

To calculate standard devi ation, variance, standard scores

To trace precise relationship

To compare two or more series with regard to their variability munotes.in

## Page 25

25

2.1 INTRODUCTION

The measures of central tendency or Averages give us an idea of the

concentration of the observations about the central part of distribution. But

the average alone cannot adequately describe a set of observations. They

must be supported and supplemented by some other measures, called

Dispersion.

2.2 DISPERSION OR VARIATION

Literal meaning of dispersion is ‘sca tteredness’. In two or more

distributions the central value may be the same but still there can be wide

differences in the formation of distribution. Measures of dispersion help us

in studying this important characteristic of a distribution.

Definitions of Dispersion:

1. “Dispersion is the measure of the variation of the items.” – A. L.

Bowley

2. “Dispersion is the measure of extent to which individual item vary.” –

L. R. Connor

3. “The degree to which numerical data tend to spread about an average

value is called variation or dispersion of the data”. – Spiegel

2.3 RANGE

Range is the difference between two extreme observations of the

distribution. Symbolically,

Range = L – S, where L = Largest item, S = smallest item

The relative measure corresponding to range, called the coefficient of

range.

Coefficient of range = ିௌ

ାௌ

Since range is based on two extreme observations, it is not at all a reliable

measure of dispersion.

Ex 1. From the following data, calculate range and coefficient of range. Day Mon Tues Wed Thurs Fri Sat Price 20 21 18 16 22 25

Sol: Range = L – S = 25 – 16 = 9

Coefficient of range = ିௌ

ାௌ munotes.in

## Page 26

26

= ଶହିଵ

ଶହାଵ = ଽ

ସଵ = 0.21

For continuous series, find the difference between the upper limit of the

highest class and the lower limit of the lowest class.

Ex 2 . . From the following data coefficient of range.

Marks 10– 20 20 -30 30-40 40-50 50-60 No. of Students 10 12 14 8 6

Sol: Coefficient of range = ିௌ

ାௌ

= ିଵ

ାଵ = ହ

= 0.21

2.4 SEMI -INTERQUARTILE RANGE OR QUARTILE DEVIATION

Semi -Interquartile Range Or Quartile Deviation is given by,

Q. D. =ொయି ொభ

ଶ

Quartile Deviation is a better measure than a range as it makes use of 50%

of the data. But since it ignores the other 50% of the data, it cannot be

considered as a reliable measure.

Q. D. =ொయି ொభ

ଶ

The relative measure corresponding to Q. D., called the coefficient of Q.

D.

Coefficient of Q. D. = ொయష ೂభଶൗ

ொయశ ೂభଶൗ = ொయି ொభ

ொయశ ೂభ

Coefficient of Q. D. can be used to compare the degree of variation in

different distributions.

Computation of Quartile Deviation - Individual Observations:

Ex. 3 Find out Quartile Deviation and Coefficient of Quartile Deviation

from following data.

25 33 45 17 35 20 55

Sol: Arrange the data in ascending order:

17 20 25 33 35 45 55

Q1 = size of [ேାଵ

ସ] th item = size of [ାଵ

ସ] th item = 2nd item munotes.in

## Page 27

27

∴ Q1 = 20

Q3 = size of 3 [ேାଵ

ସ] th item = size of 3 [ାଵ

ସ] th item = 6th item

∴ Q3 = 45

Q. D. =ொయି ொభ

ଶ = ସହିଶ

ଶ = 12.5

Coefficient of Q. D. = ொయି ொభ

ொయశ ೂభ = ସହିଶ

ସହାଶ = ଶହ

ହ = 0.455

Computation of Quartile Deviation -Discrete Series:

Ex. 4 Find out Quartile Deviation and Coefficient of Quartile Deviation

from following data.

Marks 10 20 30 40 50 60 No. of Students 7 10 18 12 10 6

Sol: Marks 10 20 30 40 50 60 Frequency f 7 10 18 12 10 6 cf 7 17 35 47 57 63

Q1 = size of [ேାଵ

ସ] th item = size of [ଷାଵ

ସ] th item = 16th item

∴ Q1 = 20

Q3 = size of 3 [ேାଵ

ସ] th item = size of 3 [ଷାଵ

ସ] th item = 48th item

∴ Q3 = 50

Q. D. =ொయି ொభ

ଶ = ହିଶ

ଶ = 15

Coefficient of Q. D. = ொయି ொభ

ொయశ ೂభ

= ହିଶ

ହାଶ = ଷ

= 0.4285

Computation of Quartile Deviation - Continuous Series:

Ex. 5 Find out Quartile Deviation and Coefficient of Quartile Deviation

from following data.

Marks 35-44 45 - 54 55- 64 65 - 74 75 - 84 No. of Students 12 40 33 13 12 munotes.in

## Page 28

28

Sol:

Marks 35-44 45 - 54 55- 64 65 - 74 75 - 84 Frequency f 12 40 33 13 12 cf 12 52 75 88 100

Q1 = size of [ே

ସ] th item = size of [ଵ

ସ] th item = 25th item

∴ Q1 lies in the class 45 – 54

Q1 = L + ேସൗି..

* i

L = 45, 𝑁4ൗ= 25, c.f. = 12 [c.f. of previous class], f= 40, i = 9

Q1 = 45 + ଶହ ିଵଶ

ସ* 9 = 47.925

Q3 = size of 3 [ே

ସ] th item = size of 3 [ଵ

ସ] th item = 75th item

∴ Q3 lies in the class 55 -64

Q3 = L + ଷேସൗି..

* i

L = 55, 3𝑁4ൗ= 75, c.f. = 52 [c.f. of previous class], f= 33, i = 9

Q3 = 55 + ହ ିହଶ

ଷଷ* 9 = 61.2727

Q. D. = ொయି ொభ

ଶ

= ଵ.ଶଶିସ.ଽଶହ

ଶ = 6.67

Coefficient of Q. D. = ொయି ொభ

ொయశ ೂభ

= ଵ.ଶଶିସ.ଽଶହ

ଵ.ଶଶାସ.ଽଶହ = .

ଵଽ.ଵଽ = 0.061

2.5 MEAN DEVIATION

Mean deviation is also known as the average deviation.

If xi | fi , i = 1, 2, …, n is the frequency distribution, then mean deviation

from the average A ( usually mean, median or mode).

Since mean deviation is based on all the observations, it is a better

measure of dispersion than range and quartile deviation.

Note: Mea n deviation is least when taken from median munotes.in

## Page 29

29

The relative measure corresponding to the mean deviation called the

coefficient of mean deviation and is obtained by,

Coefficient of M. D. = ெ..

ௌ

Computation of Mean deviation – Individual observations

M. D. = ଵ

∑|𝑋−𝐴|

= ଵ

ே ∑|𝐷|, where |𝐷| = |𝑋−𝐴| is the modulus value or absolute value of

the deviation ignoring plus and minus signs.

Ex. 6 Calculate mean deviation and coefficient of mean deviation from

following data:

600, 620, 640, 660, 680

Sol: From above data, Median = 640

Data Deviation from median 640 |𝑫| 600 40 620 20 640 0 660 20 680 40 N= 5 ∑|𝐷| =120

M. D. = ଵ

ே ∑|𝐷| = ଵଶ

ହ = 24

Coefficient of M. D. = ெ..

ௌ

= ଶସ

ସ

= 0.0375

Computation of Mean deviation – Discrete series:

M. D. = ଵ

ே ∑𝑓|𝐷|, where |𝐷| = |𝑋−𝐴|

Ex. 7 Calculate mean deviation from following data.

X 20 21 22 23 24 f 6 15 21 15 6

Sol:

X f c.f |𝐷| f |𝐷| 20 6 6 2 12 munotes.in

## Page 30

30

21 15 21 1 15 22 21 42 0 0 23 15 57 1 15 24 6 63 2 12 N = 63 ∑𝑓|𝐷|=54

Median = size of ேାଵ

ଶ th item = size of ଷାଵ

ଶ th item = 32th item

Size of 32th item is 22, hence Median = 22

M. D. = ଵ

ே ∑𝑓|𝐷|

= ହସ

ଷ = 0.857

Computation of Mean deviation – Continuous series:

Here we have to obtain the mid -point of the various classes and take

deviations of these points from median. Formula is same.

M. D. = ଵ

ே ∑𝑓|𝐷|

Ex. 8 Calculate mean and mean deviation from following data.

Size 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 7 12 18 25 16 14 8

Sol: Size f c.f. m.p (m) |𝑚−35.2| |𝐷| f |𝐷| 0-10 7 7 5 30.2 211.4 10-20 12 19 15 20.2 242.4 20-30 18 37 25 10.2 183.6 30-40 25 62 35 0.2 5.0 40-50 16 78 45 9.8 156.8 50-60 14 92 55 19.8 277.2 60-70 8 100 65 29.8 238.4 N= 100 ∑𝑓|𝐷|=1314.8

Median = size of ே

ଶ th item = size of ଵ

ଶ th item = 50th item

Median lies in the class 30 – 40

Median = L + ேଶൗି..

* i

L= 30, 𝑁2ൗ= 50, c.f. = 37, f = 25, i = 10

Median = 30 + ହିଷ

ଶହ *10 = 35.2 munotes.in

## Page 31

31

M. D. = ଵ

ே ∑𝑓|𝐷|

= ଵଷଵସ.଼

ଵ =13.148

2.6 10 –90 PERCENTILE RANGE :

The 10 – 90 percentile range of a set of data is defined by,

10 – 90 percentile range = P 90 – P10

Where P 10 and P 90 are the 10th and 90th for the data.

Semi 10 -90 percentile range = వబି భబ

ଶ

2.7 STANDARD DEVIATION :

Standard deviation is the positive square root of the arithmetic mean of the

squares of the deviations of the given values from their arithmetic mean.

Standard deviation is also known as root mean square deviation as it is the

square root of the mean of the standard deviation from arithmetic mean.

Standard deviation is denoted by the small Greek letter 𝜎 (read as sigma).

Calculation of Standard Deviati on - Individual Observations:

𝜎 =ට∑௫మ

ே , where x = (X -𝑋ത)

Calculation of Standard Deviation - Discrete Series:

𝜎 =ට∑௫మ

ே , where x = (X -𝑋ത)

Calculation of Standard Deviation: Continuous Series:

𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

∗𝑖, where d =(ି)

, i = class interval

Ex. 9 Calculate mean and standard deviation from the following data.

Size 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 7 10 32 43 50 35 23

Sol: Marks m. p. (m) f d = (m-35)/10 d2 fd fd2 0-10 5 7 -3 9 -21 63 10-20 15 10 -2 4 -20 40 20-30 25 32 -1 1 -32 32 30-40 35 43 0 0 0 0 40-50 45 50 1 1 50 50 50-60 55 35 2 4 70 140 munotes.in

## Page 32

32

60-70 65 23 3 9 69 207 N = 200 ∑fd =116 ∑fd2= 532

Assumed mean, A = 35

𝑋ത = A + ∑ௗ

ே * i

=35 + ଵଵ

ଶ *10 = 40.8

𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

∗𝑖

= ටହଷଶ

ଶ− ቀଵଵ

ଶቁଶ

∗10

= √ 2.66−0.3364 *10 = 1.5243*10 = 15.243

2.8 SHORT METHODS FOR COMPUTING THE STANDARD DEVIATION:

Calculation of Standard Deviation - Individual Observations :

When actual mean is in fractions eg 568.245, it would be too bulky to do

calculations. In such case either the mean may be approximated or the

deviations be taken from assumed mean A. Following is formula if we

take deviations from assumed mean A:

𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

, where d = ( X – A)

Ex. 10 Calculate standard deviation with the help of assumed mean.

340, 360, 390, 345, 355, 388, 372, 363, 277, 351

Sol: Consider assumed mean = 364

X d = (X – 364) d2 340 -24 576 360 -4 16 390 26 676 345 -19 361 355 -9 81 388 24 576 372 8 64 363 -1 1 377 13 169 351 -13 169 ∑ d = 1 ∑ d2 = 2689 munotes.in

## Page 33

33

𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

= ටଶ଼ଽ

ଵ− ቀଵ

ଵቁଶ

= 16.398

Calculation of Standard Deviation - Discrete Series :

Assumed mean method: 𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

, where d = ( X – A)

Ex. 11 Calculate standard deviation from the following data.

Size 3.5 4.5 5.5 6.5 7.5 8.5 9.5 Frequency 4 8 21 60 85 30 9

Sol:

Size f d = (X -6.5) d2 fd fd2 3.5 4 -3 9 -12 36 4.5 8 -2 4 -16 32 5.5 21 -1 1 -21 21 6.5 60 0 0 0 0 7.5 85 1 1 85 85 8.5 30 2 4 60 120 9.5 9 3 9 27 81 N = 217 ∑fd = 123 ∑fd2= 375

Assumed mean, A= 6.5

: 𝜎 = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

= ටଷହ

ଶଵ− ቀଵଶଷ

ଶଵቁଶ

= √1.7281−0.3212 = 1.1861

2.9 PROPERTIES OF THE STANDARD DEVIATION

1. Combined standard deviation: We can compute combined standard

deviation of two or more groups. It is denoted by 𝜎12 and given by

𝜎12 = ටேభఙభమ ା ேమఙమమ ାேభௗభమାேమఙమమ

ேభା ேమ

Where 𝜎12 = combined standard deviation;

𝜎ଵ = standard deviation of first group;

𝜎ଶ = standard deviation of second group;

d1 = |𝑋ଵതതത− 𝑋ଵଶതതതതത | ;

d2 = |𝑋ଶതതത− 𝑋ଵଶതതതതത | munotes.in

## Page 34

34

2. The standard deviation of the first n natural numbers can obtained by,

𝜎 = ටଵ

ଵଶ (𝑁ଶ−1)

Thus the standard deviation of natural numbers 1 to 20 will be

𝜎 = ටଵ

ଵଶ (20ଶ−1) = ටଵ

ଵଶ 399 = 5.76

3. Standard deviation is always computed from the arithmetic mean

because the sum of the squares of the deviations of items from their

arithmetic mean is minimum.

4. For normal distribution,

Mean ± 1 𝜎 covers 68.27% of the items.

Mean ± 2 𝜎 covers 95.45% of the items.

Mean ± 3 𝜎 covers 99.73% of the items.

2.10 VARIANCE

The square of standard deviation is called the variance and is given by,

Variance = ∑(ି )തതതതమ

ே

i.e. Variance = 𝜎2 or 𝜎 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

In the frequency distribution where deviations are taken from assumed

mean,

Variance = ൜∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

ൠ* i2, where d =(ି)

and i = class interval

Ex. 12 Calculate standard deviation from the following data.

Marks 10-20 20- 30 30-40 40-50 50-60 60-70 No. of students 2 6 8 12 7 5

Sol: Marks m.p (m) f d=(m-35)/10 d2 fd fd2 10-20 15 2 -2 4 -4 8 20-30 25 6 -1 1 -6 6 30-40 35 8 0 0 0 0 40-50 45 12 1 1 12 12 50-60 55 7 2 4 14 28 60-70 65 5 3 9 15 45 N = 40 ∑fd = 31 ∑fd2= 99 munotes.in

## Page 35

35

Variance = ൜∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

ൠ* i2

= ൜ଽଽ

ସ− ቀଷଵ

ସቁଶ

ൠ*102

= (2.475 – 0.6006)*100 = 187.44

2.11 CHARLIE’S CHECK

Some error may be made while calculating the value of mean and standard

deviations using different method. The accuracy of calculations can be

checked by using following formulae.

∑𝑓 (u + 1) = ∑𝑓u + ∑𝑓 = ∑𝑓u + N

∑𝑓 (u + 1)2 = ∑𝑓(u2 + 2u +1) = ∑𝑓u2 + 2 ∑𝑓u + ∑𝑓= ∑𝑓u2 + 2 ∑𝑓u +

N

∑𝑓 (u + 1)3 = ∑𝑓u3 + 3 ∑𝑓u2 + 3∑𝑓𝑢 + N

Ex. 13 Use Charlier’s check to verify mean and the standard deviation. Size 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Freq 9 12 8 10 11 35 15

Sol:

X f m. p. (m) u= (m-55)/i u+1 f(u+1) u2 fu fu2 20-30 9 25 -3 -2 -18 9 -27 81 30-40 12 35 -2 -1 -12 4 -24 48 40-50 8 45 -1 0 0 1 -8 8 50-60 10 55 0 1 10 0 0 0 60-70 11 65 1 2 22 1 11 11 70-80 35 75 2 3 105 4 70 140 80-90 15 85 3 4 60 9 45 135 N=∑f =100 ∑ f(u+1)= 167 ∑fu =67 ∑fu2 =423

∑𝑓 (u + 1) = 167

∑𝑓u + N = 67 +100 =167

∴ ∑𝑓 (u + 1) = ∑𝑓u + N

This provides the required check on the mean.

X f m. p. (m) u= (m-55)/i u+1 f(u+1) f(u+1)2 20-30 9 25 -3 -2 -18 36 30-40 12 35 -2 -1 -12 12 40-50 8 45 -1 0 0 0 munotes.in

## Page 36

36

50-60 10 55 0 1 10 10 60-70 11 65 1 2 22 44 70-80 35 75 2 3 105 315 80-90 15 85 3 4 60 240 N=∑f =100 ∑ f(u+1) = 167 ∑ f(u+1)2 ==657

∑𝑓 (u + 1)2 = 657

∑𝑓u2 + 2 ∑𝑓u + N =423 +2*67 +100 = 657

∴ ∑𝑓 (u + 1)2 = ∑𝑓u2 + 2 ∑𝑓u + N

This provides the required check on the standard deviation .

2.12 SHEPPARD’S CORRECTION FOR VARIANCE

The computation of the standard deviation is somewhat in error as a result

of grouping the data into classes (grouping error). To adjust for grouping

error, we use the formula,

Corrected variance = variance from grouped data - మ

ଵଶ

Where i is the class interval size. The correction మ

ଵଶ is called Sheppard’s

correction. It is used for distribution of continuous variables where the

tails tends to zero in both direction.

Ex. 14 Apply Sheppard’s Correction to determine the standard de viation

of the data in Ex. 8

Sol: 𝜎 = 15.243 ∴ 𝜎2 = 232.349 and i= 10.

Corrected variance = variance from grouped data - మ

ଵଶ

= 232.349 - ଵమ

ଵଶ = 224.016

Corrected Standard deviation = √224.016 = 14.9671

2.13 EMPIRICAL RELATIONS BETWEEN MEASURES OF DISPERSION :

There is a fixed relationship between the three measures of dispersion in

normal distribution. munotes.in

## Page 37

37

Q. D. = ଶ

ଷ 𝜎 or 𝜎 = ଷ

ଶ Q. D and

M. D. = ସ

ହ 𝜎 or 𝜎 = ସ

ହ M. D

The quartile deviation is smallest, the mean deviation next and the

standard deviation is largest.

2.14 ABSOLUTE AND RELATIVE DISPERSION:

Measures of dispersion may be either absolute or relative. Absolute

measures of dispersion are expressed in the sa me statistical unit in which

the original data are given such as kilograms, tons, rupees etc. These

values may be used to compare the variations in two distributions provided

the variables are expressed in the same units and of the same average size.

In ca se the two sets of data are expressed in different units such as quintals

of sugar versus tons of sugarcane, the absolute measures of dispersion are

not comparable. In such cases measures of relative dispersion is used.

A measure of relative dispersion is the ratio of a measure of absolute

dispersion to an appropriate average. It is sometimes called coefficient of

dispersion.

Relative dispersion = ௦௨௧ ௗ௦௦

௩

2.15 COEFFICIENT OF VARIATION:

Coefficient of is used in problems where we want to compare the

variability of two or more than two series. That series or group for which

the coefficient of variation is greater is said to be more variable or less

consistent, less uniform, less stable or less homogeneous. The series for

which the coefficient of variation is less is said to be less variable or more

consistent, more uniform, more stable or more homogeneous.

If the absolute dispersion is standard deviation 𝜎 and if average is the

mean 𝑋ത, then relative dispersion is called coefficient of variation, it is

denoted by C. V. and is given by,

Coefficient of variation (C.V. ) = ఙ

ത x100

Ex. 15 Calculate arithmetic mean, standard deviation and coefficient of

variation.

Class 23-27 28-32 33-37 38-42 43-47 48-52 53-57 58-62 63-67 68-72 Freq 2 6 7 12 18 13 9 7 4 2

munotes.in

## Page 38

38

Sol:

Class m. p. (m) f d = (m-50)/5 d2 fd fd2 23-27 25 2 -5 25 -10 50 28-32 30 6 -4 16 -24 96 33-37 35 7 -3 9 -21 63 38-42 40 12 -2 4 -24 48 43-47 45 18 -1 1 -18 18 48-52 50 13 0 0 0 0 53-57 55 9 1 1 9 9 58-62 60 7 2 4 14 28 63-67 65 4 3 9 12 36 68-72 70 2 4 16 8 32 N = 80 ∑fd = -44 ∑fd2= 380

Mean (𝑋ത) = A + ∑ௗ

ே x i

= 50 + ିସସ

଼ x 5 = 47.25

S. D. (𝜎) = ට∑ௗమ

ே− ቀ∑ௗ

ேቁଶ

x i

= ටଷ଼

଼− ቀିସସ

଼ቁଶ

x 5

= √4.75−0.3025 x 5= 22.23 75

C. V. = ఙ

ത x100

= ଶଶ.ଶଷହ

ସ.ଶହ x100 = 47.06%

2.16 STANDARDIZED VARIABLE AND STANDARD SCORES

The variable that measures the deviation from the mean in units of the

standard deviation is called standardized variable, is independent of the

units used and is given by,

z = ିത

ఙ

If the deviations from the mean are given in units of the standard

deviation, they are said to be expressed in standard units or standard

scores. These are of great value in the com parison of distribution . The

variable z is often used in educational testing, where it is called as a

standard score.

munotes.in

## Page 39

39

Ex. 16 Your test score is 160 while the test has a mean of 120 and

standard deviation of 15. If the distribution is normal, what is your z

score? Explain the meaning of the result.

Sol: z = ିത

ఙ = ଵିଵଶ

ଵହ = 2.7

The score is 2. 7 standard deviations above the mean.

Ex. 17 A student received a grade of 84 on a final examination in English

for which mean grade was 76 and the standard deviation was 10. On a

final examination in Science for which mean grade was 82 and the

standard deviation was 16, she received a grade of 90. In which subject

was her relative standing higher?

Sol: Standardized variable z = ିത

ఙ

For English, z = ଼ସି

ଵ = 0.8

For Science, z = ଽି଼ଶ

ଵ = 0.5

Thus, the student had a grade of 0.8 of a standard deviation above the

mean in English but only 0 .5 of a standard deviation above the mean in

science. Thus her relative standing was higher in English.

2.17 SOFTWARE AND MEASURES OF DISPERSION:

The statistical software gives a variety of measures for dispersion. The

dispersion measures are usually given in descriptive statistics. EXCEL and

MINITAB allows for the computation of all the measures discussed above.

The output from MINITAB and STATISTIX has helped clarify some of

the statistical concepts which are hard to understand without some help

from the graphics involved .

Calculating Range In Excel: Excel does not offer a function to compute

range. However, we can easily compute it by subtracting the minimum

value from the maximum value. The formula would be =MAX() -MIN()

where the dataset would be t he referenced in both the parentheses. The

=MAX() and =MIN() functions would find the maximum and the

minimum points in the data. The difference between the two is the range.

Microsoft Excel has two functions to compute quartiles. The inter -quartile

range has to be calculated as the difference between the quartile 3 and

quartile 1 values Quartiles can be calculated using =QUARTILE.INC() or

=QUARTILE.EXC(). Both functions calculate the quartiles by calculatin g

the percentiles on the data. Excel offers two fu nctions, =STDEV.S() for

sample standard deviation, and =STDEV.P() for population standard

deviation.Excel with two different functions: =VAR.P() for population

variance, and =VAR.S() for sample variance. Minitab may be used to munotes.in

## Page 40

40

compute descriptive statistic s for numeric variables, including the mean,

median, mode, standard deviation, variance and coefficient of variance. To

compute these go to Stat -Tables -Descriptive statistics.

You can use SPSS to calculate the measures of dispersion such as range,

semi -interquartile range, standard deviation and variance. We can get

SPSS to compute these in the command submenu. Go to the Statistics

menu, select the Analyse submenu, and then the Descriptive Statistics

submenu and then the Frequencies option. We can use MINITAB to

calculate the measures of dispersion the functions Q1, Q3, Range StDev,

Variance and CorfVar

2.18 SUMMARY

A measure of dispersion indicates the scattering of data. Di spersion is the

extent to which values in a distribution differ from the average of the

distribution. The measure of dispersion displays and gives us an idea about

the variation and the central value of an individual item. The range and

interquartile range are generally ineffective to measure the dispersion of

set of data. The useful measure that describes the dispersion of all the

values is standard deviation or variance. Dispersion can prove very

effective in association with central tendency in making an y statistical

decision.

2.19 EXERCISE

1. Calculate Quartile deviation (Q. D.), Mean Deviation (M. D. ) from

mean for the following data.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 No. of Students 6 5 8 15 7 6 3

[Ans: Q.D. = 11.23, Mean = 33.4, M.D from mean = 13.184 ]

2. Calculate Mean Deviation (M. D.) from mean for the following data

Size 2 4 6 8 10 12 14 16 f 2 2 4 5 3 2 1 1

[Ans: Mean = 8, M.D from mean = 2.8 ]

3 Calculate Mean Deviation and its coefficient from mean for the

following data .

Size 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70 -80 Freq 5 8 12 15 20 14 12 6

[Ans: Median = 43, M.D = 15.37, Coe. Of M. D. = 0.357] munotes.in

## Page 41

41

4. Find the standard deviation of the following data.

i. 12, 6, 7,3,15, 10, 18, 5

ii. 9, 3, 8, 8, 9, 8, 9,18

[Ans: i. St. dev. 𝜎= 4.87, ii. St. dev. 𝜎= 3.87]

5. Find the standard deviation of the following data. Age 20-25 25-30 30-35 35-40 40-45 45-50 No. of persons 170 110 80 45 40 35

Take assumed average = 32.5

[Ans: Standard deviation 𝜎= 7.936]

6. Calculate the standard deviation from the following data by short

method.

240.12, 240.13, 240.15, 240.12, 240.17, 240.15, 240.17, 240.16, 240.22,

240.21

7. Calculate standard deviation from the following data by short method. Salary 45 50 55 60 65 70 75 80 No. of persons 3 5 8 7 9 7 4 7 [Ans: Standard deviation = 10.35]

8. Calculate arithmetic mean, standard deviation and coefficient of

variation. Class 20-25 25-30 30-35 35-40 40-45 Frequency 1 22 64 10 3

[Ans: 𝑋ത = 32.1, S. D. (𝜎) =3.441, C. V. = 10.72]

2.20 REFERENCE

FUNDAMENTAL OF MATHEMATICAL STATISTICS by S. C

Gupta and V. K. Kapoor

Statistical Methods by S. P. Gupta

STATISTICS by Murray R. Spiegel, Larry J. Stephens

***** munotes.in

## Page 42

42

3

INTRODUCTION TO R

Unit structure

3.0 Objectives

3.1 Introduction

3.2 Basic syntax

3.3 Data types

3.4 Variables

3.5 Operators

3.6 Control statements

3.7 R-functions

3.8 R –Vectors

3.9 R – lists

3.10 R Arrays

3.11 Summary

3.12 Exercise

3.13 References

3.0 OBJECTIVES

After going through this chapter, students will able to learn

1. Understand the different data types, variables in R.

2. Understand the basics in R prog ramming in terms of operators,

control statements

3. Use of built-in and user defined function

4. Understand the different data structures in R.

3.1 INTRODUCTION

R is programming language and software environment for statistical

computing and graphics. It is an open source programming language. It

was designed by Ross Ihaka and Robert Gentleman at the University of

Auckland, New Zealand. in 1993 . It was released on 31 -Oct-2014 by the R

Development Core Team . It is widely used by researchers from diverse

disciplines to estimate and display results and by teachers of statistics and

research methods. Today, millions of analysts, researchers, and brands

such as Facebook, Google, Bing, Accenture, and Wipro are using R to

solve complex issues. The applications of R are not limited to just one

sector, we can see the use of R in banking, e -commerce, finance, and

many more sectors munotes.in

## Page 43

43

It is f reely available on www.r -project.org or can also download from

CRAN (Comprehensive R Archive Network) website http://CRAN.R -

project.org .

R Command Prompt:

We will be using RStudio . Once we have R environment setup, then it’s

easy to start R command prompt by just typing the following command at

command prompt −

$ R

This will launch R interpreter and you will get a prompt > where you can

start typing your program

> “Hello, World!”

[1] “Hello, world!”

Usually, we will write our code inside scripts which are

called RScripts in R.

Write the below given code in a file

1 print (“Hello, World!”)

and save it as myfirstprogrm.R and then run it in console by writing:

Rscript my firstprogram .R

It will produce following output

[1] "Hello, World!"

3.2 BASIC SYNTAX

Any program in R is made up of three things: Variables, Comments, and

Keywords . Varia bles are used to store the data.

Comments are us ed to improve code readability. They are like helping text

in your R program. Single comment is written using # in the beginning of

the statement.

Eg. # This is my first R program

Keywords are reserved words that hold a spe cific meaning to the compiler.

Keyword cannot be used as a variable name, function name.

Following are the Reserved words in R: if, else, while, repeat, for,

function, in, next, break, TRUE, FALSE, NU LL, Inf, NaN, NA,

NA_integer_, NA_real_, NA_complex_, NA_character etc

We can view these keywords by using either help (reserved) or ?reserved

R is case sensitive language.

munotes.in

## Page 44

44

3.3 DATA TYPES

Variables can store data of different types and different types can do

different things. Variables are the reserved memory location to store

values. As we create a variable in our program some space is reserved in

memory.

Following are the data types use d in R programming.

Data type Example Description Numeric 50, 25.65, 999 Decimal values Logical True, False Data with only two possible values

which can be constructed as true/false Character ‘A’, “Excellent”,

’50.50’ A character is used represent string

values. Integer 5L, 70L, 9876L L tells R to store the value as an

integer. Complex X= 5+4i A complex value in R defined as the

pure imaginary value i. Raw A raw data type is used to holds raw

bytes.

We can use the class ( ) function to check the data type of a variable.

# numeric

a < - 25.5

class (a)

# complex

a < - 10+5i

class (a)

# integer

a < - 100L

class (a)

# logical/boolean

a < - TRUE

class (a)

# character/string

a < - “I am doing R programming”

class (a)

output:

[1] “numeric” munotes.in

## Page 45

45

[1] “complex”

[1] “integer”

[1] “logical”

[1] “character”

3.4 VARIABLES

Variables are used to store the information to be manipulated in the R

program. A variable in R can store an atomic vector, group of atomic

vectors or a combination of many R -objects. A valid variable name

consists of letters, numbers and the dot or underline characters. The

variable name must start with a letter or the dot not followed by a number.

Ex – valid - a , a_b, a.b , a 1 , a1. , a.c

Invalid - 2a, _a

R does not have a command for declaring a variable. A variable is created

the moment you first assign a value to it.

In R, the assignment can be denoted in three ways:

1. = (Simple Assignment)

2. <- (Leftward Assignment)

3. -> (Rightward Assignment)

name = “Ajay”

gender < - “Male”

age < - 25

Here, name, gender and age are variables an d “Ajay”, “Male”, 25 are

values.

To print/output variable, you do not need any function. You can just type

the name of the variable.

name = “Ajay”

O/P:

[1] “Ajay” #auto print the value of name variable

However, R have a print() and cat() function s which are used to print the

value of the variable . The cat( ) function combines multiple values into a

continuous print output.

Cat (“My name is” , name , “\n”)

Cat (“my age is”, age, “ \n”)

O/P: My name is Ajay

My age is 25

ls() function: To know all the variables currently available in the

workspace, use the ls() function. munotes.in

## Page 46

46

# using equal to operator

a = “Good morning”

# using leftward operator

b < - “Good morning”

# using leftward operator

“Good morning - > c

print(ls())

O/P: “a” “b” “c”

# List the variables starting with the pattern "var".

> print(ls (pattern="var"))

The variables starting with dot (.) are hidden, they can be listed using

"all.names=TRUE" argument to ls() function.

> print(ls(all.name=TRUE))

rm() function: This is a built in function used to delete an wanted

variables .

> rm( variable )

# using equal to operator

a = “Good morning”

# using leftward operator

b < - “Good morning”

# using leftward operator

“Good morning - > c

# Removing variable

rm(a)

print(a)

O/P: Error in print(a) : object ‘a’ not found

All the variables can be deleted by using the rm() and ls() function

together.

> rm(list=ls())

> print(ls())

3.5 OPERATORS

Operators are the symbols directing the compiler to perform various

kinds of operations between the operands. There are different types of munotes.in

## Page 47

47

operator, and each operator performs a different task. Operators simulate

the various mathematical, logical, and decision operations performed on

a set of Complex Numbers, Integers, and Numericals as input operands .

Types of Op erators used in R programming:

• Arithmetic Operators

• Relational Operators

• Logical Operators

• Assignment Operators

• Miscellaneous Operators

Arithmetic Operators:

Arithmetic operators are used with numeric values to perform common

mathematical operations

< - , = Assignment A < - 5 ; b=10 + Addition x <- c( 2,5.5,6) ; y < - c(8, 3, 4);

print(x+ y)

# O/P [1] 10.0 8.5 10.0 - Subtraction x <- c( 2, 5.5,6); y <- c(8, 3, 4);

print(x - y) * Multiplication x<- c( 2,5.5,6); y <- c(8, 3, 4); print(x*y) / division x <- c( 2,5.5,6); y <- c(8, 3, 4); print(x/y) %% remainder x<- c( 2,5.5,6); y <- c(8, 3, 4); print(x%%y) #O/P [1] 2.0 2.5 2.0 %/% gives quotient x <- c( 2,5.5,6); y <- c(8, 3, 4) ;print(x%/%y) # O/P 0 1 1 ^ ** exponent x <- c( 2,5.5,6) ; y <- c(8, 3, 4); print(x^y) #O/P 256.000 166.375 1296.000

Relational Operators: Relational/ Comparison operators are used to

compare two values

> Greater than x <- c(2,5.5,6,9) ; y <- c(8,2.5,14,9); print(x>y) # O/P [1] FALSE TRUE FALSE FALSE < Less than x <- c(2,5.5,6,9) ; y <- c(8,2.5,14,9) ; print(x < y) ; #O/P [1] TRUE FALSE TRUE FALSE munotes.in

## Page 48

48

<= Less than equal to x <- c(2,5.5,6,9) ;y<- c(8,2.5,14,9) print(x<=y) #O/P [1] TRUE FALSE TRUE TRUE >= Greater than equal to x <- c(2,5.5,6,9) ;y <- c(8,2.5,14,9) print(x>=y) #O/P [1] FALSE TRUE FALSE TRUE == Equal x < - c(2,5.5,6,9); y < - c(8,2.5,14,9) print(x==y) #O/P [1] FALSE FALSE FALSE TRUE != Not equal x <- c(2,5.5,6,9) ; y <- c(8,2.5,14,9) print(x!=y) # O/P [1] TRUE TRUE TRUE FALSE

Logical Operators: Logical operators are used to combine conditional

statements.

& Element wise Logical AND x <- c(3, 1, TRUE, 2+3i); y <- c(4, 1,

FALSE, 2+3i)

print(x&y );

# O/P [1] TRUE TRUE FALSE TRUE | Element wise Logical OR x <- c(3, 0, TRUE, 2+2i); y <- c(4, 5, FALSE, 2+3i) print(x|y) # O/P [1] TRUE TRUE TRUE TRUE ! Element wise Logical NOT x<- c(8, 0, FALSE, 4+4i); print(!x) # O/P [1] FALSE TRUE TRUE FALSE && Takes first element of both the vectors and gives the TRUE only if both are TRUE. x <- c(3,0,TRUE, 8+9i); y<- c(1,3,TRUE, 3+4i) print(x&&y) # O/P [1] TRUE || Logical OR operator. It returns TRUE if one of the statement is TRUE. x <- c(4, 0,TRUE, 8+9i); y<- c(3, 5, TRUE, 2+3i) print(x||y) # O/P [1] TRUE

Miscellaneous Operators: Miscellaneous operators are used to anipulate

data:

: Create a series of numbers in sequence x <- 2:9 print(x) # [1] 2 3 4 5 6 7 8 9 %in% Find out if an element belongs to a vector x <- 8 ; y <- 12 ; z <- 1:10 print(x %in% z) ; print(y %in% z) # O/P [1] TRUE [1] FALSE munotes.in

## Page 49

49

%*% It is used to multiply a matrix with its transpose

3.6 CONTROL STATEMENTS

Control statements are expressions used to control the execution and flow

of the program based on the conditions provided in the statements.

if condition: if statement checks the expression provided in the

parenthesis is true or not true. The block of code insi de if statement will be

executed only when the expression evaluates to be true.

Syntax :

if (expression ) {

// statements will execute if expression is true.

}

a < - 500

if (a > 100) {

print((x, “is greater than 100”))

}

O/P: [1] “500 is greater than 100”

If ….. else condition : If expression evaluates to be true, then the if block

of code will be executed, otherwise else block of code will be executed.

Syntax:

if(expression){

// statements will execute if expression is true.

}

else{

// statements will execute if expression is false.

}

a < - 500

if (a > 100) {

print(a, “is greater than 100”)

} else {

print(a, “is smaller than 100”)

}

O/P: [1] “500 is greater than 100” munotes.in

## Page 50

50

Repeat loop: Repeat loop executes the same code again and again until

stop condition met.

Syntax:

repeat { commands

if (condition){ break

}

}

a <- 1

repeat {

print(a)

a =a+1

if (a>5){ break

}

}

O/P:

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

return statement: return statement is used to return the result of an

executed function and returns control to the calling function .

Syntax:

retur n(expression)

Example: func < - function(a) {

if(a > 0){

return (“POSITIVE”)

}else if (a < 0){

return(“NEGATIVE”)

}else{

return( “ZERO”)

}

}

fun(1)

fun(0) munotes.in

## Page 51

51

fun(-1)

O/P :

“POSITIVE”

“NEGATIVE”

“ZERO”

next statement: next statement is useful when we want to skip the current

iteration of a loop without terminating it.

Syntax:

next

Example:

a < - 1:8

#Print even numbers

for( i in a){

if(i%%2 !=0){

next

}

print(i)

}

O/P:

[1] 2

[1] 4

[1] 6

[1] 8

break statement: The break keyword is a jump statement that is used to

terminate the loop at a particular iteration

Syntax:

if (test_ expression) {

break

}

switch Statement: A switch statement is a selection control mechanism.

Switch case is a multiway branch statement. It allows a variable to be

tested for equality against a list of value s. If there is more than one match

for a specific value, then the switch statement will return the first match

found of the value matched with the expression.

Syntax:

switch(expression, case1, case2, case3, ……) munotes.in

## Page 52

52

Example:

a < - switch( 2, “Nagpur”, “Mumbai”, “Delhi”,“Raipur”)

print(a)

O/P:

[1] “Mumbai”

while loop: The while loop executes the same code again and again until

stop condition is met.

Syntax:

While (test_expression) {

Statement

}

Example:

a < - c(“Hello”, “World”)

count < - 1

while (count < 5) {

print(a)

count = count + 1

}

O/P

[1] “Hello” “World”

[1] “Hello” “World”

[1] “Hello” “World”

[1] “Hello” “World”

for loop: The for loop can be used to execute a group of statements

repeatedly depending upon the number of elements in the object. It is an

entry controlled loop, in this loop the test condition is tested first, then the

body of the loop executed, the loop body would not be executed if the test

condition is false.

Syntax:

for (value in vector) {

statements

}

Example:

v < - LETTERS[1:5]

for ( x in v

}) { munotes.in

## Page 53

53

print(x)

O/P:

[1] “A”

[1] “B”

[1] “C”

[1] “D”

[1] “E”

Example:

for (x in c(-5, 8, 9, 11))

{ print(x)

}

O/P:

[1] -5

[1] 8

[1] 9

[1] 11

Nested for-loop: Nested loops are similar to simple loops. Nested means

loops inside loop. R programming allows using one loop inside another

loop. In loop nesting, we can put any type of loop inside of any other type

of loop. For example, a if loop can be inside a for loop or vice versa.

Moreover, nested loops are used to manipulate the matrix .

for ( i in 1:3)

{

for ( j in 1:i)

{

print( i * j)

}

}

O/P:

[1] 1

[1] 2

[1] 4

[1] 3

[1] 6

[1] 9

munotes.in

## Page 54

54

3.7 R -FUNCTIONS A set of statements which are organized together to perform a specific task

is known as a function. A function is a set of statements organized

together to perform a specific task. R has a large number of in -built

functions and the user can create their own functions. An R function is

created by using the keyword function .

Syntax :

Function_name < - function (arg1, arg2, …..)

{ function body

}

The different components of function are -

Function name is the actual name of the function.

An argument is placeholder. In function, argument are optional means

a function may or may not contain arguments, and these arguments

have default values also.

The function body contains a set of statements which defines what the

function does.

Return value is the last expression in the function body which is to be

evaluated.

R also has two types of function, i.e. Built in function and user

defined function.

Built -in function: The functions which are already defined in the

programming framework are known as built in functions. Simple

examples of built -in functions are seq (), mean(), amx(), sum(), paste(…)

etc. They are directly called by user written programs.

print(seq(50, 60))

O/P: [1] 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60

print(mean(30, 40))

O/P: [1] 35

User defined Function: R allows us to create our own function in our

program. They are specific to what a user wants and once created they can

be used like built-in functions.

Example:

areaofCircle < - function (radius){

area = pi*radius^2

return(area)

}

print(areaofCircle(2))

O/P: [1] 12.56637 munotes.in

## Page 55

55

Example :

# create a function without an argument.

a.function < - function (){

for(i in 1:5) {

b < - i^2

print( b)

}

}

# call the function a.function without supplying an argument

a.function()

O/P

[1] 1

[1] 4

[1] 9

[1] 16

[1] 25

# create a function with an argument.

a.function < - function (a){

for(i in 1:a) {

b < - i^2

print(b)

}

}

# call the function a.function without supplying 5 as an argument

a.function(5)

O/P

[1] 1

[1] 4

[1] 9

[1] 16

[1] 25

Calling a function with argument values :

#Create a function with argument

a.function < - function(x,y,z) {

esult < - x * y + z

print( result)

}

# call the function by position of arguments

a.function(4, 2, 10)

# call the function by names of the arguments

a.function(x=10, y=4, z=2)

O/P:

[1] 18 munotes.in

## Page 56

56

[1] 42

Calling a function with default argument:

#Create a function with argument

a.function < - function(x = 5, y= 7) {

result < - x * y

print( result)

}

# call the function without giving any arguments

a.function()

# call the function with giving new values of the argument.

a.function(10, 6)

O/P:

[1] 35

[1] 60

3.8 R –VECTORS

A vector is a basic data structure. In R, a sequence of elements which

share the same data type is known as vector. A vector supports logical,

integer, double, character, complex, or raw data type. A vector length is

basically the number of elements in the vector, and it is calculated with the

help of the length() function.

Vector is classified into two parts, i.e., Atomic vectors and Lists . There is

only one difference between atomic vectors and lists. In an atomic vector,

all the elements are of the same type, b ut in the list, the elements are of

different data types. The elements which are contained in vector known

as components of the vector. We can check the type of vector with the

help of the typeof() function.

Creation of atomic vector

Single Element Vector: when you write just one value, it becomes a

vector of length 1.

print (“xyz”)

print(25.5)

print(TRUE)

O/P

[1] “xyz”

[1] 25.5

[1] TRUE

Multiple Elements vector:

1. Using the colon ( : ) operator:

# Creating a sequence from 1 to 8

v< - 1:8 munotes.in

## Page 57

57

print (v)

# Creating a sequence from 1.5 to 8.5

v< - 1.5:8.5

print (v)

O/P:

[1] 1 2 3 4 5 6 7 8

[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

2. Using sequence (seq) operator:

# Create a vector from 1 to 5 incrementing by 0.6

print (seq( 1, 5, by = 0.6 ))

O/P [1] 1.0 1.6 2.2 2.8 3.4 4.0 4.6

3. Using the c () function : The non character values are converted to

character type if one of the elements is a character.

x < - c(‘mango’, ‘yellow’, 10, TRUE)

print(x)

O/P

[1] “mango” “yellow”, “10”, “TRUE”

Accessing Vector Elements: Elements of a Vector are accessed using

indexing. The [ ] brackets are used for indexing. Indexing starts with

position 1. Giving a negative value in the index drops that element from

result. TRUE , FALSE or 0 and 1 can also be used for indexing.

# Accessing vector elem ents using position

x < - c(“Jan”, “Feb”, “Mar”, “Ap ril”, “May”, “Jun”, “July”, “Aug”,

“Sept”, “Oct”, “Nov”, “Dec”)

a < - x[c(2,4,8)]

print(a)

# Accessing vector elements using logical indexing.

b < - x[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,

FALSE, TRUE, TRUE, FALSE, FALSE)]

print (b)

# Accessing vector elements using negative indexing.

c < - x[c(-1, -3, -4, -8, -9, -10, -11)]

print(c )

# Accessing vector elements using 0/1indexing.

d < - x[c(0,1,0,0,1,0,0,0,0,0,0,1)]

print(d )

O/P

[1] “Feb” “April” “Aug”

[1] “Jan” “Feb” “May” “Sept” “Oct”

[1] “Feb” “May” “Jun” “July” “Dec”

[1] “Feb” “May” “Dec”

munotes.in

## Page 58

58

Vector Manipulation:

Vector arithmetic : Two vectors of same length can be added, subtracted,

multiplied or divided giving the result as a vector output.

#create two vectors.

x < - c(2,7,3,4,0,10)

y < - c(3,10,0,7,1,1)

add.result < - x + y

print(add.result)

multi.result < - x * y

print(multi.result)

O/P:

[1] 5 17 3 11 1 11

[1] 6 70 0 28 0 10

Vector Element Recycling: If we apply arithmetic op erations to two

vectors of unequal length, then the elements of the shorter vector are

recycled to complete the operations.

x < - c(2,7,3,4,0,10)

y < - c(3,10)

# y becomes c(3,10,3,10,3,10)

add.result < - x + y

print(add.result)

O/P: [1] 5 17 6 14 3 20

Vector Element Sorting: Elements in a vector can be sorted using

the sort() function.

x < - c(2,7,3, -11, 4,0,210)

sort.result < - sort(x)

print(sort.result)

resort.result < - sort(x, decreasing - TRUE)

print(resort.result)

O/P:

[1] -11 0 2 3 4 7 210

[1] 210 7 4 3 2 0 -11

3.9 R – LISTS

Lists are heterogeneous data structures. Lists are the R objects which

contain elements of different types. These are also one -dimensional data

structures. A list can be a list of vectors, list of matrices, a list of

characters and a list of functions and so on. List is created

using list() function.

Creating a List:

#Create a list containing strings, numbers, vectors)

list_1 < - list(“Apple”, “Mango”, 25.25, 60.5, c(16,25,36))

print(list_1) munotes.in

## Page 59

59

O/P:

[[1]]

[1] “Apple”

[[2]]

[1] “Mango”

[[3]]

[1] 25.25

[[4]]

[1] 60.5

[[5]]

[1] 16 25 36

Naming List Element: The list elements can be given and they can be

accessed using these names.

list_1 < - list(c(“Mon”, “Tues”, “Wed”), matrix(c(1,2,3,4,5,6), nrow = 2)

#Give names to the elements in the list.

names(list_1) < - c(“Days of Week”, “Matrix”)

print(list_1)

O/P

$Days of Week

[1] “Mon” “Tues” “Wed”

$Matrix

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

Accessing List Elements:

Elements of the list can be accessed by the index of the element in the list.

In case of named lists it can also be accessed using the names.

list_1 < - list(c(“Mon”, “Tues”, “Wed”), matrix(c(1,2,3,4,5,6), nrow = 2)

names(list_1) < - c(“Days of Week”, “Matrix”)

print(list_1[1])

print(list_1 $Matrix)

O/P:

$Days of Week

[1] “Mon” “Tues” “Wed”

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

munotes.in

## Page 60

60

Manipulating List Elements:

We can add, delete and update list elements as shown below. We can add

and delete elements only at the end of a list. But we can update any

element.

list_1 < - list(c( “Mon”, “Tues”, “Wed”), matrix(c(1,2,3,4,5,6), nrow = 2)

names(list_1) < - c(“Days of Week”, “Matrix”)

#add element at the end of the list

list_1[3] < - “Add Element”

print (list_1[4])

O/P

[[1]]

[1] “Add Element”

Merging Lists:

You can merge many lists into one list by placing all the lists inside one

list() function.

list_a < - list(1,2)

list_b < - list(“Ankit”, “Pooja”)

#merge tow lists

merged.list < - c(list_a, list_b)

print(merged.list)

[[1]]

[1] 1

[[2]]

[1] 2

[[3]]

[1] Ankit

[4]]

[1] Pooja

Converting List to vector:

A list can be converted to a vector so that the elements of the vector can be

used for further manipulation. All the arithmetic operations on vectors can

be applied after the list is converted into vectors.

list_a < - list(10:13)

print( list_a)

list_b < - list(20:23)

print( list_b)

#Convert the lists to vectors

x1 < - unlist (list_a)

x2 < - unlist (list_b)

print(x1 )

print(x2 )

add < -x1 + x2 munotes.in

## Page 61

61

print( add)

O/P

[[1]]

[1] 10 11 12 13

[[2]]

[1] 20 21 22 23

[1] 10 11 12 13

[1] 20 21 22 23

[1] 30 32 34 36

3.10 R ARRAYS

Arrays are the R data objects which can store data in more than two

dimensions. In R, an array is created with the help of the array() function.

This array() function takes a vector as an input and to create an array it

uses vectors values in the dim parameter.

#create two vectors of different lengths

v1 < - c(1,2,3)

v2 < - c(4,5,6,7,8,9)

#Take these vectors as input to the array

array1 < - (c(v1,v2), dim = c(3,3,2))

print(array1)

O/P

, , 1

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

, , 2

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

Naming Columns and Rows: We can give names to the rows, columns

and matrices in the array by using the dimnames parameter.

#create two vectors of different lengths

v1 < - c(1,2,3)

v2 < - c(4,5,6,7,8,9)

column.names < - c(“Col1”, “Col2”, “Col3”)

row.names < - c(“Row1”, “Row2”, “Row3”)

matrix.names < - c(“Matrix1”, “Matrix2”)

array1 < - array(c(v1,v2),dim = c(3,3,2), dimnames = list(row.names,

column.names, matrix.names))

print(array1)

O/P: munotes.in

## Page 62

62

, , Matrx1

Col1 Col2 Col3

Row1 1 4 7

Row2 2 5 8

Row3 3 6 9

, , Matr ix2

Col1 Col2 Col3

Row1 1 4 7

Row2 2 5 8

Row3 3 6 9

Accessing Array Elements:

#create two vectors of different lengths

v1 < - c(1,2,3)

v2 < - c(4,5,6,7,8,9)

column.names < - c(“Col1”, “Col2”, “Col3”)

row.names < - c(“Row1”, “Row2”, “Row3”)

matrix.names < - c(“Matrix1”, “Matrix2”)

array1 < - array(c(v1,v2),dim = c(3,3,2), dimnames = list(row.names,

column.names, matrix.names))

#Print the second row of the second matrix of the array.

print(array1[2,2])

#Print the element in the first row and 3rd column of the first matrix.

print(array1[1,3,1])

#Print the first Matrix

print(array1[, ,1])

O/P

Col1 Col2 Col3

2 5 8

[1] 7

Col1 Col2 Col3

Row1 1 4 7

Row2 2 5 8

Row3 3 6 9

Manipulating Array Element: As array is made up matrices in multiple

dimensions, the operations on elements of array are carried out by

accessing elements of the matrices.

#create two vectors of different lengths

v1 < - c(5,9,3 )

v2 < - c(10,11,12,13,14,15 )

#Take these vectors as input to the array

array1 < - array(c(v1,v2),dim = c(3,3,2))

#create two vectors of different lengths

V3 < - c(9,1,0 ) munotes.in

## Page 63

63

V4< - c(6,0,11,3,14,1,2,6,9 )

array2 < - array(c( v3,v4 ),dim = c(3,3,2))

#create matrices from these arrays

matrix1 < - array1[, , 2]

matrix2 < - array2[, , 2]

#add the matrices

add1 < - matrix1+matrix2

print(add1)

O/P:

[,1] [,2] [,3]

[1,] 10 20 26

[2,] 18 22 28

[3,] 6 24 30

3.11 SUMMARY

R is world's most widely used statistics programming language . It is the 1

choice of data scientists R is taught to solve critical business applications.

In addition, R is a full -fledged programming language, with a rich

complement of mathematical functions, matrix operations and control

structures. It is very easy to write your own functions. In this chapter we

covered basic programming to different types of data objects of R with

suitable examples in simple and easy steps.

3.12 EXERCISE

1. Find the output of following code.

1) b= "15"

a = switch ( b,

"5"="Hello A",

"10"="Hello B",

"15"="Hello C",

"20"="Hello D" )

print (a)

2) a= 1

b = 2

y = switch (a+b, "Hello, A", "Hello B", "Hello C", "Hello D" )

print (y)

3) # Create vegetable vector

vegetable <- c('Potato' , 'Onion' ,’Brinjal’ , 'Pumpkin' )

for ( x in vegetable) {

print(x)

}

munotes.in

## Page 64

64

4) for ( i in c (5, 10, 15, 20, 0, 25)

{

if (i == 0)

{

break

}

print (i)

}

print(“outside loop”)

5) for ( i in c (5, 10, 15, 20, 0, 25)

{

if (i == 0)

{

next

}

print (i)

}

print(“outside loop”)

6) a < - 10

b<- 14

count=0

if(a**{ **

cat(a,"is a smaller number \n")

count=1

}

if(count==1){

cat("Block is successfully execute")

}

7) a <-1

b<-24

count=0

while (a** cat(a,"is a smaller number \n") **

a=a+2

if(x==15)

break

}

8) a < -24

if(a%%2==0){

cat(a," is an even number")

}

if(a%%2!=0){

cat(a," is an odd number")

}

9) x <- c("Hardwork","is","the","key","of","success") munotes.in

## Page 65

65

if("key" %in% x) {

print("key is found")

} else {

print("key is not found")

}

10) Rectangle = function(l=6, w=5){

area = l * w

return(area)

}

print(Rectangle(3, 4))

print(Rectangle(w = 9, l = 3))

print(Rectangle())

3.13 REFERENCES

The Art of R Programming: A Tour of Statistical Software Design by

Norman Matloff

Beginning R – The Statistical Programming Language by Mark

Gardener

https://www.javatpoint.com/

https://www.ict.gnu.ac.in/content/r -programming

https://www.geeksforgeeks.org/

*****

munotes.in

## Page 66

66

UNIT I I

4

MOMENTS, SKEWNESS,

AND KURTOSIS

Unit Structure

4.1 Objective

4.2 Introduction

4.3 Moments

4.3.1 Moments for Grouped Data.

4.3.2 Relations between Moments.

4.3.3 Computati on of Moments for Grouped Data.

4.4 Charlie’s C heck and Sheppard’s Corrections.

4.5 Moments in Dimensionless Form

4.6 Skewness

4.7 Kurtosis

4.8 Software Computation of Skewness and Kurtosis .

4.9 Summary

4.10 Exercise

4.11 List of References

4.1 OBJECTIVE

After going through this unit, you will able to :

Define Moments and calculate for ungroup and group data.

Explain types of moments.

Find relation between raw and central moment.

Used Charlier’s check method in computing moments by coding

method.

Define Sheppard’s correction for moments.

Define moments in dimens ional form.

Define Skewness and Kurtosis.

Calculate moments, Skewness and Kurtosis using software.

4.2 INTRODUCTION

The measure of central tendency (location) and measure of dispersion

(variation) both are useful to describe a data set but both of them fail to tell

anything about the shape of the distribution. We need some other certain

measure called the moments to identify the shape of the distribution munotes.in

## Page 67

67

known as Skewness and Kurtosis .Moments are statistical measures that

give certain characteristics of the distribution . Moments provide su fficient

information to reconstruct a frequency distribution function . Moments are

a set of statistical parameters to measure a distribution. Four moments are

commonly used: 1st moments for Average, 2nd for Variance, 3rd for

Skewness and 4th moment for Kurtosis.

4.3 MOMENTS

The arithmetic mean of the rth power of deviations taken either from mean,

zero or from any arbitrary origin is called moments. Assume there is

sequence of random variables 𝑥ଵ,𝑥ଶ,𝑥ଷ,………𝑥. The first sample

moment, usually called the average is defined b y first moments. Three

types of moments are defined as follow:

When the deviations are computed from the arithmetic mean, then such

moments are called moments about mean (mean moments) or sometimes

calle d central moments, denoted by 𝜇 and given as follows: Hence for

ungroup data,

i) The first moment about A, as 𝜇ଵ=∑(௫ି௫̅)

.

ii) The second moment about A, as 𝜇ଶ=∑(௫ି௫̅)మ

.

iii) The third moment about A, as 𝜇ଷ=∑(௫ି௫̅)య

.

iv) The fourth moments about A, as 𝜇ସ=∑(௫ି௫̅)ర

When the deviations of the values are computed from any arbitrary value

(provisional mean) say A, then such moments are called moments about

arbitrary or provisional mean denoted by 𝜇(𝑎).Hence for ungroup data,

i) The first moment about A, as 𝜇ଵ(𝑎)=∑(௫ି)

.

ii) The second moment about A, as 𝜇ଶ(𝑎)=∑(௫ି)మ

.

iii) The third moment about A, as 𝜇ଷ(𝑎)=∑(௫ି)య

.

iv) The fourth moments about A, as 𝜇ସ(𝑎)=∑(௫ି)ర

When the deviations of the values are computed from the origin or zero,

then such moments are called momen ts abo ut origin or raw moments

denoted by 𝜇(𝑎)

i) The first moment about origin , as 𝜇′ଵ=∑(௫)

.

ii) The second moment about origin , as𝜇′ଶ=∑(௫)మ

.

iii) The third moment abou t origin , as𝜇′ଷ=∑(௫)య

. munotes.in

## Page 68

68

iv) The fourth moments about origin , as 𝜇′ସ=∑(௫)ర

Example 1: Find raw moments for the following data: 5, 8, 12, 4, 6.

Solution: 𝑥 𝑥ଶ 𝑥ଷ 𝑥ସ 5 8 2 4 6 25 64 4 16 36 125 512 8 64 216 625 4096 16 256 1296 𝑥=25 𝑥ଶ=145 𝑥ଷ=925 ∑𝑥ସ=6289

i) The first moment about origin , as 𝜇′ଵ=∑(௫)

=ଶହ

ହ=5 .

ii) The second moment about origin , as𝜇′ଶ=∑(௫)మ

=ଵସହ

ହ=29.

iii) The third moment abou t origin , as𝜇′ଷ=∑(௫)య

=ଽଶହ

ହ=185 .

iv) The fourth moments about origin , as 𝜇′ସ=∑(௫)ర

=ଶ଼ଽ

ହ=1257.8

4.3.1 Moments for Grouped Data :

1. Moments about arbitrary point :Let 𝑥 represents a variable occurring

with frequency 𝑓, in a given distribution, then the 𝑖௬ moment 𝜇(𝑎) about

𝐴 is defined as

𝜇(𝑎)=∑(௫ି)

ே, where 𝑁=∑𝑓.

We generally find moments upto 𝑖=4.

∴we can write :

i) The first moment about A, as 𝜇ଵ(𝑎)=∑(௫ି)

ே .

ii) The second moment about A, as 𝜇ଶ(𝑎)=∑(௫ି)మ

ே .

iii) The third moment about A, as 𝜇ଷ(𝑎)=∑(௫ି)య

ே .

iv) The fourth moments about A, as 𝜇ସ(𝑎)=∑(௫ି)ర

ே.

Example 2: For the following distribution find all four moments about 5. X 2 4 6 8 10 F 4 6 12 5 3

Solution: let prepared table first, 𝑥 𝑓 (𝑥−5) 𝑓(𝑥−5) 𝑓(𝑥−5)ଶ 𝑓(𝑥−5)ଷ 𝑓(𝑥−5)ହ 2 4 -3 -12 36 -108 324 4 6 -1 -6 6 -6 6 6 12 1 12 12 12 12 8 5 3 15 45 135 405 10 3 5 15 75 375 1875 Total 30 24 174 408 2622 munotes.in

## Page 69

69

Moments about arbitrary A = 5 is given by

The first moment about A, as 𝜇ଵ(𝑎)=∑(௫ି)

ே=ଶସ

ଷ=0.8 .

The second moment about A, as 𝜇ଶ(𝑎)=∑(௫ି)మ

ே=ଵସ

ଷ=5.8 .

The third moment about A, as 𝜇ଷ(𝑎)=∑(௫ି)య

ே=ସ଼

ଷ=13.6 .

The fourth moments about A, as 𝜇ସ(𝑎)=∑(௫ି)ర

ே=ଶଶଶ

ଷ=87.4.

2. Moments about mean (Central moments):

These are moments about the Arithmetic Mean 𝑥̅. Hence when A is taken

as 𝑥̅, we obtain these moments. Thus it is given by

i) The first moment about 𝑥̅, as

𝜇ଵ=∑(௫ି௫̅)

ே .

ii) The second moment about 𝑥̅, as

𝜇ଶ=∑(௫ି௫̅)మ

ே .

iii) The third moment about 𝑥̅, as

𝜇ଷ=∑(௫ି௫̅)య

ே .

iv) The fourth moments about 𝑥̅, as

𝜇ସ=∑(௫ି௫̅)ర

ே.

From the definition of the mean 𝑥̅ and the standard deviation 𝜎, it

immediately follows that 𝜇ଵ=0 , 𝜇ଶ=𝜎ଶ and 𝜇ଷ measure the asymmetry

of the curve. These moments are important study the nature of the

distribution.

Example 3: Find the central moments for the following distribution: X 1 2 3 4 5 F 2 5 6 5 2

Solution:

𝑥 𝑓 𝑓𝑥 (𝑥−𝑥̅) 𝑓 (𝑥−𝑥̅ ) 𝑓 (𝑥−−𝑥 ଶ) 𝑓(𝑥−𝑥̅)ଷ 𝑓(𝑥−𝑥̅)ସ 1 2 2 -2 -4 8 -16 32 2 5 10 -1 -5 5 -5 5 3 6 18 0 0 0 0 0 4 5 20 1 5 5 5 5 5 2 10 2 4 8 16 32 Total 20 60 0 26 0 74 munotes.in

## Page 70

70

Here , 𝑥̅=∑௫

ே=

ଶ=3.

Therefore, the central moments are given by

i) The first moment about 𝑥̅, as

𝜇ଵ=∑(௫ି௫̅)

ே=

ଶ=0 .

ii) The second moment about 𝑥̅, as

𝜇ଶ=∑(௫ି௫̅)మ

ே=ଶ

ଶ=1.3 .

iii) The third moment about 𝑥̅, as

𝜇ଷ=∑(௫ି௫̅)య

ே=

ଶ=0 .

iv) The fourth moments about 𝑥̅, as

𝜇ସ=∑(௫ି௫̅)ర

ே=ସ

ଶ=3.7.

3. Moments about origin(Raw moments):

As the name suggests, taking A as the origin ( A = 0), we get these

moments. Thus it is given by

i) The first moment about Origin , as

𝜇ଵ′=∑௫

ே .

ii) The second moment about Origin, as

𝜇ଶ′=∑௫మ

ே .

iii) The third moment about Origin , as

𝜇ଷ′=∑௫య

ே .

iv) The fourth moments about Origin , as

𝜇ସ′=∑௫ర

ே.

Note that for first moment about origin is mean of the data.

Example 4: Find the raw moments for the following data: X -1 0 1 2 3 4 F 2 4 3 7 3 1

Solution : lets prepared table 𝑥 𝑓 𝑓𝑥 𝑓𝑥ଶ 𝑓𝑥ଷ 𝑓𝑥ସ -1 2 -2 2 -2 2 0 4 0 0 0 0 1 3 3 3 3 3 2 7 14 28 56 112 3 3 9 27 81 243 4 1 4 16 64 256 Total 20 28 76 202 616

Therefore, the raw moments are given by

i) The first moment about Origin , as munotes.in

## Page 71

71

𝜇ଵᇱ=∑௫

ே=ଶ଼

ଶ=1.4 .

ii) The second moment about Origin, as

𝜇ଶᇱ=∑௫మ

ே=

ଶ=3.8 .

iii) The third moment about Origin , as

𝜇ଷᇱ=∑௫య

ே=ଶଶ

ଶ=10.1 .

iv) The fourth moments about Origin , as

𝜇ସᇱ=∑௫ర

ே=ଵ

ଶ=30.8.

4.3.2 Relations between Moments:

We studied three different types of moments. Now it is very useful to

simplifying relation between them. We will now give inter -relation

between various moments and solve example using these relations.

Relation between moments about arbitrary point and the central moment:

i) 𝜇ଵ=𝜇ଵ(𝑎)−𝜇ଵ(𝑎)=0

ii) 𝜇ଶ=𝜇ଶ(𝑎)−𝜇ଵ(𝑎)ଶ

iii) 𝜇ଷ=𝜇ଷ(𝑎)−3𝜇ଵ(𝑎)𝜇ଶ(𝑎)+2𝜇ଵ(𝑎)ଷ

iv) 𝜇ସ=𝜇ସ(𝑎)−4𝜇ଵ(𝑎)𝜇ଷ(𝑎)+6𝜇ଵ(𝑎)ଶ𝜇ଶ(𝑎)−3𝜇ଵ(𝑎)ସ

Conversely the moments 𝜇(𝑎)′𝑠 about A in term of 𝜇′𝑠are given as

follow s:

i) 𝜇ଵ(𝑎)=𝑥̅−𝐴

ii) 𝜇ଶ(𝑎)=𝜇ଶ+𝜇ଵ(𝑎)ଶ

iii) 𝜇ଷ(𝑎)=𝜇ଷ+3𝜇ଶ𝜇ଵ(𝑎)+𝜇ଵ(𝑎)ଷ

iv) 𝜇ସ(𝑎)=𝜇ସ+4𝜇ଷ𝜇ଵ(𝑎)+6𝜇ଶ𝜇ଵ(𝑎)ଶ+𝜇ଵ(𝑎)ସ

Relation between Raw moments and central moments:

Recall that, the raw moments 𝜇′ are obtained from the general moments

𝜇(𝑎) when A is taken as ‘0’.

Hence taking A as ‘0’and replacing 𝜇(𝑎) by corresponding 𝜇′ in the

formula, we get

i) 𝜇ଵ=𝜇′ଵ−𝜇′ଵ=0

ii) 𝜇ଶ=𝜇ଶ′−𝜇ଵ′ଶ

iii) 𝜇ଷ=𝜇ଷ′−3𝜇ଵ′𝜇ଶ′+2𝜇ଵ′ଷ

iv) 𝜇ସ=𝜇ସ′−4𝜇ଵ′𝜇ଷ′+6𝜇ଵ′ଶ𝜇ଶ′−3𝜇ଵ′ସ

Conversely the moments 𝜇′ in term of 𝜇 are given as follows:

i) 𝜇ଵ′=𝑥̅

ii) 𝜇ଶ′=𝜇ଶ+𝜇ଵ′ଶ munotes.in

## Page 72

72

iii) 𝜇ଷ′=𝜇ଷ+3𝜇ଶ𝜇ଵ′+𝜇ଵ′ଷ

iv) 𝜇ସ′=𝜇ସ+4𝜇ଷ𝜇ଵ′+6𝜇ଶ𝜇ଵ′ଶ+𝜇ଵ′ସ

Example 5: The first four central moments of a distribution are 0, 3, 5, 10.

If the mean of the distribution is 2, find the moments about 3.

Solution: We have 𝐴=3,𝑥̅=2,𝜇ଵ=0,𝜇ଶ=3,𝜇ଷ=5,𝜇ସ=10.

Using the relation between central moments and arbitrary moments,

i) 𝜇ଵ(𝑎)=𝑥̅−𝐴=2−3=−1.

ii) 𝜇ଶ(𝑎)=𝜇ଶ+𝜇ଵ(𝑎)ଶ=3+(−1)ଶ=4

iii) 𝜇ଷ(𝑎)=𝜇ଷ+3𝜇ଶ𝜇ଵ(𝑎)+𝜇ଵ(𝑎)ଷ=5+3(3)(−1)+

(−1)ଷ=−5.

iv) 𝜇ସ(𝑎)=𝜇ସ+4𝜇ଷ𝜇ଵ(𝑎)+6𝜇ଶ𝜇ଵ(𝑎)ଶ+𝜇ଵ(𝑎)ସ

=10+4(5)+6(3)(−1)ଶ+(−1)ସ=49.

Example 6: The first four raw moments about the origin are 2, 12, 74 and

384. Find the mean 𝑥̅ and the first four central moments.

Solution: We already define the raw moments about the origin i.e.

𝜇(𝑎)′s with 𝐴=0. Given that 𝜇ଵ(𝑎)=2,𝜇ଶ(𝑎)=12,𝜇ଷ(𝑎)=

74,𝑎𝑛𝑑 𝜇ସ(𝑎)=384,𝑤𝑖𝑡ℎ 𝐴=0.

Therefore, Mean =𝑥̅=𝜇ଵ(𝑎)+𝐴=2+0=2.

Using the relation between raw moments and central moments

i) 𝜇ଵ=𝜇ଵ(𝑎)−𝜇ଵ(𝑎)=2−2=0

ii) 𝜇ଶ=𝜇ଶ(𝑎)−𝜇ଵ(𝑎)ଶ=12−2ଶ=8

iii) 𝜇ଷ=𝜇ଷ(𝑎)−3𝜇ଵ(𝑎)𝜇ଶ(𝑎)+2𝜇ଵ(𝑎)ଷ=74−

3(2)(12)+2(2ଶ)=10

iv) 𝜇ସ=𝜇ସ(𝑎)−4𝜇ଵ(𝑎)𝜇ଷ(𝑎)+6𝜇ଵ(𝑎)ଶ𝜇ଶ(𝑎)−3𝜇ଵ(𝑎)ସ

=284−4(2)(74)+6(2ଶ)(12)−3(2ସ)

=384−592+288−48=128.

Example 7: The first four central moments for a distribution are 0,3,0 and

7. If the mean 𝑥̅ of the distribution is 4, find the first four raw moments.

Solution: The raw moments are the moments about origin. Given that

𝜇ଵ=0,𝜇ଶ=3,𝜇ଷ=0 𝑎𝑛𝑑 𝜇ସ=7 𝑤𝑖𝑡ℎ 𝑥̅=4.

Using the relation between central moments and raw moments.

i) 𝜇ଵᇱ=𝑥̅=4

ii) 𝜇ଶᇱ=𝜇ଶ+𝜇ଵ′ଶ=3+4ଶ=17.

iii) 𝜇ଷᇱ=𝜇ଷ+3𝜇ଶ𝜇ଵᇱ+𝜇ଵ′ଷ=0+3(3)(4)+4ଷ=100.

𝜇ସᇱ=𝜇ସ+4𝜇ଷ𝜇ଵᇱ+6𝜇ଶ𝜇ଵ′ଶ+𝜇ଵ′ସ munotes.in

## Page 73

73

4.3.3 Computat ion of Moments for Grouped Data:

We have already found mean and standard deviation for the continuous

data (grouped data). Now to calculate moments for the continuous data we

used coding method (Short method).

When the values of 𝑥 are not consecutive, but equally spaced at an interval

of length ′𝑐′. We need to divide the expression by ′𝑐′. It is called change of

scale by ′𝑐′.

Where we take 𝑥=𝑎+𝑐𝑢 𝑜𝑟 𝑢=௫ି

.

We give below the effect of change of origin and scale on mo ments.

Let 𝑥=𝑎+𝑐𝑢 ,

∴ 𝑥̅=𝑎+𝑐𝑢ത.

i)The moments of 𝑥 about A are given by

𝜇(𝑎)=∑𝑓𝑢

𝑁×𝑐

ii) The central moments of 𝑥 are given by

𝜇=∑𝑓(𝑢−𝑢ത)

𝑁×𝑐

Note: When A = 0, we get the raw moment.

Example 8: Find the central moments for the following data: Class interval 0-20 20-40 40-60 60-80 Frequency 4 7 6 3

Solution: first find mean by coding method, taking 𝑎=30

Here, 𝑢ത=∑௨

ே=଼

ଶ=0.4.

The central moments of 𝑥 are given by C.I F Cla

ss

Mar

ks

(x) 𝑢

=𝑥−𝑎

𝑐 F

u (𝑢

−0.4) 𝑓(𝑢

−0.4) 𝑓(𝑢

−0.4)ଶ 𝑓(𝑢

−0.4)ଷ 𝑓(𝑢

−0.4)ସ 0-

20 4 10 -1 -4 -1.4 -5.6 7.84 -10.976 15.3664 20-

40 7 30 0 0 -0.4 -2.8 3.92 -0.448 0.1792 40-

60 6 50 1 6 0.6 3.6 2.16 1.296 0.7776 60-

80 3 70 2 6 1.6 4.8 7.68 12.288 19.6608 Tot

al 2

0 8 0 21.6 2.16 35.984 munotes.in

## Page 74

74

𝜇=∑𝑓(𝑢−𝑢ത)

𝑁×𝑐

i) 𝜇ଵ=∑(௨ି௨ഥ)భ

ே×𝑐ଵ=

ଶ×20=0.

ii) 𝜇ଶ=∑(௨ି௨ഥ)మ

ே×𝑐ଶ=ଶଵ.

ଶ×20ଶ=432.

iii) 𝜇ଷ=∑(௨ି௨ഥ)య

ே×𝑐ଷ=ଶ.ଵ

ଶ×20ଷ=864.

iv) 𝜇ସ=∑(௨ି௨ഥ)ర

ே×𝑐ସ=ଷହ.ଽ଼ସ

ଶ×20ସ=2,87,872.

4.4 CHARLIE’S CHECK AND SHEPPARD’S CORRECTIONS

A check which can be used to verify correct computations in a table of

grouped classes. For example, consider the following table with specified

class limits and frequencies 𝑓. The class marks 𝑥 are then computed as

well as the rescaled fr equencies 𝑢, which are given by

𝒖𝒊=𝒇𝒊−𝒙𝟎

𝒄

Where the class mark is taken as 𝑥=44.5 and the class interval is 𝑐=

10. The remaining quantities are then computed as follows. Class interval 𝑥 𝑓 𝑢 𝑓𝑢 𝑓𝑢ଶ 𝑓(𝑢+1)ଶ 0-9 4.5 2 -4 -8 32 18 10-19 14.5 3 -3 -9 27 12 20-29 24.5 11 -2 -22 44 11 30-39 34.5 20 -1 -20 20 0 40-49 44.5 32 0 0 0 32 50-59 54.5 25 1 25 25 100 60-69 64.5 7 2 14 28 63 Total 100 -20 176 236

In order to compute the variance , note that

𝑉(𝑢)=∑𝒇𝒊𝒖𝒊𝟐

∑𝒇𝒊−ቆ∑𝒇𝒊𝒖𝒊

∑𝒇𝒊ቇଶ

=176

100−൬−20

100൰ଶ

=1.72

So the variance of the original data is

𝑉(𝑥)=𝑐ଶ𝑉(𝑢)=100×1.72=172. munotes.in

## Page 75

75

Charlier's check makes use of the additional column 𝑓(𝑢+1)ଶ added to

the right side of the table. By noting that the identity

𝒇𝒊(𝒖𝒊+𝟏)𝟐= 𝒇𝒊(𝒖𝒊𝟐+𝟐𝒖𝒊+𝟏)

=𝒇𝒊𝒖𝒊𝟐+𝟐𝒖𝒊+𝒇𝒊

connects columns five through seven, it can be checked that the

computations have been done correctly. In the example above, 236 = 176 +2 (- 20) +100 (8) Hence, the computations pass Charlier's check.

Charlier's check in computing moments by the coding method uses the

following identities:

𝑓(𝑢+1)=𝑓𝑢+𝑓

𝑓(𝑢+1)ଶ=𝑓𝑢ଶ+2𝑓𝑢+𝑓

𝑓(𝑢+1)ଷ=𝑓𝑢ଷ+3𝑓𝑢ଶ+3𝑓𝑢+𝑓

𝑓(𝑢+1)ସ

=𝑓𝑢ସ+4𝑓𝑢ଷ+6𝑓𝑢ଶ+4𝑓𝑢

+𝑓

Sheppard’s Corrections:

When the frequency distribution, consists of interval, we take 𝑥 as the

class mark of the interval and use this 𝑥 in all the formulae.

While doing this, it is assumed that all the values in the interval

concentrate at the class mark. But this assumption may not be always true

and we are likely to get some errors in this calculation.

The well -known statistician Sheppa rd gave the corrected values of the

moments as follows:

𝜇ଵ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଵ

𝜇ଶ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଶ−𝑐ଶ

12

𝜇ଷ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଷ

𝜇ସ=𝜇ସ−1

2𝑐ଶ𝜇ଶ+7

240𝑐ସ munotes.in

## Page 76

76

Where ‘c’ is the length of the class -interval, which is the same as the

spacing between the mid values.

Note that even though this correction has great mathematical significance,

we need not use these corrections in practice because the error is too small

hence negligible and also in statistic, we look for estimates, which are

approximate values .

Example 9:Apply Sheppard’s corrections to determine the moments abo ut

the mean for the data Class Interval 0-10 10-20 20-30 30-40 40-50 Frequency 1 2 9 2 6

Solution: Lets prepared table, taking 𝐴=25. Class Interval Class Mark (X) f 𝑢

=𝑥−𝑎

𝑐 𝑓𝑢 (𝑢−0.5) 𝑓(𝑢−0.5) 𝑓(𝑢−0.5)ଶ 𝑓(𝑢−0.5)ଷ 𝑓(𝑢−0.5)ସ 0-10 5 1 -2 -2 -2.5 -2.5 12.5 -15.625 39.0625 10-20 15 2 -1 -2 -1.5 -3 4.5 -6.75 10.125 20-30 25 9 0 0 -0.5 -4.5 2.25 -1.125 0.5625 30-40 35 2 1 2 0.5 1 0.5 0.25 0.125 40-50 45 6 2 12 1.5 9 13.5 20.25 30.375 Total 20 10 0 33.25 -3 80.25

Here, 𝑢ത=∑௨

ே=ଵ

ଶ=0.5.

The central moments of 𝑥 are given by

𝜇=∑𝑓(𝑢−𝑢ത)

𝑁×𝑐

i) 𝜇ଵ=∑(௨ି௨ഥ)భ

ே×𝑐ଵ=

ଶ×20=0.

ii) 𝜇ଶ=∑(௨ି௨ഥ)మ

ே×𝑐ଶ=ଷଷ.ଶହ

ଶ×20ଶ=665.

iii) 𝜇ଷ=∑(௨ି௨ഥ)య

ே×𝑐ଷ=ିଷ

ଶ×20ଷ=−1200.

iv) 𝜇ସ=∑(௨ି௨ഥ)ర

ே×𝑐ସ=଼.ଶହ

ଶ×20ସ=6,42,000.

Sheppard gave the corrected values of the moments as follows: munotes.in

## Page 77

77

𝜇ଵ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଵ=0, 𝜇ଶ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଶ−మ

ଵଶ=665−ଶమ

ଵଶ=

631.67

𝜇ଷ(𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 )=𝜇ଷ=−1200

𝜇ସ=𝜇ସ−ଵ

ଶ𝑐ଶ𝜇ଶ+

ଶସ𝑐ସ=6,42,000−ସ

ଶ×665+

ଶସ×1,60,000=

5,13,666.67.

4.5 MOMENTS IN DIMENSIONLESS FORM

To avoid particular units, we can define the dimensionless central

moments as

𝑎=𝜇

𝜎

Where 𝜎 is the standard deviation, so, as we have 𝜎=√𝜇ଶ ,

We already know that for central moments, 𝜇ଵ=0,𝜇ଶ=𝜎ଶ.

So, we get 𝑎=0 𝑎𝑛𝑑 𝑎ଵ=1.

4.6 SKEWNESS

Skewness is one more concept which deals with the symmetry or rather

asymmetry of the values of distribution around its central value. When a

frequency distribution is plotted on a chart, an ideal distribution by a nice,

symmetric, bell -shaped curve around the central value. Such a distribution

is called symmetric distribution or a normal distribution. However in

practice every distribution that we across need not be normal. Their graph

will be asym metric or skew. Such distributions are called skewed

distribution.

Definition: Skewness defined by famous statistician Garrett “ A

distribution is said to be skewed when the mean and median fall at

different points of the distribution and balance is shift ed to one side or the

other to left or right. ”

Types of Skewness: In order to unde rstand this concept we draw the

following graphs, where 𝑥̅= Mean, 𝑀= Median and 𝑀= Mode of the

distributions.

Figure 4.1

munotes.in

## Page 78

78

It is clear from the diagram that

i) Represents a symmetric distribution, for which Mean = Median =

Mode.

ii) Represents a positive skewed distribution for which Mode < Median

iii) Represents a negative skewed distribution for which Mean < Median

Measure of Skewness:

Since mean, median a nd mode are different for a skewed distribution, the

simplest measure would be the difference between two of these in pairs.

Though such measures are simple to calculate, their main drawback is the

following: these measures are expressed with respect to th e corresponding

units of the distribution. Therefore two distributions with different units

cannot be compared. In order to overcome this difficulty, relative

measures are defined. These are called Coefficients of Skewness.

Karl Pearson’s Coefficient of S kewness: it is defined as

𝑆=𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =𝑥̅−𝑀

𝜎

Using the relation between mean, median and mode:

Mean – Mode = 3 (Mean – Median), we can write

𝑆=3(𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛=3(𝑥̅−𝑀)

𝜎

Interpretation on 𝑆

i) If 𝑆 is positive then the distribution is positively skewed.

ii) If 𝑆 is negative then the distribution is negatively skewed.

iii) If 𝑆=0 then the distribution is symmetric.

iv) Theoretica lly the limits of 𝑆 are from -3 to +3.

Example 10: For the following ungrouped data find the Karl Pearson’s

Coefficient of Skewness.

12,18,25,15, 16, 10, 8 15, 27,14

Solution: For the Karl Pearson’s Coefficient of Skewness we need to find

mean, mode and standard deviation of the data.

Mean = 𝑥̅=∑௫

=ଵଶାଵ଼ାଶହାଵହାଵାଵା଼ାଵହାଶାଵସ

ଵ=ଵ

ଵ=16.

Mode = 15 ( number which repeated maximum time)

𝑥ଶ=144+324+625+225+256+100+64+225+729

+196=2,888 munotes.in

## Page 79

79

𝜎=ඨ∑𝑥ଶ

𝑛−(𝑥̅)ଶ=ඨ2,888

10−(15)ଶ=√288.8−225=√63.8=7.99

Therefore, the Karl Pearson’s Coefficient of Skewness is

𝑆=𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =𝑥̅−𝑀

𝜎=16−15

7.99=1

7.99=0.125

Example 11: For the following grouped data find th e Karl Pearson’s

Coefficient of Skewness. Also interpret the type of distribution. C.I 0-4 4-8 8-12 12-16 16-20 F 1 3 10 4 2

Solution: First we find mean, mode and standard deviation. C.I F x fx 𝑓𝑥ଶ 0-4 1 2 2 4 4-8 3 6 18 108 8-12 10 10 100 1000 12-16 4 14 56 784 16-20 2 18 36 648 Total 20 212 2,544

Mean = 𝑥̅=∑௫

ே=ଶଵଶ

ଶ=10.6.

Standard deviation =

𝜎=ඨ∑𝑓𝑥ଶ

𝑁−(𝑥̅)ଶ=ඨ2,544

20−(10.6)ଶ=√127.2−112.36=√14.84

=3.85.

𝑀𝑜𝑑𝑒=𝑙ଵ+𝑓ଵ−𝑓

2𝑓ଵ−𝑓−𝑓ଶ×ℎ

𝑀𝑜𝑑𝑒=8+10−3

2(10)−3−4×4=10.15.

Therefore, the Karl Pearson’s Coefficient of Skewness is

𝑆=𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =𝑥̅−𝑀

𝜎=10.6−10.15

3.85=0.45

3.85=0.12.

Bowley’s Coefficient of Skewness:

This measure is based on Quartiles, hence it is also known as Quartile

Coefficient of Skewness. It is given by

𝑺𝑩=𝑸𝟑+𝑸𝟏−𝟐𝑸𝟐

𝑸𝟑−𝑸𝟏

The limits of Bowley’s Coefficient of Skewness are between -1 to +1.

Example 12: Find the Bow ley’scoefficient of Skewness for the following

information are given: 𝑄ଵ=12.5,𝑄ଶ=17.2,𝑄ଷ=24.7

munotes.in

## Page 80

80

Solution: Given that

𝑄ଵ=12.5,𝑄ଶ=17.2,𝑄ଷ=24.7

The Bowley’s coefficient of Skewness is given by

𝑆=𝑄ଷ+𝑄ଵ−2𝑄ଶ

𝑄ଷ−𝑄ଵ=24.7+12.5−2(17.2)

24.7−12.5=2.8

12.2=0.23

Example 13:Find the Bowley’s coefficient of Skewness for the following

distribution: X 1 3 5 7 9 11 F 3 8 14 20 18 7

Solution: Let find all three quartiles for the distribution: X F cf(Cummulative frequency) 1 3 3 3 8 11 5 14 25 7 20 45 9 19 64 11 7 71 Total N =71

Therefore,

𝑄ଵ=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 1൬𝑁+1

4൰൨௧

𝑖𝑡𝑒𝑚

=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 1൬71+1

4൰൨௧

𝑖𝑡𝑒=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 18𝑡ℎ 𝑖𝑡𝑒𝑚=5

𝑄ଶ=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 2൬𝑁+1

4൰൨௧

𝑖𝑡𝑒𝑚

=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 2൬71+1

4൰൨௧

𝑖𝑡𝑒=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 36𝑡ℎ 𝑖𝑡𝑒𝑚=7

𝑄ଷ=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 3൬𝑁+1

4൰൨௧

𝑖𝑡𝑒𝑚

=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 3൬71+1

4൰൨௧

𝑖𝑡𝑒=𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 54𝑡ℎ 𝑖𝑡𝑒𝑚=9

The Bowley’s coefficient of Skewness is given by

𝑆=𝑄ଷ+𝑄ଵ−2𝑄ଶ

𝑄ଷ−𝑄ଵ=9+5−2(7)

9−5=0

.4=0.

Therefore, the distribution is symmetric.

munotes.in

## Page 81

81

4.7 KURTOSIS

Kurtosis in Greek means ‘Bulginess’. In statistics, Kurtosis refers to the

degree of flatness or peakedness around the mode of a frequency curve.

The measure of kurtosis is with respect to a normal curve, which is

accepted as a yardstick to decide the nature of oth er curves.

In other words measures of kurtosis tell us to what extent the given

distribution is flat or peaked with respect to the standard normal curve.

Figure 4.2

i) Normal curve is called Mesokurtic (M).

ii) Flat one is called Platykurtic (P).

iii) Peaked is c alled Leptokurtic (L).

Measures of Kurtosis:

The most prominent measure of kurtosis is the coefficient 𝛽ଶ, given by

𝛽ଶ=𝜇ସ

𝜇ଶଶ

Where 𝜇′𝑠are the moment about mean 𝑥̅.

Bigger value of 𝛽ଶ gives more peak to the distributions. For normal

distribution 𝛽ଶ= 3.

Hence the given distribution is :

i) Leptokurtic, if 𝛽ଶ> 3.

ii) Mesokurtic, if 𝛽ଶ= 3.

iii) Platykurtic, if 𝛽ଶ< 3.

Example 14: for the following distribution find 𝛽ଵ and 𝛽ଶ and comment

on the Skewness and Kurtosis of the distribution.

munotes.in

## Page 82

82

X 2 3 4 5 f 4 3 2 1 Solution: First calculate moments about mean for the given distribution.

𝑥̅=∑𝑓𝑥

∑𝑓=2(4)+3(3)+4(2)+5(1)

4+3+2+1=30

10=3. 𝑥 𝑓 (𝑥−3) 𝑓(𝑥−3) 𝑓(𝑥−3)ଶ 𝑓(𝑥−3)ଷ 𝑓(𝑥−3)ସ 2 4 -1 -4 4 -4 4 3 3 0 0 0 0 0 4 2 1 2 2 2 2 5 1 2 2 4 8 16 Total 10 0 10 6 22

The central moments are given by

i) The first moment about 𝑥̅, as

𝜇ଵ=∑(௫ି௫̅)

ே=

ଵ=0 .

ii) The second moment about 𝑥̅, as

𝜇ଶ=∑(௫ି௫̅)మ

ே=ଵ

ଵ=1 .

iii) The third moment about 𝑥̅, as

𝜇ଷ=∑(௫ି௫̅)య

ே=

ଵ=0.6 .

iv) The fourth moments about 𝑥̅, as

𝜇ସ=∑(௫ି௫̅)ర

ே=ଶଶ

ଵ=2.2.

𝛽ଵ=𝜇ଷଶ

𝜇ଶଷ=(0.6)ଶ

(1)ଷ=0.36.

𝛽ଶ=𝜇ସ

𝜇ଶଶ=2.2

(1)ଶ=2.2

Since 𝛽ଵ≠0, the curve is not symmetric Also 𝜇ଷ=0.6>0.

Therefore, the curve is positively skewed.

Since 𝛽ଶ=2.2<3.

Therefore, the curve is flat as compared to the normal curve.

Hence the distribution is platykurtic.

4.8 SOFTWARE COMPUTATION OF SKEWNESS AND KURTOSIS

To compute Skewness and Kurtosis by using different software are given

below:

Sigma Magic : Using the Sigma Magic software, calculating the Skewness

and Kurtosis is relatively straightforward. Just add a new Basic Statistics

template to Excel by clicking on Stat > Basic Statistics. Copy and paste munotes.in

## Page 83

83

the data for which you want to Skewness and Kurto sis into the input area

and then click on Compute Outputs. The analysis results will include the

Skewness and Kurtosis values.

Excel : You could also calculate these values in Excel by using the formula

=SKEW(…) for the Skewness value, =KURT(…) for the Kurt osis value.

Minitab : If you use the Minitab software, you can copy and paste the data

into Minitab and then click on Stat > Basic Statistics > Display

Descriptive Statistics. Then select the data column and then click on OK.

This will print out the quarti les for the sample values. If you want the

Skewness and Kurtosis values, you have to go back to the menu and click

on Statistics and select the checkbox next to Skewness and Kurtosis in the

statistics options. Note that the values provided by Minitab may b e slightly

different from Excel and Sigma Magic software.

4.10 SUMMARY

In this unit, we have discussed:

Moments and its types for ungroup and grouped data.

The relation between raw, arbitrary and central moments.

The effect of change of origin and scale on moments.

Charlie’s check, and Shepha rd’s Correction for Moments.

Skewness and about symmetry of distribution.

Kurtosis .

4.11 EXERCISE

1.The first four moments of a distribution are 1, 4, 10 and 46 respectively.

Compute the moment coefficients of skewness and kurtosis and comment

upon the nature of the distribution.

2. Compute the first four central moments from the following data. Also

find the two beta coefficients.

X 5 10 15 20 25 30 35 f 8 15 20 32 23 17 5

3. The first four central moments of a distribution are 0, 2.5, 0.7 and 18.75.

Examine the skewness and kurtosis of the distribution.

4. Calculate first four central moments for the following distribution: Class interval 0-4 4-8 8-12 12-16 16-20 Frequency 5 8 13 9 5

5. Find the first four arbitrary moments about A = 7 for the following: munotes.in

## Page 84

84

10, 5, 8, 7, 2, 3, 12, 14

6. Find raw moment for the following data: C.I 5-10 10-15 15-20 20-25 25-30 f 3 4 7 4 2

7. The first four central moments of a distribution are 0, 15, 36, 78. If the

mean of the distribution is 8, find the moments about A = 5.

8. The first four raw moments about origin are 4, 16, 33, 89. Find mean

and the first four central moments.

9. For the following data verify Charlie’s check : C.I 2-8 8-14 14-20 20-26 f 1 4 3 2

10. Find the first four central moments using coding method, also find

Sheppard’s correction for moments. C.I 0-5 5-10 10-15 15-20 20-25 25-30 f 3 8 12 13 7 2

11. For the following data , find Karl Pearson’s coefficient of Skewness

and also find the type of distribution:

i) 12,15,17, 12,8,25,16,6,7,41

ii) 3,7,8, 12, 15 X 1 2 3 4 5 6 7 f 2 8 12 15 18 9 6 C.I 0-2 2-4 4-6 6-8 8-10 frequency 4 7 13 10 6

12. Find the Bowely’s coefficient of Skewness for each of the following:

i) 𝑄ଵ=165.5,𝑄ଶ=184.3,𝑄ଷ=196.7.

ii) 2,8,7,12,14,17,20.

X 1 2 3 4 5 6 7 F 2 8 12 15 18 9 6

4.12 LIST OF REFERENCES

Statistics by Murry R. Spiegel, Larry J. Stephens. Publication

McGRAWHILL INTERNATIONAL.

Fundamental Statistic by S.C. Gupta

***** munotes.in

## Page 85

85

5

ELEMENTARY PROBABILITY

THEORY

Unit Structure

5.1 Objective

5.2 Introduction

5.3 Definitions of Probability

5.4 Conditional Probability

5.4.1 Independent and Dependent Events, Mutually Exclusive

Events

5.5 Probability Distributions

5.6 Mathematical Expectation

5.7 Combinatorial Analysis

5.8 Combinations, Stirling’s Approximation to n!

5.9 Relation of Probability to Point Set Theory, Euler or Venn Diagrams

and Probability

5.10 Summary

5.11 Exercise

5.12 List of References

5.1 OBJECTIVE

After going through this unit, you will able to :

Determine the probability of different experimental results.

Explain the concept of probability.

Calculate probability for simple, compound and complimentary

events.

Conditional probability and i ts examples.

Independent events and multiplication theorem of probability.

Probability distribution and its Expected value of probability

distribution.

Combination and Stirling’s number approximation.

Relations between probability and set theory with help of Venn

diagram.

5.2 INTRODUCTION Some time s in daily life certain things come to mind like “I will be success

today’, I will complete this work in hour, I will be selected for job and so munotes.in

## Page 86

86

on. There are many possible results for these things but we are happy

when we get required result. Probability theory deals with experiments

whose outcome is not predictable with certainty. Probability is very useful

concept. These days many field in comp uter science such as machine

learning, computational linguistics, cryptography, computer vision,

robotics other also like science, engineering, medicine and management.

Probability is mathematical calculation to calculate the chance of

occurrence of some happening , we need some bas ic concept on random

experiment , sample space, and events.

Basic concept of probability:

Random experiment: When experiment can be repeated any number of

times under the similar conditions but we get different results on same

experiment, also result is not predictable such experiment is called random

experiment. For.e.g. A coin is tossed, A die is rolled and so on.

Outcomes: The result which we get from random experiment is called

outcomes of random experiment.

Sample space: The set of all possible outcomes of random experiment is

called sample space. The set of sample space is denoted by S and number

of elements of sample space can be written as 𝑛(𝑆). For e.g. A die is

rolled, we get ={1,2,3,4,5,6} , 𝑛(𝑆)=6.

Events: Any subset of the sample space is called an event. Or a set of

sample point which satisfies the required condition is called an events.

Number of elements in event set is denoted by 𝑛(𝐸).For example in the

experiment of throwing of a dia. The sample space is

S = {1, 2, 3, 4, 5, 6 } each of the following can be an event :

i) A: even number i.e. A = { 2, 4, 6} ii) B: multiple of 3 i.e. B = { 3, 6}

iii) C: prime numbers i.e. C = { 2, 3, 5}.

Types of events:

Impossible event: An event which does not occurre d in random

experiment is called impossible event. It is denoted by ∅ set. i. e. 𝑛(∅)=

0. For example getting number 7 when die is rolled. The probability

measure assigned to impossible event is Zero.

Equally likely events : when all events get equal chance of occurrences is

called equally likely events. For e.g. Events of occurrence of head or tail

in tossing a coin are equally likely events.

Certain event: An event which contains all sample space elements is

called certain events. i.e. 𝑛(𝐴)=𝑛(𝑆).

Mutually e xclusive events: Two events A and B of sample space S, it

does not have any common elements are called mutually exclusive events.

In the experiment of throwing of a die A: number less than 2 , B: multiple

of 3. There fore 𝑛(𝐴∩𝐵)=0

Exhaustive events: Two events A and B of sample space S, elements of

event A and B occurred together are called exhaustive events. For e.g. In a munotes.in

## Page 87

87

thrown of fair die occurrence of even number and occurrence of odd

number are exhaustive events. There fore 𝑛(𝐴∪𝐵)=1.

Complement event: Let S be sample space and A be any event than

complement of A is denoted by 𝐴̅ is set of elements from sample space S,

which does not belong to A. For e.g. if a die is thrown, S = {1, 2, 3, 4, 5,

6} and A: odd numbers, A = {1, 3, 5}, then 𝐴̅={2,4,6}.

5.3 DEFINITIONS OF PROBABILITY

Probability: For any random experiment, sample space S with required

chance of happing event E than the probability of event E is define as

𝑃(𝐸)=𝑛(𝐸)

𝑛(𝑆)

Basic properties of probability:

1) The probability of an event E lies between 0 and 1. i.e. 0≤𝑃(𝐸)≤1.

2) The probability of impossible event is zero. i.e. 𝑃(∅)=0.

3) The probability of certain event is unity. i.e. 𝑃(𝐸)=1.

4) If A and B are exhaustive events than probability of 𝑃(𝐴∪𝐵)=1.

5) If A and B are mutually exclusive events than probability of 𝑃(𝐴∩

𝐵)=0.

6) If A be any event of sample space than probability of complement of

A is given by 𝑃(𝐴)+𝑃(𝐴̅)=1⇒∴𝑃(𝐴̅)=1−𝑃(𝐴).

Probability Axioms:

Let S be a sample space. A probability function P from the set of all event

in S to the set of real numbers satisfies the following three axioms for all

events A and B in S.

i) 𝑃(𝐴)≥0 .

ii) 𝑃(∅)=0 and𝑃(𝑆)=1.

iii) If A and B are two disjoint sets i.e. 𝐴∩𝐵=∅) than the probability of

the union of A and B is 𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵).

Theorem : Prove that for every event A of sample space S, 0≤𝑃(𝐴)≤

1.

Proof: 𝑆=𝐴∪𝐴̅ , ∅=𝐴∩𝐴̅.

∴1=𝑃(𝑆)=𝑃(𝐴∪𝐴̅)=𝑃(𝐴)+𝑃(𝐴̅)

∴1=𝑃(𝐴)+𝑃(𝐴̅)

⇒𝑃(𝐴)=1−𝑃(𝐴̅)or𝑃(𝐴̅)=1−𝑃(𝐴).

If 𝑃(𝐴)≥0. than P( 𝐴̅)≤1.

∴for every event 𝐴; 0≤𝑃(𝐴)≤1.

Addition theorem of probability:

Theorem: If A and B are two events of sample space S, then probability of

union of A and B is given by 𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵). munotes.in

## Page 88

88

Proof: A and B are two events of sample space S.

Now from diagram probability of union of two events A and B is given by,

𝑃(𝐴∪𝐵)=𝑃(𝐴∩𝐵ത)+𝑃(𝐴∩𝐵)+𝑃(𝐵∩𝐴̅)

But 𝑃(𝐴∩𝐵ത)=𝑃(𝐴)−𝑃(𝐴∩𝐵) and 𝑃(𝐵∩𝐴̅)=𝑃(𝐵)−𝑃(𝐴∩𝐵).

∴𝑃(𝐴∪𝐵)=𝑃(𝐴)−𝑃(𝐴∩𝐵)+ 𝑃(𝐴∩𝐵)+𝑃(𝐵)−𝑃(𝐴∩𝐵)

∴𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵).

Note: The above theorem can be extended to three events A, B and C as

shown below:

𝑃(𝐴∪𝐵∪𝐶)=𝑃(𝐴)+𝑃(𝐵)+𝑃(𝐶)−𝑃(𝐴∩𝐵)−𝑃(𝐵∩𝐶)

−𝑃(𝐶∩𝐴)+𝑃(𝐴∩𝐵∩𝐶)

Example 1: A bag contains 4 black and 6 white balls; two balls are

selected at random. Find the probability that balls are i) both are different

color s. ii) both are of same colors.

Solution: Total number of balls in bag = 4 blacks + 6 white = 10 balls

To select two balls at random, we get

𝑛(𝑆)=𝐶(10,2)=45.

i) A be the event to select both are different colors.

∴𝑛(𝐴)=𝐶(4,1)×𝐶(6,1)=4×6=24.

𝑃(𝐴)=𝑛(𝐴)

𝑛(𝑆)=24

45=0.53.

ii) To select both are same colors.

Let Abe the event to select both are black balls

𝑛(𝐴)=𝐶(4,2)=6

𝑃(𝐴)=𝑛(𝐴)

𝑛(𝑆)=6

45

Let B be the event to select both are white balls.

𝑛(𝐵)=𝐶(6,2)=15 S 𝐴∩𝐵ത 𝐵∩𝐴̅ 𝐴∩ B munotes.in

## Page 89

89

𝑃(𝐵)=()

(ௌ)=ଵହ

ସହ .

A and B are disjoint event.

∴ The required probability is

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)=

ସହ+ଵହ

ସହ=ଶଵ

ସହ=0.467.

Example 2: From 40 tickets marked from 1 to 40, one ticket is drawn at

random. Find the probability that it is marked with a multiple of 3 or 4.

Solution: From 40 tickets marked with 1 to 40, one ticket is drawn at

random

𝑛(𝑆)=𝐶(40,1)=40

it is marked with a multiple of 3 or 4, we need to select in two parts.

Let A be the event to select multiple of 3,

i.e. A = { 3,6,9,….,39}

𝑛(𝐴)=𝐶(13,1)=13

𝑃(𝐴)=𝑛(𝐴)

𝑛(𝑆)=13

40

Let B be the event to select multiple of 4.

i.e. B = {4,8,12, …., 40}

𝑛(𝐵)=𝐶(10,1)=10

𝑃(𝐵)=𝑛(𝐵)

𝑛(𝑠)=10

40.

Here A and B are not disjoint.

𝐴∩𝐵be the event to select multiple of 3 and 4.

i.e. 𝐴∩𝐵 = {12,24,36}

𝑛(𝐴∩𝐵)=𝐶(3,1)=3

𝑃(𝐴∩𝐵)=𝑛(𝐴∩𝐵)

𝑛(𝑆)=3

40

∴ The required probability is

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)=13

40+10

40−3

40=20

40=0.5.

Example 3: If the probability is 0.45 that a program development job; 0.8

that a networking job applicant has a graduate degree and 0.35 that applied

for both. Find the probability that applied for atleast one of jobs. If number

of graduate are 500 then how many are not applied for jobs?

Solution: Let Probability of program development job= 𝑃(𝐴)=0.45.

Probability of networking job = 𝑃(𝐵)=0.8.

Probability of both jobs = 𝑃(𝐴∩𝐵)=0.35. munotes.in

## Page 90

90

Probability of atleast one i.e. to find 𝑃(𝐴∪𝐵).

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)

𝑃(𝐴∪𝐵)=0.45+0.8−0.35=0.9

Now there are 500 application, first to find probability that not applied for

job.

𝑃(𝐴∪𝐵തതതതതതത)=1−𝑃(𝐴∪𝐵)=1−0.9=0.1

Number of graduate not applied for job = 0.1×500=50 .

Check your Progress:

1. A card is drawn from pack of 52 cards at random. Find the probability

that it is a face card or a diamond card.

2. If 𝑃(𝐴)=ଷ

଼and (𝐵)=ହ

଼ , 𝑃(𝐴∪𝐵)=

଼ than find i) 𝑃(𝐴∪𝐵തതതതതതത) ii)

𝑃(𝐴∩𝐵).

3. In a class of 60 students, 50 passed in computers, 40 passed in

mathematics and 35 passed in both. What is the probability that a

student selected at random has i) Passed in atleast one subject, ii)

failed in both the subjects, iii) passed in only one subject.

5.4 CONDITIONAL PROBABILITY

In many case we come across occurrence of an event A and for the same

are required to find out the probability of occurrence an event B which

depend on event A . This kind of problem is called conditional probability

problems.

Definition: Let A and B be two events. The conditional probability of

event B, if an event A has occurred is defined by the relation,

𝑃(𝐵|𝐴)=(∩)

()if and only if 𝑃(𝐴)>0.

In case when 𝑃(𝐴)=0,𝑃(𝐵|𝐴) is not define because 𝑃(𝐵∩𝐴)=0 and

𝑃(𝐵|𝐴)=

which is an indeterminate quantity.

Similarly, Let A and B be two events. The conditional probability of event

A, if an event B has occurred is defined by the relation,

𝑃(𝐴|𝐵)=(∩)

() If and only if 𝑃(𝐵)>0.

Example 4: A pair of fair dice is rolled. What is the probability that the

sum of upper most face is 6, given that both of the nu mbers are odd?

Solution: A pair of fair dice is rolled, therefore 𝑛(𝑆)=36.

A to select both are odd number, i.e. A = {(1,1), (1,3), (1,5), (3,1), (3,3),

(3,5), (5,1),(5,3), (5,5)}.

𝑃(𝐴)=𝑛(𝐴)

𝑛(𝑆)=9

36 munotes.in

## Page 91

91

B is event that the sum is 6, i.e. B = { ((1,5),(2,4), (3,3),(4,2), (5,1)}.

𝑃(𝐵)=𝑛(𝐵)

𝑛(𝑆)=5

36

𝐴∩𝐵 = { (1,5), (3,3), (5,1)}

𝑃(𝐴∩𝐵)=𝑛(𝐴∩𝐵)

𝑛(𝑆)=3

36

By the definition of conditional probability,

𝑃(𝐵|𝐴)=𝑃(𝐴∩𝐵)

𝑃(𝐴)=336ൗ

936ൗ=1

3.

Example 5: If A and B are two events of sample space S, such th at

𝑃(𝐴)=0.85,𝑃(𝐵)=0.7and 𝑃(𝐴∪𝐵)=0.95. Find i) 𝑃(𝐴∩𝐵), ii)

𝑃(𝐴|𝐵), iii) 𝑃(𝐵|𝐴).

Solution: Given that 𝑃(𝐴)=0.85,𝑃(𝐵)=0.7and 𝑃(𝐴∪𝐵)=0.95.

i) By Addition theorem,

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)

0.95=0.85+0.7−𝑃(𝐴∩𝐵)

𝑃(𝐴∩𝐵)=1.55−0.95=0.6.

ii) By the definition of conditional probability ,

𝑃(𝐴|𝐵)=(∩)

()=.

.=0.857.

iii) 𝑃(𝐵|𝐴)=(∩)

()=.

.଼ହ=0.706

Example 6: An urn A contains 4 Red and 5 Green balls. Another urn B

contains 5 Red and 6 Green balls. A ball is transferred from the ur n A to

the urn B, then a ball is drawn from urn B. find the probability that it is

Red.

Solution: Here there are two cases of transferring a ball from urn A to B.

Case I: When Red ball is transferred from urn A to B.

There for probability of Red ball from urn A is 𝑃(𝑅)=ସ

ଽ

After transfer of red ball, urn B contains 6 Red and 6 Green balls.

Now probability of red ball from urn B = 𝑃(𝑅|𝑅)×𝑃(𝑅)=

ଵଶ×ସ

ଽ=

ଶସ

ଵ଼.

Case II: When Green ball is transferred from urn A to B.

There for probability of Green ball from urn A is 𝑃(𝐺)=ହ

ଽ

After transfer of red ball, urn B contains 5 Red and 7 Green balls. munotes.in

## Page 92

92

Now probability of red ball from urn B = 𝑃(𝑅|𝐺)×𝑃(𝐺)=ହ

ଵଶ×ହ

ଽ=

ଶହ

ଵ଼.

Therefore required probability =ଶସ

ଵ଼+ଶହ

ଵ଼=ସଽ

ଵ଼=0.4537.

Check your progress:

1. A family has two children. What is the probability that both are boys,

given at least one is boy?

2. Two dice are rolled. What is the condition probability that the sum of

the numbers on the dice exceeds 8, given that the first shows 4 ?

3. Consider a medical test that screens for a COVID -19 in 10 people in

1000. Suppose that the false positive rate is 4% and the false negative

rate is 1%. Then 99% of the time a person who has the condition tests

positive for it, and 96% of the time a person who does not have the

condition tests negative for it. a) What is the probability that a

randomly chosen person who tests positive for the COVID -19 actually

has the disease? b) What is the probability that a randomly chosen

person who tests negative for th e COVID -19 does not indeed have the

disease?

5.4.1 Independent and Dependent Eve nts, Mutually Exclusive Events:

Independent events:

Two events are said to be independent if the occurrence of one of them

does not affect and is not affected by the occurrence or non -occurrence of

other.

i.e. 𝑃൫𝐵𝐴ൗ൯=𝑃(𝐵) or 𝑃൫𝐴𝐵ൗ൯=𝑃(𝐴).

Multiplication theorem of probability: If A and B are any two events

associated with an experiment, then the probability of simultaneous

occurrence of events A and B is given by

𝑃(𝐴∩𝐵)=𝑃(𝐴)𝑃൫𝐵𝐴ൗ൯

Where 𝑃൫𝐵𝐴ൗ൯ denotes the conditional probability of event B given that

event A has already occur red.

OR

𝑃(𝐴∩𝐵)=𝑃(𝐵)𝑃൫𝐴𝐵ൗ൯

Where 𝑃൫𝐴𝐵ൗ൯ denotes the conditional probability of event A given that

event B has already occurred.

munotes.in

## Page 93

93

11.5.1 For Independent events multiplication theorem:

If A and B are independent events then multiplication theorem can be

written as,

𝑷(𝑨∩𝑩)=𝑷(𝑨)𝑷(𝑩)

Proof. Multiplication theorem can be given by,

If A and B are any two events associated with an experiment, then the

probability of simultaneous occurrenc e of events A and B is given by

𝑃(𝐴∩𝐵)=𝑃(𝐴)𝑃൫𝐵𝐴ൗ൯

By definition of independent events, 𝑃൫𝐵𝐴ൗ൯=𝑃(𝐵) or 𝑃൫𝐴𝐵ൗ൯=𝑃(𝐴).

∴𝑷(𝑨∩𝑩)=𝑷(𝑨)𝑷(𝑩).

Note:

1) If A and B are independent event then, 𝐴̅ and 𝐵ത are independent event.

2) If A and B are independent event then, 𝐴̅ and B are independent event.

3) If A and B are independent event then, A and 𝐵ത are independent event.

Example 7: Manish and Mandar are trying to make Software for

company. Probability that Manish can be succes s is ଵ

ହ and Mandar can be

success is ଷ

ହ, both are doing independently. Find the probability that i)

both are success. ii) Atleast one will get success. iii) None of them will

success. iv) Only Mandar will success but Manish will not success.

Solution: Let probability that Manish will success is 𝑃(𝐴)=ଵ

ହ=0.2.

Therefore probability that Manish will not success is 𝑃(𝐴̅)=1−𝑃(𝐴)=

1−0.2=0.8.

Probability that Mandar will success is 𝑃(𝐵)=ଷ

ହ=0.6.

Therefore probability that Mandar will not success is 𝑃(𝐵ത)=1−𝑃(𝐵)=

1−0.6=0.4.

i) Both are success i.e. 𝑃(𝐴∩𝐵).

𝑃(𝐴∩𝐵)=𝑃(𝐴)×𝑃(𝐵)=0.2×0.6=0.12∵A and B are

independent events.

ii) Atleast one will get success. i.e. 𝑃(𝐴∪𝐵)

By addition theorem,

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)=0.2+0.6−0.12=0.68.

iii) None of them will success. 𝑃(𝐴∪𝐵തതതതതതത) or 𝑃(𝐴̅∩𝐵ത)

[ ByDeMorgan’s law both are same]

𝑃(𝐴∪𝐵തതതതതതത)=1−𝑃(𝐴∪𝐵)=1−0.68=0.32.

Or

If A and B are independent than 𝐴̅ and 𝐵ത are also independent. munotes.in

## Page 94

94

𝑃(𝐴̅∩𝐵ത)=𝑃(𝐴̅)×𝑃(𝐵ത)=0.8×0.4=0.32.

iv) Only Mandar will success but Manish will not success. i.e. 𝑃(𝐴̅∩𝐵).

𝑃(𝐴̅∩𝐵)=𝑃(𝐴̅)×𝑃(𝐵)=0.8×0.6=0.48

Example 8: 50 coding done by two students A and B, both are trying

independently. Number of correct coding by student A is 35 and

student B is 40. Find the probability of only one of them will do

correct coding.

Solution: Let probability of student A get correct coding is 𝑃(𝐴)=ଷହ

ହ=

0.7

Probability of student A get wrong coding is 𝑃(𝐴̅)=1−0.7=0.3

Probability of student B get correct coding is 𝑃(𝐵)=ସ

ହ=0.8

Probability of student B get wrong coding is 𝑃(𝐵ത)=1−0.8=0.2.

The probabi lity of only one of them will do correct coding.

i.e. A will correct than B will not or B will correct than A will not.

𝑃(𝐴∩𝐵ത)+𝑃(𝐵∩𝐴̅)=𝑃(𝐴)×𝑃(𝐵ത)+𝑃(𝐵)×𝑃(𝐴̅).

=0.7×0.2+0.8×0.3=0.14+0.24

=0.38

Example 9: Given that 𝑃(𝐴)=ଷ

,𝑃(𝐵)=ଶ

, if A and B are independent

events than find i) 𝑃(𝐴∩𝐵), ii) 𝑃(𝐵ത), iii) 𝑃(𝐴∪𝐵), iv) 𝑃(𝐴̅∩𝐵ത).

Solution: Given that 𝑃(𝐴)=ଷ

,𝑃(𝐵)=ଶ

.

i) A and B are independent events,

∴𝑃(𝐴∩𝐵)=𝑃(𝐴)×𝑃(𝐵)=3

7×2

7=6

49=0.122

𝑖𝑖) 𝑃(𝐵ത)=1−𝑃(𝐵)=1−ଶ

=ହ

=0.714.

iii) By addition theorem,

𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)=3

7+2

7−6

49=29

49=0.592.

iv) 𝑃(𝐴̅∩𝐵ത)=𝑃(𝐴∪𝐵തതതതതതത)=1−𝑃(𝐴∪𝐵)=1−0.592=0.408.

Check your progress:

1. If 𝑃(𝐴)=ଶ

ହ ,𝑃(𝐵)=ଵ

ଷ and if A and B are independent events, find

(𝑖)𝑃(𝐴∩𝐵),(𝑖𝑖)𝑃(𝐴∪𝐵),(𝑖𝑖𝑖)𝑃(𝐴̅∩𝐵ത).

2. The probability that A , B and C can solve the same problem

independently are ଵ

ଷ,ଶ

ହ𝑎𝑛𝑑ଷ

ସ respectively. Find the probability that i)

the problem remain unsolved, ii) the problem is solved, iii) only one

of them solve the pro blem.

3. The probability that Ram can shoot a target is ଶ

ହ and probability of

Laxman can shoot at the same target is ସ

ହ. A and B shot independently.

Find the probability that (i) the target is not shot at all, (ii) the target is

shot by at least one of them. (iii) the target shot by only one of them.

iv) target shot by both. munotes.in

## Page 95

95

5.5 PROBABILITY DISTRIBU TIONS

In order to under stand the behavior of a random variable, we may want to

look at its average value. For probability we need to find Average is

called expected value of random variable X. for that first we have to learn

some basic concept of random variable.

Random Variable: A probability measurable real valued functions, say

X, defined over the sample space of a random experiment with respective

probability is called a random variable.

Types of random variables: There are two type of random variable.

Discrete Rando m Variable: A random variable is said to be discrete

random variable if it takes finite or countably infinite number of values.

Thus discrete random variable takes only isolated values.

Continuous Random variable: A random variable is continuous if its set

of possible values consists of an entire interval on the number line.

Probability Distribution of a random variable: All possible values of

the random variable, along with its corresponding probabilities, so

that∑=1

ୀ1, is called a probability distribution of a random variable.

The probability function always follows the following properties:

i) 𝑃(𝑥)≥0 for all value of 𝑖.

ii) ∑𝑃=1

ୀଵ.

The set of values 𝑥 with their probability 𝑃 constitute a discrete

probability distribution of th e discrete variable X.

For e.g. Three coins are tossed, the probability distribution of the discrete

variable X is getting head. X= 𝑥 0 1 2 3 𝑃(𝑥) 18 38 38 18

5.6 MATHEMATICAL EXPECTA TION

All the probability information of a random variable is contained in

probability mass function for random variable, it is often useful to

consider various numerical characteristics of that random variable. One

such number is the expectation of a random variable.

If random variable X takes values 𝑥ଵ,𝑥ଶ,…..𝑥 with corresponding

probabilities 𝑃ଵ,𝑃ଶ,…….𝑃 respectively, then expectation of random

variable X is

𝐸(𝑋)=∑𝑝𝑥

ୀଵwhere ∑𝑃=1

ୀଵ munotes.in

## Page 96

96

Example 10 : In Vijay sales every day sale of number of laptops with his

past experience the probability per d ay are given below: No. of laptop 0 1 2 3 4 5 Probability 0.05 0.15 0.25 0.2 0.15 0.2

Find his expected number of laptops can be sale?

Solution: Let X be the random variable that denote number of laptop sale

per day.

To calculate expected value, 𝐸(𝑋)=∑𝑝𝑥

ୀଵ

𝐸(𝑋)=(0×0.05)+(1×0.15)+(2×0.25)+(3×0.2)+(4×0.15)

+(5×0.2)

𝐸(𝑋)= 2.85 ~3

Therefore expected number of laptops sale per day is 3.

Example 11 : A random variable X has probability mass function as

follow: X=𝑥 -1 0 1 2 3 P(𝑥) K 0.2 0.3 2k 2k

Find the value of k, and expected value.

Solution: A random variable X has probability mass function,

∑𝑃=1

ୀଵ.

⇒ k + 0.2 + 0.3 + 2k +2k = 1

⇒5k = 0.5

⇒ k = 0.1

Therefore the probability distribution of random variable X is X=𝑥 -1 0 1 2 3 P(𝑥) 0.1 0.2 0.3 0.2 0.2

To calculate expected value, 𝐸(𝑋)=∑𝑝𝑥

ୀଵ

𝐸(𝑋)=(−1×0.1)+(0×0.2)+(1×0.3)+(2×0.2)+(3×0.2)=

1.2 .

Example 12 : A box contains 5 white and 7 black balls. A person draws 3

balls at random. He gets Rs. 50 for every white ball and losses Rs. 10

every black ball. Find the expectation of him.

Solution: Total number of balls in box = 5 white + 7 black = 12 balls.

To sele ct 3 balls at random, 𝑛(𝑠)=𝐶(12,3)=ଵଶ×ଵଵ×ଵ

ଷ×ଶ×ଵ=220.

Let A be the event getting white ball.

A takes value of 0, 1, 2 and 3 white ball.

Case I : no white ball. i.e. A = 0, munotes.in

## Page 97

97

𝑃(𝐴=0)=𝐶(7,3)

220=35

220

Case II: one white ball i.e. A = 1,

𝑃(𝐴=1)=𝐶(5,1)×𝐶(7,2)

220=105

220

Case III: two white balls i.e. A = 2,

𝑃(𝐴=2)=𝐶(5,2)×𝐶(7,1)

220=70

220

Case IV: three white balls i.e. A = 3,

𝑃(𝐴=3)=𝐶(5,3)

220=10

220

Now let X be amount he get from the game.

Therefore the probability distribution of X is as follows: X= 𝑥 -30 30 90 150 P(𝑥) 35

220 105220 70220 10220

To calculate expected value, 𝐸(𝑋)=∑𝑝𝑥

ୀଵ

𝐸(𝑋)=ቀ−30×ଷହ

ଶଶቁ+ቀ30×ଵହ

ଶଶቁ+ቀ90×

ଶଶቁ+ቀ150×ଵ

ଶଶቁ=Rs.

45.

5.7 COMBINATORIAL ANALYSIS

Multiplication Rule:

If the procedure can be broken into first and second stages, and if there are

m possible outcomes for the first stage and for each of these outcomes,

there are n possible outcomes for second stage, then the total procedure

can be carried out in the designa te order, in 𝑚×𝑛 ways.

This principle can be extended to a general form as follows:

Theorem : If a process consists of n steps, and

i) The first step can be performed by 𝑛ଵ ways.

ii) The second step can be performed by 𝑛ଶ ways.

iii) The 𝑖௧ step can be performed by 𝑛 ways.

Then the whole process can be completed by 𝑛ଵ×𝑛ଶ×……×𝑛

different ways.

Example 13 : There are 8 men and 7 women in a drama company. How

many way the director has to choose a couple to play lead roles in a stage

show? munotes.in

## Page 98

98

Solution: The director can choose a man (task 1) in 8 ways and then a

woman (task 2) in 7 ways. Then by multiplication rul e he can choose a

couple from 8×7=56 ways.

Example 14 : How many four digits numbers can be formed contains each

of the digits 7, 8, and 9 exactly once?

Solution: To construct four digits number we have four places.

_______ _______ _______ _________

Thousand place Hundred place Ten place Unit place

First for ‘7’ there are 4 places, for ‘8’ there are 3 places and for ‘9’ there

are 2 places. For last digit, we can choose any of 0,1,2,3,4,5,6 so there will

be 7 digits.

Thus these can be done by 4×3×2×7=168 ways.

Example 15 : To generate typical personal identification number (PIN) is

a sequence of any four symbols chosen from the letters in the alphabet and

the digits , How many different PIN’s are generated?

i) repetition is not allowed.

ii) repetition is allowed.

Solution: There are 26 letters of alphabets and 10 digits. Total different

symbols are 36.

i) Repetition is not allowed:

There are four place to generate PIN with four symbols,

First place can be filled by 36 ways, second place can be filled by 35

ways, third place can be filled by 34 ways and last fourth place can be

filled by 33 ways.

By the multiplication rule,

Therefore these can be done by 36×35×34×33=1,413,720 ways.

ii) Repetition is allowed:

Since repetition is allowed, so each place can be filled by 36 ways,

By multiplication rule,

These can be done by 36×36×36×36=1,679,616 ways.

Check your progress:

1. A license plate can be made by 2 letters followed by 3 digits. How

many different license plates can be made if i) repetition is not allowed.

ii) Repetition is allowed.

2. Mr. Modi buying a personal computer system is offered a choice of 4

models of basic units, 2 models keyboard, and 3 models of printer.

How many distinct systems can be purchased?

munotes.in

## Page 99

99

Counting elements of disjoint sets with Addition Rule:

In above section we have discussed counting problem that can be solved

using possibility tree. Here we discuss counting problem that can be

solved using the operation sets like union , intersection and the difference

between two sets.

The addition rule:

If a task can be performed in m ways and another task in n ways assuming

that these two tasks cannot perform simultaneously, then the performing

either task can be accomplished in any one of the 𝑚+𝑛 ways.

In general from as follows:

If there are 𝑛ଵ,𝑛ଶ,𝑛ଷ,……,𝑛 different objects in m different sets

respectively and the sets are disjoint, then the number of ways to select an

object from one of the m sets is 𝑛ଵ+𝑛ଶ+ 𝑛ଷ+ ……+𝑛

Example 16 : How many different number of signals that can be sent by 5

flags of different colours taking one or more at a time ?

Solution: Let number of signal made by one colour flag =5 ways.

Number of signal made by two colours flag =5×4=20 ways.

Number of signal made by three flag colours =5×4×3=60 ways.

Number of signal made by four flag colours =5×4×3×2=120 ways.

Number of signal made by five flag colours =5×4×3×2×1=120

ways.

Using Addition rule we get,

Therefore total number of signals =5+20+60+120+120=325

ways.

Example 17 : There are 4 different English books, 5 different Hindi books

and 7 different Marathi books. How many ways are there to pick up an

pair of two books not both with the same subjects?

Solution: One English and one Hindi book is chosen, that selection can be

done by =4×5=20 ways.

One Engli sh and one Marathi book is chosen, that selection can be done

by =4×7=28 ways.

One Hindi and one Marathi book is chosen, that selection can be done by

=5×7=35 ways.

These three types of selection are disjoint, therefore by addition rule,

Total selection ca n be done by =20+28+35=83 ways.

Additive Principle with Disjoint sets:

Given two sets A and B, both sets are disjoint i.e. if 𝐴∩𝐵=∅ , than

|𝐴∪𝐵|=|𝐴|+|𝐵|. munotes.in

## Page 100

100

Example 18 : In college 200 students visit to canteen every day of which

80 likes coffee and 70 likes tea. If no one student like both than find i)

number of students like atleast one of them? ii) number of students like

none of them?

Solution: Total number of stud ents = 200

Total number of students who like coffee = |𝐴|=80.

Total number of students who like tea = |𝐵|=70.

Total number of students like atleast one |𝐴∪𝐵|=|𝐴|+|𝐵|=80+

70=150.

Total number of students like none of them =200−150=50.

Definition: An r -combination of n distinct objects is an unordered

selection, or subset, of r out of the n objects. We use 𝐶(𝑛,𝑟)𝑜𝑟𝐶 to

denote the number of r -combinations. This number is called as binomial

number.

If 𝑥ଵ,𝑥ଶ,𝑥ଷ,…..,𝑥 are n distinct o bjects, and r is any integer, with

1≤𝑟≤𝑛. Therefore selecting r -objects from n objects is given by

𝐶(𝑛,𝑟)=𝑛!

𝑟!(𝑛−𝑟)!

Example 19 : How many elements of set 3 -bit string with weight 2?

Solution : there are 3 -bit with weight 2, i.e. 𝑛=3,𝑟=2.

These can be done by = 𝐶(𝑛,𝑟)=𝐶(3,2)=3.

Therefore the bit string is 011, 101, 110.

Example 20 : A bag contains 4 red marbles and 5 green marbles. Find the

number of ways that 4 marbles can be selected from the bag, if selection

contain i) No restriction of colors. ii) all are of same colors.

Solution: Total number of marbles: 4 Red + 5 Green = 9 marbles

To select 4 marbles from the bag with condition,

i) No restriction of colors:

These can be done by : 𝐶(9,4)=126 ways.

ii) All are of same colors:

First select th e colors by :𝐶(2,1)=2.

If all is Red in colors than these can be done by =𝐶(4,4)=1.

If all is Green in colors then these can be done by =𝐶(5,4)=5.

Therefore total number of ways =2×1×5=10 ways.

Example 21 : There are 10 members in a societ y who are eligible to attend

annual meeting. Find the number of ways a 4 members can be selected

that

i) No restriction munotes.in

## Page 101

101

ii) If 2 of them will not attend meeting together.

iii) If 2 of them will always attend meeting together.

Solution:

i) To select 4 members from 10 members, it can be done by =

𝐶(10,4)=210 ways.

ii) If 2 of them will not attend meeting together,

Let A and B denote the 2 members who will not attend meeting

together.

i.e. A or B but not both are together , these can be done by =2×

𝐶(8,3)=112 ways.

It possible that both will not attend meeting, i.e. Neither A nor B

will attend meeting, these can be done by =𝐶( 8,2)=28 ways.

Therefore total number of ways = 112 + 28 =140 ways.

iii) If 2 of them will attend meeting together,

Let A an d B denote the 2 members who will attend meeting together.

i.e. A or B =𝐶(8,2)=28 ways.

It possible that both will not attend meeting, i.e. Neither A nor B will

attend meeting, these can be done by =𝐶( 8,4)=70 ways.

Therefore total number of ways = 28 +70 = 88 ways.

Example 21 : How many diagonal has a regular polygon with n sides?

Solution : The regular polygon with n sides has n vertices. Any two

vertices determine either a side or diagonal. Therefore these can be done

by =𝐶 (𝑛 ,2)= (ିଵ)

ଶ . But there are n sides which are not diagonal.

Therefore total number of diagonals are = (ିଵ)

ଶ− 𝑛= మି

ଶ−ଶ

ଶ=

మି ଷ

ଶ= (ିଷ)

ଶ diagonals.

r-combinations with Repetition Allowed:

Till now, we have seen the formula for the number of combinations when

r objects are chosen from the collection of n distinct objects. The

following results is very important to find the number of selection of n

objects when not all n are distinct.

The num ber of selection with repetition of r objects chosen from n types of

objects is

𝐶(𝑛+𝑟−1,𝑟)

Example 22 : How many ways are there to fill a box with a dozen marbles

chosen five different colors of marbles with the requirement that at least

one fruit of each colors is picked?

Solution: One can pick one marble of each colors and then the remaining

seven marbles in any way. There is no choice in picking one marble of munotes.in

## Page 102

102

each type. The choice occurs in picking the remaining 7 marbles from 5

colors. By the resul t of r -combination with repetition allowed,

These can be done by =𝐶(5+7−1,7)=𝐶(11,7)=330 ways.

Example 23 : How many solution does the following equation 𝑥ଵ+𝑥ଶ+

𝑥ଷ+𝑥ସ=15 have 𝑥ଵ,𝑥ଶ,𝑥ଷ, and 𝑥ସ are non -negative integers?

Solution: Assume we have four types of unknown 𝑥ଵ,𝑥ଶ,𝑥ଷ, and 𝑥ସ .

There are 15 items or units (since we are looking for an integer solution).

Every time an item is selected it adds one to the type it picked it up.

Observe that a solution corresponds to a w ay of selecting 15 items from

set of four elements. Therefore, it is equal to r -combinations with

repetition allowed from set with four elements, we have

𝐶(4+15−1,15)=𝐶(18,15)=𝐶(18,3)=18×17×16

3×2×1=816

Example 24 : In how many ways can a teacher choose one or more

students from 5 students?

Solution: Let set of student are 5, therefore total number of subsets are

2ହ=32 .

To select one or more students, we must deleted empty set,.

Therefore total number of selection = 32 – 1= 31 ways.

5.8 COMBINATIONS, STIRLING’S APPROXIMATION TO N!

A helpful and commonly used approximate relationship for the evaluation

of the fac torials of large numbers is Stirling’s approximation. It is a good

approximation, leading to accurate results even for small values of n. it is

given by

𝑛!≈

√2𝜋𝑛

Where e = 2:71828 is the natural base of logarithms .

5.9 RELATION OF PROBABILITY TO POINT SET THEORY, EULER OR VENN DIAGRAMS AND

PROBABILITY

a) Relation of Probability to Point Set Theory :

In discrete probability we assume well defined experiment such as

flipping a coin or rolling a die. Each individual result which could

occur is called an outcome. The set of all outcomes is called sample

space, and any subset of the sample space is called an event.

The union of two or more sets is the set that contains all the elements of

the two or more sets. Union is denoted by the symbol ∪.The general munotes.in

## Page 103

103

probability addition rule for the union events states that 𝑃(𝐴∪𝐵)=

𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵). Where 𝐴∩𝐵 is the intersection of the two

sets.

Euler or Venn Diagrams and Probability:

In probability, a Venn diagram is a figure with one or more circles inside a

rectangle that describes logical relation between events. The rectangle in a

Venn diagram represents the sample space or the universal set, that is , the

set of all possible outco mes. A circle inside the rectangle represents an

event, that is, a subset of the sample space. We consider the following

Venn diagram involving two events, 𝐴 and 𝐵.

In the above diagram, we have two events 𝐴 and 𝐵 within the sample

space 𝑆(or Universal set)

White and Red region represent event A, Black and Red region represent

event B, only Red region represent 𝐴∩𝐵, and White, Black and Red

region together represents 𝐴∪𝐵. Also Blue region represent 𝐴∪𝐵തതതതതതത.

If the circle s do not overlap than 𝐴 and 𝐵 are mutually exclusive

events.

i.e𝐴∩𝐵=∅.

If the circles are overlap than 𝐴 and 𝐵 are intersecting each other .

i.e𝐴∩𝐵≠∅.

The region outside both the circles but within the rectangle represents

complement of union of both the events. i.e. 𝐴∪𝐵തതതതതതത.

Within each divided region of a Venn diagram, we can add data in one of

the following ways:

The outcomes of the event,

The number of outcomes in the event,

The probability of the event.

,

𝐴

a

n

d

𝐵

,

𝐴

a

n

d

𝐵 munotes.in

## Page 104

104

5.10 LET US SUM UP

In this unit we have learn :

Basic concept and Definitions of Probability.

Conditional Probability.

Independent and Dependent Events, Mutually Exclusive Events.

Probability Distributions for discrete distribution.

Mathematical Expectation for probability distribution.

Relation between Population, Sample Mean, and Variance.

Combinations, Stirling’s Approximation to n!.

Relations of Probability to Point Set Theory with represent Euler or

Venn Diagrams.

5.11 UNIT END EXERCISES

1. A card is drawn at random from well shuffled pack of card find the

probability that it is red or king card.

2. There are 30 tickets bearing numbers from 1 to 15 in a bag. One ticket

is drawn from the bag at random. Find the probability that the ticket

bears a number , which is even , or a multiple of 3.

3. In a group of 200 persons, 1 00 like sweet food items, 120 like salty

food items and 50 like both. A person is selected at random find the

probability that the person (i). Like sweet food items but not salty

food items (ii). Likes neither.

4. A bag contains 7 white balls & 5 red balls. One ball is drawn from

bag and it is rep laced after noting its color. In the second draw again

one ball is drawn and its color is noted. The probability of the event

that both the balls drawn are of different colors.

5. The probability of A winning a race is ଵ

ଷ& that B wins a race is ଷ

ହ. Find

the probability that (a). either of the two wins a race.b), no one wins

the race.

6. Three machines A, B & C manufact ure respectively 0.3, 0.5 & 0.2 of

the total production. The percentage of defective items produced by

A, B & C is 4 , 3 & 2 percent respectively. for an item chosen at

random , what is the probability it is defective.

7. An urn A contains 3 white & 5 black balls. Another urn B contains 5

white & 7 black balls. A ball is transferred from the urn A to the urn

B, then a ball is drawn from urn B. find the probability that it is white.

8. A husband & wife appear in an interview for two vacancies in the

same post. The probability of husband selection is ଵ

& that of wife’s

selection is ଵ

ହ. What is the probability that, a). both of them will be munotes.in

## Page 105

105

selected.b). only one of them will be selected.c). none of them will be

selected?

9. A problem statistics is given to 3 students A,B & C whose chances of

solving if are ଵ

ଶ, ଷ

ସ&ଵ

ସ respectively. What is the probability that the

problem will be solved?

10. A bag contains 8 white & 6 red balls. Find the probability of drawing

2 balls of the same color.

11. Find the probability of drawing an ace or a spade or both from a deck

of cards?

12. A can hit a target 3 times in a 5 shots, B 2 times in 5 shots & C 3 times

in a 4 shots. they fire a volley. What is the probability that a).2 shots

hit? b). at least 2 shots hit?

13. A purse contains 2 silver & 4 cooper coins & a second purse contains

4 silver & 4 cooper coins. If a coin is selected at random from one of

the two purses, what is the probability that it is a silver coin?

14. The contain of a three urns are : 1 white, 2 red, 3 green balls; 2 white,

1 red, 1 green b alls & 4 white, 5 red, 3 green balls. Two balls are

drawn from an urn chosen at random. This are found to be 1 white &

1 green. Find the probability that the balls so drawn come from the

second urn.

15. Three machines A,B& C produced identical items. Of there respective

output 2%, 4% & 5 % of items are faulty. On a certain day A has

produced 30 % of the total output, B has produced 25% & C the

remainder. An item selected at random is found to be faulty. What are

the chances that it was produced by the machin e with the highest

output?

16. A person speaks truth 3 times out of 7. When a die is thrown, he says

that the result is a 1. What is the probability that it is actually a 1?

17. There are three radio stations A, B and C which can be received in a

city of 1000 families. The following information is available on the

basis of a survey:

(a) 1200 families listen to radio station A

(b) 1100 families listen to radio station B.

(c) 800 families listen to radio station C.

(d) 865 families listen to radio station A & B.

(e) 450 families listen to radio station A & C.

(f) 400 families listen to radio station B & C.

(g) 100 families listen to radio station A,B & C.

The probability that a family selected at random listens at least to one

radio station.

18. The probability distribution of a random variable x is as follows. X 1 3 5 7 9 P(x) K 2k 3k 3k K munotes.in

## Page 106

106

Find value of (i). K (ii). E(x)

19. A player tossed 3 coins. He wins Rs. 200 if all 3 coins show tail, Rs.

100 if 2 coins show tail, Rs. 50 if one tail appears and loses Rs. 40 if

no tail appears. Find his mathematical expectation.

20. The probability distribution of daily demand of cell ph ones in a mobile

gallery is given below. Find the expected mean .

Demand 5 10 15 20 Probability 0.4 0.22 0.28 0.10

21. If 𝑃(𝐴)=ସ

ଵହ ,𝑃(𝐵)=

ଵହ and if A and B are independent events, find

(𝑖)𝑃(𝐴∩𝐵),(𝑖𝑖)𝑃(𝐴∪𝐵),(𝑖𝑖𝑖)𝑃(𝐴̅∩𝐵ത).

22. If 𝑃(𝐴)=ହ

ଽ ,𝑃(𝐵ത)=ଶ

ଽ and if A and B are independent events, find

(𝑖)𝑃(𝐴∩𝐵),(𝑖𝑖)𝑃(𝐴∪𝐵),(𝑖𝑖𝑖)𝑃(𝐴̅∩𝐵ത).

23. If 𝑃(𝐴)=0.65 ,𝑃(𝐵)=0.75 and 𝑃(𝐴∩𝐵)=0.45, where A and B

are events of sample space S , find (𝑖)𝑃(𝐴|𝐵) ,(𝑖𝑖)𝑃(𝐴∪𝐵),

(𝑖𝑖𝑖)𝑃(𝐴̅∩𝐵ത).

24. A box containing 5 red and 3 black balls, 3 balls are drawn at random

from box. Find the expected number of red balls drawn.

25. Two fair dice are rolled. X denotes the sum of the numbers appearing

on the uppermost faces of the dice. Find the expected value.

26. A bag contains 5 black marbles and 6 white marbles. Find the number

of ways that five marbles can be drawn from the bag such that it

contains i) No restriction ii) no black marbles, iii) 3 black and 2 white,

iv) at least 4 black, v) All are of same colors .

27. A student is to answer 8 out of 10 questions on an exam. Find the

number of ways that the student can chose the 8 questions if i) No

restriction, ii) student must answer the first 4 questions, iii) student

must answer atleast 4 out of the five questions.

28. There are 12 points in a given plane, no three on the same line. i) How

many triangle are determine by the points? ii) How many of these

triangle contain a particular point as a vertex?

29. How many committees of two or more can be selected from 8 people?

30. Find the number of combinations if the letters of the letters of the

word EXAMINATION taken out at a time.

5.12 LIST OF REFERENCES

Statistics byMurry R. Spiegel, Larry J. Stephens. Publication

McGRAWHILL INTERNATIONAL.

Fundamental Mathematics and Statistics by S.C. Gupta and V.K

kapoor Mathematical Statistics by J.N. Kapur and H.C. Saxena.

***** munotes.in

## Page 107

107

6

ELEMENTARY SAMPLING

THEORY

Unit Structure

6.1 Objective

6.2 Introduction

6.3 Sampling Theory

6.3.1 Random Samples and Random Numbers,

6.3.2 Sampling With and Without Replacement,

6.4 Sampling Distributions,

6.4.1 Sampling Distribution of Means,

6.4.2 Sampling Distribution of Proportions,

6.4.3 Sampling Distributions of Differences and Sums,

6.5 Standard Errors,

6.6 Summary

6.7 Exercise

6.8 List of References

6.1 OBJECTIVE

After going through this chapter you will able to know:

Sampling and its requirements in statistics.

Random sampling with and without replacement.

Sampling distribution of Mean, Proportions, difference and sum.

Standard errors in sampling distribution.

Some software to use for sampling.

6.2 INTRODUCTION

In the previous chapters, we have discussed probability theory. In this

chapter, we will introduce some basic concepts in statistics. The basic idea

of statistical inference is to assume that the observed data is generated

from some unknown probability distribution, which is often assumed to

have a known functional form up to some unknown parameters. The

purpose of statistical inference is to develop theory and methods to make

inference on the unknown parameters based on observed data.

Sampli ng theory provides the tools and techniques for data collection

keeping in mind the objectives to be fulfilled and nature of population .

Sample surveys collect information on a fraction of total population

whereas in census, the information is collected on the whole population. munotes.in

## Page 108

108

The concept of sampling has a huge implementation and its application is

seen in many vital fields. The importance of sampling theory is when it

comes into play while making statistical analysis with different efficiency

levels, ther e are three different methods of sampling. We have adequately

thrown light on the process and methods of sampling.

6.3 SAMPLING THEORY

Often we are interested in drawing some valid inferences about a large

group of individuals or objects called population in statistics. Instead of

studying the entire population, which may be difficult or even impossible

to study, we may study only a small portion of the population. Our

objective is to draw valid inferences about certain facts for the population

from results found in the sample; a process known as statistical inferences.

The process of obtaining samples is called sampling and theory

concerning the sampling is called sampling theory .

The sampling theory definition of the statistic is the creation of a sample

set. This is recognized as one of the major processes. It retains the

accuracy in bringing out the correct statistical information. The population

tree is huge set and it turns out to be exhausting for the actual study and

estimation process. Bot h money and time get exhausting in the process.

The creation of the sample set saves time and effort and is a vital theory in

the process of statistical data analysis.

Process of Sampling :

In this part of the chapter, we will discuss a few details regarding the

process of sampling. So the steps are mentioned in the steps below:

The first step is a wise choice of the population set.

The second step is focusing on the sample set and the size of it.

Then, one needs to choose an identifiable property ba sed on which the

samples will be created out of the population set.

Then, the samples can be chosen using any of the types of sampling

theory – Simple random, systematic, or stratified. Each of them is

thoroughly discussed in the article ahead.

Checking th e inaccuracy, if there is any.

Hence, the set is achieved in the result.

Sampling can be done in their different method and they are given below:

1. Simple random type.

2. Systematic Sampling.

3. Stratified sampling.

6.3.1 Random Samples and Random Numbers:

Definition: Simple random sampling is defined as a sampling technique

where every item in the population has an even chance and likelihood of munotes.in

## Page 109

109

being selected in the sample. Here the selection of items entirely depends

on luck or probability, and therefore t his sampling technique is also

sometimes known as a method of chances. For e.g. Using the lottery

method is one of the oldest ways and is a mechanical example of random

sampling. In this method, the researcher gives each member of the

population a number. Researchers draw numbers from the box randomly

to choose samples. The use of random numbers is an alternative method

that also involves numbering the population. The use of a number table

similar to the one below can help with this sampling technique.

Simp le random sampling (SRS) is a method of selection of a sample

comprising of n a number of sampling units out of the population having N

number of sampling units such that every sampling unit has an equal

chance of being chosen.

Simple random sampling metho ds:

Researchers follow these methods to select a simple random sample:

1. They prepare a list of all the population members initially, and then

each member is marked with a specific number ( for example, there

are nth members, then they will be numbered from 1 to N).

2. From this population, researchers choose random samples using two

ways: random number tables and random number generator software.

Researchers prefer a random number generator software, as no human

interference is necessary to generate samples.

Advantages of simple random sampling :

1. It is a fair method of sampling, and if applied appropriately, it helps to

reduce any bias involved compared to any other sampling method

involved.

2. Since it involves a large sample frame, it is usually easy to pick a

smaller sample size from the existing larger population.

3. The person conducting the research doesn’t need to have prior

knowledge of the data he/ she is collecting. Once can ask a question to

gather the researcher need not be subject expert.

4. This sampling method is a fundamental method of collecting the data.

You don’t need any technical knowledge. You only require essential

listening and recording skills.

5. Since the population size is vast in this type of sampling method, there

is no restri ction on the sample size that the researcher needs to create.

From a larger population, you can get a small sample quite quickly.

6. The data collected through this sampling method is well informed;

more the sample better is the quality of the data.

Disadvantages:

1. Sampling is not feasible where knowledge about each element or unit

or a statistical population is needed. munotes.in

## Page 110

110

2. The sampling procedures must be correctly desi gned and followed

otherwise, what we call as wild sample would crop up with mis -

leading results.

3. Each type of sampling has got its own limitations.

4. There are numerous situations in which units, to be measured, are

highly variable. Here a very large sample is required in order to yield

enough cases for achieving statistically r eliable information.

5. To know certain population characteristics like population growth

rate, population density etc. census of population at regular intervals

is more appropriate than studying by sampling.

6.3.2 Samplin g With and Without Replacement:

Selection with Replacement (SWR): In this case, a unit is selected from

a population with a known probability and the unit is returned to the

population before the next selection is made (after recording its

characteristic). Thus, in this method at each s election, the population size

remains constant and the probability at each selection or draw remains the

same. Under this sampling plan, a unit has chances of being selected more

than once. For example a card is randomly drawn from a pack of cards and

placed back in the pack, after noting its face value before the next card is

drawn. Such a sampling method is known as sampling with replacement.

There are 𝑁possible samples of size n from a population of N units in

case of sampling with replacement.

Sampling with out replacement (SWOR) : In this selection procedure, if

a unit from a population of size N selected, it is not returned to the

population. Thus, for any subsequent selection, the population size is

reduced by one. Obviously, at the time of the first selection, the population

size is N and the probability of a unit being selected randomly is 1𝑁ൗ; for

the second unit to be randomly selected, the population size is (𝑁−1)

and the probability of selection of any one of the remaining sampling unit

is 1(𝑁−1)ൗ , similarly at the third draw, the probability of selection is

1(𝑁−2)ൗ and so on.

6.4 SAMPLING DISTRIBUTIONS

Sampling distribution is a statistic that determines the probability of an

event based on data from a small group within a large population. Its

primary purpose is to establish representative results of small samples of a

comparatively larger population. Since the population is too large to

analyze, the smaller group is selected and repeatedly sampled, or analyzed.

The gathered data, or statistic, is used to calculate the likely occurrence, or

probability, of an event. Using a sampling distribution simplifies the

process of making inferences, or conclusions, abo ut large amounts of data.

munotes.in

## Page 111

111

The idea behind a sampling distribution is that when you have a large

amount of data (gathered from a large group, the value of a statistic from

random samples of a small group will inform you of that statistic’s value

for the en tire group. Once the data is plotted on a graph, the values of any

given statistic in random samples will make a normal distribution from

which you can draw inferences.

Each random sample selected will have a different value assigned to the

statistic being studied. For example, if you randomly sample data three

times and determine the mean, or the average, of each sample, all three

means are likely to be different and fall somewhere along the graph. That's

variability. You do that many times, and event ually the data you plot

should look like a bell curve . That process is a sampling distribution.

Factors that influence sampling distribution:

The sampling distribution’s variability can be measured either by standard

deviation, also called “ standard error of the mean ,” or population variance ,

depending on the context and inferences you are trying to draw. They both

are mathematical formulas that measure the spread of data points in

relation to the mean.

There are three primary fac tors that influence the variability of a sampling

distribution. They are:

The number observed in a population: This variable is represented

by "N." It is the measure of observed activity in a given group of data.

The number observed in the sample: This variable is represented by

"n." It is the measure of observed activity in a random sample of data

that i s part of the larger grouping.

The method of choosing the sample: How the samples were chosen

can account for variability in some cases.

Types of distributions :

There are three standard types of sampling distributions in statistics.

1. Sampling Distribution of Means.

2. Sampling Distribution of Proportions.

3. Sampling Distributions of Differences and Sums.

6.4.1 Sampling Distribution of Means:

The most common type of sampling distribution is of the mean. It focuses on

calculating the mean of every sample group chosen from the population and

plotting the data points. The graph shows a normal distribution where the center

is the mean of the sampling distri bution, which represents the mean of the entire

population.

The mean of the sampling distribution of the mean is the mean of the

population from which the scores were sampled. Therefore, if a population

has a mean μ, then the mean of the sampling distribut ion of the mean is

also 𝜇. The symbol 𝜇ത is used to refer to the mean of the sampling munotes.in

## Page 112

112

distribution of the mean. Therefore, the formula for the mean of the

sampling distribution of the mean can be written as:

𝜇ത=𝜇

The standard deviation of the sampling distribution of the mean is

computed as follows:

𝜎ത=𝜎

√𝑁

That is, the standard deviation of the sampling distribution of the mean is

the population Standard deviation divided by √𝑁, the sample size (the

number of scores used to compute a mean). Thus, the larger the sa mple

size, the smaller the Standard deviation of the sampling distribution of the

mean.

For sampling is drawn without replacement,

The mean of the sampling distribution of means given by

𝜇ത=𝜇

The standard deviation of the sam pling distribution of means is given by

𝜎ത=ఙ

√ேටேି

ேିଵ.

Example 1: A population consist s of the five numbers 5, 6, 7, 12, and 15 .

Conside r all possible samples of size 2 that can be drawn with replacem ent

from this population. Find

a) the mean of the population,

b) the standard devi ation of the population,

c) the mean of the sampling distribution of means,

d) the standard deviation of the sampling d istribution of means.

Solution: Here 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑁=5,𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑛=2.

a) The mean of the population is given by

𝜇=5+6+7+12+15

5=45

5=9

b) The standard deviation of the population is given by

𝜎=ඨ(𝑥−𝑥̅)ଶ

𝑁

𝜎=ඨ(5−9)ଶ+(6−9)ଶ+(7−9)ଶ+(12−9)ଶ+(15−9)ଶ

5

𝜎=ඨ16+9+4+9+36

5 munotes.in

## Page 113

113

𝜎=ඨ74

5=3.85

c) The mean of the sampling distribution of means:

Hare 5(5)=25 samples of size 2, that can be drawn with

replacement, i.e. Samples are

(5,5),(5,6),(5,7),(5,12),(5,15),(6,5),(6,6),(6,7),(6,12),(6,15),

(7,5),(7,6),(7,7),(7,12),(7,15),

(12,5),(12,6),(12,7),(12,12),(12,15),(15,5),(15,6),(15,7),

(15,12),(15,15)

Therefore , the corresponding sample means are

5,5.5,6,8.5,10,5.5,6,6.5,9,

10.5,6,6.5,7,9.5,11,

8.5,9,9.5,12,13.5,10,

10.5,11,13.5,15.

The mean of the sampling distribution of means is given by

𝜇ത=𝑆𝑢𝑚 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠

25=225

25=9.

i.e.𝜇ത=𝜇

d) The standard deviation of the sampling distribution of means.

𝜎ത=ඩ(5−9)ଶ+(5.5−9)ଶ+(6−9)ଶ+⋯………+(11−9)ଶ+

(13.5−9)ଶ+(15−9)ଶ

25

= ඨ185

25=√7.4=2.72

Therefore, 𝜎ത=ఙ

√ே=ଷ.଼ହ

√ଶ=2.72.

Example 2: A population consists of the five numbers 7, 9, 10, 14, and 20.

Consider all possible samples of size 2 that can be drawn without

replacement from this population. Find

a) the mean of the population,

b) the standard deviation of the population,

c) the mean of the sampling distribution of means, munotes.in

## Page 114

114

d) the standard deviation of the sampling distribution of means.

Solution : Here 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑁=5,𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑛=2.

a) The mean of the population is given by

𝜇=7+9+10+14+20

5=60

5=12

b) The standard deviation of the population is given by

𝜎=ඨ(𝑥−𝑥̅)ଶ

𝑁

𝜎=ඨ(7−12)ଶ+(9−12)ଶ+(10−12)ଶ+(14−12)ଶ+(20−12)ଶ

5

𝜎=ඨ25+9+4+4+64

5

𝜎=ඨ106

5=4.60

c) The mean of the sampling distribution of means:

Here 5C2 = 10 samples of size 2, that can be drawn with out

replacement, i.e. (all sample are distinct selection). Samples are

(7,9),(7,10),(7,14),(7,20),(9,10),(9,14),(9,20),

(10,14),(10,20),(14,20)

Therefore, the corresponding sample means are

8,8.5,10.5,13.5,9.5,11.5,14.5,

12,15,17.

The mean of the sampl ing distribution of means is given by

𝜇ത=𝑆𝑢𝑚 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠

10=120

10=12.

i.e.𝜇ത=𝜇

d) The standard deviation of the sampling distribution of means. munotes.in

## Page 115

115

𝜎ത

=ඩ(8−12)ଶ+(8.5−12)ଶ+(10.5−12)ଶ+⋯………+(12−12)ଶ

+(15−12)ଶ+(17−12)ଶ

10

= ඨ79.5

10=√7.95=2.81

Therefore, 𝜎ത=ఙ

√ேටேି

ேିଵ=ସ.

√ଶටହିଶ

ହିଵ=2.81.

6.4.2 Sampli ng Distribution of Proportions:

This sampling distribution focuses on proportions in a population.

Samples are selected and their proportions are calculated. The mean of the

sample proportions from each group represent the proportion of the entire

population.

Suppose random samples of size 𝑛 are drawn from a population in which

the proportion with a characteristic of interest is 𝑝.

The Sampling Distribution of Proportion measures the proportion of

success, i .e. a chance of occurrence of certain events, by dividing the

number of successes i.e. chances by the sample size ’n’. Thus, the sample

proportion is defined as

𝑝=𝑥

𝑛

Therefore the mean 𝜇 and standard deviation 𝜎 are given by

𝜇=𝑝,

𝜎=ට𝑝𝑞

𝑛

Where 𝑞 is probability of non -occurrence of event, which is given

by𝑞=1−𝑝.

The following formula is used when population is finite, and the sampling

is made without the replacement:

𝜎=ඨ𝑁−𝑛

𝑁−1ට𝑝𝑞

𝑛

If 𝑛 is large, and 𝑝 is not too close to 0 or 1, the binomial distribution can

be approximated by the normal distribution. Practically, the Normal

approximation can be used when both 𝑛𝑝≥10.

Once we have the mean and standard deviation of the survey data, we can

find out the probability of a sample proportion . Here, the Z score

conversion formula will be used to find out the required probability, i.e. munotes.in

## Page 116

116

𝑍=ିఓ

ఙ.

Example 3:A random sample of 100 students is taken from the population

of all part -time students in the Maharashtra, for which the overall

proportion of females is 70%. Find sample mean and sample standard

deviation.

Solution: Here 𝑛=100,𝑝=70%=

ଵ=0.7

∴𝑞=1−𝑝=1−0.7=0.3

the mean 𝜇 is given by

𝜇=𝑝=0.7

and standard deviation 𝜎 is given by

𝜎=ට𝑝𝑞

𝑛=ඨ0.7×0.3

100=ඨ0.21

100=0.0458.

Example 4: Suppose it is known that 47% of Indian own smart phone. If a

random sample of 50 Indians were surveyed, what is the probability that

the proportion of the sample who owned smart p hone is between 50% and

54%?

Solution: Here we have, 𝑛=50,𝑝=47%=0.47,

∴𝑞=1−𝑝=1−0.47=0.53.

Now, we should check our conditions for the sampling distribution of the

sample proportion.

𝑛𝑝=50×0.47=23.5 ≥10,𝑛𝑞=50×0.53=26.5≥10.

Since both the condition satisfy,

∴ The sampling distribution that is approximately normal with mean and

standard deviation,

𝜇=𝜇=0.47,

𝜎=ට𝑝𝑞

𝑛=ඨ0.47×0.53

50=0.07

Now, T he probability that the proportion of the sample who owned smart

phone is between 50% and 54 % is given by

𝑃(0.50<𝑍<0.54)=𝑃൬0.50−0.47

0.07<𝑍<0.54−0.47

0.07൰

=𝑃(0.429<𝑍<1)

=𝑃(𝑍<1)−𝑃(𝑍<0.429)

=0.8413−0.6627

=0.1786 munotes.in

## Page 117

117

∴ If the true proportion of Indians who own smart phone is 47%, then

there would be a 17.86% chance that we would see a sample proportion

between 50% and 54% when the sample size is 50.

6.4.3 Sampling Distrib utions of Differences and Sums:

Statistical analyses are very often concerned with the difference between

means. A typical example is an experiment designed to compare the mean

of a control group with the mean of an experimental group. Inferential

statistics used in the analysis of this type of experiment depend on the

sampling distribution of the difference between means.

The sampling distribution of the difference between means can be thought

of as the distribution that would result if we repeated the following three

steps over and over again:

1. Sample 𝑛ଵ scores from Population 1 and 𝑛ଶ scores from Population 2.

2. compute t he means of the two samples 𝑀ଵ and 𝑀ଶ.

3. compute the difference between means, M 1 - M2. The distribution of

the differences between means is the sampling distribution of the

difference between means.

As you might expect, the mean of the sampling distribut ion of the

difference between means is:

𝜇ெభିெమ=𝜇ெభ−𝜇ெమ

Which says that the mean of the distribution of differences between

sample means is equal to the difference between population means.

From the variance sum law , we know that:

𝜎ଶ

ெభିெమ=𝜎ଶ

ெభ+𝜎ଶ

ெమ

We can write the formula for the standard deviation of the sampling

distribution of the difference between means as

∴𝜎ெభିெమ=ට𝜎ଶெభ+𝜎ଶெమ

Similarly we say about the sampling distribution of the sum between

means is given by:

𝜇ெభାெమ=𝜇ெభ+𝜇ெమ

𝜎ெభାெమ=ට𝜎ଶெభ+𝜎ଶெమ

Example 5: Let 𝑆ଵ be a variable that stands for any of the elements of the

population 4, 6, 8 and 𝑆ଶ be a variable that stands for any of the elements

of the population 3, 5. Compute a) 𝜇ௌభ, b) 𝜇ௌమ, c) 𝜇ௌభିௌమ, d)𝜎ௌభ, e) 𝜎ௌమ,

and f) 𝜎ௌభିௌమ.

Solution:

a) Here 𝑆ଵ has sample of population 4, 6, 8. munotes.in

## Page 118

118

𝜇ௌభ=4+6+8

3=18

3=6.

b) Here 𝑆ଶ has sample of population 3, 5.

𝜇ௌమ=3+5

2=4

c) The population consisting of the differences of any member of 𝑆ଵ

and any member of 𝑆ଶ mean is given by

𝜇ௌభିௌమ

=(4−3)+(6−3)+(8−3)+(4−5)+(6−5)+(8−5)

6

=12

6=2.

Therefore, we can verify that

𝜇ௌభିௌమ=𝜇ௌభ−𝜇ௌమ=6−4=2.

d) 𝜎ௌభ=ට(ସି)మା(ି)మା(଼ି)మ

ଷ=ට଼

ଷ=√2.67=1.63

e) 𝜎ௌమ=ට(ଷିସ)మା(ହିସ)మ

ଶ=ටଶ

ଶ=1.

f) The population consisting of the differences of any member of 𝑆ଵ

and any member of 𝑆ଶStandard deviation is given by

𝜎ௌభିௌమ=ඩ(1−2)ଶ+(3−2)ଶ+(5−2)ଶ+(−1−2)ଶ+

(1−2)ଶ+(3−2)ଶ

6

=ඨ22

6=ඨ11

3=1.91

Therefore, we can verify that,

∴𝜎ௌభିௌమ=ට𝜎ଶௌభ+𝜎ଶௌమ=√2.67+1=√3.67=1.91

Example 6: The battery life of smart phone manufacture r A have a mean

lifetime of 1050 days with a sta ndard deviation of 150 days, while those

of manufacture r B have a mean lifetime of 800 days with a standard

deviation of 120 days. If random samples of 100 batteries of each brand

are tested, what is the probabi lity that the brand A batteries will have a

mean lifetime that is at least (a) 2 00 days and (b) 280 days more than the

brand B batteries?

Solution: Let 𝑋തതത and 𝑋തതതത denote the mean lifetimes of samples A and B,

respectively. Then munotes.in

## Page 119

119

𝜇ಲതതതതିಳതതതത=𝜇ಲതതതത− 𝜇ಳതതതത=1050−800=250 𝑑𝑎𝑦𝑠.

𝜎ಲതതതതିಳതതതത=ඨ𝜎ಲതതതതଶ

𝑁+ 𝜎ಳതതതതଶ

𝑁 =ඨ(150)ଶ

100+(120)ଶ

100=ඨ36900

100

=√369=19.21

Therefore, the standardized variable for the difference in means is

𝒁=(𝑋തതത−𝑋തതതത)−(𝜇ಲതതതതିಳതതതത)

𝜎ಲതതതതିಳതതതത

and is very closely normally distributed .

a) the probability that the brand A batteries will have a mean lifetim e

that is at least 1200 days,

𝑃(𝑍>200)=0.5+𝑃൬𝑍=200−250

19.21൰=0.5+𝑃(𝑍>−2.6)

=0.5+0.4953=0.9953.

b) the probability batteries will have a mean lifetime that is at least

960 days more than the brand B batteries ,

𝑃(𝑍>280)=0.5−𝑃൬𝑍=280−250

19.21൰=0.5−𝑃(𝑍=1.56)

=0.5 −0.4406=0.0594.

6.5 STANDARD ERRORS

Another measure is standard error, which is the standard deviation of the

sampling distribution of an estimator. The idea is that if we draw a number

of repeated samples of fixed size 𝑛 from a population having a mean 𝜇 and

variance 𝜎ଶ , each simple mean, say 𝑥̅, will have a different value. Here

𝑥̅it isa random variable and hence it has a distribution. The standard

deviation of 𝑥̅is called standard error . It has been proved that the standard

error ′𝜎௫̅′of the mean 𝑥̅based on a sample of size 𝑛 is,

𝜎௫̅= 𝜎

√𝑛

From above formula, it is obvious that the larger the sample size, the

smaller the standard error and vice -versa. The advantage of considering

standard error instead of a standard deviation is that this measure is not

influenced by the extreme values present in a population under

consideration.

In reality neither we use 𝜎 to calculate the standard error of 𝑥̅nor we take

more than one sample. As a matter of fact, what we do is, that we select

only one sample, find its standard deviation 𝑠 and use the following

formula to find out the standard error of 𝑥̅ .i.e. munotes.in

## Page 120

120

𝑆.𝐸.(𝑥̅)=𝑠

√𝑛

Standard error is commonly used in testing of hypothesis and interval

estimation. Many distributions, which are originally not normally

distributed, have been taken as normal by considering the distribution of

mean 𝑥̅ for a large 𝑛.

6.6 SUMMARY In this unit we have learn:

In sampling theory we have Random Samples, Sampling With and

Without Replacement of sample.

Sampling Distributions and its types.

Standard Errors of sampling distribution.

6.7 EXERCISE 1. A population consists of the four numbers 8, 11, 12, and 19 . Consider

all possible samples of size 2 that can be drawn with replacement from

this population. Find

a) the mean of the population,

b) the standard deviation of the population,

c) the mean of the sampling distribution of means,

d) the standard deviation of the sampling distribution of means.

2. A population consists of the seven numbers 3, 5, 7, 9, 11, 13, and 15.

Consider all possible samples of size 2 that can be drawn without

replacement from this population. Find

a) the mean of the populat ion,

b) the standard deviation of the population,

c) the mean of the sampling distribution of means,

d) the standard deviation of the sampling distribution of means.

3. Let 𝑆ଵ be a variable that stands for any of the elements of the

population 5, 8, 12 and 𝑆ଶ be a variable that stands for any of the

elements of the population 2, 6. Compute a) 𝜇ௌభ, b) 𝜇ௌమ, c) 𝜇ௌభିௌమ,

d)𝜎ௌభ, e) 𝜎ௌమ, and f) 𝜎ௌభିௌమ.

4. A certain t ype of electric light bulb has a mean lifetime of 1500 h and a

standard deviation of 150 h. Three bulbs are connected so that when

one burns out, another will go on. Assuming that the lifetimes are

normally distributed, what is the probability that lightin g will take place

for (a) at least 5000 h and (b) at most 4200 h?

5. Two distances are measured as 27.3 centimeters (cm) and 15.6 cm with

standard deviations (standard errors) of 0.16 cm and 0.08 cm, munotes.in

## Page 121

121

respectively. Determine the mean and standard deviation of (a) the sum

and (b) the difference of the distances.

6. A and B play a game of ‘‘heads and tails,’’ each tossing 50 coins. A

will win the game if she tosses 5 or more heads than B; otherwise, B

wins. Determine the odds against A winning any particular game.

7. The average income of men in a city is Rs. 20,000 with standard

deviation Rs. 10,500 and the average income of women is Rs. 16,000

and standard deviation Rs. 8,000. There are 100 sample selected from

population. Find the probability of income betw een Rs. 15,000 to Rs.

18000.

8. A sample of 300 items selected at random had 32 defective items. Find

mean and standard deviation of sampling distribution of proportion.

9. Ball bearings of a given brand weigh 0.50 g with a standard deviation

of 0.02 g. What is the probability that two lots of 1000 ball bearings

each will differ in weight by more than 2 g?

10. Find the probability that in 120 tosses of a fair coin (a) less than 40%

or more than 60% will be heads and (b) 5/8 or more will be heads.

6.8 LIST OF REFERENCES

Statistics by Murry R. Spiegel, Larry J. Stephens. Publication

McGRAWHILL INTERNATIONAL.

Fundamental Mathematics and Statistics by S.C. Gupta and V.K

kapoor

Mathematical Statistics by J.N. Kapur and H.C. Saxena.

*****

munotes.in

## Page 122

122

UNIT III

7

STATISTICAL ESTIMATION

THEORY

Unit Structure

7.0 Objectives

7.1 Basic definitions

7.1.1 Population

7.1.2 Sample

7.1.3 Parameter

7.1.4 Statistic

7.1.5 Sampling distribution

7.1.6 Parameter Space

7.1.7 Estimator

7.1.8 E stimate

7.2 Point estimation

7.2.1 Unbiasedness

7.2.2 Consistency

7.2.3 Efficiency

7.2.4 Minimum variance unbiased estimator

7.2.5 Uniformly minimum variance unbiased estimator

7.2.6 Likelihood function

7.2.7 Sufficiency

7.3 Interval estimation

7.3.1 Probable error

7.4 Summary

7.5 Exercise

7.6 Refer ences

7.0 OBJECTIVES

To understand basic definitions related to point estimation.

To find the best p oint estimators to represent population

characteristics .

In this chapter, students can learn the requirements of a good

estimator . munotes.in

## Page 123

123

To find an appropriate confidence interval for the population

parameters .

7.1 BASIC DEFINITIONS

7.1.1 Population :

A collection of all well-defined objects under study is called

population.

Example : Suppose we want to study the economic conditions of

primary teachers in Maharashtra, then the group of all primary

teachers in the state of Maharashtra is a population.

7.1.2 Sample :

A well defined finite subset of the population is called a sample.

Example : Suppose we want to study the economic conditions of

primary teachers in the state of Maharashtra, then the few primary

teachers (set of few teachers ) in the state of Maharashtra forms a

sample.

7.1.3 Parameter :

An unknown constant of a population that summarises or describes an

aspect of the population (such as a mean or a standard deviation ) is

called parameter . Let f (x, θ) be the pdf of a random variable ‘ X’

having an unknown constant θ.

7.1.4 Statistic :

Any function of a sample value (observed value) is called a statistic.

The sample statistic is constants but it differ from sample to sample.

7.1.5 Sampling distribution :

The probability distribution of the sample statistic is called a sampling

distribution.

7.1.6 Parameter space :

The set of all admissible value s of a parameter of the distribution is

called parameter space. It is denoted by Θ.

Example : X ~ Normal ( μ, σ2)

Θ = {(μ, σ2) / -∞<μ<∞, σ>0}

7.1.7 Estimator :

Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a sample of size n taken from a distribution

having pdf f (x, θ) where θ ∈ Θ is an unknown parameter. A function

T = T ( 𝑥ଵ,𝑥ଶ,…,𝑥) which maps sample space (S) to parameter space

Θ is called an estimator. In other words, If a statistic T =

T(𝑥ଵ,𝑥ଶ,…,𝑥) is used to estimate θ, and its value belongs to

parameter space then it is said to be an estimator of θ. munotes.in

## Page 124

124

7.1.8 Estimate :

A particular value of an estimator corresponding to the given sample

values is called an estimate of the population parameter.

In the theory of estimation , there are two parts, 1) Point Estimation, 2)

Interval Estimation.

7.2 POINT ESTIMATION

Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a sample of size n taken from a distribution having

pdf (probability density function) f (x, θ) where θ ∈ Θ is an unknown

parameter. The method of using sample statistic ‘T’ to estimate the

value of parameter θ, which is a po int on real line R, is called “Point

Estimation”.

Requirements of good and reliable e stimators:

1. Unbiasedness

2. Consistency

3. Efficiency

4. Sufficiency

7.2.1 Unbiasedness :

An estimator T = T ( 𝑥ଵ,𝑥ଶ,…,𝑥) is said to be an unbiased estimator of θ

iff;

𝐸(𝑇)= 𝜃, ∀ 𝜃∈Θ,

a parameter ic function 𝜙 (𝜃) is said to be estimable if there exists a

statistic h (T) such that, 𝐸൫ℎ(𝑇)൯= 𝜙 (𝜃) ∀ 𝜃∈Θ.

7.2.1.1 The b ias of an estimator :

An estimator T = T (x 1, x2, …, x n) is said to be a biased estimator with a

bias (ఏ)

if;

𝐸(𝑇)= 𝜃+𝑏 (𝜃)

𝑛 ∀ 𝜃∈Θ

Or

𝐸(𝑇− 𝜃)= 𝑏 (𝜃)

𝑛 ∀ 𝜃∈Θ

If 𝑏 (𝜃)>0, then estimator T is called biased estimator with an

upward (positive) bias (ఏ)

.

If 𝑏 (𝜃)<0, then estimator T is called biased estimator with a

downward (negative) bias (ఏ)

.

If 𝑏 (𝜃)=0, then estimator T is called an unbiased estimato r. munotes.in

## Page 125

125

Example : Let x 1, x2, x3 be independent observation from Poisson ( 𝜆),

then show that, 𝑇= ௫భା ௫మା ௫య

ଷ is an unbiased estimator of 𝜆.

Since 𝑥ଵ, 𝑥ଶ,𝑥ଷ are i.i.d. Poisson ( 𝜆).

𝐸[𝑋ଵ]= 𝐸[𝑋ଶ]= 𝐸[𝑋ଷ]= 𝜆

Consider;

𝐸[𝑇]=𝐸ቂ௫భା ௫మା ௫య

ଷቃ= ଵ

ଷ 𝐸[𝑥ଵ+ 𝑥ଶ+ 𝑥ଷ]= 𝜆

Hence, T is an unbiased estimator of 𝜆.

7.1.1.1 MSE of an estimator :

Mean square error (MSE) of an estimator T = T (x 1, x2, …, xn) is,

𝑀𝑆𝐸=𝐸(𝑇− 𝜃)ଶ

7.1.1 Consistency :

A sequence of the estimator 𝑇=𝑇(𝑥ଵ,𝑥ଶ,…,𝑥) is said to be a

consistent estimator for parameter 𝜃 if any given 𝜖> 0,

𝑃 [|𝑇− 𝜃|< 𝜖] →1 𝑎𝑠 𝑛 →∞ ∀ 𝜖>0

Difference between 𝑇 & 𝜃 becomes smaller and smaller as n goes to a

large number (infinity) .

A sequence of an estimator 𝑇=𝑇(𝑥ଵ,𝑥ଶ,…,𝑥) is said to be a

consistent estimator for parameter 𝜃 if;

𝐸(𝑇)= 𝜃 ∀ 𝜃∈Θ

𝑉[𝑇] →0 𝑎𝑠 𝑛 → ∞

Example : Show that the sample mean is a consistent estimator of the

population mean.

Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a random sample of size n from a population with

mean 𝜇 and variance 𝜎ଶ.

First, we have to define the sample mean,

𝑥 ഥ= ∑௫

సభ

𝐸 [𝑥]= 𝜇 & 𝑉 [𝑥]= 𝜎ଶ ∀ 𝑖=1,2,…,𝑛

𝐸 [𝑥̅]=𝐸 ቂ∑௫

సభ

ቃ

= ∑ா [௫]

= ఓ

munotes.in

## Page 126

126

𝐸 [𝑥̅]= 𝜇

Consider,

𝑉 [𝑥̅]=𝑉ቂ∑௫

సభ

ቃ

= ଵ

మ ∑𝑉 (𝑥 )

= ଵ

మ 𝑛𝜎ଶ

𝑉 [𝑥̅]= ఙమ

Lim → ஶ𝐸 [𝑥̅]= 𝜇

& 𝑉 [𝑥̅]= ఙమ

→0.

Therefore, the sample mean is a consistent estimator of the population

mean.

Remark : If 𝑇 is a consistent estimator for 𝜃, then 𝜙 (𝑇) is a consistent

estimator for 𝜙 (𝜃), where 𝜙 is a continuous function.

7.1.2 Efficiency :

An estimator T 1 is said to be more efficient than estimator T 2 of

parameter 𝜃 if V (T1) < V (T2). The relative efficiency of T 2 with

respect to T 1 is defined as,

𝑒= [்భ]

[்మ].

Example : Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a random sample of size n from a Normal

(𝜇, 𝜎ଶ). Then select the most efficient estimator between 𝑥̅ (Mean) &

𝑥 (Median).

We know that,

𝐸 [𝑥̅]= 𝜇

𝑉 [𝑥̅]= ఙమ

→0 𝑎𝑠 𝑛 → ∞

& 𝐸 [𝑥]= 𝜇

𝑉 [𝑥]= ቀగ

ଶቁ .ቀఙమ

ቁ .

Now we have to calculate efficiency as,

𝑒= [௫̅]

[௫]= ଶ

గ<1.

Therefore, the sample mean is a more efficient estimator than the sample

median.

munotes.in

## Page 127

127

7.1.3 Minimum Variance Unbiased Estimator ( MVUE ):

An estimator T = T ( 𝑥ଵ,𝑥ଶ,…,𝑥) is said to be minimum variance

unbiased estimator of parameter 𝜃 if,

1) 𝐸 [𝑇]= 𝜃 ∀ 𝜃∈Θ and,

2) 𝑉 [𝑇]<𝑉 [𝑇ᇱ] ; where, 𝑇ᇱis any other unbiased estimator of 𝜃.

7.1.4 Uniformly minimum variance unbiased estimator (UMVUE) :

Let 𝜃 be the unknown parameter & Θ be the parameter space of 𝜃. Let

𝑈 (𝜃) be the set of a class of unbiased estimator of 𝑇 (𝜃) such that,

𝐸 [𝑇ଶ]< ∞ ∀ 𝜃∈Θ

i.e.,

𝑈 (𝜃)= {𝑇: 𝐸 [𝑇]= 𝜃,𝐸 [𝑇ଶ]< ∞ ∀ 𝜃∈Θ}

Then, 𝑇 ∈𝑈 (𝜃) is UMVUE of 𝜃 if,

𝐸 [𝑇− 𝜃]ଶ ≤𝐸 [𝑇− 𝜃]ଶ ∀ 𝜃∈Θ & 𝑇∈𝑈 (𝜃).

7.1.5 Likelihood function :

Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a sample of size n taken from a distribution

having pdf f (x, θ) where θ ∈ Θ is an unknown parameter. Then the

likelihood function of 𝑥ଵ,𝑥ଶ,…,𝑥 is defined as

𝐿 (𝑋ଵ,𝑋ଶ,…,𝑋,𝜃)=𝑓൫𝑋ଵ,𝜃൯ .𝑓൫𝑋ଶ,𝜃൯ .𝑓൫𝑋ଷ,𝜃൯… 𝑓൫𝑋,𝜃൯

𝐿 ൫𝑋,𝜃൯= ∏𝑓(𝑥,𝜃)

ୀଵ .

T is said to be the maximum likelihood estimator of 𝜃, which

maximizes the likelihood function 𝐿 ൫𝑋,𝜃൯.

7.1.6 Sufficiency :

Definition I A stati stic 𝑇=𝑇(𝑥ଵ,𝑥ଶ,…,𝑥) based on a sample of size

n having pmf /pdf 𝑓(𝑋,𝜃) 𝜃∈Θ is said to sufficient statistic for 𝜃 if

and only if the information contains in T about 𝜃 is same as the

information contains in 𝑥ଵ,𝑥ଶ,…,𝑥 about 𝜃.

Definition II A statistic 𝑇=𝑇(𝑥ଵ,𝑥ଶ,…,𝑥) based on a sample of size

n having pmf /pdf 𝑓(𝑋,𝜃) 𝜃∈Θ is said sufficient statistic for 𝜃, if and

only if the conditional distribution of 𝑥ଵ,𝑥ଶ,…,𝑥 given T is

independent of 𝜃.

Definition III (Neyman factorization criteri on) If 𝑥ଵ,𝑥ଶ,…,𝑥 is a

sample of size n having pmf /pdf 𝑓(𝑋,𝜃) 𝜃∈Θ and

𝑇=𝑇(𝑥ଵ,𝑥ଶ,…,𝑥) be a statistic which is said to be sufficient for 𝜃 if

and only if the joint probability distribution function of 𝑥ଵ,𝑥ଶ,…,𝑥

can be expressed as a product of a function of T and 𝜃, and function of

𝑥ଵ,𝑥ଶ,…,𝑥 only.

i.e., 𝐿 ൫𝑋,𝜃൯=𝑔(𝑇,𝜃) .ℎ൫𝑋൯,

munotes.in

## Page 128

128

then T is said to be sufficient statistic for 𝜃, where 𝑔(𝑇,𝜃) is a function of

𝑇 & 𝜃 only and ℎ൫𝑋൯ is a function of 𝑥ଵ,𝑥ଶ,…,𝑥 only.

Example : Let 𝑥ଵ,𝑥ଶ,…,𝑥 be a random sample from a population

having pdf,

𝑓(𝑥)=൜𝜃𝑥ఏ ି ଵ,0 ≤𝑥≤1

0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The likelihood function of 𝑥ଵ,𝑥ଶ,…,𝑥 is,

𝐿 ൫𝑋,𝜃൯= ∏𝜃 𝑥ఏିଵ

– ଵ

= 𝜃∏𝑥ఏିଵ

ିଵ

= ൣ𝜃 ∏𝑥ఏ൧ ቂଵ

∏௫ቃ

𝐿 ൫𝑋,𝜃൯= 𝑔(𝑇,𝜃) .ℎ൫𝑋൯

Where, (𝑇,𝜃)= 𝜃 ∏𝑥ఏ , ℎ൫𝑋൯= ଵ

∏௫

Therefore, from the Neyman factorization criteri on ∏𝑥 is sufficient

statistic for 𝜃.

7.2 INTERVAL ESTIMATION

A confidence interval for an unknown parameter is an interval of

possible values for the parameter. It is constructed so that, with a

chosen degree of confidence, the actual value of the parameter lies

within the lower and upper bounds of the interval.

Let T1 and T2 be two statistics such that,

𝑃 (𝑇ଵ> 𝜃)= 𝛼ଵ … (1)

and 𝑃 (𝑇ଶ< 𝜃)= 𝛼ଶ … (2)

where 𝛼ଵ and 𝛼ଶ are constants independent of 𝜃. Equation s (1) & (2) can

be combined to give

𝑃 (𝑇ଵ < 𝜃 < 𝑇ଶ)=1− 𝛼 … (3)

where 𝛼= 𝛼ଵ+ 𝛼ଶ

Example : If we take a large sample from a normal population with

mean 𝜇 and standard deviation 𝜎, then 100(1−𝛼)% confidence

interval for a population mean at 𝛼=0.05 𝑖𝑠

𝑍= 𝑥̅− 𝜇

𝜎

√𝑛ൗ ~ 𝑁 (0,1)

munotes.in

## Page 129

129

In general, confidence interval for 𝜇 can be constructed by using the

following normal probability approach,

𝑃 ൫𝑍ఈ/ଶ ≤𝑍 ≤𝑍ଵିఈ/ଶ൯=1−𝛼,

Where 𝑃(𝑍<𝑍ఈ/ଶ)=𝛼/2.

In particular case, 𝛼=0.05 then 𝑍ఈ/ଶ=−𝑍ଵିഀ

మ=−1.96 implies,

𝑃 (−1.96 ≤𝑍 ≤1.96)=0.95 … (From Normal Probability Tables)

⟹𝑃 ቆ−1.96 ≤ ௫̅ି ఓ

ఙ

√ൗ ≤1.96ቇ=0.95

⟹𝑃 ቀ𝑥̅−1.96 ఙ

√ ≤ 𝜇 ≤ 𝑥̅ +1.96 ఙ

√ ቁ=0.95

Thus, 𝑥̅ ±1.96 ఙ

√ are 95% confidence limits for the unknown parameter

𝜇, the population mean and interval ቀ𝑥̅−1.96 ఙ

√,𝑥̅ +1.96 ఙ

√ ቁ is

called the 95% confidence interval.

Remark For a small sample case with unknown variance, to construct

a confidence interval for a population mean, use students -t

distribution.

7.2.1 Probable error :

Probable error defines the half-range of an interval about a mean for

the distribution, such that half of the values from the distribution will

lie within the interval and half outside.

Thus for a symmetric distribution it is equivalent to half

the interquartile range, or the median absolute deviation.

For a normal distribution probable error is 0.6745𝜎

7.3 EXERCISE

1. What do you understand by point estimation?

2. Explain the terms statistic and its sampling distribution.

3. Define estimator and estimate.

4. Write a note on unbiased ness, consiste ncy, the efficien cy of an

estimator. Also , define sufficient statistics.

5. Define MVUE and UMVUE.

6. Define Neyman factorization criteria.

7. Construct a confidence interval for a population mean where

population variance is known. munotes.in

## Page 130

130

7.4 SUMMARY

In this chapter we studied basic definitions related to point estimation

of population parameters.

The good estimator must satisfy conditions of Unbiasedness,

Consistency, Efficiency and Sufficiency.

Uniformly Minimum variance unbiased estimators popularly used

techniques to estimate population parameters.

In this chapter we studied confidence interval estimation of the

population parameter.

7.5 REFERENCES

1. Gupta S. C. and Kapoor V. K., 2011, Fundamentals of Mathematical

Statistics, 11th Ed, Sultan and Chand.

2. Rohatgi V. K, 1939, Introduction to Probability and Statis tics, Wiley

3. Murray R. Spiegel, Larry J. Stephens, STATISTICS, 4th Ed,

McGRAW – HILL I NTERNATIONAL.

4. J.N. KAPUR and H.C. SAXENA, 2005, MATHEMATICAL

STATISTICS, 12th rev, S.Chand.

5. Agrawal B. L, 2003, Programmed Statistics , 2nd Ed, New Age

International.

*****

munotes.in

## Page 131

131

8

STATISTICAL DECISION THEORY

Unit Structure

8.0 Objectives

8.1 Basic definitions

8.1.1 Statistical decision

8.1.2 Hypothesis

8.1.3 Null hypothesis

8.1.4 Alternative hypothesis

8.1.5 Simple and composite hypothesis

8.1.6 Errors in the test of significance

8.1.7 Critical region

8.1.8 Level of significance

8.1.9 Test statistic

8.1.10 One-tailed and two -tailed tests

8.1.11 Central limit theorem

8.1.12 Critical value

8.1.13 P-value

8.1.14 Power of the test

8.1.15 Procedure for test a hypothesis

8.2 Large sample test

8.2.1 One sample Z -test for a mean

8.2.2 Two sample Z -test for the difference of two mean

8.2.3 Test for a testing population proportion

8.2.4 Test for testing equality of population proportion

8.3 Small sample test

8.3.1 A t-test for testing a population mean

8.3.2 Two sample t -test for the difference of two mean

8.3.3 Paired t -test

8.4 Control chart

8.4.1 A lot acceptance sampling plan

8.4.2 OC curve

8.4.3 Control chart

8.4.4 Variable control chart

8.4.5 Attribute control chart

8.5 Summary munotes.in

## Page 132

132

8.6 Exercise

8.7 References

8.0 OBJECTIVES

In this chapter , student s can learn,

Basic definitions related to the testing of hypothesis/decision making

Decision making through Testing of hypothesis

Large and small sample tests

Decision Making through statistical c ontrol chart

8.1 BASIC DEFINITIONS

8.1.1 Statistical decision :

The decisions are made based on observations of a phenomenon that

carry out probabilistic laws that are not completely known.

8.1.2 Hypothesis :

A definite statement about the population parameter is called a

hypothesis. A hypothesis is a claim to be tested.

Example : A particular scooter gives an average of 50 km per litre.

8.1.3 Null hypothesis :

A hypothesis having no difference is called the null hypothesis.

Example : The population mean is 𝜇 the hypothesis will be 𝐻: 𝜇=

𝜇.

8.1.4 Alternative hypothesis :

A hypothesis that is accepted in the case 𝐻 is rejected is called the

alternative hypothesis and usually denoted by 𝐻ଵ. It is exactly

opposite to H 0.

Example : If 𝐻: 𝜇= 𝜇 i.e., the population has a specified mean 𝜇,

then the alternative hypothesis could be;

i. 𝐻ଵ: 𝜇≠ 𝜇 (𝜇 > 𝜇 𝑜𝑟 𝜇 < 𝜇) … Two-tailed alternative

ii. 𝐻ଵ: 𝜇 > 𝜇 … Right tailed alternative

iii. 𝐻ଵ: 𝜇 < 𝜇 … Left tailed alternative

8.1.5 Simple and composite hypothesis :

A statistical hypothesis that completely specifies the population

parameter is called a simple hypothesis, and the hypothesis that does

not specify the population parameter is called a composite hypothesis.

Example : If 𝑥ଵ,𝑥ଶ,…,𝑥 is a random sample from normal with mean

𝜇 and variance 𝜎ଶ then 𝐻: 𝜇= 𝜇 and 𝜎ଶ= 𝜎ଵଶ is a simple munotes.in

## Page 133

133

hypothesi s. The following fully not specified hypotheses is called a

composite hypothes is.

1) 𝐻ଵ: 𝜇≠ 𝜇 2) 𝐻: 𝜎ଶ≠𝜎ଶ 3) 𝐻: 𝜇= 𝜇 𝑎𝑛𝑑 𝜎ଶ>𝜎ଶ etc.

8.1.6 Errors in the test of significance :

The main objective in the sampling theory is to draw a valid

inference about the population parameters based on sample results.

In practice, we decide to accept or reject the lot after examining a

sample drawn from it. In sampling theory, we are liable to commit

two types of errors: a rejection of a good lot and acceptance of a bad

lot.

i. Type -I error: Rejecting 𝐻 when 𝐻 is true.

ii. Type -II error: Accepting 𝐻 when it is false (Accepting 𝐻 when

𝐻ଵis true).

iii. Size of Type -I and Type -II errors

𝑃 [𝑅𝑒𝑗𝑒𝑐𝑡 𝐻 𝑤ℎ𝑒𝑛 𝑖𝑡 𝑖𝑠 𝑡𝑟𝑢𝑒]=𝑃 [𝑅𝑒𝑗𝑒𝑐𝑡 𝐻 | 𝐻]=

𝑃 [𝑅𝑒𝑗𝑒𝑐𝑡 𝑎 𝑙𝑜𝑡 𝑤ℎ𝑒𝑛 𝑖𝑡 𝑖𝑠 𝑔𝑜𝑜𝑑]=𝛼

𝑃 [𝐴𝑐𝑐𝑒𝑝𝑡 𝐻 𝑤ℎ𝑒𝑛 𝑖𝑡 𝑖𝑠 𝑤𝑟𝑜𝑛𝑔]=𝑃 [𝐴𝑐𝑐𝑒𝑝𝑡 𝐻 | 𝐻ଵ]=

𝑃 [𝐴𝑐𝑐𝑒𝑝𝑡 𝑎 𝑙𝑜𝑡 𝑤ℎ𝑒𝑛 𝑖𝑡 𝑖𝑠 𝑏𝑎𝑑]= 𝛽

In the above probabilities, 𝛼 & 𝛽 are called the Type -I & Type -II errors,

respectively.

The four types of decisions are shown in the table as follows. Actual Situation Decision Reject 𝐻 Accept 𝐻 𝐻 is true Type-I Error Correct Decision 𝐻 is false Correct Decision Type-II Error

8.1.7 Critical region:

A region in sample space S which amounts to a rejection of 𝐻 is

called a critical region or rejection region of 𝐻.

8.1.8 Level of significance :

The probability ‘ 𝛼’ is that the value of the test statistic belongs to the

critical region, known as ‘level of significance’. That is the

probability of the occurrence of the type I error is the level of

significance. Usually, we use the level of significance of 5% or 1%.

The le vel of significance is always fixed in advance before

collecting the sample information.

8.1.9 Test statistic :

A function of sample observations is used to test 𝐻 is called test

statistic.

munotes.in

## Page 134

134

8.1.10 One-tailed and two-tailed tests:

A function of any statistical hypothesis where the alternative

hypothesis is one -tailed (right -tailed or left -tailed) is called a one -

tailed test.

Example : A test for testing the mean of a population

𝐻: 𝜇= 𝜇 Versus 𝐻ଵ: 𝜇 > 𝜇 (Right Tailed) or 𝐻ଵ: 𝜇 < 𝜇 (Left

Tailed), is a one -tailed test. In the right -tailed test, the critical region

lies entirely in the right tail of the sampling distribution of 𝑋ത, while

for the left tailed test, the critical region lies entirely in the left tail of

the sam pling distribution of 𝑋ത.

A test of any statistical hypothesis where the alternative hypothesis

is two -tailed such as; 𝐻: 𝜇= 𝜇 Versus 𝐻ଵ: 𝜇≠ 𝜇 (𝜇 >

𝜇 𝑜𝑟 𝜇 < 𝜇) is known as a two -tailed test, and in such a case, the

critical region is giv en by the portion of the area lying in both the

tails (sides) of the probability curve of the test statistic 𝑋ത.

8.1.11 Central limit theorem :

In many cases, the exact probability distribution of the test statistics

T cannot be obtained. The difficulty is overcome using the normal

approximation. The probability distribution of standardized T is

assumed to be N (0, 1) as the sample size 𝑛 → ∞ (i.e., n is

sufficiently large). The corresponding theorem in support of the

normal approximation is known as the central limit theor em.

Case -I: Parent population is Normal :

Let the random sample drawn from 𝑁 (𝜇,𝜎ଶ). By the definition of a

random sample, vari ates values 𝑥ଵ,𝑥ଶ,…,𝑥 of the sample are

independent and identically distributed as 𝑁 (𝜇,𝜎ଶ) then the sample mean

(𝑋ത) is distributed normally with 𝜇 and variance 𝜎ଶ

𝑛ൗ i.e.,

𝑋ത ~ 𝑁 (𝜇,𝜎ଶ

𝑛ൗ). The result shows how the precision of a sample mean

increases as the sample size increases and 𝑍= ௫̅ି ఓ

ఙ√⁄ ~ 𝑁 (0,1), standard

normal variate.

Case -II: Parent population is Non -Normal :

If the population from which the random sample is drawn has a non-

normal distribution with finite mean 𝜇 and finite standard deviation 𝜎 then

the variate, by owing to central limit theorem,

𝑍= 𝑥̅− 𝜇

𝜎√𝑛⁄ ~ 𝑁 (0,1) 𝑎𝑠 𝑛 → ∞

8.1.12 Critical value:

The value of the test statistic, which separates the critical (rejection)

region and acceptance region, is called the ‘critical value’ . It

depends upon (1) The level of significance 𝛼 used and (2) The

alternative hypothesis, whether it is two -tailed or on e-tailed. munotes.in

## Page 135

135

8.1.13 P-value :

Another approach for testing is to find out the ‘p’ value at which 𝐻 is

significant. That is, to find the smallest level of significance, 𝛼 at

which 𝐻 is rejected. About the acceptance or rejection of 𝐻, the

experimenter can himself decide the level 𝛼 by comparing it with the

p-value. The criterion for this is that if the p -value is less than or equal

to 𝛼, reject 𝐻 otherwise, accept 𝐻.

Example :

Fig.4: P -value for One -tailed Z -Test

munotes.in

## Page 136

136

8.1.14 Power of test:

𝑃𝑜𝑤𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡=𝑃 [𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑛𝑔 𝐻 𝑔𝑖𝑣𝑒𝑛 𝑡ℎ𝑎𝑡 𝐻ଵ 𝑖𝑠 𝑡𝑟𝑢𝑒]

= 𝑃 [𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑛𝑔 𝐻 | 𝐻ଵ 𝑖𝑠 𝑡𝑟𝑢𝑒]

=1− 𝑃 [𝐴𝑐𝑐𝑒𝑝𝑡𝑖𝑛𝑔 𝐻 | 𝐻ଵ 𝑖𝑠 𝑡𝑟𝑢𝑒]

=1− 𝑃 [𝑇𝑦𝑝𝑒 𝐼𝐼 𝐸𝑟𝑟𝑜𝑟]

=1− 𝛽

8.1.15 The p rocedure of testing of hypothesis :

In any testing of a hypothesis, it is a stepwise procedure that leads to

rejection or acceptance of the null hypothesis based on samples drawn

from the population. The various steps in testing a statistical hypothesis

are as follows;

i. Set up the null hypothesis 𝐻.

ii. Set up the alternative hypothesis 𝐻ଵ, this will enable us to decid e

whether we have to use a one -tailed (right or left) test or a two -tailed

test.

iii. Choose an appropriate level of significance ( 𝛼), i.e., 𝛼 is fixed in

advance.

iv. Choose the appropriate test statistic Z or T and find its value under the

null hypothesis 𝐻.

v. Determine the critical values and critical region corresponding to the

level of significance and the alternative hypothesis ( 𝑍ఈ for one -tailed

𝐻ଵor 𝑍ఈଶ⁄ for two -tailed 𝐻ଵ).

vi. Decision rule: We c ompare the calculated value of Z, with the tabulated

value of 𝑍ఈ if 𝐻ଵ is one -tailed & compare | Z| with 𝑍ఈଶ⁄ (For

symmetrical distribution). If the calculated value is greater than the

tabulated value, then we reject 𝐻 at 𝛼 % level of significance and

conclude that there is a significant difference at 𝛼 % level of

significance otherwise accept 𝐻 at 𝛼 % significance level and

conclude that there is no significant difference at 𝛼 % level of

significance.

8.2 LARGE SAMPLE TESTS

8.2.1 Test for testing of a population mean :

Let consider 𝑥={𝑥ଵ,𝑥ଶ,…,𝑥} be a random sample of size n taken

from a norma lly distributed population having population mean 𝜇 and

population variances 𝜎ଶ respectively.

We have to test the hypothesis,

𝐻: 𝜇= 𝜇

Against

𝐻ଵ: 𝜇≠ 𝜇 OR

𝐻ଵ: 𝜇 > 𝜇 OR munotes.in

## Page 137

137

𝐻ଵ: 𝜇 < 𝜇

Under 𝐻, the test statistic is,

𝑍= ௫̅ିఓబ

ఙ√⁄ ~ 𝑁 (0,1),

where n is the sample size, 𝑥̅ is the sample mean, 𝜎 population standard

deviation.

Let 𝑍ఈ be the critical value at 𝛼 level of significance. We compare the

calculated value of Z, with the tabulated value of 𝑍ఈ. Where 𝑃(𝑍≤𝑍ఈ)=

𝛼.

𝐻ଵ is two -tailed ( 𝜇≠ 𝜇 )

|𝑍|≥𝑍ఈଶ⁄, then reject H 0.

𝐻ଵ is one -tailed ( 𝜇 > 𝜇)

𝑍≥𝑍ଵିఈ, then reject H 0.

𝐻ଵ is one -tailed (𝜇 <𝜇 )

𝑍≤𝑍ఈ, then reject H 0.

8.2.2 Test for the testing difference of the two population mean :

Let consider 𝑥ଵ={𝑥ଵଵ,𝑥ଵଶ,…,𝑥ଵభ} and 𝑥ଶ={𝑥ଶଵ,𝑥ଶଶ,…,𝑥ଶమ} be

two independent normally distributed samples having population

means 𝜇ଵ and 𝜇ଶ and population variances 𝜎ଵଶ and 𝜎ଶଶ respectively.

𝑥̅ଵ and 𝑥̅ଶ are the sample means.

We have to test the hypothesis,

𝐻: 𝜇ଵ−𝜇ଶ=𝜇

Against

𝐻ଵ: 𝜇ଵ≠ 𝜇ଶ OR

𝐻ଵ: 𝜇ଵ−𝜇ଶ>𝜇 OR

𝐻ଵ: 𝜇ଵ−𝜇ଶ<𝜇

Under 𝐻, the test statistic is,

𝑍= (௫̅భି ௫̅మ)ିఓబ

ඨቈభమ

భାቈమమ

మ ~ 𝑁 (0,1).

Let 𝑍ఈ be the critical value at 𝛼 level of significance . We compare the

calculated value of Z, with the tabulated value of 𝑍ఈ. Where 𝑃(𝑍≤𝑍ఈ)=

𝛼.

𝐻ଵ is two-tailed (𝜇ଵ≠ 𝜇ଶ )

|𝑍|≥𝑍ఈଶ⁄, then reject H 0. munotes.in

## Page 138

138

𝐻ଵ is one -tailed ( 𝜇ଵ−𝜇ଶ>𝜇 )

𝑍≥𝑍ଵିఈ, then reject H 0.

𝐻ଵ is one -tailed ( 𝜇ଵ−𝜇ଶ<𝜇 )

𝑍≤𝑍ఈ, then reject H 0.

8.2.4 Test for testing of the population proportion :

We have to test the hypothesis,

𝐻: 𝑃= 𝑃

Against

𝐻ଵ: 𝑃≠ 𝑃 OR

𝐻ଵ: 𝑃 > 𝑃 OR

𝐻ଵ: 𝑃 < 𝑃

Under 𝐻, the test statistic is,

𝑍= ି బ

ටುబೂబ

~ 𝑁 (0,1),

where n is the sample size, p is the sample proportion.

Let 𝑍ఈ be the critical value at 𝛼 level of significance. We compare the

calculated value of Z, with the tabulated value of 𝑍ఈ. Where 𝑃(𝑍≤𝑍ఈ)=

𝛼.

𝐻ଵ is two -tailed ( 𝑃≠ 𝑃 )

|𝑍|≥𝑍ఈଶ⁄, then reject H 0.

𝐻ଵ is one -tailed ( 𝑃 > 𝑃)

𝑍≥𝑍ଵିఈ, then reject H 0.

𝐻ଵ is one -tailed ( 𝑃 < 𝑃 )

𝑍≤𝑍ఈ, then reject H 0.

8.2.4 Test for testing equality of two population Proportion :

We have to test the hypothesis,

𝐻: 𝑃ଵ= 𝑃ଶ

Against

𝐻ଵ: 𝑃ଵ≠ 𝑃ଶ OR

𝐻ଵ: 𝑃ଵ> 𝑃ଶ OR

𝐻ଵ: 𝑃ଵ< 𝑃ଶ

munotes.in

## Page 139

139

Under 𝐻, the test statistic is,

𝑍= (𝑝ଵ− 𝑝ଶ)

ට𝑃𝑄ቀ1

𝑛ଵ+ 1

𝑛ଶቁ ~ 𝑁 (0,1)

𝑃= 𝑛ଵ𝑝ଵ+ 𝑛ଶ𝑝ଶ

𝑛ଵ+ 𝑛ଶ 𝑄=1− 𝑃

Let 𝑍ఈ be the critical value at 𝛼 level of significance. We compare the

calculated value of Z, with the tabulated value of 𝑍ఈ. Where 𝑃(𝑍≤𝑍ఈ)=

𝛼.

𝐻ଵ is two -tailed ( 𝑃ଵ≠ 𝑃ଶ )

|𝑍|≥𝑍ఈଶ⁄, then reject H 0.

𝐻ଵ is one -tailed ( 𝑃ଵ> 𝑃ଶ)

𝑍≥𝑍ଵିఈ, then reject H 0.

𝐻ଵ is one -tailed ( 𝑃ଵ< 𝑃ଶ )

𝑍≤𝑍ఈ, then reject H 0.

8.3 SMALL SAMPLE TESTS

8.3.1 Test for the population mean:

Let consider 𝑥={𝑥ଵ,𝑥ଶ,…,𝑥} be a random sample of size n (n small)

taken from a normally distributed population having population mean

𝜇 and population variances 𝜎ଶ (mean and variances are unknown)

respectively.

We have to test the hypothesis,

𝐻: 𝜇= 𝜇

Against

𝐻ଵ: 𝜇≠ 𝜇 OR

𝐻ଵ: 𝜇 > 𝜇 OR

𝐻ଵ: 𝜇 < 𝜇

Under 𝐻, the test statistic is,

𝑡= 𝑥̅− 𝜇

𝑠√𝑛⁄ ~ 𝑡ିଵ𝑑.𝑓.

Where n is the sample size, 𝑥̅ is the sample mean, 𝑠 sample standard

deviation.

Let 𝑡(ିଵ,ఈ) be the critical value based on students t -distribution at n -1

degrees of freedom and 𝛼 level of significance. We compare the munotes.in

## Page 140

140

calculated value of t, with the tabulated value of 𝑡(ିଵ,ఈ). Where 𝑃(𝑇≤

𝑡(ିଵ,ఈ))=𝛼.

𝐻ଵ is two -tailed ( 𝜇≠ 𝜇 )

|𝑡|≥𝑡(ିଵ,ఈଶ⁄), then reject H 0.

𝐻ଵ is one -tailed ( 𝜇 > 𝜇)

𝑡≥𝑡(ିଵ,ଵିఈ), then reject H 0.

𝐻ଵ is one -tailed (𝜇 <𝜇 )

𝑡≤𝑡(ିଵ,ఈ), then reject H 0.

8.3.2 Test for the difference of two population means (two sample s):

Let consider 𝑥={𝑥ଵ,𝑥ଶ,…,𝑥భ} and 𝑦={𝑦ଵ,𝑦ଶ,…,𝑦మ} be two

independent normally distributed samples having unknown population

means 𝜇ଵ and 𝜇ଶ and unknown population variances 𝜎ଵଶ and 𝜎ଶଶ

respectively.

𝑥̅ and 𝑦ത are the sample arithmetic means.

We have to test the hypothesis,

𝐻: 𝜇ଵ−𝜇ଶ=𝜇

Against

𝐻ଵ: 𝜇ଵ≠ 𝜇ଶ OR

𝐻ଵ: 𝜇ଵ−𝜇ଶ>𝜇 OR

𝐻ଵ: 𝜇ଵ−𝜇ଶ<𝜇

Under 𝐻, the test statistic is,

𝑡= തି ത

௦ටభ

భାభ

మൗ ~ 𝑡భାమିଶ𝑑.𝑓. where 𝑠ଶ= (భିଵ)௦ೣమା (మିଵ)௦మ

భା మିଶ.

Let 𝑡(భାమିଶ,ఈ) be the critical value based on students t -distribution at

𝑛ଵ+𝑛ଶ−2 degrees of freedom and 𝛼 level of significance. We compare

the calculated value of t, with the tabulated value of 𝑡(భାమିଶ,ఈ). Where

𝑃(𝑇≤𝑡(భାమିଶ,ఈ))=𝛼.

𝐻ଵ is two -tailed ( 𝜇≠ 𝜇 )

|𝑡|≥𝑡(భାమିଶ,ఈଶ⁄), then reject H 0.

𝐻ଵ is one -tailed ( 𝜇 > 𝜇)

𝑡≥𝑡(భାమିଶ,ଵିఈ), then reject H 0.

𝐻ଵ is one -tailed (𝜇 <𝜇 )

𝑡≤𝑡(భାమିଶ,ఈ), then reject H 0. munotes.in

## Page 141

141

8.3.3 Paired t -test for difference of m ean:

Let consider {𝑥,𝑦};𝑖=1,2,…,𝑛 be n -pair dependent normally

distributed samples having unknown population means 𝜇ଵ and 𝜇ଶ and

unknown population variances 𝜎ଵଶ and 𝜎ଶଶ respectively.

We have to test the hypothesis,

𝐻: 𝜇ଵ− 𝜇ଶ=0

Against

𝐻ଵ:𝜇ଵ− 𝜇ଶ≠ 0 OR

𝐻ଵ: 𝜇ଵ− 𝜇ଶ >0 OR

𝐻ଵ: 𝜇ଵ − 𝜇ଶ< 0

Under 𝐻, the test statistic is,

𝑡= തି ത

௦√⁄ ~ 𝑡ିଵ𝑑.𝑓. where 𝑑= 𝑋ത− 𝑌ത & 𝑠ଶ= ଵ

ିଵ∑൫𝑑−𝑑̅൯ଶ

Let 𝑡(ିଵ,ఈ) be the critical value based on students t -distribution at n -1

degrees of freedom and 𝛼 level of significance. We compare the

calculated value of t, with the tabulated value of 𝑡(ିଵ,ఈ). Where 𝑃(𝑇≤

𝑡(ିଵ,ఈ))=𝛼.

𝐻ଵ is two -tailed ( 𝜇≠ 𝜇 )

|𝑡|≥𝑡(ିଵ,ఈଶ⁄), then reject H 0.

𝐻ଵ is one -tailed ( 𝜇 > 𝜇)

𝑡≥𝑡(ିଵ,ଵିఈ), then reject H 0.

𝐻ଵ is one -tailed (𝜇 <𝜇 )

𝑡≤𝑡(ିଵ,ఈ), then reject H 0.

8.4 STATISTICAL CONTROL CARTS

8.4.1 A lot acceptance sampling plan (LASP) :

A lot acceptance sampling plan (LASP) is a sampling scheme and a set

of rules for making decisions. The decision, based on counting the

number of defectives in a sample, can be to accept the lot, reject the lot,

or even, for multiple or sequential sampling schem es, to take another

sample and then repeat the decision process.

8.4.2 OC curve :

Operating Characteristic (OC) Curve plots the probability of accepting

the lot (Y -axis) versus the lot fraction or percent defectives (X -

axis). The OC curve is the primary tool for displaying and

investigating the properties of a LASP.

munotes.in

## Page 142

142

8.4.3 Control charts :

Control charts are a statistical process control tool used to determine if

a manufacturing or business process is in a state of control . It is more

appropriate to say that the control charts are the graphical device for

Statistical Process Monitoring (SPM). Traditional control charts are

mostly de signed to monitor process parameters when an underlying

form of the process distributions are known.

8.4.4 Variable control chart ( 𝑿ഥ Chart) :

Dr Walter A. Shewhart proposed a general model for control charts in

1920. Let 𝑤 be a sample statistic that measures some continuously

varying quality characteristic of interest (e.g., thickness), and suppose

that the mean of 𝑤 is 𝜇௪, with a standard deviation of 𝜎௪. Then the

centre line, the Upper Control Limit (UCL), and the Lower Control

Limit (LCL) are:

𝑈𝐶𝐿=𝜇௪+𝑘𝜎௪

Center Line =𝜇௪

𝐿𝐶𝐿=𝜇௪+𝑘𝜎௪

where k is the distance of the control limits from the cent re line, expressed

in terms of standard deviation units. When 𝑘 is set to 3, we speak of

3−𝑠𝑖𝑔𝑚𝑎 control charts. Historically, 𝑘=𝟑 has become an accepted

standard in the industry. The centerline is the process mean, which in

general is unknown. We replace it with a target or the average of all the

data. The quantity that we plot is the sample average, 𝑋ത The chart is called

the 𝑋ത chart.

We also have to deal with the fact that σ is, in general, unknown. Here we

replace σw with a given standard value, or we estimate it by a function of

the average standard deviation.

8.4.5 Attributes control charts :

The Shewhart control chart plots quality characteristics that can be

measured and expressed numerically. We measure weight, height,

position, thickness, etc. If we cannot represent a particular quality

characteristic numerically, or if it is impractical to do so, we then often

resort to u sing a quality characteristic to sort or classify an item that is

inspected into one of two "buckets".

An example of a common quality characteristic classification would be

designating units as "conforming units" or "nonconforming units".

Another quality characteristic criteria would be sorting units into "non -

defective" and "defective" categories. Quality characteristics of that type

are called attributes.

Control charts dealing with the number

of defects or nonconformities are called c charts (for the count). munotes.in

## Page 143

143

Control charts dealing with the proportion or fraction of defective

product s are called p charts (for proportion).

There is another chart that handles defects per unit, called the u chart

(for the unit). This applies when we wish to work with the a verage

number of nonconformities per unit of product.

8.6 SUMMARY

Students can get an idea about the testing of a hypothesis and make

decisions about parameters of interest. In this chapter, we briefly studied

various large/small One -sample and two -samp le tests for mean and

proportion along with variable and attribute control charts are also

discussed.

8.5 EXERCISE

1. Explain the term hypothesis and its types.

2. Define Type -I and Type -II errors.

3. Explain the terms level of significance, p -value and power of a test.

4. Write down the stepwise procedure of testing of hypothesis.

5. Write down the procedure for testing the equality of two population

proportion s.

6. Write down the procedure to test the specified population mean, in the

case of a small sample.

7. Explain in detail paired t -Test for difference Mean.

8.7 REFERENCES

Gupta S. C. and Kapoor V. K., 2011, Fundamentals of Mathematical

Statistics, 11th Ed, Sultan and Chand.

Rohatgi V. K, 1939, Introduction to Probability and Statis tics, Wiley

Murray R. Spiegel, Larry J. Stephens, STATISTICS, 4th Ed,

McGRAW – HILL I NTERNATIONAL.

J.N. KAPUR and H.C. SAXENA, 2005, MATHEMATICAL

STATISTICS, 12th rev, S.Chand.

Agrawal B. L, 2003, Programmed Statistics , 2nd Ed, New Age

International.

Kanji G. K., 20 06, 100 Statistical tests, 3rd Ed, SAGE Publication.

***** munotes.in

## Page 144

144

9

STATISTICS IN R

Unit Structure

9.0 Objectives

9.1 Descriptive statistics in R

9.2 Normal distribution

9.3 Binomial distribution

9.4 Frequency distribution

9.5 Data i mport and export

9.6 Summary

9.7 Exercise

9.8 References

9.0 OBJECTIVES

Use of R -software to find basic statistical measures

Use of R -software to simulate Normal Distributions

Use of R -software to simulate Binomial Distributions

Data import and Export in R.

9.1 DESCRIPTIVE STATISTICS IN R -SOFTWARE Sr. No. Descriptive Statistics Syntax in R Output 1. Observation Vector (c-function) x <- c(1,4,7,12,19,15,21,20) x [1] 1 4 7 12 19 15 21 20 2. Arithmetic mean AM<-mean(x) AM [1] 12.375 3. Mode # Create the function. getmode <- function(x) { uniqv <- unique(x) uniqv[which.max(tabulate(match(x, uniqv)))] } v <-c(2,1,2,3,1,2,3,4,1,5,5,3,2,3); mode<-getmode(v) Mode [1] 2

4. median Med=median(x) Med [1] 13.5 munotes.in

## Page 145

145

9.2 NORMAL DISTRIBUTION

R has four inbuilt functions to generate normal distribution. They are

described below.

Sr. no. Function Name Syntax Description 1. Density function dnorm(x, mean, sd) This function gives the height of the probability distribution at each point for a given mean and standard deviation. 2. Cumulative Probability pnorm(x, mean, sd) This function gives the probability of a normally distributed random number less than the value of a given number. It is also called the "Cumulative Distribution Function”. 3. Inverse function qnorm(p, mean, sd) This function takes the probability value and gives a number whose cumulative value matches the probability value. 4. Random Number rnorm(n, mean, sd) This function is used to generate random numbers whose distribution is normal. It takes the sample size as input and generates that many random numbers.

Following is the description of the parameters used in the above

functions :

x: is a vector of numbers.

p: is a vector of probabilities.

n: is a number of observations(sample size).

mean: is the mean value of the sample data. Its default value is zero.

sd: is the standard deviation. Its default value is 1.

9.3 BINOMIAL DISTRIBUTION

R has four in -built functions to generate binomial distribution. They are

described below.

Sr. no. Function Name Syntax Description 1. Probability dbinom(x, size, prob) This function gives the munotes.in

## Page 146

146

probability density distribution at each point. 2. Cumulative Probability pbinom(x, size, prob) This function gives the cumulative probability of an event. It is a single value representing the probability. 3. Inverse function qbinom(p, size, prob) This function takes the

probability value and

gives a number whose

cumulative value

matches the probability

value. 4. Random Number rbinom(n, size, prob) This function generates

required number of

random values of given

probability from a given

sample.

Following is the description of the parameters used −

x: is a vector of numbers.

p: is a vector of probabilities.

n: is a number of observations.

Size: is the number of trials.

prob: is the probability of success of each trial.

9.4 FREQUENCY DISTRIBUTION IN R

Table function in R -table(), performs categorical tabulation of data

with the variable and its frequency. table() function is also helpful in

creating Frequency tables with the condition and cross -tabulation

Syntax

x <- c(1,2,3,2,4,2,5,4,6,7,8,9)

freq <- data.frame(table(x))

Output

> freq x Freq

1 1 1

2 2 3

3 3 1

4 4 2

5 5 1

6 6 1

7 7 1

8 8 1

9 9 1 munotes.in

## Page 147

147

9.5 DATA IMPORT AND EXPORT

i. Data Importing :

The sample data is frequently observed in Excel format and needs to be

imported into R before use. For this, we can use the function read.xls from

the gdata package. It reads from an Excel spreadsheet and returns a data

frame.

library(gdata) # load gdata package

> help(read.xls) # documentation

> mydata = read.xls("mydata.xls") # read from the first sheet

ii. Data export :

There are numerous methods for exporting R objects into other formats.

A tab delimited tex file

write.table(mydata, "c:/mydata.txt", sep=" \t")

MS-Excel Spread sheet

library(xlsx)

write.xlsx(mydata, "c:/mydata.xlsx")

9.7 SUMMARY

R is a programming language and free software environment for statistical

computing and graphics supported by the R Core Team and the R

Foundation for Statistical Computing. It is widely used for data analysis

purposes in analytical industries. In this chapter, we studied various

descriptive measures in R, probabilities, quanti les and random number

generation syntax of the normal and binomial distribution.

9.6 EXERCISE

1. Write down the R -command for the arithmetic mean and compute for

the given data.

1, 4, 7, 12, 19, 15, 21 and 20.

2. Generate 10 random numbers using R -command from a normal

distribution with mean zero and standard deviation one.

3. Calculate cumulative distribution function at point zero for normal

distribution with mean zero and standard deviation one.

4. Compute probability density function at 1, 2, 3, 4, 5 for normal

distribution with mean zero and standard deviation one.

5. Compute quantiles at 0.1, 0.2, 0.3, 0.4, 0.5 for binomial distribution

with n=5 and p=0.7.

6. Generate 10 random numbers using R -command from a binomial

distribution with n=5 and p=0.5. munotes.in

## Page 148

148

7. Explain fr equency distribution in R.

9.8 REFERENCES

Gupta S. C. and Kapoor V. K., 2011, Fundamentals of Mathematical

Statistics, 11th Ed, Sultan and Chand.

R.B. Patil, H.J. Dand and R. Bhavsar, 2017, A Practical Approach

using R, 1st ed, SPD.

https://www.tutorialspoint.com/r/r_normal_distribution.htm

https://cran.r -project.org/bin/windows/base/

*****

munotes.in

## Page 149

149 UNIT IV

10

SMALL SAMPLING THEORY

Unit Structure

10.0 Objectives

10.1 Introduction

10.2 Student’s t distribution

10.3 Graph of t-distribution

10.4 Critical values of t

10.5 Application of t -distribution

10.6 Test of Hypothesis and Significance

10.7 Confidence Interval

10.8 t-Test for Difference of Means

10.9 Degrees of Freedom

10.10 The F -Distribution

10.11 Summary

10.12 Reference for further reading

10.13 Exercises

10.14 Solution to Ex ercises

10.15 Tables of t -distribution an d F-distribution

10.0 OBJECTIVES

In this chapter we will study about the test suitable for small samples i.e

sample size less than or equal to 30 for which the tests studied in previous

chapters are not applicable. We will also study the test for equality of

variance.

10.1 INTRODUCTION

The entire large sample theory was based on the application of “Normal

Test”. However if the sample size n is small, the distribution of the various

statistics, e.g. 𝑍=௫̅ିఓ

(

√)𝑜𝑟𝑍=ି

ඥொetc., are far from normality and as such

‘normal test’ cannot be applied if ‘n’ is small. In such cases exact sample

tests, pioneered by W.S. Gosset (1908) who wrote under the pen name of

Student, and later on developed and extended by Prof. R. A. Fisher (1926), munotes.in

## Page 150

150 are used. In the following sections we shall discuss i) t – test and ii) F -

test.

The exact sample tests can, however, be applied to large samples though

the converse is not true. In all exact sample tests, the basic assumption is

that “the population(s) from which the sample(s) is(are) drawn is(are)

normal, i.e., the parent population( s) is(are) normally distributed.”

10.2 STUDENT’S T DISTRIBUTION

Let x i(i = 1,2,….,n) be a random sample of size n from a normal

population with mean 𝜇 and variance 𝜎ଶ. Then Student’s t is defined by

the statistic:

𝑡=𝑡=𝑥̅−𝜇

௦

√ ,𝑠ଶ=1

𝑛−1(𝑥−𝑥̅)ଶୀ

ୀଵ,𝑥̅=∑𝑥

𝑛

where 𝑥̅ is the sample mean and 𝑠ଶis an unbiased estimate of the

population variance 𝜎ଶ, and it follows Student’s distribution with 𝑣=

(𝑛−1) degree of freedom with probability density function:

𝑓(𝑡)=1

√𝑣𝐵ቀଵ

ଶ,௩

ଶቁ⋅1

ቀ1+௧మ

௩ቁೡశభ

మ ,−∞<𝑡∞

10.3 GRAPH OF t-DISTRIBUTION

The probability density function of t -distribution with n degrees of

freedom is:

𝑓(𝑡)=𝐶.ቆ1+𝑡ଶ

𝑛ቇି(శభ)

మ

,−∞<𝑡<∞

Since f(t) is an even function, the probability curve is symmetric about the

line t =0. As t increases, f(t) decreases rapidly and tends to zero as 𝑡→∞,

so that t -axis is an asymptote to the curve. We know that

𝜇ଶ=𝑛

𝑛−2 ,𝑛>2 ; 𝛽ଶ=3(𝑛−2)

𝑛−4,𝑛>4

Hence for n >2, 𝜇ଶ>1 i.e., the variance of t -distribution is greater than

that of standard normal distribution and for n >4, 𝛽ଶ>3 and thus t -

distribution is more flat on the top than the normal curve. In fact, for small

n, we have

𝑃(|𝑡|≥𝑡)>−𝑃(|𝑍|≥𝑡),𝑍 ~ 𝑁(0,1)

i.e, the tails of the t -distribution have a greater probability (area) than the

tails of standard normal distribution. Moreover we can check that for large munotes.in

## Page 151

151 n, t-distribution tends to standard normal distribution. Graph of t -

distribution is given by the following diagram

10.4 CRITICAL VALUES OF t

The critical values of t at level of significance 𝛼 and degree of freedom v

for two tailed test are given by the equation:

𝑝{|𝑡|>𝑡௩(𝛼)}=𝛼

𝑝{|𝑡|≤𝑡௩(𝛼)}=1−𝛼

The values 𝑡௩(𝛼) have been tabulated in table, for different values of

𝛼𝑎𝑛𝑑𝑣 are given at the end of the chapter.

Since t -distribution is symmetric about t= 0, we get

𝑃൫𝑡<𝑡௩(𝛼)൯+𝑃൫𝑡<−𝑡௩(𝛼)൯=𝛼⇒2𝑃(൫𝑡>𝑡௩(𝛼)൯=𝛼

⇒𝑃൫𝑡>𝑡௩(𝛼)൯=𝛼

2∴𝑃൫𝑡>𝑡௩(2𝛼)൯=𝛼

𝑡௩(2𝛼) (from the tables at the end of the chapter) gives the significant

value of t for a single tail test(Right tail or L eft tail since the distribution is

symmetrical), at level of significance 𝛼&𝑣 degree of freedom.

munotes.in

## Page 152

152 Hence the significant values of t at level of significance ′𝛼′ for a single

tailed test can be obtained from those of two tailed test by looking the

values at level of significance 2𝛼.For example

𝑡଼(0.05) 𝑓𝑜𝑟 𝑠𝑖𝑛𝑔𝑙𝑒 𝑡𝑎𝑖𝑙 𝑡𝑒𝑠𝑡=𝑡଼(0.1) 𝑓𝑜𝑟 𝑡𝑤𝑜 𝑡𝑎𝑖𝑙 𝑡𝑒𝑠𝑡=1.86

𝑡ଵହ(0.01) 𝑓𝑜𝑟 𝑠𝑖𝑛𝑔𝑙𝑒 𝑡𝑎𝑖𝑙 𝑡𝑒𝑠𝑡=𝑡ଵହ(0.02) 𝑓𝑜𝑟 𝑡𝑤𝑜 𝑡𝑎𝑖𝑙 𝑡𝑒𝑠𝑡=2.602

10.5 APPLICATION OF t-DISTRIBUTION

The t -distribution has a whole number of applications in Statistics, some

of which are given here below:

i) to test if the sample mean (𝑥̅) differs significantly from the

hypothetical value 𝜇 of the population mean.

ii) to test the significance of the difference between two sample means.

iii) to test the significance of an observed sample correlation coefficient

and sample regression co efficient.

iv) to test the significance of observed partial correlation coefficient.

10.6 TEST OF HYPOTHESIS AND SIGNIFICANCE

Suppose we want to test :

i) if a random sample x i(i=1,2,….,n) of size n has been drawn from a

normal population with a specified mean say 𝜇 orif the sample mean

differs significantly from the hypothetical value of 𝜇 of the

population mean.

i.e 𝐻∶𝜇= 𝜇 ,𝐻ଵ∶𝜇≠𝜇𝑜𝑟𝜇>𝜇𝑜𝑟𝜇<𝜇where 𝐻 is the null

hypothesis and 𝐻ଵ is the alternative hypothesis

ii) Calculate

𝑡=𝑥̅−𝜇

ௌ

√,𝑆ଶ =1

𝑛−1(𝑥−𝑥̅)ଶୀ

ୀଵ

iii) degree of freedom df = v = n -1

iv) from the table calculate, 𝑡௩(𝛼)

v) Conclusion: Reject H 0if calculated |t| > tabulated t and Do not reject

H0 if calculated |t| ≤ tabulated t .

Remark: We know, the sample variance:

𝑠ଶ=1

𝑛∑(𝑥−𝑥̅)ଶ⇒𝑠ଶ=1

𝑛(𝑛−1)𝑆ଶ⇒𝑠ଶ

𝑛−1=𝑆ଶ

𝑛 munotes.in

## Page 153

153 Hence for numerical problems, the test statistic t stated above will become

𝑡=𝑥̅−𝜇

ௌ

√=𝑥̅−𝜇

௦

√ିଵ

Eg: 1) A machinist is making engine parts with axle diameter of 0.7 inch.

A random sample of 10 parts show a mean diameter of 0.742 inch with a

standard deviation of 0.04 inch. Compute the statistic you would use to

test whether the work is meeting the specifications and state the

conclusion .

Solution:

i) 𝐻∶𝜇=0.7,𝐻ଵ∶𝜇≠0.7

ii) 𝑥̅=0.742,𝑠=0.04,𝑛=10,

𝑡=𝑥̅−𝜇

௦

√ିଵ=0.742−0.7

.ସ

√ଽ= 0.042×√9

0.04=3.15

iii) degree of freedom i.e v = n -1, v = 9

iv) 𝑡௩(𝛼)=𝑡ଽ(0.05)=𝑡ଽ(0.025)=2.262

v) conclusion 𝑡=3.15>𝑡ଽ(0.05) =2.262

Since calculated ‘t’ is greater than the tabulated ‘t’ we reject the null

hypothesis H0 i.e the product is not conforming to specifications.

Eg: 2) The mean weakly sales of soap bars in departmental stores was

146.3 bars per store. After an advertising campaign the mean weekly sales

in 22 stores for a typical week increased to 153.7 and showed a standard

deviation of 17.2. Was the advertising campaign successful?

Solution:

i) 𝐻∶𝜇=146.3,𝐻ଵ∶𝜇>146.3

ii) 𝑥̅=153.7,𝑠=17.2,𝑛=22

𝑡=𝑥̅−𝜇

௦

√ିଵ=153.7−146.3

ଵ.ଶ

√ଶଵ= 7.4×√21

17.2=1.97

iii) degree of freedo m f i.e v = n -1, v = 21

iv) iv) 𝑡௩(𝛼)=𝑡ଶଵ(0.05)=𝑡ଶଵ(0.025)=1.72

v) conclusion 𝑡=1.97>𝑡ଶଵ(0.05) =1.72

Since calculated ‘t’ is greater than the tabulated ‘t’ we reject the null

hypothesis H 0 i.e the advertising campaign was successful in

promoting the sales.

munotes.in

## Page 154

154 Eg: 3) A random sample of 10 boys had the following I.Q.’s is 70, 120,

110, 101, 88, 83, 95, 98, 107, 10 0. Do these data support the assumption

of a population mean I.Q of 100?

Solution: First we calculate find the value of 𝑥̅&𝑆ଶ.

Here n = 10, ∑𝑥=972∴𝑥̅=∑௫

=ଽଶ

ଵ= 97.2

𝑆ଶ=1

𝑛−1∑(𝑥−𝑥̅)ଶ=1

9(1833.6)=203.73⇒𝑆= √203.73

=14.273

i) 𝐻∶𝜇=100,𝐻ଵ∶𝜇≠100

ii) 𝑥̅=97.2,𝑆=203.73,𝑛=10

𝑡=|𝑥̅−𝜇|

ௌ

√=|97.2−100|

ଵସ.ଶଷ

√ଵ= 2.8×√10

14.273=0.6203

iii) degree of freedo m d f i.e v = n -1, v = 9

iv) 𝑡௩(𝛼)=𝑡ଽ(0.05)=𝑡ଽ(0.025)=2.262

v) conclusion 𝑡=0.6203<𝑡ଽ(0.05) =2.262

Since calculated ‘t’ is less than the tabulated ‘t’ we accept the null

hypothesis H 0 i.e the data are consistent with the assumption of mean I.Q

of 100 in the population.

10.7 CONFIDENCE INTERVAL

As done with normal distribution in earlier chapter, we can define 95%,

99 % or other confidence intervals by using the table given at the end of

this chapter. We can estimate within specified limits of confidence the

population mean 𝜇. In general confidence limits are given by the formula

𝑥̅±𝑡௩(𝛼)𝑆

√𝑛

In specific the 95 % c onfidence interval is given by

ቆ𝑥̅−𝑡௩(0.05)𝑆

√𝑛,𝑥̅+𝑡௩(0.05)𝑆

√𝑛ቇ

and 99% confidence interval is given by

munotes.in

## Page 155

155 ቆ𝑥̅−𝑡௩(0.01)𝑆

√𝑛,𝑥̅+𝑡௩(0.01)𝑆

√𝑛ቇ

Eg: 1) A random sample of 16 values from a normal population showed a

mean of 41.5 inches and the sum of squares of deviations from this mean

= 135 sq. inches. Obtain the 95 % and 99 % confidence limits for

population mean .

Solution: n = 16, 𝑥̅=41.5,∑(𝑥−𝑥̅)ଶ=135

∴𝑆ଶ=1

𝑛−1∑(𝑥−𝑥̅)ଶ=1

15(135)=9⇒𝑆=3

from the table of t -distribution, we get 𝑡ଵହ(0.05)=2.131&𝑡௩(0.01)=

2.94795% confidence limits for population mean are given by

𝑥̅±𝑡ଵହ(0.05)𝑆

√𝑛 =41.5±2.131×3

√16=41.5±2.131×0.75

⇒39.902<𝜇<43.098

99% confidence limits for population mean are given by

𝑥̅±𝑡ଵହ(0.01)𝑆

√𝑛 =41.5±2.947×3

√16=41.5±2.947×0.75

⇒39.29<𝜇<43.71

10.8 t-TEST FOR DIFFERENCE OF MEANS

Suppose we want to test if two independent samples

𝑥(𝑖=1,2,3,⋯⋯,𝑛ଵ)and𝑦(𝑗=1,2,3,⋯⋯,𝑛ଶ)of size n 1 and n 2 have

been drawn from two normal populations with means 𝜇௫(𝜇ଵ) &𝜇௬(𝜇ଶ)

respectively.

Under the null hypothesis (H 0) that the samples have been drawn from the

normal populations with mean 𝜇௫(𝜇ଵ)&𝜇௬(𝜇ଶ) and the under the

assum ption that the population variance are equal

i.e 𝜎௫ଶ=𝜎௬ଶ =𝜎ଶ, the statistic

𝑡=ቀ𝑥̅−𝑦ത−൫𝜇௫−𝜇௬൯ቁ

𝑆൬ටଵ

భ+ଵ

మ൰ =𝑥̅−𝑦ത−𝑑

𝑆ටଵ

భ+ଵ

మ ,𝑑 =𝜇௫−𝜇௬

Where 𝑥̅=∑௫

భ ,𝑦ത=∑௬ೕ

మ & 𝑆ଶ=ଵ

భାమିଶቂ∑(𝑥−𝑥̅)ଶ+∑൫𝑦−𝑦ത൯ଶቃ is

an unbiased estimate of the common population variance 𝜎ଶ, follows munotes.in

## Page 156

156 Students t -distribution with 𝑣=(𝑛ଵ−1)+(𝑛ଶ−1)=(𝑛ଵ+𝑛ଶ−2)

degree s of freedom .

Paired t-test for Difference of means:

Let us now consider the case when (i) the sample sizes are equal i.e. n 1 =

n2 = n and (ii) the two samples are not independent but the sample

observations are paired together, i.e., the pair of observations (𝑥,𝑦),1≤

𝑖≤𝑛 corresponds to the same ith sample unit. The pro blem is to test if the

sample means differ significantly or not.

For example, suppose we want to test the efficacy of a particular drug, say,

for inducting sleep. Let x i and y i (i =1,2,……,n) be the readings, in hours

of sleep, on the ith individual, before and after the drug is given

respectively. Here instead of applying the difference of the means test

discussed above in the same section we apply the paired t -test given

below:

Here we consider the increments, 𝑑=𝑥−𝑦 ,1≤𝑖≤𝑛

Under the null Hypothesis, H 0 that increments are due to fluctuations of

sampling, i.e., the drug is not responsible for these increments, the statistic

is

𝑡=𝑑̅

ௌ

√ =𝑑̅×√𝑛

𝑆

where

𝑑̅=1

𝑛𝑑

ୀଵ&𝑆ଶ=1

𝑛−1൫𝑑−𝑑̅൯ଶ=1

𝑛−1൝𝑑ଶ

ୀଵ−(∑𝑑

ୀଵ)ଶ

𝑛ൡ

ୀଵ

follows Student’s t -distribution with (n - 1) degree of freedom.

Eg: 1) For a random sample of 10 pigs fed on diet A, the increases in

weight in pounds in a certain period were: 10, 6, 16, 17, 13, 12, 8, 14, 15,

9. For another sample of 12 pigs, fed on Diet B, the increase in the same

period were: 7, 13, 22, 15, 12, 14, 18, 8, 21, 23, 10, 17.Test whether diets

A and B differ significantly as regards to their effect on increase in weight.

Solution:

i) Null Hypothesis , 𝐻: 𝜇௫=𝜇௬, i.e., there is no significant difference

between t he mean increase in weight due to diets A and B.

Alternative hypothesis, 𝐻ଵ:𝜇௫≠𝜇௬(two tailed)

ii) xi 𝑥−𝑥̅ (𝑥−𝑥̅)ଶ yj 𝑦−𝑦ത ൫𝑦−𝑦ത൯ଶ 10 -2 4 7 -8 64 6 -6 36 13 -2 4 16 4 16 22 7 49 munotes.in

## Page 157

157 17 5 25 15 0 0 13 1 1 12 -3 9 12 0 0 14 -1 1 8 -4 16 18 3 9 14 2 4 8 -7 49 15 3 9 21 6 36 9 -3 9 23 8 64 10 -5 25 17 2 4 Total 120 0 120 180 0 314

𝑥̅=∑𝑥

𝑛ଵ =120

10=12 ,𝑦ത=∑𝑦

𝑛ଶ=180

12=15

𝑆ଶ=1

𝑛ଵ+𝑛ଶ−2ቂ∑(𝑥−𝑥̅)ଶ+൫𝑦−𝑦ത൯ଶቃ=1

20[120+314]

𝑆ଶ=21.7⇒𝑆=√21.7=4.6583

𝑡=𝑥̅−𝑦ത−𝑑

𝑆ටଵ

భ+ଵ

మ =12−15−0

4.6583ටଵ

ଵ+ଵ

ଵଶ =−3×√120

4.6583×√22= −32.8634

21.8494

⇒𝑡= −1.5041

iii) df = v = 10 + 12 - 2 = 20

iv) 𝑡ଶ(0.05)=2.086

v) Conclusion: |𝑡|=1.5041<𝑡ଶ(0.05)=2.086

Since calculated value of t is less tha n the tabulated value, the null

hypothesis H 0 is accepted at 5% level of significance and we may

conclude that the two diets do not differ significantly as regards their

effect on increase in weight.

Eg: 2) The yields if two types ‘Type A’ and ‘Type B’ of grains in pounds

per acre in 6 replications are given below. What comments would you

make on the difference in the mean yield.

Replication 1 2 3 4 5 6 Yield of Type A 20.5 24.6 2306 29.98 30.37 23.83 Yield of Type B 24.86 26.39 28.19 30.75 29.98 22.04

Solution: i) 𝐻:𝜇௫=𝜇௬,𝐻ଵ: 𝜇௫≠𝜇௬

ii)𝑡=ௗത

ቀೄ

√ቁ ,𝑑̅=∑𝑑

ୀଵ,𝑑 =𝑥−𝑦,𝑣=𝑛−1

𝑆ଶ=1

𝑛−1ቊ∑𝑑ଶ−(∑𝑑)ଶ

𝑛ቋ

munotes.in

## Page 158

158 replication type A type B d d2 1 20.5 24.86 -4.36 2 24.6 26.39 -1.79 3 23.06 28.19 -5.13 4 29.98 30.75 -0.77 5 30.37 29.97 0.4 6 23.83 22.04 1.79 total -9.86 52.4876

(𝑖𝑖𝑖)𝑆ଶ=1

5ቈ52.4876−(−9.86)ଶ

6=1

552.4876−97.2196

6൨

=36.2843

5=7.2569

∴𝑆=√7.2569 =2.6939

𝑑̅= −ଽ.଼

= −1.6433,𝑡=ௗത

ೄ

√ =ିଵ.ସଷଷ×√

ଶ.ଽଷଽ =−1.4942

iv) s𝑡ହ(0.05)=2.571

v) Conclusion: |𝑡|=1.4942<𝑡ହ(0.05)=2.571

Since the calculated value of t is less than the tabulated value of t we

accept H 0 at 5% level of significance i.e., there is no major difference

between the mean yield of two types diets.

10.9 DEGREES OF FREEDOM

In order to discuss the statistic t as discussed above, it is necessary to use

observations obtained from a sample as well as certain population

parameters. If these parameters are unknown, they must be estimated from

the sample.

The number of degrees of freedom of a statistic, generally denoted by v, is

defined as the number N of independent observation in the sample (i.e. the

sample size) minus the number k of population parameters, which must be

estimated from sample observations. In symbols,

𝑣=𝑁−𝑘.

In case of the statistic t, the number of independent observations in the

sample is N, from which we compute 𝑥̅&𝑆. However, since we must

estimate 𝜇,𝑘=1and 𝑣=𝑁−1.

10.10 THE f-DISTRIBUTION

If X and Y are two independent chi -square variate s with v 1 and v 2 degree

of freedom respectively, then F -statistic is defined by

munotes.in

## Page 159

159 𝐹=𝑋/𝑣ଵ

𝑌/𝑣ଶ

In other words, F is defined as the ration of two independent chi-square

variates divided by their respective degrees of freedom and it follows

Snedecor’s F-distribution with (𝑣ଵ,𝑣ଶ) degree of freedom with probability

function given by:

𝑓(𝐹)=ቀ௩భ

௩మቁೡభ

మ

𝐵ቀ௩భ

ଶ,௩మ

ଶቁ⋅𝐹ೡభ

మି ଵ

ቀ1 + ௩భ

௩మ𝐹ቁ(ೡభ శ ೡమ)

మ

Remark: The sampling distribution of F -statistic does not involve any

population parameters and depends only on the degrees of freedom 𝑣ଵ&𝑣ଶ.

F-test for Equality of Two population Variances.

Suppose we want to test (i) whether two independent samples x i,

(i=1,2,…n 1) and y j, (j =1,2,…n 2) have been drawn from the normal

population with the same variance 𝜎ଶ or (ii) whether the two independent

estimates of the population variance are homogenous or not.

Under the Null hypothesis 𝐻: 𝜎௫ଶ=𝜎௬ଶ=𝜎ଶ i.e., the population variances

are equal, or two independent estimates of the population variance are

homogenou s, the statistics F is given by

𝐹=𝑆ଶ

𝑆ଶ

where

𝑆ଶ=1

𝑛ଵ−1∑(𝑥−𝑥̅)ଶ, 𝑆ଶ=1

𝑛ଶ−1∑(𝑦−𝑦ത)ଶ

are unbiased estimates of the common population variance 𝜎ଶ obtained

from two independent samples and it follows Snedecor’s F -distribution

with (𝑣ଵ,𝑣ଶ)=(𝑛ଵ−1,𝑛ଶ−1) degree of freedom.

The shaded in the diagram indicates the acceptance region (1−𝛼) and the

unshaded region indicates the rejection region 𝛼.

munotes.in

## Page 160

160 Eg: 1) In one sample of 8 observations, the sum of the squares of

deviations of the sample values from the sample mean was 84.4 and in the

other sample of 10 observations it was 102.6. Test whether this difference

is significant at 5% LOS given that 5% point of F for v 1 =7 and v 2 = 9

degree of freedom is 3.29.

Solution : 𝐻:𝜎௫ଶ =𝜎௬ଶ i.e., t he estimate of variance given by the samples

are homogenous,

𝐻ଵ∶ 𝜎௫ଶ≠𝜎௬ଶ

𝑛ଵ=8,𝑛ଶ=10,∑(𝑥−𝑥̅)ଶ=84.4,∑(𝑦−𝑦ത)ଶ=102.6

𝑆௫ଶ=1

𝑛ଵ−1∑(𝑥−𝑥̅)ଶ =1

7×84.4=12.0571

𝑆௬ଶ=1

𝑛ଶ−1∑(𝑦−𝑦ത)ଶ=1

9×102.6= 11.4

𝐹=𝑆௫ଶ

𝑆௬ଶ=12.0571

11.4=1.0576

tabulated 𝐹,ଽ(0.05)=3.29

𝐹=𝐹=1.0576<𝐹.ହ(7,9)=3.29

∴𝐴𝑐𝑐𝑒𝑝𝑡 𝐻 𝑎𝑡 5% 𝐿𝑂𝑆

Eg : 2) Two random samples gave the following results: Sample no Size sum of squares of deviation from the mean 1 10 90 2 12 108 Test at 5% LOS whether there is a difference in variance.

{Given 𝐹.ହ(9,11)=2.9 ,𝐹.ହ(11,9)=3.1}

Solution:𝐻:𝜎௫ଶ=𝜎௬ଶ ,𝐻ଵ:𝜎௫ଶ≠𝜎௬ଶ

𝑛ଵ=10,𝑛ଶ=12,∑(𝑥−𝑥̅)ଶ=90,∑(𝑦−𝑦ത)ଶ = 108

𝑆௫ଶ=1

𝑛ଵ−1∑(𝑥−𝑥̅)ଶ =1

9×90=10

𝑆௬ଶ=1

𝑛ଶ−1∑(𝑦−𝑦ത)ଶ=1

11×108=9.8182

𝐹=𝑆௫ଶ

𝑆௬ଶ=10

9.8182=1.0185

𝐹=1.0185 <𝐹.ହ(9,11)=2.9

∴𝑎𝑐𝑐𝑒𝑝𝑡 𝐻

munotes.in

## Page 161

161 10.11 SUMMARY

In this chapter we had learnt about to apply t -test for the sample size less

than or equal to 30. We had also learnt to test if the sample mean (𝑥̅)

differs significantly from the hypothetical value 𝜇 of the population mean

and to test the significance of the difference between two sample means.

WE had also learnt about the confidence interval and the F -test to whether

the two independent estimates o f the population variance are homogenous

or not.

10.13 EXERCISES

1) A researcher is interested in determing whether or not review sessions

affect exam performance. The independent variable, a review session,

is administered to a sample of students (n=9) in an attempt to

determine if this has an effect on the dependent variable, exam

performance. Based on the information gathered in previous

semesters, the researcher knows that the population mean for a given

exam is 24. The sample mean is 25. with a S.D of 4 , LOS = 5%.

2) You conduct a survey of a sample of 25 members of this year’s

graduating marketing students and find that average GPA is 3.2. The

standard deviation of the sample is 4. Over the last year the average

GPA has been 3.0. Is the GPA of this year’s students significantly

different from long run average?

3) The heights of 10 males of given locality are found to be 70, 67, 62,

68, 61, 68, 70, 64, 64, 66 inches. Is it reasonable to believe that the

average height is greater than 64 inches?

4) A random sample of 16 values from a normal population showed a

mean of 41.5 inches and the sum of squares of deviations from this

mean = 135 sq. inches. Show that the assumptions of mean 43.5

inches for the population is not reasonable. Obtain the 95 % and 99 %

confidence limits.

5) Below are given the gain in weights (in kgs) of pigs fed on two Diets

A and B

Test if the two diets differ significantly as regards to their effect on

increase in weight.

6) Samples of two types of electric light bulbs were tested for length of

life and following data were obtained:

Type I Type II

munotes.in

## Page 162

162 Sample Size 8 7 Sample mean 1234 1036 Sample S.D. 36 40

Is the difference in the means sufficient to warrant that Type I is

superior to Type II regarding length of life?

7) Two laboratories carry out independent estimates of a particular

chemicals in a medicine produced by a certain firm. A sample is taken

from each batch, halved and the separate halves sent to the two

laboratories. The following data is obtained.

no. of samples 10 mean value of the diff. of estimates 0.6 sum of the squares of their diff. from their mean 20

Is the diffe rence significant at 5% LOS.

8) A certain stimulus administered to each of the following 12 patients

resulted in the following increase of blood pressure: 5, 2, 8, -1, 0,-2, 1,

5, 0,4 and 6. Can it be obtained that the stimulus will, in general, be

accompanied by an increase in blood pre ssure?

9) Two independent samples of 8 and 7 items respectively had the

following values of the variables:

Sample I 9 11 13 11 15 9 12 14 Sample II 10 12 10 14 9 8 10

Do the estimates of population variance differ significantly?

10.14 SOLUTION TO EXERCISES

10.15 TABLES OF t-DISTRIBUTION AND f-DISTRIBUTION t DISTRIBUTION : CRITICAL VALUES

OF t

Significance level

munotes.in

## Page 163

163 Degrees of Two-tailed test: 10% 5% 2% 1% 0.2% 0.1% freedom One-tailed test: 5% 2.5% 1% 0.5% 0.1% 0.05% 1 6.314 12.706 31.821 63.657 318.309 636.619 2 2.920 4.303 6.965 9.925 22.327 31.599 3 2.353 3.182 4.541 5.841 10.215 12.924 4 2.132 2.776 3.747 4.604 7.173 8.610 5 2.015 2.571 3.365 4.032 5.893 6.869 6 1.943 2.447 3.143 3.707 5.208 5.959 7 1.894 2.365 2.998 3.499 4.785 5.408 8 1.860 2.306 2.896 3.355 4.501 5.041 9 1.833 2.262 2.821 3.250 4.297 4.781 10 1.812 2.228 2.764 3.169 4.144 4.587 11 1.796 2.201 2.718 3.106 4.025 4.437 12 1.782 2.179 2.681 3.055 3.930 4.318 13 1.771 2.160 2.650 3.012 3.852 4.221 14 1.761 2.145 2.624 2.977 3.787 4.140 15 1.753 2.131 2.602 2.947 3.733 4.073 16 1.746 2.120 2.583 2.921 3.686 4.015 17 1.740 2.110 2.567 2.898 3.646 3.965 18 1.734 2.101 2.552 2.878 3.610 3.922 19 1.729 2.093 2.539 2.861 3.579 3.883 20 1.725 2.086 2.528 2.845 3.552 3.850 21 1.721 2.080 2.518 2.831 3.527 3.819 22 1.717 2.074 2.508 2.819 3.505 3.792 23 1.714 2.069 2.500 2.807 3.485 3.768 24 1.711 2.064 2.492 2.797 3.467 3.745 25 1.708 2.060 2.485 2.787 3.450 3.725 26 1.706 2.056 2.479 2.779 3.435 3.707 27 1.703 2.052 2.473 2.771 3.421 3.690 28 1.701 2.048 2.467 2.763 3.408 3.674 29 1.699 2.045 2.462 2.756 3.396 3.659 30 1.697 2.042 2.457 2.750 3.385 3.646 32 1.694 2.037 2.449 2.738 3.365 3.622 34 1.691 2.032 2.441 2.728 3.348 3.601 36 1.688 2.028 2.434 2.719 3.333 3.582 38 1.686 2.024 2.429 2.712 3.319 3.566 40 1.684 2.021 2.423 2.704 3.307 3.551 42 1.682 2.018 2.418 2.698 3.296 3.538 44 1.680 2.015 2.414 2.692 3.286 3.526 46 1.679 2.013 2.410 2.687 3.277 3.515 48 1.677 2.011 2.407 2.682 3.269 3.505 50 1.676 2.009 2.403 2.678 3.261 3.496 60 1.671 2.000 2.390 2.660 3.232 3.460 70 1.667 1.994 2.381 2.648 3.211 3.435 80 1.664 1.990 2.374 2.639 3.195 3.416 munotes.in

## Page 164

164 90 1.662 1.987 2.368 2.632 3.183 3.402 100 1.660 1.984 2.364 2.626 3.174 3.390 120 1.658 1.980 2.358 2.617 3.160 3.373 150 1.655 1.976 2.351 2.609 3.145 3.357 200 1.653 1.972 2.345 2.601 3.131 3.340 300 1.650 1.968 2.339 2.592 3.118 3.323 400 1.649 1.966 2.336 2.588 3.111 3.315 500 1.648 1.965 2.334 2.586 3.107 3.310 600 1.647 1.964 2.333 2.584 3.104 3.307 1.645 1.960 2.326 2.576 3.090 3.291

F Distribution : Critical Values of F (5%significancelevel)

v1 1 2 3 4 5 6 7 8 9

10 12 14 16 18 20

v2

munotes.in

## Page 165

165

(continued)

f Distribution : Critical Values of f (5%significancelevel)

v1 25 30 35 40 50 60 75 100 150

200

v2

munotes.in

## Page 166

166 750 1.52 1.47 1.44 1.41 1.37 1.34 1.30 1.26 1.22 1.20 1000 1.52 1.47 1.43 1.41 1.36 1.33 1.30 1.26 1.22 1.19

f Distribution : Critical Values of f(1%significancelevel)

v11 2 3 4 5 6 7 8 9 10

12 14 16 18 20

v2

munotes.in

## Page 167

167

continued)

f Distribution : Critical Values of f (1%significancelevel)

v1 25 30 35 40 50 60 75 100 150

200

munotes.in

## Page 168

168

10.12 REFERENCE FOR FURTHER READING

Following books are recommended for further reading:

Statistics by Murray R, Spiegel, Larry J. Stephens, Mcgraw Hill

International Publisher, 4th edition

Fundamental of Mathematical Statistics by S. C. Gupta and V. K.

Kapoor, Sultan Chand and Sons publis her, 11th edition

*****

munotes.in

## Page 169

169 11

THE CHI -SQUARE TEST

Unit Structure

11.0 Objectives

11.1 Introduction

11.2 Properties of Chi Square variate

11.3 The Chi Square test for Goodness of fit

11.3.1 Decision Criterion

11.4 Test for Independence of Attributes

11.5 Yate’s Correction for continuity

11.6 Test in r c Contingency Table

11.7 Coefficient of Contingency

11.8 Correlation of Attributes

11.9 Additive Property of Chi Square variate

11.10 Summary

11.11 Exercises

11.12 Solution to Exercises

11.13 Table of Chi -Square distribution

11.14 Reference for further reading

11.0 OBJECTIVES

The chi -square test is a non -parametric test that compares two or more

variables from randomly selected data. It helps find the relationship

between two or more variables. The chi square distribution is a theoretical

or mathematical distribution which has wide applicability in statistical

work. The term ‘chi square’ (pronounced with a hard ‘ch’) is used because

the G reek letter χ is used to define this distribution. It will be seen that the

elements on which this distribution is based are squared, so that the

symbol χ 2 is used to denote the distribution.

11.1 INTRODUCTION

We know that if the probability distribution of the discrete random

variable X is known we can find the probability distribution of the random

variable 𝑌=𝑋ଶ. One may be interested in knowing whether we can find

the probability distribution of random varia ble 𝑌=𝑋ଶ if the probability

distribution of a continuous random variable X is known. The answer is

affirmative and in particular, if X is a standard normal variate (a

continuous random variable) so that its probability density function is

given by munotes.in

## Page 170

170 𝑓(𝑥)= 1

√2𝜋𝑒ቀିభ

మቁ௫మ ,−∞<𝑥<∞ =0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

then 𝑌=𝑋ଶis also a continuous random variable whose probability

density function is given by

𝑔(𝑦)=1

√2𝜋1

ඥ𝑦𝑒ି

మ , 𝑦≥0

=0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Here, t he distribution of Y is known as Chi -square distribution with one

degree of freedom .

More generally, if 𝑋ଵ,𝑋ଶ,⋯⋯𝑋 are n independent standard normal

variates, then the distribution of random variable 𝑈=𝑋ଵଶ+𝑋ଶଶ+⋯⋯+

𝑋ଶis given by the probability density function

ℎ(𝑢)=1

2

మΓቀ

ଶቁ𝑒ିೠ

మ𝑢

మିଵ,𝑢≥0,𝑛∈ℕ

=0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

where Γቀ

ଶቁ is called gamma

ଶand is given by Γቀ

ଶቁ=∫𝑒ି௧ஶ

𝑡

మିଵ 𝑑𝑡

Here, the distribution of random variable U is called Chi -sqaure

distribution with n degrees of freedom n degrees of freedom and U is

called a Chi -square variate. Generally, a chi square variate is denoted by

the square of greek letter chi, i.e 𝜒ଶ. Thus if 𝜒ଶ denotes a Chi -square

variate with n degrees of freedom, th en its probability density function is

given by

𝑓(𝜒ଶ)=1

2

మΓቀ

ଶቁ𝑒ିഖమ

మ(𝜒ଶ)

మିଵ,𝜒ଶ≥0,𝑛∈ℕ

=0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

11.2 PROPERTIES OF CHI -SQUARE VARIATE WITH n DEGREES OF FREEDOM

Let 𝜒ଶ denote a Chi -square variate with n degrees of freedom. Then w e

have the following properties:

1) The probability density function of 𝜒ଶ is given by

𝑓(𝜒ଶ)=1

2

మΓቀ

ଶቁ𝑒ିഖమ

మ(𝜒ଶ)

మିଵ,𝜒ଶ≥0,𝑛∈ℕ

=0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

2) The mean of 𝜒ଶ is E(𝜒ଶ) = n

munotes.in

## Page 171

171 3) The variance of 𝜒ଶ is 𝑉(𝜒ଶ) = 2n

4) The mode of 𝜒ଶ = n – 2

5) The frequency curve is given by 𝑦=𝑓(𝜒ଶ) lies in the first quadrant

and it is positively skewed curve, its tail on the right extends upto

infinity , as given in the below diagram

𝑝=Pr[𝑋≥ 𝜒ଶ] is denoted by 𝛼 and 𝜒ଶ= 𝜒,ఈଶ

6) The total area under the Chi -square curve is 1.

7) 𝑝(𝜒ଶ>𝑐)=𝑎𝑟𝑒𝑎 𝑢𝑛𝑑𝑒𝑟 𝑡ℎ𝑒 𝑐𝑢𝑟𝑣𝑒 𝑦=𝑓(𝜒ଶ) 𝑡𝑜 𝑡ℎ𝑒 𝑟𝑖𝑔ℎ𝑡 𝑜𝑓 𝜒ଶ=𝑐

8) For a chi square variate with n degrees of freedom, if 𝒑(𝝌𝟐>𝑐)=

𝜶,𝒕𝒉𝒆𝒏 𝒄 𝒊𝒔 𝒅𝒆𝒏𝒐𝒕𝒆𝒅 𝒃𝒚 𝝌𝒏,𝜶𝟐 𝑖.𝑒 𝑝൫𝜒ଶ>𝜒,ఈଶ൯=𝛼

9) 𝜒,ఈଶ is called 𝛼 probability point of chi square distribution with n

degrees of freedom.

Eg: 1) If a random variable X follows chi square distribution with 10

degrees of freedom find i) x 0 ii) x 1& iii) 𝛼 such that 𝑝(𝑋>𝑥)=

0.95,𝑝(𝑋≤𝑥ଵ)=0.01 & 𝑝(𝑋>18.3)=𝛼

Solution : n = 10 – degree of freedom

i) to find x 0 such that p(X> x 0) = 0.95 ⇒𝑥=𝜒ଵ,.ଽହଶ=3.9403

ii) to find x 1 such that p(X ≤ x 1) = 0.01

𝑝(𝑋≤𝑥ଵ)=1−𝑝(𝑋>𝑥ଵ) ⇒𝑝(𝑋>𝑥ଵ)=1−0.01 =0.99

⇒𝑝(𝑋>𝑥ଵ)=0.99⇒𝑥ଵ=𝜒ଵ,.ଽଽଶ =2.5582

iii) 𝑝(𝑋>18.3)=𝛼⇒𝜒ଵ,ఈଶ=18.3⇒𝛼=0.05

Eg: 2) A random variable Y follows chi square distribution with S.D 4,

Find y 0 if 𝑝(𝑌≤𝑦)=0.05.

munotes.in

## Page 172

172 Solution: 𝑆.𝐷=4⇒𝑣𝑎𝑟=16⇒2𝑛=16⇒𝑛=8

𝑝(𝑌≤𝑦)=0.05⇒𝑝(𝑌≤𝑦)=1−𝑝(𝑌>𝑦)⇒𝑝(𝑌>𝑦)

=1−0.05=0.95

𝑝(𝑌>𝑦)=0.95⇒𝑦=𝜒଼,.ଽହଶ⇒𝑦 =2.7326

Eg: 3) If a random variable X follows chi square distribution with S.D 4.

Find mean and mode.

Solution: 𝑆.𝐷=4⇒𝑉𝑎𝑟=16,𝑏𝑢𝑡 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝜒ଶ=2𝑛⇒𝑛=

8𝑚𝑒𝑎𝑛 𝑜𝑓 𝜒ଶ=𝑛=8 & 𝑚𝑜𝑑𝑒 𝑜𝑓 𝜒ଶ=𝑛−2=6

11.3 THE CHI SQUARE TEST FOR GOODNESS OF FIT

When we come across some observations on a random variable, our

curiosity may tempt us to investigate whether it can be considered to be a

random variable following a certain specified probability law. A technique

to test whether a given frequency distribution of a random var iable follows

a certain specified distribution (known as theoretical distribution) was

proposed by Karl Pearson. It involves a test statistic that can be shown to

follow Chi -square distribution under certain assumptions and the test is

known as Chi-square test for goodness of fit.

Suppose we have an observed frequency distribution with n classes having

observed frequencies

𝑂ଵ,𝑂ଶ,𝑂ଷ,⋯⋯,𝑂 𝑤𝑖𝑡ℎ ∑𝑂ୀ

ୀଵ=𝑁. (Here the classes may correspond

to the discrete values of a variable or groups of va lues of a variable or

even to the groups corresponding to an attribute).

Further, suppose that according to our assumption to be called the null

hypothesis H 0, the expected frequencies are

𝐸ଵ,𝐸ଶ,𝐸ଷ,⋯⋯,𝐸 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ∑𝐸=𝑁ୀ

ୀଵ.

Then the test statistic proposed by Karl Pearson is given by 𝜒ଶ=

∑(ை ି ா)మ

ாୀ

ୀଵ

It can be shown that under the assumptions

1) observations are drawn independently and at random.

2) null hypothesis H 0 is true i.e the expected frequencies are

𝐸ଵ,𝐸ଶ,𝐸ଷ,⋯⋯,𝐸 respectively.

munotes.in

## Page 173

173 3) total number of observations made = N is large

4) observed frequencies O i’s are large

5) expected frequencies E i’s are large so that the termsை ି ா

ா can be

considered to be negligible.

The test statistic 𝜒ଶ=∑(ை ି ா)మ

ாୀ

ୀଵ follow Chi -square distribution with

(n-1) degrees of freedom.

For all practical purposes, the assumptions regarding the distribution of

test statistic 𝜒ଶ can be considered to be reasonable if

a) the number of observations = N ≥ 50

b) the expected frequencies E i’s ≥ 5.

In case when the expected frequencies is less than 5 we require to combine

more than one neighboring classes so that the expected frequency for such

class is not less than 5.

11.3.1 DECISION CRITERION

While performing a test for goodness of fit, we shall use the following

decision criteria:

𝑅𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖𝑓 𝜒ଶ=(𝑂 − 𝐸)ଶ

𝐸ୀ

ୀଵ>𝜒(ିଵ),ఈଶ

𝐷𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖.𝑒 𝑎𝑐𝑐𝑒𝑝𝑡 𝐻 𝑖𝑓 𝜒ଶ≤ 𝜒(ିଵ),ఈଶ

Eg: 1) The following data repres ents the last digit of the cars passing at a

certain traffic signal observed during last 30 minutes for 180 cars.

last digit 0 1 2 3 4 5 6 7 8 9 frequency 12 20 14 12 21 18 17 26 19 21

Can we retain at 5% level of significance that all the digits are equally

likely to occur?

Solution: we want to test whether all digits are equally likely to occur. If

all digits are equally likely, 𝑝=ଵ

ଵ.𝐸=𝑁𝑝=180×ଵ

ଵ=18

𝐻:𝐴𝑙𝑙 𝑑𝑖𝑔𝑖𝑡𝑠 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦 𝑡𝑜 𝑜𝑐𝑐𝑢𝑟

𝐻ଵ:𝑛𝑜𝑡 𝐻 𝑖.𝑒 𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑛𝑒𝑔𝑎𝑡𝑖𝑜𝑛 𝑡𝑜 𝐻(𝑖.𝑒 𝑎𝑙𝑙 𝑑𝑖𝑔𝑖𝑡𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦) munotes.in

## Page 174

174 𝐿𝑂𝑆=5% 𝑖.𝑒 𝛼=0.05,𝑁=180≥50 𝑎𝑛𝑑 𝐸=18≤5,𝑛=10𝜒ଶ

=(𝑂−𝐸)ଶ

𝐸

ୀ𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑐ℎ𝑖𝑞 𝑠𝑞𝑢𝑎𝑟𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜 𝑛 𝑤𝑖𝑡ℎ 9 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚

Decision criteria is given by

𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖𝑓 𝜒ଶ>𝜒ଽ,.ହଶwhere 𝜒ଽ,.ହଶ=16.9190

𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖𝑓 𝜒ଶ≤𝜒ଽ,.ହଶ,𝜒ଶ=(𝑂−𝐸)ଶ

𝐸

ୀ

Digit observed freq. (Oi) Exp. freq.

(Ei) Oi - Ei (𝑶𝒊−𝑬𝒊)𝟐

𝑬𝒊

0 12 18 -6 36/18 1 20 18 2 4/18 2 14 18 -4 16/18 3 12 18 -6 36/18 4 21 18 3 9/18 5 18 18 0 0 6 17 18 -1 1/18 7 26 18 8 64/18 8 19 18 1 1/18 9 21 18 3 9/18 total 176/18 = 9.7778

𝜒ଶ= (𝑂 − 𝐸)ଶ

𝐸

ୀ=9.7778<16.9190=𝜒ଽ,.ହଶ

do not reject H 0 i.e accept H 0.

Eg: 2) As per Mendel’s theory according to the shape and color, certain

variety of pea that can be classified into four categories Round and yellow,

Round and green, Angular and yellow, Angular and green occur in the

proportion of 9:3:3:1. To test this a sample of N = 128 peas was taken and

the following were the observed frequencies

RY – 66 RG – 28 AY – 29 AG – 5

Perform the chi square test for goodness of fit.

Solution : N = 128, n =4

The probability of occurrence and Expected frequencies are given by

munotes.in

## Page 175

175 category pi Ei = Npi RY 9/16 128 X 9/16 = 72 RG 3/16 128 X 3/16 = 24 AY 3/16 128 X 3/16 = 24 AG 1/16 128 X1/16 = 8

∴𝐻:𝑇ℎ𝑒 𝑓𝑜𝑢𝑟 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝑜𝑓 𝑝𝑒𝑎𝑠 𝑖.𝑒 𝑅𝑌,𝑅𝐺,𝐴𝑌,𝐴𝐺 have expected

frequencies 72, 24, 24, 8 resp.

H1: not H 0

LOS = 5%, 𝛼=0.05

Decision criteria is given by 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖𝑓 𝜒ଶ>𝜒ଷ,.ହଶ=7.8147

Do not reject H 0 if 𝜒ଶ≤7.8147,𝑤ℎ𝑒𝑟𝑒 𝜒ଶ=∑൫ை –ா൯మ

ா

ୀଵ

category of peas obs. freq.(Oi) Exp. freq. (Ei) 𝑂−𝐸 (𝑂−𝐸)ଶ𝐸 RY 66 72 -6 0.5 RG 28 24 4 0.6667 AY 29 24 5 1.0417 AG 5 8 -3 1.125 total 3.3334

𝜒ଶ=3.3334<𝜒ଷ,.ହଶ=7.8147

⇒𝐴𝑐𝑐𝑒𝑝𝑡 𝐻 𝑎𝑡 5% 𝐿𝑂𝑆.

Eg: 3) Four identical coins are tossed 100 times and the following results

are obtained.

no. of heads (x) 0 1 2 3 4 frequency 8 29 40 19 4

Are there sufficient evidences to conclude that the coins are biased at 5%

LOS.

Solution: let p denote the probability of getting a head with each of the

four coins,

X: no. of heads follows binomial distribution with n= 4, p

X ~ B(n, p) i.e X ~ B(4,p)

H0: the coins are unbiased i.e p =1/2

H1 : the coins are baised i.e p ≠ ½

LOS = 5 %, 𝛼=0.05

𝑋 ~ 𝐵(4,𝑝)⇒𝑝(𝑥)= ൬4

𝑥൰𝑝௫𝑞ସି௫

𝑝(0)=𝑝𝑞ସ=1

16 ,𝑝(1)=4𝑝𝑞ଷ=4

16,𝑝(2)=6𝑝ଶ𝑞ଶ=6

16 munotes.in

## Page 176

176 𝑝(3)=4𝑝ଷ𝑞=4

16,𝑝(4)=𝑝ସ𝑞=1

16

Decision criteria Reject 𝐻 𝑖𝑓 𝜒ଶ>𝜒ସ,.ହଶ=9.4877

Expected frequencies 𝐸=𝑁 ⋅𝑝(𝑥)

x obs. freq. Oi Exp. freq. Ei Oi - Ei (𝑂−𝐸)ଶ𝐸 0 8 100p(0) = 6.25 1.75 0.49 1 29 100p(1) = 25 4 0.64 2 40 100p(2) = 37.5 2.5 0.1667 3 19 100p(3) = 25 -6 1.44 4 4 100p(4) = 6.25 -2.25 0.81 total 3.5467

𝜒ଶ=3.5467<𝜒ସ,.ହଶ=9.4877

𝐴𝑐𝑐𝑒𝑝𝑡 𝐻𝑎𝑡 5% 𝐿𝑂𝑆⇒𝑐𝑜𝑖𝑛𝑠 𝑎𝑟𝑒 𝑢𝑛𝑏𝑖𝑎𝑠𝑒𝑑

Eg: 4) The random variable X denotes the number of street accidents per

week.

X 0 1 2 3 4 5 6 7 obs. freq. 15 30 28 14 8 4 0 1 exp. freq. 14 27 27 18 9 4 1 0

Test whether the random variable X follows Poisson distribution with

parameter m =2 at 1% l evel of significance.

Solution: As the expected frequency is less than 5 for X = 5, 6 & 7 we

combine them into one class

X 0 1 2 3 4 5-7 obs. freq. 15 30 28 14 8 5 exp. freq. 14 27 27 18 9 5

H0: X follows Poisson distributions with parameter m =2

H1: not H 0

LOS = 1%, 𝛼=0.01

Decision criteria is given by 𝑅𝑒𝑗𝑒𝑐𝑡 𝐻 𝑖𝑓 𝜒ଶ>𝜒ହ,.ଵଶ=15.086

Do not reject H 0 if 𝜒ଶ≤𝜒ହ,.ଵଶ,𝑤ℎ𝑒𝑟𝑒 𝜒ଶ=∑൫ை – ா൯మ

ா

ୀଵ

i X obs. freq. Oi exp. freq. Ei (Oi – Ei) (𝑂−𝐸)ଶ𝐸 1 0 15 14 1 0.0714 2 1 30 27 3 0.3333 munotes.in

## Page 177

177 3 2 28 27 1 0.0370 4 3 14 18 -4 0.8889 5 4 8 9 -1 0.1111 6 5-7 5 5 0 0 total 1.4417

𝜒ଶ=1.4417<𝜒ହ,.ଵଶ=15.086

Accept H 0 i.e at 1% level of significance the hypothesis that the variable X

follows Poisson distribution with parameter m =2 is retainable.

11.4 TEST FOR INDEPENDENCE OF ATTRIBUTES

In this section, we consider the presence of two attributes among units

from single population and the interest is centered around the possible

dependence or independence of the attributes. In other words, on the basis

of data regarding two attributes for some units from the population, we

shall investigate whether the observed data provide sufficient reasons to

reject the claim that the two attributes are independent of each other for

the population under consideration. Such a test is called test for

indep endence of attributes.

To understand the mechanism of the test, we consider the following table

representing data for two attributes known as 2 2 contingency table. It

is called a contingency table as it represents the information which can be

attributed to chance as the information is regarding randomly selected

persons.

2 2 contingency table is given as below:

Attribute B total B1 B2 Attribute A A1 a b a+b A2 c d c+d a+c b+d a+b+c+d= N

With the help of the test statistic 𝜒ଶ=ே(ௗି)మ

(ା)(ା)(ାௗ)(ାௗ) we can perform

a test for testing the hypothesis H0: the attributes A and B under the

consideration are independent against the logical alternative H 1: not H 0i.e

the attributes A and B are dependent subject to the condition s N ≥ 50 and

each of the observed frequ encies a, b, c, d ≥ 5.

The decision criterion at level of significance = α is given by

𝑅𝑒𝑗𝑒𝑐𝑡 𝐻:𝑖𝑓 𝜒ଶ>𝜒ଵ,ఈଶ,𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻∶𝑖𝑓 𝜒ଶ≤𝜒ଵ,ఈଶ

munotes.in

## Page 178

178 where 𝜒ଶ=ே(ௗି)మ

(ା)(ା)(ାௗ)(ାௗ) where a,b,c,d are the observed

frequencies with a+b+c+d = N.

Eg: 1) The following results are obtained at the end of six months of a

kind of psychotherapy given to a group of 120 patients and also for

another group of 120 patients who were not given the psychotherapy.

psychotherapy given not given condition improved 71 42 condition did not improved 49 78

Can we conclude at 5% LOS that the psychotherapy is effective ?

Solution: H0: Psychotherapy is not effective

H1: psychotherapy is effective

LOS = 5% ,𝛼=0.05

Decision criteria Reject 𝐻 𝑖𝑓𝑓 𝜒ଶ>𝜒ଵ,.ହଶ=3.8415

N = 240, a =71, b = 42, c = 49, d = 78

𝜒ଶ=𝑁(𝑎𝑑−𝑏𝑐)ଶ

(𝑎+𝑏)(𝑎+𝑐)(𝑏+𝑑)(𝑐+𝑑) =240(71×78−49×42)ଶ

113⋅120⋅120⋅127

𝜒ଶ=240×12110400

206654400 = 14.0645

⇒𝜒ଶ=14.0645>𝜒ଵ,.ହଶ= 3.8415

⇒ Reject H 0 at 5% level of significance, we may say that the

Psychotherapy is effective at 5% LOS.

11.5 YATE’S CORRECTION FOR CONTINUITY

When the cell frequencies a, b, c, d as observed in case of the four classes

corresponding to two attributes are small, we cannot use the test statistic

𝜒ଶ as defined in previous section i.e for when the assumption a, b, c, d are

greater than or equal to 5 does not hold, the distribution of

𝜒ଶ=ே(ௗି)మ

(ା)(ା)(ାௗ)(ାௗ) cannot be considered to be Chi -square with one

degree of freedom.

In such a case i.e if the cell frequencies a, b, c, d are not all greater than or

equal to 5, we make the followin g adjustment called Yate’s correction.

if ad < bc add ½ to a and b and subtract ½ from both b and c both.

if ad > bc add ½ to b and c and subtract ½ from both a and b both. munotes.in

## Page 179

179 With this adjustments, we get the test statistic as 𝜒ଶ=ேቀ|ௗି|ିಿ

మቁమ

(ା)(ା)(ାௗ)(ାௗ)

The decision criterion at level of significance = 𝛼 is given by

𝑅𝑒𝑗𝑒𝑐𝑡 𝐻:𝑖𝑓 𝜒ଶ>𝜒ଵ,ఈଶ,𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻∶𝑖𝑓 𝜒ଶ≤𝜒ଵ,ఈଶ

where 𝜒ଶ=ேቀ|ௗି|ିಿ

మቁమ

(ା)(ା)(ାௗ)(ା)

Eg: 1) In an experiment on immunization of cattle from tuberculosis the

following results were obtained

affected unaffected Inoculated 11 31 Not inoculated 14 4

Examine the effect of vaccine in controlling the incidence of the disease at

1% LOS.

Solution: H0: the attributes are independent

H1: the attributes are not independent

LOS = 1%, 𝛼=0.01

Decision criteria Rej ect 𝐻 𝑖𝑓𝑓 𝜒ଶ>𝜒ଵ,.ଵଶ=6.6349

N = 60, a = 11, b = 31, c = 14, d = 4

𝜒ଶ =𝑁ቀ|𝑎𝑑−𝑏𝑐|−ே

ଶቁଶ

(𝑎+𝑏)(𝑎+𝑐)(𝑏+𝑑)(𝑐+𝑑) =60(|11×4−31×14|−30)ଶ

42⋅18⋅25⋅35

𝜒ଶ=60(390−30)ଶ

661500=60×360×360

661500=11.7551

𝜒ଶ=11.7551>𝜒ଵ,.ଵଶ=6.6349

⇒𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑎𝑡 1% 𝐿𝑂𝑠

We can say at 1% level of significance that the Inoculation and affection

due to disease are dependent.

11.6 TEST IN r c CONTINGENCY TABLE

If we have two attributes A and B classified into r and c classes

respectively denoted by 𝐴ଵ,𝐴ଶ,⋯⋯𝐴 & 𝐵ଵ,𝐵ଶ,⋯⋯𝐵 then the observed

frequency can be put in a tabular form with r rows and c columns called

r c contingency table.

If we use Oij (i = 1,2,3,……,r , j = 1,2,3,……..,c)to denote the observed

frequency for attribute class A iBj, then r c contingency table can be

represented as shown below: munotes.in

## Page 180

180 Attribute B1 B2 B3 BC TOTAL A1 O11 O12 O13 O1C a1 A2 O21 O22 O23 O2C a2 A3 O31 O32 O33 O3C a3 AR OR1 OR2 OR3 ORC ar TOTAL b1 b2 b3 bc N

Here a i’s represent total observed frequencies for attribute classes A i’s and

bj’s represent the same for classes B j’s, N being the overall total frequency.

Then with these notations, we can write down the test statistic as

𝜒ଶ= ∑∑൫ைೕିாೕ൯మ

ாೕ ,𝑤ℎ𝑒𝑟𝑒 𝐸=ೕ

ேare the expected frequencies.

With the help of the test statistic 𝜒ଶ= ∑∑൫ைೕିாೕ൯మ

ாೕ we can perform a

test for testing the hypothesis H 0: the attributes A and B under the

consideration are independent against the logical alternative H 1: not H 0i.e

the attributes A and B are dependent subject to the conditions N ≥ 50 and

each of the observed frequencies ≥ 5 (Oij ≥ 5).

The d ecision criterion at level of significance = 𝛼 is given by

𝑅𝑒𝑗𝑒𝑐𝑡 𝐻:𝑖𝑓 𝜒ଶ>𝜒(ିଵ)(ିଵ),ఈଶ

𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻∶𝑖𝑓 𝜒ଶ≤𝜒(ିଵ)(ିଵ),ఈଶ

Eg: 1) Using the data given in the following table decide whether we can

conclude that standard of clothing of a salesman has significant effect on

his performance in field selling at 5% LOS.

Performance in Field Selling Disappointing Satisfactory Excellent Total Poorly dressed 21 15 6 42 Well dressed 24 35 26 85 Very well dressed 35 80 58 173 Total 80 130 90 300

Solution: H0: attributes are independent

LOS = 5%, 𝛼=0.05

Reject H 0 iff 𝜒ଶ>𝜒ସ,.ହଶ=9.4877,𝑤ℎ𝑒𝑟𝑒 𝜒ଶ = ∑∑൫ைೕ – ாೕ൯మ

ாೕ

𝐸ଵଵ=𝑎ଵ𝑏ଵ

𝑁=42⋅80

300=11.2,𝐸ଵଶ=𝑎ଵ𝑏ଶ

𝑁=42⋅130

300=18.2

𝐸ଵଷ=𝑎ଵ𝑏ଷ

𝑁 =42⋅90

300=12.6,𝐸ଶଵ =𝑎ଶ𝑏ଵ

𝑁=85⋅80

300=22.67 munotes.in

## Page 181

181 𝐸ଷଵ=𝑎ଷ𝑏ଵ

𝑁=173⋅80

300=46.13,𝐸ଷଶ=𝑎ଷ𝑏ଶ

𝑁=173⋅130

300=74.97

𝐸ଷଷ=𝑎ଷ𝑏ଷ

𝑁 =173⋅90

300=51.9

Oij Eij Oij - Eij ൫𝑂−𝐸൯ଶ𝐸 21 11.2 9.8 8.575 15 18.2 -3.2 0.5626 6 12.6 -6.6 3.4571 24 22.67 1.33 0.078 35 36.83 -1.83 0.0909 26 25.5 0.5 0.0098 35 46.13 -11.13 2.6854 80 74.97 5.03 0.3375 58 51.9 6.1 0.7170 total 16.5133 𝜒ଶ=16.5133>𝜒ସ,.ହଶ=9.488

⇒𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑎𝑡 5% 𝐿𝑂𝑆

We decide to reject H 0 at 5% level of significance and conclude that the

standard of clothing of a salesman has significant effect on his

performance in field selling.

11.7 COEFFICIENT OF CONTINGENCY

A measure of the degree of relationship, association, or dependence of the

classifications in a contingency table is given by

𝐶= ඨ𝜒ଶ

𝜒ଶ+𝑁

which is called the coefficient of contingency. The larger the C, the greater

is the degree of association. The number of rows and columns in the

contingency table determines the maximum val ue of C, which is never

greater than 1. If the number of rows and columns of a contingency table

is equal to k, the maximum value of C is given by ටିଵ

.

Eg: 1) Using the data given in the following table decide whether we can

conclude that standard of clothing of a salesman has significant effect on

his performance in field selling at 5% LOS.

munotes.in

## Page 182

182 Performance in Field Selling Disappointing Satisfactory Excellent Total Poorly dressed 21 15 6 42 Well dressed 24 35 26 85 Very well dressed 35 80 58 173 Total 80 130 90 300 Also find the coefficient of contingency.

Solution: H0: attributes are independent

LOS = 5%, 𝛼=0.05

Reject H 0 iff 𝜒ଶ>𝜒ସ,.ହଶ=9.4877,𝑤ℎ𝑒𝑟𝑒 𝜒ଶ = ∑∑൫ைೕ – ாೕ൯మ

ாೕ

𝐸ଵଵ=𝑎ଵ𝑏ଵ

𝑁=42⋅80

300=11.2,𝐸ଵଶ=𝑎ଵ𝑏ଶ

𝑁=42⋅130

300=18.2

𝐸ଵଷ=𝑎ଵ𝑏ଷ

𝑁 =42⋅90

300=12.6,𝐸ଶଵ =𝑎ଶ𝑏ଵ

𝑁=85⋅80

300=22.67

𝐸ଷଵ=𝑎ଷ𝑏ଵ

𝑁=173⋅80

300=46.13,𝐸ଷଶ=𝑎ଷ𝑏ଶ

𝑁=173⋅130

300=74.97

𝐸ଷଷ=𝑎ଷ𝑏ଷ

𝑁 =173⋅90

300=51.9 Oij Eij Oij - Eij ൫𝑂−𝐸൯ଶ𝐸 21 11.2 9.8 8.575 15 18.2 -3.2 0.5626 6 12.6 -6.6 3.4571 24 22.67 1.33 0.078 35 36.83 -1.83 0.0909 26 25.5 0.5 0.0098 35 46.13 -11.13 2.6854 80 74.97 5.03 0.3375 58 51.9 6.1 0.7170 total 16.5133

𝜒ଶ=16.5133>𝜒ସ,.ହଶ=9.488 ⇒𝑟𝑒𝑗𝑒𝑐𝑡 𝐻 𝑎𝑡 5% 𝐿𝑂𝑆

We decide to reject H 0 at 5% level of significance and conclude that the

standard of clothing of a salesman has significant effect on his

performance in field selling.

𝐶= ඨ𝜒ଶ

𝜒ଶ+𝑁= ඨ16.5133

16.5133+300=0.2284 munotes.in

## Page 183

183 11.8 CORRELATION OF ATTRIBUTES

Because classifications in a contingency table often describe

characteristics of individuals or objects, they are often referred to as

attributes , and the degree of dependence, association, or relationship is

called the correlation of attributes . For k X k tables, we define

𝑟= ඨ𝜒ଶ

𝑁(𝑘 − 1)

as the correlation coefficient between attributes(or classification). This

coefficient lies between 0 and 1. For 2 2 tables in which k = 2, the

correlation is often called tetrachoric correlation.

11.9 ADDITIVE PROPERTY OF CHI SQUARE VARIATE

Suppose that the results of repeated experiments yield sample values of 𝜒ଶ

given by

𝜒ଵଶ,𝜒ଶଶ,𝜒ଷଶ,⋯⋯,𝜒ଶ with 𝑣ଵ,𝑣ଶ,𝑣ଷ,⋯⋯,𝑣 degrees of freedom,

respectively. Then the result of all these experiment can be considered

equivalent to 𝜒ଶ value given by 𝜒ଵଶ+𝜒ଶଶ+ 𝜒ଷଶ+ ⋯⋯+𝜒ଶ

with 𝑣ଵ+𝑣ଶ+ 𝑣ଷ+ ⋯⋯+𝑣 degrees of freedom.

11.10 SUMMARY

In this chapter we discussed about the chi-square test which is a non -

parametric test that compares two or more variables from randomly

selected data. It helps to find the relationship between two or more

variables. The chi square distribution is a theoretical or mathematical

distribution whic h has wide applicability in statistical work. We had also

seen the various properties of Chi -square variate along with different tests

to check whether the variable follows Chi -square distribution, along with

Yate’s correction.

11.11 EXERCISES

1) If X is a chi square variate with 17 degreed of freedom

find𝑥 ,𝑥ଵ 𝑎𝑛𝑑 𝛼 such that

𝑝(𝑋>𝑥)=0.01,𝑝(𝑋≤𝑥ଵ)=0.95 & 𝑝(𝑋≤8.67)=𝛼.

2) The following table gives the result of investigation in the association

of eye color and hair color. Can we deduce that the two attributes are

independent? Hair color munotes.in

## Page 184

184 Brown Black Eye color Blue 75 25 Brown 65 35

3) The following data show the classification of individuals with respect

to gender and literacy in a random sample of 200. Test the data for

independence of attributes using chi square at 1% LOS. literate illiterate male 95 5 female 75 25

4) The eyesight of 100 randomly selected people from a town were tested

with the following results: poor eyesight good eyesight male 200 350 female 200 250 Can we conclude at 5% level of significance that gender has no bearing

on the quality of eyesight?

5) The following table shows results of inoculation against cholera not attacked attacked Inoculated 446 4 Not inoculated 291 9 Can we say at 5% LOS that inoculation is effective in controlling

susceptibility of cholera?

6) The following are the results of the tests performed on two brands of

tyres manufactured by a manufacturer. Brand A Brand B Lasted more than 30000km 27 38 Failed to last 30000 km 18 27

Use chi square at 5% LOS to test whether we can say that the two

brands of tyres differ significantly or not as regards their lifespan.

7) Determine at 1% LOS whether vaccination can be regarded as a

preventive measure for small pox on the basis of following report:

Out of 1482 persons in a locality exposed to small pox, 368 in all were

attacked. Out of 1482 persons, 343 had been vaccinated and of these

only 35 were attacked.

8) During a market research survey organized by ABC Ltd. the

households were asked whether they used “Beauty Soap”(the price of

which is Rs. 7.25 a piece) and whether their per capita monthly

expenditure exceeded Rs. 800 w ith the following results. munotes.in

## Page 185

185 Monthly per capita expenditure Exceeded Rs 800 Did not exceed Rs. 800 Whether they used yes 6 20 “Beauty Soap” no 4 30

Can we say the use of “Beauty Soap” depends on monthly per capita

expenditure at 5% LOS.

9) In a household survey conducted in certain locality, the following

information is collected. Whether exclusive Indoor toilet facility is available yes no owned house 9 4 rented house 21 16

Use chi square test for independence at 1% LOS and give your

conclusions.

10) The following data refer to an investigation carried out to examine the

effect of T.V as a medium of advertisement on the turnover of the

company’s manufacturing and selling consumer products. A random

sample of 40 companies was selected. An alyze the data and comment

on your findings. Annual turnover exceeding one crore rupees Yes No T.V advertisement 7 3 No T.V advertisement 10 20

11) A random sample of students of Mumbai University was selected and

asked their opinion about autonomous college. The results are given

below. The same number of each gender was included within each

class group. Test the hypothesis at 5% LOS that opinions are

independent of the class groupings. favoring autonomous colleges opposed to autonomous college First Year 120 80 Second Year 130 70 Third year 70 30 Post Graduation 80 20

12) Test for independence between health and working capacity from one

following data: Health very good good fair working good 20 25 15 Capacity bad 10 15 15 munotes.in

## Page 186

186 13) ABC Ltd. employ a large number of handicapped persons. The

following is an account of the performance of 200 randomly chosen

employees of the company: performance above average average below average handicapped 29 31 20 non handicapped 55 30 35 Can we retain at 5% LOS that handicapped employees are equally

efficient as the non -handicapped employees of the company.

14) In a survey 100 couples are interviewed and they were asked to give

an opinion on the importance of amiable nature of partner in selection

of a bride or a groom. The ranks were given independently by them as

I or II or III. Rankings by wives I II III Ranking I 25 12 8 by II 10 23 8 Husbands III 5 5 4 Use chi square test for independence and comment at 5% LOS.

15) A socio economic survey conducted in 1981 in Mumbai revealed the

following results: Monthly family income Below 1200 1200 to 1800 1800 and above No child 18 15 12 one child 31 34 25 Two or more children 81 51 63 Can we regard at 1% LOS that the number of children in the family

has no association with monthly income?

16) The data in the table were collected on how individuals prepared their

taxes and their education level. The null hypothesis is that the way

people prepare their taxes (computer software or pen and paper) is

independent of their education level. The following table is the

contingency table.

Education Tax Prepare High School Bachelors Masters Computer Software 23 35 42 Pen and Paper 45 30 25 Find the co efficient of contingency.

11.12 SOLUTION TO EXERCISE Q. No. Solution Q. No. Solution munotes.in

## Page 187

187 1 𝑥=33.4087,𝑥ଵ =27.5871, 𝛼=0.05 2 2.38, accept H0 3 15.686, reject H0 4 3.84, donot reject H0 5 3.552, accept H0 6 0.026, accept 7 51.157 reject 8 0.6652 accept 9 0.2122, accept 10 2.762, accept 11 12.046 , reject 12 1.9097, accept 13 4.904, accept 14 10.8125, accept 15 4.305, accept 16 0.236

11.13 TABLE OF CHI -SQUARE DISTRIBUTION

11.14 REFERENCE FOR FURTHER READING

Following books are recommended for further reading:

Statistics by Murray R, Spiegel, Larry J. Stephens, Mcgraw Hill

International Publisher, 4th edition

Fundamental of Mathematical Statistics by S. C. Gupta and V. K.

Kapoor, Sultan Chand and Sons publisher, 11th edition

***** chi suaredf\area0.9950.990.9750.950.90.750.50.250.10.050.0250.010.00510.000040.000160.000980.003930.015790.101530.454941.32332.705543.841465.023896.63497.8794420.010030.02010.050640.102590.210720.575361.386292.772594.605175.991467.377769.2103410.5966330.071720.114830.21580.351850.584371.212532.365974.108346.251397.814739.348411.3448712.8381640.206990.297110.484420.710721.063621.922563.356695.385277.779449.4877311.1432913.276714.8602650.411740.55430.831211.145481.610312.67464.351466.625689.2363611.070512.832515.0862716.7496 60.675730.872091.237341.635382.204133.45465.348127.840810.6446412.5915914.4493816.8118918.5475870.989261.239041.689872.167352.833114.254856.345819.0371512.0170414.0671416.0127618.4753120.2777481.344411.64652.179732.732643.489545.070647.3441210.2188513.3615715.5073117.5345520.0902421.9549591.734932.08792.700393.325114.168165.898838.3428311.3887514.6836616.9189819.0227721.6659923.58935102.155862.558213.246973.94034.865186.73729.3418212.5488615.9871818.3070420.4831823.2092525.18818 112.603223.053483.815754.574815.577787.5841410.34113.7006917.2750119.6751421.9200524.7249726.75685123.073823.570574.403795.226036.30388.4384211.3403214.845418.5493521.0260723.3366626.2169728.29952133.565034.106925.008755.891867.04159.2990712.3397615.9839119.8119322.3620324.735627.6882529.81947144.074674.660435.628736.570637.7895310.1653113.3392717.1169321.0641423.6847926.1189529.1412431.31935154.600925.229356.262147.260948.5467611.0365414.3388618.2450922.3071324.9957927.4883930.5779132.80132 165.142215.812216.907667.961659.3122411.9122215.338519.3688623.5418326.2962328.8453531.9999334.26719175.697226.407767.564198.6717610.0851912.7919316.3381820.4886824.7690427.5871130.1910133.4086635.71847186.26487.014918.230759.3904610.8649413.6752917.337921.6048925.9894228.869331.5263834.8053137.15645196.843977.632738.9065210.1170111.6509114.56218.3376522.7178127.2035730.1435332.8523336.1908738.58226207.433848.26049.5907810.8508112.4426115.4517719.3374323.8276928.4119831.4104334.1696137.5662339.99685 218.033658.897210.282911.5913113.239616.3443820.3372324.9347829.6150932.6705735.4788838.9321741.40106228.642729.5424910.9823212.3380114.0414917.2396221.3370426.0392730.8132833.9244436.7807140.2893642.79565239.2604210.1957211.6885513.0905114.8479618.137322.3368827.1413432.006935.1724638.0756341.638444.18128249.8862310.8563612.4011513.8484315.6586819.0372523.3367328.2411533.1962436.4150339.3640842.9798245.558512510.5196511.5239813.1197214.6114116.4734119.9393424.3365929.3388534.3815937.6524840.6464744.314146.92789 2611.1602412.1981513.843915.3791617.2918820.8434325.3364630.4345735.5631738.8851441.9231745.6416848.289882711.8075912.878514.5733816.151418.113921.749426.3363431.5284136.7412240.1132743.1945146.9629449.644922812.4613413.5647115.3078616.9278818.9392422.6571627.3362332.6204937.9159241.3371444.4607948.2782450.993382913.1211514.2564516.0470717.7083719.7677423.5665928.3361333.7109139.0874742.5569745.7222949.5878852.335623013.7867214.9534616.7907718.4926620.5992324.4776129.3360334.7997440.2560243.7729746.9792450.8921853.67196

munotes.in

## Page 188

188 UNIT V

12

CURVE FITTING AND THE METHOD OF

LEAST SQUARES

Unit Structure

12.0 Objectives

12.1 Introduction

12.2 Relationship between variables

12.3 Curve fitting

12.4 Equations of Approximating Curves

12.5 Freehand Method o f Curve Fitting

12.6 The Straight line Method

12.7 Least Square Curve fitting

12.7.1 Straight Line

12.7.2 Parabola

12.7.3 Non-Linear relationship

12.8 Regression

12.9 Applications to Time Series

12.10 Problems involving more than two variables

12.11 Summary

12.12 Exercises

12.13 Solution to Exercises

12.14 Logarithm tables

12.15 Reference for further reading

12.0 OBJECTIVES

The main objective is to study the fitting of curves using the method of

least square which minimizes the sum of the square of the errors.

12.1 INTRODUCTION

When we come across data for two variables and think about the relation

between each other, two objects come to our mind:

1) To translate the possible relationship into a mathematical

equation(called equation of curve)

2) To exploit the relationship between the variables for estimating the

value of one variable corresponding to a given value of the other

variable.

munotes.in

## Page 189

189 Curve fitting helps us in serving the first object directly, i.e., it leads to a

mathematical equation describing the relationship between the variables.

Further, using this equation one can estimate the value of the other

variable.

12.2 RELATIONSHIP BETWEEN VARIABLES

Very often in practice a relationship is found to exist between two (or

more) variables. For example, weights of adult males depend to some

degree on their heights, the circumferences of circles depend on their radii,

and the pressure of a given mass of g as depends on its temperature and

volume.

It is frequently desirable to express this relationship in mathematical form

by determining an equation that connects the variables.

12.3 CURVE FITTING

To determine an equation that connects variables, a first step is to collect

data that show corresponding values of the variables under consideration.

For example, suppose X and Y denote respectively, the height and weight

of adult males; then a sample of N individuals would reveal the heights

𝑋ଵ,𝑋ଶ,𝑋ଷ,⋯⋯,𝑋ே and the corresponding weights 𝑌ଵ,𝑌ଶ,𝑌ଷ,⋯⋯𝑌ே.

A next step is plot the points (𝑋ଵ,𝑌ଵ),(𝑋ଶ,𝑌ଶ),(𝑋ଷ,𝑌ଷ),⋯⋯,(𝑋ே,𝑌ே) on a

rectangular coordinate system. The resulting set of points is some times

called a scatter diagram . From the scatter diagram it is often possible to

visualize a smooth curve that approximates the data. Such a curve is called

an approximating curve .

For example, in the following diagram the data appear to be approximated

by a straight line, and so we call that a linear relationship exists between

the variables.

00.511.522.53

024681012linear relationship munotes.in

## Page 190

190 In the following diagram, however the relationship exists between the

variables, it is not a linear relationship, and so we call it a non -linear

relationship.

The general problem of finding equations of approximating curves that fit

the given sets of data is called curve fitting.

12.4 EQUATIONS OF APPROXIMATING CURVES

Various common types of approximating curves and their equations are

given below for reference. All letters other than x and y shall be treated as

constants. The variables x and y are termed as independent variable and

dependent variable respectively, role s of the variables can be interchanged

as per the requirement.

Straight Line 𝑦=𝑎+𝑏𝑥

Parabola or quadratic curve 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ

Cubic curve 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ+𝑑𝑥ଷ

Quadratic curve 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ+𝑑𝑥ଷ+𝑒𝑥ସ

nth degree curve 𝑦=𝑎+𝑎ଵ𝑥+𝑎ଶ𝑥ଶ+⋯⋯+𝑎𝑥

The above equations refer to the polynomial equations of the degree one,

two, three, four and n respectively. The following are some examples of

other equations most commonly used.

Hyperbola 𝑦=ଵ

బ ା భ௫

Exponential curve 𝑦=𝑎⋅𝑏௫

Geometric curve 𝑦=𝑎⋅𝑥

Gompertz Curve 𝑦=𝑝𝑞ೣ

Logistic Curve 𝑦=ଵ

ೣ ା

00.511.522.5

024681012Non linear relationship munotes.in

## Page 191

191 To decide which curve to be used, it is helpful to obtain scatter diagrams

of the transformed variables. For example, if a scatter diagram of

log𝑦versus x shows a linear relationship, then the equation is of the 𝑦=

𝑎⋅𝑏௫ , while if log𝑦 versus log𝑥 shows a linear relationship, the equation

is of the form 𝑦=𝑎⋅𝑥.

12.5 FREEHAND METHOD OF CURVE FITTING

Individual judgment can often be used to d raw an approximating curve to

fit a set of data. This is called a freehand method of curve fitting. If the

type of equation of this curve is known, it is possible to obtain the

constants in the equation by choosing as many points on the curve as there

are constants in the equation. For example, if the curve is a straight line,

two points are necessary; if it is a parabola, three points are necessary. The

method has the disadvantage that different observers will obtain different

curves and equations.

12.6 THE STRAIGHT LINE METHOD

The simplest type of the approxi mating curve is a straight line , whose

equation can be written

𝑦=𝑎+𝑏𝑥

Given any two points (𝑥ଵ,𝑦ଵ)&(𝑥ଶ,𝑦ଶ) on the line, the constants a and b

can be determined. The resulting equation of the straight line can be

written as

𝑦−𝑦ଵ=𝑚(𝑥−𝑥ଵ),𝑚 =𝑦ଶ−𝑦ଵ

𝑥ଶ−𝑥ଵ

where m is known as the slope of the line.

When the equation is written in the form of 𝑦=𝑎+𝑏𝑥, the constant ‘b’

denotes the slope m. The constant ‘a’ is the value of y when x = 0 is called

the y -intercept.

12.7 LEAST SQUARE CURVE FITTING

We can find the trend curve by fitting a mathematical equation. The

method is more precise and can be used even for forecasting. We can fit

either a straight line or a curve to the given data. We fit a straight line

𝑦=𝑎+𝑏𝑥, where a and b are constants. We determine the constants a

and b so that the following conditions are fulfilled:

i) The sum of the deviations of all the values of y from their trend values

is zero, when, deviations above the line are given positive sign and

deviations below, negative i.e if 𝑦is the trend value obtained from the

trend line, and y is the ac tual value in the data.

∑(𝑦−𝑦)=0

ii) The sum of the squares of the deviations is the least, i.e., munotes.in

## Page 192

192 ∑(𝑦−𝑦)ଶ is minimum.

The method gets the name ‘least square method’ because of this second

property. This is also called line of best fit . In a sense this line is like

arithmetic mean since arithmetic mean is a single value possessing the

above two properties.

Remark: When the value of x is given in terms of years or value of x is

big then we apply the following technique to get the values of x:

1) If the value of n is odd then we take the value of central most

observation i.e value of ାଵ

ଶth observation as 0 and we add 1 as we

move downward and we subtract 1 as we move upwards.

2) If the value if n is even then we take the value of

ଶ th observation as -

1 and we add 2 as we move downward and subtract 2 as we move

upwards.

12.7.1 Straight Line :

Suppose o two variables x and y, n pairs of observations, say

(𝑥ଵ,𝑦ଵ),(𝑥ଶ,𝑦ଶ),⋯⋯,(𝑥,𝑦) are available and we want to fit a linear

curve of the form 𝑦=𝑎+𝑏𝑥 to the data. Then according to the least

squares method we have to find a and b such that ∑(𝑦−𝑦)ଶ is minimum

where 𝑦=𝑎+𝑏𝑥.

Thus, we have to minimize 𝐷=∑(𝑦−𝑎−𝑏𝑥). The first order

conditions are ఋ

ఋ=0 𝑎𝑛𝑑 ఋ

ఋ=0

which are known as normal equations and they are given by

∑𝑦=𝑎𝑛+𝑏∑𝑥 𝑎𝑛𝑑 ∑𝑥𝑦 =𝑎∑𝑥+𝑏 ∑𝑥ଶ

It can be easily verified that the second order condition for minima are

satisfied and therefore the best choice of a and b can be made by finding

their values satisfying the normal equations. This equations will be used to

fit a linear curve to the given data.

Eg: 1) Fit a straight line of the form y = a + bx using least square method x 0 1 2 3 4 y 1 2.9 4.8 6.7 8.6

Solution: n = 5 x y xy x2 0 1 0 0 1 2.9 2.9 1 2 4.8 9.6 4 3 6.7 20.1 9 4 8.6 34.4 16 total 10 24.0 67.0 30 munotes.in

## Page 193

193 The straight line equation is given by 𝑦=𝑎+𝑏x. To find values of a and

b we solve the following equations

∑𝑦=𝑎𝑛 +𝑏∑𝑥,∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ --- (1)

by substituting the values in eq (1) , we get

24=5𝑎+10𝑏−−−−(2)

67=10𝑎+30𝑏−−−−(3)

multiplying eq (2) by 2 we get 48=10𝑎+20𝑏−−−−(4)

subtracting eq (4) by eq (3), we get 19=10𝑏⇒𝑏=1.9

substituting 𝑏=1.9 in eq (2) gives 𝑎=1

∴𝑦=1+1.9𝑥

is the required straight line.

Eg: 2) Fit a straight line of the form 𝑦=𝑎+𝑎ଵ𝑥 x 1 2 3 4 6 8 y 2.4 3.1 3.5 4.2 5 6

Solution: n = 6 x y xy x2 1 2.4 2.4 1 2 3.1 6.2 4 3 3.5 10.5 9 4 4.2 16.8 16 6 5 30 36 8 6 48 64 total 24 24.2 113.9 130

The straight line equation is given by 𝑦=𝑎+𝑎ଵx. To find values of 𝑎

and 𝑎ଵ we solve the following equations

∑𝑦=𝑎𝑛 +𝑎ଵ∑𝑥,∑𝑥𝑦=𝑎∑𝑥+𝑎ଵ∑𝑥ଶ - - - (1)

by substituting the values in eq (1) , we get

24.2=6𝑎+24𝑎ଵ −−−(2)

113.9=24𝑎+130𝑎ଵ −−−(3)

multiplying eq (2) by 4, we get 96.8=24𝑎+96𝑎ଵ−−−(4)

subtracting eq (4) from eq (3), we get 17.1=34𝑎ଵ⇒𝑎ଵ=0.5029

subs 𝑎ଵ=0.5029 in (2) gives 𝑎=2.0217

∴ 𝑦=2.0217+0.5029𝑥

is the required straight line.

Eg: 3) Fit a straight line using least square method and estimate the

exchange rate for the year 1993 -94 and 1984 -85.

munotes.in

## Page 194

194 year 1985 -

86 1986 -

87 1987-88 1988-89 1989-90 1990-91 1991-92 Exchange rate 12.24 12.78 12.97 14.48 16.65 17.94 24.47

Solution: n = 7

The straight line equation we take is 𝑦=𝑎+𝑏x. To find values of a and b

we solve the following equations

∑𝑦=𝑎𝑛 +𝑏∑𝑥,∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ --- (1)

year ex. rate(y) x xy x2 1985-86 12.24 -3 -36.72 9 1986-87 12.78 -2 -25.56 4 1987-88 12.97 -1 -12.97 1 1988 -89 14.48 0 0 0 1989 -90 16.65 1 16.65 1 1990 -91 17.94 2 35.88 4 1991 -92 24.47 3 73.41 9 total 111.53 0 50.69 28

by su bstituting the values in eq (1) , we get

111.53=7𝑎−−−−(2)

50.69=28𝑏−−−−(3)

from eq (2), we get 𝑎=ଵଵଵ.ହଷ

=15.9328

from eq (3), we get 𝑏=ହ.ଽ

ଶ଼=1.8104

∴𝑦=15.9328+1.8104𝑥 −−−−(4)

is the required straight line.

To the get exchange rate for the year 1993 -94 , we subs x = 5 in eq (4)

∴𝑦=15.9328+1.8104(5)=24.9848

To the get the exchange rate for the year 1984 -85, we subs x = -4 in eq (4)

∴𝑦=15.9328+1.8104(−4)=8.6912

12.7.2 Parabola :

Suppose we want to fit a second degree curve called parabola of the form

𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ to n pairs of observations

(𝑥ଵ,𝑦ଵ),(𝑥ଶ,𝑦ଶ),⋯⋯,(𝑥,𝑦). Then according to the least squares

method we have to minimize 𝐷=∑(𝑦−𝑎−𝑏𝑥−𝑐𝑥ଶ)ଶ with respect to

a, b and c. The first order conditions are given by

𝛿𝐷

𝛿𝑎=0,𝛿𝐷

𝛿𝑏=0 𝑎𝑛𝑑𝛿𝐷

𝛿𝑐=0

Applying these we get,

∑𝑦=𝑛𝑎+𝑏∑𝑥+𝑐∑𝑥ଶ

∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ+𝑐∑𝑥ଷ

∑𝑥ଶ𝑦=𝑎∑𝑥ଶ+𝑏∑𝑥ଷ+𝑐∑𝑥ସ munotes.in

## Page 195

195 Eg: 1) Fit a parabola of the form 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ using least square

method x 0 1 2 y 1 6 17

Solution: n=3 x y x2 x3 x4 xy x2y 0 1 0 0 0 0 0 1 6 1 1 1 6 6 2 17 4 8 16 34 68 total 3 24 5 9 17 40 74

The equation of the parabola is given by 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ. To find

values of 𝑎,𝑏 𝑎𝑛𝑑 𝑐 we solve the following equations

∑𝑦=𝑛𝑎+𝑏∑𝑥+𝑐∑𝑥ଶ,∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ+𝑐∑𝑥ଷ,

∑𝑥ଶ𝑦=𝑎∑𝑥ଶ+𝑏∑𝑥ଷ+𝑐∑𝑥ସ −−−−(∗)

subs values in eq (* ), we get

24=3𝑎+3𝑏+5𝑐 −−−−(1)

40=3𝑎+5𝑏+9𝑐 −−−−(2)

74=5𝑎+9𝑏+17𝑐 −−−−(3)

Subtracting eq (1) by (2), we get 16 = 2b +4c ⇒8=𝑏+2𝑐−−−(4)

𝑚𝑢𝑙 𝑒𝑞 (2) 𝑏𝑦 5 ,𝑤𝑒 𝑔𝑒𝑡 200=15𝑎+25𝑏+45𝑐−−−(5)

𝑚𝑢𝑙 𝑒𝑞 (3)𝑏𝑦 3,𝑤𝑒 𝑔𝑒𝑡 222=15𝑎+27𝑏+51𝑐−−−(6)

𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡𝑡𝑖𝑛𝑔 𝑒𝑞 (5) 𝑏𝑦 (6),𝑤𝑒 𝑔𝑒𝑡 22=2𝑏+6𝑐

⇒11=𝑏+3𝑐−−−(7)

𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡𝑖𝑛𝑔 𝑒𝑞 (4) 𝑏𝑦 (7),𝑤𝑒 𝑔𝑒𝑡 3=𝑐

𝑠𝑢𝑏 𝑐=3 𝑖𝑛 𝑒𝑞 (4) 𝑤𝑒 𝑔𝑒𝑡 𝑏=2

𝑠𝑢𝑏 𝑏=2,𝑐=3 𝑖𝑛 (1),𝑤𝑒 𝑔𝑒𝑡 𝑎=1

∴𝑦=1+2𝑥+3𝑥ଶis the required equation of the parabola.

Eg: 2) Fit a parabola of the form 𝑦=𝑎+𝑏𝑥 +𝑐𝑥ଶ year 1989 1990 1991 1992 1993 1994 1995 1996 no.of students 15 17 20 25 30 31 30 32 Also find the no. of students for the year 2000 & 1987.

Solution: n = 8 year no. of students(y) x x2 x3 x4 xy x2y 1989 15 -7 49 -343 2401 -105 735 1990 17 -5 25 -125 625 -85 425 1991 20 -3 9 -27 81 -60 180 1992 25 -1 1 -1 1 -25 25 munotes.in

## Page 196

196 1993 30 1 1 1 1 30 30 1994 31 3 9 27 81 93 279 1995 30 5 25 125 625 150 750 1996 32 7 49 343 2401 224 1568 total 200 0 168 0 6216 222 3992

The equation of the parabola is given by 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ. To find

values of 𝑎,𝑏 𝑎𝑛𝑑 𝑐 we solve the following equations

∑𝑦=𝑛𝑎+𝑏∑𝑥+𝑐∑𝑥ଶ,∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ+𝑐∑𝑥ଷ,

∑𝑥ଶ𝑦=𝑎∑𝑥ଶ+𝑏∑𝑥ଷ+𝑐∑𝑥ସ −−−−(∗)

subs values in eq ( *), we get

200 =8𝑎+168𝑐 −−−−(1)

222=168𝑏 −−−−(2)

3992=168𝑎+6216𝑐 −−−−(3)

𝑓𝑟𝑜𝑚 (2),𝑤𝑒 𝑔𝑒𝑡 𝑏=222

168=1.3214

𝑚𝑢𝑙 (1)𝑏𝑦 21 ,𝑤𝑒 𝑔𝑒𝑡 4200=168𝑎+3528𝑐 −−−(4)

𝑠𝑢𝑏 (4)𝑏𝑦 (3),𝑤𝑒 𝑔𝑒𝑡 208= −2688𝑐 ⇒𝑐=−208

2688=−0.0774

𝑠𝑢𝑏 𝑐= −0.0774 𝑖𝑛 (1),𝑤𝑒 𝑔𝑒𝑡 𝑎=200−168(−0.0774)

8=26.6254

∴𝑦=26.6254+1.3214𝑥−0.0774𝑥ଶ−−−−(4)

is the required equation of the parabola.

To get the no. of students for the year 2000 we substitute x = 15 in (4)

∴𝑦=26.6254+1.3214(15)−0.0774(15)ଶ=29.0314

To get the no. of students for the year 1987 we substitute x = -11 in (4)

∴𝑦=26.6254+1.3214(−11)−0.0774(−11)ଶ=2.7246

Eg: 3) Fit a parabola of the form 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ

year 1985 1986 1987 1988 1989 milk (in 100 litres) 20 25 27 35 38 Find the production of milk for the year 1982 and 1995.

`Solution: n =5 year milk (y) x x2 x3 x4 xy x2y 1985 20 -2 4 -8 16 -40 80 1986 25 -1 1 -1 1 -25 25 1987 27 0 0 0 0 0 0 munotes.in

## Page 197

197 1988 35 1 1 1 1 35 35 1989 38 2 4 8 16 76 152 total 145 0 10 0 34 46 292

The equation of the parabola is given by 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ. To find

values of 𝑎,𝑏 𝑎𝑛𝑑 𝑐 we solve the following equations

∑𝑦=𝑛𝑎+𝑏∑𝑥+𝑐∑𝑥ଶ,∑𝑥𝑦=𝑎∑𝑥+𝑏∑𝑥ଶ+𝑐∑𝑥ଷ,

∑𝑥ଶ𝑦=𝑎∑𝑥ଶ+𝑏∑𝑥ଷ+𝑐∑𝑥ସ −−−−(∗)

subs values in eq (*), we get

145=5𝑎+10𝑐−−(1)

46=10𝑏−−(2) ⇒𝑏=4.6

292=10𝑎+34𝑐−−(3)

multiply eq (1) by 2, we get 290=10 𝑎+20𝑐−−(4)

subtracting eq (4) from eq (3), we get 2=14𝑐⇒𝑐=ଶ

ଵସ=0.1428

substituting c = 0.1428 in eq (1), we get 𝑎=28.7144

∴𝑦=28.7144+4.6𝑥+ 0.1428𝑥ଶ−−−(5)

is the required equation of the parabola.

To get the production of milk for the year 1995, we put x = 8 in (5)

∴𝑦=28.7144+4.6(8)+0.1428(8)ଶ =74.6536

To get the production of milk for the year 1982, we put x = -5 in (5)

∴𝑦= 28.7144+4.6(−5)+0.1428(−5)ଶ =9.2844

12.7.3 Non-Linear Relationship :

Fitting of a power Curve:

Suppose we want to fit a curve whose equation is of the form 𝑦=𝑎⋅𝑥

to variables x and y on which n paired observations are available. Then we

first rewrite the equation of the curve as

log𝑦=log𝑎+𝑏log𝑥

which can be written in the linear form 𝑌=𝐴+𝑏𝑋 where Y = log y, A =

log a and X = log x. Therefore this curve can be fitted by the method of

least squares solving the following normal equations.

∑𝑌=𝑛𝐴+𝑏∑𝑋 ,∑𝑋𝑌 =𝐴∑𝑋+𝑏∑𝑋ଶ

Once we get the value of A, we can find a = antilog (A). The method of

fitting is explained in the following example.

Eg: 1) Fit a curve of the form 𝑦=𝑎⋅𝑥 for the following data

x 1 2 4 5 7 10 y 2.1 4.9 20.8 32.7 60 131 munotes.in

## Page 198

198 Solution: n = 6

We can write the given curve in the form of 𝑌=𝐴+𝑏𝑋 where Y = log y,

A = log a, X = log x

x y X = log x Y = log y XY X2 1 2.1 0 0.3222 0 0 2 4.9 0.3010 0.6902 0.2078 0.0906 4 20.8 0.6021 1.3181 0.7936 0.3625 5 32.7 0.6990 1.5142 1.0586 0.4886 7 60 0.8451 1.7782 1.5028 0.7142 10 131 1 2.1173 2.1173 1 Total 3.4472 7.7405 5.6801 2.6559

The given curve is of the form 𝑦=𝑎𝑥, therefore the normal equations

are given by

∑𝑌=𝑛𝐴+𝑏∑𝑋 ,∑𝑋𝑌 =𝐴∑𝑋+𝑏∑𝑋ଶ −−−(1)

substituting the values in eq (1), we get

7.7405=6𝐴+3.4472𝑏 −−−(2)

5.6801= 3.4472𝐴+2.6559𝑏 −−−−(3)

multiplying eq (1) by 3.4472 and eq (2) by 6, then subtracting the two eq.,

we get

4.0522𝑏= 7.3975⇒𝑏=7.3975

4.0522=1.8256

substituting b = 1.8256 in eq (1), we get a = 0 02412

Now a = antilog(A) = antilog(0.2412) = 1.7426

∴𝑦=1.7426𝑥ଵ.଼ଶହ

is the required equation.

Fitting of the curve 𝒚=𝒂𝒆𝒃𝒙

We first try to convert the equation 𝑦=𝑎𝑒௫ to a linear equation by

applying logarithms on both sides with respect to base 10. By applying

logarithms on both sides, we get

log𝑦=log𝑎+𝑏𝑥𝑙𝑜𝑔ଵ𝑒

Let Y = log y, A = log a, logଵ𝑒=0.4343𝑏. Then the equation of the

curve reduces to a linear equation of the form

𝑌=𝐴+𝐵𝑥

By using the least square method the constants A and b can be found by

solving the normal equations

∑𝑌=𝑛𝐴+𝐵∑𝑥,∑𝑥𝑌=𝐴∑𝑥+𝐵∑𝑥ଶ

munotes.in

## Page 199

199 Once A and B are found, we can find a = antilog (A) and b =

.ସଷସଷ

Eg: 1) Fit a cur ve of the form 𝑦=𝑎𝑒௫ for the following data

x 0 2 4 6 8 y 3 55 1095 22000 442000

Solution : n =5

The given equation is of the form 𝑦=𝑎𝑒௫, therefore the normal

equations are given by

∑𝑌=𝑛𝐴+𝐵∑𝑥,∑𝑥𝑌=𝐴∑𝑥+𝐵∑𝑥ଶ −−−(1)

substituting the values in eq (1), we get

15.2447=5𝐴+20𝐵 −−−−(2)

86.8560=20𝐴+120𝐵 −−−−(3)

multiplying eq (1) by 4 and subtracting it from eq (2), we get

25.8772=40𝐵⇒𝐵=25.8772

40=0.6469

substituting B = 0.6469 in eq (1), we get

15.2447=5𝐴+20(0.6469)⇒𝐴=2.3067

5=0.4613

Now a = antilog (A) = antilog(0.4613) = 2.892

b =

.ସଷସଷ=.ସଽ

.ସଷସଷ=1.4895

∴𝑦=2.892𝑒ଵ.ସ଼ଽହ௫

is the required equation.

12.8 REGRESSION

Often, on the basis of sample data, we wish to estimate the value of a

variable Y corresponding to a given value of a variable x. This can be

accomplished by estimating the value of y from a least -squares curve that

fits the sample data. The resulting curve is called a regression curve of y

on x, since y is estimated from x.

x y Y = log y xY x2 0 3 0.4771 0 0 2 55 1.7404 3.4808 4 4 1095 3.0394 12.1576 16 6 22000 4.3424 26.0544 36 8 442000 5.6454 45.1632 64 Total 20 15.2447 86.8560 120 munotes.in

## Page 200

200 If we want to estimate the value of x from a given value of y, we would

use a regression curve of x on y, which amounts to interchanging the

variables in the scatter diagram so that x is the dependent variable and y is

the independent variable. This is equivalent to replacing the vertical

deviations in the definition of the least -squares curve with horizontal

deviations.

In general, the regression line or curve of y on x is not the same as the

regression line or curve of x on y.

12.9 APPLICATIONS TO TIME SERIES

If the independent variable x is time, the data show the values of y at

various times. Data arranged according to time are called time series. The

regression line or curve of y on x in this case is often called atrend line or

trend curve and is often used for purposes of estimation, p rediction, or

forecasting.

12.10 PROBLEMS INVOLVING MORE THAN TWO VARIABLES

Problems involving more than two variables can be treated in a manner

analogous to that for two variables. For example, there may be a

relationship between the three variables X, Y, and Z that can be described

by the equation

𝑧=𝑎+𝑏𝑥+𝑐𝑦

which is called a linear equation in variables x, y and z.

In a three dimensional rectangular coordinate system this equation

represents a plane, and the actual sample points

(𝑥ଵ,𝑦ଵ,𝑧ଵ),(𝑥ଶ,𝑦ଶ,𝑧ଶ),⋯⋯,(𝑥ே,𝑦ே,𝑧ே) may scatt er not too far from this

plane which we call an approximating plane.

By extension of the least square method, we can talk about a least square

plane approximating the data. If we are estimating z from the given value s

of x and y, this would be called a regression plane of z on x and y. The

normal equations corresponding to the least square plane are given by

∑𝑧=𝑎𝑁+𝑏∑𝑥+𝑐∑𝑦

∑𝑥𝑧 =𝑎∑𝑥+𝑏∑𝑥ଶ+𝑐∑𝑥𝑦

∑𝑦𝑧= 𝑎∑𝑦+𝑏∑𝑥𝑦+𝑐∑𝑦ଶ

Problems involving the estimation of a variable from two or more

variables are called problems of multiple regression which is not there in

the syllabus.

12.11 SUMMARY

In this chapter we learnt about curve fitting, various equations of

approximating curves, the straight line method. We also learnt about the munotes.in

## Page 201

201 fitting of linear and non -linear curves using the technique of least square

which minimizes the sum of the square of the error.

12.12 EXERCISES

1) Fit a straight line of the form y = ax+ b using least square method x 0 1 2 3 y 2 5 8 11

2) Fit a straight line of the form y = a + bx x 10 12 13 16 17 20 25 y 19 22 24 27 29 33 37

3) Fit a straight line of the form y = a + bx x 1 3 5 7 9 11 13 15 17 y 10 15 20 27 31 35 30 35 40

Also estimate y when x = 21.

4) Fit a parabola of the form 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ using least square

method x 0 1 2 3 4 y 1 0 3 10 21

5) Fit a parabola of the form 𝑦=𝑎+𝑏𝑥+𝑐𝑥ଶ x 0 1 2 3 4 y 1 1.8 1.3 2.5 6.3

6) Fit a parabola using least square method 𝑦=𝑎𝑥+𝑏𝑥ଶ x 1 2 3 4 5 y 1.8 5.1 8.9 14.1 19.8

7) Fit a curve of the form 𝑦=𝑎𝑥 for the following data x 1 2 3 4 y 0.7 0.86 0.97 1.06

8) Fit a curve of the form 𝑦=𝑎𝑒௫ for the following x 0 2 4 y 5.012 10 31.62

9) Fit a curve of the form 𝑦=𝑎𝑒௫ for the following x 1 2 3 4 5 y 1.230 2.042 3.162 3.981 5.624 munotes.in

## Page 202

202

10) The number y of bacteria per unit volume present in a culture after x

hours is given in the following table. Fit a of the form 𝑦=𝑎𝑏௫

using least square method for the following data. x 0 1 2 3 4 5 6 y 30 45 60 90 130 190 275

11) Fit a curve of the form 𝑦=𝑎𝑏௫ for the following x 1 2 3 y 8.3 15.4 33.1

12) The population of a state at ten yearly intervals is given below. Fit a

curve of the form 𝑦=𝑎𝑏௫ using least square method and also

estimate the population for the year 1961. year 1881 1891 1901 1911 1921 1931 1941 1951 population in millions 3.9 5.3 7.3 9.6 12.9 17.1 23.2 30.5

13) Fit a straight line of the form y = a + bx for the following data x 5 4 3 2 1 y 1 2 3 4 5

14) Fit a straight line of the form y = a + bx for the following data x 3 5 7 9 11 y 2.3 2.6 2.8 3.2 3.5

15) Fit a straight line to the following data on production Year 1996 1997 1998 1999 2000 Production 40 50 62 58 60

16) Fit a straight line to the following data on profit Year 1992 1993 1994 1995 1996 1997 1998 1999 Profit 38 40 65 72 69 60 87 95

17) Fit a second degree parabolic equation of the form 𝑦=𝑎+𝑏𝑥+

𝑐𝑥ଶ x 12 10 8 6 4 2 y 6 5 4 3 2 1

munotes.in

## Page 203

203 12.14 SOLUTION TO EXERCISES Q. No. Solution Q. No. Solution 1 2, 3 2 a = 7.7044, b = 1.213 3 a=11.43, b=1.73 4 1, -3, 2 5 1.42, -1.07, 0.55 6 a= 1.44 , b = 0.51 7 𝑦=4.642𝑒.ସ௫ 8 𝑦=4.642𝑒.ସ௫ 9 𝑦=0.7762𝑒.ସଶ௫ 10 𝑦=30(1.306)௫ 11 𝑦=8.099(1.997)௫ 12 𝑦=11.074(1.0298)௫ିଵଽଵ 13 6, -1 14 1.83, 0.15 15 54, 4.8 16 62.0833, 7.3333 17 0, 0.5, 0

12.15 LOGARITHM TABLES

munotes.in

## Page 205

205

Note: The logarithmic tables are directly taken from internet and inserted

in picture format.

12.13 REFERENCE FOR FURTHER READING

Following books are recommended for further reading:

Statistics by Murray R, Spiegel, Larry J. Stephens, Mcgraw Hill

International Publisher, 4th edition

Fundamental of Mathematical Statistics by S. C. Gupta and V. K.

Kapoor, Sultan Chand and Sons publisher, 11th edition

Descriptive Statistics by R. J. Shah

*****

munotes.in

## Page 206

206 13

CORRELATION THEORY

Unit Structure

13.0 Objectives

13.1 Introduction

13.2 Correlation

13.2.1 Scatter Diagram and Linear Correlation

13.2.2 Coefficient of Correlation

13.2.3 Coefficient of Rank Correlation

13.2.4 The Least Square Regression Lines

13.3 The Least Square Regression Lines

13.3.1 Regression

13.3.2Least Square Method

13.3.3 Regression Lines and Regression Coefficients

13.4 Standard Error of Estimate

13.5 Explained and Unexplained Variation

13.6 Coefficient of determination

13.7 Summary

13.8 Exercises

13.9 Solution to exercises

13.10 Reference for further reading

13.0 OBJECTIVES

This Chapter would make you understand about the following concepts

about correlation and regression:

Scatter Diagram

Linear Correlation

Least square regression

Explained and unexplained variation

Regression Lines

Correlation of Time series and Attributes

13.1 INTRODUCTION

By now we know how to find averages and dispersion of a distribution.

Involving one variable. These measures give a complete idea about the

structure of a distribution.

Sometimes it is necessary to know the relationship between two

variables. For instance , a businessman would like to know the effect of

production on price. If the two are related, he would like to know the munotes.in

## Page 207

207 nature of the relationship and to use that knowledge to his benefit. If we

get heights and weights of a group of students we can see tha t in general

taller students will be heavier, though there will be some exceptions. As a

man’s income increases he spends more. That is, there will be some

relation between income and expenditure of any person. For such data we

would like to find answer fo r the following questions:

i) Are the two variables related?

ii) If they are related, how?

iii) To what extent they are related?

Correlation helps to find out answers to these questions.

If two variables vary together in the same direction or in opposite

directions, they are said to be correlated. That is if as X increases, Y

increases consistently, we say that X and Y are positively related i.e the

variables are directly related with each other. In this case the values of X

and Y for a particular individual have rou ghly the same relative position

among their respective distributions, i.e if X is far above mean of X, then

corresponding Y will be above mean of Y. If X is near to mean of X then

Y will be near to mean of Y. We know that if the weight of the parents is

above average then the son/daughter also will be having more weight.

There are some variables which are negatively correlated where, as X

increases, Y decreases and as X decreases, Y increases i.e the variables are

inversely proportional to each other , e.g Price increases as the supply

decreases, that is, as the commodity becomes scare, the price increases.

If the change in one variable is proportional to the change in the other, the

two variables are said to perfectly correlated.

If the two variables are not related to each other, we say the two variables

have zero correlation , e.g length of the hair of an individual and the I.Q

level of the same individual.

13.2 CORRELATION

The various method of finding whether two given variables are realted or

not are:

Scatter diagram

Coefficient of Correlation

13.2.1 Scatter Diagram :

Scatter Diagram can be obtained by simply plotting the points on a graph

where the two variables, say x a nd y are taken along x – axis and y – axis

respectively.

Eg: A manager of a firm may want to appoint salesman for promoting his

sales. When he gets a number of applications, he conducts an aptitude

test and selects on the basis of their results. But afte r employing them, munotes.in

## Page 208

208 he wants to know whether there is any relation between the actual

sales and the marks in the aptitude test.

The data for seven salesman are as follows: Salesman 1 2 3 4 5 6 7 Aptitude test score 47 49 60 55 59 70 83 Actual sales in ‘000 in Rs. 70 69 80 75 75 87 90

Solution:

Here the test scores are plotted on the X –axis and actual sales on the

Y –axis.

Here we can see that as X increases, Y also increases therefore the two

variables are positively related. The businessman knows by this that it is

useful to select salesman on the basis of the aptitude test conducted in this

case.

Various possible patterns are shown here below

6065707580859095

4555657585Actual Sales(in rs. '000)

Test Scores SCATTER DIAGRAM

munotes.in

## Page 209

209 Types of Correlation:

Perfect Positive Correlation: When all plotted values lie exactly on a

straight line and the line runs from lower left to upper right corner it is

called perfect positive correlation. Here r = 1

Perfect Negative Correlation: When all plotted values lie exactly on a

straigh t line and the line runs from upper left to lower right corner it is

called perfect negative correlation . Here r = -1

Positive Correlation: If the value of two variables deviate in the same

direction. It is known as Positive correlation or direct correlation.

Negative Correlation: If the value of two variables deviate in the opposite

direction. It is also known as the Negative correlation or inverse

correlation.

No Correlation: The variables are independent i.e there is no relation

between the two variables i.e r = 0.

13.2.2 Coefficient of Correlation :

The things discussed in the previous section gives us the direction of

existence of correlation. But we also require to find exact numerical

measurement for the degree or extent of correlation. It is useful to have a

numerical measure, which is independent of units of the original data, so

that the two variables can be compared. For this we calculate the

coefficient of correlation. It is denoted by 𝑟.

Definition: The coefficient of Correlation denoted by 𝑟 and name after

Karl Pearson is defined as

𝑟=𝑟௫,௬= ∑൫(𝑥− 𝑥̅)(𝑦− 𝑦ത)൯

𝑁𝜎௫𝜎௬

𝜎௫= ඨቆ∑(𝑥−𝑥̅)ଶ

𝑁ቇ ,𝜎௬= ඨ൬∑𝑦−𝑦ത

𝑁൰

where there are N pairs.

This is also called Product Moment Coefficient of Correlation.

Covariance of x and y is defined as 𝑐𝑜𝑣(𝑥,𝑦)=∑(௫ି ௫̅)(௬ି௬ത)

ே

∴𝑟=𝑐𝑜𝑣(𝑥,𝑦)

𝜎௫𝜎௬

The formula of r can be simplified as 𝑟=∑௫௬ ି ே௫̅௬ത

ට∑௫మ – ே௫̅మට∑௬మ – ே௬തమ

=∑𝑥𝑦−∑௫∑௬

ே

ට∑𝑥ଶ−(∑௫)మ

ேට∑𝑦ଶ−(∑௬)మ

ே

munotes.in

## Page 210

210 Properties of Correlation:

1) −1≤𝑟≤1

2) 𝑟=1,𝑝𝑒𝑟𝑓𝑒𝑐𝑡+𝑣𝑒𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

3) 𝑟= −1,𝑝𝑒𝑟𝑓𝑒𝑐𝑡−𝑣𝑒𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

4) If 0 < r < 1, the correlation is positive

5) If -1 < r < 0, the correlation is negative

6) r = 0, no correlation

7) r is a pure number and is not affected by change of origin and scale

in magnitude.

𝑖.𝑒𝑖𝑓𝑢=𝑥−𝑎

𝑏 ,𝑣=𝑦−𝑐

𝑑𝑡ℎ𝑒𝑛𝑟௫௬ =𝑏𝑑

|𝑏||𝑑|𝑟௨௩

a) If b and d are of same sign then 𝑟௫௬=𝑟௨௩

b) If b and d are of opposite signs then 𝑟௫௬ = −𝑟௨௩

8) If y = ax + b, 𝑟௫௬ =1 𝑖𝑓𝑎>0 𝑎𝑛𝑑𝑟௫௬= −1,𝑖𝑓𝑎<0

Eg 1 : 𝑟௫௬=0.6, find 𝑟௨௩if 2u – 3x + 4 = 0, 4v – 16y + 11 = 0

solution:2𝑢−3𝑥+4=0⇒𝑢=ଷ௫ିସ

ଶ=௫ – ర

య

మ

య⇒𝑏=ଶ

ଷ

4𝑣−16𝑦 +11=0⇒𝑣=16𝑦−11

4=𝑦−ଵଵ

ଵ

ସ

ଵ⇒𝑑=4

16

𝑏 𝑎𝑛𝑑 𝑑 𝑎𝑟𝑒 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑠𝑖𝑔𝑛𝑠∴𝑟௨௩=𝑟௫௬ =0.6

Eg 2: 𝑟௫௬=0.6, find 𝑟௨௩if 2u + 3x + 4 = 0, 4v – 16y + 11 = 0

solution:2𝑢+3𝑥+4=0⇒𝑢=−௫ ା ర

య

మ

య⇒𝑏= −ଶ

ଷ,

4𝑣−16𝑦+11=0⇒𝑣=𝑦−ଵଵ

ଵ

ସ

ଵ⇒𝑑=4

16

𝑏 𝑎𝑛𝑑 𝑑 𝑎𝑟𝑒 𝑜𝑓 𝑜𝑝𝑝𝑜𝑠𝑖𝑡𝑒 𝑠𝑖𝑔𝑛𝑠∴𝑟௨௩= −𝑟௫௬ = −0.6

munotes.in

## Page 211

211 Eg 3 : If 2x + y = 3 what is the value of r xy.

solution: 2𝑥+𝑦=3⇒𝑦= −2𝑥+3 ⇒𝑎= −2<0⇒𝑟௫௬= −1

Eg4: Calculate the karlpearson’s coefficient of correlation given

𝑐𝑜𝑣(𝑥,𝑦)= −15,𝜎௫=5,𝜎௬ = 4.

solution: 𝑟=௩(௫,௬)

ఙೣఙ = −ଵହ

ହ×ସ= −ଵହ

ଶ= −ଷ

ସ=−0.75

Eg 5: Calculate the coefficient of correlation for the following: x -2 -1 0 1 2 y 4 1 0 1 4 solution: 𝑁=5,∑𝑥=0,∑𝑦=10,∑(𝑥−𝑥̅)(𝑦−𝑦ത)=0,𝜎௫=ටଵ

ହ=

√2,𝜎௬=ටଵସ

ହ= √2.8

𝑟=∑(𝑥− 𝑥̅)(𝑦− 𝑦ത)

𝑁𝜎௫𝜎௬=0

5⋅√2⋅√2.8=0

Remark: If there is no correlation between the two variables, r = 0 but the

converse is not true. In this example the values are related by the equation

𝑦=𝑥ଶbut the observed value of coefficient of correlation is zero.

Eg 6: Calculate the coefficient of correlation for the following: x 1 2 3 4 5 6 7 8 9 10 y 2 4 9 7 10 5 14 16 2 20

solution: N = 10 Total x 1 2 3 4 5 6 7 8 9 10 55 y 2 4 9 7 10 5 14 16 2 20 89 xy 2 8 27 28 50 30 98 128 18 200 589 𝑥ଶ 1 4 9 16 25 36 49 64 81 100 385 𝑦ଶ 4 16 81 49 100 25 196 256 4 400 1131

𝑟 =∑𝑥𝑦 − ∑௫∑௬

ට∑𝑥ଶ − (∑௫)మ

ට∑𝑦ଶ − (∑௬)మ

= ቀ589−ହହ×଼ଽ

ଵቁ

ට385−(ହହ)మ

ଵට1131−(଼ଽ)మ

ଵ

𝑟=(589−489.5)

√82.5×√338.9= 99.5

(9.08)(18.41)=0.6

Eg 7: Find the coefficient of correlation for the following:

x 60 50 45 47 53 70 75 57 73 48 y 30 29 29 28 29 35 40 32 35 28 munotes.in

## Page 212

212 solution: N = 10 , we take a = 60, c = 35(assumed mean) Total x 60 50 45 47 53 70 75 57 73 48 y 30 29 29 28 29 35 40 32 35 28 u = x – 60 0 -10 -15 -13 -7 10 15 -3 13 -12 -22 v = y – 35 -5 -6 -6 -7 -6 0 5 -3 0 -7 -35 uv 0 60 90 91 42 0 75 9 0 84 451 u2 0 100 125 169 49 100 225 9 169 144 1190 v2 25 36 36 49 36 0 25 9 0 49 265

here b and d are same i.e 1 ∴𝑟௫௬=𝑟௨௩=∑௨௩ ି ∑ೠ∑ೡ

ಿ

ට∑௨మ ି (∑ೠ)మ

ಿට∑௩మ ି ∑ೡమ

ಿ

=451 − (ିଶଶ)(ିଷହ)

ଵ

ට1190 − (ିଶଶ)మ

ଵට265 − (ିଷହ)మ

ଵ = 374

√1141.6√142.5 =0.927

Eg 8: Calculate the coefficient of correlation for the following: x 5 10 10 15 15 20 25 30 y 15 17 17 19 21 21 19 17

solution: N = 8 , we take a = 20, c = 19, b = 5, d = 2 Total x 5 10 10 15 15 20 25 30 y 15 17 17 19 21 21 19 17 𝑢=𝑥−205 -3 -2 -2 -1 -1 0 1 2 -6 𝑣=𝑦−192 -2 -1 -1 0 1 1 0 -1 -3 uv 6 2 2 0 -1 0 0 -2 7 u2 9 4 4 1 1 0 1 4 24 v2 4 1 1 0 1 1 0 1 9

Since b and d are of same signs, 𝑟௫௬=𝑟௨௩= ∑௨௩ ି ∑ೠ∑ೡ

ಿ

ට∑௨మ ି (∑ೠ)మ

ಿට∑௩మ ି ∑ೡమ

ಿ

𝑟௫௬=𝑟௨௩=7 − (ି)(ିଷ)

଼

ට24 − (ି)మ

଼ට9 − (ିଷ)మ

଼ = 7 − 2.25

√19.5√7.87 =0.38

Eg 9: From the data given below find the number of items n, r =

0.5,∑(𝑥−𝑥̅)(𝑦−𝑦ത) =120,𝜎௬=8,∑(𝑥−𝑥̅)ଶ =90.

solution: 𝑟=∑(௫ି௫̅)(௬ି௬ത)

ఙೣఙ ,𝜎௫ =ටቀ∑(௫ି௫̅)మ

ቁ ,𝜎௬=ට∑(௬ି௬ത)మ

munotes.in

## Page 213

213 0.5 =120

𝑛×ටቀଽ

ቁ×8⇒0.5 =120

ට𝑛ଶ×ଽ

×8

0.5 =120

√90𝑛×8⇒√90𝑛=120

0.5×8=30

𝑠𝑞𝑢𝑎𝑟𝑖𝑛𝑔 𝑤𝑒 𝑔𝑒𝑡,90𝑛=900⇒𝑛=10

Eg 10: Calculate the coefficient of correlation between x and y. x y no.of observations 15 15 arithmetic mean 25 18 standard deviation 3.01 3.03 sum of squares of the deviation from arithmetic mean 136 138 ∑(𝑥−𝑥̅)(𝑦−𝑦ത)=122

solution: 𝑟=∑(௫ି௫̅)(௬ି௬ത)

ఙೣఙ ,𝜎௫ =ටቀ∑(௫ି௫̅)మ

ቁ ,𝜎௬=ට∑(௬ି௬ത)మ

𝑟=122

15×ටଵଷ

ଵହ×ටଵଷ଼

ଵହ=122

15×3.01×3.03=0.89

Eg 11: n = 20, r = 0.3, 𝑥̅ =15,𝑦ത=20,𝜎௫=4,𝜎௬ =5. One pair (27, 30)

was wrongly taken as (17, 35). Find corrected value of r.

solution: 𝑥 ഥ=15,∑௫

=15⇒∑𝑥=15×20 =300

corrected value 𝑥̅=ଷିଵାଶ

ଶ =ଷଵ

ଶ=15.5

𝑦ത=20,∑𝑦

𝑛=20⇒∑𝑦=20×20=400

corrected value 𝑦ത=ସିଷହାଷ

ଶ=ଷଽହ

ଶ=19.75

𝜎௫=4⇒(𝜎௫)ଶ=16⇒∑𝑥ଶ

𝑛−(𝑥̅)ଶ=16

∑𝑥ଶ=(16+ 225)×20=4820

corrected value of ∑𝑥ଶ =4820+27×27 −17×17=5260

corrected value of 𝜎௫ =ටହଶ

ଶ−(15.5)ଶ=√22.75=4.77 munotes.in

## Page 214

214 𝜎௬=5⇒൫𝜎௬൯ଶ=25⇒∑𝑦ଶ

𝑛−(𝑦ത)ଶ=25

∑𝑦ଶ=(25+400)×20=8500

corrected value of ∑𝑦ଶ=8500+30×30−35×35=8175

corrected value of 𝜎௬=ට଼ହ

ଶ−(19.75)ଶ=5.91

𝑟=∑(𝑥−𝑥̅)(𝑦−𝑦ത)

𝑛𝜎௫𝜎௬⇒0.3=∑(𝑥−𝑥̅)(𝑦−𝑦ത)

20×4×5

⇒∑(𝑥−𝑥̅)(𝑦−𝑦ത)=120

corrected value of

∑(𝑥−𝑥̅)(𝑦−𝑦ത)

=120+(27− 17.5)(30−19.75)

−(17−15)(35−20)

=120+9.5×10.25−(2)×(15)=187.375=187.38

𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑟 =187.38

20 × 4.77×5.91=0.33

13.2.3 Coefficient of Rank Correlation :

In certain types of characteristics it is not possible to get numerical

measurements, but we can rank the indivi duals in order according to our

own judgment, e.g. beauty, smartness. If two persons rank a given group

of individuals and we have to find how far the two judges agree with each

other, the technique of rank correlation can be used. In some cases though

actual measurements are available we may be interested in only the ranks,

that is the relative position of an individual in the group. Here also the

rank correlation is used.

The f ormula for Spearman’s rank correlation coefficient is

𝑅=1−∑ௗమ

ே(ேమ ି ଵ) where d = difference between the ranks of the same

individual, N = number of individuals.

Remark: R follows the same property as r

Eg 1: The ranks according to judges in a beauty contest are R1 1 2 3 4 5 6 R2 4 1 2 3 6 5 Find the coefficient of rank correlation.

munotes.in

## Page 215

215 Solution: N = 6 Total R1(d1) 1 2 3 4 5 6 R2(d2) 4 1 2 3 6 5 d = d1 - d2 -3 1 1 1 -1 1 d2 9 1 1 1 1 1 14

𝑅=1−6∑𝑑ଶ

𝑁(𝑁ଶ−1) = 1− 6×14

6(36−1 ) = 1−14

35 =21

35=3

5

= 0.6.

Eg 2: Find spearman’s rank correlation coefficient between cost and sales

for the following cost 39 65 62 90 82 75 25 98 36 78 sales 47 53 58 86 62 68 60 91 51 84

Solution: N = 10 Total cost (X) 39 65 62 90 82 75 25 98 36 78 sales (Y) 47 53 58 86 62 68 60 91 51 84 d1 8 6 7 2 3 5 10 1 9 4 d2 10 8 7 2 5 4 6 1 9 3 d = d1 – d2 -2 -2 0 0 -2 1 4 0 0 1 d2 4 4 0 0 4 1 16 0 0 1 30

𝑅=1−6∑𝑑ଶ

𝑁(𝑁ଶ−1)= 1− 6×30

10(100−1) = 1−18

99=81

99=9

11

= 0.8182

Coefficient of Rank correlation when ranks are repeated

In the above example, the ranks were different for all the but in some

cases, two or more items may have the same numerical values and ranks

should be the same for these values. Suppose we give ranks 1,2,3 and the

next two values have to given same rank. In this case next two ranks are 4

and 5.T hese are to be distributed equally. Therefore both the individuals

will get the rank ସ ା ହ

ଶand the next one will get the rank as 6.

When there are groups getting the same rank, there is some adjustment in

the formula also. If 𝑚ଵ,𝑚ଶ,𝑚ଷ,… denotes the number of times the same

rank appear, the coefficient of Rank Correlation will be

𝑅=1−൛∑ௗమାிൟ

ே(ேమିଵ), where =ଵ

ଵଶ[(𝑚ଵଷ−𝑚ଵ)+(𝑚ଶଷ−𝑚ଶ)+

(𝑚ଷଶ−𝑚ଷ)+ ⋯⋯] , CF – Correction Factor

munotes.in

## Page 216

216 Eg 1: Marks of 10 students in two I.Q tests carried out in two successive

months. Find the coefficient of correlation. marks test 1 75 60 60 73 55 57 53 72 65 69 marks test 2 65 65 64 70 58 60 58 68 63 65

Solution: N = 10 Total marks test 1 75 60 60 73 55 57 53 72 65 69 marks test 2 65 65 64 70 58 60 58 68 63 65 d1 1 6.5

{6} 6.5

{7} 2 9 8 10 3 5 4 d2 4 {3} 4

{4} 6 1 9.5

{9} 8 9.5 {10} 2 7 4 {5} d = d1 – d2 -3 2.5 0.5 1 -

0.5 0 0.5 1 2 0 d2 9 6.2

5 0.2

5 1 0.2

5 0 0.25 1 4 0 22

m1 = 2, {no. of times marks 60 is repeated in test 1}

m2 = 3, {no. of times marks 65 is repeated in test 2}

m3 = 2, {no. of times marks 58 is repeated in test 2}

𝐶𝐹=1

12{(𝑚ଵଷ−𝑚ଵ)+(𝑚ଶଷ−𝑚ଶ)+(𝑚ଷଷ−𝑚ଷ)} =1

12{6+24+6}

=1

12{36}=3

𝑅=1−6{∑𝑑ଶ+𝐶𝐹}

𝑛(𝑛ଶ−1)= 1−6{22+3}

10(99)= 1− 6×25

990= 1−15

99

=84

99= 0.8485

Eg 2: The coefficient of Rank correlation for certain data is found to 0.6.

If the sum of the squares of the differences is given to be 66, find the

number of observations.

Solution: Given R = 0.6, ∑𝑑ଶ=66,To find number of observations i.e to

find N

𝑅=1−6∑𝑑ଶ

𝑁(𝑁ଶ−1) ⇒ 0.6= 1− 6×66

𝑁(𝑁ଶ−1)

⇒6× 66

𝑁(𝑁ଶ−1) = 1− 0.6 munotes.in

## Page 217

217 ⇒6×66

𝑁(𝑁ଶ−1) =0.4 ⇒𝑁(𝑁ଶ−1) =6×66

0.4

⇒(𝑁−1)𝑁(𝑁+1)=990

⇒(𝑁−1)𝑁(𝑁+1)=9×10×11 ⇒ 𝑁=10

13.3 THE LEAST SQUARE REGRESSION LINES

When we know that two given variables are correlated we try to establish

some relation between the two so that we can estimate the value of one of

the variables given the value of other e.g If we know that there is positive

correlation between the heights and weights of a group of individuals, we

can find an equation between the height and the weigh t. We can estimate

the weight of an individual belonging to the same population given his

height.

Correlation coefficient only determines whether the variables are related

and if so, how strong is the relationship. But it is not useful for prediction.

The equations used for prediction or estimation are known as regression

equations. With the help of regression analysis we establish a model which

expresses the functional relationship between the two variables. These are

also known as estimation equations.

13.3.1 Regression Lines :

We can find two straight lines which will be useful for estimating Y when

X is given. It is known as regression of Yon X. Here X is considered as

independent variable. For estimating X when Y is given. It is known as

regression of X on Y. Here Y is considered as independent variable. We fit

a straight line for the set of points given in the bivariate data.

13.3.2 Least Square Method :

We know that a degree one equation represents a straight line. Therefore

we take the equation of t he regression line of Y on X as 𝑌=𝑎+𝑏𝑋where

a and b are constants. The constant ‘a’ determines the point where the line

cuts the Y – axis and constant ‘b’ determines the slope of the line. The

method of least squares is used to determine the constants and we get a and

b by solving the following two equations simultaneously which are known

as normal equations.

∑𝑌=𝑁𝑎+𝑏 ∑𝑋 , ∑𝑋𝑌=𝑎∑𝑋+𝑏 ∑𝑋ଶ

Similarly, we take the regression of X on Y as 𝑋=𝑐+𝑑𝑌where c and d

are constants which can be determined by solving the following normal

equations.

∑𝑋=𝑁𝑐+𝑑∑𝑌 , ∑𝑋𝑌=𝑐∑𝑌 +𝑑 ∑𝑌ଶ

munotes.in

## Page 218

218 Eg 1: Find the regression of Y on X and X on Y for the following data.

Also estimate Y when X = 7 and estimate X when Y = 16 X 1 2 3 4 5 Y 10 12 15 14 15

Solution: N = 5 Total X 1 2 3 4 5 15 Y 10 12 15 14 15 66 XY 10 24 45 56 75 210 X2 1 4 9 16 25 55 Y2 100 144 225 196 225 890 Normal Equation for regression of Y on X are given by

∑𝑦=𝑁𝑎+𝑏∑𝑋 𝑖.𝑒 66=5𝑎+15𝑏

∑𝑋𝑌=𝑎∑𝑋+𝑏∑𝑋ଶ 𝑖.𝑒 210=15𝑎+55𝑏

Solving the two equations simultaneously we get a = 9.6 and b = 1.2

∴ The regression line of Y on X is given by 𝑌=9.6+1.2𝑋 --- (1)

For estimating value of Y when X = 7, we put X = 7 in eq (1)

∴ The estimate will be Y = 9.6 + 1.2(7) = 18

Normal equations fo r regression of X on Y are given by

∑𝑋=𝑁𝑐+𝑑∑𝑌 𝑖 .𝑒 15=5𝑐+66𝑑

∑𝑋𝑌=𝑐∑𝑌 +𝑑∑𝑌ଶ 𝑖.𝑒 210=66𝑐+ 890𝑑

Solving the two equations simultaneously we get c = -5.4255 , d = 0.6383

∴ The regression line of X on Y is given by 𝑋= −5.4255 +0.6383𝑌 ---

(2)

For estimating value of X when Y = 16, we put Y = 16 in eq (2)

∴ The estimate will be X = -5.4255 + 0.6383(16) = 4.7873

13.3.3 Regression Lines and Regression Coefficients :

The regression of Y on X is given by

𝑌−𝑌ത =𝑏(𝑋−𝑋ത),𝑏=𝑟𝜎

𝜎

The regression of X on Y is given by

𝑋−𝑋ത =𝑏(𝑌−𝑌ത),𝑏 =𝑟𝜎

𝜎

where 𝑏 𝑎𝑛𝑑 𝑏 are known as regression coefficients.

Properties:

1) 𝑏=𝑟ఙೊ

ఙ ,𝑏=𝑟ఙ

ఙೊ

𝑟ଶ =𝑏×𝑏 ⇒𝑟 = ±ඥ𝑏×𝑏

𝑟>0 𝑖𝑓 𝑏&𝑏>0

𝑟<0 𝑖𝑓 𝑏&𝑏<0 munotes.in

## Page 219

219 2) 𝑖𝑓 𝑢=𝑋−𝑎,𝑣=𝑌−𝑏 𝑡ℎ𝑒𝑛 𝑏=𝑏௨௩&𝑏 =𝑏௩௨

3) 𝑏 =𝑟ఙೊ

ఙ =௩(,)

ఙమ =∑ ି ∑∑ೊ

ಿ

∑మ ି (∑)మ

ಿ

4) 𝑏௫௬ =𝑟ఙ

ఙೊ =௩(,)

ఙೊమ =ቀ∑ ି ∑∑ೊ

ಿቁ

∑మ ି (∑ೊ)మ

ಿ

Eg: 1) The following data are given about the expenditure on clothes and

expenditure on entertainment. Average expenditure on clothes Rs. 300,

average expenditure on entertainment Rs. 100, S.D of expenditure on

clothes Rs. 20, S.D of expenditure on entertainment Rs. 15, coefficient of

correlation 0.78. Find the two regression equations.

Solution: Let the expenditure on clothes be denoted by x and the

expenditure on entertainment be denoted by y.

Given 𝑥̅=300,𝑦ത=100,𝜎௫=20,𝜎௬=15,𝑟 =0.78

∴𝑏௫௬ =𝑟𝜎௫

𝜎௬ =0.78×20

15=0.78×4

3 =1.04

& 𝑏௬௫=𝑟𝜎௬

𝜎௫ =0.78×15

20=0.78×3

4 =0.585

𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑜𝑛 𝑥 𝑖𝑠 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦

(𝑦−𝑦ത)=𝑏௬௫(𝑥−𝑥̅)⇒𝑦−100=0.585(𝑥−300)

−0.585𝑥+𝑦=100−175.5⇒𝑦=0.585𝑥−75.5

𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥 𝑜𝑛 𝑦 𝑖𝑠 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦

(𝑥−𝑥̅)=𝑏௫௬(𝑦−𝑦ത)⇒𝑥−300=1.04(𝑦−100)

𝑥=1.04𝑦+300−104⇒𝑥=1.04𝑦+196

Eg 2: If ∑𝑥=37,∑𝑦=71,∑𝑥𝑦=563,∑𝑥ଶ= 297,∑𝑦ଶ =1079,𝑛=

5. Find the two regression equations .

Solution: 𝑥̅=ଷ

ହ =7.4 ,𝑦ത=ଵ

ହ=14.2,𝑏௬௫= ∑௫௬ ି ∑ೣ∑

∑௫మ ି (∑ೣ)మ

=ቀହଷିయళ×ళభ

ఱቁ

ଶଽି(యళ)మ

ఱ

𝑏௬௫=563×5−37×71

297×5−37×37 =2815−2627

1485−1369=188

116=1.62

𝑏௫௬=∑𝑥𝑦−∑௫∑௬

∑𝑦ଶ−(∑௬)మ

= ቀ563−ଷ×ଵ

ହቁ

1079−(ଵ)మ

ହ

𝑏௫௬=563×5−37× 71

1079×5−71×71 =188

354=0.53 munotes.in

## Page 220

220 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑜𝑛 𝑥

𝑦−𝑦ത=𝑏௬௫(𝑥 −𝑥̅)⇒𝑦−14.2=1.62×(𝑥−7.4)

𝑦=1.62𝑥+2.21

𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥 𝑜𝑛 𝑦

𝑥−𝑥̅=𝑏௫௬(𝑦−𝑦ത)⇒𝑥−7.4=0.53×(𝑦−14.2)

𝑥=0.53𝑦−0.126

Eg 3: Find the two regression lines for the following data Price in Rs. 100 120 110 110 160 150 Demand(in units) 40 38 43 45 37 23 Also estimate the demand when price is 130 and the price when

demand is 30 units.

Solution: N=6 Total Price in Rs. (X) 100 120 110 110 160 150 750 Demand(in units) (Y) 40 38 43 45 37 23 226 XY 4000 4560 4730 4950 5920 3450 27610 X2 10000 14400 12100 12100 25600 22500 96700 Y2 1600 1444 1849 2025 1369 529 8816

𝑥̅=∑𝑥

𝑛=750

6=125,𝑦ത=∑𝑦

𝑛=226

6=37.67

𝑏௬௫ =∑𝑥𝑦−∑௫∑௬

∑𝑥ଶ−(∑௫)మ

=ቀ27610−ହ×ଶଶ

ቁ

96700−750×ହ

= −3840

17700= −0.22

𝑏௫௬=ቀ∑𝑥𝑦−∑௫∑௬

ቁ

∑𝑦ଶ−(∑௬)మ

= −3840

1820= −2.11

𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑜𝑛 𝑥

𝑦−𝑦ത=𝑏௬௫(𝑥−𝑥̅)⇒𝑦−37.67= −0.22(𝑥−125)

𝑦= −0.22𝑥+65.17 −−− −(1)

𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥 𝑜𝑛 𝑦

𝑥−𝑥̅=𝑏௫௬(𝑦−𝑦ത)⇒𝑥−125= −2.11(𝑦−37.67)

𝑥= −2.11𝑦+204.48 −−−−−(2) munotes.in

## Page 221

221 𝑑𝑒𝑚𝑎𝑛𝑑 =?𝑤ℎ𝑒𝑛 𝑝𝑟𝑖𝑐𝑒=130,𝑤𝑒 𝑝𝑢𝑡 𝑥=130 𝑖𝑛 (1)

𝑦= −0.22×130+65.17 =36.57

𝑝𝑟𝑖𝑐𝑒= ?𝑤ℎ𝑒𝑛 𝑑𝑒𝑚𝑎𝑛𝑑 =30,𝑤𝑒 𝑝𝑢𝑡 𝑦=30 𝑖𝑛 (2)

𝑥= −2.11× 30 +204.48 =141.18

Eg 4: The two regression lines are given by 𝑥+2𝑦=5,2𝑥+3𝑦−8=

0,(𝜎௫)ଶ=12.Find the values of 𝑥ഥ,𝑦ത ,൫𝜎௬൯ଶ.

Solution:We solve the two regression equations simultaneously to find

𝑥̅&𝑦ത

𝑥+2𝑦=5 −−(1)

2𝑥+3𝑦−8=0 − −(2)

𝑀𝑢𝑙 𝑒𝑞 (1) 𝑏𝑦 2 𝑎𝑛𝑑 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑓𝑟𝑜𝑚 𝑒𝑞 (2) 𝑔𝑖𝑣𝑒𝑠 𝑦ത=2

𝑠𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑒 𝑦=2 𝑖𝑛 𝑒𝑞 (1) 𝑔𝑖𝑣𝑒𝑠 𝑥̅=1 ,

𝑥+ 2𝑦=5⇒𝑥= −2𝑦+5,𝑏௫௬ = −2

2𝑥+3𝑦−8=0⇒𝑦=൬−2

3൰𝑥+4,𝑏௬௫ = −2

3

𝑏௫௬×𝑏௬௫ =4

3>1 𝑛𝑜𝑡 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒

𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑓 𝑦 𝑜𝑛 𝑥 𝑖𝑠 𝑔𝑖𝑣𝑒𝑛 𝑏𝑦 𝑥+2𝑦=5 𝑎𝑛𝑑 𝑥 𝑜𝑛 𝑦 𝑖𝑠

𝑔𝑖𝑣𝑒𝑛 𝑏𝑦 2𝑥+3𝑦−8=0

𝑥+2𝑦=5 ⇒𝑦= −𝑥

2+5

2 ,𝑏௬௫ = −1

2

2𝑥+3𝑦=8⇒𝑥=൬−3

2൰𝑦+4 ,𝑏௫௬ = −3

2

𝑟ଶ=𝑏௬௫×𝑏௫௬⇒𝑟ଶ=3

4=0.75⇒ 𝑟= −ඨ3

4 = −0.87

𝑟ଶ=൫𝑏௬௫൯ଶ×൫𝜎௬൯ଶ

(𝜎௫)ଶ ⇒ 𝑟ଶ =1

4×൫𝜎௬൯ଶ

12⇒3

4=1

4×൫σ୷൯ଶ

12

⇒൫𝜎௬൯ଶ =0.75×48=36

13.4 STANDARD ERROR OF ESTIMATE

Ley y 1 be the value of y for given value of x, a measure of the scatter about

the regression line y on x is given by

𝑆௬,௫ =ඨ∑(𝑦−𝑦ଵ)ଶ

𝑛 munotes.in

## Page 222

222 which is called the standard error of estimate of y on x.

Standard error of estimate of x on y

𝑆௫,௬ =ඨ∑(𝑥−𝑥ଵ)ଶ

𝑛

In general, 𝑆௬,௫≠𝑆௫,௬

Eg: 1) If the regression line of y on x is 𝑦=𝑎+𝑎ଵ𝑥. Show that

𝑆௬,௫ଶ =∑௬మିబ∑௬ିభ∑௫௬

.

solution:𝑦ଵ=𝑎+𝑎ଵ𝑥

𝑆௬,௫ଶ=∑(𝑦−𝑦ଵ)ଶ

𝑛 =∑(𝑦−𝑦ଵ)ଶ

𝑛=∑(𝑦ଶ−2𝑦𝑦ଵ+𝑦ଵଶ)

𝑛

=∑(𝑦ଶ−2𝑦(𝑎+𝑎ଵ𝑥)+(𝑎+𝑎ଵ𝑥)ଶ)

𝑛

=∑(𝑦ଶ−2𝑎𝑦−2𝑎ଵ𝑥𝑦+𝑎ଶ+2𝑎𝑎ଵ𝑥+𝑎ଵଶ𝑥ଶ)

𝑛

= ∑(𝑦ଶ−𝑎𝑦−𝑎ଵ𝑥𝑦−𝑎𝑦−𝑎ଵ𝑥𝑦+𝑎ଶ+𝑎𝑎ଵ𝑥+𝑎ଵଶ𝑥ଶ+𝑎𝑎ଵ𝑥)

𝑛

={∑𝑦(𝑦−𝑎−𝑎ଵ𝑥)−𝑎∑(𝑦−𝑎ଵ𝑥−𝑎)−𝑎ଵ∑𝑥(𝑦−𝑎ଵ𝑥−𝑎)}

𝑛

𝑦=𝑎+𝑎ଵ𝑥 𝑏𝑦 𝑙𝑒𝑎𝑠𝑡 𝑠𝑞𝑢𝑎𝑟𝑒 𝑚𝑒𝑡ℎ𝑜𝑑 𝑡𝑜 𝑓𝑖𝑛𝑑 𝑎 𝑎𝑛𝑑 𝑎ଵ,

𝑤𝑒 𝑠𝑜𝑙𝑣𝑒 ∑𝑦=𝑎𝑛+𝑎ଵ∑𝑥 𝑎𝑛𝑑 ∑𝑥𝑦 =𝑎∑𝑥+𝑎ଵ∑𝑥ଶ

⇒∑(𝑦−𝑎−𝑎ଵ𝑥)=0 ,∑(𝑥𝑦−𝑎𝑥−𝑎ଵ𝑥ଶ)=0

𝑆௬,௫ଶ=∑𝑦(𝑦−𝑎−𝑎ଵ𝑥)

𝑛=∑𝑦ଶ−𝑎∑𝑦−𝑎ଵ∑𝑥𝑦

𝑛

Eg: 2) If the regression line of y on x is y = 35.82 + 0.476x and values of

x and y are x 65 63 67 64 68 62 70 y 68 66 68 65 69 66 68

Find 𝑆௬,௫ଶ.

solution: y = 35.82 + 0.476x x y y1= 35.82+0.476x (y –y1) (𝑦−𝑦ଵ)ଶ 65 68 66.76 1.24 1.538 63 66 65.808 0.192 0.037 67 68 67.712 0.288 0.083 64 65 66.284 -1.284 1.649 68 69 68.188 0.812 0.659 62 66 65.332 0.668 0.446 munotes.in

## Page 223

223 70 68 69.14 -1.14 1.30 total 5.712

𝑆௬,௫ଶ =∑(𝑦−𝑦ଵ)ଶ

𝑛 =5.712

7=0.816

Eg: 3) Show that ∑(𝑦−𝑦ത)ଶ=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ

solution: 𝑙ℎ𝑠=∑(𝑦−𝑦ത)ଶ =∑(𝑦−𝑦ଵ+𝑦ଵ−𝑦ത)ଶ

=∑{(𝑦−𝑦ଵ)ଶ+2(𝑦−𝑦ଵ)(𝑦ଵ−𝑦ത)+(𝑦ଵ−𝑦ത)ଶ}

=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ+2∑(𝑦−𝑦ଵ)(𝑦ଵ−𝑦ത)

𝑛𝑜𝑤𝑡𝑜𝑝𝑟𝑜𝑣𝑒 ,∑(𝑦−𝑦ଵ)(𝑦ଵ−𝑦ത)=0

i.e ∑(𝑦−𝑦ଵ)(𝑦ଵ−𝑦ത)=∑(𝑦−𝑎−𝑎ଵ𝑥)(𝑎+𝑎ଵ𝑥−𝑦ത)

=𝑎∑(𝑦−𝑎−𝑎ଵ𝑥)+𝑎ଵ∑𝑥(𝑦−𝑎−𝑎ଵ𝑥)−𝑦ത∑(𝑦−𝑎−𝑎ଵ𝑥)=0

𝑦=𝑎+𝑎ଵ𝑥,𝑢𝑠𝑖𝑛𝑔𝑙𝑒𝑎𝑠𝑡𝑠𝑞𝑢𝑎𝑟𝑒𝑚𝑒𝑡 ℎ𝑜𝑑 we 𝑓𝑖𝑛𝑑𝑎𝑎𝑛𝑑𝑎ଵ,

𝑤𝑒𝑠𝑜𝑙𝑣𝑒 ∑𝑦=𝑎𝑛+𝑎ଵ∑𝑥𝑎𝑛𝑑 ∑𝑥𝑦 =𝑎∑𝑥+𝑎ଵ∑𝑥ଶ

⇒∑(𝑦−𝑎−𝑎ଵ𝑥)=0 ,∑(𝑥𝑦−𝑎𝑥−𝑎ଵ𝑥ଶ)=0

⇒ ∑(𝑦−𝑦ത)ଶ=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ

13.5 EXPLAINED AND UNEXPLAINED VARIATION In the previous section we had proved,

∑(𝑦−𝑦ത)ଶ=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ

∑(𝑦−𝑦ଵ)ଶ is called unexplained variation as the deviation (𝑦−𝑦ଵ)

behave randomly and have an unpredictable manner.

∑(𝑦ଵ−𝑦ത)ଶis called explained variation as the deviation (𝑦ଵ−𝑦ത)have a

definite pattern.

∑(𝑦−𝑦ത)ଶis called as total variation.

Eg: 1) For the following calculate total variation, explained variation and

unexplained variation, given the regression of on x as y = 35.82 + 0.476x

x 65 63 67 64 68 62 70 y 68 66 68 65 69 66 68 Solu tion: y = 35.82 + 0.476x, 𝑦ത=∑௬

= 67.143 x y y1= 35.82+0.476x (y –y1) (𝑦−𝑦ଵ)ଶ (𝑦ଵ−𝑦ത)ଶ 65 68 66.76 1.24 1.538 0.147 63 66 65.808 0.192 0.037 1.306 67 68 67.712 0.288 0.083 0.734 64 65 66.284 -1.284 1.649 4.592 68 69 68.188 0.812 0.659 3.448 munotes.in

## Page 224

224

Unexplained variation= ∑(𝑦−𝑦ଵ)ଶ =5.712

Explained variation = ∑(𝑦ଵ−𝑦ത)ଶ= 12.267

Total variation = ∑(𝑦−𝑦ത)ଶ=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ=17.979

13.6 COEFFICIENT OF DETERMINATION

The ratio of the explained variation to the total variation is called the

coefficient of determination.

It is given by ∑(௬భ ି ௬ത)మ

∑(௬ ି ௬ത)మ.

If there is zero explained variation (i.e the total variation is same as

unexplained variation) then coefficient of determination will be 0. If there

is zero unexplained variation (i.e the total variation is same as explained

variation) then coefficient of determination will be 1. Since the ratio is

always non negative, we denote it by 𝑟ଶ. 𝑟called as coefficient of

correlation is given by

𝑟= ± ඨ𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 = ± ඨ∑(𝑦ଵ−𝑦ത)ଶ

∑(𝑦− 𝑦ത)ଶ

varies from between -1 and +1. The + and – signs indicates po sitive and

negative linear correlation respectively. Note 𝑟is a dimensionless quantity,

i.e it does not depend on the units.

For the case of linear correlation, the quantity r is the same regardless of

whether X or Y is considered the independent variables. Thus r is a good

measure of the linear correlation between two variables.

Eg : 1) Show that 𝑠௬,௫ଶ =𝑠௬ଶ(1−𝑟ଶ).

Solu tion:𝑟= ±ට௫ௗ ௩௧

௧௧ ௩௧ = ±ට∑(௬భି௬ത)మ

∑(௬ି ௬ത)మ

𝑟ଶ=ቌඨ∑(𝑦ଵ− 𝑦ഥ)ଶ

∑(𝑦 − 𝑦ത)ଶቍଶ

=ቌඨ∑(𝑦 − 𝑦ത)ଶ − ∑(𝑦 − 𝑦ଵ)ଶ

∑(𝑦 − 𝑦ത)ଶቍଶ

⇒𝑟ଶ =ቌඨ1−ቆ∑(𝑦 − 𝑦ଵ)ଶ

∑(𝑦 − 𝑦ത)ଶቇቍଶ

=ቌඨ1−𝑠௬,௫ଶ

𝑆௬ଶቍଶ

⇒ 𝑠௬,௫ଶ =𝑠௬ଶ(1−𝑟ଶ)

62 66 65.332 0.668 0.446 1.306 70 68 69.14 -1.14 1.30 0.734 total 5.712 12.267 munotes.in

## Page 225

225 Eg: 2) Given the regression of y on x as y = 35.82 + 0.476x. Find the

coefficient of determination and coefficient of correlation for the

following data x 65 63 67 64 68 62 70 y 68 66 68 65 69 66 68

Solu tion:y = 35.82 + 0.476x, 𝑦ത=∑௬

= 67.143

x y y1= 35.82+0.476x (y –y1) (𝑦−𝑦ଵ)ଶ (𝑦ଵ−𝑦ത)ଶ 65 68 66.76 1.24 1.538 0.147 63 66 65.808 0.192 0.037 1.306 67 68 67.712 0.288 0.083 0.734 64 65 66.284 -1.284 1.649 4.592 68 69 68.188 0.812 0.659 3.448 62 66 65.332 0.668 0.446 1.306 70 68 69.14 -1.14 1.30 0.734 total 5.712 12.267

Unexplained variation = ∑(𝑦−𝑦ଵ)ଶ =5.712

Explained variation = ∑(𝑦ଵ−𝑦ത)ଶ= 12.267

Total variation = ∑(𝑦−𝑦ത)ଶ=∑(𝑦−𝑦ଵ)ଶ+∑(𝑦ଵ−𝑦ത)ଶ=17.979

Coefficient of determination = ∑(௬భ ି ௬ത)మ

∑(௬ ି ௬ത)మ=ଵଶ.ଶ

ଵ.ଽଽ =0.682

Coefficient of correlation 𝑟= ±ට௫ௗ ௩௧

௧௧ ௩௧= ±√0.682

13.7 SUMMARY

In this chapter we have learnt about the scatter diagram which helps us to

find the nature and extent of relationship between the variables ,

Coefficient of correlation which is a numerical measure of nature an d

extent of relationship between two given variables whose values lies

between +1 and -1. We h ad also learnt about the coefficient of Rank

correlation which is used in cases where it is not possible to get numerical

measurements, but we can rank the individuals in order according to our

judgment. This chapter also deals with the least square regres sion lines of

y on x as well as x on y. This chapter also explains about the standard

error of estimate, Explained and unexplained variation and correlation of

time series.

13.8 EXERCISES 1) Following table gives the us the marks obtained by 6 students in mid

term exam and final semester exam. Find the coefficient of

correlation. munotes.in

## Page 226

226 Mid term Exam 12 14 23 18 10 19 Final semester Exam 68 78 85 75 70 74

2) Given covariance = 27, r = 0.6, variance of y = 25. Find variance of

x.

3) Find the coefficient of correlation between the heights of male

students and female students height of male students 65 66 67 68 69 70 71 height of female students 67 68 66 69 72 72 69

4) If𝑟 =0.38,𝑐𝑜𝑣(𝑥,𝑦)=10.2,𝜎௫=16.𝐹𝑖𝑛𝑑 𝜎௬

5) Find r, if 𝑐𝑜𝑣(𝑥,𝑦) =6,𝜎௫ =2.45,𝜎௬=3.41.

6) Find r, given ∑(𝑥−𝑥̅)(𝑦−𝑦ത)=29,∑(𝑥−𝑥̅)ଶ =28,∑(𝑦−

𝑦ത)ଶ =42.

7) Ten competitors in Miss Universe are ranked by three judges in the

following order: J1 1 6 5 10 3 2 4 9 7 8 J2 3 5 8 4 7 10 2 1 6 9 J3 6 4 9 8 1 2 3 10 5 7 Using rank correlation coefficient determi ne which pair of judges

has the nearest approach to common tastes in beauty.

8) The coefficient of rank correlation for certain data is found to be 0.6.

If the sum of the squares of the differences is given to be 66. Find

the n umber of items in the group.

9) The ranks of 16 students in the subject of DBMS and CN are given

as follows. Calculate the rank coefficient of correlation. Rank in DBMS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Rank in CN 1 10 3 4 5 7 2 6 8 11 15 9 14 12 16 13

10) The coefficient of rank correlation of marks obtained by 9 students

was calculated to be 0.4. It was later discovered that the value of the

difference between ranks for one student was written wrongly as 6

instead of 8. Find the correct value of coeff icient of rank correlation.

11) Find the two regression equations given 𝑥̅=8,𝑦ത=2000,𝜎௫=

2,𝜎௬=80,𝑟=0.7.

Also estimate y when x = 10 and estimate x when y = 2500.

10

munotes.in

## Page 227

227 12) Find the two regression equations for the following data:

∑(𝑥−𝑥̅)(𝑦−𝑦ത)=135,∑(𝑥−𝑥̅)ଶ=96,∑(𝑦−𝑦ത)ଶ =206,∑𝑥

= 120,

∑𝑦=180,𝑛=5

13) n = 50, regression equation of marks in mathematics (Y) on the

marks in English(X) was 4Y - 5X = 8.Mean marks in English are 40.

The ratio of the two standard deviation 𝜎௬:𝜎௫=5:2. Find the

average marks in mathematics and coefficient of correlation between

the marks in the two subjects.

14) The two regression lines between x and y are given below 2𝑥+

3𝑦= 61,

𝑥+𝑦=25. Find𝑥̅,𝑦ത 𝑎𝑛𝑑 𝑟.

15) Find the regression line of profits on output from the following data

using least square method. Ouput(100 tons) 5 7 9 11 13 15 Profit per unit(Rs.) 1.7 2.4 2.8 3.4 3.7 4.4

16) For the following find total variation given the regression of y on x

as y = 35.82 + 0.476x x 65 63 67 64 68 62 70 66 68 67 69 71 y 68 66 68 65 69 66 68 65 71 67 68 70

13.9 SOLUTION TO EXERCISE Q. No. Solution Q. No. Solution 1 r = 0.81 2 variance of x = 81 3 r = 0.67 4 1.68 5 r = 0.72 6 r = 0.85 7 R12 = -0.212, R13 = 0.636, R23 = -0.297 8 n= 10 9 R = 0.8 10 R = 0.17 11 y = 28x +1776, x = 0.02y – 32, y = 2056, x = 18 12 y = 1.41x + 2.16, x = 0.66y +0.24 13 52, r = 0.5 14 14, 11, -0.82 15 y = 0.26x + 0.5 16 38.917

13.10 REFERENCES

Following books are recommended for further reading:

Statistics by Murray R, Spiegel, Larry J. Stephens, Mcgraw Hill

International Publisher, 4th edition

Fundamental of Mathematical Statistics by S. C. Gupta and V. K.

Kapoor, Sultan Chand and Sons publisher, 11th edition

Mathematical Statistics by J. N. Kapur and H. C. Saxena, S. Chand

publisher, 12th edition

***** munotes.in