Page 1
UNIT I1INTRODUCTION TO MACHINELEARNINGUnit Structure1.0Introduction1.1Machine learning1.2Examples of Machine Learning Problems1.3Structure of Learning1.4Learning versus Designing1.5Training versus Testing1.6Characteristics ofMachine learning tasks1.7Predictive and descriptive tasksSummaryUnit End QuestionsReferences1.0INTRODUCTIONA human child learns new things and uncovers the structure of their worldyear by year as they grow to adulthood.A child's brain andsensesperceive the facts of their surroundings and gradually learn the hiddenpatterns of life which help the child to craft logical rules to identifylearned patterns. The learning process of the human brain makes humansthe most sophisticated living creature of this world. Learning continuouslyby discovering hidden patterns and then innovating on those patternsenables us to make ourselves better and better throughout our lifetime.Superficially, we can draw some motivational similarities between thelearning process of the human brain and the concepts of machine learning.(jlooper, n.d.)Thehuman brainperceives things from the real world, processes theperceived information, makes rational decisions, and performs certainactions based on circumstances. When we program a replica of theintelligent behavioural process to a machine, it is called artificialintelligence (AI).Machine learning (ML) is animportant subset of artificialintelligence.ML is concerned with using specialized algorithms tomunotes.in
Page 2
uncover meaningful information and find hidden patterns from perceiveddata tosupportthelogicaldecision-making process.1.1MACHINE LEARNINGMachine learning, from a systems perspective, is defined as the creation ofautomated systems that can learn hidden patterns from data to aid inmaking intelligent decisions.This motivation is loosely inspired by how the human brain learns certainthings based on the data it perceives from the outside world.Machinelearning is the systematic study of algorithms and systems that improvetheir knowledge or performance with experience.A Machine Learning systemlearns from historical data, builds theprediction models, and whenever it receives new data, predicts the outputfor it. The accuracy of predicted output depends upon the amount of data,as the huge amount of data helps to build a better model which predicts theoutput more accurately.Suppose we have a complex problem, where we need to perform somepredictions, so instead of writing a code for it, we just need to feed thedata to generic algorithms, and with the help of these algorithms, machinebuilds the logic as per the data and predict the output.Two definitions of Machine Learning are offered.Arthur Samueldescribed it as: "The field of study that gives computersthe ability to learn from data without being explicitly programmed."Thisis an older, informal definition.TomMitchellprovides a more modern definition. According to him,"Acomputer program is said to learn from experience E with respect to someclass of tasks T and performance measure P, if its performance at tasks inT, as measured by P, improves with experience E."Example: playing checkers.T = the task of playing checkers.P = the probability that the program will win the next game.E = the experience of playing many games of checkersLet us now understand,Supervised Machine Learning,UnsupervisedMachine Learningand Reinforcement Learning:Supervised Machine Learning:Supervised learning is the types of machine learning in which machinesare trained using well "labelled" training data, and on basis of that data,munotes.in
Page 3
machines predict the output. The labelled data means some input data isalready tagged with the correct output.In supervised learning, the training data provided to the machines work asthe supervisor that teaches the machines to predict the output correctly. Itapplies the same concept asa student learns in the supervision of theteacher.Supervised learning is a process of providing input data as well as correctoutput data to the machine learning model. The aim of a supervisedlearning algorithm is tofind a mapping function to map theinputvariable(x) with the output variable(y).In the real-world, supervised learning can be used forRisk Assessment,Image classification, Fraud Detection, spam filtering, etc.Supervised learning can be further divided into two types ofproblems:1.Regression:Regression algorithms are used if there is a relationship between the inputvariable and the output variable. It is used for the prediction of continuousvariables, such as Weather forecasting, Market Trends, etc. LinearRegression, RegressionTrees, Non-Linear Regression, Bayesian LinearRegression, Polynomial Regression are some popular Regressionalgorithms which come under supervised learning.2. Classification:Classification algorithms are used when the output variable is categorical,which means there are two classes such as Yes-No, Male-Female, True-false, etc. Spam Filtering, Random Forest, Decision Trees, LogisticRegression, Support vector Machines are some examples of classification.Unsupervised Machine Learning:There may be manycases in which we do not have labeled data and needto find the hidden patterns from the given dataset. So, to solve such typesof cases in machine learning, we need unsupervised learning techniques.unsupervised learning is a machine learning technique in which modelsare not supervised using training dataset. Instead, models itself find thehidden patterns and insights from the given data. It can be compared tolearning which takes place in the human brain while learning new things.Unsupervised learning cannot be directly applied to a regression orclassification problem because unlike supervised learning, we have theinput data but no corresponding output data. The goal of unsupervisedlearning is to find the underlying structure of dataset, group thatdataaccording to similarities, and represent that dataset in a compressedformat.munotes.in
Page 4
Example:Suppose the unsupervised learning algorithm is given an inputdataset containing images of different types of cats and dogs. Thealgorithm is never trained upon the given dataset, which means it does nothave any idea about the features of the dataset. The task of theunsupervised learning algorithm is to identify the image features on theirown. Unsupervised learning algorithm will perform this task by clusteringthe image dataset into the groups according to similarities between images.The unsupervisedlearning algorithm can be further categorized into twotypes of problems:ClusteringAssociation1. Clustering:Clustering is a method of grouping the objects into clusters such thatobjects with most similarities remains into a group and has less or nosimilarities with the objects of another group. Cluster analysis finds thecommonalities between the data objects and categorizes them as per thepresence and absence of those commonalities.2. Association:An association rule is an unsupervised learningmethod which is used forfinding the relationships between variables in the large database. Itdetermines the set of items that occurs together in the dataset. Associationrule makes marketing strategy more effective. Such as people who buy Xitem (supposea bread) are also tend to purchase Y (Butter/Jam) item. Atypical example of Association rule is Market Basket Analysis.K-means clustering, KNN (k-nearest neighbors), Hierarchal clustering,Anomaly detection, Neural Networks are the examples of unsupervisedlearning.Reinforcement Learning:Reinforcement Learning is a feedback-based Machine learning techniquein which an agent learns to behave in an environment by performing theactions and seeing the results of actions. For each good action, the agentgets positive feedback, and for each badaction, the agent gets negativefeedback or penalty. In Reinforcement Learning, the agent learnsautomatically using feedbacks without any labeled data, unlike supervisedlearning.Since there is nolabelleddata, so the agent is bound to learn by itsexperience only. RL solves a specific type of problem where decisionmaking is sequential, and the goal is long-term, such as game-playing,robotics, etc. The agent interacts with the environment and explores it byitself. The primary goal of an agent in reinforcement learning is toimprove the performance by getting the maximum positive rewards.munotes.in
Page 5
The agent learns with the process of hit and trial, and based on theexperience, it learns to perform the task in a better way. Hence, we can saythat "Reinforcement learning is a type of machine learning method wherean intelligent agent (computer program) interacts with the environmentand learns to act within that”. How a Robotic dog learns the movement ofhis arms is an example of Reinforcement learning.1.2 EXAMPLES OF MACHINE LEARNING PROBLEMS:Machine learning is a buzzword for today's technology, and it is growingvery rapidly day by day. We are using machine learning in our daily lifeeven without knowing it such as Google Maps, Google assistant, Alexa,etc. Below are some most trending real-world applications of MachineLearning:1. Image Recognition:Image recognition is one of the most common applications of machinelearning. It is used to identify objects, persons, places, digital images, etc.The popular use case of image recognition and face detection is,Automatic friend tagging suggestion:Facebook provides us a feature of auto friend tagging suggestion.Whenever we upload a photo with our Facebook friends, then weautomatically get a tagging suggestion with name, and the technologybehind this is machine learning's face detection and recognition algorithm.It is based on the Facebook project named "Deep Face," which isresponsible for face recognition and person identification in the picture.2. Speech Recognition:While using Google, we get an option of "Search by voice," it comesunder speech recognition, and it's a popular application of machinelearning.Speech recognition is a process of converting voice instructions into text,and it isalso known as "Speech to text", or "Computer speechrecognition." At present, machine learning algorithms are widely used byvarious applications of speech recognition. Google assistant, Siri, Cortana,and Alexa are using speech recognition technology to follow the voiceinstructions.3. Traffic prediction:If we want to visit a new place, we take help of Google Maps, whichshows us the correct path with the shortest route and predicts the trafficconditions.It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways:munotes.in
Page 6
•Real Time location of the vehicle form Google Map app and sensors•Average time has taken on past days at the same time.Everyone who is using Google Map is helping thisapp to make it better. Ittakes information from the user and sends back to its database to improvethe performance.4. Product recommendations:Machine learning is widely used by various e-commerce andentertainment companies such as Amazon, Netflix,etc., for productrecommendation to the user. Whenever we search for some product onAmazon, then we started getting an advertisement for the same productwhile internet surfing on the same browser and this is because of machinelearning.Google understands the user interest using various machine learningalgorithms and suggests the product as per customer interest.As similar,when we use Netflix, we find some recommendations for entertainmentseries, movies, etc., and this is also done with the help of machinelearning.5. Self-driving cars:One of the most exciting applications of machine learning is self-drivingcars. Machine learning plays a significant role in self-driving cars. Tesla,the most popular car manufacturing company is working on self-drivingcar. It is using unsupervised learning method to train the car models todetect people and objects while driving.6. Email Spam and Malware Filtering:Whenever we receive a new email, it is filtered automatically as important,normal, and spam. We always receive an important mail in our inbox withthe important symbol and spam emails in our spam box, and thetechnology behind this is Machine learning. Below are some spam filtersused by Gmail:•Content Filter•Header filter•General blacklists filter•Rules-based filters•Permission filtersSome machine learning algorithms such as Multi-Layer Perceptron,Decision tree, and Naïve Bayes classifier are used for email spam filteringand malware detection.7. Virtual Personal Assistant:We have various virtualpersonal assistants such as Google assistant,Alexa, Cortana, Siri. As the name suggests, they help us in finding themunotes.in
Page 7
information using our voice instruction. These assistants can help us invarious ways just by our voice instructions such as Play music,callsomeone,open an email, Scheduling an appointment, etc.These virtual assistants use machine learning algorithms as an importantpart.These assistantsrecord our voice instructions, send it over the serveron a cloud, and decode it using ML algorithms and act accordingly.8. Online Fraud Detection:Machine learning is making our online transaction safe and secure bydetecting fraud transaction. Whenever we perform some onlinetransaction, there may be various ways that a fraudulent transaction cantake place such as fake accounts, fake ids, and steal money in the middleof a transaction. So,to detect this, Feed Forward Neural network helps usby checking whether it is a genuine transaction or a fraud transaction.For each genuine transaction, the output is converted into some hashvalues, and these values become the input for the next round. For eachgenuine transaction, there is a specific pattern which gets change for thefraud transaction hence, it detects it and makes our online transactionsmore secure.9. Stock Market trading:Machine learning is widely used in stock market trading. In the stockmarket, there is always a risk of up and downs in shares, so for thismachine learning's long short-term memory neural network is used for theprediction of stock market trends.10. Medical Diagnosis:In medical science, machine learning is used for diseases diagnoses. Withthis, medical technology isgrowing very fast and able to build 3D modelsthat can predict the exact position of lesions in the brain.It helps in findingbraintumoursand other brain-related diseases easily.11. Automatic Language Translation:Nowadays, if we visit a new place andwe are not aware of the languagethen it is not a problem at all, as for this also machine learning helps us byconverting the text into our known languages. Google's GNMT (GoogleNeural Machine Translation) provide this feature, which is a NeuralMachineLearning that translates the text into our familiar language, and itcalled as automatic translation.The technology behind the automatic translation is asequence-to-sequencelearning algorithm, which is used with image recognition and translatesthe text from one language to another language.munotes.in
Page 8
1.3 STRUCTURE OF LEARNINGLike all other machine learning models, patterns are a manifestation ofunderlyingstructure in the data. Sometimes this structure takes the form ofa single hidden or latentvariable, much like unobservable but neverthelessexplanatory quantities in physics,such as energy. Consider the followingmatrix:
(FLACH, 2012)Imagine these represent ratings by six different people (in rows), on ascale of 0 to 3, of four different films–sayKhosla Ka Ghosla (KG),Drishyam (D),BADLA (B),Hera Phery (HP), (in columns, from left toright).BADLA (B)seems to be the most popular of the four with anaverage rating of 1.5, andKhosla Ka Ghosla (KG)is the least appreciatedwith an average rating of 0.5.Try to finda structure in this matrix.(FLACH, 2012)Try to look for columns or rows that are combinations of other columns orrows. For instance, the third column turns out to be the sum of the first andsecond columns. Similarly, the fourth row is the sum of the first andsecond rows. What this means is that the fourth person combines theratings of the first and second person. Similarly,BADLA (B)’s ratings arethe sum of the ratings of the first two films. This is made more explicit bywriting the matrix as the following product:
(FLACH, 2012)Notice that the first and third matrix on the right-hand side are nowBoolean, and the middle one is diagonal (all off-diagonal entries are zero).Moreover, these matrices have a very natural interpretation in terms offilm genres.
munotes.in
Page 9
The right-most matrix associates films (in columns) with genres (in rows):Khosla Ka Ghosla (KG)andDrishyam (D)belong to two different genres,say drama and crime,BADLA (B)belongs to both, andHera Phery (HP)is a crime film and also introduces a newgenre (say comedy).The tall, 6-by-3 matrix then expresses people’s preferences in terms ofgenres: the first, fourth and fifth person like drama, the second, fourth andfifth person like crime films, and the third, fifth and sixth person likecomedies.Finally, the middle matrix states that the crime genre is twice asimportant as the other two genres in terms of determining people’spreferences.1.4 LEARNING VERSUS DESIGNINGAccording to Arthur Samuel “Machine Learning enables a Machine toAutomatically learn from Data,improve performance from an Experienceand predict things without explicitly programmed.”(https://www.tutorialspoint.com/, n.d.)In Simple Words,when we fed the Training Data to Machine LearningAlgorithm, this algorithm will produce a mathematical model and with thehelp of the mathematical model, the machine will make a prediction andtake a decision without being explicitly programmed, as shown in figure1.1. Also, during training data, the more machine will work with it themore it will get experience and the more it will get experience the moreefficient result is produced.Figure 1.1: Learn from dataExample:In Driverless Car, the training data is fed to Algorithm likehow to Drive Car in Highway, Busy and Narrow Street with factors likespeed limit, parking, stop at signal etc. After that, a Logical andMathematical model is createdbased onthat and afterthat, the car willwork according to the logical model. Also, the more data the data is fedthe more efficient output is produced.Designing a Learning System in Machine Learning:According to Tom Mitchell, “A computer program is said to be learningfromexperience (E), with respect to some task (T). Thus, the performancemeasure (P) is the performance at task T, which is measured by P, and itimproves with experience E.”Example:In Spam E-Mail detection, /!),),'!1!!#(),%%!/,),'*'-/)1(+2)*$),'-')#!*!1(%+!1)#!*-$%*21.21munotes.in
Page 10
Task, T: To classify mails into Spam or Not Spam.Performance measure, P: Total percent of mails being correctly classifiedas being “Spam” or “Not Spam”.Experience, E: Set of Mails with label “Spam”Steps for Designing Learning System areas shown in figure 1.2 below:Step 1) Choosing the Training Experience:The very important and firsttask is to choose the training data or training experience which will be fedto the Machine Learning Algorithm. It is important to note that the data orexperience that we fed to the algorithm must have a significantimpact onthe Success or Failure of the Model. So, training data or experience shouldbe chosen wisely.
Figure 1.2: Steps for Designing Learning SystemBelow are the attributes which will impact on Success and Failure of Data:The training experiencewill be able to provide direct or indirect feedbackregarding choices. For example: While Playing chess the training data willprovide feedback to itself like instead of this move if this is chosen thechances of success increases.Second importantattribute is the degree to which the learner will controlthe sequences of training examples. For example: when training data is fedto the machine then at that time accuracy is very less but when it gainsexperience while playing again and again with itself or opponent themachine algorithm will get feedback and control the chess gameaccordingly.Third important attribute is how it will represent the distribution ofexamples over which performance will be measured. For example, aMachine learning algorithm will get experience while going through anumber of different cases and different examples. Thus, Machine LearningAlgorithm will get more and more experience by passing through moreand more examples and hence its performance will increase.),!*%0)',(--0),'2,#1)-,../-4)+!1)-,*'-/)1(+(--0),'%./%0%,1!1)-,&-/ !/'%1&2,#1)-,(--0),'1!/'%1&2,#1)-,(--0),'1(% /!),),'4.%/)%,#%munotes.in
Page 11
Step 2)Choosing target function:The next important step is choosingthe target function. It means according to the knowledge fed to thealgorithm the machine learning will choose NextMove function which willdescribe what type of legal moves should be taken. Forexample: Whileplaying chess with the opponent, when opponent will play then themachine learning algorithm will decide what be the number of possiblelegal moves taken in order to get success.Step 3)Choosing Representation for Target function:When themachine algorithm will know all the possible legal moves the next step isto choose the optimized move using any representation i.e. using linearEquations, Hierarchical Graph Representation, Tabular form etc. TheNextMove function will move the Target move like out of these moveswhich will provide more success rate. For Example: while playing chessmachine have 4 possible moves, so the machine will choose that optimizedmove which will provide success to it.Step 4)Choosing Function Approximation Algorithm:An optimizedmove cannot be chosen just with the training data. The training data had togo through with set of examples and through these examples the trainingdata will approximates which steps are chosen and after that machine willprovide feedback on it. For Example: When a training data of Playingchess is fed to algorithm so at that time it is not machine algorithm willfail or get success and again from that failure or success it will measurewhile next move what step should be chosen and whatis its success rate.Step 5)Final Design:The final design is created at last when system goesfrom number of examples, failures and success, correct and incorrectdecision and what will be the next step etc. Example: DeepBlue is anintelligent computerwhich is ML-based won chess game against the chessexpert Garry Kasparov, and it became the first computer which had beatena human chess expert.1.5 TRAINING VERSUS TESTINGTraining data and test data are two important concepts in machinelearning.Training Data:The observations in the training set form the experience that the algorithmuses to learn. In supervised learning problems, each observation consistsof an observed output variable and one or more observed input variables.Test Data:The test set is a set of observations used to evaluate the performance of themodel using some performance metric. It is important that no observationsfrom the training set are included in the test set. If the test set does containexamples from the training set, it will be difficult to assess whether themunotes.in
Page 12
algorithm has learned to generalize from the training set or has simplymemorized it.A program that generalizes well will be able to effectively perform a taskwith new data. In contrast, a program that memorizes the training data bylearning an overly complex model could predict the values of the responsevariable for the training setaccurately butwill fail to predict the value ofthe response variable for new examples.Memorizing the training set is calledover-fitting. A program thatmemorizes its observations may not perform its task well, as it couldmemorize relations andstructures that are noise or coincidence. Balancingmemorization and generalization, or over-fitting and under-fitting, is aproblem common to many machine learningalgorithms.Regularizationmay be applied to many models to reduce over-fitting.In addition to the training and test data, a third set of observations, calledavalidationorhold-out set,is sometimes required. The validation set isused to tune variables calledhyper parameters, which control how themodel is learned. The program is still evaluated on the test set to providean estimate of its performance in the real world; its performance on thevalidation set should not be used as an estimate of the model's real-worldperformance since the program has been tuned specifically to thevalidation data.It is common to partition a single set of supervised observations intotraining, validation, and test sets. There are no requirements for the sizesof the partitions, and they may vary according to the amount of dataavailable. It is common to allocate 50 percent or more of the data to thetraining set, 25 percent to the test set, and the remainder to the validationset.Some training sets may contain only a few hundred observations; othersmay include millions. Inexpensive storage, increased networkconnectivity, the ubiquity of sensor-packed smartphones, and shiftingattitudes towards privacy have contributed to the contemporary state of bigdata, or training sets with millions or billions of examples.However, machine learning algorithms alsofollow the maxim "garbage in,garbage out." A student who studies for a test by reading a large,confusing textbook that contains many errors will likely not score betterthan a student who reads a short but well-written textbook. Similarly, analgorithmtrained on a large collection of noisy, irrelevant, or incorrectlylabelled data will not perform better than an algorithm trained on a smallerset of data that is more representative of problems in the real world.Many supervised training sets are prepared manually, or by semi-automated processes. Creating a large collection of supervised data can bemunotes.in
Page 13
costly in some domains. Fortunately, several datasets are bundledwithscikit-learn, allowing developers to focus on experimenting withmodels instead.During development, and particularly when training data is scarce, apractice calledcross-validationcan be used to train and validate analgorithm on the same data. In cross-validation, the training data ispartitioned. The algorithm is trained using all butone of thepartitions andtested on the remaining partition. The partitions are then rotated severaltimes so that the algorithm is trained and evaluated on all of the data.Consider for example that the original dataset is partitioned into fivesubsetsof equal size, labelled A through E. Initially, the model is trainedon partitions B through E, and tested on partition A. In the next iteration,the model is trained on partitions A, C, D, and E, and tested on partition B.The partitions are rotated untilmodels have been trained and tested on allof the partitions. Cross-validation provides a more accurate estimate of themodel's performance than testing a single partition of the data.1.6 CHARACTERISTICS OF MACHINE LEARNINGTASKSTo understand the actual power of machine learning, we must consider thecharacteristics of this technology. There are lots of examples that echo thecharacteristics of machine learning in today’s data-rich world. Here areseven key characteristics of machine learning for whichcompanies shouldprefer it over other technologies:1.The ability to perform automated data visualization2.Automation at its best3.Customer engagement like never before4.The ability to take efficiency to the next level when merged withIoT5.The ability tochange the mortgage market6.Accurate data analysis7.Business intelligence at its best1. The ability to perform automated data visualization:A massive amount of data is being generated by businesses and commonpeople on a regular basis. By visualizing notable relationships in data,businesses can not only make better decisions but build confidence aswell. Machine learning offers several tools that provide rich snippets ofdata which can be applied to both unstructured and structured data. Withthe help ofuser-friendly automated data visualization platforms inmachine learning, businesses can obtain a wealth of new insights toincrease productivity in their processes.munotes.in
Page 14
2.Automation at its best:Figure 1.3 Machine Learning workflowFigure 1.3 shows Machine Learning workflow. One of the biggestcharacteristics of machine learning is its ability to automate repetitivetasks and thus, increasing productivity. A huge number of organizationsare already using machine learning-powered paperwork and emailautomation. In the financial sector, for example, a huge number ofrepetitive, data-heavy and predictable tasks are needed to be performed.Because of this, this sector uses different types of machine learningsolutions to a great extent. The make accounting tasks faster, moreinsightful, and more accurate. Some aspects that have been alreadyaddressed by machine learning include addressing financial queries withthe help of chatbots, making predictions, managing expenses, simplifyinginvoicing, and automating bank reconciliations.3.Customer engagement like never before:For any business, one of the most crucial ways to drive engagement,promote brand loyalty and establish long-lasting customer relationships isby triggering meaningful conversations with its target customer base.Machine learning plays a critical role in enabling businesses and brands tospark more valuable conversations in terms of customer engagement. Thetechnology analyzes particular phrases, words, sentences, idioms, andcontent formats which resonate with certain audience members.Wecanthink of Pinterestwhich is successfully using machine learning topersonalize suggestions to its users. It uses the technology to sourcecontent in which users will be interested, based on objects which they havepinned already.4.The ability to take efficiency to the next level when merged withIoT:IoT is being designated as a strategically significant area by manycompanies. And many others have launched pilot projects to gauge thepotential of IoT in the context of business operations. But attainingfinancial benefits through IoT isn’t easy. In order to achieve success,companies, which are offering IoT consulting services and platforms, needto clearly determine the areas that will change with the implementation ofIoT strategies. Many of these businesses have failed to address it. In thisscenario, machine learning is probably the best technology that can beused to attain higher levels of efficiency. By merging machine learningwith IoT, businesses can boost the efficiency of their entire productionprocesses.)+.-/1./-#%003)02!*)6%+-$%*%3!*2!1%munotes.in
Page 15
5.The ability to change the mortgage market:It’s a fact that fostering a positive credit score usually takes discipline,time, and lots of financial planning for a lot of consumers. When it comesto the lenders, the consumer credit score is one of the biggest measures ofcreditworthiness that involve a number of factors including paymenthistory, total debt, length of credit history etc. But wouldn’t it be great ifthere is a simplified and better measure? With the help of machinelearning, lenders can now obtain a more comprehensive consumer picture.They can now predict whether the customer is a low spender or a highspender and understand his/her tipping point of spending. Apart frommortgage lending, financial institutions are using the same techniques forother types of consumer loans.6.Accurate data analysis:Traditionally, data analysis has always been encompassing trial and errormethod, an approach which becomes impossible when we are workingwith large and heterogeneous datasets. Machine learning comes as the bestsolution to all these issues by offering effective alternatives to analyzingmassive volumes of data. By developing efficient and fast algorithms, aswell as, data-driven models for processing of data in real-time, machinelearning is able to generate accurate analysis and results.7.Business intelligence at its best:Machine learning characteristics, when merged with big data analyticalwork, can generate extreme levels of business intelligencewith the help ofwhich several different industries are making strategic initiatives. Fromretail to financial services to healthcare, and many more–machinelearning has already become one of the most effective technologies toboost business operations.1.7 PREDICTIVE AND DESCRIPTIVE TASKSIn the similar fashion, as the distinction between supervised learning fromlabelled data andunsupervised learning from unlabelled data, we can drawa distinction between whether the model output involves the targetvariable or not: we call it a predictive model if it does, and a descriptivemodel if it does not. This leads to the four different machine learningsettings summarised in Table 1.1.Predictive modelDescriptivemodelSupervised learningclassification, regressionsubgroupdiscoveryUnsupervised learningpredictive clusteringdescriptiveclustering,association rule discoveryTable 1.1. An overview of different machine learning settings.(FLACH, 2012)munotes.in
Page 16
The rows refer to whether thetraining data is labelled with a targetvariable, while the columns indicate whether the modelslearned are usedto predict a target variable or rather describe the given data.The table 1.1 indicates following points:•The most common setting is supervised learning of predictive models–in fact, this is what people commonly mean when they refer tosupervised learning. Typical tasks are classification and regression.•It is also possible to use labelled training data to build adescriptivemodel that is not primarily intended to predict the target variable, butinstead identifies, say, subsets of the data that behave differently withrespect to the target variable. This example of supervised learning of adescriptive model is called subgroup discovery.•Descriptive models can naturally be learned in an unsupervised setting,and we have just seen a few examples of that (clustering, associationrule discovery and matrix decomposition). This is often the impliedsetting when people talk about unsupervised learning.•A typical example of unsupervised learning of a predictive modeloccurs whenwe cluster data with the intention of using the clusters toassign class labels to new data. We will call this predictive clustering todistinguish it from the previous, descriptive form of clustering.SUMMARYThis chapter gives brief introduction of Machine Learning. After studyingthis chapter, you will learn about definition of machine learning, what issupervised, unsupervised andreinforcement learning, applications ofmachine learning, how a pattern can be found from data, what are trainingdata and test data and predictive and descriptive tasks with respect tosupervised and unsupervised learning.UNIT END QUESTIONS1.Define and explain Machine Learning. Also explain its examples inbrief.2.Explain supervised learning and unsupervised learning in detail.3.Write a short note on learning verses designing.4.Explain training data and test data in detail.5.What arethe characteristics of machine learning tasks? Explain eachone in brief.6.What are predictive and descriptive tasks? Explain with respect tosupervised and unsupervised learning.munotes.in
Page 17
REFERENCES•FLACH, P. (2012). MACHINE LEARNING The Art and Science ofAlgorithms that Make Sense of Data. Cambridge, New York,Melbourne, Madrid, Cape Town, Singapore, S˜ao Paulo, Delhi,Mexico City: cambridge university press.•jlooper, s. l. (n.d.).microsoft/ ML-For-Beginners. Retrieved fromhttps://github.com/microsoft/ML-For-Beginners/blob/main/1-Introduction/1-intro-to-ML/README.md•https://www.tutorialspoint.com/. (n.d.). Retrieved fromhttps://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_training_test_data.htm.
-3 (11.0+!',)+),$!#!$%+5#-+"*-'#(!/!#1%/)01)#0-&+!#(),%*%!/,),'munotes.in
Page 18
2MACHINE LEARNING MODELSUnit Structure2.0Introduction2.1Geometric Models2.2Logical Models2.3Probabilistic Models2.4Features2.5Feature types2.6Feature Construction and Transformation2.7Feature SelectionSummaryUnit EndQuestionsReferences2.0 INTRODUCTIONModels form the central concept in machine learning as they are what isbeing learned from the data, in order to solve a given task. There is aconsiderable–not to say be wildering–range of machine learning modelsto choose from. One reason for this is the ubiquity of the tasks thatmachine learning aims to solve: classification, regression, clustering,association discovery, to name but a few. Examples of each of these taskscan be found in virtually every branch of science and engineering.Mathematicians, engineers, psychologists, computer scientists and manyothers havediscovered–and sometimes rediscovered–ways to solvethese tasks. They have all brought their specific background to bear, andconsequently the principles underlying these models are also diverse. Mypersonal view is that this diversity is a good thingas it helps to makemachine learning the powerful and exciting discipline it is. It doesn’t,however, make the task of writing a machine learning book any easier!Luckily, a few common themes can be observed, which allow me todiscuss machine learning models in a somewhat more systematic way.Wewill discuss three groups of models: geometric models, probabilisticmodels, and logical models. These groupings are not meant to be mutuallyexclusive, and sometimes a particular kind of model has, for instance, botha geometric and a probabilistic interpretation. Nevertheless, it provides agood starting point for our purposes.2.1GEOMETRIC MODELSTheinstance spaceis the set of all possible or describable instances,whether they are present in our data set ornot. Usually this set has somegeometric structure. For instance, if all features are numerical, then we canmunotes.in
Page 19
use each feature as a coordinate in a Cartesian coordinate system. Ageometric modelis constructed directly in instance space, using geometricconcepts such as lines, planes and distances. For instance, the linearclassifier depicted in Figure 1 on p.5 is a geometric classifier. One mainadvantage of geometric classifiers is that they are easy to visualise, as longas we keep to two or three dimensions. It is important to keep in mind,though, that a Cartesian instance space has as many coordinates as thereare features, which can be tens, hundreds, thousands, or even more. Suchhigh-dimensional spaces are hard to imagine but are nevertheless verycommon in machine learning. Geometric concepts that potentially apply tohigh-dimensional spaces are usually prefixed with ‘hyper-’: for instance, adecision boundary in an unspecified number of dimensions is called ahyperplane.If there existsa linear decision boundary separating the two classes, wesay that the data is linearly separable. As we have seen, a linear decisionboundary is defined by the equation w·x = t, where w is a vectorperpendicular to the decision boundary, x points to an arbitrary point onthe decision boundary, and t is the decision threshold. A good way to thinkof the vector w is as pointing from the ‘centre of mass’ of the negativeexamples, n, to the centre of mass of the positives p. In other words, w isproportional(or equal) to p−n. One way to calculate these centres of massis by averaging. For instance, if P is a set of n positive examples, then wecan defineP=andsimilarly for n. By setting the decisionthreshold appropriately,we can intersect thelinefrom n to p half-way(Figure2.1).
Source:(FLACH, 2012)We will call this the basic linear classifier.It has the advantage ofsimplicity, being defined in terms of addition, subtraction andrescaling of
munotes.in
Page 20
examplesonly (in other words, w is a linear combination of the examples).However, if those assumptions do not hold, thebasic linear classifier canperform poorly–for instance, note that it may not perfectlyseparate thepositives from the negatives, even if the data is linearly separable.Becausedata is usually noisy, linear separability doesn’t occur very often inpractice, unless the data is very sparse, as in text classification. Recall thatwe used a large vocabulary, say 10 000 words, each word correspondingto a Boolean feature indicating whether or not that word occurs in thedocument. This means that the instance space has 10 000 dimensions, butfor any one document no more than a small percentage of the features willbe non-zero. As a result there is much ‘empty space’ between instances,which increases the possibility of linear separability. However, becauselinearly separable data doesn’t uniquely define a decision boundary, weare now faced with a problem: which of the infinitely many decisionboundaries should we choose? One natural option is to prefer large marginclassifiers, where the margin of a linear classifier is the distance betweenthe decision boundary and the closest instance. Support vector machinesare a powerful kind of linear classifier that find a decision boundary whosemargin is as large as possible (Figure2.2).
Source:(FLACH, 2012)A very useful geometric concept in machine learning is the distance. If thedistance between two instances is small then the instances are similar interms of their feature values, and so nearby instances would be expected toreceive the same classification or belong to the same cluster.In a Cartesian coordinate system, distance can be measured by Euclideandistance, which is the square root of the sum of the squared distancesalong each coordinate:. For n points, thegeneral formula is:
munotes.in
Page 21
nearest-neighbour classifier:A very simple distance-based classifier works as follows:To classify a new instance, we retrieve from memory the most similartraining instance (i.e., the traininginstance with smallest Euclideandistance from the instance to be classified), and simply assign that traininginstance’s class. This classifier is known as the nearest-neighbourclassifier.Suppose we want to cluster our data into K clusters, and we have an initialguess of how the data should be clustered. We then calculate the means ofeach initial cluster and reassign each point to the nearest cluster mean.Unless our initial guess was a lucky one, this will have changed some ofthe clusters, so we repeat these two steps (calculating the cluster meansand reassigning points to clusters) until no change occurs.It remains to be decided how we construct our initial guess. This is usuallydone randomly: either by randomly partitioning the data set into K‘clusters’ or by randomly guessing K ‘cluster centres’. Instead of usingEuclidean distance, which May not work exactly the way it should, foroutliers, other distances can be used such as Manhattan distance, whichsums the distances along each coordinate:.2.2LOGICAL MODELSFor a given problem, the collection of all possible outcomes representsthesample space or instance space.The basic idea for creating a taxonomyof algorithms is that we divide the instance space by using one of threeways:•Using a Logical expression.•Using the Geometry of the instance space.•Using Probability to classify the instance space.The outcome of the transformation of the instance spaceby a machinelearning algorithm using the above techniques should be exhaustive (coverall possible outcomes) and mutually exclusive (non-overlapping).Logical modelscan also be expressed asTree models and Rule modelsLogical modelsuse a logical expression to divide the instance space intosegments and hence construct grouping models. Alogical expressionis anexpression that returns a Boolean value, i.e., a True or False outcome.Once the data is grouped using a logical expression, the data is dividedinto homogeneous groupings for the problem we are trying to solve.Forexample, for a classification problem, all the instances in the group belongto one class.
munotes.in
Page 22
There are mainly two kinds of logical models:Tree modelsandRulemodels.Rule models consist of a collection of implications or IF-THEN rules. Fortree-based models, the ‘if-part’ defines a segment and the ‘then-part’defines the behaviour of the model for this segment. Rule models followthe same reasoning.Tree models can be seen as a particular type of rule model where the if-parts of the rules are organised in a tree structure. Both Tree models andRule models use the same approach to supervised learning.The approachcan be summarised in two strategies: we could first find the body of therule (the concept) that covers a sufficiently homogeneous set of examplesand then find a label to represent the body. Alternately, we could approachit from the other direction, i.e., first select a class we want to learn andthen find rules that coverexamples of the class.The models of this type can be easily translated into rules that areunderstandable by humans, such as ·if Bonus = 1 then Class = Y = spam·.Such rules are easily organized in a tree structure, such as the one inFigure 2.3, whichis called a feature tree. The idea of such a tree is thatfeatures are used to iteratively partition the instance space.
Source:(FLACH, 2012)The leaves of the tree therefore correspond to rectangular areas in theinstance space, which we will call instance space segments, or segmentsfor short. Depending on the task we are solving, we can then label theleaves with a class, a probability, a real value, and so on. Feature treeswhose leaves are labelled with classes are commonly calleddecision trees.A complete feature tree, which contains all features, one at each level ofthe tree is shown in figure 2.4.
munotes.in
Page 23
A feature list is a binary feature tree which always branches in the samedirection, either left or right. The tree in Figure 2.3 is a left-branchingfeature list. Such feature lists can be written as nested if–then–elsestatements that will be familiar to anyone with a bit of programmingexperience. For instance, if we were to label the leaves in Figure 2.3 bymajority class, we obtain the following decision listas per the Rulelearning:if bonus = 1 then Class = Y = spamelse if lottery = 1 then Class = Y = spamelse Class = Y = hamBoth tree learning and rule learning are implemented in top-down fashion.Select a feature from the instance space, which best splits the entiretraining sets into different number of subsets. Each subset can then furtherderive into subsets. Finally, allbelongs to each node of a class. In treelearning, we follow divide and conquer approach.
Source:(FLACH, 2012)In rule based, first write a rule, based on some condition and then step bystep, we add more conditions torule by using some set of examples fromthe training dataset. Now remove those examples from the dataset. Here,we find the class for each feature, ultimately. Here, we follow separate andconquer.Logical models often have different, equivalent formulations. For instance,two alternative formulations for this model are:if bonus = 1
lottery = 1 then Class = Y = spam·else Class = Y = ham·if bonus = 0 lottery = 0 then Class = Y = ham·else Class = Y = spam·We can also represent the same model as un-nested rules:
munotes.in
Page 24
if bonus = 1 then Class = Y = spam·if bonus = 0 lottery = 1 then Class = Y = spam·if bonus = 0 lottery = 0 then Class = Y = ham.Here, every path from root to a leaf is translated into arule. As a result,although rules from the same sub-tree share conditions (such as bonus=0),every pair of rules will have at least some mutually exclusive conditions(such as lottery = 1 in the second rule and lottery = 0 in the third).However, this is not always the case: rules can have a certain overlap.Before learning more on logical models let us understand theterminologies–grouping and grading.Grouping and grading:Grouping is breaking the instance space into groups or segments, thenumber ofwhich is determined at training time.Figure2.4shows theexample of Grouping.Grading models are able to distinguish between arbitrary instances, whenworking in cartesian instance space.The basic linear classifier constructs adecision boundary by half-way intersecting the line between the positive(p)and negative(n)centers of mass. It is described by the equation w·x = t(x is any arbitrary point), as shown inFigure2.5–example of Grading.
Figure2.5–example of GradingSource:(FLACH, 2012)Let us now continue understanding logical models.An interesting aspectof logical models, which sets them aside from most geometric andprobabilistic models, is that they can, to some extent, provide explanationsfor their predictions.For example, a prediction assigned by a decision tree could be explainedby reading off the conditions that led to the prediction from root to leaf.The model itself can also easily be inspected by humans, which is whythey are sometimes called declarative. Declarative models do not need tobe restricted to the simple rules that we have considered so far.
munotes.in
Page 25
The logical rule learning system Progolfound the following set ofconditions to predict whether a molecular compound is carcinogenic(causes cancer):1.it tests positive in the Salmonella assay; or2.it tests positive for sex-linked recessive lethal mutation in Drosophila;or3.it tests negative for chromosome aberration; or4.it has a carbon in a six-membered aromatic ring with a partial chargeof−0.13; or5.it has a primary amine group and no secondary or tertiary amines; or6.it has an aromatic (or resonant) hydrogen with partial charge≥ 0.168;or7.it has a hydroxy oxygen with a partial charge≥−0.616 and anaromatic (or resonant)hydrogen; or8.it has a bromine; or9.it has a tetrahedral carbon with a partial charge≤−0.144 and testspositive on Progol’smutagenicity rules.The first three conditions concerned certain tests that were carried out forall molecules and whose results were recorded in the data as Booleanfeatures. In contrast, the remaining six rules all refer to the structure of themolecule and were constructed entirely by Progol.For instance, rule 4 predicts that a molecule is carcinogenic if it contains acarbon atom with certain properties. This condition is different from thefirst three in that it is not a pre-recorded feature in the data, but a newfeature that is constructed by Progol during the learning process because ithelps to explain the data.Statisticians work very often with different conditional probabilities, givenby the likelihood function P(X|Y ).For example,if somebody was to sendme a spam e-mail, how likely would it be that it contains exactly the wordsof the e-mail I’m looking at? And how likely if it were a ham e-mailinstead?With so many words to choose from, the probability of any particularcombination of words would be very small indeed. What really matters isnot the magnitude of these likelihoods, but their ratio: how much morelikely is it to observe this combination of words in a spam e-mail than it isin a non-spam e-mail.For instance, suppose that for a particular e-mail described by X we haveP(X|Y = spam) = 3.5 · 10−5and P(X|Y = ham) = 7.4 · 10−6, then observingX in a spam e-mail is much more likely than it is in a ham e-mail.This suggests the following decision rule:munotes.in
Page 26
predict spam ifthe likelihood ratio is larger than 1 and ham otherwise.So, which one should we use: posterior probabilities or likelihoods? As itturns out, we can easily transform one into the other using Bayes’ rule, asimple property of conditional probabilities which states thatWhere, P(Y |X) is conditional probabilityP(X |Y) is likelihood functionP(Y) is prior probability without observing data X andP(X) is probability of features independent of YThe first decision rule above suggested that we predictthe class withmaximum posterior probability, which using Bayes’ rule can be written interms of the likelihood function.2.3PROBABILISTIC MODELSThethirdtype of models are probabilistic in nature, like the Bayesianclassifier weconsidered earlier.Many of these models are based aroundthe following idea. Let Xdenote the variables we know about, e.g., ourinstance’s feature values; and let Y denotethe target variables we’reinterested in, e.g., the instance’s class. The key questionin machinelearning is how to model the relationship between X and Y.Since X is known for a particular instance but Y may not be, we areparticularly interested in the conditional probabilities P(Y |X). Forinstance, Y could indicate whether the e-mail is spam, and Xcouldindicate whether the e-mail contains the words ‘bonus’ and ‘lottery’. Theprobability of interest is then P(Y | bonus, lottery), with bonus and lotterytwo Boolean variables which together constitute the feature vector X. Fora particular e-mail we know the feature values and so we might writeP(Y|bonus =1,lottery = 0) if the e-mail contains the word ‘bonus’ but not theword ‘lottery’. This is called a posterior probability because it is used afterthe features X are observed.Table 2.1 shows an example of how these probabilities might bedistributed. From this distribution you can conclude that, if an e-maildoesn’t contain the word ‘Bonus’, then observing the word ‘lottery’increases the probability of the e-mail being spam from 0.31 to 0.65; butifthe e-mail does contain the word ‘Bonus’, then observing the word‘lottery’ as well decreases the spam probability from 0.80 to 0.40.
munotes.in
Page 27
Table 2.1. An example posterior distribution.‘Bonus’ and ‘lottery’ are two Booleanfeatures; Y is the class variable, with values ‘spam’ and ‘ham’. In each row the mostlikely class is indicated in blue.Source:(FLACH, 2012)Even though this example table is small, itwill grow unfeasibly large veryquickly,with n Boolean variables 2ncases have to be distinguished.Wetherefore don’t normally have access to the full joint distribution and haveto approximate it using additional assumptions, as we will see below.Assuming that X and Y are the only variables we know and care about, theposterior distribution P(Y |X) helps us to answer many questions ofinterest. For instance, to classify a new e-mail we determine whether thewords ‘Bonus’ and ‘lottery’ occur in it,look up the correspondingprobability P(Y = spam | Bonus,Lottery), and predict spam if thisprobability exceeds 0.5 and ham otherwise. Such a recipe to predict avalue of Y on the basis of the values of X and the posterior distributionP(Y |X) is called adecision rule.2.4FEATURESMACHINE LEARNING IS ALL ABOUT using the right features to buildthe right models that achieve the right tasks–this is the slogan, visualisedin Figure2.6.In essence,featuresdefine a ‘language’ in which wedescribe the relevant objects in our domain.We should not normally haveto go back to the domain objects themselves once we have a suitablefeature representation, which is why features play such an important rolein machine learning.Ataskis an abstract representation of a problemwe want to solveregarding those domain objects: the most common form of these isclassifying them into two or more classes. Many of these tasks can berepresented as a mapping from data points to outputs.This mapping ormodelis itself produced as the output of a machinelearning algorithm applied to training data; there is a wide variety ofmodels to choose from.No matter what variety of machine learningmodels you may encounter, you will find that they are designed to solveone of only a small number of tasks and use only a few different types of
munotes.in
Page 28
features. One could say thatmodels lend the machine learning fielddiversity, but tasks and features give it unity.
Figure 2.6.An overviewof how machine learning is used to address a given task. A task(upper box) requires an appropriate mapping–a model–from data described by featuresto outputs. Obtaining such a mapping from training data is what constitutes a learningproblem (lower box).Source:(FLACH, 2012)Features determine much of the success of a machine learning application,because a model is only as good as its features. A feature can be thoughtof as a kind of measurement that can be easily performed on any instance.Mathematically, they are functions, that map from the instance space tosome set of feature values called the domain of the feature. Sincemeasurements are often numerical, the most common feature domain isthe set of real numbers. Other typical feature domains include the set ofintegers, for instance when the feature counts something, such as thenumber of occurrences of a particular word; the Booleans, if our feature isa statement that can be true or false for a particular instance, such as ‘thise-mail is addressed to Beena Kapadia’;and arbitrary finite sets, such as aset of colours, or a set of shapes.Suppose we have a number of learning modelsthat we want to describe interms of a number of properties:•the extent to which the models are geometric, probabilistic or logical•whether they are grouping or grading models•the extent to which they can handle discrete and/or real-valuedfeatures•whether they are used in supervised or unsupervised learning; and•the extent to which they can handle multi-class problems.The first twoproperties could be expressed by discrete features with threeandtwo values, respectively; or if the distinctions are more gradual, eachaspect couldbe rated on some numerical scale.2.5FEATURE TYPES
munotes.in
Page 29
There are mainly three kinds of features–Quantitative, Ordinal andCategorical.
Kinds of features, their properties and allowable statistics.Each kind inheritsthe statistics from the kinds above it in the table. For instance, the mode is a statistic ofcentral tendency that can be computed for any kind of featureSource:(FLACH, 2012)Quantitative:They have a meaningful numerical scale and order.They most ofteninvolve a mapping into the reals or continuous. Even if a feature maps intoa subset of the reals, such as age expressed in years, the various statisticssuch as mean or standard deviation still require the full scale of the reals.Ordinal:Features with an ordering but without scale are called ordinal features. Thedomain of an ordinal feature is some totally ordered set, such as the set ofcharacters or strings. Even if the domain of a feature is the set of integers,denoting the feature as ordinal means that we have to dispense with thescale, as we did with house numbers. Another common example arefeatures that express a rank order: first, second, third, and so on. Ordinalfeatures allowthe mode and median as central tendency statistics, andquantiles as dispersion statistics.Categorical:Features without ordering or scale are called categorical features (orsometimes ‘nominal’ features). They do not allow any statistical summaryexcept the mode. One subspecies of the categorical features is the Booleanfeature, which maps into the truth values true and false. The situation issummarised in Table2.1.Models treat these different kinds of feature in distinct ways. First,consider treemodels such as decision trees. A split on a categorical featurewill have as many children as there are feature values. Ordinal andquantitative features, on the other hand, give rise to a binary split, byselecting a value v0 such that all instances witha feature value less than orequal to v0 go to one child, and the remaining instances to the other child.It follows that tree models are insensitive to the scale of quantitativefeatures. For example, whether a temperature feature is measured on theCelsius scale or on the Fahrenheit scale will not affect the learned tree.Neither will switching from a linear scale to a logarithmic scale have anyeffect: the split threshold will simply be logv0 instead of v0. In general,tree models are insensitive to monotonic transformations on the scale of afeature, which are those transformations that do not affect the relativeorder of the feature values. In effect, tree models ignore the scale of
munotes.in
Page 30
quantitative features, treating them as ordinal. The same holds for rulemodels.Now let’s consider the naive Bayes classifier. We have seen that thismodel works by estimating a likelihood function P(X|Y) for each featureX given the class Y. For categorical and ordinal features with k values thisinvolves estimating P(X =v1|Y), . . . ,P(X = vk |Y). In effect, ordinalfeatures are treated as categorical ones, ignoring the order.Quantitative features cannot be handled at all, unless they are discretisedinto a finite number of bins and thus converted to categorical.Alternatively, we could assume a parametric form for P(X|Y), for instancea normal distribution.While naive Bayes only really handles categorical features,manygeometric models go in the other direction: they can only handlequantitative features. Linearmodels are a case in point: the very notion oflinearity assumes a Euclidean instance space in which features act asCartesian coordinates, and thus need to be quantitative. Distance-basedmodels such as k-nearest neighbour and K-means require quantitativefeatures if their distance metric is Euclidean distance, but we can adapt thedistance metric to incorporate categorical features by setting the distanceto 0 for equal values and 1 for unequal values.In a similar vein, for ordinal features we can count the number of valuesbetween two feature values (if we encode the ordinal feature by means ofintegers, this would simply be their difference). This means that distance-based methods can accommodate all feature types by using an appropriatedistance metric. Similar techniques can be used to extend support vectormachines and other kernel-based methods to categorical and ordinalfeatures.2.6FEATURE CONSTRUCTION ANDTRANSFORMATIONThere is a lot of scope in machine learning for playing around withfeatures. In the spam filter example, and text classification more generally,the messages or documents don’t come with built-in features; rather, theyneed to be constructed by the developer of the machine learningapplication. This feature construction process is absolutely crucial for thesuccess of a machine learning application.Indexing an e-mail by the words that occur in it (called a bag of wordsrepresentation as it disregards the order of the words in the e-mail) is acarefully engineered representation that manages to amplify the ‘signal’and attenuate the ‘noise’ in spam e-mail filtering and related classificationtasks. However, it is easy to conceive of problems where this would beexactly the wrong thing to do: for instance if we aim to train a classifier tomunotes.in
Page 31
distinguish between grammatical and ungrammatical sentences, wordorder is clearly signal rather than noise, and a different representation iscalled for.
Figure 2.7. (left) Artificial data depicting a histogram of body weightmeasurements of people with (blue) and without (red) diabetes, with eleven fixedintervals of 10 kilograms width each. (right) By joining the first and second, thirdand fourth, fifth and sixth, and the eighth, ninth and tenth intervals, we obtain adiscretisation such that the proportion of diabetes cases increases from left toright. This discretisation makes the featuremore useful in predicting diabetes.(FLACH, 2012)It is often natural to build a model in terms of the given features. However,we are free to change the features aswe see fit, or even to introduce newfeatures. For instance, real-valued features often contain unnecessarydetail that can be removed by discretisation. Imagine you want to analysethe body weight of a relatively small group of, say, 100 people, bydrawing a histogram.If you measure everybody’s weight in kilograms with one position afterthe decimal point (i.e., your precision is 100 grams), then your histogramwill be sparse and spiky. It is hard to draw any general conclusions fromsuch a histogram. It would be much more useful to discretise the bodyweight measurements into intervals of 10 kilograms. If we are in aclassification context, say we’re trying to relate body weight to diabetes,we could then associate each bar of the histogram with the proportion ofpeople having diabetes among the people whose weight falls in thatinterval.In fact, we can even choose the intervals such that this proportionis monotonically increasing as shown in Figure2.7.The previous example gives another illustration of how, for a particulartask such as classification, we can improve the signal-to-noise ratio of afeature. In more extreme cases of feature construction, we transform theentire instance space. Consider Figure2.6: the data on the left is clearlynot linearly separable, but by mapping the instance space into a new‘feature space’ consisting of the squares of the original features we seethat the data becomes almostlinearly separable. In fact, by adding in athird feature we can perform a remarkable trick: we can build this featurespace classifier without actually constructing the feature space.2.7FEATURE SELECTION
munotes.in
Page 32
Once we have constructed new features it is often a good idea to select asuitable subset of them prior to learning. Not only will this speed uplearning as fewer candidate features need to be considered, it also helps toguard against overfitting.
(FLACH, 2012)There are two main approaches to feature selection, The filter approachand the relief approach.The filter approach scoresthefeatures on a particular metric and the top-scoring features are selected. Many of the metrics we have seen so far canbe used for feature scoring, including information gain, the χ2 statistic, thecorrelation coefficient, to name just a few.An interesting variation is provided by the Relief feature selection method,which repeatedly samples a random instance x and finds its nearest hit h(instance of the same class) as well as its nearest miss m (instance ofopposite class). The i-th feature’s score isthen decreased by Dis(xi , hi)2and increased by Dis(xi , mi)2, where Dis is some distance measure (e.g.,Euclidean distance for quantitative features, Hamming distance forcategorical features). The intuition is that we want to move closer to thenearesthit while differentiating from the nearest miss.One drawback of a simple filter approach is that no account is taken ofredundancy between features. Imagine, for the sake of the argument,duplicating a promising feature in the data set: both copies score equallyhigh and will be selected, whereas the second one provides no added valuein the context of the first one.Secondly, feature filters do not detect dependencies between features asthey are solely based on marginaldistributions. For example, consider twoBoolean features such that half the positives have the value true for bothfeatures and the other half have the value false for both, whereas allnegatives have opposite values (again distributed half-half over the two
munotes.in
Page 33
possibilities). It followsthat each feature in isolation has zero informationgain and hence is unlikely to be selected by a feature filter, despite theircombination being a perfect classifier. One could say that feature filtersare good at picking out possible root features for adecision tree, but notnecessarily good at selecting features that are useful further down the tree.To detect features that are useful in the context of other features, we needto evaluate sets of features; this usually goes under the name of wrapperapproaches. The idea is that feature selection is ‘wrapped’ in a searchprocedure that usually involves training and evaluating a model with acandidate set of features.Forward selection methods start with an empty set of features and addfeatures to theset one at a time, as long as they improve the performanceof the model. Backward elimination starts with the full set of features andaims at improving performance by removing features one at a time. Sincethere are an exponential number of subsets of features it is usually notfeasible to search all possible subsets, and most approaches apply a‘greedy’ search algorithm that never reconsiders the choices it makes.SUMMARYAfter studying this chapter, you will understand different modes likeGeometricModels,Logical ModelsandProbabilistic Models. You willunderstand about features usage and why it is very important in modeldesigning. You will also understand about differentFeature types, howthey can beConstructedandwhy theirTransformationrequired and how itcan be done. You will also understand howFeature Selectionplays animportant role in designing a model and how to do it.UNIT ENDQUESTIONS1.How a linear classifier construct decision boundary using linearseparable data? Explain itin detail with respect to geometric modelsof Machine Learning.2.Explain the working of decision boundary learned by Support VectorMachine from linear separable data with respect to geometric modelsof Machine Learning.3.Describe logical models.4.Write a short note on probabilistic models.5.Machine learning is all about using the right features to build the rightmodels that achieve the right tasks–justify this sentence.6.What are various types of features available? Explain each one inbrief.munotes.in
Page 34
7.Why are feature construction and feature transformation required?How to achieve them?8.What are the approaches to feature selection? Explain each one indetail.REFERENCES•FLACH, P. (2012). MACHINE LEARNING The Art and Science ofAlgorithms thatMake Sense of Data. Cambridge, New York,Melbourne, Madrid, Cape Town, Singapore, S˜ao Paulo, Delhi,Mexico City: cambridge university press.munotes.in
Page 35
UNIT II3CLASSIFICATION AND REGRESSIONUnit structure3.0Objectives3.1Introduction3.2Classification3.3Binary Classification3.4Assessing Classification performance3.5Class probability Estimation3.6Assessing class probability Estimates3.7Multiclass ClassificationSummaryUnit End QuestionsReferences3.0 OBJECTIVESA learner will learn:-Machine learning methods like classification-Will also explore the classification algorithms with practicalapproach.-Concept of multiclass classification3.1INTRODUCTIONClassification may be defined as the process of predicting class orcategory from observed values or given data points. The categorizedoutput can have the form such as “Red” or “Blue” or “spam” or “nospam”.Conceptually, classification is the task of approximating a mappingfunction (f) from input variables (X) that tends to output variables (Y). Itis basically belonging to the supervised machine learning in which targetsare also provided along with the input data set.An exampleof classification problem can be the spam detection in emails.There can be only two categories of output, “spam” and “no spam”; hencethis is a binary type classification.munotes.in
Page 36
To implement this classification, we first need to train the classifier. Forthisexample, “spam” and “no spam” emails would be used as the trainingdata. After successfully train the classifier, it can be used to detect anunknown email.3.1.1Types of Learners in Classification:There exist two types of learners in classification problems–a.Lazy Learners:These learnerswaitfor the testing data to be appeared after storing thetraining data. Classification is done only after getting the testing data.They spend less time on training but more time on predicting. Examples oflazy learners are K-nearest neighbor and case-based reasoning.b.Eager LearnersAs opposite to lazy learners, eager learners construct classification modelwithout waiting for the testing data to be appeared after storing thetraining data. They spend more timeon training but less time onpredicting. Examples of eager learners are Decision Trees, Naïve Bayesand Artificial Neural Networks (ANN).’3.2 CLASSIFICATIONIn machine learning, classification refers to a predictive modeling problemwhere a class labelis predicted for a given example of input data.Classification is the most common task in machine learning. A classifier iscˆ:X→ C, where C = {C1,C2,...,Ck } is a finite and usually small set ofclass labels. We will sometimes also use Cito indicate the set of examplesof that class. We use the ‘hat’ to indicate that cˆ(x) is an estimate of thetrue but unknown function c(x).Examples for a classifier take the form(x,c(x)), where xX is an instance and c(x) is the true class of theinstance. Learning a classifier involves constructing the function cˆ suchthat it matches c as closely as possible (and not just on the trainingset, butideally on the entire instance space X ).Classification in machine learning and statistics is a supervised learningapproach in which the computer program learns from the data given to itand make new observations or classifications. Classification is a processof categorizing a given set of data into classes, It can be performed onboth structured or unstructured data. The process starts with predicting theclass of given data points. The classes are often referred to as target, labelor categories. The classification predictive modelling is the task ofapproximating the mapping function from input variables to discreteoutput variables. The main goal is to identify which class/category the newdata will fall into.munotes.in
Page 37
Fig 3.1: Email classification exampleThe algorithm which implements the classification on a dataset is knownas a classifier. There are two types of Classifications:•Binary Classifier:If the classification problem has only two possibleoutcomes, then it is called as Binary Classifier.Examples:YES or NO, MALE or FEMALE, SPAM or NOT SPAM,CAT or DOG, etc.•Multi-class Classifier:If a classification problem has more than twooutcomes, then it is called as Multi-class Classifier.Example:Classifications of types of crops, Classification of types ofmusic.3.3 ASSESSING CLASSIFICATION PERFORMANCEThe contingencytable is usedto assess the performance of classification.The table is also known as confusion matrix. The table is constituted withrows and columns. Each row refers to actual classes as recorded in thetest set, and each column to classes as predicted by the classifier. In thegiven table 3.1 , the first row states that the test set contains 50 positives,30 of which were correctly predicted and 20 incorrectly. The lastcolumnand the last row give the marginals (i.e., column and row sums). Marginalsare important because they allow us to assess statistical significance.Predicted (+ve)Predicted (-ve)Actual (+ve)302050Actual (-ve)1040504060100Table 3.1: Confusion Matrix+ve-ve+ve203050-ve2030504060100Table 3.2: two-class contingency tableThe table 3.2, has the same marginals, but the classifier clearly makes arandom choice as to which predictions are positive and which are negative
munotes.in
Page 38
–as a result the distribution of actual positives and negatives in eitherpredicted class is the same.From a contingency table we can calculate a range of performanceindicators. The simplest of these is accuracy, which is the proportion ofcorrectly classified test instances. In the notation introduced at thebeginning of this chapter, accuracy over a test set Te is defined as shownin equation 3.1:……….(3.1)As stated in the equation 3.1, the function I[·] denotes the indicatorfunction, which is1 if its argument evaluates to true, and 0 otherwise. Inthis case it is a convenient way to count the number of test instances thatare classified correctly by the classifier (i.e., the estimated class label cˆ(x)is equal to the true class label c(x)).Alternatively, we can calculate theerror rate as the proportion of incorrectly classified instances, here 0.30and 0.50, respectively. Clearly, accuracy and error rate sum to 1.Test set accuracy can be seen as an estimate of the probability that anarbitrary instance xX is classified correctly: more precisely, it estimatesthe probability.We have access to the true classes of a small fraction of the instance spaceand so an estimate is all we can hope to get. It is therefore important thatthe test set is as representative as possible. This is usually formalised bythe assumption that the occurrence of instances in the world. Correctlyclassified positives and negatives are referred to as true positives and truenegatives, respectively. Incorrectly classified positives are, perhapssomewhat confusingly, called false negatives; similarly, misclassifiednegatives are called false positives. The positive/negative refers to theclassifier’s prediction, and true/false refers to whether the prediction iscorrect or not. So, a false positive is something that was incorrectlypredicted as positive, and therefore an actual negative (e.g., a ham emailmisclassified as spam, or a healthy patient misclassified as having thedisease in question).The true positive rate is the proportion of positives correctly classified,and can be defined mathematically as given in equation 3.2:………(3.2)
munotes.in
Page 39
True positive rate is an estimate of the probability that an arbitrary positiveis classified correctly, that is, an estimate of PX(cˆ(x) =+ve|c(x) = +ve).Analogously, the true negative rate is the proportion of negatives correctlyclassified and estimates PX(cˆ(x) =-ve|c(x) =-ve). These rates, which aresometimes called sensitivity and specificity, can be seen as per-classaccuracies. Inthe contingency table, the true positive and negative ratescan be calculated by dividing the number on the descending (good)diagonal by the row total. In table 3.2, we have a true positive rate of 60%,a true negative rate of 80%, a false negative rateof 40% and a falsepositive rate of 20%. We have a true positive rate of 40%, a true negativerate of 60%, a false negative rate of 60% and a false positive rate of 40%.Notice that the accuracy in both cases is the average of the true positiverate and the true negative rate.Example 2.1 (Accuracy as a weighted average). Suppose a classifier’spredictions on a test set are as in the following table:Predicted (+ve)Predicted (-ve)Actual (+ve)601575Actual (-ve)1015257030100From thistable, we see that the true positive rate is tpr = 60/75 = 0.80 andthe true negative rate is tnr = 15/25 = 0.60. The overall accuracy is acc =(60 + 15)/100 = 0.75, which is no longer the average of true positive andnegative rates. However, taking intoaccount the proportion of positivespos = 0.75 and the proportion of negatives neg = 1−pos = 0.25, we see thatacc = pos·tpr +neg ·tnrThis equation holds in general: if the numbers of positives and negativesare equal, we obtain the unweighted average from the earlier example (acc= (tpr + tnr)/2). Thegiven equation has a neat intuition: good performanceon either class contributes to good classification accuracy, but the moreprevalent class contributes more strongly. In order to achieve goodaccuracy, a classifier should concentrate on the majority class, particularlyif the class distribution is highly unbalanced. However, it is often the casethat the majority class is also the least interesting class.3.4 CLASS PROBABILITY ESTIMATIONA class probability estimator–or probability estimator in short–is ascoring classifier that outputs probability vectors over classes, i.e., amapping pˆ : X→ [0,1]k . We write pˆ (x) = pˆ1(x),...,pˆk (x) , wherepˆi(x) is the probability assigned to class Ci for instance x, and k i=1 pˆi(x)= 1. If we have only two classes, the probability associated with one classis 1 minus the probability of the other class; in that case, we use pˆ(x) todenote the estimated probability of the positive class for instance x. Asmunotes.in
Page 40
with scoring classifiers, we usually do not havedirect access to the trueprobabilities pi(x).First, assume a situation in which any two instances are similar to eachother. We then have PC (c(x ) =|xx) = PC (c(x ) =) which issimply estimated by the proportion pos of positives in our data set (I amgoing to drop the subscript C from now on). In other words, in thisscenario we predict pˆ(x) = pos regardless
Figure 3.2. A probability estimation treeof whether we know anything about x’s true class. At the other extreme,consider a situation in which no two instances are similar unless they arethe same, i.e., xx if x = x, and xx otherwise. In this case we haveP(c(x ) = +ve|xx) = P(c(x) = +ve), which–because x is fixed–is 1 ifc(x) = +ve and 0 otherwise. Put differently, we predict pˆ(x) = 1 for allknown positives and pˆ(x) = 0 for all known negatives, but we can’tgeneralise this to unseen instances.A feature tree allows us tostrike a balance between these extreme andsimplistic scenarios, using the similarity relationT associated withfeature tree T : xT x if, and only if, x and x are assigned to the same leafof the tree. In each leaf we then predict the proportion of positivesassigned to that leaf.3.5ASSESSING CLASS PROBABILITY ESTIMATESAs with classifiers,we can now ask the question of how good these classprobability estimators are. A slight complication here is that, as already
munotes.in
Page 41
remarked, we do not have access to the true probabilities. One trick that isoften applied is to define a binary vector (I[c(x)= C1],...,I[c(x) = Ck ]),which has the i-th bit set to 1 if x’s true class is Ci and all other bits set to0, and use these as the ‘true’ probabilities. We can then define the squarederror (SE) of the predicted probability vector pˆ (x) = (pˆ1(x),...,pˆk (x)) as:……. (3.3)and the mean squared error (MSE) as the average squared error over allinstances in the test set:………. (3.4)This definition of error in probability estimates is often used in forecastingtheory where it is called the Brier score.Example 2.6 (Squared error).:Suppose one model predicts (0.70,0.10,0.20) for a particular example x ina three-class task, while another appears much more certain by predicting(0.99, 0, 0.01). If the first class is the actual class, the second prediction isclearly better than the first: the SE of the first prediction is ((0.70−1)2+(0.10−0)2+(0.20−0)2)/2 = 0.07, while for the second prediction it is((0.99− 1)2+(0−0)2+(0.01−0)2)/2 = 0.0001. The first model gets punishedmore because, although mostly right, it isn’t quite sure of it. However, ifthe third class is the actual class, the situation is reversed: now the SE ofthe first prediction is ((0.70−0)2 +(0.10−0)2 +(0.20−1)2)/2 = 0.57, and ofthe second ((0.99−0)2 +(0−0)2 +(0.01−1)2)/2 = 0.98. Thesecond modelgets punished more for not just being wrong, but being presumptuous.Wit reference to figure 3.2, we calculate the squared error per leaf asfollows (left to right): SE1 = 20(0.33−1)2 +40(0.33−0)2 = 13.33 SE2 =10(0.67−1)2 +5(0.67−0)2 = 3.33SE3 = 20(0.80−1)2 +5(0.80−0)2 = 4.00.which leads to a mean squared error of MSE = 1 100 (SE1+SE2+SE3) =0.21An interesting question is whether we can change the predictedprobabilities in each leaf to obtain a lower mean squared error. It turns outthat this is not possible: predicting probabilities obtained from the classdistributions in each leaf is optimal in the sense of lowest MSE.For instance, changing the predicted probabilities in the left-most leaf to0.40 for spam and 0.60 for ham, or 0.20for spam and 0.80 for ham, resultsin a higher squared error: SE 1 = 20(0.40−1)2 +40(0.40−0)2 = 13.6 SE 1 =20(0.20−1)2 +40(0.20−0)2 = 14.4 The reason for this becomes obvious ifwe rewrite the expression for two-class squared error of a leaf as follows,
munotes.in
Page 42
using the notation nand nfor the numbers of positive and negativeexamples in the leaf:where p˙ = n/(n+n) is the relative frequency of the positive classamong the examples covered by the leaf, also called the empiricalprobability. As the term p˙(1−p˙) does not depend on the predictedprobability pˆ, we see immediately that we achieve lowest squared error inthe leaf if we assign pˆ = p˙.3.6 MULTI-CLASS PROBABILITY ESTIMATIONConsider the standard setting of multi-class classification with an instancespace X and a set of classes Y = {y1, . . . , yK}. We are interested inlearning a probabilistic classifier, that is, a model that estimates theconditional probabilities of classes given an instance xX:(p1, . . . , pK) = (Py(y1 | x), . . . ,Py(yK | x)) (3.4)Since true probability degrees are rarely available for training,probabilistic classifiers are typically trained on standard classificationdata, that is, observations of the form (x, y)X × Y, where the class labely is assumed tobe generated according to Py (· | x).Probability estimation is known to be a quite hard problem, especially incomparison to standard classification. This comes at no surprise, notingthat proper probability estimation is a sufficient but not necessarycondition for proper classification: If the conditional class probabilities (1)are predicted accurately, an optimal classification can simply be made bypicking the class with highest probability:……………….. (3.5)The Bayes decision can be taken so asto minimize any loss in expectation.On the other hand, a correct classification can also be obtained based onless accurate probability estimates. In fact, the classification will remaincorrect as long as the estimated probability is highest for the trueclass. Or,stated differently, an estimation error will remain ineffective unless itchanges the result of the arg max operation.Methods like naive Bayes and decision trees are multi-class classifiers andcan in principle be used to produce probabilityestimates in this setting. Inpractice, however, one often prefers to estimate probabilities in the two-class setting, especially because estimating a single probability (of thepositive class) is much simpler than estimating K− 1 probabilitiessimultaneously. Moreover, the binary case is amenable to a broader
munotes.in
Page 43
spectrum of classifiers, including logistic regression, which is a provenmethod for probability estimation. On the other hand, the reduction ofmultinomial to binomial probability estimation obviously involves anaggregation problem, namely the need to combine probabilities on pairs ofclasses into probabilities on the label set Y. This is the idea of “pairwisecoupling” techniques.SUMMARYClassification deals with labelling the tuples base on some attribute.Binary classification refers to predicting one of two classes and multi-classclassification involves predicting one of more than two classes. Multi-label classification involves predicting one or more classes for eachexample and imbalancedclassification refers to classification tasks wherethe distribution of examples across the classes is not equal.Classprobability Estimation. A class probability estimator–or probabilityestimator is a scoring classifier. Multi-class classification makes theassumption that each sample is assigned to one and only one label.UNIT END QUESTIONS1.Explain the concept ofclassification with suitable example.2.Illustrate the assessment of classification with suitable example.3.Write a note on binaryclassification4.Briefly explain the concept of class probability Estimation5.Explain Multiclass Classification with concept note.6.How are classification estimates assessed? Explain with suitableexample.REFERENCES•Peter Flach, Machine LearningThe Art and Science of Algorithms that Make Senseof Data, Cambridge Press, 2012•Baidaa M Alsafy,Zahoor Mosad, Wamidh k. Mutlag,Multiclass ClassificationMethods: A Review,International Journal of Advanced Engineering Technology andInnovative Science (IJAETIS), 2020.•Robust Model-Free Multiclass Probability Estimation (nih.gov)•Probability Estimation-an overview | ScienceDirect Topicsmunotes.in
Page 44
4REGRESSIONUnit structure4.0Objectives4.1Introduction4.2Assessing performance of Regression4.2.1Error measures4.3Overfitting4.3.1Catalysts for Overfitting4.4Case study of Polynomial RegressionSummaryUnit End QuestionsReferences4.0 OBJECTIVESA learning will learn:-regression method with practical cases-how to apply regression method-identify the error measures with overfitting values4.1 INTRODUCTIONRegression is a method of modelling a target value based on independentpredictors. This method is mostly used for forecasting and finding outcause and effect relationship between variables. Regression techniquesmostly differ based on the number of independent variables and the typeof relationship between the independent and dependent variables.Regression is asupervised learning techniquewhich helps in finding thecorrelation between variables and enables us to predict the continuousoutput variable based on the one or more predictor variables. It is mainlyused forprediction, forecasting, time series modeling, and determining thecausal-effect relationship between variables.As mentioned earlier Linear Regression is a supervised machine learningalgorithm. It predicts the value within a continuous range of numbers.munotes.in
Page 45
1. Simple regression:Simple linear regression uses traditional slope-intercept form to producethe most accurate predictions.xrepresents our input data andyrepresentsour prediction.The motive of the linear regression algorithm is to find the best valuesformandcin the equationy = mx + c.
Fig. 4.1: Simple Linear RegressionSimple linear regression is a type of regression analysis where the numberof independent variables is one and there is a linear relationship betweenthe independent(x) and dependent(y) variable. The red line in the abovegraph is referred to as the best fit straight line. Based on the given datapoints, we try to plot a line that models the points the best. The line can bemodelled based on the linear equation shown below.y = a_0 + a_1 * x # Linear EquationThe motive of the linear regression algorithm is to find the best values fora_0 and a_1. Before moving on to the algorithm, let’s have a look at twoimportant concepts you must know to better understand linear regression.4.1.1Cost Function:The cost function helps us to figure out the best possible values for a_0and a_1 which would provide the best fit line for the data points. Since wewant the best values for a_0 and a_1, we convert this search problem intoa minimization problem where we would like to minimize the errorbetween the predicted value and the actual valueWe choose the above function to minimize. The difference between thepredicted values and ground truth measures the error difference. Wesquare the error difference and sum over all data points and divide thatvalue by the total number of data points. This provides the averagesquared error over all the data points. Therefore, this cost function is alsoknown as the Mean Squared Error(MSE) function. Now, using this MSE
munotes.in
Page 46
function we are going to change the values of a_0 and a_1 such that theMSE value settles at the minima.Cost function optimizes the regressioncoefficients or weights. It measures how a linear regression model isperforming. We can use the cost function to find the accuracy ofthemapping function, which maps the input variable to the outputvariable. This mapping function is also known asHypothesis function.4.1.2Gradient Descent:The next important concept needed to understand linear regression isgradient descent.Gradient descent is a method of updating a_0 and a_1 toreduce the cost function(MSE).The idea is that we start with some valuesfor a_0 and a_1 and then we change these values iteratively to reduce thecost. Gradient descent helps us on how to change the values.Gradientdescent is used to minimize the MSE by calculating the gradient of thecost function. A regression model uses gradient descent toupdate thecoefficients of the line by reducing the cost function. It is done by arandom selection of values of coefficient and then iteratively update thevalues to reach the minimum cost function.
Fig 4.2:Gradient DescentTo draw an analogy, imagine a pit in the shape of U and you are standing atthe topmost point in the pit and your objective is to reach the bottom of thepit. There is a catch, you can only take a discrete number of steps to reachthe bottom. If you decide to take one step at a time you would eventuallyreach the bottom of the pit but this would take a longer time. If you chooseto take longer steps each time, you would reach sooner but, there is achance that you could overshoot the bottom of the pit and not exactly at thebottom.In the gradient descent algorithm, the number of steps you take isthe learning rate. This decides on how fast the algorithm converges to theminima.
munotes.in
Page 47
Fig 4.3: Convex and Non Convex4.2 ASSESSING PERFORMANCE OF REGRESSIONIn regression model, the most commonly known evaluation metricsinclude:1.Mean Absolute Error(MAE), like the RMSE, the MAE measures theprediction error. Mathematically, it is the average absolute differencebetween observed and predicted outcomes,MAE =mean(abs(observeds-predicteds)). MAE is less sensitive to outlierscompared to RMSE.2.Root Mean Squared Error(RMSE), which measures the averageerror performed by the model in predicting the outcome for anobservation. Mathematically, the RMSE is the square root of themeansquared error (MSE), which is the average squared difference betweenthe observed actual outome values and the values predicted by themodel. So,MSE = mean((observeds-predicteds)^2)andRMSE =sqrt(MSE). The lower the RMSE, the better the model.3.R-squared(R2), which is the proportion of variation in the outcomethat is explained by the predictor variables. In multiple regressionmodels, R2 corresponds to the squared correlation between theobserved outcome values and the predicted values by the model. TheHigher the R-squared, the better the model.4.Residual Standard Error(RSE), also known as themodel sigma, is avariant of the RMSE adjusted for the number of predictors in themodel. The lower the RSE, the better the model. In practice, thedifference between RMSEand RSE is very small, particularly forlarge multivariate data.4.2.1MeanAbsolute Error(MAE):It is the average absolute difference between observed and predictedoutcomes.We defined the MAEas,
munotes.in
Page 48
whereyyis the actual valueŷis the predicted value and|y−y^|is theabsolute value of the difference between the actual and predicted value. Nis the number of sample points.For Example:Take a look at the following plot, which shows the numberof failures for a piece of machinery against the age of the machine:
In order to predict the number of failures from the age, we would want tofit a regression line such as this:
In order to understand how well this line represents the actual data, weneed to measurehow good a fitit is. We can do this by measuring thedistance from the actual data points to the line:
You may recall that these distances are called residuals or errors. Themean size of these errors is the MAE. We can calculate it as follows:
munotes.in
Page 49
here is how the table and formularelate:
The MAE has a big advantage in thatthe units of the MAE are the sameas the units ofyy,the feature we want to predict. Inthe example above,we have an MAE of 8.5, so it means that on average our predictions of thenumber ofmachine failures are incorrect by 8.5 machine failures.4.2.2Root Mean Squared Error(RMSE):It measures the average error performed by the model in predicting theoutcome for an observation.Its calculation is very similar to MAE, butinstead of takingtheabsolutevalue to get rid of the sign on the individualerrors, wesquarethe error (because the square of a negative number ispositive).The formula for RMSE is:
munotes.in
Page 50
As with MAE, we can think ofRMSE as being measured in the yunits.So the aboveerror can be read as an error of 9.9 machine failures onaverage per observation.4.2.3 R-Squared:It tells us the degree to which the model explains the variance in the data.In other words how much better it is than just predicting the mean.•A value of 1 indicates a perfect fit.•A value of 0 indicates a model no better than the mean.•A value less than 0 indicates a model worse than just predicting themean.4.2.4 Residual Standard Error:Theresidualstandarderroris√MSE. TheMSEis an unbiased estimatorofσ2, whereσ2=Var(y|x).Forexample:AnovatableofSLR/SimpleLinearRegression(DFisdifferentformultipleregression):
4.3 OVERFITTINGOverfitting a model is a condition where a statistical model begins todescribe the random error in thedata rather than the relationships betweenvariables. This problem occurs when the model is too complex.
munotes.in
Page 51
Inregression analysis, overfitting can produce misleadingR-squaredvalues,regression coefficients, and p-values. In this post, I explainhow overfitting models is a problem andhow you canidentify and avoidit. Overfitregressionmodels have too many terms for the number ofobservations. When this occurs, theregression coefficientsrepresent thenoise rather than the genuine relationships in thepopulation.That’s problematic by itself. However, there is another problem.Eachsamplehas its own unique quirks. Consequently, a regression modelthat becomes tailor-made to fit the random quirks of one sample isunlikely to fit the randomquirks of another sample. Thus, overfitting aregression model reduces its generalizability outside the original dataset.4.3.1 Graphical Illustration of Overfitting Regression Models:The image below illustrates an overfit model. The green line represents thetrue relationship between the variables. The random error inherent in thedata causes the data points to fall randomly around the greenfit line. Thered line represents an overfit model. This model is too complex, and itattempts to explain the random error present in the data.
Fig 4.4: OverfittingThe example above is very clear. However, it’s not always that obvious.Below, the fitted line plot shows an overfit model. In the graph, it appearsthat the model explains a good proportion of thedependentvariablevariance.
Fig 4.5: Overfitting Line
munotes.in
Page 52
4.3.2Catalysts for Overfitting:This concept is fairly intuitive. Suppose we have a total sample size of 20and we need toestimateone populationmeanusing a 1-sample t-test.We’ll probablyobtain a good estimate. However, if we want to use a 2-sample t-test to estimate the means of two populations, it’s not as goodbecause we have only ten observations to estimate each mean. If we wantto estimate three or more means using one-way ANOVA, it becomespretty bad.As the number of observations per estimate decreases (20, 10, 6.7, etc.),theestimatesbecome more erratic. Furthermore, a new sample is unlikelyto replicate the inconsistent estimates produced by the smaller samplesizes.In short, the quality of the estimates deteriorates as you draw moreconclusions from a sample. This idea is directly related to the degrees offreedom in the analysis. To learn more about this concept, read mypost:Degrees of Freedom in Statistics.4.3.3 Applying These Concepts to Overfitting Regression Models:Overfitting a regression model is similar to the example above. Theproblems occur when you try to estimate too manyparametersfrom thesample. Each term in the model forces the regression analysis to estimateaparameterusing a fixed sample size. Therefore, the size of your samplerestricts the number of terms that you can safely add to the model beforeyou obtain erratic estimates.Similar to the example with the means, you need asufficient number ofobservations for each termin the regression model to help ensuretrustworthy results.Statisticianshave conducted simulation studies* whichindicate you should have at least 10-15 observations for each term in alinear model. The number of terms in a model is the sum of all theindependent variables, their interactions, andpolynomial terms to modelcurvature.For instance, if the regression model has two independent variables andtheir interaction term, you have three terms and need 30-45 observations.Although, if themodel has multicollinearityor if theeffectsize is small,you might need more observations.To obtain reliable results, you need a sample size that is large enough tohandlethe model complexity that your study requires. If your study callsfor a complex model, you must collect a relatively large sample size. If thesample is too small, you can’t dependably fit a model that approaches thetrue model for yourindependent variable. In that case, the results can bemisleading.munotes.in
Page 53
4.3.4 How to Detect Overfit Models:Linear regression, there is an excellent accelerated cross-validationmethod called predictedR-squared. This method doesn’t require you tocollect a separate sample or partition your data, and you can obtain thecross-validated results as you fit the model. Statistical software calculatespredicted R-squared using the following automated procedure:•It removes a data point from the dataset.•Calculates the regression equation.•Evaluates how well the model predicts the missing observation.•And, repeats this for all data points in the dataset.Predicted R-squared has several cool features. First,youcan justinclude itin the output as you fit the model without any extra steps on your part.Second, it’s easy to interpret. You simply compare predicted R-squared totheregular R-squaredand see if there is a big difference.If there is a large discrepancy between the two values, your model doesn’tpredict new observations as well as it fits the original dataset. The resultsare not generalizable, and there’s a good chance you’re overfitting themodel.For the fitted line plot above, the model produces a predicted R-squared(not shown) of 0%, which reveals the overfitting. For more information,read my post abouthow to interpret predicted R-squared, which alsocovers the model in the fitted line plot in more detail.`4.3.5 How to Avoid Overfitting Models:To avoid overfitting a regression model, you should draw a randomsample that is large enough to handle all of the terms that you expect toinclude in your model. This process requires that you investigate similarstudies before you collect data. The goal is to identify relevant variablesand terms that you are likely to include in your own model. After you get asense of the typical complexity of models in your study area, you’ll beable to estimate a good sample size.4.4CASE STUDY OF POLYNOMIAL REGRESSIONPolynomial Regression is a regression algorithm that models therelationship between a dependent(y) and independent variable(x) as nthdegree polynomial. The Polynomial Regression equation is given below:y= b0+b1x1+ b2x12+ b2x13+...... bnx1n•It is also called the special case of Multiple Linear Regression in ML.Because we add some polynomial terms to the Multiple Linearregression equation to convert it into Polynomial Regression.•It is a linear model with some modification in order to increase theaccuracy.munotes.in
Page 54
•The datasetused in Polynomial regression for training is of non-linearnature.•It makes use of a linear regression model to fit the complicated andnon-linear functions and datasets.•Hence,"In Polynomial regression, the original features areconverted into Polynomial features of required degree (2,3,..,n) andthen modeled using a linear model."4.4.1 Need for Polynomial Regression:•The need of Polynomial Regression in ML can be understood in thebelow points:•If we apply a linear model on alinear dataset, then itprovides us agood result as we have seen in Simple Linear Regression, but if weapply the same model without any modification on anon-lineardataset, then it will produce a drastic output. Due to which lossfunction will increase, the error rate will behigh, and accuracy will bedecreased.•So for such cases,where data points are arranged in a non-linearfashion, we need the Polynomial Regression model. We canunderstand it in a better way using the below comparison diagram ofthe linear dataset and non-linear dataset.
Fig 4.6: linear regression and polynomial model•In the abovefigure, we have taken a dataset which is arranged non-linearly. So if we try to cover it with a linear model, then we canclearly see that it hardly covers any data point. On the other hand, acurve is suitable to cover most of the data points, which is of thePolynomial model.•Hence,if the datasets are arranged in a non-linear fashion, then weshould use the Polynomial Regression model instead of Simple LinearRegression.4.4.2Equation of the Polynomial Regression Model:Simple Linear Regression equation:y = b0+b1x.................(a)
munotes.in
Page 55
Multiple Linear Regression equation:y= b0+b1x+ b2x2+ b3x3+....+ bnxn..................(b)Polynomial Regression equation:y= b0+b1x + b2x2+ b3x3+....+ bnxn...................(c)When we compare the above three equations, we can clearly see that allthree equations are Polynomial equations but differ by the degree ofvariables. The Simple and Multiple Linear equations are also Polynomialequations with a single degree, and the Polynomial regression equation isLinear equation with the nth degree. So if we add a degree to our linearequations, then it will be converted into Polynomial Linear equations.Here we will implement the Polynomial Regression using Python. We willunderstand it by comparing Polynomial Regression model with the SimpleLinear Regression model. So first, let's understand the problem for whichwe are going to build themodel.Problem Description:There is a Human Resource company, which isgoing to hire a new candidate. The candidate has told his previous salary160K per annum, and the HR have to check whether he is telling the truthor bluff. So to identify this, theyonly have a dataset of his previouscompany in which the salaries of the top 10 positions are mentioned withtheir levels. By checking the dataset available, we have found that there isanon-linear relationship between the Position levels and the salaries.Our goal is to build aBluffing detector regressionmodel, so HR can hirean honest candidate. Below are the steps to build such a model.
Steps for Polynomial Regression:The main steps involved in Polynomial Regression are given below:•Data Pre-processing
munotes.in
Page 56
•Build a Linear Regression model and fit it to the dataset•Build a Polynomial Regression model and fit it to the dataset•Visualize the result for Linear Regression and Polynomial Regressionmodel.•Predicting the output.Data Pre-processing Step:The data pre-processing step will remain the same as in previousregression models, except for some changes. In the Polynomial Regressionmodel, we will not use feature scaling, and also we will not split ourdataset into training and test set. It has two reasons:•The dataset contains very less information which is not suitable todivide it into a test and training set, else our model will not be able tofind the correlations between the salaries and levels.•In this model, we want very accurate predictions for salary, so themodel should have enough information.The code for pre-processing step is given below:#importinglibrariesimportnumpyasnmimportmatplotlib.pyplotasmtpimportpandasaspd#importingdatasetsdata_set=pd.read_csv('Position_Salaries.csv')#ExtractingIndependentanddependentVariablex=data_set.iloc[:,1:2].valuesy=data_set.iloc[:,2].values•In the above lines of code, we have imported the important Pythonlibraries to import dataset and operate on it.•Next, we have imported the dataset 'Position_Salaries.csv', whichcontains three columns (Position, Levels, and Salary), but we willconsider only two columns (Salary and Levels).•After that, we have extracted the dependent(Y) and independentvariable(X) from the dataset. For x-variable, we have taken parametersas [:,1:2], because we want 1 index(levels), and included :2 to make itas a matrix.munotes.in
Page 57
Now, we will build and fit the Linear regression model to the dataset. Inbuilding polynomial regression, wewill take the Linear regression modelas reference and compare both the results. The code is given below:1.#FittingtheLinearRegressiontothedataset2.fromsklearn.linear_modelimportLinearRegression3.lin_regs=LinearRegression()4.lin_regs.fit(x,y)In the above code, we have created the Simple Linear modelusinglin_regsobject ofLinearRegressionclass and fitted it to the datasetvariables (x and y).Output:Out[5]: LinearRegression(copy_X=True, fit_intercept=True,n_jobs=None, normalize=False)Building the Polynomial regression model:Now we will build the Polynomial Regression model, but it will be a littledifferent from the Simple Linear model. Because here we willusePolynomialFeaturesclass ofpreprocessinglibrary. We are using thisclass to add some extra features to our dataset.1.#FittingthePolynomialregressiontothedataset2.fromsklearn.preprocessingimportPolynomialFeatures3.poly_regs=PolynomialFeatures(degree=2)4.x_poly=poly_regs.fit_transform(x)5.lin_reg_2=LinearRegression()6.lin_reg_2.fit(x_poly,y)
munotes.in
Page 58
In the above lines of code, we have usedpoly_regs.fit_transform(x),because first we are converting our feature matrix into polynomial featurematrix, and then fitting it to the Polynomial regression model. Theparameter value(degree= 2) depends on our choice. We can choose itaccording to our Polynomial features.After executing the code, we will get another matrixx_poly, which can beseen under the variable explorer option:
Output:Out[11]: LinearRegression(copy_X=True, fit_intercept=True,n_jobs=None, normalize=False)Visualizing the result for Linear regression:Now we will visualize the result for Linear regression model as we did inSimple Linear Regression. Below is the code for it:1.#VisulaizingtheresultforLinearRegressionmodel2.mtp.scatter(x,y,color="blue")3.mtp.plot(x,lin_regs.predict(x),color="red")4.mtp.title("Bluffdetectionmodel(LinearRegression)")5.mtp.xlabel("PositionLevels")6.mtp.ylabel("Salary")7.mtp.show()
munotes.in
Page 59
Output:
In the above output image, we can clearly see that the regression line is sofar from the datasets. Predictions are in a red straight line, and blue pointsare actual values. If we consider this output to predict the value of CEO, itwill give a salary of approx. 600000$, which is far away from the realvalue.Visualizing the result for Polynomial Regression:Here we will visualize the result of Polynomial regression model, code forwhich is little different from the above model.Code for this is given below:#VisulaizingtheresultforPolynomialRegressionmtp.scatter(x,y,color="blue")mtp.plot(x,lin_reg_2.predict(poly_regs.fit_transform(x)),color="red")mtp.title("Bluffdetectionmodel(PolynomialRegression)")mtp.xlabel("PositionLevels")mtp.ylabel("Salary")mtp.show()In the above code, we have takenlin_reg_2.predict(poly_regs.fit_transform(x), instead of x_poly, becausewe want a Linear regressor object to predict the polynomial featuresmatrix.Output:
munotes.in
Page 60
As we can see in the above output image, the predictions are close to thereal values. The above plot will vary as we will change the degree.For degree= 3:If we change the degree=3, then we will give a more accurate plot, asshown in the below image.
SO as we can see here in the above output image, the predicted salary forlevel 6.5 is near to 170K$-190k$, which seems that future employee issaying the truth about his salary.Degree= 4:Let's again change the degree to 4, and now will get the mostaccurate plot. Hence we can get more accurate results by increasing thedegree of Polynomial.
Predicting the final result with the Linear Regression model:Now, we will predict the final output using the Linear regression model tosee whether an employee is saying truth or bluff. So, for this, we will usethepredict()method and will pass the value 6.5. Below is the code for it:1.lin_pred=lin_regs.predict([[6.5]])2.print(lin_pred)
munotes.in
Page 61
Output:[330378.78787879]Predicting the final result with the Polynomial Regression model:Now, we willpredict the final output using the Polynomial Regressionmodel to compare with Linear model. Below is the code for it:1.poly_pred=lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))2.print(poly_pred)Output:[158862.45265153]SUMMARYRegression analysisis a statistical technique for studying linearrelationships.It begins by supposing a general form for therelationship, known as theregression model. The ultimate goal of theregression algorithm is to plot a best-fit line or a curve between thedata. The three main metrics that are used for evaluating the trainedregression model are variance, bias and error. If the variance is high, itleads to overfitting and when the bias is high, it leads to underfitting.Based on the number of input featuresand output labels, regression isclassified as linear (one input and one output), multiple (many inputsand one output) and multivariate (many outputs). Linear regressionallows us to plot a linear equation, i.e., a straight line. We need to tunethe coefficient and bias of the linear equation over the training data foraccurate predictions. The tuning of coefficient and bias is achievedthrough gradient descent or a cost function—least squares method.Learning a linear regression model means estimatingthe values of thecoefficients used in the representation withthe data that we haveavailable.UNIT ENDQUESTIONS1.What is regression? Explain types of regression.2.Give the illustration of regression performance.3.Explain the methods used for regressionanalysis.4.Write a note on R square method.5.Write a note on Mean absolute error.6.Explain Root mean square method with suitable example.7.Discuss Polynomial Regression in detail.munotes.in
Page 62
References•Peter Flach, Machine Learning The Art and Science of Algorithms thatMake Sense of Data, Cambridge Press, 2012•https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a•Baidaa M Alsafy,Zahoor Mosad, Wamidh k. Mutlag,MulticlassClassification Methods: A Review,International Journal of AdvancedEngineering Technology and Innovative Science (IJAETIS), 2020.•Dragos D. Margineantu, Class Probability Estimation and Cost-Sensitive Classification Decisions, Inc Springer-Verlag BerlinHeidelberg 2002•https://www.educative.io/edpresso/what-is-linear-regressionmunotes.in
Page 63
5THEORY OF GENERALIZATIONUnit Structure5.0Objectives5.1Effective number of hypothesis5.2Bounding the Growth function5.3VC Dimensions5.4Regularization theorySummaryUnit End QuestionsReferences5.0 OBJECTIVESA learning will learn:•hyptothesis implementation•to implement the regularization theory•understand the VC dimensions in Machine learning.5.1 EFFECTIVE NUMBER OF HYPOTHESISA hypothesis is an explanation for something. It is a provisional idea, aneducated guess that requires some evaluation. A goodhypothesis istestable; it can be either true or false. In science, a hypothesis must befalsifiable, meaning that there exists a test whose outcome could mean thatthe hypothesis is not true. The hypothesis must also be framed before theoutcome of the test is known.Consider the regression estimation problem where X = Y = R and the dataare the following points:
Fig 5.1: regression estimation
munotes.in
Page 64
where the dash-dot line represents are fairly complex model and fits thedata set perfectly, and the straightline does not completely “explain” thedata but seems to be a “simpler” model, if we argue that the residuals arepossibly measurement errors.Statistical hypothesis tests are techniques used to calculate acriticalvaluecalled an “effect.” The critical value can then be interpreted in orderto determine how likely it is to observe the effect if a relationship does notexist. If the likelihood is very small, then it suggests that the effect isprobably real. If the likelihood is large, then we may have observed astatistical fluctuation, and the effect is probably not real. For example, wemay be interested in evaluating the relationship between the means of twosamples, e.g. whether the samples were drawn from the same distributionor not, whether there is a difference between them.One hypothesis is that there is no difference between the populationmeans, based on the data samples. This is a hypothesis of no effect and iscalled the null hypothesis and we can use the statistical hypothesis test toeither reject this hypothesis, or fail to reject (retain) it. We don’t say“accept” because the outcome is probabilistic and could still be wrong,just witha very low probability.5.1.1 Underfitting versus overfitting:In the machine learning literature the more complex model is said to showsigns of overfitting, while the simpler model underfitting. Often severalheuristic are developed in order to avoid overfitting, for example, whendesigning neural networks one may:1.limit the number of hidden nodes (equivalent to reducing the numberof support vectors),2.stop training early to avoid a perfect explanation of the training set,and3.apply weightdecay to limit the size of the weights, and thus of thefunction class implemented by the network (equivalent to theregularization used in ridge regression).5.1.2 Minimizing risk (expected loss):The risk or expected loss (L) is defined as:where Stis a test set. This problem is intractable since we do not knowD(x, y) which is the probability distribution on X ×Y which governs thedata generation and underlying functional dependency.
munotes.in
Page 65
The only thing we do have it our training set S sampled from thedistribution D and we can use this to approximate the above integral bythe finite sum:
5.2 BOUNDING THE GROWTH FUNCTIONWe have considered the case when H is finite or countably infinite. Inpractice, however, the function class H could be uncountable. Under thissituation, the previous method does not work. The key idea is to groupfunctions based on the sample. Given a sample Dn = {(x1, y1),...,(xn, yn)},and define S = {x1, . . . , xn}. Consider the set HS = Hx1,...,xn={(h(x1),...,h(xn) : hH} . The size of this set is the total number ofpossible ways that S = {x1, . . . , xn} can be classified. For binaryclassification the cardinality of this set is always finite, no matter howlarge H is.The growth function is the maximum number of ways into which n pointscan be classified by the function class:GH(n) = sup |HS| .x1,...,xnGrowth function can be thought as a measure of the “size” for the class offunctions H. Several facts about the growth function:•When H is finite, we always have GH(n)≤ |H| = m.•Since h(x){0, 1}, we have GH(n)≤ 2n. If GH(n) = 2n, then there isa set of n points such that the class of functions H can generate anypossible classification result on these points.5.3 VC (VAPNIK-CHERVONENKIS) DIMENSIONSThe VC (Vapnik-Chervonenkis) dimension is a single parameter thatcharacterizes the growth function:The VC dimension of a hypothesis set H, denoted by dVC(H), is the largestvalue of m, for which ΠH(m) = 2m. If ΠH(m) = 2m, then dVC(H) =∞.
munotes.in
Page 66
To illustrate this definition, we will now take the examples for the growthfunction to learn their VC-dimension.To find a lower bound we have tosimply find a set S that can be shattered by H. To give an upper bound, weneed to prove that no set S of d + 1 points exists, that can be shattered byH, which is usually more difficult.For Example,Positive rays:We have shown that the growth function forpositive rays isΠH(m) = m+1. Only for m=0 ,1 we have ΠH(m) = 2m,therefore dVC(H)=1Positive intervals:The growth function for positive intervals is, ΠH(m) =m2+m + .We haveΠH(m) = 2mfor m =0,1,2 which yields dVC(H)=2.Convex sets:We have seen that by arrange convex sets in the right way,sets of every size can be shattered.ThereforeΠH(m) = 2m for all mdVC(H)=∞.Perceptrons:The next example will be a bit more elaborate. We willshow that for the perceptron in Rdthe VC-dimension is always d + 1. Wewon’t explicitly calculate the growth function ΠH(m) for all m.Therefore,for determining the VC dimension dVC(H) = d + 1, we have to show that:a)The VC dimension is at least d+1: To prove this, we have to find d+ 1 points in the input space X = Rdthat the perceptron can shatter. Wefirst consider the set of hyperplanes in R2. We have already seen that anythree non-collinear points in R2can be shattered by Perceptron. dVC(H) =(hyperplanes in R2) = 3.If we now show the general case of hyperplanesin Rd. We pick a set of d + 1 points inRd, setting x0to be the origin anddefining xi, i{1, ..., x}as the points whose i-th coordinate is 1 and allothers are 0.Let y0, ..., yn{−1, +1} be an arbitrary set of labels for x0, x1,...xd.Let wbe the vector whose ith coordinate is yi.The perceptron definedby the hyperplane of equation w · x += 0is shatters x0, ..., xdbecausefor any i{0, ..., d}:sgn(w · xi+) = sgn(yi+).Now we need to show that b) the VC-dimension is at most d + 1. This is abit trickier, because we have to show that there is no subset of more than d+ 1 points that can be shattered by the perceptron. To prove this, we willshow that on any d + 2 points, there is a dichotomy that can not be realizedby the perceptron classifier.Choose the points x1, ..., xd+2at random. There are more points thandimension, therefore we must have
munotes.in
Page 67
i.e. one point is a linear combination of the rest of the points. This willapply to any set of d + 2 points you choose. Also, some of the ai’s must benonzero, because the first coordinate of the xi’s is always one(seedefinition of the perceptron, first coordinate is one to include the bias termof the hyperplane into the form w · x).Now we show a dichotomy thatcan’t be implemented: Consider the following dichotomy.Let y1, ..., yd+2the labels of x1, ..., xd+2. Give xiwith nonzero coefficient ai get the label+1, give any label to the xiwith ai= 0 and set yj=−1 as the label of xj.Let wRd+ 1 be the weight vector to any hyperplane h. Now we have,If yi= sgn(wTxi) = sgn(ai), then aiwTxi> 0 for all 0 0. However, we set yj=sgn(wTxj) = sgn(wTxi)< 0 that gives a contradiction. The dichotomy can’t be implemented onany set of d+2 points by the perceptron classifier.Combining both parts of the proof, we get: dVC(hyperplanes in Rd) = d +1.Consider again the perceptron in the two-dimensional space. Asshownabove, its VCdimension is three. The fact that no four points can beshattered by H limits the number of the dichotomies that can be realizedsignificantly. We exploit that fact to get a bound
Fig 5.2: Construction of G1 and G2for ΠH(m) for all mN. We will prove, that, if the VC-dimension for aset of hypotheses is finite, then there is a polynomial that bounds ΠH(m)for all values of m. If such a polynomial exists, and ΠH(m) can replace |H|in above equation then the generalization error will go to zero as m→ ∞.The next result uses the VC-dimension to define a bound for the growthfunction.
munotes.in
Page 68
5.4 REGULARIZATION THEORYMany strategies used in machine learning are explicitly designed to reducethe test error, possibly at the expense of increased training error. In anotherway we can say that any modification we make to a learning algorithm thatis intended to reduce its generalization error but not its training error isregularization.Regularization is a technique used to reduce the errors byfitting the function appropriately on the given training set and avoidoverfitting.There are many regularization strategies. Some put extra constraints on amachine learning model, such as adding restrictions on the parametervalues. Some add extra terms in the objective function that can be thoughtof as corresponding to a soft constraint on the parameter values. Thesestrategies are collectively known asregularization.In fact, developing more effective regularization strategies has been one ofthe major research efforts in the machine learningfield. Sometimes theseconstraints and penalties are designed to encode specific kinds of priorknowledge. Other times, these constraints and penalties are designed toexpress a generic preferencefor a simpler model class in order to promotegeneralization. Sometimes penalties and constraints are necessary to makean underdetermined problem determined.There are several form ofregularizationsby which we can preventoverfitting in our network ormachine learning model.Parameter Norm Penalties:Many regularization approaches are based on limiting the capacity ofmodels, such as neural networks, linear regression, or logistic regression,by adding a parameter norm penaltyΩ(θ) to the objective function J. Wedenote the regularized objective function byJ.J(θ;X,y) = J(θ;X,y) + αΩ(θ)—{1}where α[0,∞) is a hyperparameter that weights the relative contributionof the norm penalty term,Ω, relative to the standard objective function J.Setting α to zero results in no regularization. Larger values of α correspondto more regularization.We note that for neural networks, we typically choose to use a parameternorm penaltyΩ that penalizes only the weights of the a ne transformationat each layer and leaves the biases unregularized.The biases typicallyrequire less data tofit accurately than the weights. Each weightspecifies how two variables interact. Fitting the weight well requiresobserving both variables in a variety of conditions. Each bias controls onlya single variable. This means that we do not induce too much variance byleaving the biases unregularized. Also,regularizing the bias parameterscan introduce a significant amount of underfitting. We therefore use themunotes.in
Page 69
vectorwto indicate all the weights that should be a ected by a normpenalty, while the vectorθdenotes all of the parameters, includingbothwand the unregularized parameters.
Fig 5.3: Overfitting and RegularizationL² regularization:It is one of the commonly used regularization form. TheL² parameter norm penalty commonly known as weight decay. L²regularization drives the weights closer to origin by adding a regularizationtermΩ(θ) = 1/2||w||² to the objective function. Such a model hasfollowing total objective function:J(w;X,y) =α/2(w`w) + J(w;X,y) (`means transpose)The L² regularization has the intuitive interpretation of heavily penalizingpeaky weight vectors and preferring diffuse weight vectors. Due tomultiplicative interactions between weights andinputs this has theappealing property of encouraging the network to use all of its inputs alittle rather that some of its inputs a lot.Lastly, also notice that during gradient descent parameter update, using theL² regularization ultimately means thatevery weight is decayedlinearly:W+=-lambda *Wtowards zero. Let’s see what this means, Wecan see that the addition of the weight decay term has modified the learningrule to multiplicatively shrink the weight vector by a constant factor oneach step, just before performing the usual gradient update.This describeswhat happens in a single step. But what happens over the entire course oftraining? The L² regularization causes the learning algorithm to “perceive”the inputXas having higher variance, which makes it shrink the weightson features whosecovariance with the output target is low compared to thisadded variance.L¹regularization:The L¹ regularization has the intriguing and fascinatingproperty that it leads the weight vectors to become sparse duringoptimization (i.e. very close to exactly zero). In other words, neurons withL¹ regularization end up using only a sparse subset of their most importantinputs as most weight goes very close to zero and become nearly invariantto the “noisy” inputs. In comparison, final weight vectors from L²regularization are usually diffuse, small numbers.The sparsity propertyinduced by L¹ regularization has been used extensively as a feature
munotes.in
Page 70
selection mechanism.The L¹ penalty causes a subset of the weights tobecome zero, suggesting that the correspondingfeatures may safely bediscarded. In practice, if you are not concerned with explicit featureselection, L² regularization can be expected to give superior performanceover L1.Formally, L² regularization on the model parameterwis defined asSUMMARYWe have derivated bounds for finite hypothesis sets. But in machinelearning, the hypothesis sets are usually infinite.We show that thehypotheses in a hypothesis set H can be "similar" to each other andtherefore their "bad events" with poor generalization can overlap.Therefore, we define the growth function, that formalizes the number of"effective" hypotheses in a hypothesis set.The VC (Vapnik-Chervonenkis)dimension is a single parameter that characterizes the growth function.Aregression model which usesL1 Regularizationtechnique iscalledLASSO(LeastAbsolute Shrinkage and SelectionOperator)regression.Aregression model that usesL2regularizationtechniqueis calledRidge regression.LassoRegressionadds“absolute value of magnitude”of coefficient as penaltyterm to the loss function(L).UNIT ENDQUESTIONS1.What is hypothesis? Explain different types of hypothesis.2.Explain underfitting and overfitting with suitable example.3.Explain the growth bounding function with suitable derivation.4.Give the illustration of VC (Vapnik-Chervonenkis) Dimensions.5.What is regularization? Explain its theory.6.Explain L1 and L2 regularization with suitable example.REFERENCES•Peter Flach, Machine Learning The Art and Science of Algorithms thatMake Senseof Data, Cambridge Press, 2012.•https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a
munotes.in
Page 71
71UNITIII6LINEAR MODELSUnit Structure6.0Objective6.1IntroductionLeast Square Method6.1.1Definition:6.1.2Least square methodgraph6.1.3Least Square Method Formula6.1.4Advantages of Least Square method6.1.5Disadvantages of Least Square Method6.2Multivariate linear regression6.2.1Normal Equation6.2.2Examples6.2.3Steps for Multivariate Linear Regression6.2.3.1Normalizing Features6.2.3.2Select Loss function and Hypothesis6.2.3.3Set Hypothesis Parameters6.2.3.4Minimizethe Loss Function6.2.3.5Test the hypothesis function6.2.3.6Multivariate Linear Regression model: Scalar Model6.2.4Advantages of MultivariateRegression6.2.5Disadvantages of Multivariate Regression6.3Regularization6.3.1Definition6.3.2Types of regularized regression6.3.3Ridge regression6.3.4Lasso Regression6.3.5Comparison between ridge regression and lasso regression6.4Least square regression for classification6.4.1Linear regression and least squares problem6.4.2Introduction6.4.3Non-Regularized Least Squares Problem6.4.4Regularized Least Squares Problem6.5Perceptron6.5.1Introduction6.5.2Types of Perceptron6.5.3Single layer Perceptron6.5.3.1Working of Single Layer Perceptron6.5.3.2Advantages:6.5.3.3Disadvantages:6.5.4Multi layer perceptron:6.5.4.1Neurons6.5.4.2Activationmunotes.in
Page 72
726.5.4.3Networks of Neurons6.5.4.4Input or Visible Layers6.5.4.5Hidden Layers6.5.4.6Output Layer6.5.4.7Stochastic Gradient Descent6.5.4.8Weight Updates6.5.4.9Prediction6.5.4.10Advantages:6.5.4.11Disadvantages:SummaryQuestionReference6.0 ObjectivesTheObjectivefor this chapter is to present the background andmathematical function needed to construct a statistical model thatdescribes a particular scientific or engineering process. These types ofmodels can be used for prediction of process outputs, for calibration, or forprocess optimization.6.1INRODUCTIONLEAST SQUARE METHODThe most widely used modeling method is Linear least squares regression.It is what most people mean when they say they have used "regression","linear regression" or "least squares"to fit a model to their data. Not onlyis linear least squares regression the most widely used modeling method,but it has been adapted to a broad range of situations that are outside itsdirect scope. It plays a strong underlying role in many other modelingmethods, including the other methods discussed in this section6.1.1Definition:The process of finding the best-fitting curve for a set of data points byreducing the sum of the squares of the offsets of the points from the curveis called theleastsquare method.The method of least squaresdefinesthe solution for the minimization ofthe sum of squares of errors in equation. to find the variation in observeddata we need to find theformula for sum of squares of errors.Thismethodis applied in data fitting. The resultof this method is usedto reduce thesum of squared errorswhich aredifferences between the observed orexperimental value and corresponding fitted value given in the model.The regression analysis is the process of finding the relation between twovariables, the trend of outcomesisestimated quantitatively. The method ofcurve fitting is an approachto regression analysis. This method of fittingequations which approximates the curves to given raw data is the leastsquares.munotes.in
Page 73
73If we add up all of the errors, the sum will be zero. So how do we measureoverall error? We use a little trick: we square theerrors and find a line thatminimizes this sum of the squared errors.Followingarethebasic categories of least-squares problems:•Ordinary or linear least squares•Nonlinear least squaresThese depend upon linearity or nonlinearity of theerrors. The linearproblems are often seen in regression analysis in statistics. On the otherhand, the non-linear problems are generally used in the iterative method ofrefinement in which the model is approximated to the linear one with eachiteration.
Figure 1:Least square method graph6.1.3Formula:The least-square method states that the curve that best fits a given set ofobservations, is said to be a curve having a minimum sum of the squaredresiduals (or deviations or errors) from the given datapoints. Let usassume that the given points of data are (x1,y1), (x2,y2), (x3,y3), …,(xn,yn) andfitting curvef(x)withd represents error or deviation fromeach given point.Now, we can write:d1 = y1− f(x1)d2 = y2− f(x2)d3 = y3− f(x3)…..dn = yn–f(xn)
munotes.in
Page 74
74The least-squares explain that the curve that best fits is represented by theproperty that the sum of squares of all the deviations from given valuesmust be minimum.i.e:
6.1.4Advantagesof Least Square method:1.This method is completelyfree from personal bias of the analyst as itis very objective in nature.2.This method provides us with a rate of growth per period.6.1.5Disadvantagesof Least Square Method:1.Sensitivity to outliers: One or two outliers can sometimes seriouslyskew theresults of a least-squares analysis.2.Tendency to overfit data6.2 MULTIVARIATE LINEAR REGRESSION (MLR)The general multivariate regression model is a compact way of writingseveral multiple linear regression modelssimultaneously. In that sense it isnot a separate statistical linear model. The various multiple linearregression models may be compactly written asfollows6.2.1 Normal EquationY=XB+U
Where,Y is a matrix with series of multivariate measurementsX is a matrix of observations on independent variables
munotes.in
Page 75
75B is a matrix containing parameters that are usually to be estimated andU is a matrix containing errorsi.e.noise.Inmultivariate linear regression,thefocusison selecting the best possibleindependent variables that contribute well to the dependent variable.1.create a correlation matrix for all the independent variables and thedependent variable from the observed data.2.The correlation value gives us an idea about which variable issignificant and by what factor.3.Low value of correlation between two independent variables showslow overlap between the respective variables and high value pointstowards high overlap.4.If two variables highly overlap each other thentheir contribution inminimizing the loss function is the same therefore making thecontribution of one of the layers redundant.This method can get complicated when there are large no. of independentfeatures that have significant contribution in deciding our dependentvariable.Multivariate regression is a technique that estimates a singleregression model with more than one outcome variable.6.2.2Example:Alice and Bob are planning to buy a new home. Nancy has mentioned arate for the home they like, but they want a method to verify it. All theyhave got is the data set of size M house price and 3 feature counts (no.ofbedrooms, size of home in sq mtrs and age of the home). Bob has decidedto use his statistical data processing experience from his previous job anddo a multivariate linear regression. IfB,S,Aare the parameter values ofthe house they like andPis the price mentioned by Nancy, help Bobdecide if they are being cheated or not. (If the price mentioned by Nancy-expected price≥ 2000dollars-then they are being cheated)Input formatLine 1: M valueNext M lines contain 4 values separated by spacesLine 2 to m+1: Bi Si Ai PiLine m+2: B S A (features of their future home)Line m+3: P (price mentioned by Nancy)Output formatPexp CPexp is expected price of their home and C is a binary value (0: beingcheated, 1: not cheatedmunotes.in
Page 76
76The general linear model incorporates a number of different statisticalmodels: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linearregression, t-test and F-test. The general linear model is a generalization ofmultiple linear regression to the case of more than one dependent variable.If Y, B, and U were column vectors, the matrix equation above wouldrepresent multiple linear regression.6.2.3Steps for Multivariate Linear Regression:Steps involved for Multivariate regression analysis are1)feature selection,2)normalizing the features,3)selecting the loss function4)hypothesis,setting hypothesis parameters,5)minimizing the loss function,6)testingthe hypothesis, and7)generating the regression model.6.2.3.1Normalizing Features:We need to scale the features as it maintains general distribution and ratiosin data. This will lead to an efficient analysis. The value of each featurecan also bechanged.6.2.3.2Select Loss function and Hypothesis:The loss function predicts whenever there is an error. Meaning, when thehypothesis prediction deviates from actual values. Here, the hypothesis isthe predicted value from the feature/variable.6.2.3.3Set Hypothesis Parameters:The hypothesis parameter needs to be set in such a way that it reduces theloss function and predicts well.6.2.3.4Minimize the Loss Function:The loss function needs to be minimized by using a loss minimizationalgorithm on the dataset, which will help in adjusting hypothesisparameters. After the loss is minimized, it can be used for further action.Gradient descent is one of the algorithms commonly used for lossminimization.6.2.3.5Test the hypothesis function:Thehypothesis function needs to be checked on as well, as it is predictingvalues. Once this is done, it has to be tested on test data.6.2.3.6Multivariate Linear Regression model: Scalar ModelIt has the following form
munotes.in
Page 77
77For i€ {1,….,n} and k €{1,…..,m}Where,is the kth real valued response for ith observationis the regression intercept for the kth responseis the jth predictor’s regression slope for kth responseis the jth predictor for ith responseis a multivariate Gaussian error vector.6.2.4Advantages:1.MLRis it helps us to understand the relationships among variablespresent in the dataset.2.MLRis a widely used machine learning algorithm.6.2.5Disadvantages:1.This Techniquearea bit complex and require a high-levels ofmathematical calculation.2.MLRmodel’s output is not easy to interpret3.MLRmodel does not have much scope for smaller datasets.6.3 REGULARIZATION6.3.1Definition:Thetype of regression where thecoefficient estimates are constrained tozerois called Regularization. The magnitude (size) of coefficients, anderror term, are penalized.Formula:Here Y represents the learned relationβrepresents the coefficient estimates for different variables orpredictors(X).
munotes.in
Page 78
78TheModelfitting procedure involves a loss function, known as residualsum of squares(RSS). The coefficients are chosen, such that they minimizethis loss function.Now,this will adjust the coefficients based on your training data. If thereis noise in the training data, then the estimated coefficientswill not giveefficient results. This is where regularization comes inandregularizesthese learned estimates towards zero.6.3.2Types of regularized regression:1.Ridge regressionis a way to create a sparing model when the numberof predictor variables in a set are more than the number ofobservations.2.Least Absolute Shrinkage and Selection Operator(LASSO)regressionis a type of linear regression that uses shrinkage.Shrinkage is where data values are shrunk towards a central point, likethe mean.6.3.3Ridge regression:Itis a technique that is implemented by adding bias to a multilinearregression model to expecta much more accurate regression with testeddata.The general equation of a best-fit line for multilinear regression isy = β0 + β1x1 + β2x2 + ··· βkxkwhere y is the output variable and x1,x2…xk are predictor variables.The penalty term for ridge regression is λ(slope) ², where lambda denotesthe degree of deflection from the original curve by restricting thecoefficients of predictor variables but never makes them zero.Therefore the equation for ridge regression isy = β0 + β1x1 + β2x2 + ··· βkxk+ λ(slope) ²Above function shows ridge regression, where the RSS is modified byadding the shrinkage quantity. Now, the coefficients are estimated byminimizing this function. Here, λ is the tuning parameter that decides howmuch we want to penalize the flexibility of our model. The increase inflexibility of a model is represented by an increase in its coefficients, and
munotes.in
Page 79
79if we want to minimize the above function, then these coefficients need tobe small. This is how the Ridge regression technique prevents coefficientsfrom rising too high.6.3.4Lasso Regression:Lasso regression is much similar to ridge regression but only differs in thepenalty term. The penalty for lasso regression is λ|slope|.Lasso regression can even eliminate the variables bymaking theircoefficients to zero thus removing the variables that have high covariancewith other predictor variables.The equation for lasso regression isy = β0 + β1x1 + β2x2 + ··· βkxk + λ|slope|Lasso is another variation, in which the abovefunction is minimized. It'sclear that this variation differs from ridge regression only in penalizing thehigh coefficients. It uses |βj|(modulus)instead of squares of β, as itspenalty. In statistics, this is known as the L1 norm.6.3.5Comparison between ridge regression and lasso regression:According to the above formulation, the ridge regression is expressed byβ1² + β2² ≤ s. This implies that ridge regression coefficients have thesmallest RSSfor all points that lie within the circle given by β1² +β2² ≤ s.Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies thatlasso coefficients have the smallest RSS for all points that lie within thediamond given by |β1|+|β2|≤ s.Comparison of lasso and ridge with the linear regression model,we get:
Figure2:Comparison of lasso and ridge with the linearregression modelNote:
variable.
munotes.in
Page 80
806.4 LEASTSQUARE REGRESSION FORCLASSIFICATIONRegression and classification are basic in learning. In regression, theoutput variable takes continuous values, while in classification, the outputvariable takes class labels.6.4.1Linear regression and least squares problem:The linear regression is similar to linear least squares problem and can beused for classification problems appearing in machine learning algorithms.We will revise the solution of the linear least squares problem in terms oflinear regression.The simplest linear model for regression isf(x,ω)=ω0
1+ω1x1+...+ωMxM.Here, ω={ωi},i=0,...,M are weights with bias parameter ω0, {xi},i=1,...,Mare training examples. Target values (known data) are {ti},i=1,...,N whichcorrespond to {xi},i=1,...,M. Here, M is the number of weights and N isthe number of data points.6.4.2 Introduction:Each set consists of sample data points repressing two classes. One of thesets represents a linearly-separable classification problem, and the otherset is for a non-linearly separable problem.To use the Least SquaresRegression to solve a classification problem, a simple trick is used. Thedata points of the first and second classes are extended by adding a newextradimension. This produces an augmented cloud of points in n+1dimensional space, where n is the size of the original data space. In thatextra dimension, the data points belonging to the first and second classestake values of-1 and +1 respectively.
Figure3: Least square methods to classify linear data from differentpoints of viewFigure3shows the decision boundary of classifying linear data fromdifferent point of views, and figure 2 shows the same for the wave-alikedata, where misclassified samples are circled in red. In such 2D datapoints case, the decision boundary is the intersection of the fitted
munotes.in
Page 81
81polynomial and the horizontal plane passing by z=0 (z is the extradimension here).
Figure4: Least square models to classify wave-alike data fromdifferent points of view, misclassified samples are circled in red.For classification accuracy, we use the Minimum Correct ClassificationRate (MCCR). MCCR is defined as the minimum of CCR1 and CCR2.CCRn is the ratio of the correctly classified test points in class n dividedby the total number of test points in class n.6.4.3Non-regularized Least Squares Problem:In non-regularized linear regression or least squares problem the goal is tominimize the sum of squaresE(ω)=12N∑n=1(tn−f(x,ω))2=12N∑n=1(tn−ωTφ(xn))2:=12 t−ωTφ(x) 22to findminωE(ω)=minω12 t−ωTφ(x) 22.with the residualr(ω)=t−ωTφ(x). The test functionsφ(x)form the designmatrixA and the regression problem (or the least squares problem) iswritten as:minω12 r(ω) 22=minω12 Aω−t 22,whereAis of the sizeN×MwithN>M,tis the target vector of thesizeN,andωis vector of weights of the sizeM.6.4.4Regularized Least Squares Problem:Let now the matrixAwill have entriesaij=ϕj(xi),i=1,...,N;j=1,...,M.Recall, that functionsϕj(x),j=0,...,Mare called basis functions whichshould be chosen andare known. Then the regularized least squaresproblem takes the formminω12 r(ω) 22+γ2 ω 22=minω12 Aω−t 22+γ2 ω 22.To minimize the regularized squarederrorswehave toderive thenormalequations. Similarly as the Fréchet derivative for the non-regularized
munotes.in
Page 82
82regression problem, we look for theωwhere the gradientof12||Aω−t||22+γ2 ω 22=12(Aω−t)T(Aω−t)+γ2ωTωvanishes. In otherwords, we consider12lim e →0(A(ω+e)−t)T(A(ω+e)−t)−(Aω−t)T(Aω−t)||e||2+lim e →0γ2(ω+e)T(ω+e)−γ2ωTω||e||2We finally get0=lim e →0eT(ATAω−ATt)||e||2+γeTω||e||2The expression above means that the factorATAω−ATt+γωmust also bezero, or(ATA+γI)ω=Att6.5 PERCEPTRON6.5.1 Introduction:The perceptron is an algorithm for supervised learning of binaryclassifiers. A binary classifier is a function which can decide whetherinput, represented by a vector of numbers, belongs to some specific class.It is a type of linear classifier, i.e. a classification algorithm that makes itspredictions based on a linear predictor functioncombining a set of weightswith the feature vector. The perceptron algorithm was invented in 1958 atthe Cornell Aeronautical Laboratory by Frank Rosenblatt.
Figure5: Perceptron model6.5.2Types of Perceptron:There are two types of Perceptrons: Single layer and Multilayer.1.Single layer-Single layer perceptrons can learn only linearlyseparable patterns2.Multilayer-Multilayer perceptrons or feedforward neural networkswith two or more layers have the greater processing power6.5.3Single layerPerceptron:It is the first and basic model of the artificial neural networks. It is alsocalled the feed-forward neural network. Single layer perceptrons can learn
munotes.in
Page 83
83only linearly separable patterns The working of the single-layer perceptron(SLP) is basedon the threshold transfer between the nodes. This is thesimplest form of ANN and it is generally used in the linearly based casesfor the machine learning problems.6.5.3.1Working of Single Layer Perceptron:
Figure 6:Single layer Perceptron●In asingle layer perceptron, the weights to each input node areassigned randomly since there is no a priori knowledge associatedwith the nodes.●Now SLP sums all the weights which are inputted and if the sums areis above the threshold then the network is activated.●If the calculated value is matched with the desired value, then themodel is successful●If it is not, then since there is no back-propagation technique involvedin this the error needs to be calculated using the below formula andthe weights needto be adjusted again.6.5.3.1Advantages:1.Single Layer Perceptron is quite easy to set up and train.2.The neural network model can be explicitly linked to statisticalmodels3.The SLP outputs a function which is a sigmoid and that sigmoidfunction can easilybe linked to posterior probabilities.4.We can interpret and input the output as well since the outputs are theweighted sum of inputs.6.5.3.2Disadvantages:1.This neural network can represent only a limited set of functions.2.The decision boundaries that are the threshold boundaries are onlyallowed to be hyperplanes.3.This model only works for the linearly separable data.
munotes.in
Page 84
846.5.4Multi layer perceptron:A multilayer perceptron (MLP) is a class of feedforward artificial neuralnetwork (ANN). The term MLP is used ambiguously, sometimes looselyto any feedforward ANN, sometimes strictly to refer to networkscomposed of multiple layers of perceptrons (with threshold activation).An MLP consists of at least three layers of nodes:1)an input layer,2)ahidden layer and3)an output layer. Except for the input nodes, each nodeis a neuron that uses a nonlinear activation function. MLP utilizes asupervisedlearning technique called backpropagation for training. Itsmultiple layers and non-linear activation distinguish MLP from a linearperceptron. It can distinguish data that is not linearly separable.
Figure 7: Multilayer PerceptronFollowing are the building blocks of perceptron1. Neurons2. Activation3. Networks of Neurons4. Input or Visible Layers5. Hidden Layers6. Output Layer6.5.4.1Neurons:The building block for neural networksisartificial neurons.These are simple computational unitsthat have weighted input signals andproduce an output signal using an activation function.
Figure8:Artificial Neurons
munotes.in
Page 85
85You may be familiar with linear regression, in which case the weights onthe inputs are very much like the coefficients used in a regressionequation.Like linear regression, each neuron also has a bias which can bethought of as an input that always has the value 1.0 and it too must beweighted.6.5.4.2Activation:The weighted inputs are summed and passed through an activationfunction, sometimes called a transfer function.Traditionally non-linearactivation functions are used. This allows the network to combine theinputs in more complex ways and in turn provide a richer capability in thefunctions they can model. Non-linear functions like the logistic also calledthe sigmoid function were used that output a value between 0 and 1 withan s-shaped distribution, and the hyperbolic tangent function also calledtanh that outputs the same distribution over the range-1 to +1.Morerecently the rectifier activation function has been shown to provide betterresults6.5.4.3Networks of Neurons:Neurons are arranged into networks of neurons.A row of neurons is calleda layer and one network can have multiple layers. The architecture oftheneurons in the network is often called the network topology.
Figure 9: Networks of Neurons6.5.4.4Input or Visible Layers:The bottom layer that takes input from your dataset is called the visiblelayer, because it is the exposed part of the network. Often a neural networkis drawn with a visible layer with one neuron per input value or column inyour dataset. These are not neurons as described above, but simply passthe input value though to the next layer.6.5.4.5Hidden Layers:Layers after the input layer are called hidden layers because that are notdirectly exposed to the input. The simplest network structure is to haveasingle neuron in the hidden layer that directly outputs the value.Given increases in computing power and efficient libraries, very deepneural networks can be constructed. Deep learning can refer to havingmany hidden layers in your neural network. They are deep because theywould have been unimaginably slow to train historically, but may takeseconds or minutes to train using modern techniques and hardware.
munotes.in
Page 86
866.5.4.6Output Layer:The final hidden layer is called the output layer and it is responsibleforoutputting a value or vector of values that correspond to the formatrequired for the problem.The choice of activation function in he output layer is strongly constrainedby the type of problem that you are modeling.For example:1.A regressionproblem may have a single output neuron and the neuronmay have no activation function.2.A binary classification problem may have a single output neuron anduse a sigmoid activation function to output a value between 0 and 1 torepresent the probability ofpredicting a value for the class 1.3.A multi-class classification problem may have multiple neurons in theoutput layer, one for each class (e.g. three neurons for the threeclasses in the famous iris flowers classification problem).6.5.4.7Stochastic Gradient Descent:The classical and still preferred training algorithm for neural networks iscalled stochastic gradient descent.This is where one row of data isexposed to the network at a time as input. The network processes the inputupward activating neurons as it goes to finally produce an output value.This is called a forward pass on the network. It is the type of pass that isalso used after the network is trained in order to make predictions on newdata.The process is repeated for all of the examples in your training data. Oneround of updating the network for the entire training dataset is called anepoch. A network may be trained for tens, hundreds or many thousands ofepochs.6.5.4.8Weight Updates:The weights in the network can be updatedfrom the errors calculated foreach training example and this is called online learning. It can result in fastbut also chaotic changes to the network.Alternatively, the errors can be saved up across all of the trainingexamples and the network can be updated at the end. This is called batchlearning and is often more stable.Typically, because datasets are so large and because of computationalefficiencies, the size of the batch, the number of examples the network isshown before an update is oftenreduced to a small number, such as tens orhundreds of examples.munotes.in
Page 87
876.5.4.9Prediction:Once a neural network has been trained it can be used to make predictions.You can make predictions on test or validation data in order to estimatethe skill of the model on unseen data. You can also deploy it operationallyand use it to make predictions continuously.An MLP is a network of simple neurons called perceptrons. The basicconcept of a single perceptron was introduced by Rosenblatt in 1958. Theperceptron computes a single output from multiple real-valued inputs byforming a linear combination according to its input weights and thenpossibly putting the output through some nonlinear activation function.Mathematically this can be written aswhere W denotesthe vector of weights, X is the vector of inputs, b is thebias and φ is the activation function. A signal-flow graph of this operationis shown in Figure given below6.5.4.10Advantages:●Can be applied to complex non-linear problem●Works well with largeinput data●Provides quick prediction after training●The same accuracy can be achieved even with smaller data6.5.4.11Disadvantages:●It is not known to what extent each independent variable is affectedby the dependent variable.●Computations aredifficult and time consuming●The proper functioning of the model depends on the quality oftrainingSUMMARYThis chapter gives a breakdown of linear regression model. Regressionanalysis is one of the first modeling techniques to learn as a data scientist.It can helpful when forecasting continuous values, e.g., sales, temperature.There are quite a few formulas to learn but they are necessary tounderstand when we run linear regression models. As you saw above thereare many ways to check the assumptions of linear regressionlike LeastSquares method, Multivariate Linear Regression, Regularized Regression,Using Least Square regression for Classification. Also we saw Perceptronthat classifies patterns and groups by finding the linear separation betweendifferent objects and patterns that are received through numeric or visualinput.
munotes.in
Page 88
88UNIT ENDQUESTIONS•Least Square method:1.Explain least square method and its limitations.2.Explain the types of least square method.3.What is the difference between Linearand non-linear least squaremethod?4.Solve the following using least square method.•Multivariate Linear Regression:1.Explain Multivariate linear regression with an example.2.What are the steps for Multivariate Linear regression?•Regularized Regression:1.Explain Regularized regression.2.What are the types of regularized regression?3.Give comparison of Lasso and Ridge with linear regrrssion model.•Using Least Square Regression for classification:1.Explain the use of least square regression for classification.•Perceptron1.Explain perceptron algorithm.2.Explain the types of perceptron algorithms3.Explain the working of single layer perceptron.4.Explain Single layer perceptron with advantages and disadvantages.5.Explain Multilayer perceptron with advantages anddisadvantages.REFERENCE•Bevans, R. (2020, October 26). Simple Linear Regression: An EasyIntroduction & Examples. Retrievedfromhttps://www.scribbr.com/statistics/simple-linear-regression/•Bevans, R. (2020, October 26). Multiple Linear Regression: A Quickand Simple Guide. Retrievedfromhttps://www.scribbr.com/statistics/multiple-linear-regression/•International Encyclopedia of the Social Sciences. . Encyclopedia.com.16 Oct. 2020 . (2020, November 27). Retrievedfromhttps://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/ordinary-least-squares-regression
munotes.in
Page 89
89•Jcf2d, W. B. (n.d.). University of Virginia Library Research DataServices Sciences. Retrievedfromhttps://data.library.virginia.edu/understanding-q-q-plots/•Stephanie. (2020, September 16). Q Q Plots: Simple Definition &Example. Retrieved fromhttps://www.statisticshowto.com/q-q-plots/•Assumptions of Linear Regression. (2020, June 22). Retrievedfromhttps://www.statisticssolutions.com/assumptions-of-linear-regression/*****munotes.in
Page 90
907:SUPPORT VECTOR MACHINEUnit Structure7.0Objective7.1Support Vector Machines7.1.1Definition7.1.2Hyperplane and Support Vectors:7.1.2.1Hyperplane:7.1.2.2Support Vectors:7.1.3Types ofSVM7.1.4Working of SVM7.1.4.1Linear SVM:7.1.4.2Non-linear SVM:7.2Soft Margin SVM7.2.1What Soft Margin does is7.2.2Better of soft margin7.2.3Degree of tolerance7.2.4 Formulation7.3Linear Classifiers7.3.1Regularized Discriminant Analysis7.3.2Linear Discriminant Analysis7.3.3Quadratic Discriminant Analysis7.3.4Logistic Regression7.4Kernel Functions in Non-linear Classification397.4.1Kernel Functions7.4.2Kernel Composition Rules7.4.3Radial Basis Kernel7.4.4Kernel in ActionSummaryUnit EndQuestionReference7.0 OBJECTIVESObjective: Support Vector Machineis a linear model for classification andregression problems. It can solve linear and non-linear problems and workwell for many practical problems. Through this chapter we are exhibitingSVM model.munotes.in
Page 91
917.1INTRODUCTIONSUPPORT VECTOR MACHINES7.1.1Definition:Support Vector Machine(SVM)is one of the most popular SupervisedLearning algorithms, which is used for Classification as well asRegression problems. Mainly it is used for Classification problems inMachine Learning.The goal of the SVM algorithm is to create the best line or decisionboundary that can segregate n-dimensional space into classes so that wecan easily put the new data point in the correct category in the future. Thisbest decision boundary is called a hyperplane. SVM chooses the extremepoints/vectors that help in creating the hyperplane. These extreme casesare called support vectors, and hence the algorithm is termed as SupportVector Machine. Consider the below diagram in which there are twodifferentcategories that are classified using a decision boundary orhyperplane:
Figure 10: Decision boundary or hyperplane7.1.2Hyperplane and Support Vectors:7.1.2.1Hyperplane:There can be multiple lines/decision boundaries to segregate the classes inn-dimensional space, but we need to find out the best decision boundarythat helps to classify the data points. This best boundary is known as thehyperplaneof SVM.7.1.2.2Support Vectors:The data points or vectors that are the closest to the hyperplane and whichaffect the position of the hyperplane are termed asSupport Vectors.These vectors support the hyperplane, hence called a Support vector.
munotes.in
Page 92
927.1.3Types of SVM:SVM can be of two types:1.Linear SVM:Linear SVM is used for linearly separable data, whichmeans if a dataset can be classified into two classes by using a singlestraight line, then such data is termed as linearly separable data, andclassifier is used called as Linear SVM classifier.2.Non-linear SVM:Non-Linear SVM is used for non-linearlyseparated data, which means if a dataset cannot be classified by usinga straight line, then such data is termed as non-linear data andclassifier used is called as Non-linear SVM classifier.7.1.4Working of SVM:7.1.4.1Linear SVM:Suppose we have a dataset that has two tags (green and blue), and thedataset has two features x1 and x2. We want a classifier that can classifythe pair(x1, x2) of coordinates in either green or blue. Consider the belowimage:
Figure 11: Linear SVMSo as it is 2-d space so by just using a straight line, we can easily separatethese two classes. But there can be multiple lines that can separate theseclasses
.Figure 12: Separated classes
munotes.in
Page 93
93Hence, the SVM algorithm helps to find the best line or decisionboundary; this best boundary or region is called a hyperplane. SVMalgorithm finds the closest point of the lines from both the classes. Thesepoints are called support vectors. The distancebetween the vectors and thehyperplane is called the margin. And the goal of SVM is to maximize thismargin. The hyperplane with maximum margin is called theoptimalhyperplane.
Figure 13: Optimal hyperplane7.1.4.2Non-linear SVM:For non-linear data,we cannot separate by drawing a single straight line.Consider the below image:
Figure 14: Non Linear SVMSo to separate these data points, we need to add one more dimension. Forlinear data, we have used two dimensions x and y, so for non-linear data,we will add a third dimension z. It can be calculated as:
munotes.in
Page 94
94z=x2 +y2
Figure 15: Third dimension zSo now, SVM will divide the datasets into classes in the following way.
Figure 16: SVM divide dataset in classesSince we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:
Figure 17: Best Hyperplane
munotes.in
Page 95
95Hence we get a circumference of radius 1 in case of non-linear data.7.2SOFT MARGIN SVM7.2.1What Soft Margin doesis:1.it tolerates a few dots to get misclassified2.it tries to balance the trade-off between finding a line that maximizesthe margin and minimizes the misclassification.This idea is based on a simple premise: allow SVM to make a certainnumber of mistakesand keep the margin as wide as possible so that otherpoints can still be classified correctly. This can be done simply bymodifying the objective of SVM.The main idea behind the support vectorclassifier is to find a decision boundary with a maximum width that canclassify the two classes. Maximum margin classifiers are super sensitive tooutliers in the training data and that makes them pretty lame. Choosing athreshold that allows misclassifications is an example of the Bias-Variancetradeoff that plagues all the machine learning algorithms.When we allowsome misclassifications (slack variables), the distance between theobservations and the threshold is called a “soft margin”.This idea is based on a simple premise: allow SVM to make a certainnumberof mistakes and keep the margin as wide as possible so that otherpoints can still be classified correctly. This can be done simply bymodifying the objective of SVM.7.2.2Bettersoft margin:We use cross-validation to determine how many misclassifications andobservations to allow inside of the soft margin to get the bestclassification.The name support vector classifier comes from the fact that theobservations on the edge that helps us to draw the margin are calledsupport vectors.7.2.3Degree oftoleranceHow much tolerance we want to set when finding the decision boundary isan important hyper-parameter for the SVM.7.2.4Formulation●Almost all real-world applications have data that is linearlyinseparable.●Insomecases where the data is linearly separable, we might not wantto choose a decision boundary that perfectly separates the data toavoid overfitting. For example, consider the following diagram:munotes.in
Page 96
96Figure 18: Betterdecision boundaryHere the red decision boundary perfectly separates all the training points.However, is it really a good idea to have a decision boundary with suchless margin The green decision boundary has a wider margin that wouldallow it to generalize well on unseendata. In that sense, soft marginformulation would also help in avoiding the overfitting problem.Let us see how we can modify our objective to achieve the desiredbehavior. In this new setting, we would aim to minimize the followingobjective:…Equation 1Here, C is a hyperparameter that decides the trade-off betweenmaximizing the margin and minimizing the mistakes. When C is small,classification mistakes are given less importance and focus is more onmaximizing the margin, whereas when C is large, the focus is more onavoiding misclassification at the expense of keeping the margin small.Let’s see how this could be incorporated with the help of the followingdiagram.
Figure 19: Avoiding MisclassificationFigure19: The penalty incurred bydata points for being on the wrong sideof the decision boundary
munotes.in
Page 97
97The idea is: for every data point x_i, we introduce a slack variable ξ_i. Thevalue of ξ_i is the distance of x_i from the corresponding class’s margin ifx_iis on the wrong side of the margin, otherwise zero. Thus the points thatare far away from the margin on the wrong side would get more penalty.With this idea, each data point x_i needs to satisfy the followingconstraint:……..Equation 2Here, the left-hand side of the inequality could be thought of as theconfidence of classification.7.3 LINEAR CLASSIFIERSSolving classification tasks are based on linear models. What this means isthat they aim at dividing the feature space into a collection of regionslabeled according to the values the target can take, where the decisionboundaries between those regions are linear: they are lines in 2D, planes in3D, and hyperplanes with more features.Some of linear classifier techniques include:1.Regularized Discriminant Analysis2.Quadratic Discriminant Analysis3.Linear Discriminant Analysis4.Logistic Regression7.3.1Regularized Discriminant Analysis:Likelinear models for regression can be regularizedto improve accuracy,so can linear classifiers. One can introduce a shrinkingparameter α thatshrinks the separate covariance matrices of QDA towards a common LDAmatrix:The shrinkage parameter can take values from 0 (LDA) to 1 (QDA) andany value in between is a compromise between the two approaches. Thebest value of α can be chosen based on cross-validation. To do this inPython, we need to pass the shrinkage argument to the LDA function, aswell as specify the computation algorithm to be least squares, as othercomputation methods do not support shrinkage.7.3.2Linear Discriminant Analysis:The method to be discussed is the Linear Discriminant Analysis (LDA). Itassumes that the joint density of all features, conditional on the target's
munotes.in
Page 98
98class, is a multivariate Gaussian. This means that the density P of thefeatures X, given the target y is in class k, are assumed to be given bywheredis the number of features, µ is a mean vector, and Σ_kthecovariance matrix of the Gaussian density for classk.The decision boundary between two classes, saykandl, is the hyperplaneon which the probability of belonging to either class is the same. Thisimplies that, on this hyperplane, the difference between the two densitiesshould be zero.7.3.3Quadratic Discriminant Analysis:This performs a quadratic discriminant analysis(QDA). QDA is closelyrelated to linear discriminant analysis (LDA), where it is assumed that themeasurements are normally distributed. Unlike LDA however, in QDAthere is no assumption that the covariance of each of the classes isidentical. To estimatethe parameters required in quadratic discriminationmore computation and data is required than in the case of lineardiscrimination. If there is not a great difference in the group covariancematrices, then the latter will perform as well as quadratic discrimination.Quadratic Discrimination is the general form of Bayesian discrimination.Discriminant analysis is used to determine which variables discriminatebetween two or more naturally occurring groups. For example, aneducational researcher may want toinvestigate which variablesdiscriminate between high school graduates who decide (1) to go tocollege, (2) NOT to go to college. For that purpose the researcher couldcollect data on numerous variables prior to students' graduation. Aftergraduation, most students will naturally fall into one of the two categories.Discriminant Analysis could then be used to determine which variable(s)are the best predictors of students' subsequent educational choice.Computationally, discriminant function analysis is very similar to analysisof variance (ANOVA). For example, suppose the same student graduationscenario. We could have measured students' stated intention to continue onto college one year prior to graduation. If the means for the two groups(those who actually went to college and those who did not) are different,then we can say that the intention to attend college as stated one year priorto graduation allows us to discriminate between those who are and are notcollege bound (and this information may be used by career counselors toprovide the appropriate guidance to the respective students). The basicidea underlying discriminant analysis is to determine whether groupsdiffer with regard to the mean of a variable, and then to use that variable topredict group membership (e.g. of new cases).
munotes.in
Page 99
99Discriminant Analysis may be used for two objectives: either we want toassess the adequacy of classification, given the group memberships of theobjects under study; or we wish to assign objects to one of a number of(known) groups of objects. Discriminant Analysis may thus have adescriptive or a predictive objective. In both cases, some groupassignments must be known before carrying out the DiscriminantAnalysis. Such group assignments, or labeling, may be arrived at in anyway. Hence Discriminant Analysis can be employed as a usefulcomplement to Cluster Analysis (in order to judge the results of the latter)or Principal Components Analysis.7.3.4Logistic Regression:Another approach to linear classification is the logistic regression model,which, despite its name, is a classification rather than a regression method.Logistic regression models the probabilities of an observation belonging toeach of theKclasses via linear functions, ensuring these probabilitiessumup to one and stay in the (0, 1) range.The model is specified in termsofK-1 log-odds ratios, with an arbitrary class chosen as reference class (inthis example it is the last class,K). Consequently, the difference betweenlog-probabilities of belonging to a given class and to the reference class ismodeled linearly as
whereGstands for the true, observed class. From here, the probabilities ofan observation belonging to each of the classes can be calculated aswhich clearly shows that all class probabilities sum up to one.
munotes.in
Page 100
100Logistic regression models are typically estimated by maximumlikelihood. Just likelinearmodels for regression can be regularizedtoimprove accuracy, so can logistic regression.7.4 KERNEL FUNCTIONS IN NON-LINEARCLASSIFICATION
Figure 20: Kernal FunctionOnce the data points are non-linear separable in their original featurespace, the linear classifier may fail to determine where the decisionboundary is. However, mapping the original feature space (x) intothe higher dimensional feature space (ϕ(x) , e>d) can help toresurrect the linear classifier to do the job correctly.
Figure21:Mapping data points with 2-D feature vectors into 3-Dfeature vectorsFigure illustratesthe concepts of classifying data points through featuremapping. Originally, the data points with the feature vectorsx= [x,x] inthe 2-D space havethe concentrically circular distribution. It is impossibleto use a linear classifier to distinguish the decision boundary.Nonetheless, by incorporating a certain mapping functionϕ(x), the featurevectors can be transformed into 3-D feature space. The new data points
munotes.in
Page 101
101with 3-D feature vectorsϕ(x) = [x,x,(x²+x²)] can now be using thelinear classifier to determine the decision boundary hyperplane. This is thepower of feature mapping that can allow us to deal with the more complexdata distribution pattern with more expressive ability.However, thedrawbacks of usingϕ(x) directly are thatIt is sometimes hard to explicitlyconstruct aϕ(x) directly.Increase computational power quickly with theincreased feature dimensions.But the kernel functions can provide anefficient way to solve this.7.4.1Kernel Functions:The idea of kernel functions is to take the inner products between twofeature vectors, and evaluate inner products is not computationally costly.We can then exploit only the result of the inner products in our algorithms.For example, if we want to have theϕ(x) as follows,The kernel function is take the inner products between two feature vectorsas follows,As a result, the form of the kernel functions will be easier for us toconstruct than directly using the mapping functions in the higher featuredimensions.7.4.2Kernel Composition Rules:There are several kernelcompositionsrules that can be used to constructmore complex kernel functions.
munotes.in
Page 102
1027.4.3Radial Basis Kernel:The kernel functions can even empower the feature vectors to be infinitedimensional. One of the common kernel functions is the radial basiskernel. The definition is as follows.Because the exponential can be expanded to the infinite power series, theradial basis kernel gives much more expressiveness to the featuremapping. The following is the proof of the radial basis kernel that is akernel function.
Kernel Perceptron Algorithm:Recalling the perceptron algorithmhere, the perceptron algorithmupdatesθ = θ +yʲxʲonce a data point is misclassified. In the otherword, theθcan be expressed alternatively as follows.where αis the number of mistakes the perceptron made on thej-th datapoint. If it is in the mapping feature space, theθcan be expressed asfollows.
munotes.in
Page 103
103SUMMARYWe have two choices, we can either use the sci-kit learn library to importthe SVM model and use itdirectly or we can write our model fromscratch. Instead, using a library fromsklearn.SVMmodule whichincludes Support Vector Machine algorithms will be much easier inimplementation as well as to tune the parameters. You can try for hand-onwith the SVM algorithms including classification and regression problems.SVM Use CasesSome use-cases of SVM are as below.•Face Detection•Bioinformatics•Classification of Images•Remote Homology Detection•Handwriting Detection•Generalized Predictive Control•Text and Hypertext CategorizationUNIT ENDQUESTION•Support vector Machine1.Explain support vector machines with example.2.What are the types of Support vector machines?3.What are Hyperplane and Support vectors in the SVM algorithms?4.Explain the working of SVM.5.Why SVM is an example of a large margin classifier?6.What is a kernel in SVM? Why do we use kernels in SVM?7.Explain the key terminologies of Support Vector Machine.8.Define support vector machine(SVM) and further explain themaximum margin linear separators concept.•Soft Margin SVM1.What is soft margin SVM?2.Explain the working of Soft margin SVM.3.Explain the formulation of soft margin SVM?•Obtaining probabilities from linear classifiers:1.How to obtain probabilities from linear classifiers using logisticregression?•Kernal methods for non-linearity1.Explain Kernel methods for non-linearitymunotes.in
Page 104
1042.What are the limitations of the kernel method?3.Explain optimization problem for SVM with non-linear kernalREFERENCE•"A First Course in Machine Learningby Simon Rogers and MarkGirolami•Machine Learning Algorithms: A reference guide to popularalgorithms for data science and machine learningJuly 2017 Author:Giuseppe Bonaccorso Publisher:Packt Publishing ISBN:978-1-78588-962-2.•Deep Learning IanGoodfellow, Yoshua Bengio, and Aaron Courville•Hands-On Machine Learning with Scikit-Learn and TensorFlow•Concepts, Tools, and Techniques to Build Intelligent Systems (2ndedition)munotes.in
Page 105
105UNITIV8DISTANCE BASED MODELSUnit Structure8.0Objectives8.1Introductionto Algebric Model8.1.1 Distance based models8.1.2Distance Calculation Methods8.2Neighbours and Exemplars8.3Nearest Neighbours Classification8.3.1 What is Nearest Neighbour?8.3.2Working ofK-NN Algorithm8.3.3 Examples of K-NN Algorithm8.4K-Means Algorithm8.4.1 K-Means algorithm working8.4.2 Examples of K-Means algorithm8.5HierarchicalClustering8.5.1Agglomerative Clustering8.5.2Examples ofHierarchical ClusteringSummaryUnit End ExercisesList of References8.0 OBJECTIVESThis chapter would make you understand the following concepts:•Meaning of distance based model•Distance Computation methods•Concepts of different clustering algorithms•Application of clustering algorithms to solve problems8.1 INTRODUCTIONTO ALGEBRIC MODELIn this section, we consider models that define similarity by consideringthe geometry of the instance space.InAlgebricmodels, features could bedescribed aspoints in twodimensions (x-andy-axis) or a three-dimensional space (x,y,andz). Even when features are not intrinsicallygeometric, they could be modelled in a geometric manner (for example,munotes.in
Page 106
106temperature as a function of time can be modelled in two axes). Inalgebraicmodels, there are two ways we could impose similarity.•We could use geometric concepts likelines or planes to segment(classify)the instance space. These are calledLinear models.•Alternatively, we can use the geometric notion of distance to representsimilarity. In this case, if two points are close together, they havesimilar values for features and thus can be classed as similar. We callsuch models asDistance-based models.8.1.1 Distance based models:In the firstsection wehave seen the concept ofAlgebricmodels. Secondclass or type of thealgebricmodels is known as theDistance-basedmodels.Geometry of data is used to design theDistance-based models.Working of the distance-based modelsis basedon the concept ofdistance.With respect to machine learning, the concept of distance is notbased onjustthe physical distance between two points. Instead, we couldthink of the distance between two points considering themode oftransportbetween two points.For example if we aretravellingby planefrom one city to otherthen the plane willcover less distance physicallyascompared to travellingby train. The reason for this is the unrestrictedroute for a plane. In the same manner forchess, the concept of distancedepends on the piece used. For example, a Bishop can move diagonally.Thus, depending on the entity and the mode of travel, the concept ofdistance can be experienced differently.8.1.2Distance Calculation Methods:Thefollowingdistance metricsarecommonly usedto calculate thedistance.1.Euclidean distance:For geometrical problemsEuclideandistance isusedas the standardmetric.It is simply the ordinary distance between two points. Euclideandistance ismainlyextensively used in clustering problems.In K-meansalgorithms bydefaultEuclidean distance is used asdistance measure.TheEuclidean distanceis calculated by takingthe root of square differencesbetween the coordinates of a pair of objects( x1,y1) and (x2,y2)as shown inequation given belowDistance =.2.Manhattan distance:Manhattan distance is a distance metric that calculates the absolutedifferences between coordinates of pair of data objects as shown inequation given below:Distance=
munotes.in
Page 107
1078.2NEIGHBOURS AND EXEMPLARSIn the Distance based models distance is applied through the conceptofneighbours and exemplars. Neighbours are points in proximity withrespect to the distance measure expressed through exemplars. Exemplarsare eithercentroidsthatfind a centre of mass according to a chosendistance metric ormedoidsthatfind the most centrally located data point.The most commonly used centroid is the arithmetic mean, whichminimises squared Euclidean distance to all other points.•Thecentroidrepresents the geometric centre of a plane figure, i.e., thearithmetic mean position of all the points in the figure from thecentroid point. This definition extends to any object inn-dimensionalspace: its centroid is the mean position of all the points.•Medoidsare similar in concept to means or centroids. Medoids aremost commonly used on data when a mean or centroid cannot bedefined. They are used in contexts where the centroid is notrepresentative of the dataset, such as in image data.8.3NEAREST NEIGHBOURS CLASSIFICATIONExamples of distance-based models include thenearest-neighbourmodels,which use the training data as exemplars–for example, in classification.8.3.1 What is NearestNeighbour?:Nearestneighbouris a method inmachine learning method that aims atlabellingpreviously unseen query objects while distinguishing two ormore destination classes. As anyclassifier, in general, it requires sometraining data with given labelsand, thus, is an instance of supervisedlearning.K-nearest neighbors (KNN) algorithm is a type of supervised MLalgorithm which can be used for both classification as well as regressionpredictive problems. However, it is mainly used for classificationpredictive problems in industry. The following two properties woulddefine KNN well−•Lazy learning algorithm:KNN is a lazy learning algorithm because itdoes not have a specialized training phase and uses all the data fortraining while classification.•Non-parametric learning algorithm:KNN is also a non-parametriclearning algorithm because it doesn’t assume anything about theunderlying data.8.3.2Working of KNN Algorithm:K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predictthe values of new datapoints which further means that the new data pointwill be assigned a value based on how closely it matches the points in themunotes.in
Page 108
108training set. We can understand its working with the help of followingstepsStep1For implementing any algorithm, we need dataset. So during thefirst step of KNN, we must load the training as well as test data.Step2Next, we need to choose the value of K i.e. the nearest datapoints. K can be any integer.Step 3For each point in the test data do the following–•3.1− Calculate the distance between test data and each row oftraining data with the help of any of the method namely: EuclideanorManhattan distance. The most commonly used method tocalculate distance is Euclidean.•3.2− Now, based on the distance value,sort them in ascendingorder.•3.3− Next, it will choose the top K rows from the sorted array.•3.4− Now, it will assign a class to the test point based on mostfrequent class of these rows.Step 4–End:*(*
% * # 0*,#+& #+#*
)*%*-% %'+*)$'#%*( % %)$'#&(** )*%"%()*% &+()''#/) $'#$!&( */%munotes.in
Page 109
1098.3.3Examples of K-NN Algorithm:Example 1:The following is an example to understand the concept ofK and workingof KNN algorithmSuppose we have a datasetwhich can be plotted as follows:
Now, we need to classify new data point (60,60) into blue or red class. Weareassuming K = 3 i.e. it would find three nearest data points. It is shownin the next figure
We can see in the abovefigurethe three nearestneighboursof the datapoint with black dot. Among those three, two of themliein Red classhence the black dot will also be assigned in red class.Example 2:We havecollecteddata from thesamplesurvey. This data represents thetwo attributes as rating of acting of actors in that movie and other is rating
munotes.in
Page 110
110of story line of that movie.The rating scale is usedfrom 1(excellent) to 7(poor).Now we needto classify whether agiven movieis goodor not.Nowwe want to check whether new movie with rating asX1 =3 and X2= 7 is good or not.Here are four training samples:X1 =Rating ofActing skills ofmovie actorsX2 =Rating of storyline of movieY = Classification77Bad74Bad34Good14GoodStep 1 :Initialize and Define k.Lets say, k = 3NOTE :Always choose k as an odd number if the number of attributes isevento avoid a tie in the class predictionStep2 :Compute the distance between input sample andtrainingsample-Co-ordinate of the input sample is (3,7).-Instead of calculating the Euclidean distance, wecalculate theSquaredEuclidean distance.X1 =Rating of Actingskills of movie actorsX2 =Rating of storyline of movieSquared Euclideandistance77(7-3)2+ (7-7)2=1674(7-3)2+ (4-7)2=2534(3-3)2+ (4-7)2=09Step3:Sort the distance and determine the nearest neighbours based ofthe Kthminimum distance :X1 =RatingofActingskills ofmovieactorsX2 =Ratingof story lineof movieSquaredEuclideandistanceRankminimumdistanceIs itincluded in3-NearestNeighbour?77163Yes74254No34091Yes14132YesStep 4 :Take 3-Nearest Neighbours:Gather the category Y of the nearest neighbours.munotes.in
Page 111
111X1 =RatingofActingskills ofmovieactorsX2 =Ratingof storyline ofmovieSquaredEuclideandistanceRankminimumdistanceIs itincluded in3-NearestNeighbour?Y =Categoryof thenearestneighbour77163YesBad74254No-34091YesGood14132YesGoodStep5:Applysimple majority•Use simple majority of the category of the nearestneighbours as theprediction value of the query instance.•We have 2 “good” and 1 “bad”. Thuswe conclude that the newmoviewith X1 = 3 and X2 = 7 is included in the “good” category.8.4K-MEANS CLUSTERINGTo solve the wellknown clustering problem K-means is used, which is oneof the simplest unsupervised learning algorithms. Given data set isclassified assuming some prior number of clusters through a simple andeasy procedure. In k-means clustering for each cluster one centroid isdefined. Total there are k centroids. The centroids should be defined in atricky way because result differs based on the location of centroids. To getthe better results we need to place the centroids far away from each otheras much as possible. Next, each point from the given data set is stored in agroup with closest centroid. This process is repeated for all the points. Thefirst step is finished when all points are grouped. In the next step new kcentroids are calculated again from the result of the earlier step. Afterfinding these new k centroids, a new grouping is done for the data pointsand closest new centroids. This process is done iteratively. The process isrepeated unless and until no data point moves from one group to another.The aim of this algorithm is to minimize an objective function such as sumof a squared error function. The objective functionis defined as follows:J=kj = 1xi = 1|| xji–Cj||2Here|| xji–Cj||2shows the selected distance measure between a data pointxjiand the cluster centreCj. It is a representationof the distance of the ndata points from their respective cluster centers.munotes.in
Page 112
1128.4.1Working of K-means Algorithm:The algorithm is comprises of the following steps:1.Identify the K centroids for the given data points that we want tocluster.2.Store each data point in the group that has the nearest centroid.3.When all data points have been stored, redefine the K centroids.4.Repeat Steps 2 and 3 until the no data points move from one group toanother. The result of this process is the clusters fromwhich the metricto be minimized can be calculated.
8.4.3Examples ofK-means Algorithm:Example1:Given{2, 4, 10, 12, 3, 20, 30, 11, 25}. Assume number of clusters i.e. K =2Solution:Randomly assign means: m1= 3, m2= 4The numbers which are close to mean m1= 3 are grouped into cluster k1and others in k2.Again calculate new mean for new cluster group.StartInitialise number of clusterCalculate/InitialiseCentroidDistance objects tocentroidsGrouping based onminimum distanceNoobjectmovegroup?End)&munotes.in
Page 113
113K1=(2, 3}, k2= {4, 10, 12, 20, 30, 11, 25} m1= 2.5, m2= 16K1=(2, 3, 4}, k2= {10, 12, 20, 30, 11, 25} m1= 3, m2= 18K1=(2, 3, 4, 10}, k2= {12, 20, 30, 11, 25} m1= 4.75, m2= 19.6K1=(2, 3, 4, 10, 11, 12}, k2= {20, 30, 25} m1= 7, m2= 25Final clustersK1=(2, 3, 4, 10, 11, 12}, k2= {20, 30, 25}Example2:Given {10, 4, 2, 12, 3, 20, 30, 11, 25, 31} Assume number of clusters i.e.K = 2Solution:Randomly assign alternative values to each clusterK1=(10, 2, 3, 30, 25}, k2= {4, 12, 20, 11, 31} m1= 14, m2= 15.6Re assignK1=(2, 3, 4, 10, 11, 12}, k2= {20, 25, 30, 31} m1= 7, m2= 26.5Re assignK1=(2, 3, 4, 10, 11, 12}, k2= {20, 25, 30, 31} m1= 7, m2= 26.5Final clustersK1=(2, 3, 4, 10, 11, 12}, k2= {20, 25, 30, 31}Example3:Let’s assume that we have 4 types of items and each item has 2 attributesor features. We need to group these items in to k = 2 groups of items basedon the two features.ObjectAttribute 1(x)Number of partsAttribute 2(y)Colour codeItem 111Item 221Item 343Item 454Solution :Initial value of centroid:Suppose we use item 1 and 2 as the first centroids, c1= (1, 1) and c2= (2,1)The distance of item 1 = (1, 1) to c1= (1, 1) and with c2= (2, 1) iscalculated as,D=(1–1)2+(1–1)2= 0D=(1–2)2+(1–1)2= 1munotes.in
Page 114
114The distance of item 2 = (2, 1) to c1= (1, 1) andwith c2= (2, 1) iscalculated as,D=(2–1)2+(1–1)2= 1D=(2–2)2+(1–1)2= 0The distance of item 3 = (4, 3) to c1= (1, 1) and with c2= (2, 1) iscalculated as,D=(4–1)2+(3–1)2= 3.61D=(4–2)2+(3–1)2= 2.83The distance of item 4 = (5, 4) to c1= (1, 1) and with c2= (2, 1) iscalculated as,D=(5–1)2+(4–1)2= 5D=(5–2)2+(4–1)2= 4.24Objects-centroids distance:D0=
013.615102.834.24c1=(1,1)group 1c2=(2,1)group 2To find the cluster of each item we consider the minimum Euclidiandistance between group1 and group 2.From the above object centroid distance matrix we can see,Item 1 has minimum distance for group1, so we cluster item 1 in group1.Item 2 has minimum distance for group 2, so we cluster item 2 ingroup 2.Item 3 has minimum distance for group 2, so we cluster item 3 ingroup 2.Item 4 has minimum distance for group 2, so we cluster item 4 ingroup 2.Object Clustering:G0=
10000111Iteration 1 : Determine centroids:C1has only one member thus c1= (1, 1) remains same.C2=(2 + 4 + 5/3, 1 + 3 + 4/3) = (11/3, 8/3)The distance of item 1 = (1, 1) to c1= (1, 1) and with c2= (11/3, 8/3) iscalculated as,munotes.in
Page 115
115D=(1–1)2+(1–1)2= 0D=(1–11/3)2+(1–8/3)2= 3.41The distance of item 2 = (2, 1) to c1= (1, 1) and with c2=(11/3, 8/3) iscalculated as,D=(2–1)2+(1–1)2= 1D=(2–11/3)2+(1–8/3)2= 2.36The distance of item 3 = (4, 3) to c1= (1, 1) and with c2= (2, 1) iscalculated as,D=(4–1)2+(3–1)2= 3.61D=(4–11/3)2+(3–8/3)2= 0.47The distance of item 4 = (5, 4) to c1= (1, 1) and with c2= (2, 1) iscalculated as,D=(5–1)2+(4–1)2= 5D=(5–11/3)2+(4–8/3)2= 1.89Objects-centroids distanceD2=
013.6153.412.360.471.89c1=(1,1)group 1c2=
113,83group 2From the above object centroid distance matrix we can see,Item 1 has minimum distance for group1, so wecluster item 1 in group1.Item 2 has minimum distance for group 1, so we cluster item 2 ingroup 1.Item 3 has minimum distance for group 2, so we cluster item 3 ingroup 2.Item 4 has minimum distance for group 2, so we cluster item 4 ingroup 2.ObjectClustering:G1=
11000011Iteration 2 : Determine centroids:C1=(1 + 2/2, 1 + 1/2) = (3/2, 1)munotes.in
Page 116
116C2=(4 + 5/2, 3 + 4/2) = (9/2, 7/2)The distance of item 1 = (1, 1) to c1= (3/2, 1) and with c2= (9/2, 7/2)iscalculated as,D=(1–3/2)2+(1–1)2=0.5D=(1–9/2)2+(1–7/2)2=4.3The distance of item 2 = (2, 1) to c1= (3/2, 1) and with c2= (9/2, 7/2) iscalculated as,D=(2–3/2)2+(1–1)2= 0.5D=(2–9/2)2+(1–7/2)2= 3.54The distance of item 3 = (4, 3) to c1= (3/2, 1) and with c2= (9/2, 7/2) iscalculated as,D=(4–3/2)2+(3–1)2= 3.20D=(4–9/2)2+(3–7/2)2= 0.71The distance of item 4 = (5, 4)to c1= (3/2, 1) and with c2= (9/2, 7/2) iscalculated as,D=(5–3/2)2+(4–1)2= 4.61D=(5–9/2)2+(4–7/2)2= 0.71Objects-centroids distance:D2=
0.50.53.204.614.33.540.710.71c1=
32,1group 1c2=
92,72group 2From the above object centroid distance matrix we can see,Item 1 has minimum distance for group1, so we cluster item 1 in group1.Item 2 has minimumdistance for group 1, so we cluster item 2 ingroup 1.Item 3 has minimum distance for group 2, so we cluster item 3 ingroup 2.Item 4 has minimum distance for group 2, so we cluster item 4 ingroup 2.Object Clustering:G2=
11000011munotes.in
Page 117
117G2= G1, Objects does not move from group any more. So, the finalclusters are as follows:Item 1 and 2 are clustered in group 1Item 3 and 4 are clustered in group 2Example 4 :Suppose we have eight data points and each datapoint has 2 features.Cluster the data points into 3 clusters using k-means algorithm.Data pointsAttribute 1(x)Attribute 2(y)1210225384458575664712849Solution:Initial value of centroid:Suppose we use data points 1,4 and 7 as the first centroids, c1= (2, 10), c2= (5, 8) and c3= (1, 2)The distance of data point 1 = (2, 10) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(2–2)2+(10–10)2= 0D=(2–5)2+(10–8)2= 3.61D=(2–1)2+(10–2)2= 8.06The distance of data point 1 = (2, 5) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(2–2)2+(5–10)2= 5D=(2–5)2+(5–8)2= 4.24D=(2–1)2+(5–2)2= 3.16The distance of data point 1 = (8, 4) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(8–2)2+(4–10)2= 8.48D=(8–5)2+(4–8)2= 5munotes.in
Page 118
118D=(8–1)2+(4–2)2= 7.28The distance of data point 1 = (5, 8)to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(5–2)2+(8–10)2= 3.61D=(5–5)2+(8–8)2= 0D=(5–1)2+(8–2)2= 7.21The distance of data point 1 = (7, 5) to c1= (2, 10), c2= (5, 8) and withc3= (1, 2) is,D=(7–2)2+(5–10)2= 7.07D=(7–5)2+(5–8)2= 3.61D=(7–1)2+(5–2)2= 6.71The distance of data point 1 = (6, 4) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(6–2)2+(4–10)2= 7.21D=(6–5)2+(4–8)2= 4.12D=(6–1)2+(4–2)2= 5.39The distance of data point 1 = (1, 2) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(1–2)2+(2–10)2= 8.06D=(1–5)2+(2–8)2= 7.21D=(1–1)2+(2–2)2= 0The distance of data point 1 = (4, 9) to c1= (2, 10), c2= (5, 8) and with c3= (1, 2) is,D=(4–2)2+(9–10)2= 2.24D=(4–5)2+(9–8)2= 1.4D=(4–1)2+(9–2)2= 7.62Objects-centroids distance:D0=
058.483.617.077.218.062.243.614.24503.614.127.211.48.063.167.287.216.715.3907.62c1=(2,10)group 1c2=(5,8)group 2c3=(1,2)group 3munotes.in
Page 119
119From the above object centroid distance matrix we can see,Data point 1 has minimum distance for group1, so we cluster datapoint 1in group 1.Data point 2 has minimum distance for group3, so we cluster datapoint 2in group 3.Data point 3 has minimum distance for group 2, so we cluster datapoint 3 in group 2.Data point 4 has minimum distance for group 2, so we cluster datapoint 4 in group 2.Data point 5 has minimum distance for group 2, so we cluster datapoint 5 in group 2.Data point 6 has minimum distance for group 2, so we cluster datapoint 6 in group 2.Data point 7 has minimum distance for group 3, so we cluster datapoint 7 in group 3.Data point 8 has minimum distance for group 2, so we cluster datapoint 8 in group 2.Object Clustering:G0=
100000000011110101000010Iteration1:Determine centroidsC1has only one member thus c1= (2, 10) remains same.C2=(8 + 5 + 7 + 6 + 4/5, 4 +8 + 5 + 4 + 9/5) = (6, 6)C3=(2 + 1/2, 5 + 2/2) = (1.5, 3.5)Objects-centroids distance:D1=
058.483.617.077.218.062.245.664.122.832.241.4126.403.166.521.586.255.75.74.521.586.04c1=(2,10)group 1c2=(6,6)group 2c3=(1.5,3.5)group 3Object Clustering:G1=
100000010011110001000010munotes.in
Page 120
120Iteration 2 : Determine centroids:C1=(2 + 4/2, 10 + 9/2) = (3, 9.5)C2=(8 + 5 + 7 + 6/4, 4 + 8 + 5 + 4/4) = (6.5, 5.25)C3=(2 + 1/2, 5 + 2/2) = (1.5, 3.5)D2=
1.122.357.432.56.026.267.761.126.544.511.953.130.561.356.387.686.521.586.525.75.74.521.586.04c1=(3,9.5)group 1c2=(6.5,5.25)group 2c3=(1.5,3.5)group 3Object Clustering:G2=
100100010010110001000010Iteration 3 : Determine centroids:C1=(2 + 5 + 4/3, 10 + 9 + 8/3) = (3.67, 9)C2=(8 + 7 + 6/3, 4 + 5 + 4/3) = (7, 4.33)C3=(2 + 1/2, 5 + 2/2) = (1.5, 3.5)D2=
1.954.336.611.665.25.527.490.336.015.041.054.170.671.056.445.556.521.586.525.75.74.521.586.04c1=(3.67,9)group 1c2=(7,4.33)group 2c3=(1.5,3.5)group 3Object Clustering:G3=
100100010010110001000010G3= G2, Objects does not move from group any more. So, the finalclusters are as follows:Data points 1, 4 and 8 are clustered in group 1Data points 3, 5 and 6 are clustered in group 2Data points 2 and 7 are clustered in group 3munotes.in
Page 121
1218.5HIERARCHICAL CLUSTERINGHierarchical clustering algorithms works in top down manner or bottomup manner. Hierarchical clustering is known as Hierarchicalagglomerative clustering.8.5.1Agglomerative Hierarchical Clustering:In agglomerative clusteringinitially each data point is considered as asingle cluster. In the next step, pairs of clusters are merged oragglomerated. This step is repeated until all clusters have been merged into a single cluster. At the end a single cluster remains that contains all thedata points. Hierarchical clustering algorithms works in top-down manneror bottom-up manner. Hierarchical clustering is known as Hierarchicalagglomerative clustering.In agglomerative, clustering is represented as a dendogram as shownbelow whereeach merge is represented by a horizontal line
A clustering of the data objects is obtained by cutting the dendogram atthe desired level,then each connected forms a cluster.The basic steps of Agglomerative hierarchical clustering are asfollows:1.Compute the proximity matrix(distance matrix)2.Assume each data point as a cluster.3.Repeat4.Merge the two nearest clusters.5.Update the proximity matrix6.Until only a single cluster remainsIn Agglomerative hierarchical clustering proximity matrix is symmetricie., the number on lower half will be same as the numbers on top half.Different approaches to defining the distance between clusters distinguishthe different algorithm’s ie., Single linkage,Complete linkage and Averagelinkage clusters.In single linkage, the distance between two clusters is considered to be&&*%%&
% , +##+)*()
munotes.in
Page 122
122equal to shortest distance from any member of one cluster to any memberof other cluster.D(r,s) =Min { d(i,j), object i-> cluster r and object j-> cluster sIncomplete linkage, the distance between two clusters is considered to beequal to greatest distance from any member of one cluster to any memberof other cluster.D(r,s) =Max { d(i,j), object i-> cluster r and object j-> cluster sIn average linkage, weconsider the distance between any two clusters Aand B is taken to be equal to average of all distances between pairs ofobject I in A and j in B.ie., mean distance between elements of each other.D(r,s) =Mean { d(i,j), object i-> cluster r and object j->cluster s*(*!*%* ($)+(*+() &$'+* )*%$*( .*&!*)#+)*(&&#+)*(
(#&))*#+)*()'* )*%$*( .%)&munotes.in
Page 123
1238.5.2. Examples of Hierarchical Clustering:Example1:The table below shows the six data points. Use all link methods to findclusters. Use Euclidian distance measure.XyD10.40.53D20.220.38D30.350.32D40.260.19D50.080.41D60.450.30Solution:First we will solve using single linkage:The distance of data point D1= (0.4, 0.53) to D2= (0.22, 0.38) is,D=(0.4–0.22)2+(0.53–0.38)2= 0.24The distance of data point D1= (0.4, 0.53) to D3= (0.35, 0.32) is,D=(0.4–0.35)2+(0.53–0.32)2= 0.22The distance of data point D1= (0.4, 0.53) to D4= (0.26, 0.19) is,D=(0.4–0.26)2+(0.53–0.19)2= 0.37The distance of data point D1= (0.4, 0.53) to D5= (0.08, 0.41) is,D=(0.4–0.08)2+(0.53–0.41)2= 0.34The distance of data point D1= (0.4, 0.53) to D6= (0.45, 0.30) is,D=(0.4–0.45)2+(0.53–0.30)2= 0.23Similarly we will calculate all distances.Distance matrix:D10D20.240D30.220.150D40.370.200.150D50.340.140.280.290D60.230.250.110.220.390D1D2D3D4D5D6munotes.in
Page 124
1240.11 is smallest. D3and D6have smallest distance. So, we combine thistwo in one cluster and recalculate distance matrix.Distance ((D3, D6), D1) = min (distance (D3, D1), distance (D6, D1)) = min(0.22, 0.23) = 0.22Distance ((D3, D6), D2) = min (distance (D3, D2), distance (D6, D2)) = min(0.15, 0.25) = 0.15Distance ((D3, D6), D4) = min(distance (D3, D4), distance (D6, D4)) = min(0.15, 0.22) = 0.15Distance ((D3, D6), D5) = min (distance (D3, D5), distance (D6, D5)) = min(0.28, 0.39) = 0.28Similarly we will calculate all distances.Distance matrix:D10D20.240(D3, D6)0.220.150D40.370.200.150D50.340.140.280.290D1D2(D3, D6)D4D50.14 is smallest. D2and D5have smallest distance. So, we combine thistwo in one cluster and recalculate distance matrix.Distance ((D3, D6), (D2, D5)) = min (distance (D3, D2), distance(D6, D2), distance (D3, D5), distance (D6, D6))=min (0.15, 0.25, 0.28, 0.29) = 0.15Similarly, we will calculate all distances.Distance matrix:D10(D2, D5)0.240(D3, D6)0.220.150D40.370.200.150D1(D2, D5)(D3, D6)D40.15 is smallest. (D2, D5) and (D3, D6) as well as D4and (D3, D6) havesmallest distance. We can pick either one.munotes.in
Page 125
125
Distance matrix:D10(D2, D5, D3, D6)0.220D40.370.150D1(D2, D5, D3, D6)D40.15 is smallest. (D2,D5, D3, D6) and D4have smallest distance. So, wecombine this two in one cluster and recalculate distance matrix.Distance matrix:D10(D2, D5, D3, D6, D4)0.220D1(D2, D5, D3, D6, D4)Now a single clusterremains (D2, D5, D3, D6,D4, D1)Next, we represent thefinal dendogram for singlelinkage as,Now we will solve using complete linkage:Distance matrix:D10D20.240D30.220.150D40.370.200.150D50.340.140.280.290D60.230.250.110.220.390D1D2D3D4D5D60.11 is smallest. D3and D6have smallest distance. So, we combine thistwo in one cluster and recalculate distance matrix.Distance ((D3, D6), D1) = max (distance (D3, D1), distance (D6, D1)) = max(0.22, 0.23) = 0.23Similarly, we willcalculate all distances.munotes.in
Page 126
126
Distance matrix:D10D20.240(D3, D6)0.230.250D40.370.200.220D50.340.140.390.290D1D2(D3, D6)D4D50.14 is smallest. D2and D5have smallest distance. So, we combine thistwo in one clusterand recalculate distance matrix.Distance matrix:D10(D2, D5)0.340(D3, D6)0.230.390D40.370.290.220D1(D2, D5)(D3, D6)D40.22 is smallest.Here (D3, D6) and D4have smallest distance. So, wecombine these two in one cluster and recalculate distance matrix.Distance matrix:D10(D2, D5)0.340(D3, D6, D4)0.370.390D1(D3, D6, D4)(D3, D6, D4)0.34 is smallest. (D2, D5) and D1have smallest distance so, we combinethese two in one cluster and recalculate distance matrix.Distance matrix:(D2, D5, D1)00(D3, D6, D4)0.390(D2, D5, D1)(D3, D6, D4)Now a single clusterremains (D2, D5, D1,D3, D6, D4)Next, we represent thefinal dendogram forcomplete linkage as,munotes.in
Page 127
127Now we will solve using average linkageDistance matrix:D10D20.240D30.220.150D40.370.200.150D50.340.140.280.290D60.230.250.110.220.390D1D2D3D4D5D60.11 is smallest. D3and D6have smallest distance. So, we combine thistwo inone cluster and recalculate distance matrix.Distance ((D3, D6), D1) = 1/2 (distance (D3, D1) + distance (D6, D1)) = 1/2(0.22 + 0.23) = 0.23Similarly, we will calculate all distances.Distance matrix:D10D20.240(D3, D6)0.230.20D40.370.200.190D50.340.140.340.290D1D2(D3, D6)D4D50.14 is smallest. D2and D5have smallest distance. So, we combine thistwo in one cluster and recalculate distance matrix.Distance matrix:D10(D2, D5)0.290(D3, D6)0.220.270D40.370.220.150D1(D2, D5)(D3, D6)D4(D3, D6) and D4have smallest distance. So, we combine this two in onecluster and recalculate distance matrix.Distance matrix:D10(D2, D5)0.240(D3, D6, D4)0.270.260D1(D2, D5)(D3, D6, D4)munotes.in
Page 128
1280.24 is smallest. (D2, D5) and D1have smallest distance. So, we combinethis two in one cluster and recalculate distance matrix.Distance matrix:(D2, D5, D1)00(D3, D6, D4)0.260(D2, D5, D1)(D3, D6, D4)Now a single clusterremains (D2, D5, D1, D3, D6, D4)Next, we represent the final dendogram for average linkage as,Example2:Apply single linkage, completelinkage and average linkage on thefollowing distance matrix and draw dendogram.Solution :First we will solve using single linkageDistance matrix:P10P220P3630P410970P598540P1P1P3P4P52 is smallest. P1and P2have smallest distance.So, we combine this two inone cluster and recalculate distance matrix.Distance ((P1,P2), P3) = min (distance (P, P3), distance (P2, P3)) = min (6,3) = 3Similarly, we will calculate all distances.Distance matrix:(P1, P2)0P330P4970P58540(P1, P2)P3P4P53 is smallest. (P1, P2) and P3have smallest distance.So, we combine thistwo in one cluster and recalculate distance matrix.Distance ((P1, P2, P3), P4)) = min (distance (P1, P4), distance (P2, P4),distance (P3, P4)) = min (9, 7) = 7munotes.in
Page 129
129
Similarly, we will calculate all distances.Distance matrix:(P1,P2, P3)0P470P5540(P1, P2, P3)P4P54 is smallest. P4and P5have smallest distance.Distance matrix:(P1, P2, P3)0(P4, P5)50(P1, P2, P3)(P4, P5)Now a single cluster remains (P1, P2, P3, P4, P5)Next, we represent the final dendogram forsingle linkage as,Now we will solve using complete linkageDistance matrixP10P220P3630P410970P598540P1P2P3P4P52 is smallest. P1and P2have smallest distance.So, we combine this two inone cluster and recalculate distance matrix.Distance ((P1, P2), P3) = max (distance (P1, P3), distance (P2, P3)) = max(6, 3) = 6Similarly, we will calculate all distances.Distance matrix(P1, P2)0P360P41070P59540(P1, P2)P3P4P5munotes.in
Page 130
130
4 is smallest. P4and P5have smallest distance.So, we combine this two inone cluster and recalculate distance matrix.Distance matrix:(P1, P2)0P360(P4, P5)1070(P1, P2)P3(P4, P5)6 is smallest. (P1, P2) and P3have smallest distance.So, we combine thistwo in one cluster and recalculate distance matrix.Distance matrix:(P1, P2, P3)0(P4, P5)100(P1, P2, P3)(P4, P5)Now a single cluster remains (P1, P2, P3, P4, P5)Next, we represent the final dendogramforcomplete linkage as,Now we will solve using average linkageDistance matrix:P10P220P3630P410970P598540P1P2P3P4P52 is smallest. P1and P2have smallest distance.So, we combine this two inone clusterand recalculate distance matrix.Distance ((P1, P2), P3) = 1/2 (distance (P1, P3), distance (P2, P3)) = 1/2 (6,3) = 4.5Similarly, we will calculate all distances.munotes.in
Page 131
131
Distance matrix:(P1, P2)0P34.50P49.570P58.5540(P1, P2)P3P4P54 is smallest. P4and P5have smallest distance.So, we combine this two inone cluster and recalculate distance matrix.Distance matrix:(P1, P2)0P34.50(P4, P5)960(P1, P2)P3(P4, P5)4.5 is smallest. (P1, P2) and P3have smallest distance.So, we combine thistwo in one cluster and recalculate distance matrix.Distance matrix:(P1, P2, P3)0(P4, P5)80(P1, P2, P3)(P4, P5)Now a single cluster remains (P1, P2,P3, P4, P5)Next, we represent the finaldendogramfor average linkage as,SUMMARYIn this chapter we have seen distance based model which is based on theconcept of distance. We have seen how to calculate the distance betweenthe data using Euclidean and manhattan distance. In this chapter we haveseen nearest neighbor method which is used to classify the data point in toone of the classes based on the concept of minimum distance. Here wehave also seen K means algorithm in which we calculate the centroid andthen distance of each data point is calculated from this centroid. Datapoints are clustered based on minimum distance and this process isrepeated unless and until there is no change in the clutsers. We have alsoseen agglomerative clustering in which we calculate the distance matrixmunotes.in
Page 132
132which isused to find minimum distance. The data points having minimumdistance are clustered together and distance matrix is updated. Thisprocess is repeated unless and until single cluster remains.UNIT END EXERCISES1. Describe the essential steps of K-means algorithm for clusteringanalysis.2. Apply K-means algorithm algorithm on given data for k=3.Usec1(2),c2(16) and c3(38) as initial cluster centres.Data: 2,4,6,3,31,12,15,16,38,35,14,21,23,25,303. Apply K-means algorithm algorithm on given data for k=3.Usec1(2),c2(16) and c3(38) as initial cluster centres.Data: 2,4,6,3,31,12,15,16,38,35,14,21,23,25,304. Apply Agglomerative clustering algorithm on given data and drawdendogram. Show three clusters with its allocated points.Use singlelink method.abcdeFa0
b
0
31
c
0
2d
1
023e
1
20
f
23
05.For the given set of points identify clusters using complete link andaverage link using Agglomerative clustering.ABP111P21.51.5P355P434P544P633.5ABmunotes.in
Page 133
133LIST OF REFERENCES•kdnuggets.com/2019/06/main-approaches-machine-learning-models.html•https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering•https://www.displayr.com/what-is-hierarchical-clustering•https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning.munotes.in
Page 134
9RULE BASED MODELSUnit Structure9.0Objectives9.1Introduction to Logic based Model9.1.1 Rule basedclassifier9.1.2 Example of Rule basedclassifier9.1.3 Application of Rule basedclassifier9.1.4 Characteristics of Rule basedclassifier9.1.5Rule Building using Coverage Algorithm9.2Rule learning for subgroup discovery9.2.1 Subgroup discovery9.2.2 Working of Rule learning for subgroup discovery9.2.3Measures in subgroup discovery9.2.4 Weighted Coverage Algorithm for subgroup discovery9.3Introduction to Association rule mining9.3.1 Apriori Algorithm9.3.1.1 What is Frequent Itemset?9.3.1.2 Steps for Apriori Algorithm9.3.1.3 Apriori Algorithm Working9.3.2 Association Rule Mining9.3.2.1 How does Association Rule Mining work?9.3.2.2Association Rule Mining using Apriori Algorithm9.3.2.3 Applications of Association Rule MiningSummaryUnit End ExercisesList ofReferences9.0 OBJECTIVES1. To identify label or class of a given example2. To identify rules for subgroup discovery3.To identify frequent item set.9.1 INTRODUCTION TO LOGICAL MODELSLogical modelsuse a logical expression to divide the instance space intosegments and hence construct grouping models. Alogical expressionismunotes.in
Page 135
an expression that returns a Boolean value,i.e., a True or False outcome.Once the data is grouped using a logical expression, the data is dividedinto homogeneous groupings for the problem we are trying to solve.Forexample, for a classification problem, all the instances in the group belongto one class.There are mainly two kinds of logical models:Tree modelsandRulemodels.9.1.1 Rule based Classifier:Rule models consist of a collection of implications or IF-THEN rules. Fortree-based models, the ‘if-part’ defines a segment and the ‘then-part’defines the behaviour of the model for this segment. Rule models followthe same reasoning.Rule-based classifier classifies records using a set of IF-THEN rules. Therules can be expressed in the following from−(Condition)->yWhere-LHS of above rule is known as an antecedent or condition-RHS of above rule is known as rule consequent-Condition is a conjunction of attribute. Herecondition consist of oneor more attribute tests which are logically ANDed.-Y represents the class labelAssume a rule R1,R1: IF Buying-Price = high AND Maintenance_Price = high AND Safety= low THEN Car_evaluation = unacceptableRule R1 can also be rewritten as:R1: (Buying-Price = high) ^ (Maintenance_Price = high)^ (Safety = low)->Car_evaluation = unacceptableIf the antecedent is true for a given record, then the consequent is given asoutput.9.1.2Example of Rule basedclassifier:Buying_PriceMaintenance_ PriceLug_BootSafetyEvaluation?HighHighSmallHighUnacceptableHighHighSmallLowUnacceptableMediumHighSmallHighAcceptableLowMediumSmallHighAcceptableLowLowBigHighAcceptableLowLowBigLowUnacceptableMediumLowBigLowAcceptablemunotes.in
Page 136
HighMediumSmallHighUnacceptableHighLowBigHighAcceptableLowMediumBigHighAcceptableHighMediumBigLowAcceptableMediumMediumSmallLowAcceptableMediumHighBigHighAcceptableLowMediumSmallLowUnacceptableR1:(Buying-Price = high) ^ (Maintenance_Price = high)^ (Safety =low)->Car_evaluation = unacceptableR2:(Buying-Price = medium) ^( Maintenance_Price = high)^(Lug_Boot= big)->Car_evaluation = acceptableR3:(Buying-Price = high) ^( Lug_Boot = big)->Car_evaluation =unacceptableR4:(Maintenance_Price= medium) (Lug_Boot = big)^ (Safety = high)->Car_evaluation = acceptable9.1.3Application of Rule basedclassifier:A record x is said to be covered by a rule, if all the attributes present in therecord satisfy the antecedent of the rule.CarBuying_PriceMaintenance_PriceLug_BootSafetyEvaluation?1HighHighSmalllow?2mediumHighbigLow?3Highmediumbighigh?4Highmediumsmalllow?Car 1 triggers rule R1=>unacceptableCar 2 triggers rule R2 =>acceptableCar 3 triggers both R3 and R4Car 4 triggers none of the rules9.1.4Characteristicsof Rule basedclassifier:1. Mutually Exclusive rules-Rule based Classifier comprises of mutuallyexclusive rules if the rules are independent of each other. Each and everyrecord is covered by at most one rule.Solution:-Arrange rules in the orderArrangement of Rules in the Order:Rules are assigned a priority and based on this they are arranged and ranksare associated. When a test record is given as a input to the classifier, alabel of the class with highest priority triggered rule is assigned. If the testrecord does not trigger any of the rule then a default class is assigned.Rule-based ordering:-In rule based ordering individual rules are rankedbased on their quality.munotes.in
Page 137
R1:(Buying-Price = high) ^ (Maintenance_Price = high)^ (Safety = low)->Car_evaluation = unacceptableR2:(Buying-Price = medium) ^ ( Maintenance_Price = high) ^(Lug_Boot= big)->Car_evaluation = acceptableR3:(Buying-Price = high) ^( Lug_Boot = big)->Car_evaluation =unacceptableR4:(Maintenance_Price= medium) (Lug_Boot = big)^ (Safety = high)->Car_evaluation = acceptableCarBuying_PriceMaintenance_PriceLug_BootSafetyEvaluation?3HighMediumbighigh?Car 3 triggers rule R3 first =>unacceptableClass-based ordering:-In class based ordering rules which belongs to thesame class are grouped together.R2:(Buying-Price = medium) ^( Maintenance_Price = high)^(Lug_Boot= big)->Car_evaluation = acceptableR4:(Maintenance_Price=medium) (Lug_Boot = big)^ (Safety = high)->Car_evaluation = acceptableR1:(Buying-Price = high) ^ (Maintenance_Price = high)^ (Safety =low)->Car_evaluation = unacceptableR3: (Buying-Price = high) ^( Lug_Boot = big)->Car_evaluation =unacceptableCar3 triggers rule R4 =>acceptable2. Exhaustive rules:It said to has a complete coverage for the rule basedClassifierif it accountsfor eachdoableattributevaluescombination.Everyinstanceis roofedwitha minimum ofone rule.Solution:-Use adefault class9.1.5RuleBuilding usingCoverageAlgorithm:CoverageAlgorithm can be used to extract IF-THEN rules form thetraining data. We do not require to generate a decision tree first. In thisalgorithm, each rule for a given class covers many of the tuples of thatclass.As per the general strategy the rules arelearned one at a time. For eachtime rules are learned, a tuple covered by the rule is removed and theprocess continues for the rest of the tuples. This is because the path toeach leaf in a decision tree corresponds to a rule.munotes.in
Page 138
Note− The Decision treeinduction can be considered as learning a set ofrules simultaneously.The Following is the sequential learning Algorithm where rules arelearned for one class at a time. When learning a rule from a class Ci, wewant the rule to cover all the tuples fromclass C only and no tuple formany other class.Algorithm:CoverageInput:D, a data set class-labeled tuples,Att_vals, the set of all attributes and their possible values.Output: A Set of IF-THEN rules.Method:Rule_set={ }; // initial set of rules learned is emptyfor each class c dorepeatRule = Learn_One_Rule(D, Att_valls, c);remove tuples covered by Rule form D;until termination condition;Rule_set=Rule_set+Rule; // add a new rule to rule-setend forreturnRule_Set;9.2 RULELEARNING FOR SUBGROUP DISCOVERY9.2.1Subgroup discovery:•Imagine you want to market the new version of a successful product.You have a database of people who have been sent information aboutthe previous version, containing all kinds of demographic, economicand social information about those people, as well as whether or notthey purchased the product.•If you were to build a classifier or ranker to find the most likelycustomers for your product, it is unlikely to outperform the majorityclass classifier (typically, relatively few people will have bought theproduct).•However, what you are really interested in is finding reasonably sizedsubsets of people with a proportion of customers that is significantlyhigher than in the overall population. Youcan then target those peoplein your marketing campaign, ignoring the rest of your database.munotes.in
Page 139
•Subgroups are subsets of the instance space–or alternatively,mappings gˆ : X→ {true,false} that are learned from a set of labelledexamples (xi,label(xi)),where label: X→ C is the true labellingfunction.•A good subgroup is one whose class distribution is significantlydifferent from the overall population. This is by definition true for puresubgroups, but these are not the only interesting ones.•Consider small dolphins data set with positive examplesp1: Length = 3Gills = noBeak = yesTeeth = manyp2: Length = 4Gills = noBeak = yesTeeth = manyp3: Length = 3Gills = noBeak = yesTeeth = fewp4: Length = 5Gills = noBeak = yesTeeth = manyp5: Length = 5Gills = noBeak = yesTeeth = fewand negativesn1: Length = 5Gills = yesBeak = yesTeeth = manyn2: Length = 4Gills = yesBeak = yesTeeth = manyn3: Length = 5Gills = yesBeak = noTeeth = manyn4: Length = 4Gills = yesBeak = noTeeth = manyn5: Length = 4Gills = noBeak = yesTeeth = few•For instance, one could argue that the complement of a subgroup is asinteresting as the subgroup itself: in our dolphin example, the conceptGills = yes, which covers four negatives and no positives, could beconsidered as interesting as its complement Gills = no, which coversone negative and all positives.•This means that we need to move away from impurity-basedevaluation measures.9.2.2Working ofRule learning for subgroup discovery:Classification identifies a rule such that it forms a pure class. From atraining set rules are used to identify pure subset of examples means onlypositive or negativeinstances.This is called subgroups. Based on rules wehave to find subset of instance space. Along with finding subgroups weneed to find interesting patterns.Means if we apply one rule we get oneoutput and applying second rule we get other output.If subgroup isinteresting pattern then its complementary is alsoaninteresting pattern.Eg Gills=yes it identifies 4 negative and 0 positive. If we are assumingthat it is aninteresting pattern then its complementary Gills=No. Herewewillassume thatit will identify5 positive,but it maynot be true. It mayidentify5 positiveand 1 negative. means Complementary of rule is notcomplementaryof subgroup.munotes.in
Page 140
9.2.3Measures in subgroupdiscovery:If we drawagraph considering positives alongyaxisand negatives alongx axis. suppose there are differentsubgroupsare formed. Any subgroupthat is present on diagonalhave equal proportion of positivesto overallpopulation.1.Precision:Any subgroupthat is present ondiagonalhave samepropotionof positivesto overall positives.Exampleswhich are presenton above diagonal has morenumber ofpositivesandexamples thatarebelowdiagonalhave less numberof positives.Precision =2.Aveargerecall:The subgroupsthat arepresenton diagonal haveaveargerecall of 0.5Average recall =If above value is positive means the exmaples are present on abovediagonal else below diagonal.3.Weighted relative accuracy:Hereaccuracyis assignedto eachsubgroup and one subgroup iscompared to other one.4.Weighted relative accuracy = pos*neg (true positive rate–falsepositive rate)9.2.4 Weighted Coverage Algorithm for subgroup discovery:The main difference inrule basedclassification and subgroup discover isoverlapping rules.In classification rulesarenotoverlapped. Wehave seencoverage algorithm in section9.1. Whenevernew rule is generated,weidentify the instances thataresatisfied bytherule then we remove thoseexamples. In subgroup discovery the rules areoverlapped. Whenever newrule is generated then examplesarenot directly removed. For thispurposeweareusingweighted coverage algorithm.Herefor each exampleweareassigning weightas 1. Whenever ruleis generated and exampleis coveredbyrule then weight of that exampleis halved. When it is 0 then weremove thatexamplefrom set.Algorithm: WeightedcoverageInput:Data set DOutput: Rule set RR= nullWhile some examples in D,have weight=1 dor= LearnRule(D)//Coverage algorithm ofclassificationappend r to the end Rdecrease weights of examamplecovered by rreturn R
munotes.in
Page 141
9.3 INTRODUCTION TO ASSOCIATION RULE MINING•Associations are things that usually occur together. For example, inmarket basket analysis we are interested in items frequently boughttogether. An example of an association rule isif milk then bread,stating that customers who buymilktend to also buybread.•In abakery shopmost clients will buycake. This means that there willbe many frequent item sets involvingcake, such as {candle,cake}.•This might suggest the construction of an association ruleifcandlethencake–however, this is predictable given that {cake} isalready a frequent item set (and clearly at least as frequent as{candle,cake}).•Of more interest would be the converse ruleifcakethencandlewhichexpresses that a considerable proportion of the people buyingcakealsobuy acandle.9.3.1Apriori Algorithm:The Apriori algorithm uses frequent itemsets to generate associationrules, and it is designed to work on the databases that contain transactions.With the help of these association rule, itdetermines how strongly or howweakly two objects are connected. This algorithm uses abreadth-firstsearchandHash Treeto calculate the itemset associations efficiently. Itis the iterative process for finding the frequent itemsets from the largedataset.This algorithm was given by theR. AgrawalandSrikantin theyear1994. It is mainly used formarket basket analysisand helps to findthose products that can be bought together. It can also be used in thehealthcare field to find drug reactions forpatients.9.3.1.1What is Frequent Itemset?:Frequent itemsets are those items whose support is greater than thethreshold value or user-specified minimum support. It means if A & B arethe frequent itemsets together, then individually A and B should alsobethe frequent itemset.Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, inthese two transactions, 2 and 3 are the frequent itemsets.9.3.1.2Steps for Apriori Algorithm:Below are the steps for the apriori algorithm:Step-1:Determine the support of itemsets in the transactional database,and select the minimum support and confidence.munotes.in
Page 142
Step-2:Take all supports in the transaction with higher support valuethan the minimum or selected support value.Step-3:Find all the rulesof these subsets that have higher confidencevalue than the threshold or minimum confidence.Step-4:Sort the rules as the decreasing order of lift.9.3.1.3Apriori Algorithm Working:We will understand the apriori algorithm using an example andmathematical calculation:Example:Suppose we have the following dataset that has varioustransactions, and from this dataset, we need to find the frequent itemsetsand generate the association rules using the Apriori algorithm givenminimum support 2 and minimumconfidence 50%TIDITEM SETST1A,BT2B,DT3B,CT4A,B,DT5A,CT6B,CT7A,CT8A,B,C,ET9A,B,CSolution:Step-1: Calculating C1 and L1:•In the first step, we will create a table that contains support count (Thefrequency of each itemsetindividually in the dataset) of each itemsetin the given dataset. This table is called theCandidate set or C1.ItemsetSupport_countA6B7C5D2E1Now, we will take out all the itemsets that have the greater support countthat the MinimumSupport (2). It will give us the table for thefrequentitemset L1.Since all the itemsets have greater or equal support count than theminimum support, except the E, so E itemset will be removed.munotes.in
Page 143
ItemsetSupport_countA6B7C5D2Step-2:Candidate Generation C2, and L2:In this step, we will generate C2 with the help of L1. In C2, we will createthe pair of the itemsets of L1 in the form of subsets.After creating the subsets, we will again find the support count from themain transactiontable of datasets, i.e., how many times these pairs haveoccurred together in the given dataset. So, we will get the below table forC2:ItemsetSupport_count{A,B}4{A,C}4{A,D}1{B,C}4{B,D}2{C,D}0Again, we need to compare the C2Support count with the minimumsupport count, and after comparing, the itemset with less support countwill be eliminated from the table C2. It will give us the below table for L2ItemsetSupport_count{A,B}4{A,C}4{B,C}4{B,D}2Step-3:Candidate generation C3, and L3:For C3, we will repeat the same two processes, but now we will form theC3 table with subsets of three itemsets together, and will calculate thesupport count from the dataset. It will give the below table:ItemsetSupport_count{A,B,C}2{B,C,D}1{A,C,D}0{A,B,D}0Now we will create the L3 table. As we can see from the above C3 table,there is only one combination of itemset that has support count equal tothe minimum support count. So, the L3 will have only one combination,i.e.,{A, B, C}.munotes.in
Page 144
Step-4:Finding the association rules for the subsets:To generate the association rules, first, we will create a new table with thepossible rules from the occurred combination {A, B.C}. For all the rules,we will calculate the Confidence using formulasup( A^B)/A.Aftercalculating the confidence value for all rules, we will exclude the rulesthat have less confidence than the minimum threshold(50%).Consider the below table:RulesSupportConfidenceA ^B→C2Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%B^C→A2Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%A^C→B2Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%C→ A^B2Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%A→B^C2Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%B→B^C2Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%As the giventhreshold or minimum confidence is 50%, so the first threerulesA ^B→ C, B^C → A, and A^C → Bcan be considered as thestrong association rules for the given problem.9.3.2Association Rule Mining:Association rule learning is a type of unsupervisedlearning technique thatchecks for the dependency of one data item on another data item and mapsaccordingly so that it can be more profitable. It tries to find someinteresting relations or associations among the variables of dataset. It isbased on different rules to discover the interesting relations betweenvariables in the database.The association rule learning is one of the very important conceptsofmachine learning, and it is employed inMarket Basket analysis, Webusage mining, continuous production, etc.Here market basket analysisis a technique used by the various big retailer to discover the associationsbetween items. We can understand it by taking an example of amunotes.in
Page 145
supermarket, as ina supermarket, all products that are purchased togetherare put together.For example, if a customer buys bread, he most likely can also buy butter,eggs, or milk, so these products are stored within a shelf or mostly nearby.ForAssociation rule learningApriori algorithm can be used.9.3.2.1How does Association RuleMiningwork?Association rule learning works on the concept of If and Else Statement,such as if A then B.Here the If element is calledantecedent, and then statement is calledasConsequent. These types of relationships where we can find out someassociation or relation between two items is knownas single cardinality. Itis all about creating rules, and if the number of items increases, thencardinality also increases accordingly.So, to measure the associationsbetween thousands of data items, there are several metrics. These metricsare given below:oSupportoConfidenceoLiftLet's understand each of them:Support:Support is the frequency of A or how frequently an item appears inthedataset. It is defined as the fraction of the transaction T that contains theitemset X. If there are X datasets, then for transactions T, it can be writtenas:Support(X) = Freq(X) / TConfidence:Confidence indicates how often the rule has beenfound to be true. Or howoften the items X and Y occur together in the dataset when the occurrenceof X is already given. It is the ratio of the transaction that contains X andY to the number of records that contain X.Confidence = Freq( X,Y) / Freq(X)Lift:It is the strength of any rule, which can be defined as below formula:Lift = Support (X,Y) / ( Support(X) * Support(Y))It is the ratio of the observed support measure and expected support if Xand Y are independent of each other. It has three possible values:•IfLift= 1: The probability of occurrence of antecedent and consequentis independent of each other.munotes.in
Page 146
•Lift>1: It determines the degree to which the two itemsets aredependent to each other.•Lift<1: It tells us that one item is a substitute forother items, whichmeans one item has a negative effect on another.9.3.2.2 Association RuleMiningusing Apriori Algorithm:This algorithm uses frequent datasets to generate association rules. It isdesigned to work on the databases that containtransactions. Thisalgorithm uses a breadth-first search and Hash Tree to calculate theitemset efficiently.It is mainly used for market basket analysis and helps to understand theproducts that can be bought together. It can also be used in the healthcarefield to find drug reactions for patients.9.3.2.3Applications of Association RuleMining:It has various applications in machine learning and data mining. Below aresome popular applications of association rule learning:oMarket Basket Analysis:It isone of the popular examples andapplications of association rule mining. This technique is commonlyused by big retailers to determine the association between items.oMedical Diagnosis:With the help of association rules, patients can becured easily, as it helps in identifying the probability of illness for aparticular disease.oProtein Sequence:The association rules help in determining thesynthesis of artificial Proteins.oIt is also used for theCatalog DesignandLoss-leader Analysisandmany more other applications.SUMMARYIn this chapter we have seen rule based model. In rule based model rulesare defined in the form of if-then. Rules are used to identify the label orclass of a given example. In classification when a new rule is generatedthen theexample that covers this rule is eliminated from training set. Insubgroup discovery we identify a pure class means all examples arecorresponding to eithrt positive label or negative label. In subgroupdiscovery when new rule is generated the example which covers this ruleis not directly eliminated. Weight of this example is halved and thisprocess is repeated till it becomes zero. When a weight becomes zero thenthat example is eliminated from training set. This algorithm is called asweighted coveragealgorithm. In this chapter we have also seenassociation rule mining. In this we find the frequent item set using apriorialgorithm. Then using these frequent item set rules are defined.munotes.in
Page 147
UNIT END EXCERCISES1.Find frequent item set and associationrules for minimum supportcount 2 and minimum confidence is 60%TIDItemsT1i1,i2,i5T2i2,i4T3i2,i3T4i1,i2,i4T5i1,i3T6i2,i3T7i1,i3T8i1,i2,i3,i5T9i1,i2,i3LIST OF REFERENCES•https://www.geeksforgeeks.org/rule-based-classifier-machine-learning/•Using rule learning for subgroupdiscovery,BrankoKavšek(2004)Using rule learning for subgroupdiscovery.• ###! !""# • ###! ! •https://medium.com/analytics-vidhya/association-rule-mining-7f06401f0601\munotes.in
Page 148
10TREE BASED MODELSUnit Structure10.0Objectives10.1Introduction to tree model10.2Decision Trees10.2.1. Where Decision Tree is applicable?10.2.2. Decision Tree Representation10.2.3. Attribute Selection Measure10.2.4. AvoidOver fittingin classification (Tree pruning)10.2.5. Strengths of Decision Tree Method10.2.6. Weakness of Decision Tree Method10.2.7 Constructing Decision Trees10.2.8 Example of Classification Tree Using ID310.2.9 Example of Decision Tree Using Gini Index10.3Ranking and Probability Estimation Trees10.3.1 Choosing alabelingbased on costs10.4RegressionTrees10.4.1 Example of Regression Tree10.5Clustering TreesSummaryUnit End QuestionsList of References10.0 OBJECTIVES•Gain conceptual picture of decision trees, regression trees andclustering trees.•Learn to predict the class using decision tree.•Learn to predict value of continuous variable by using regression tree.•Learn to identify the ranking.10.1 INTRODUCTIONTO TREE MODEL•A tree model is a hierarchical structure of conditions, where leafscontaintree outcome.munotes.in
Page 149
•They represent recursive divide-and-conquer strategies.•Tree models are among the most popular models in machine learning,because they are easy to understand and interpret:•Tree models can be seen as a particular type of rule model where theif-parts of the rules are organised in a tree structure.•Both Tree models and Rule models use the same approach tosupervised learning.•The approach can be summarised in two strategies: we could first findthe body of the rule (the concept) that covers a sufficientlyhomogeneous set of examples and then find a label to represent thebody.•Alternately, we could approach it from the other direction, i.e., firstselect a class we want to learn and thenfind rules that cover examplesof the class.10.2DECISION TREES•Decision trees are very strong and most suitable tools for classificationand prediction. The attractiveness of decision trees is due to the factthat, in contrast to neural network, decision trees represent rules.•Rules are represented using linguistic variables so that userinterpretability may be achieved. By comparing the records with therules one can easily find a particular category to which the recordbelongs to.•In some applications, the accuracy of a classification or prediction isthe only thing that matters in such situations we do not necessarily carehow or why the model works. In other situations, the ability to explainthe reason for a decision is crucial, in marketing one has described thecustomer segments to marketing professionals, so that they can usethis knowledge to start a victorious marketing campaign.•This domain expert must acknowledge and approve this discoveredknowledge and for this we need good descriptions. There are a varietyof algorithms for building decision trees that share the desirablequality of interpretability (ID3).10.2.1Where Decision Tree is applicable? :Decision tree method is mainly used for the tasks that possess thefollowing properties.•The tasks or the problems in which the records are represented byattribute-value pairs.Records are represented by a fixed set of attribute and their valueExample: For ‘temperature’ attribute the value is ‘hot’.munotes.in
Page 150
When there are small numbers of disjoint possible values for eachattribute, then decision tree learning becomes very simple.Example: Temperature attribute takes three values as hot, mild andcold.Basic decision tree algorithm may be extended to allow real valuedattributes as well.Example: we can define floating point temperature.•An application where the target function takes discrete output values.In Decision tree methods an easiest situation exists, if there areonly two possible classes.Example: Yes or NoWhen there are more than two possible output classes thendecision tree methods can also be easily extended.A more significant extension allows learning target functions withreal valued outputs, although the application of decision trees inthis area is not frequent.•The tasks orthe problems where the basic requirement is thedisjunctive descriptors.Decision trees naturally represent disjunctive expressions.•In certain cases where the training data may contain errors.Decision tree learning methods are tolerant to errors that canbe aclassification error of training records or attribute-valuerepresentation error.•The training data may be incomplete as there are missing attributevaluesAlthough some training records have unknown values, decisiontree methods can be used.10.2.2Decision Tree Representation:Decision tree is a classifier which is represented in the form of a treestructure where each node is either a leaf node or a decision node.•Leaf node represents the value of the target or response attribute(class) of examples.•Decision node represents some test to be carried out on a singleattribute-value, with one branch and sub tree for each possibleoutcome of the test.Decision tree generates regression or classification models in the form of atree structure. Decisiontree divides a dataset into smaller subsets withincrease in depth of tree. The final decision tree is a tree withdecisionnodesandleaf nodes. A decision node (e.g., Buying_Price) has two ormore branches (e.g., High, Medium and Low). Leaf node (e.g.,Evaluation)munotes.in
Page 151
shows a classification or decision. The topmost decision node in a treewhich represents the best predictor is calledroot node. Decision trees canbe used to represent categorical as well as numerical data.1.Root Node:It represents entire setof records or dataset and this isagain divided into two or more similar sets.2.Splitting:Splitting procedure is used to divide a node into two or moresub-nodes depending on the criteria.3.Decision Node:A decision node is asub-node which is divided intomore sub-nodes.4.Leaf/ Terminal Node:Leaf node is a node which is not further dividedor a node with no children.5.Parent and Child Node:Parent node is anode, which is split into sub-nodes and sub-nodes are called as child of parent node.6.Branch / Sub-Tree:A branch or sub-tree is a sub part of decision tree.7.Pruning:Pruning method is used toreduce the size of decision trees byremoving nodes.
10.2.3Attribute Selection Measure:1)Gini Index:•All attributes are assumed tobe continuous valued.•It is assumed that there exist several possible split values for eachattribute.•Gini index method can be modified for categorical attributes.•Gini is used in Classification and Regression Tree (CART).If a data set T contains example from n classes, gini index, gini (T) isdefined as,Gini (T) =1-(Eq.1)In the above equationrepresents the relative frequency of class j in T.After splitting T into two subsets T1and T2 with sizes N1 and N2,giniindex of split data is,(-&%%$"(-$"#,.$'")$ ! +.,"((+AcceptableUnacceptableAcceptableUnacceptable$"# $,&
munotes.in
Page 152
Ginisplit(T) =(Eq.2)The attribute with smallest ginisplit(T) is selected to split the node.2)Information Gain (ID3):Inthis method all attributes are assumed to be categorical. The method canbe modified for continuous valued attributes. Here we select the attributewith highest information gain.Assume there are 2 classes P and N. Let the set of records S contain precords of class P and n records of N.The amount of information required to decide if a random record in Sbelongs to P or N is defined as,I (p, n) =log2log2(Eq.3)Assume that using attribute A, a set S will be partitioned in to sets{s1, s2,----sk}If Sihas pirecords of P and nirecords of N, the entropy or the expectedinformation requiredto classify objects in all subtrees Siis,E (A) =I (pi, ni)(Eq.4)Entropy (E):-Expected amount of information (in bits) needed to assign aclass to a randomly drawn object in S under the optimal shortest lengthcode.Gain (A):-Measures reduction in entropy achieved because of split.Choose split that achieves most reduction (maximum Gain).Gain (A) = I (p, n)-E (A)(Eq.5)10.2.4. Avoid Overfitting inclassification (Tree pruning):•The generated tree may overfit the training data.If there are too many branches then some may reflect anomaliesdue to noise or outliers.Overfitting result in poor accuracy for unseen samples.•There are two approaches to avoid overfitting, prune the tree so that itis not too specific.Prepruning ( prune while building tree):-Stop tree construction early do not divide a nodeif this wouldresult inthe goodness measure falling below threshold.Postpruning (prune after building tree):-
munotes.in
Page 153
Fully constructed tree get a sequence of progressively pruned trees.10.2.5. Strengths of Decision Tree Method:-•Able to generate understandable rules•Performs classification without requiring much computation.•Able to handle both continuous and categorical variables.•Decision tree clearly indicates which fields are most important forprediction or classification.10.2.6. Weakness ofDecision Tree Method:-•Not suitable for prediction of continuous attribute•Perform poorly with many class and small data•Computationally expensive to train10.2.7Constructing Decision Trees:The ID3 algorithm starts with the original setSas the root node. On eachiteration of the algorithm, it iterates through every unused attribute of thesetSand calculates theentropy(orinformation gain) of that attribute.Algorithm next select the attribute which has the smallest entropy (orlargest information gain) value. The setSis then divided by the chosenattribute (e.g. Income is less than20 K , Income is between 20 K and 40K, Income is greater than 40 K) to produce subsets of the data. Thealgorithm is recursively called for each subset, considering the attributeswhich are not selected before.The stopping criteria for recursion can beone of these situations:•When all records in the subset belongs to the same class (+ or-), thenthe node is converted into a leaf node and labeled with the class of therecords.•When we have selected all the attributes , but the records still do notbelong to the same class (some are + and some are-), then the node isconverted into a leaf node and labeled with the most frequent class ofthe records in the subset•When there are no records in the subset, this is due to the non coverageof a specific attribute value for the record in the parent set, for exampleif there was no record with income = 40 K. Then a leaf node isgenerated, and labeled with the most frequent class of the record in theparent set.Decision tree is generated with each non-terminalnode representing theselected attribute on which the data was split, and terminal nodesrepresenting the class label of the final subset of this branch.munotes.in
Page 154
SummaryEntropy of each and every attribute is calculated using the data set1.Divide the set Sinto subsets using the attribute for which the resultingentropy (after splitting) is minimum (or, equivalently, information gainis maximum)2.Make a decision tree node containing that attributeRecurse on subsetsusing remaining attributes.10.2.8Example of Classification Tree Using ID3:Example 1:Suppose we want ID3 to evaluate car database as whether the car isacceptable or not. The target classification is “Should we accept car?”which can be acceptable or unacceptable.Buying_PriceMaintenance_PriceLug_BootSafetyEvaluation?HighHighSmallHighUnacceptableHighHighSmallLowUnacceptableMediumHighSmallHighAcceptableLowMediumSmallHighAcceptableLowLowBigHighAcceptableLowLowBigLowUnacceptableMediumLowBigLowAcceptableHighMediumSmallHighUnacceptableHighLowBigHighAcceptableLowMediumBigHighAcceptableHighMediumBigLowAcceptableMediumMediumSmallLowAcceptableMediumHighBigHighAcceptableLowMediumSmallLowUnacceptableClass P: Evaluation = “Acceptable”Class N: Evaluation = “Unacceptable”Total records = 14No. of records with Acceptable =9 and Unacceptable =5I (p, n) =log2log2I (9, 5) =log2log2= 0.940Step 1->1.Compute entropy for Buying_PriceFor Buying_Price = HighPi=2 and ni=3
munotes.in
Page 155
I (pi, ni) = I (2, 3) =log2log2= 0.971Similarly we will calculate I (pi, ni) for Medium and Low.Buying_PricepiniI(pi, ni)High230.971Medium400Low320.971E (A) =I (pi, ni)E(Buying_Price)=I(2,3)+I(4,0)+I(3,2)= 0.694Gain(S, Buying_Price) = I (p, n)-E (Buying_Price) =0.940-0.694 =0.246Similarly, Gain (S, Maintenance_ Price) = 0.029, Gain (S, Lug_Boot) =0.151, Gain (S, Safety) = 0.048Since Buying_Price is the highest we selectBuying_Price as the rootnode.Step2->As attribute Buying_Price at root, we have to decide on remaining treeattribute for High branch.Buying_PriceMaintenance_PriceLug_BootSafetyEvaluation?HighHighSmallHighUnacceptableHighHighSmallLowUnacceptableHighMediumSmallHighUnacceptableHighLowBigHighAcceptableHighMediumBigLowAcceptableNo. Of records with Acceptable =2 and Unacceptable =3I(p,n) =log2log2I(2,3)=log2log2= 0.971Buying_Price$"# $,&(-
munotes.in
Page 156
1. Compute entropy for Maintenance_ PriceMaintenance_PricepiniI(Pi, ni)High020Medium111Low100E(Maintenance_ Price)=I(0,2)+I(1,1)+I(1,0)= 0.4Gain(SHigh, Maintenance_ Price) = I (p, n)-E(Maintenance_ Price) =0.971-0.4 =0.5712. Compute entropy for Lug_BootPi=0 and ni=3I (Pi, ni) = I(0,3)= 0Lug_BootpiniI(Pi, ni)Small030Big200E (Lug_Boot) =I (0,3)+I(2,0)= 0Gain (SHigh, Lug_Boot) =I (p, n)-E(Lug_Boot) =0.971-0 =0.9713. Compute entropy for Safetypi=1 and ni=2I (pi, ni) = I(1,2)= 0.918SafetypiniI(Pi, ni)High120.918Low111E (Safety) =I (1,2)+I(1,1)= 0.951Gain(SHigh, Safety) =I (p, n)-E (Safety) =0.971-0.951 =0.02Since Lug_Boot is the highest we select Lug_Boot as a next node belowHigh branch.,.$'")$ $"# $,&(-Lug_Boot&%%$"#$"
munotes.in
Page 157
Step 3->Consider now only Maintenance_ Price and Safety for Buying_Price =MediumBuying_PriceMaintenance_PriceLug_BootSafetyEvaluation?MediumHighSmallHighAcceptableMediumLowBigLowAcceptableMediumMediumSmallLowAcceptableMediumHighBigHighAcceptableSince for any combination of values of Maintenance_ Priceand Safety,Evaluation? value is Acceptable, so we can directly write down the answeras Acceptable.
Step 4->Consider now only Maintenance_ Price and Safety for Buying_Price =Low
Pi=3 and ni=2I(Pi, ni) = I(3,2)= 0.9701. Compute entropy for SafetySafetypiniI(Pi, ni)High300Low020E (Safety) =I (3, 0) +I (0, 2) = 0Gain (SLow, Safety) =I(p,n)-E(Safety) =0.970-0 =0.970Buying_PriceMaintenance_PriceLug_BootSafteyEvaluation?LowMediumSmallHighAcceptableLowLowBigHighAcceptableLowLowBigLowUnacceptableLowMediumBigHighAcceptableLowMediumSmallLowUnacceptable,.$'")$ $"# $,&(-,"((+&%%$"
munotes.in
Page 158
2.Compute entropy for Maintenance_ PriceMaintenance_PricepiniI(Pi, ni)High000Medium210.918Low111E(Maintenance_ Price)=I(0,0)+I(2,1)+I(1,1)= 0.951Gain (SLow, Maintenance_ Price) =I (p, n)-E (Maintenance_ Price)=0.970-0.951 =0.019Since, Safety is the highest we select Safety below Low branch.
Now we will check the value of ‘Evaluation?’ from the database, for allbranches,Buying_Price=High and Lug_Boot =Small-> Evaluation? = UnacceptableBuying_Price=High and Lug_Boot =Big-> Evaluation? =AcceptableBuying_Price=Low and Safety =Low-> Evaluation? = UnacceptableBuying_Price= Low and Safety = High-> Evaluation? = AcceptableFinal Decision Tree is,.$'")$ $"# $,&(-,"((+&%%$"
(-&%%$"(-$"#,.$'")$ ! +.,"((+AcceptableeUnacceptableAcceptableUnacceptable$"# $,&$"#Safety(-Acceptable
munotes.in
Page 159
10.2.9Example of Decision Tree Using Gini Index:Example1:Create a decision tree using Gini Index to classify following dataset.Sr.No.IncomeAgeOwn Car1Very HighYoungYes2HighMediumYes3LowYoungNo4HighMediumYes5Very HighMediumYes6MediumYoungYes7HighOldYes8MediumMediumNo9LowMediumNo10LowOldNo11HighYoungYes12MediumOldNoIn this example there are two classes Yes and No.No. Of records for Yes = 7No. Of records for No = 5Total No. Of records = 12Now we will calculate Gini of the complete database as,Gini (T) == 0.48Next we will calculate Split for all attributes, i.e. Income and Age.Income->Split===0.1125Age->Split =
munotes.in
Page 160
==0.4375Split value of Income is smallest, so we will select Income as root node.
From the database we can see that,Own Car = Yes for Income = Very High, so we can directly write down‘Yes’ for Very High branch.Own Car = Yes for Income = High, so we can directly write down ‘Yes’for High branch.Own Car = No for Income = Low, so we can directly write down ‘No’ forLow branch.Since Income is taken as root node, now we have to decide on the Ageattribute, so we will take Age as next node below Medium branch.
From the database we can see that,Own Car = Yes for Income= Medium and Age = Young, so we candirectly write down ‘Yes’ for Young branch.Own Car = No for Income = Medium and Age = Medium, so we candirectly write down ‘No’ for medium branch.Own Car = No for Income = Medium and Age = Old, so we can directlywrite down ‘No’ for Old branch.Final Decision Tree is, * *(" '(&
(,'" $,&%'(& ).$"#$"#(- $,&
).$"#$"#(- $,&
munotes.in
Page 161
10.3 RANKING AND PROBABILITY ESTIMATIONTREES•Decision trees divide the instance space into segments, by learningordering onthose segments the decision trees can be turned intorankers.•Decision trees canaccesslocalclass distribution, means how manynumber of +ve and–ve instances are there in each class.•Decision trees are directly used to constructleafordering in anoptimalway in the training data.•Onecommonlyused procedure isempirical probability ,means if +veand–ve instances in a class are equal then give highest priority topositive class(means use ranking).•The ranking obtained from the empirical probabilities in the leaves ofa decisiontree yields a convex ROC curve on the training data.
Figure10.3.1.a Figure10.3.1.b•Consider the tree inabove figure10.3.1.a. Each node is labelled withthe numbers ofpositive and negative examples covered by it: so, forinstance, the root of the treeis labelled with the overall classdistribution (50 positives and 100 negatives),resulting in the trivial
'(& ).$"#$"#(- $,&" (,'" $,&%
munotes.in
Page 162
ranking [50+,100-]. The corresponding one-segmentcoverage curve isthe ascending diagonalfigure10.3.1.b.•Adding split (1) refines this ranking into [30+,35-][20+,65-], resultingin atwo-segment curve.•Adding splits (2) and (3) again breaks up the segment corresponding totheparent into two segments corresponding to the children.•However, the ranking produced by the full tree [15+,3-][29+,10-][5+,62-][1+,25-] is different from the left-to-rightordering of itsleaves, hence we need to reorder the segments of thecoverage curve,leading to the top-most, solid curve. This reordering alwaysleads to aconvex coverage curve•Figure10.3.1.a is anabstract representation of a tree with numbers ofpositive and negative examplescovered in each node. Binary splits areadded to the tree in the order indicated.•Figure10.3.1.b shows after addinga splitto the treehow itwill addnew segments to the coverage curve as indicated bythe arrows. After asplit is added the segments may need reordering, and so only thesolidlines represent actual coverage curves.10.3.1Choosing a labelling based on costs•Assume the training set class ratioclr=50/100 is representative. Wehave achoice of five labellings, depending on the expected cost ratioC=CFN/ CFPofmisclassifying a positive in proportion to the cost ofmisclassifying a negative:+-+-would be thelabelling of choice ifC=1, or more generally if10/29
Page 163
Figure10.3.2.a Figure10.3.2.b•Figure10.3.2.ais used to achieve thelabeling+-++we don’t need theright-most split, which cantherefore be pruned away.•In figure10.3.2.bpruning doesn’t affect the chosen operating point, butit does decrease the ranking performance of the tree.•If we know class distribution in leaves of feature tree then it isconverted in to Rankers, Probability estimation and Classifiers.•Rankers will give simple order to leaves in descending order based onempirical probability•Probability estimation predicts empirical probabilities of each leaveand calculateLaplaceor m estimation to smooth curve•In classifier we have to choose the operating conditions and findoperating point that fits the condition.•First and foremost, I would concentrate on getting good rankingbehaviour,because from a good ranker I can get good classificationand probabilityestimation, but not necessarily the other way round.•I would therefore try to use an impurity measure that isdistribution-insensitive, such as pGini; if that isn’t available and I can’t hackthecode, I would resort to oversampling the minority class to achieve abalanced class distribution.•I would disable pruning and smooth the probability estimates bymeans ofthe Laplace correction (or them-estimate).•Once I know the deployment operation conditions, I would use thesetoselect the best operating point on the ROC curve (i.e., a threshold onthepredicted probabilities, or alabelingof the tree).•(optional) Finally, I would prune away anysub treewhose leaves allhave thesame label.10.4 REGRESSION TREESClassification trees are used to divide the dataset into classes belonging tothe target variable. Mainly the target variable has two classes that can beyes or no. When the target variable type is categorical classification treesare used.
munotes.in
Page 164
In certain applications the target variable is numeric or continuous in thatcase regression trees are used. Let’s take an example of prediction ofprice of a flat. Hence regression trees are used for problems or tasks wherewe want to predict some data instead of classifying the data.Based on the similarity of the data the records are classified in a standardclassification tree. Let’s take an example of an Income tax evades. In thisexample we have two variables, Income and marital status that predict if aperson is going to evade the income tax or not. In our training data if itshowed that 85% of people who are married does not evade the incometax, we split the data here and Marital status becomes a root node in tree.Entropy or Ginny index is used in classification trees.The main basic working of regressiontree is to fit a model. The target orresponse variable does not have classes so a regression model is fit usingeach independent variable to the target variable. Then the data is split atvarious split points for each independent variable. At each splitpoint sumof squared errors (SSE) is calculated by taking the square of the differencebetween predicted and actual value. The criteria for root node is to selectthe node which is having minimum SSE among all split point errors. Thefurther tree is builtusing the recursive procedure.10.4.1Example of Regression Tree:Example 1:Buying_PriceLug_BootSafetyMaintenance_ Price?(in thousand)LowSmallHigh25LowSmallLow30MediumSmallHigh46HighSmallHigh45HighBigHigh52HighBigLow23MediumBigLow43LowSmallHigh35LowBigHigh38HighBigHigh46LowBigLow48MediumSmallLow52MediumBigHigh44HighSmallLow30Standard deviation->A decision tree is built up top down from root node and involvedpartitioning the data into subsets that contain instances with similar values.We use SD to calculate homogeneity of a numerical sample.SD, S== 9.32
munotes.in
Page 165
SD Reduction->It is based on the decrease in SD after a dataset is split on an attribute.Constructing a tree is all about finding attribute that returns highest SDR.Step 1->SD (Maintenance_Price?) = 9.32Step 2->The dataset is then split on the different attribute.SD for each branch iscalculated. The resulting SD is subtracted from SD beforesplit.Maintenance_Price(SD)Buying_PriceLow7.78Medium3.49High10.87SD(Maintenance_Price, Buying_Price)=P(Low)SD(Low)+P(Medium)SD(Medium)+P(High)SD(High)=
= 7.66SDR = SD(Maintenance_ Price)-SD(Maintenance_ Price,Buying_Price)= 9.32-7.66=1.66Maintenance_ Price (SD)Lug_BootSmall9.36Big8.37SD(Maintenance_ Price,Lug_Boot)=P(Small)SD(Small)+P(Big)SD(Big)=
= 8.86SDR= SD(Maintenance_ Price)-SD(Maintenance_ Price, Lug_Boot)=9.32-8.86=0.46Maintenance_ Price (SD)SafetyHigh7.87Low10.59
munotes.in
Page 166
SD(Maintenance_ Price, Safety)=P(High)SD(High)+P(Low)SD(Low)=
= 9.02SDR= SD(Maintenance_ Price)-SD(Maintenance_ Price, Safety)= 9.32-9.02=0.3SDR of Buying_Price is highest so we select Buying_Price as our rootnode.Step 2->Now we will consider the records of ‘High’. For High SD is 7.66 (which isnot less than 50% global SD therefore branch will be splitted.Buying_PriceLug_BootSafetyMaintenance_ Price?(in thousand)HighSmallHigh45HighBigHigh52HighBigLow23HighBigHigh46HighSmallLow30For Buying_Price = High, SD=10.87We will calculate SDR of only Lug_Boot and SafetyMaintenance_ Price (SD)Lug_BootSmall7.5Big12.49SD(High, Lug_Boot)=P(Small)SD(Small)+P(Big)SD(Big)=
= 10.49SDR= SD(High)-SD(Maintenance_ Price, Lug_Boot) = 10.87-10.49=0.38Maintenance_ Price (SD)SafetyHigh3.09Low3.50SD(High, Safety)=P(High)SD(High)+P(Low)SD(Low)=
= 3.25SDR= SD(High)-SD(Maintenance_ Price, Safety)= 10.87-3.25=7.62Buying_Price$"# $,&(-
munotes.in
Page 167
SDR of Safety is highest so we select Safety as next node below Highbranch.
For Buying_Price =High and Safety= High, SD is 3.50 which is less than50% SD of database, so we can directly write down the answer.For Buying_Price =High and Safety= Low, SD is 3.09 which is less than50% SD of database, so we can directly write down the answer.To write down the answer we take average of values of following records,For Buying_Price =High and Safety= High, Maintenance _Price =45+52+46/3= 47.7For Buying_Price =High and Safety= Low, Maintenance _Price =23+30/2= 26.5Step 3->Now we will consider the records of ‘Medium’Buying_PriceLug_BootSafetyMaintenance_ Price? (in thousand)MediumSmallHigh46MediumBigLow43MediumSmallLow52MediumBigHigh44For Buying_Price = Medium SD is 3.49 which is less than 50% SD ofdatabase, so we can directly write down the answer as 46.3. The answer iscalculated by taking the average of values of Maintenance_ Price forMedium records (average of 46, 43, 52, and 44).,.$'")$ $"# $,&(-! +.$"#(- ,.$'")$ $"# $,&(-! +.$"#(- munotes.in
Page 168
Step 4->Now we will consider the records of ‘Low’. For Low SD is 7.78 (which isnot lessthan 50% global SD therefore branch will be splitted.Buying_PriceLug_BootSafetyMaintenance_ Price?(in thousand)LowSmallHigh25LowSmallLow30LowSmallHigh35LowBigHigh38LowBigLow48For Buying_Price= Low, SD=7.78Now only Lug_Boot attribute is remaining, so we can directly takeLug_Boot as a next node below Low branch.To write down the answer we take average of values of following records,For Buying_Price = Low and Lug_Boot = Small, Maintenance_Price =25+30+35/3= 30For Buying_Price = Low and Lug_Boot = Big, values ofMaintenance_Price=38+ 48/2=43.Final Regression Tree is,
10.5 CLUSTERING TREESImagine you are a collector of vintage Hammond tonewheel organs. Youhavebeen monitoring an online auction site, from which you collectedsome dataabout interesting transactions,.$'")$ $"# $,&(-! +.$"#(-
,"((+&%%$"munotes.in
Page 169
ModelConditionLesliePriceResesrveBidsB3ExcellentNo453022T202FairYes609A100GoodNo11813T202GoodNo301M102GoodYes952A100ExcellentNo181515T202FairNo103A100GoodYes19191E112FairNo105.•The means of the three numerical features are (13.3, 8.6,7.9) and theirvariances are (158,101.8,48.8). The average squared Euclideandistanceto the mean is then the sum of these variances, which is 308.6.•For the A100 cluster these vectors are (16,14,9.7) and(12.7,20.7,38.2),with average squared distance to the mean 71.6; forthe T202 cluster theyare (3.3, 0,4.3) and (4.2, 0,11.6), with averagesquared distance 15.8.•Usingthis split we can construct a clustering tree whose leaves arelabeledwith the mean vectors (Figure10.5.1).•A clustering tree learned from the data inexample using Euclideandistance on thenumerical features.
Figure10.5.1SUMMARYIn this chapterwe have seen Tree model whichcan be seen as a particulartype of rule model where the if-parts of the rules are organised in a treestructure.Decision trees are very strong and most suitable tools forclassification and prediction.If we want to predictclass of instances thenwe have to use decision trees whereas if we want to predict the value oftarget variable then we have to use regression tree.Rankers will givesimple order to leaves in descending order based on empirical probability.Probability estimation predicts empirical probabilities of each leave andcalculate Laplace or m-estimation to smooth curve.In classifier we have
munotes.in
Page 170
to choose the operating conditions and find operating point that fits thecondition.Gettinggood ranking behaviour,because from a good ranker.Wecan get good classification and probability estimation, but notnecessarily the other way round.UNITENDQUESTIONS1.Create a decision tree for the attribute “ class” using the respective valuesEyecolourMarriedSexHairlengthclassBrownyesmaleLongFootballBlueyesmaleShortFootballBrownyesmaleLongFootballBrownnofemaleLongNetballBrownnofemaleLongNetballBluenomaleLongFootballBrownnofemaleLongNetballBrownnomaleShortFootballBrownyesfemaleShortNetballBrownnofemaleLongNetballBluenomaleLongFootballBluenomaleShortFootball2.Write short note on Issues in Decision Tree3.For the given data determine the entropy after classification using eachattribute for classification seperatley and find which attribute is taken asdecision attribute for the root by finding information gain with respect toentropy of Temperature as reference attribute.Sr.no.TemperatureWindHumidity1HotWeakHigh2HotStrongHigh3MildWeakNormal4CoolStrongHigh5CoolWeakNormal6MildStrongNormal7MildWeakHigh8HotStrongHigh9MildWeakNormal10HotStrongNormal4.For a Sunburn dataset given below, construct a decision treeNameHairHeightWeightLocationClassSwatiBlondeAverageLightNoYesSunitaBlondeTallAverageYesNoAnitaBrownShortAverageYesNoLataBlondeShortAverageNoYesRadhaRedAverageHeavyNoYesMayaBrownTallHeavyNoNoLeenaBrownAverageHeavyNoNoRinaBlondeShortLightYesNomunotes.in
Page 171
LIST OF REFERENCES•https://www.kdnuggets.com/2019/06/main-approaches-machine-learning-models.html•Foster Provost, pedro Domingos,”TreeInduction for Probability-BasedRanking”,MachineLearning,•SpringerLink,volume52,pages199–215 (2003)•https://www.solver.com/regression-trees•Luke Zappia, Alicia Oshlack, “Clustering trees: a visualization forevaluating clusterings at multiple resolutions”,GigaScience, Volume7, Issue 7, July 2018, giy083,munotes.in
Page 172
172UNIT V11PROBABILISTIC MODELUnit Structure11.0Objective11.1Introduction: Probabilistic Model11.1.2Normal Distribution And Its Geometric Interpretation:11.2Standard Normal Variate11.2.1standard Normal Distribution:11.2.2 Application In Machine Learning:11.2.2.1 Histograms:11.2.3 Feature Analysis:11.2.4 Central Limit Theorem And Normal Distribution:11.2.5 Naïve Bayes Classifier Algorithm11.2.6 Why Is It Called Naïve Bayes:11.2.7 Advantagesof Naïve Bayes Classifier:11.2.8 Applicationsof Naïve Bayes Classifier:11.2.9 Typesof Naïve Bayes Model:11.2.10 Descriptive Learning Maximum Like Hood:11.2.11 Problemof Probability Density Estimation:11.2.12 Maximum Likelihood Estimation:11.2.13 Relationshipto Machine Learning:11.2.14 Probabilistic Model With Hidden Variable:11.3Why Probabilistic Ml Models11.3.1 Objective Functions:11.3.2 Maximization Method:11.3.3 Usageof Em Algorithm–11.3.4 Advantagesof Em Algorithm:11.4Gaussian Mixture Model & Compression Based Model11.4.1 Gaussian Mixture Model:SummaryUnit EndQuestionsReferences11.0OBJECTIVEProbabilistic modelling provides a framework for understanding whatlearning is, and has therefore emerged as one of the principal theoreticaland practical approaches for designing machines that learn from datamunotes.in
Page 173
173acquired through experience. The probabilistic framework, whichdescribes how to represent and manipulate uncertainty about models andpredictions, has a central role in scientific data analysis, machine learning,robotics, cognitive science and artificial intelligence11.1 INTRODUCTION: PROBABILISTIC MODELProbabilistic models are widely used in text mining and applications rangefrom topic modelling, language modelling, document classification andclustering to information extraction. And the examples of models are the:topic modelling methods PLSA and LDA are special applications ofmixture models.The probabilistic model is a model that uses probability theory to modelthe uncertainty in the data. Like terms in topics are modelled bymultinomial distribution and the observations for a random field aremodelled by Gibbs distribution. Which are the major probabilistic modelsbeing Mixture and Models Mixture models are used for clustering datapoints, where each component is a distribution for that cluster, and eachdata point belongs to one cluster with a certain probability.Finite mixturemodels require the user to specify the number of clusters.The typicalapplications of mixture models in text mining include topic models, likePLSA and LDA. Bayesian Nonparametric Models Bayesiannonparametric models refer to probabilistic models with infinite-dimensional parameters, which usually have a stochastic process that isinfinite-dimensional as the prior distribution. Infinite mixture model is onetype of nonparametric model, which can deal with the problem ofselecting the number of clusters for clustering. The Dirichlet processmixture model belongs to the infinite mixture model, and can help todetect the number of topics in topic modelling.Mixture Models:Mixture model is a probabilistic model originallyproposed to address the multi-modal problem in the data, and now isfrequently used for the task of clustering in data mining, machine learningand statistics. defines the distribution of a random variable, which containsmultiple components and each component represents a differentdistribution following the same distribution family with differentparameters. Mixture Model Basic framework of mixture models, Theirvariations and applications in text mining area, and the standard learningalgorithms for them. General Mixture Model Framework In a mixturemodel, given a set of data points, e.g., the height of people in a region,they are treated as an instantiation of a set of random variables. Accordingto the observed data points, the parameters in the mixture model can belearned. For example, we can learn the mean and standard deviation forfemale and male height distributions, if we model height of people as amixturemodel of two Gaussian distributions. Assume n random variablesX1,X2, . . . , xn with observations x1, x2, . . . , xn, following the mixturemodel with K components.munotes.in
Page 174
17411.1.2Normal Distribution and its Geometric Interpretation:Normal Distribution is animportant concept in statistics and the backboneof Machine Learning. A Data Scientist needs to know about NormalDistribution when they work with Linear Models (perform well if the datais normally distributed), Central Limit Theorem, and exploratory dataanalysis. As discovered byCarl Friedrich Gauss,NormalDistribution/Gaussian Distributionis a continuous probabilitydistribution. It has a bell-shaped curve that is symmetrical from the meanpoint to both halves of the curve.
Figure 11.1.2Normal Distribution and its Geometric InterpretationA continuous random variable “x” is said to follow a normal distributionwith parameter µ(mean) and σ(standard deviation), if it’s probabilitydensity function is given by,
This is also called anormal variate.11.2 STANDARD NORMAL VARIATEIf “x” isa normal variable with a mean(µ) and a standard deviation(σ)then,where z = standard normal variate11.2.1Standard Normal Distribution:
munotes.in
Page 175
175The simplest case of the normal distribution, known as the StandardNormal Distribution, has an expected value of µ(mean) 0 and σ(s.d.) 1,and is described by this probability density function,2
21
2z
fz e
2<9F9z Distribution Curve Characteristics:1.The totalarea underthenormal curveis equal to 1.2.It is a continuous distribution.3.It is symmetrical about themean. Each half of the distribution is amirror image of the other half.4.It is asymptotic to the horizontal axis.5.It is unimodal.Area Properties:The normal distribution carries with it assumptions and can be completelyspecified by two parameters: the mean and the standard deviation. If themean and standard deviation are known, you can access every data pointon the curve.The empirical rule is a handy quick estimate of the data's spread given themean and standard deviation of a data set thatfollows a normaldistribution. It states that:●68.26% of the data will fall within 1 sd of the mean(µ±1σ)●95.44% of the data will fall within 2 sd of the mean(µ±2σ)●99.7% of the data will fall within 3 sd of the mean(µ±3σ)●95%—(µ±1.96σ)●99%—(µ±2.75σ)
Figure 11.2.1 slandered normal distribution
munotes.in
Page 176
176Thus, almost all the data lies within3 standard deviations. This ruleenables us to check forOutliersand is very helpful when determining thenormality of any distribution.11.2.2 Application in Machine Learning:In Machine Learning, data satisfying Normal Distribution is beneficial formodel building. It makes math easier. Models like LDA, Gaussian NaiveBayes, Logistic Regression, Linear Regression, etc., are explicitlycalculated from the assumption that the distribution is a bivariate ormultivariate normal. Also,Sigmoid functionswork most naturally withnormally distributed data. Many natural phenomena in the world follow alog-normal distribution, such asfinancial dataandforecasting data.Byapplying transformation techniques, we can convert the data into a normaldistribution. Also, many processes follow normality, such asmanymeasurement errors in an experiment, the position of a particle thatexperiences diffusion,etc.So, it’s better to critically explore the data andcheck for the underlying distributions for each variable before going to fitthe model.Visualization Techniques:11.2.2.1Histograms:It is a kind of bar graph which is an estimate of the probability distributionof a continuous variable. It defines numerical dataand divided them intouniform bins which are consecutive, non-overlapping intervals of avariable.
Figure 11.2.2.1Histograms1
munotes.in
Page 177
177kdeplot:It is a Kernel Distribution Estimation Plotwhich depicts the probabilitydensity function of the continuous or non-parametric data variables i.e. wecan plot for the univariate or multiple variables altogether.
Figure 11.2.2.1Histograms211.2.3 Feature Analysis:Let’s take an example of featurerm(average number of rooms perdwelling)closely resembling a normal distribution.
Though it has some distortion in the right tail, We need to check howclose it resembles a normal distribution. For that, we need to check theQ-Q Plot.When thequantilesof two variables are plotted against each other,then the plot obtained is known as quantile—quantile plot or plot. Thisplot provides a summary of whether the distributions of two variables aresimilar or not with respect to the locations.
munotes.in
Page 178
178Figure11.2.3 plots variableHere we can clearly see that feature is not normally distributed. But itsomewhat resembles it. We can conclude that standardizing(StandardScaler) this feature before feeding it to a model can generate agood result.11.2.4 Central Limit Theorem and Normal Distribution:CLT states that when we add a large number of independent randomvariables to a dataset, irrespective of these variables' original distribution,their normalized sum tends towards a Gaussian distribution. MachineLearning models generally treat training data as a mixofdeterministicandrandomparts. Let thedependent variable(Y)consistsof these parts. Models always want to express thedependentvariables(Y)as some function of severalindependent variables(X). Ifthefunction is sum (or expressed as a sum of some other function) and thenumber ofXis really high, thenYshould have a normal distribution. Hereml models try to express thedeterministic part as a sum of deterministicindependent variables(X):deterministic + random = fun (deterministic(1)) +…+ fun(deterministic(n)) + model errorIf the whole deterministicpart ofYis explained byX,then themodel errordepicts only therandompartand should have a normal distribution. So if the error distribution isnormal, then we may suggest that the model is successful. Else some otherfeatures are absent in the model but have a large enough influence onY,orthe model is incorrect.11.2.5 Naïve Bayes Classifier Algorithm●Naïve Bayes algorithm is a supervised learning algorithm, which isbased onBayes theoremand used for solving classification problems.●It is mainly used intext classificationthat includes a high-dimensionaltraining dataset.
munotes.in
Page 179
179●Naïve Bayes Classifier is one of the simple and most effectiveClassification algorithms which helps in building the fast machinelearning models that can make quick predictions.●It is a probabilistic classifier, which means it predicts on the basisof the probability of an object.●Some popular examples of Naïve BayesAlgorithm arespamfiltration, Sentimental analysis, and classifying articles.11.2.6 Why is it called Naïve Bayes:●The Naïve Bayes algorithm is comprised of two words Naïve andBayes, which can be described as:●Naïve: It is called Naïve because it assumes that the occurrence of acertain feature is independent of the occurrence of other features. Suchas if the fruit is identified on the bases of color, shape, and taste, thenred, spherical, and sweet fruit is recognized as an apple. Hence eachfeature individually contributes to identify that it is an apple withoutdepending on each other.●Bayes: It is called Bayes because it depends on the principle ofBayes'Theorem.●Bayes' Theorem:●Bayes' theorem is also known asBayes' RuleorBayes' law, which isused to determine the probability of a hypothesis with priorknowledge. It depends on the conditional probability.●The formula for Bayes' theorem is given as:Where,P(A|B) is Posterior probability: Probability of hypothesis A on theobserved event B.P(B|A) is Likelihood probability: Probability of the evidence given thatthe probability of a hypothesis is true.P(A) is Prior Probability: Probability of hypothesis beforeobserving theevidence.P(B) is Marginal Probability: Probability of Evidence.Working of Naïve Bayes' Classifier:Working of Naïve Bayes' Classifier can be understood with the help of thebelow example:Suppose we have a dataset ofweather conditionsand correspondingtarget variable "Play". So using this dataset we need to decide thatwhether we should play or not on a particular day according to the weatherconditions. So to solve this problem, we need to follow the below steps:
munotes.in
Page 180
1801.Convert the given dataset into frequency tables.2.Generate Likelihood table by finding the probabilities of givenfeatures.3.Now, use Bayes theorem to calculate the posterior probability.Problem:If the weather is sunny, then the Player should play or not?Solution:To solve this, first consider the below dataset:OutlookPlay0RainyYes1SunnyYes2OvercastYes3OvercastYes4SunnyNo5RainyYes6SunnyYes7OvercastYes8RainyNo9SunnyNo10SunnyYes11RainyNo12OvercastYes13OvercastYesTable 11.2.6 :Frequency table for the Weather ConditionsWeatherYesNoOvercast50Rainy22Sunny32Total105Table 11.2.6 :Frequency table for the Weather Conditions solution11.2.7 Advantages of Naïve Bayes Classifier:●Naïve Bayes is one of the fast and easy ML algorithms to predict aclass of datasets.●It can be used for Binary as well as Multi-class Classifications.●It performs well in Multi-class predictions as compared to the otherAlgorithms.●It is the most popular choice fortext classification problems.●Naive Bayes assumes that all features are independent or unrelated, soit cannot learn the relationship between features.11.2.8 Applications of Naïve Bayes Classifier:●It is used forCredit Scoring.●It is used inmedical data classification.munotes.in
Page 181
181●It can be used inreal-time predictionsbecause Naïve BayesClassifier is an eager learner.●It is used in Text classification such asSpam filteringandSentimentanalysis.11.2.9 Types of Naïve Bayes Model:There are three types of Naive Bayes Model, which are given below:●Gaussian: The Gaussian model assumes that features follow a normaldistribution. This means if predictors take continuous values instead ofdiscrete, then the model assumes that these values are sampled fromthe Gaussian distribution.●Multinomial: The Multinomial Naïve Bayes classifier is used whenthe data is multinomial distributed. It is primarily used for documentclassification problems, it means a particular document belongs towhich category such as Sports, Politics, education, etc.The classifier uses the frequency of words for the predictors.●Bernoulli: The Bernoulli classifier works similar to the Multinomialclassifier, but the predictor variables are the independent Booleansvariables. Such as if a particular word is presentor not in a document.This model is also famous for document classification tasks.11.2.10 Descriptive Learning maximum like hood:Density estimation is the problem of estimating the probability distributionfor a sample of observations from a problem domain. There are manytechniques for solving density estimation, although a common frameworkused throughout the field of machine learning is maximum likelihoodestimation. Maximum likelihood estimation involves defining a likelihoodfunction for calculatingthe conditional probability of observing the datasample given a probability distribution and distribution parameters. Thisapproach can be used to search a space of possible distributions andparameters. This flexible probabilistic framework also provides thefoundation for many machines learning algorithms, including importantmethods such as linear regression and logistic regression for predictingnumeric values and class labels respectively, but also more generally fordeep learning artificial neural networks.●Maximum Likelihood Estimation is a probabilistic framework forsolving the problem of density estimation.●It involves maximizing a likelihood function in order to find theprobability distribution and parameters that best explain the observeddata.●It provides a framework for predictive modelling in machine learningwhere finding model parameters can be framed as an optimizationproblem.Descriptive Learning maximum like hood divided into three parts:1.Problem of Probability Density Estimationmunotes.in
Page 182
1822.Maximum Likelihood Estimation3.Relationship to Machine Learning11.2.11 Problem of Probability Density Estimation:A common modelling problem involves how to estimate a joint probabilitydistribution for a dataset.For example, given a sample of observation(X) from a domain (x1, x2, x3,…,xn), where each observation is drawn independently from the domainwith the same probability distribution (so-called independent andidentically distributed, i.i.d., or close to it).Density estimation involves selecting aprobability distribution functionand the parameters of that distribution that best explain the jointprobability distribution of the observed data (X).●How do you choose the probability distribution function?●How do you choose the parameters for the probability distributionfunction?This problem is made more challenging as sample (X) drawn from thepopulation is small and has noise, meaning that any evaluation of anestimated probability density function and its parameters will have someerror.●There are many techniques for solving this problem, although twocommon approaches are:●Maximum a Posteriori (MAP), a Bayesian method.●Maximum Likelihood Estimation (MLE), frequentist method.●The main difference is that MLE assumes that all solutions are equallylikely beforehand, whereas MAP allows prior information about theform of the solution to be harnessed.●In this post, we will take a closer look at the MLE method and itsrelationship to applied machine learning.11.2.12 Maximum Likelihood Estimation:Onesolution to probability density estimation is referred to as MaximumLikelihood Estimation, or MLE for short.Maximum Likelihood Estimationinvolves treating the problem as anoptimization or search problem, where we seek a set of parameters thatresultsin the best fit for the joint probability of the data sample (X).First, it involves defining a parameter calledthetathat defines both thechoice of the probability density function and the parameters of thatdistribution. It may be a vector of numerical values whose values changesmoothly and map to different probability distributions and theirparameters.munotes.in
Page 183
183In Maximum Likelihood Estimation, we wish to maximize the probabilityof observing the data from the joint probability distribution given aspecificprobability distribution and its parameters, stated formally as:●P(X | theta)This conditional probability is often stated using the semicolon (;) notationinstead of the bar notation (|) becausethetais not a random variable, butinstead an unknown parameter. For example:●P(X ; theta)or●P(x1, x2, x3, …, xn ; theta)This resulting conditional probability is referred to as the likelihood ofobserving the data given the model parameters and written using thenotationL()to denote thelikelihood function. For example:●L(X ; theta)The objective of Maximum Likelihood Estimation is to find the set ofparameters (theta) that maximize the likelihood function, e.g. result in thelargest likelihood value.●maximize L(X ; theta)29 75B IBD57? H<9 7CB8=H=CB5@ DFC656=@=HM75@7I@5H986MH<9@=?9@=
Page 184
184model and model parameters is referred to as a modelling hypothesish,and the problem involves findinghthat best explains the dataX.●P(X ; h)We can, therefore, find the modelling hypothesis that maximizes thelikelihood function.●maximize L(X ; h)Or, more fully:●maximize sum i to n log(P(xi ; h))This provides the basis for estimating the probability density of a dataset,typically used in unsupervised machine learning algorithms.11.2.14 Probabilistic model with hidden variable:Mathematics is the foundation of Machine Learning, and its branches suchas Linear Algebra, Probability, and Statistics can be considered as integralpartsof ML. As a Computer Science and Engineering student, one of thequestions I had during my undergraduate days was in which ways theknowledge that was acquired through math courses can be applied to MLand what are the areas of mathematics that play a fundamental role in ML.I believe this is a common question among most of the people who areinterested in Machine Learning. Therefore, I decided to write a blog serieson some of the basic concepts related to “Mathematics for MachineLearning”. In this series, my intention is to provide some directions intowhich areas to look at and explain how those concepts are related to ML. Iam not going deep into the concepts and I believe there are a lot ofresources with quite good examples that explain each of theseconcepts ina detailed manner.As the first step, I would like to write about the relationship betweenprobability and machine learning. In machine learning, there areprobabilistic models as well as non-probabilistic models. In order to havea better understanding of probabilistic models, the knowledge about basicconcepts of probability such as random variables and probabilitydistributions will be beneficial. I will write about such concepts in my nextblog. However, in this blog, the focus will be on providing some idea onwhat are probabilistic models and how to distinguish whether a model isprobabilistic or not.What are Probabilistic Machine Learning Models?In order to understand what is a probabilistic machine learning model,let’s consider a classification problem with N classes. If the classificationmodel (classifier) is probabilistic, for a given input, it will provideprobabilities for each class (of the N classes) as the output. In other words,a probabilistic classifier will provide a probability distribution over the Nclasses. Usually, the class with the highest probability is then selected asthe Class for which the input data instance belongs.munotes.in
Page 185
185However, logistic regression (which is a probabilistic binary classificationtechnique based on the Sigmoid function) can be considered as anexception, as it provides the probability in relation to one class only(usually Class 1, and it is not necessary to have “1—probability of Class1= probability of Class 0”relationship). Because of these properties,Logistic Regression is useful inMulti-Label Classificationproblems aswell, where a single data point can have multiple class labels.Some examples for probabilistic models are Logistic Regression, BayesianClassifiers, Hidden Markov Models,and Neural Networks (with aSoftMax output layer).If the model is Non-Probabilistic (Deterministic), itwill usually output only the most likely class that the input data instancebelongs to. Vanilla “Support Vector Machines” is a popular non-probabilisticclassifier.Let’s discuss an example to better understand probabilistic classifiers.Take the task of classifying an image of an animal into five classes—{Dog, Cat, Deer, Lion, Rabbit} as the problem. As input, we have animage (of a dog). For this example, let’s consider that the classifier workswell and provides correct/ acceptable results for the particular input we arediscussing. When the image is provided as the input to the probabilisticclassifier, it will provide an output such as (Dog (0.6),Cat (0.2),Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier is non-probabilistic, it will only output “Dog”.11.3 WHY PROBABILISTIC ML MODELSOne of the major advantages of probabilistic models is that they providean idea about the uncertainty associated with predictions. In other words,we can get an idea of how confident a machine learning model is on itsprediction. If we consider the above example, if the probabilistic classifierassigns a probability of 0.9 for ‘Dog’ class instead of 0.6, it means theclassifier is more confident that the animal in the image is a dog. Theseconcepts related to uncertainty and confidence are extremely useful whenit comes to critical machine learning applications such as disease diagnosisand autonomous driving. Also, probabilistic outcomes would be useful fornumerous techniques related to Machine Learning such asActiveLearning.11.3.1 Objective Functions:In order to identify whether a particular model is probabilistic or not, wecan look at its Objective Function. In machine learning, we aim tooptimize a model to excel at a particular task. The aim of having anobjective function is to provide a value based on the model’s outputs, sooptimization can be done by either maximizing or minimizing theparticular value. In Machine Learning, usually, the goal is to minimizeprediction error. So, we define what is called a loss function as theobjective function and tries to minimize the loss function in the trainingphase of an ML model.munotes.in
Page 186
186If we take a basic machine learning model such as Linear Regression, theobjective function is based on the squared error. The objective of thetraining is to minimize the Mean Squared Error / Root Mean SquaredError (RMSE) (Eq. 1). The intuition behind calculating Mean SquaredError is, the loss/ error created by a prediction given to a particular datapoint is based on the difference between the actual value and the predictedvalue (note that when it comes to Linear Regression, we are talking abouta regression problem, nota classification problem).The loss created by aparticular data point will be higher if the prediction gives by the model issignificantly higher or lower than the actual value. The loss will be lesswhen the predicted value is very close to the actual value. As you can see,the objective function here is not based on probabilities, but on thedifference (absolute difference) between the actual value and the predictedvalue.Here, n indicates the number of data instances in the data set, true is thecorrect/ true value and predict is the predicted value (by the linearregression model).When it comes to Support Vector Machines, the objective is to maximizethe margins or the distance between support vectors. This concept is alsoknown as the ‘Large Margin Intuition’. As you can see, in both LinearRegression and Support Vector Machines, the objective functions are notbased on probabilities. So, they can be considered as non-probabilisticmodels. On the other hand, if we consider a neural network with aSoftMax output layer, the loss function is usually defined using Cross-Entropy Loss (CE loss) (Eq. 2). Note that we are considering a trainingdataset with ’n’ number of data points, so finally take the average of thelosses of each data point as the CE loss of the dataset. Here, Yi means thetrue label of the data point i and p(Yi) means the predicted probability forthe class Yi (probability of this data point belongs to the class Yiasassigned by the model).The intuition behind Cross-Entropy Loss is ;if the probabilistic model isable to predict the correct class of a data point with high confidence, theloss will be less. In the example we discussed about image classification,if the model provides a probability of 1.0 to the class ‘Dog’ (which is thecorrect class), the loss due to that prediction =-log(P(‘Dog’)) =-log(1.0)=0. Instead, if the predicted probability for ‘Dog’ class is 0.8, theloss =-log(0.8)= 0.097. However, if the model provides a low probabilityfor the correct class, like 0.3, the loss =-log(0.3) = 0.523, which can beconsidered as a significant loss.
munotes.in
Page 187
187In a binary classification model based on Logistic Regression, the lossfunction is usually defined using the Binary Cross Entropy loss (BCEloss).Here y_i is the class label (1if similar, 0 otherwise) and p(s_i) is thepredicted probability of a point being class 1 for each point ‘i’ in thedataset. N is the number of data points. Note that as this is a binaryclassification problem, there are only two classes, class 1 and class0.Asyou can observe, these loss functions are based on probabilities and hencethey can be identified as probabilistic models. Therefore, if you want toquickly identify whether a model is probabilistic or not, one of the easiestways is to analyse the loss function of the model.So, that’s all for this article. I hope you were able to get a clearunderstanding of what is meant by a probabilistic model. In the next blog,I will explain some probability concepts such as probability distributionsand random variables, which will be useful in understanding probabilisticmodels. If you find anything written here which you think is wrong, pleasefeel free to comment.11.3.2 Maximization method:In the real-world applications of machine learning, it is very common thatthere are many relevant features available for learning but only a smallsubset of them are observable. So, for the variables which are sometimesobservable and sometimes not, then we can use the instances when thatvariable is visible is observed for the purpose of learning and then predictits value in the instances when it is not observable. On the otherhand,Expectation-Maximization algorithmcan be used for the latentvariables (variables that are not directly observable and are actuallyinferred from the values of the other observed variables) too in order topredict their values with the condition that the general form of probabilitydistribution governing those latent variables is known to us. Thisalgorithm is actually at the base of manyunsupervised clusteringalgorithms in the field of machine learning.It was explained, proposed and given its name in a paper published in1977 by Arthur Dempster, Nan Laird, and Donald Rubin. It is used to findthelocal maximum likelihood parametersofa statistical model in thecases where latent variables are involved and the data is missing orincomplete.1.Algorithm:1.Given a set of incomplete data, consider a set of starting parameters.2.Expectation step (E–step):Using the observed available dataofthe dataset, estimate (guess) the values of the missing data.3.Maximization step (M–step):Complete data generated after theexpectation (E) step is used in order to update the parameters.4.Repeat step 2 and step 3 until convergence.munotes.in
Page 188
188Figure 11.3.2 statistics of parameter●The essence of Expectation-Maximization algorithm is to use theavailable observed data of the dataset to estimate the missing data andthen using that data to update the values of the parameters. Let usunderstand the EM algorithm in detail.●Initially, a set of initial values of the parameters are considered. A setof incomplete observed data is given to the system with theassumption that the observed data comes from a specific model.●The next stepis known as “Expectation”–step orE-step. In this step,we use the observed data in order to estimate or guess the values of themissing or incomplete data. It is basically used to update the variables.●The next step is known as “Maximization”-step orM-step. In this step,we use the complete data generated in the preceding “Expectation”–step in order to update the values of the parameters. It is basically usedto update the hypothesis.●Now, in the fourth step, it is checked whether the values areconverging or not, if yes, then stop otherwise repeatstep-2andstep-3i.e. “Expectation”–step and “Maximization”–step until theconvergence occurs.
Figure 11.3.2 Algorithm of maximization
munotes.in
Page 189
18911.3.3 Usage of EM algorithm:●It can be used to fill the missing data in a sample.●It can be used as the basis of unsupervised learning of clusters.●It can be used for the purpose of estimating the parameters of HiddenMarkov Model (HMM).●It can be used for discovering the values of latent variables.11.3.4 Advantages of EM algorithm:●It is always guaranteed that likelihood will increase with eachiteration.●The E-step and M-step are often pretty easy for many problems interms of implementation.●Solutions to the M-steps often exist in the closed form.11.3.5 Disadvantages of EM algorithm:●It has slow convergence.●It makes convergence to the local optima only.●It requires both the probabilities, forward and backward (numericaloptimization requires only forward probability).11.4 GAUSSIAN MIXTURE MODEL & COMPRESSIONBASED MODELSuppose there are set of data points that needs to be grouped into severalparts or clusters based on their similarity. In machine learning, this isknown asClustering.●There are several methods available for clustering like:●K Means Clustering●Hierarchical Clustering●Gaussian Mixture ModelsNormal or Gaussian Distribution:In real life, many datasets can be modeled by Gaussian Distribution(Univariate or Multivariate). So it is quite natural and intuitive to assumethat the clusters comefrom different Gaussian Distributions. Or in otherwords, it is tried to model the dataset as a mixture of several GaussianDistributions. This is the core idea of this model.11.4.1 Gaussian Mixture Model:Data scientists use various machine learning algorithms to discoverpatterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal componentsso as to highlight similarities and differences. In this work, we deal withmunotes.in
Page 190
190the reduced data using a bivariate mixture model and learning with abivariate Gaussian mixture model. We discuss a heuristic for detectingimportant components by choosing the initial values of locationparameters using two different techniques: cluster means,k-means andhierarchical clustering, and default values in the “mixtools” R package.The parameters of the model are obtained via an expectation maximizationalgorithm. The criteria from Bayesian point are evaluated for bothtechniques, demonstrating that both techniques are efficient with respect tocomputation capacity. The effectiveness of the discussed techniques isdemonstrated through a simulation study and using real data sets fromdifferent fields.In real data such as engineering data, efficient dimension reduction isrequired to reveal underlying patterns of information. Dimensionreduction can be used to convert data sets containing millions of functionsinto manageable spaces for efficient processing and analysis.Unsupervised learning is the mainapproach to reducing dimensionality.Conventional dimensional reduction approaches can be combined withstatistical analysis to improve the performance of big data systems [1].Many dimension reduction techniques have been developed by statisticaland artificial intelligence researchers. Principal component analysis(PCA), introduced in 1901 by Pearson [2], is one of the most popular ofthese methods. The main purpose of PCA is to reduce the dimensionalityof a data set consisting of a large number of interrelated variables, whileretaining as much as possible of the variation present in the data set.Among the many PCA methods, singular value decomposition is used innumerical analysis and Korhonen–Loève expansion in electricalengineering. Eigenvector analysis and characteristic vector analysis areoften used in the physical sciences. In image analysis, the Hotellingtransformation is often used for principal component projection.In recent years, there has been increasing interest in PCA mixture models.Mixture models provide a useful framework for the modelling of complexdata with a weighted component distribution. Owing to their highflexibility and efficiency, they are used widely in many fields, includingmachine learning, image processing, and data mining. However, becausethe component distributions in a mixture model are commonly formalizedas probability density functions, implementations in high-dimensionalspaces are constrained by practical considerations.PCA mixture modelsare based on a mixture-of-experts technique, which models a nonlineardistribution through a combination of local linear sub models, each with afairly simple distribution [3]. For the selection of the model, a PCAmixture model was proposed by Kim, Kim, and Bang [4], which has amore straightforward expectation maximization (EM) calculation, does notrequire a Gaussian error term for each mixture component, and uses anefficient technique for model order selection. The researchers applied theproposed model to the classification of synthetic data and eye detection[4].munotes.in
Page 191
191For multimode processes, the Gaussian mixture model (GMM) wasdeveloped to estimate the probability density function of the process dataunder normal operating conditions. However, in the case of high andcollinear process variables, learning from process data with GMM can bedifficult or impossible. A novel multimode monitoring approach based onthe PCA mixture model was proposed by Xu, Xie, and Wang [5] toaddress this issue. In this method, first, the PCA technique is applieddirectly to each Gaussian component’s covariance matrix to reduce thedimension of process variables and to obtain non-singular covariancematrices. Then, an EM algorithm is used to automatically optimize thenumber of mixture components.A novel process monitoring scheme forthe detection of multimode processes was developed using the resultingPCA mixture model. The monitoring performance of the proposedapproach has been evaluated through case studies [5].In recent years,hyperspectral imaging has become an important research subject in thefield of remote sensing. An important application of hyperspectral imagingis the identification of land cover areas. The rich content of hyperspectraldata enables forests, urban areas, crop species,and water supplies to berecognized and classified. In 2016, Kutluk, Kayabol, and Akan [6]proposed a supervised classification and dimensionality reduction methodfor hyperspectral images, using a mixture of probability PCA (PPCA)models. The proposed mixture model simultaneously allows the reductionof dimensionality and spectral classification of the hyperspectral image.Experimental findings obtained using real hyperspectral data indicate thatthe proposed approach results in better classification thanthe state-of-the-art methods [6].PROBABILITY CONCEPTS-SUMMARY1.Probabilities of outcomes.(!#%C6G9FJ9GCA9H<=B;
&% CB9C:DCGG=6@9H<=B;GK975BC6G9FJ9!$!G9HC:DCGG=6@9CIH7CA9GO
X
Q#OQ@=A H=A9GC77IFG=B F9D9H=H=CBG :CF957<CIH7CA9
'BHI=H=J989:B#OQY:CF957<#OQmunotes.in
Page 192
192/%($#CAD5BMA5?9G8=C89G-=7?58=C89:FCADFC8I7H=CB@=B9O89:97H=J9
;CC8QOQ-FC656=@=HM8=C89=G89:97H=J9#OQ
#OQ89:97H=J98=C89G#OQ5G8=C89G
/%($.C@@58=9O
Q-FC656=@=HMC:#OQ
9H7/%($ +9KGGH5B86IMG5B8G9@@G
%#% ,6G9FJ9=BHI=H=J989:B#OO
QQ#OQ
DF97=G989:B#O
Q#OQ#OQ=:5B85F98=G>C=BH
%'%)/%($
)$++'/%($)CC?5HHKC8=C89GO
Q#OQ
#OQ
#OQ
#OQO
Q9J9BHK<9F9:=FGH8=C89=G;CC8
9H7-FC656=@=HM:=FGH8=C89=G;CC8#OQ#OO
QQ#OQ#OQ
9H7/%($ /%($ '&+#&,+9KGGH5B8GHC7?G7CD=9GC:
%#% CB579FH5=B85M-FC656=@=HM9BCI;<7CD=9G:CF5@@H<97IGHCA9FGH<5HK5BHHC6IMCB9#OO
QQ#OQ#OQ#OQ/%($ /%($ '&+#&,.C@@58=9)9HO
Q9J9B5B8O
QCF55B85F98=G>C=BH
O
Q5BMH<=B;9L79DH5munotes.in
Page 193
193 29 <5J9#OQ
#OQ
5B8#O
Q +CH9#O
Q#OQ#OQ)'$%,::=79*5L?99DG579FH5=BBIA69FC:GH5D@9FGCB<5B8':H<9MG9@@CIHCB579FH5=B85M
H<9MCF89FACF9:FCAH<98=GHF=6IHCF5B8H<9G95F989@=J9F98=BH=A9:CFH<9GH5FHC:H<9B9LH85M0HC957<CIH7CA9O(QO
(QG9HC:CIH7CA9GGI7<H<5H
(
566F9J=5H=CB#O(Q#O
(Q-FC656=@=HMH<5H5GGIA9GJ5@I9(#OQ#O
Q-FC656=@=HMH<5HJ5@I9C:@=9G=B#O(
)Q#O
(5B8
)Q-FC656=@=HMH<5H5GGIA9GJ5@I9(5B85GGIA9GJ5@I9)
(#O(Q!# %)$$&% C:F5B8CAJ5F=56@9/%($/%($
'&+#&,)CC?5HHKC8=C89GF9GI@HC::=FGH8=C895B8F9GI@HC:G97CB88=C89
OQO
Q#OQ#O
Q#OQ#OQ/%($.')'$$*' #
#O
Q:CF5@@
F9GI@HC::=FGHFC@@
F9GI@HC:G97CB8FC@@
OQ9J9BHK<9F9:=FGHFC@@=GO
Qmunotes.in
Page 194
194#OQ-FC656=@=HMH<9:=FGHFC@@=G5#O
Q#O
Q#O
Q#O
Q#O
Q#O
Q#O
Q#OQ#OQ:CF5@@
#O
Q:CF5@@
#OO
QQ-FC656=@=HMH<9:=FGHFC@@=G
CF#OQ#OQ#OQU
GIAC:H<9HKCFC@@GO
Q9J9BHK<9F9GIA=GO
Q#O
Q#O
Q#O
Q
PWP44444444444444444444444444444444444444444444444444444444444444
)'$%)$++'/%($)CC?5HH#OQ#OQ#OQ
#OQ#OQ#OQ
#OQ
7CB8=H=CBC::=FGH8=C89
7CB8=H=CBC:G97CB88=C89
7CB8=H=CBC:H<=F88=C89
BIA69FC:8=C89G=BH<965H7<C:HIGH H<9 DFC656=@=HM C:GCA9H<=B;<5DD9B=B;577CF8=B;HC;=J9B=B:CFA5H=CBmunotes.in
Page 195
195#OPQ#OQFOQ % !# %)C:9J9BH'9J9BH/%($ #& '&+/+ ' /%($
05?9 HKC GI779GG=J9 8=C89G :FCADFC8I7H=CB@=B909GH:=FGH8=C89=H=G89:97H=J9$C9GH<=G5::97HH<9DFC656=@=HMH<5HH<9G97CB88=C89=G89:97H=J9 'BHI=H=J95DDFC57<05?95@5F;9BIA69FC:D5=FGC:8=C89G5B8H9GH6CH<8=C89G=B957<D5=F,B@M7CBG=89FD5=FGK<9F9:=FGH8=C89=G89:97H=J9#CIBHH<9BIA69F=BK<=7<H<9G97CB88=C89=G5@GC89:97H=J9&CF5@5F;9BIA69FC:D5=FG-FC656=@=HMG97CB88=C89=G89:97H=J9=::=FGH8=C89=G89:97H=J9
K<=7<6CH<5F989:97H=J9K<=7<H<9:=FGH=G89:97H=J9K<=7<6CH<5F989:97H=J9HCH5@C:D5=FGK<=7<H<9:=FGH=G89:97H=J9HCH5@C:D5=FG#O6CH<5F989:97H=J9Q#O:=FGH=G89:97H=J9Q
5GH<9BIA69FC:D5=FG0<9C6G9FJ5H=CBH<5HH<9:=FGH8=C89=G89:97H=J98C9G5::97HH<9DFC656=@=HMH<5HH<9G97CB88=C89=G89:97H=J9/%($#&'&+/+' /%($.C@@58=9&=B87CB8=H=CB5@DFC656=@=HMBIA69F=G9J9B;=J9BH<5H=:=GCF@5F;9F)9HO
QBIA69F=G9J9B5B8O
QBIA69F=GCF@5F;9F#OPQ#OQFOQ)'$% 3CICKBGHC7?=B*9;56MH9#CADIH9F#CFDCF5H=CB0<9F9=G5B 7<5B79 C: *9;56MH9 A5?=B; 5 DFC:=H =: =H A5?9G 5 H97
Page 196
196/=B79#OPQ#OQFOQH<=G=G9EI=J5@9BHHC#OQ#OQ#OQ
ACGH6CC?GIG9H<=G89:B/%($/%($'&+#&,'B%L5AD@9H<9G97CB88=C8969=B;89:97H=J9=GBCH=B89D9B89BHC:H<9:=FGH8=C8969=B;89:97H=J9/%($ /%($'&+#&,'B%L5AD@9;9HH=B;5B9J9BBIA69F=GBCH=B89D9B89BHC:;9HH=B;CF@5F;9FG=B79#O9J9BPCF@5F;9FQ
K<=@9#O9J9BQ,BH<9CH<9F<5B8=:O
Q=GH<99J9BHC:;9HH=B;CF@5F;9FH<9B#OPQGC;9HH=B;5B9J9BBIA69F=G=B89D9B89BHC:;9HH=B;CF@5F;9F)'$%
53CIFC@@58=9HK=795G=B%L5AD@9#CBG=89FH<99J9BHK<9F9H<9GIAC:H<9BIA69FGCBH<9HKCFC@@G=G'GH<=G=B89D9B89BHC:FC@@=B;5CBH<9:=FGHFC@@ !BG39G6)9H"69H<99J9BHC:FC@@=B;5CBH<9:=FGHFC@@CFG97CB8FC@@CF6CH<'G9J9BHK<9F9H<9GIAC:H<9BIA69FGCBH<9HKCFC@@G=G=B89D9B89BHGC:"!BG+C3.Independent Random Variables..5B8CAJ5F=56@9G5B85F9!%=:?BCK@98;9C:H<9J5@I9GC:CB9C:H<9A8C9GB H=B:@I9B79H<9DFC656=@=HMH<5HH<9CH<9F5GGIA9GJ5F=CIGJ5@I9G
=9#O(5B8)Q#O(Q#O)Q:CF5BM(5B8)/%($
#&+"'&+/+' /%($.C@@HKC8=79F9GI@HC::=FGHFC@@5B8F9GI@HC:G97CB8FC@@5F9=B89D9B89BHG=B79#O
Q#OQ#OQ:CF5@@
/%($#&+"'&+/+' /%($
)CC?5HHKC8=C89G7CB8=H=CBC::=FGH8=C895B87CB8=H=CBC:G97CB88=C895F9BCH=B89D9B89BHG=B79#O
Q#OQ#OQ
/%($-)#+#'&'&/%($
)CC?5HHKC8=C89G7CB8=H=CBC::=FGH8=C89 5B8 7CB8=H=CB C: G97CB8 8=C89 /IDDCG9 H<9 DFC656=@=HM A5GG:IB7H=CBGC:5B85F9;=J9B6M
#OQ
#OQ
#OQ
#OQ&IFH<9FACF9
GIDDCG95B85F9=B89D9B89BH&=B8H<9DFC656=@=H=9GC:H<9:CIFCIH7CA9G
5B8munotes.in
Page 197
1971G=B;H<9=B89D9B89B79CB9<5G#OQ#O
Q#OQ#OQ
#OQ#O
Q#OQ#OQ
#OQ#O
Q#OQ#OQ
#OQ#O
Q#OQ#OQ
0<=G=@@IGHF5H9G57CAACBG=HI5H=CBK<9F9K989G7F=695DFC656=@=HMAC89@6M;=J=B;CB9CFACF9F5B8CAJ5F=56@9G5@CB;K=H<H<9=FDFC656=@=HMA5GG:IB7H=CBGHC;9H<9F K=H< GCA9 =B:CFA5H=CB
GI7< 5G =B89D9B89B79
56CIH H<9 >C=BH69<5J=CFC:H<9F5B8CAJ5F=56@9G298CH<=G=BGH958C:@=GH=B;H<99@9A9BHG=BH<9G5AD@9GD5795@CB;K=H<H<9=FDFC656=@=H=9G.5B8CAJ5F=56@9G
H<9BH<9M5F9!%=:?BCK@98;9C:H<9J5@I9GC:GCA9C:H<9J5F=56@9G8C9GB H7<5B;9H<9DFC656=@=HMH<5HH<9CH<9FG5GGIA9J5F=CIGJ5@I9G
=9#O(
(Q#O(Q#O(Q:CF5BM(
(/%($'&+#&,+#'&' /%($)CC?5HHKC8=C89G7CB8=H=CBC:H<8=C89)9H69H<97CB8=H=CBC:H<9H<8=C89:CF
X
/IDDCG95@@H<9<5J9H<9G5A9DFC656=@=HMA5GG:IB7H=CB;=J9B6M
#OQ
#OQ&IFH<9FACF9
GIDDCG95@@H<95F9=B89D9B89BH&=B8H<9DFC656=@=HMH<5H5@@8=C89G5F9;CC81G=B;H<9=B89D9B89B79CB9<5G#OXQ#O
X
Q#OQ#OQX#OQ
X
7.Averages of Data(
(
X
(G9EI9B79C:C6G9FJ5H=CBGC:GCA9H<=B;(45J9F5;9C:(
(
X
((((/%($3CI5F95K
Page 198
198"
"
"
"
"K(#O(QDFC656=@=HMA5GG:IB7H=CBC:µ
(!%'& #O(Q(#O(Q(#O(Q(#O(Q(
((
((
((
((+#'&$29 5F9 AC89@=B; 5 G=HI5H=CB K<9F9 K9 5F9 ;C=B; HC A5?95G9EI9B79 C: F9@5H98 C6G9FJ5H=CBG 6M 5 G9EI9B79
X
C: F5B8CAJ5F=56@9GK<9F9=GH<9F9GI@HC:H<9H(K<9F9
(#O(Q:CF957<5B8/IDDCG9"
"
X
"5F9H<9J5@I9GK957HI5@@MC6G9FJ9:CFH<9F5B8CAJ5F=56@9G
X
'BCIF7CADIH5H=CBC:"4@9H G;FCID5@@H<9J5@I9GC:"H<5H9EI5@(HC;9H<9F5B85@@H<9J5@I9GC:"H<5H9EI5@(HC;9H<9F
9H70<9BK9<5J9"4"""
(((
(((
(((((((((K<9F9=GH<9BIA69FC:H=A9GH<5H(5DD95FG=B"
"
X
"!GK99LD97H#O(Q
(K<9F989BCH9G5BMC:H<9/C5GK99LD97H"4
((
((
((µ0<9 :57H H<5H H<=G
Page 199
199/%($/IDDCG9=B%L5AD@9H<9G9HC:DCGG=6@9J5@I9G:CFH<9K
':H<95F95@@=B89D9B89BH
H<9BK9KCI@89LD97HH<95J9F5;9"4C:H<957HI5@DF=79GCJ9FK99?GHC5DDFC57<5G)'$%+9KGGH5B86IMG5B8G9@@G
%#% BIA69F<9G9@@G=B585M#OQ
#OQ
#OQ
#OQ
#OQ
&=B8µ!BGK9F)'$%)CC?5H58=C89O
Q#OQ
#OQ.5B8CAJ5F=56@9=G89:=B986M
5B8
&=B8µ!BGµ#OQ9.Properties of Means
=:5B85F9=B89D9B89BH
;
(
(/%($ ! 7CAD5BM DFC8I79G HF5BG=GHCFG 0<9M 9GH=A5H9 H<5H H<9DFC656=@=HMC:5BMCB9C:H<9HF5BG=GHCFG=G89:97H=J9=G/IDDCG956CL7CBH5=BGHF5BG=GHCFG2<5H=GH<99LD97H98BIA69FC:89:97H=J9HF5BG=GHCFG=B56CL '$,+#'&)9H=:H<9H
X
X
0<=G9L5AD@9=@@IGHF5H9GH<9:C@@CK=B;;9B9F5@DFCDCG=H=CB/%($#CBG=89F5F5B8CAK5@?K<9F9H<9DFC656=@=HMC:5GH9DHCH<9F=;'$,+#'&
#OQ
#OQ
#OQ
':K9K9F9HC7CADIH9
:FCAH<989:=B=H=CB
H<9B
#OQ
#OQ
munotes.in
Page 200
200".' )!,%)*)9H
X
X695G9EI9B79C:=B89D9B89BHF5B8CAJ5F=56@9G5@@H5?=B;CBH<9G5A9G9HC:J5@I9G5B8<5J=B;H<9G5A9DFC656=@=HMA5GG:IB7H=CB
()9H40<9B#O4
µQ5GQUESTIONS1.What is a Probabilistic model?2.What is the difference between deterministic and Probabilisticmachine learning modes??3.How is probabilities used in machine learning4.What is a Probabilistic model in information retrieval?REFERENCES Mitchell Machine Learning, Tata McGraw-Hill Education; Firstedition Python Willi Richert, Shroff/Packt Publishing Building MachineLearningSystems With; First edition (2013) .munotes.in
Page 201
12MACHINELEARNINGINHYPER-AUTOMATIONUnit Structure12.0Objective12.1Introduction12.1.1Business Analysis And Predictions:12.1.2Automated Machine Learning:12.1.3 Synchronizationof Machine Learning And Iot12.1.4 Faster Computing Power12.1.5 Reinforcement Learning12.1.6 Machine Learning In Cybersecurity12.2Models’ Symbols Bagging And Boosting12.2.1 Bias And Variance12.2.2 Ensemble Methods12.2.2.1parallel Ensemble Methods12.2.2.2 Sequential Ensemble Methods12.3Bagging12.3.1bootstrapping12.3.2 Aggregation12.4How Is Bagging Performed12.4.1 Implementationof Bagging12.5Boosting12.5.1 How Is Boosting Performed:12.5.2 Similarities Between Bagging And Boosting:12.5.3 Bagging Vs Boosting:12.5.4 MultitaskLearning:12.5.4.1 Whento Use Multi-Task Learning12.5.4.2 Building A Multi-Task Model12.6Learning A Shared Representation12.6.1 Optimizing For Multiple Tasks12.6.2 What Is Online Machine Learning?12.6.2.1 Objective:12.6.2.2 Offline VsOnline Learning12.6.2.3 Online Learning Use Cases12.7Sequences Predictionmunotes.in
Page 202
12.7.1 Types Of Sequence Prediction Problems:12.7.2 Predicting The Next Value:12.7.3 Time-Series Forecasting:12.7.4 Webpage/Product Recommendation:12.7.5 Predicting AClass Label:12.7.5.1 Examples Of Sequence ClassificationApplications:12.8What Is Active Learning12.8.1 How Does Active Learning Work:12.8.2 Stream-Based Selective Sampling:12.8.3 Pool-Based Sampling:SummaryUnit EndQuestionsReferences12.0OBJECTIVEHyper parameters can be classified as model hyper parameters, that cannotbe inferred whilefitting the machine to the training setbecause they referto themodel selectiontask, or algorithm hyper parameters, that inprinciple have noinfluence on the performance of the model but affect thespeed and quality of the learning process.12.1INTRODUCTIONHyper-automation, which Gartner has identified as an IT mega-trend, isthe likelihood that virtually everything in an organization canbeautomated. Legacy business procedures, for example, should actually beautomated. The COVID-19 crisis has significantly increased the adoptionof this phenomenon, which is otherwise known as intelligent processautomation and digital process automation.Machine learning and AI are crucial aspects and indispensable propellersof hyper-automation (in addition to various innovations such as processautomation tools). For the sake of effectiveness, hyper-automationprocesses cannot rely on statically packaged software. Automatedenterprise processes must have had the ability to adapt to evolvingconditions and respond to unexpected situations.12.1.1 Business AnalysisandPredictions:This strategy allows experts to collect and analyse a set of data over atimeframe that is thereafter screened and used to make smart decisions.Machine learning networks can provide suppositions with accuracy asmuch as about 95% whenever trained using multiple data sets.In 2021andbeyond, we can expect that companies should integrate recurrent neuralmunotes.in
Page 203
networks for high-fidelity prediction. For instance, machine learningsolutions can be integrated to unravel hidden trends and precisepredictions. A clear illustration of this can be seen in insurance companiesidentifying likely frauds that could in one way or anotherhave a greatimpact on them.12.1.2Automated Machine Learning:The next phase of growth in machine learning trends is automatedmachine learning. It is a good idea for individuals who are notprofessionals in the complex world of machine learning as well as forprofessional data analysts and scientists. Automated machine learningenables these data experts to build machine learning models withincreased productivity and efficiency while featuring extremely highquality. Thus, a tool such as AutoMachine Learning can be utilized to traintop-quality custom ML models forclustering, regression, andclassification without extensive knowledge of how to program. It canseamlessly produce an adequate level of customization without an in-depth understanding of the complicated workflow of machine learning. Itcan also be usefulin utilizing machine learning best practices while savingresources and time. A good example of such AutoML is automatedmachine learning created by Microsoft Azure which can design and launchpredictive models.12.1.3 SynchronizationofMachine LearningandIot:The Internet of Things is already a developed technology that involves theconnection of several “things” or devices across a network, each havingthe ability to interact with each other. These devices are continuallyincreasing, at such a rate that there is the possibility of having over 64billion IoT devices by the end of 2025.The function of all of these devicesis to collect data that can be evaluated and processed to generate usefulinsights. This is where ML becomes very pertinent. ML algorithms can beutilized to transform the data gathered by IoT devices into usefulactionable outcomes.A good example of this is Green Horizons, which is a project launched byIBM’s Research Lab in China whose goal is to regulate the pollution ratesto better breathable levels. This is achievable with the use of an IoTnetwork wherein sensors gather emissions from automobiles, traffic levels,weather, airflow direction, and pollen levels, among other things. Thenmachine learning algorithms are employed to figure out the most effectiveway to minimize these emissions. The synchronization of machinelearning and IoT can as well be observed in the area of smart cars whereautonomous-driving vehicles must be very precise and each part mustcommunicate with one another in split seconds on the road. Thisdemonstrates how essential the integration of these technologies is.In fact,Gartner forecasts that over 80% of business IoT projects will utilize AIand ML in one way or the other by 2022. This paced growth is a lotgreater than the 10% of enterprise projects currently utilizing it.munotes.in
Page 204
12.1.4 Faster Computing Power:AI analysts are basically close to the beginning of understanding the fieldof artificial neural networks and the most suitable approach to arrangingthem. This suggests that within the next year, algorithmic successes willcontinue increasing at an overwhelming pace with pragmatic progress andbetter problem-solving mechanisms. Similarly, cloud ML solutions areadding momentum as third-party cloud serviceplatforms support thedeployment of machine learning algorithms in the cloud. AI can resolve areasonable range of unfavourable issues that require discovering insightsand making decisions. Although, in the absence of the ability to lay handson a machine’s proposition, people will assume that it is cumbersome toaccept that suggestion. With defined lines, conceive continueddevelopment in the transitional period increasing the explain ability andtransparency regarding Artificial Intelligence algorithms.12.1.5 Reinforcement Learning:Reinforcement Learning (RL) is also one of themachine learning trendstolook out for this year. It can generally be used by businesses in the nearfuture. It is an innovative use of deep learning which makes use of itspersonal experiences to enhance the effectiveness of accumulated data RL,Artificial Intelligence programming is deployed with several conditionsthat determine what kind of activity will be executed by the software. Inrespect to various actions and outcomes, the software auto-learns actionsto work to achieve the proper ultimate goal. A typical illustration of RL isa chatbot that attends to basic user queries such as consultation calls, orderbooking, greetings. ML Development Organizations can make use ofReinforcement Learning to ensure the ingenuity of the chatbot byincluding sequential conditions to it–for instance, differentiatingprospective buyers and transferring calls to the appropriate service agent.Some other applications of reinforcement Learning are aircraft control,industrial automation, robot motion control, as well asbusiness processesand strategic planning.12.1.6 Machine LearninginCybersecurity:Machine learning continues to skyrocket in this contemporary time,it islikewise spreading its applications in several various sectors. One of themost common among these industries is the cybersecurity industry.Machine learning has a lot of applications in cybersecurity, some of whichare detection of cyber threats, combating cyber-crime that equally utilizesML capabilities, enhancing available antivirus software, among otherthings.Machine learning is also employed in the creation of smart antivirussoftware that can detect any malware or virus by its irregular behaviourinstead of just utilizing its signature like regular antivirus. Hence, thesmart antivirus has the capacity to detect older threats from formerlyexperienced viruses and also fresh threats from viruses that were newlycreated by screening their abnormal behaviour. Several companies areintegrating machine learning into cybersecurity.munotes.in
Page 205
12.2 MODELS’ SYMBOLS BAGGING AND BOOSTINGThis blog will explain ‘Bagging and Boosting’ most simply and shortly.But let us first understand some important terms whichare going to beused later in the main content. Let’s start with an example, If we want topredict ‘sales’ of a particular company based on its certain features, thenmany algorithms likeLinear RegressionandDecision Tree Regressorcanbe used. But both of these algorithms will make different predictions. Whyis it so? Oneof the key factors is how much bias and variance theyproduce.Cool, but what if we don’t know anything about Bias andVariance. So let’s jump to Bias and Variance first.12.2.1 Bias and Variance:Bias:When we train our model on the given data, it makes certainassumptions, some correct and some incorrect. But when the model isintroduced with testing or validation data, these assumptions (obviouslyincorrect ones) may create a difference in predicted value.So, to concludefrom this,” Bias is the difference between the Predicted Value and theExpected Value”. When there is ahigh biaserror, it results in a model thatis not capable of taking so much variation. Since it does not learn enoughfrom the training data, it is calledUnderfitting.Varianceisthe error that occurs when the model captures fluctuations ornoises of the data.To explain further, the model learns too much from the training data, sothat when it is introduced with new testing data, it is unable to predict theresult accurately. When there is a highvarianceerror, your model is sospecific to the trained data, it is calledOverfitting.12.2.2 Ensemble Methods:Let’s take an example, If you want to purchase a new laptop, then youwould browse different shopping websites/apps and check the price,features, and reviews/ratings and then compare them. You would also visitsome local stores and also ask your friends or relatives for their opinion.Then at the end, you take a decision considering all these factors.Ensemble models in machine learning work on a similar idea. Ensemblemethods create multiple models (called base learners/weak learners.) andcombine/aggregate them into one final predictive model to decrease theerrors (variance or bias). This approach allows us to produce better andmore accurate predictive performance compared to a single model.Ensemble methods can be divided into two groups:12.2.2.1Parallel ensemble methods:In this method base learners are generated parallelly, hence encouragingindependence between the base learners. Due to the application ofaverages, the overall error is reduced.munotes.in
Page 206
12.2.2.2 Sequential ensemble methods:In this method base learners are generated by sequence try; hence baselearners are dependent on each other. Overall performance of themodel isthen increased by allocating higher weights to previously mislabelled/mis-predicted learners.Boosting and bagging are the two most popularly used ensemble methodsin machine learning. Now as we have already discussed prerequisites, let’sjump tothis blog’s main content.12.3 BAGGINGBagging stands forBootstrapAggregating or simply Bootstrapping +Aggregating.12.3.1Bootstrappingin Bagging refers to a technique where multiplesubsets are derived from the whole (set) using the replacementprocedure.12.3.2 Aggregationin Bagging refers to a technique that combines allpossible outcomes of the prediction and randomizes the outcome.Hence many weak models are combined to form a better model.Bagging is aParallelensemble method, where every model is constructedindependently. Bagging is used when the aim is toreduce variance.So,now let’s see how bagging is performed.12.4 HOW IS BAGGING PERFORMEDThe whole process of Bagging is explained in just a few steps. Please referto thediagram below for a clearer understanding and visualization.1.‘n’ number of data subsets (d1, d2, d3…. dn) are generated randomlywith replacement from theoriginal dataset ‘D’;Bootstrapping.2.Now these multiple sub-datasets are used to train multiple models(which are called ‘WeakLearners’) like m1, m2, m3….mn.3.Final prediction (Ensemble model) is given based on the aggregationof predictions from all weak models;Aggregating.In the case of Regressor: the average/mean of these predictions isconsidered as the finalprediction.In the case of Classifiers: the majority vote gained from the votingmechanism is considered as the final prediction.munotes.in
Page 207
Figure 12.4.1 Bagging RepresentationAs “random samplingwith replacement” is used here therefore everyelement has the same probability to appear in a new training sub-datasetand also some observations may be repeated. Ensemble model producedwith these weak models is much more robust and with low variance.12.4.1 Implementation of Bagging:Random forest algorithm uses the concept of Bagging. Here is a samplecode for how Bagging can be implemented practically. Remember it’s justa sample code, just to introduce you to the required library.12.5 BOOSTINGBoosting is aSequentialensemble method, where each consecutive modelattempts to correct the errors of the previous model.If a base classifier is misclassified in one weak model, its weight will getincreased and the next base learner will classify itmore correctly. Sincethe output of one base learner will be input to another, hence every modelis dependent on its previous model. Boosting is used when the aim istoreduce bias.So now let’s see how bagging is performed.12.5.1 How is Boosting performed:The whole process of Boosting is explained in just a few steps. Pleaserefer to the diagram below for a clearer understanding and visualization.1.let‘d1’ be the data-subset, which is generated randomly withreplacement from the original dataset ‘D’.2.Now this subset is used to train the model ‘m1’(which is called a weaklearner).
munotes.in
Page 208
3.This model is then used to make predictions on the original(complete)dataset. Elements or instances which are misclassified/mis-predictedby this model, will be given more weights while choosing the nextdata-subset.4.Let ‘d2’ be the data-subset, which is generated randomly withreplacement from the dataset ‘D’ (which is now updated with weights).In this step, instances which have more weights (concluded from theprevious step) will be more likely to be chosen.4.Now this subset is again used to train the model ‘m2’(which is called aweak learner).5.Above stepsare repeated for ‘n’ number of times, to get ‘n’ suchmodels(m1,m2,m3…..mn)6.Results of these ‘n’ weak models are combined to make a finalprediction.
12.5.1 Figure BoostingImplementation:▪AdaBoost▪Gradient boostingThese algorithms use Boosting in a different-different manner which wewill see in detail in the next blog.Here is a sample code for how boosting can be implemented practically onthe AdaBoostalgorithm. Remember it’s just a sample code, just tointroduce you to the required library.12.5.2 Similarities Between Bagging and Boosting:1.Both of them are ensemble methods to get N learners from one learner.2. Both of them generate several sub-datasets for training by randomsampling.3. Both of them make the final decision by averaging the N learners (or byMajority Voting).
munotes.in
Page 209
4. Both of them are good at providing higher stability.12.5.3 Bagging Vs Boosting:1.The Main Goal of Bagging is todecrease variance, not bias. The MainGoal of Boosting is to decrease bias, not variance.2.In Bagging multiple training data-subsets are drawn randomly withreplacement from theoriginal dataset. In Boosting new sub-datasetsare drawn randomly with replacement from the weighted(updated)dataset3.In Bagging, every model is constructed independently.In Boosting,new models are dependent on the performance of the previous model.4.In Bagging, every model receives an equal weight. In Boosting,models are weighted by their performance.5.In Bagging, the final prediction is just the normal average. InBoosting, the final prediction is a weighted average.6.Bagging is usually applied where the classifier is unstable and has ahigh variance. Boosting is usually applied where the classifier is stableand has a high bias.7.Bagging is used for connecting predictions of the same type. Boostingis used for connecting predictions that are of different types.8.Bagging is an ensemble method of type Parallel. Boosting is anensemble method of type Sequential12.5.4 Multitask learning:In mostmachine learning contexts, we are concerned with solvingasingletask at a time. Regardless of what that task is, the problem istypically framed as using data to solve a single task or optimize a singlemetric at a time. However, this approach will eventually hit a performanceceiling, oftentimes due to the size of the data-set or the ability of themodel to learn meaningful representations from it. Multi-task learning, ontheother hand, is a machine learning approach in which we try tolearnmultipletasks simultaneously, optimizing multiple loss functions atonce. Rather than training independent models for each task, we allow asingle model to learn to complete all of the tasks at once. In this process,the model uses all of the available data across the different tasks to learngeneralized representations of the data that are useful in multiple contexts.Multi-task learning has seen widespread usage across multiple domainssuch as natural language processing, computer vision, andrecommendation systems. It is also commonly leveraged in industry, suchas atGoogle, due to its ability to effectively leverage large amounts ofdata in order to solve related tasks.12.5.4.1 When to use multi-task learning:Before going into the specifics of how to implementa multi-task learningmodel, it is first important to go through situations in which multi-tasklearning is, and is not, appropriate. Generally, multi-task learning shouldmunotes.in
Page 210
be used when the tasks have some level ofcorrelation. In other words,multi-task learning improves performance when there are underlyingprinciples or information shared between tasks. For example, two tasksinvolving classifying images of animals are likely to be correlated, as bothtasks will involve learning to detect fur patterns and colours. This wouldbe a good use-case formulti-task learning since learning these imagesfeatures is useful for both tasks.On the other hand, sometimes training on multiple tasks resultsinnegative transferbetween the tasks, in which the multi-taskmodelperforms worse than the equivalent single-task models. This generallyhappens when the different tasks are unrelated to each other, or when theinformation learned in one task contradicts that learned in another task.12.5.4.2 Building a multi-taskmodel:Now that we know when we should use multi-task learning, we will gothrough a simplemodel architecture for a multi-task model. This willfocus on a neural network architecture (deep multi-task learning), sinceneural networks are by far the most common type of model used in multi-task learning.12.6 LEARNING A SHARED REPRESENTATIONAtits core, deep multi-task learning aims to learn to produce generalizedrepresentations that are powerful enough to be shared across differenttasks. I will focus on hard parameter sharing here, in which the differenttasks use exactly the same base representation of the input data.
! As we can see, hard parameter sharing forces the model to learn anintermediate representation that conveys enough information for all of thetasks. The task-specific portions ofthe network all start with the samebase representation from the last shared layer.
munotes.in
Page 211
Multi-task learning improves the generalizability of this representationbecause learning multiple tasks forces the model to focus on the featuresthat are useful across all of the tasks. Assuming the tasks are correlated, afeature that is important for Task A is also likely to be important for TaskC. The opposite is also true; unimportant features are likely to beunimportant across all of the tasks.Multi-task learning also effectively increases the size of your data-set,since you are combining the data-sets from each task. By adding moresamples to the training set from different tasks, the model will learn tobetter ignore the task-specific noise or biases within each individual data-set.12.6.1 Optimizing for multiple tasks:Once the model’s architecture has been decided, we need to decide whatloss function to optimize. The simplest approach is to minimize a linearcombination of the individual tasks’ loss functions. Each task will have itsown individual loss functionLi. So in our multi-task model, we simplyweight each loss function and minimize the sum of these weighted losses.Now that we know how to build a multi-task model, we need to identifyhow to applythis method to a given task. In many cases, it may seem likeyou truly only have one task to solve, and it may not be obvious how toadapt your problem into a multi-task learning situation. In cases where youdo not explicitly have multiple tasks, you cancreateauxiliary tasksthataim to solve a problem that is related, but not identical, to your singleoriginal task. By creating an auxiliary task, you can still apply multi-tasklearning to your single primary task, and hopefully improve performanceonit.Identifying an auxiliary task is generally domain-specific and there is noone-size-fits-all approach to coming up with one. However, there are somegeneral principles that they will have in common. In general, the auxiliarytask should be related to the primary task, and should nudge the networkinto learning important features for the primary task. For example, if theprimary task is to classify sequences of data, we can create an auxiliarytask that is to reconstruct the sequence with an autoencoder. This auxiliarytask explicitly forces the network to learn a sequence encoder thatproduces a representation that is informative enough to be able toreconstruct the original sequence. This is likely to improve performanceon the original task, classifying the sequence, as well, simply by producinga more informative intermediate representation of the sequence.12.6.2 What is Online Machine Learning?:While you may not knowbatchoroffline learningby name, you surelyknow how it works. It’s the standard approach to machine learning.Basically, you source a dataset and build a model on the whole dataset atonce. This is why it’s called batch learning. You may be wondering why itmunotes.in
Page 212
goes by yet another name: offline learning. That’s because offline learningis the polar opposite of another machine learning approach that you maynot even be aware of. It’s calledonline learningand you should knowwhat it can do for you.12.6.2.1 Objective:My objective in this post is to introduce you to online learning, describeitsuse cases, and show you how to get started in Scikit-learn. To helpmotivate things, know that online learning is a powerful tool that opens upa whole new world. It’s a tool you can add to your toolbox, giving youcapabilities to tackle problems thatmay have once been beyond yourreach.12.6.2.2 Offline vs Online Learning:So what differentiates offline and online learning? In the simplest sense,offline learning is an approach that ingests all the data at one time to builda model whereas online learning is an approach that ingests data oneobservation at a time. There’s a perfect one-to-one analogy here for thosefamiliar withGradient Descent. Offline learning, also known as batchlearning, is akin to batch gradient descent. Online learning, on the otherhand, is the analog of stochasticgradient descent. In fact, as we’ll see,implementing online learning in Scikit-learn will utilize stochasticgradient descent with a variety of loss functions to create online learningversions of algorithms like logistic regression and support vectormachines. There’s more to online learning, though.Online learning is data efficient and adaptable. Online learning is dataefficient because once data has been consumed it is no longer required.Technically, this means you don’t have to store your data. Online learningis adaptable because it makes no assumption about the distribution of yourdata. As your data distribution morphs or drifts, due to say changingcustomer behaviour, the model can adapt on-the-fly to keep pace withtrends in real-time. In order to do something similar with offline learningyou’d have to create a sliding window of your data and retrain every time.And if you’ve been paying attention, you surely noticed that you can usethis methodology todo streaming analytics.12.6.2.3 Online Learning Use Cases:Now that you know the difference between offline and online learning,you may be wondering when to consider the latter. Simply put, consideronline learning when:1.Your data doesn’t fit intomemory2.You expect the distribution of your data to morph or drift over time3.Your data is a function of time (e.g. stock prices)munotes.in
Page 213
12.7 SEQUENCES PREDICTION12.7.1 Types of Sequence Prediction Problems:Sequence prediction is a popular machine learningtask, which consists ofpredicting the next symbol(s) based on the previously observed sequenceof symbols. These symbols could be a number, an alphabet, a word, anevent, or an object like a webpage or product. For example:●A sequence of words or characters in a text●A sequence of products bought by a customer●A sequence of events observed on logsSequence prediction is different from other types of supervised learningproblems, as it imposes that the order in the data must be preserved whentrainingmodels and making predictions.Sequence prediction is a common problem which finds real-lifeapplications in various industries. In this article, I will introduce to youthree types of sequence prediction problems:●Predicting the next value●Predicting a class label●Predicting a sequence12.7.2 Predicting the next value:Being able to guess the next element of a sequence is an importantquestion in many applications. A sequence prediction model learns toidentify the pattern in the sequential input dataand predict the next value.
12.7.2. Figure: Sequence Prediction Model12.7.3 Time-series forecasting:Time-series refers to an ordered series of data, where the sequence ofobservations is sequentially in the time dimension. Time-series forecastingis about making predictions of what comes next in the series. Thus, Time-series forecasting involves training the model on historical data and usingthem to predict future observations.But what makes time-series forecasting different from a regressionproblem? There are 2 things:
munotes.in
Page 214
●Time series is time-dependent, which is ordered by time. ButRegression can be applied to non-ordered data where a target variableis dependent on values taken by features.●Time series looks for seasonality trends. For example,the powerdemand in a day will drop at night, and the number of air passengerswill increase during the summer.12.7.4 Webpage/product recommendation:Have you searched for something, and every advertisement you saw nextis related to what you searched for?For example, after watching the movieAvengers:Endgame, I wassearching for explanations of certain scenes. Ever since then,GoogleDiscover Feedstarted to show me content revolve around the MarvelCinematic Universe, even until today.Even thoughit seems like Google Discover Feed is recommending acollection of webpages, each webpage is an individual output.12.7.5 Predicting a class label:Sequence classification uses labelled datasets with some sequence inputsand class labels as outputs, to train a classification model which can beused to predict the class label of an unseen sequence.
!!
12.7.5.1 Examples of sequence classification applications:Text categorization. Assigning labels to documents written in a naturallanguage has numerous real-world applications includingsentimentclassificationandtopic categorization.Anomaly detection. Researchers exploredetecting abnormal behavioursin4 different time-series datasets, 1) electrocardiograms,2) Space ShuttleMarotta valve, 3) power demand, and engine sensors datasets.Genomic research. Researchers have beenclassifying protein sequencesinto categories. This work could be potentially useful for the discovery ofa wide range of protein functions.Health-informatics.Researchers use LSTM toclassify electrocardiogram(ECG) signalsinto five different classes to identify the condition of apatient’s heart. This allows end-to-end learning, extracting features relatedto ECG signals without any expert intervention.
munotes.in
Page 215
Brain-computer interface. We extractbrain signals from theElectroencephalogram, decoding of the user’s intentions to operate theassistive devices.12.8 WHAT IS ACTIVE LEARNINGActive learning is the subset of machine learning in which a learningalgorithmcan query a user interactively to label data with the desiredoutputs. In active learning, the algorithm proactively selects the subset ofexamples to be labelled next from the pool of unlabelled data. Thefundamental belief behind the active learner algorithm concept is that anML algorithm could potentially reach a higher level of accuracy whileusing a smaller number of training labels if it were allowed to choose thedata it wants to learn from.Therefore, active learners are allowed to interactively pose queries duringthe training stage. These queries are usually in the form of unlabelled datainstances and the request is to a human annotator to label the instance.This makes active learning part of the human-in-the-loop paradigm, whereit is one of the most powerful examples of success.12.8.1 How does active learning work:Active learning works in a few different situations. Basically, the decisionof whether or not to query each specific label depends on whether the gainfrom querying the label isgreater than the cost of obtaining thatinformation. This decision making, in practice, can take a few differentforms based on the data scientist’s budget limit and other factors.12.8.2 Stream-based selective sampling:In this scenario, the algorithmdetermines if it would be beneficial enoughto query for the label of a specific unlabelled entry in the dataset. Whilethe model is being trained, it is presented with a data instance andimmediately decides if it wants to query the label. This approach has anatural disadvantage that comes from the lack of guarantee that the datascientist will stay within budget.12.8.3 Pool-based sampling:This is the most well-known scenario for active learning. In this samplingmethod, the algorithm attempts to evaluate the entire dataset before itselects the best query or set of queries. The active learner algorithm isoften initially trained on a fully labelled part of the data which is then usedto determine which instances would be most beneficial to insert into thetraining set for the next active learning loop. The downside of this methodis the amount of memory it can require.Membership query synthesis:This scenario is not applicable to all cases, because it involves thegeneration of synthetic data. The active learner in this method is allowedmunotes.in
Page 216
to create its own examples for labelling. This method is compatible withproblems where it is easy to generate a data instance.SUMMARYBootstrap aggregating, also calledbagging(from bootstrap aggregating),is amachine learningensemble meta-algorithm designed to improve thestability and accuracy ofmachine learningalgorithms used in statisticalclassification and regression. It alsoreduces variance and helps to avoidoverfitting.UNIT ENDQUESTIONSWhat is bagging and boostingWhat is Active learningWhat is sequences predictionExplain in detail BoostingREFERENCES"Alpaydin Ethem,Introductionto Machine Learning by MIT; 2 edition(2010)●Jacek M. Zurada , Jaico Publishing House Introduction to ArtificialNeuralSystems by ; First edition (25 January 1994).●Simon Haykin, PHI Private Ltd Neural Networks and LearningMachines 2013.munotes.in