Page 45
Data Curation
45
Figure 3.7 SQL Language Statement
1. DML (Data Manipulation Language)
It is use d to select the records from a table, inserting, deleting and
updating or modifying existing records.
Following SQL commands are used for DML
Select: It is used to select the records from a table or schema.
Select from
Insert: It is used to insert new records in a table.
Insert into values
Update: It is used to update or modify existing table or records from
the given database.
Update set
Delete: It is used to delete the existing records from the database.
Delete from
2. DDL (Data Definition Language)
It is used to modify or alter a database or table. Mainly it is used for
database design and storage.
Following SQL commands are used for DDL
Create: It is used to create new database, table or schema.
Create Database, Create table munotes.in
Page 46
Data science
46 Alter: It is used to alter the existing table or column description.
Alter Database, Alter table
Drop: It is used to delete existing table.
Drop table
3. DCL (Data Contro l Language)
It is used to control the level of accessing the database.
Following SQL commands are used for DCL
Grant: It is used to allows user to read or write on specific database.
Grant privileges ON object TO user;
Privileges may be Select, Insert, Up date, Delete, Alter etc.
Revoke: It is used to keeps user from read and write permission on
database objects.
Revoke privileges ON object FROM user;
Privileges to revoke may be Select, Insert, Update, Delete, Alter etc.
4. TCL (Transaction Control Language)
Transaction Control Language is used to control and manage the
transactions to maintain the integrity of the database with the help of SQL
statements.
Following SQL commands are used for TCL
Begin Transaction: It is used to opens a transaction.
Commit Tran saction: It is used to commits a transaction.
Rollback Transaction: It is used to Rollback a transaction.
3.3 STRUCTURED DATA
Data: Data is a collection of facts and figures that can be processed to
produce information.
Database: Database is a collection of data, Database is one important
components for many different applications. It is used for storing the
variety of data, and stored information can access very easily so that we
can able to do data management as well as updation of data.
Structured Data: It is defined by a data model,which gives the data
confirms to a pre -defined schema or structure. It is easily used and
accessed by the users. Generally it stores in tabular form (rows and
columns) with its attributes.
SQL (Structure Query Language) is us ed to store, manage and access data
stored in databases in structured form.
Sources of Structured Data:
If the available data is highly structured like RDBMS (Oracle, Microsoft
SQL Server, PostgreSQL (advanced open source) etc. It is used to hold
operation al data or transactional data generation and collection of day -to-
day business activity. The data which comes with on line transaction
processing (OLTP) system are structured data. munotes.in
Page 47
Data Curation
47
Figure 3.8: Sources of Structured Data
[source:
https://nitsri.ac.in/Department/Computer%20Science%20&%20Engineeri
ng/BDL2.pdf ]
Uses of Structured Data
1. For Insert, update or delete these commands which are generally used
in DML (Data Manipulation Language) operations for input, store,
access and analysisof data.
2. It is stored in well -defined format of the database like tabular form
with rows and columns.
3. It is very easy to access and manage the data from tabular form.
4. Data mining is eas y as we can extract knowledge from data easily.
3.4 SEMI -STRUCTURAL DATA
Semi structured data is partially structured and partially unstructured data.
The data which has not been organized into a specific database or
repository, but that nevertheless has a ssociated information, such as
metadata.
Semi structured data does not have a specific format but it contains
semantic tages.
Examples of semi structured data type: Email, JSON, NOSQL, XML etc.
3.4.1 XML
As we all know that XML stands for Extensible Markup Langu age.
It is a markup language and file format that helps in storing and
transporting of data.
It is designed to carry data and not just to display data as it is self -
descriptive. munotes.in
Page 48
Data science
48 It was formed from extracting the properties of SGML (Standard
Generaliz ed Markup Language).
It supports exchanging of information between computer systems. They
can be websites, databases, and any third -party applications.
It consists of predefined rules which makes it easy to transmit data as
XML files over any network.
The components of an XML file are
XML Document:
The content which are mentioned between tags are called as
XML Document. It is mainly at the beginning and the end of an XML
code.
XML Declaration:
The content begins with come information about XML itself. It also
mentions the XML version.
For example:
XML Elements:
The other tags you create within an XML document are called as XML
Elements. It consists of the following features:
1. Text
2. Attributes
3. Other elements
For example:
Strawberry
Blueberry
Raspberry
Oranges
Lemons
Limes munotes.in
Page 49
Data Curation
49
Here, are root elements and
& are other element names.
4. XML Attributes:
The XML elements which can have other descriptors are called as XML
Attributes. One can define his/her own attribute name and attribute v alues
within the quotation marks.
For example:
5. XML Content:
The data that is present in the XML file is called as XML content. In the
given example in XML Elements, Strawberry, Blueberry, Raspberry,
Oranges, Lemons and Limes are the conten t.
EXAMPLE:
Figure XML Document
Source: [tutorials.com]
3.4.2 XQuery
XQuery is an abbreviation for XML Query.
XQuery is basically considered as the language for querying XML
data.
It is built on XPath expressions.
XQuery for XML = SQL for Databases.
All major databases support XQuery.
It is used for finding and extracting elements and attributes from XML
documents. munotes.in
Page 50
Data science
50 One can search web documents for relevant information and generate
summary reports.
It replaces complex Java or C++ program s with a few lines of code.
3.4.3 XPath
XPath (XML Path Language) is a query language used to navigate through
an XML document and select specific elements or attributes. It is widely
used in web scraping and data extraction, as well as in data science for
parsing and analyzing XML data.
In data science, XPath can be used to extract information from XML files
or APIs. For example, you might use XPath to extract specific data fields
from an XML response returned by a web API, such as stock prices or
weather dat a.
XPath can also be used in combination with other tools and languages
commonly used in data science, such as Python and Beautiful Soup, to
scrape data from websites and extract structured data for analysis. By
using XPath to select specific elements and attributes, you can quickly and
easily extract the data you need for analysis.
The operator in Xpath
Different types of Operators are:
Addition (+): It does Addition Manipulation in the given field.
Subtraction ( -): It does Subtraction Manipulation in the given field.
Multiplication (*): It does Multiplication Manipulation in the given
field.
Div ( -): It does Addition Division in the given field.
Mod: It does Modulation Manipulation in the given field.
[ / ] : This Stepping Operator helps in selecting a spe cific node (
specific path) from a root node.
[ // ] : This operator Being Descendant is used to select a node
directing from a root node.
[ … ] : This operator helps in checking a node value from the node -set.
[ |] : It is used to compute union between tw o node – sets by this the
duplicate values are filtered out and arranged in a sorted manner.
3.4.4 JSON
JSON (JavaScript Object Notation) is a lightweight data interchange
format that is easy for humans to read and write, and easy for
machines to parse and generate. munotes.in
Page 51
Data Curation
51 It is used for exchanging data between web applications and servers,
and can be used with many programming languages.
JSON data is represented in key -value pairs, similar to a dictionary or
a hash table.
The key represents a string that identi fies the value, and the value can
be a string, number, Boolean, array, or another JSON object.
JSON objects are enclosed in curly braces {}, and arrays are enclosed
in square brackets [].
JSON is often used in web development because it can easily be
parsed by JavaScript, which is a commonly used programming
language for front -end web development. JSON data can be easily
converted to JavaScript objects, and vice versa, Additionally,JSON is
supported by many modern web APIs, making it a popular choice for
exchanging data between web applications and servers.
3.5 UNSTRUCTURED DATA
Unstructured data is a data that is which is not organized in a pre -
defined manner or does not have a pre -defined data model, thus it is
not a good fit for a mainstream relational database.
For Unstructured data, there are alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications.
Example: Word, PDF, Text, Media logs.
Characteristics of Unstructured Data
o It is not based on Schema
o It is not suitable for relational database
o 90% of unstructured data is growing today
o It includes digital media files, Word doc, pdf files,
o It is stored in NoSQL database
munotes.in
Page 52
Data science
52 3.6 SUMMARY
This chapter contains the detail study of what is data, different types of
data, data curation and its various steps. Query languages and the various
operators of query languages, Structured, unstructured and semi structured
data with example, what is aggregate and group functions, detail study of
structured query languages like SQL, non -procedural query languages,
structured, semi -structured and unstructured data, XML, XQuery, XPath,
JSON.
3.7 UNIT END QUESTIONS
1. What is Data? Discuss different types of data.
2. What is Data Curation?
3. Explain Query languages and their operations.
4. Explain in detail structured data.
5. Explain in detail unstructured data.
6. Write a note on semi structured data with example.
7. What is XML? Expl ain its advantages and disadvantages.
8. Write a note on
a. XQuery b. XPath c. JSON
9. Give the difference between structured and unstructured data.
3.8 REFERENCES
1. Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013
2. MasteringMachine Learning with R, Cory Lesmeister, PACKT
Publication,2015
3. Hands -On Programming with R, Garrett Grolemund,1st Edition,
2014
4. An Introduction to Statistical Learning, James, G., Witten, D.,
Hastie, T., Tibshirani, R.,Springer,2015
5. http://www.icet.ac.in/Uploads/Downloads/1._MOdule_1_PDD_KQ
B__(1)%20(1).PDF
6. https://www.researchgate.net /figure/Diagram -of-the-digital -curation -
lifecycle_fig3_340183022
munotes.in
Page 53
53 4
DATA BASE SYSTEMS
Unit Structure
4.0 Objective
4.1 Web Crawler & Web Scraping
4.1.1 Difference between Web Crawler and Web Scraping
4.2 Security and Ethical Considerations in Relation to Authenticating
And Authorizing
4.2.1 ACCESS TO DATA ON REMOTE SYS TEMS
4.3 Software Development Tools
4.3.1 Version Control/Source Control
4.3.2 Github
4.4 Large Scale Data Systems
4.4.1 Paradigms of Distributed Database Storage
4.4.1.1 Homogeneous Distributed Databases
4.4.1.2 Heterogeneous Distributed Databases
4.4.2 Nosql
4.4.3 Mongodb
4.4.4 Hbase
4.5 AWS (Amazon Web Servies)
4.5.1 AWS Bsic Architecture
4.5.2 Cloud Services
4.5.3 Map Reduce
4.0 OBJECTIVES
In this chapter the students will learn about:
Web Crawler
Web Scraping
Security and Ethical consideratio ns in relation to authenticating and
authorizing
munotes.in
Page 54
Data science
54 To data on remote systems
Software Development Tools
Version control terminology and functionalities
Github
Large Scale Data Systems
Distributed Database Storage
NOSQL
MongoDB
HBase
AWS
Cloud Services
Map Reduce
4.1 WEB CRAWLER&WEB SCRAPER
Web Crawler
Web crawler is also known as web spider, search engine bot. It takes the
content from internet then downloads and indexes it.
The main aim of web crawler or bot is to learn from every webpage on the
web, the c ontent or the information can be retrieved when the user needs.
It is known as “web crawlers” as crawling is the technical term for
automatically it access a website and we can obtain the information or data
with the help of software programs. These bots a re operated with the help
of search engines. In search engine it stores the various search algorithms
for collection of data by web crawlers, Search engine also provides the
relevant link for the same information or content. Search engine will
generate the list of webpages that contains a user types a search into
Google or Bing or any other search engine like Yahoo.
Web crawler is a bot which will like in a book it will go through the
various books in a disorganized library and combines a card catalog,
wher e anyone who want to visit the library can easily and quickly fine the
content or information they need.
With the help of sorted and categorized data in library’s book topic -wise,
then the organizer will read first the title, summary of each text book to
find out what is content of particular book, if the reader needs then user
can download it use it as per need.
In short book is nothing but the information on the web library which
organized in a systematic manner (sort and index). User can download
which one is relevant and use as per the need. munotes.in
Page 55
Data Base Systems
55 The sequence of searching in a book or web library, it will start with a
certain set of known webpages and then follow the links i.e. hyperlinks
from those pages to the other pages, after following the hyperlinks fr om
those other pages to additional pages will open and user can get the
information or data on it. Internet is crawled by search engine bots.
Example of web crawlers: Amazonbot (Amazon), Bingbot (Bing), Yahoo,
Baiduspider (Baidu), Googlebot (Google), Duck Duckbot (DuckDuckGo)
etc.
Search Indexing
With the help of search engine on internet, it is like creating a library card
catalog, where on internet it will retrieve the information or data when the
user searches for it. It can also be compared to the index in the back of the
book, where it will lists all the places in the book where a particular topic
or phrase is typed by the user on any search engine.
The main aim of search indexing is on the web library the text that appears
will search with the help of internet.
Metadata is the data that gives the details about search engines what a
webpage is about. Meta means what the description will appear on search
engine result pages.
Web crawlers working
It is a programming script developed by the vendors like G oogle. The
main of these crawlers is to collect the data and send it to the Google or
respective search engine. The name of the crawler comes from the
commands from the programming script.
[source: techtarget.com/whatis/definition/crawler]
Crawling proce ss: It collects data from various websites that allow
crawling and indexing. Once collected data then it sends to the respective
search engine like Google or user defines any other search engine.
Indexing process: After crawling process the Google or respe ctive search
engine shelves the data base on its relevance and the importance to users.
With the help of hyperlink or URLs the data which are present on various
sites get processed and stored in a Google or respective search engine
database.
Ranking Proces s: After completing indexing process, user enters a query
on search engine (Google), then the search engine shows the results from munotes.in
Page 56
Data science
56 the stored database to the user. Sharing results with relevant keyword is
also cumulate with result. Ranking of a website on a particular search
engine is the key factor of relevance on search engine.
Web Scraper
Online scraping is a computerised technique for gathering copious
volumes of data from websites. The majority of this data is unstructured in
HTML format and is transf ormed into structured data in a database or
spreadsheet so that it can be used in multiple applications. To collect data
from websites, web scraping can be done in a variety of methods. Options
include leveraging specific APIs, online services, or even wri ting your
own code from scratch for web scraping. You may access the structured
data on many huge websites, including Google, Twitter, Facebook,
StackOverflow, and others, using their APIs.
Difference between Web Scraping and Web Crawling
1. It is used for downloading information It is used for indexing of
Web pages
2. It need not visit all the pages of website
for information. It visits each and every
page , until the last line
for information.
3. A Web Scraper doesn’t obey
robots.txt in most of cases. Not all web
crawlers obey robots.txt .
4. It is done on both small and large scale . It is mostly employed
in large scale .
5. Application areas include Retail
Marketing, Equity search, and Machine
learning. Used in search engines
to give search results to
the user.
6. Data de -duplication is not necessarily a
part of Web Scraping. Data de -duplication is an
integral part of Web
Scraping.
7. This needs crawl agent and a parser for
parsing the response. This only needs only
crawl agent .
8. ProWebScraper, Web Scraper.io are the
examples Google, Yahoo or Bing
do Web Crawling
munotes.in
Page 57
Data Base Systems
57 4.2 SECURITY AND ETHICAL CONSIDERATIONS IN
RELATION TO AUTHENTICATING AND
AUTHORIZING
Authentication And Authorization for Storage System
Security is an important parameter for any data storage system. Various
security attacks that can be faced in any system can be:
1. Password guessing attack
2. Replay attack.
3. Man -in-the-middle attack
6. Phishing attack
4. Masquerade attack
5. Shoulder surfing attack.
6. Insider attack
Authe ntication and Authorization are two major processes used for
security of data on the emote system.
1. Denial -of-service (DoS) and distributed denial -of-service (DDoS)
attacks overwhelms a system's resources so that it cannot respond to
service requests.
2. Man-in-the-middle (MitM) attack : A MitM attack occurs when a
hacker inserts itself between communications of a client and a server.
It causes misuse of data.
3. Phishing and spear phishing attacks : Phishing attack is the act of
sending emails that appear to be from trusted sources in -order to get
personal information or influencing users to do something.
4. Drive -by attack: Drive -by download attacks is used to spread malware,
Hackers look for insecure web malware an email message or directly
onto the computer of someone who visits the site, or it might re -direct
the victim to a site controlled and plant a malicious script into HTTP
or PHP code on one of the pages. This script might install by the
hackers, Drive -by downloads can happen when visiti ng a website or
viewing a pop -up window.
5. Password attack: Because passwords are used to authenticate users to
an information system, obtaining passwords is a common and effective
attack approach. Access to a person's password can be obtained
looking a round the person's desk, "sniffing" the connection to the
network to acquire unencrypt passwords, using social engineering,
gaining access to a password database or outright guessing.
munotes.in
Page 58
Data science
58 6. SQL injection attack: SQL injection has become a common issue wit h
database -driven websites. It occurs when a malefactor executes a SQL
query to the database via the input data from the client to server.
7. Cross -site scripting (XSS) attack: XSS attacks use third -party web
resources to run scripts in the victim web bro wser or scriptable
application.
8. Malware attack: Malicious software can be described as unwanted
software that is installed in your system without your consent.
Examples of data security technologies include data backups, data
masking and data erasur e.
A key data security technology measure is encryption, where digital
data, software/hardware, and h drives are encrypted so that it is made
unreadable to unauthorized users and hackers.
One of the most commonly used methods for data security is the us e of
authentication and authorization.
With authentication, users must provide a password, code, biometric
data, or some other form of data verify identity of user before we grant
access to a system or data.
4.2.1 ACCESS TO DATA ON REMOTE SYSTEMS
There are various major process used for security of data on remote
system.
Authentication
It is a process for confirming the identity of the user. The basic way of
providing authentication is through username and password, but many a
time this approach fails due to hackers or attackers if some hacker will be
able to crack the password and username than even the hacker will able to
use the system.
Authentication is part of a three -step process for gaining access to digital
resources:
1. Identification —Who are y ou?
2. Authentication —Prove it.
3. Authorization —Do you have permission?
Identification requires a user ID like a username. But without identity
authentication, there’s no way to know if that username actually belongs
to them. That’s where authentication come s in—pairing the username with
a password or other verifying credentials.
The most common method of authentication is a unique login and
password, but as cybersecurity threats have increased in recent years, most
organizations use and recommend additional authentication factors for
layered security.
munotes.in
Page 59
Data Base Systems
59 Authorization
It follows the authentication step which means that once the authentication
of a particular user is done the next step is authorization which is to check
what rights are given to that user.
-During the process of authentication policies are made which define the
authorities of that user.
Various algorithms used for authentication and authorization are:
1. RSA algorithm.
2. AES algorithm and MD5 hashing algorithm.
3. OTP password algorithm.
4. Data encryption standard algorithm.
5. Rijndael encryption algorithm.
4.3 SOFTWARE DEVELOPMENT TOOLS
Software development tools plays a crucial role in data science workflows,
especially as projects become more complex and involve larger amounts
of data.
Here are some of the most commonly used software development tools in
data science:
4.3.1 VERSION CONTROL/SOURCE CONTROL
Version control systems (VCS)
Basically, Version control also known as source control is the practice of
tracking and managing cha nges to software code.
Any multinational companies may face several problems like
collaboration among employees, storing several versions of files being
made, and data backing up. All these challenges are essential to be
resolved for a company to be succe ssful. This is when a Version Control
System comes into the picture.
In other words, it allows developers to track changes to code and
collaborate on projects with other team members.
Git is the most commonly used VCS in data science, and platforms like
GitHub and GitLab provide hosting services for Git repositories.
munotes.in
Page 60
Data science
60
Let’s try to understand the process with the help of this diagram
Source: ( https://youtu.be/Yc8sCSeMhi4 )
There are 3 workstations or three d ifferent developers at three other
locations, and there’s one repository acting as a server. The work stations
are using that repository either for the process of committing or updating
the tasks.
There may be a large number of workstations using a singl e server
repository. Each workstation will have its working copy, and all these
workstations will be saving their source codes into a particular server
repository.
This makes it easy for any developer to access the task being done using
the repository. I f any specific developer's system breaks down, then the
work won't stop, as there will be a copy of the source code in the central
repository.
Finally, let’s have a look at some of the best Version Control Systems in
the market.
munotes.in
Page 61
Data Base Systems
61 Integrated Development Environments (IDEs)
IDEs are software applications that provide a comprehensive environment
for coding, debugging, and testing code.
Popular IDEs for data science include
PyCharm
Spyder
Jupyter Notebook
Package managers
Package managers make it e asy to install, update, and manage software
libraries and dependencies.
Popular package managers for Python include
pip
conda
Data analysis and visualization tools
Data analysis and visualization tools help data scientists to explore, clean,
and visualize data.
Popular tools include
Pandas
NumPy
Matplotlib
Automated testing tools
Automated testing tools help to ensure the quality and correctness of
code.
Popular tools include
pytest
unittest
Deployment tools
Deployment tools are used to deploy models and applications to
production environments.
Popular deployment tools include
Docker
Kubernetes
In addition to these tools, data scientists may also use cloud platforms
such as AWS, Google Cloud, and Microsoft Azure for data storage,
computing resources, and machine learning services.
munotes.in
Page 62
Data science
62 4.3.2 GITHUB
Github is an Internet hosting service for software development and version
control using Git. It provides the distributed version control of Git plus
access control, bug tracking, software feature requests, task management,
continuous integration, and wikis for every project.
Projects on GitHub.com can be accessed and managed using the standard
Git command -line interface; all standard Git commands work with it.
GitHub.com also allo ws users to browse public repositories on the site.
Multiple desktop clients and Git plugins are also available. The site
provides social networking -like functions such as feeds, followers, wikis
is newest. Anyone can browse and download public repositorie s but only
registered users can contribute content to repositories.
Git
GIT full form is “Global Information Tracker,” Git is a DevOps tool used
for source code management. It is a free and open -source version control
system used to handle small to very large projects efficiently. Git is used
to tracking changes in the source code, enabling multiple developers to
work together on non -linear development. While Git is a tool that's used to
manage multiple versions of source code edits that are then transfe rred to
files in a Git repository, GitHub serves as a location for uploading copies
of a Git repository.
Need of Github
It's used for storing, tracking, and collaborating on software projects. It
makes it easy for developers to share code files and collab orate with
fellow developers on open -source projects. GitHub also serves as a social
networking site where developers can openly network, collaborate, and
pitch their work Languages used in Git : Core languages for GitHub
features include C, C++, C#, Go, J ava, JavaScript, PHP, Python, Ruby,
Scala, and TypeScript
4.4 LARGE SCALE DATA SYSTEMS
To store the large data normal databases cannot be used and hence databases like
NoSQL, MongoDB and HBase etc are good option for large scale data systems.
Large scale syst ems do not always have centralized data storage. Distributed
database approach is widely used in many applications.
4.4.1 PARADIGMS OF DISTRIBUTED DATABASE STORAGE
A distributed database is basically a database that is not limited to one
system, it is spr ead over different sites, i.e, on multiple computers or over
a network of computers. A distributed database system is located on
various sites that don’t share physical components. This may be required
when a particular database needs to be accessed by var ious users globally.
It needs to be managed such that for the users it looks like one single
database. munotes.in
Page 63
Data Base Systems
63 Distributed databases are capable of modular development, meaning that
systems can be expanded by adding new computers and local data to the
new site and connecting them to the distributed system without
interruption. When failures occur in centralized databases, the system
comes to a complete stop. When a component fails in distributed database
systems, however, the system will continue to function at red uced
performance until the error is fixed. Data is physically stored across
multiple sites. Data in each site can be managed by a DBMS independent
of the other sites. The processors in the sites are connected via a network.
They do not have any multiproces sor configuration. A distributed database
is not a loosely connected file system.
A distributed database incorporates transaction processing, but it is not
synonymous with a transaction processing system.
Distributed database systems are mainly classified as homogenous and
heterogeneous database.
Figure : Distributed Database
4.4.1.1 HOMOGENEOUS DISTRIBUTED DATABASES
Homogeneous Distributed Databases:
▪ Homogeneous Distributed Databases are the systems in which on all
the sites identical DBMS and OS are used.
▪ Homogeneous Distributed Databases have identical software’s and
here every site know what is happening at other site and where it is
located.
▪ Homogeneous Distributed Databases are further classified as
autonomous and non -autonomous.
▪ In autonomous database each site is independent for processing only
the integration is done using some controlling application. munotes.in
Page 64
Data science
64 In non -autonomous databases, data is distributed across the various
nodes or sites and one node manages all the other nodes as if like
client server model.
Figure : Homogeneous distributed system
4.4.1.2 HETEROGENEOUS DISTRIBUTED DATABASES
Heterogeneous Distributed Databases:
In H eterogeneous Distributed Databases every site has different
database, OS and different software’s.
In such system querying is complex as the environment and all other
tools are different.
Heterogeneous Distributed Database is further classified as Fede rated
and Un -federated databases.
In Federated database system every site is independent of each other
and hence it acts as a single database system individually.
In Un -federated database system there is a single central coordinator
module through which all the sites communicates.
Figure : Heterogeneous database system munotes.in
Page 65
Data Base Systems
65 4.4.2 NOSQL
NoSQL is a broad term that refers to non -relational databases that don't
use the traditional SQL querying language. NoSQL databases come in
different types, such as k ey-value stores, document -oriented databases,
graph databases, and column -family stores.
1. Schema flexibility: NoSQL databases allow for flexible schema
designs that can be easily adapted to changing data requirements. This
allows for more agile development and easier scaling of the database.
2. Horizontal scalability: NoSQL databases are designed to scale
horizontally, meaning that new nodes can be added to the cluster to
increase storage and processing capacity. This allows for virtually
unlimited scalabilit y and high availability.
3. High performance: NoSQL databases are designed for high
performance and low latency, which makes them well -suited for
handling real -time data processing and analytics workloads.
4. Replication and availability: Most NoSQL databases provide built -in
replication and fault -tolerance features that ensure high availability
and data durability.
5. Distributed architecture: NoSQL databases are typically designed as
distributed systems, which allows them to distribute data across
multiple node s in the cluster. This enables efficient handling of large
volumes of data and high performance at scale.
6. No fixed schema: Unlike traditional relational databases, NoSQL
databases do not require a predefined schema. This means that you can
add new fields or attributes to the data on the fly, without having to
modify the entire database schema.
4.4.3 MONGODB
MongoDB is a document -oriented NoSQL database that stores data in the
form of JSON -like documents.
Automatic sharing: MongoDB can automatically split data across
multiple servers, allowing it to handle large volumes of data and scale
horizontally.
Indexing: MongoDB supports indexes on any field, including fields
within nested documents and arrays.
Rich query language: MongoDB supports a rich query la nguage that
includes filtering, sorting, and aggregation.
Dynamic schema: MongoDB's flexible schema allows you to add new
fields or change existing ones without affecting the existing data.
Replication: MongoDB supports replica sets, which provide
automa tic failover and data redundancy. munotes.in
Page 66
Data science
66 4.4.4 HBASE
HBase is also a NoSQL database, but it is a column -oriented database
built on top of Hadoop. HBase is an excellent choice for applications that
require random read/write access to large amounts of data.
Built on Hadoop: HBase is built on top of Hadoop, allowing it to
leverage Hadoop's distributed file system (HDFS) for storage and
MapReduce for processing.
Strong consistency: HBase provides strong consistency guarantees,
ensuring that all reads and writes are seen by all nodes in the cluster.
Scalability: HBase can scale to handle petabytes of data and billions
of rows.
Data compression: HBase provides data compression options,
reducing the amount of storage required for large datasets.
Transactions: HBase supports multi -row transactions, allowing for
complex operations to be executed atomically.
4.5 AWS (AMAZON WEB SERVICES)
Amazon Web Service
Amazon Web Service is a platform that offers scalable, easy to use,
flexible and cost -effective cloud computing p latforms, API’s and solutions
to individuals, businesses and companies. AWS provides different IT
resources available on demand. It also provides different services such as
infrastructure as a service (IaaS), platform as a service (PaaS) and
packaged softw are as a service (SaaS). Amazon’s first cloud computing
service was S3(Simple Storage Service) released in 2006, March. Using
AWS, instead of building large -scale infrastructures and storage;
companies can opt for Amazon Cloud Services where they can get a ll the
infrastructure they could ever need.
4.5.1 AWS BASIC ARCHITECTURE
Figure: Amazon web services architecture
Source [Tutorialspoint.com] munotes.in
Page 67
Data Base Systems
67 This includes EC2, S3, EBS Volume.
EC2 which stands for Elastic Compute Cloud. EC2 provides the
opport unity to the users to choose a virtual machine as per their
requirement. It gives freedom to the user to choose between a variety of
storage, configurations, services, etc.
S3 stands for Simple Storage Service, using which online backup and
archiving of data becomes easier. It allows the users to store and retrieve
various types of data using API calls. It doesn’t contain any computing
element.
EBS also known as Elastic Block Store, provides persistent block storage
volumes which are to be used in instan ces created by EC2. It has the
ability to replicate itself for maintaining its availability throughout.
The Important Cloud Services according to various categories that are
provided by AWS are given below :
1. Compute
Amazon EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web
service that provides secure, resizable compute capacity in the cloud. It
allows organisations to obtain and configure virtual compute capacity in
the cloud. Amazon EC2 is an example of Infrastructure as a Service(IaaS).
AWS E lastic Beanstalk: AWS Elastic Beanstalk is a Platform as a Service
that facilitates quick deployment of your applications by providing all the
application services that you need for your application. Elastic Beanstalk
supports a large range of platforms li ke Node js, Java, PHP, Python, and
Ruby.
2. Networking
Amazon Route 53: Amazon Route 53 is a highly available and scalable
cloud Domain Name System (DNS) web service. It is designed to give
developers and businesses an extremely reliable and cost -effect ive way to
route end users to Internet applications by translating human -readable
names, such as www.geeksforgeeks.com, into the numeric IP addresses
that computers use to connect to each other. Amazon Route 53 is fully
compliant with IPv6 as well.
3. Sto rage
Amazon S3 (Simple Storage Service): Amazon Simple Storage Service
(Amazon S3) is object storage with a simple web service interface to store
and retrieve any amount of data from anywhere on the web. You can use
Amazon S3 as primary storage for cloud -native applications as a target for
backup and recovery and disaster recovery.
Amazon Glacier: Amazon Glacier is a secure, durable, and extremely low -
cost storage service for data archiving and long -term backup. Data stored munotes.in
Page 68
Data science
68 in Amazon Glacier takes sever al hours to retrieve, which is why it’s ideal
for archiving.
4. Databases
Amazon RDS (Relational Database Service): Amazon Relational Database
Service (Amazon RDS) makes it easy to set up, operate, and scale a
relational database in the cloud. You can f ind Amazon RDS is also
available on several database instance types – optimised for memory,
performance, or I/O.
Benefits of AWS
High Availability.
Parallel Processing.
Security.
Low Latency.
Fault Tolerance and disaster recovery.
Cost effective .
4.5.2 CLOUD SERVICES
What is Cloud Computing?
Cloud computing is a technology that allows use storage, applications, and
services over the internet, without having to own or manage their own
infrastructure. Instead of having to purchase, configure, an d maintain
hardware and software, users can simply rent resources from cloud service
providers. There are several types of cloud computing models, including:
1. Infrastructure as a Service (IaaS): Provides users with access to
computing resources such as servers, storage, and networking.
2. Platform as a Service (PaaS): Provides users with access to a
platform for developing, testing, and deploying applications.
3. Software as a Service (SaaS): Provides users with access to software
applications over the internet. Cloud computing has revolutionized the
way computing resources without having to invest in expensive
infrastructure. As a result, cloud computing has become an essential
technology for businesses of all sizes
Advantages of Cloud Computing There are numerous advantages of cloud
computing, some of which include:
1. Scalability: Cloud computing offers the ability to quickly scale up or
down computing resources based on demand. This can be done
automatically or manually, making it easier for busines ses to manage
spikes in usage or traffic. munotes.in
Page 69
Data Base Systems
69 2. Cost -effectiveness: Cloud computing reduces the need for businesses
to invest in expensive hardware and infrastructure. Instead, they can
rent computing resources from cloud service providers on a pay -as-
you-go basis. This allows businesses to only pay for what they use,
reducing overall costs.
3. Accessibility: With cloud computing, users can access computing
resources from anywhere with an internet connection. This means that
employees can work remotely and c ollaborate on projects from
different locations.
4. Security: Cloud service providers offer robust security measures,
including encryption, firewalls, and access controls to protect data and
applications. Additionally, cloud providers often employ dedicat ed
security teams to monitor and respond to potential security threats.
5. Reliability: Cloud service providers offer high levels of uptime and
availability, ensuring that resources are always accessible when
needed. Additionally, cloud providers typicall y have redundant
infrastructure in place to ensure that services remain available even if
there is an outage in one location.
6. Flexibility: Cloud computing allows businesses to experiment with
new applications and services without having to commit to lo ng-term
investments. This means that businesses can test new ideas quickly
and easily, without worrying about the cost of hardware or
infrastructure. Overall, cloud computing offers numerous advantages
for businesses of all sizes, making it a popular choic e for many
organizations.
Disadvantages of Cloud Computing While cloud computing offers many
benefits, there are also some potential disadvantages to consider. Some of
these include:
1. Dependence on the Internet: Cloud computing requires a reliable
internet connection in order to access computing resources. If the
internet is slow or unavailable, this can impact the ability to access
critical resources.
2. Security concerns: While cloud providers often offer robust security
measures, there is still the potential for security breaches and data
theft. Additionally, if a cloud provider experiences a security breach,
this can impact multiple customers at once.
3. Limited control: When using cloud computing, businesses may have
limited control over their comp uting resources. This can make it more
difficult to customize applications or infrastructure to meet specific
needs.
4. Downtime: While cloud providers offer high levels of uptime, there is
still the potential for downtime due to outages, maintenance, or other
issues. This can impact productivity and cause disruption to business
operations. munotes.in
Page 70
Data science
70 5. Cost: While cloud computing can be cost -effective in some cases, it
can also be expensive if usage levels are high or if resources are not
managed effectively. Addi tionally, cloud providers may raise prices or
change their pricing models over time, which can impact the cost of
using cloud computing.
6. Data privacy and compliance: Businesses may face challenges in
ensuring that data stored in the cloud is compliant with regulatory
requirements. Additionally, some organizations may have concerns
about data privacy and how data is used by cloud providers.
Overall, while cloud computing offers many benefits, it is important for
businesses to carefully consider the pote ntial drawbacks and risks before
deciding to adopt cloud computing.
Need for Cloud Computing
The need for cloud computing arises from the fact that businesses require
access to powerful computing resources to support their operations, but
investing in and maintaining their own infrastructure can be costly and
time-consuming. Cloud computing allows businesses to access computing
resources over the internet, rather than having to build and maintain their
own infrastructure. Overall, cloud computing addresses many of the key
needs that businesses face, including scalability, flexibility, cost -
efficiency, reliability, security, and innovation. As a result, cloud
computing has become an essential technology for many businesses.
4.5.3 MAP REDUCE
MapReduce
MapRedu ce is a programming model and data processing framework used
for parallel computing of large datasets on clusters of commodity
hardware. It was originally developed by Google to process large amounts
of data in a distributed environment. The MapReduce prog ramming model
allows developers to write simple and scalable code for processing large
datasets. It also provides fault tolerance and automatic parallelization,
making it well -suited for big data applications
The basic idea of MapReduce is to split a larg e data processing task into
smaller sub -tasks and execute them in parallel across a cluster of
computers. The sub -tasks are divided into two phases:
1. Map phase: In this phase, the input data is divided into smaller
chunks and processed by individual nod es in the cluster. Each node
processes its assigned data and produces key -value pairs as output.
2. Reduce phase: In this phase, the output of the map phase is collected
and processed to produce the final result. The reduce phase takes in the
key-value pa irs produced by the map phase and applies a reduce
function to aggregate the values with the same key. munotes.in
Page 71
Data Base Systems
71 MapReduce is widely used in big data processing because it allows
developers to write code that can be easily parallelized and distributed
across a large number of machines. This enables the processing of very
large datasets that would otherwise be difficult or impossible to handle
with traditional data processing techniques.
Uses of MapReduce
Scalability: MapReduce is highly scalable as it allows parallel
processing of large datasets across a large number of machines. This
makes it ideal for handling big data workloads.
Fault tolerance: MapReduce is designed to handle failures in the
cluste r. If a machine fails, the MapReduce framework automatically
reassigns the tasks to other machines, ensuring the job is completed
without any data loss or errors.
Flexibility: MapReduce is flexible as it can be used with a variety of
data storage system s, including Hadoop Distributed File System
(HDFS), Amazon S3, and Google Cloud Storage.
Cost -effective: MapReduce is cost -effective as it uses commodity
hardware to process data. This makes it an affordable solution for
handling big data workloads.
Efficient: MapReduce is efficient because it performs data processing
operations in parallel, which reduces the overall processing time. This
makes it possible to process large datasets in a reasonable amount of
time.
Overall, MapReduce is used to process an d analyze large volumes of data
in a distributed computing environment, making it an essential tool for
handling big data workloads
4.6 SUMMARY
This chapter gives brief introduction of Database System. After studying
this chapter, you will learn about the concept of web crawling and web
scraping, what are the various security and ethical considerations in
relation to authentication and authorizati on, what are the software
development tools, what is version control, GitHub, detail study of large -
scale systems with the different types namely homogeneous distributed
system and heterogeneous distributed system, NoSQL, HBase, Mongo
DB, what is AWS, clou d services and MapReduce.
4.7 UNIT END QUESTION
Q.1) What is Web Crawling and Web Scraping?
Q.2) Give the difference between Web Crawling and Web Scraping.
Q.3) Explain in briefly about Authentication and Authorization for
Storage System. munotes.in
Page 72
Data science
72 Q.4) Elaborate the concept of version control.
Q.5) Write a note on GitHub.
Q.6) What is Distributed Database Storage? Explain with its types.
Q.7) Write briefly about large scale data systems.
Q.8) Give the difference between Homogeneous and Hetrogeneous Data
storage.
Q.9) Write a note on
a) NoSQL b)HBase c) Mong DB
Q.10) Explain AWS in detail.
Q.11) What is MapReduce? Explain with its architecture.
Q.12) What is cloud computing? Explain with its types.
4.8 REFERENCES
1. Doing Data Science, Rachel Schutt and Cathy O’Ne il, O’Reilly,2013
2. Mastering Machine Learning with R, Cory Lesmeister, PACKT
Publication,2015
3. Hands -On Programming with R, Garrett Grolemund,1st Edition, 2014
4. An Introduction to Statistical Learning, James, G., Witten, D., Hastie,
T., Tibshirani, R.,Spri nger,2015
5. https://www.cloudflare.com/learning/bots/what -is-a-web-
crawler/#:~:text=A%20web%2 0crawler%2C%20or%20spider,appear
%20in%20search%20engine%20results
6. (https://capsicummediaworks.com/web -crawler -guide/ )
munotes.in
Page 73
73 5
INTRODUCTION TO MODEL
SELECTION
Unit Structure
5.0 Objectives
5.1 Introduction
5.2 Regularization
5.2.1 Regularization techniques
5.3 Bias/variance tradeoff
5.3.1 What is Bias?
5.3.2 What is Variance?
5.3.3 Bias -Variance Tradeoff
5.4 Parsimony Model
5.4.1 How to choose a Parsimonious Model
5.4.1.1 AIC
5.4.1.2 BIC
5.4.1.3 MDL
5.5 Cross validation
5.5.1 Methods used for Cross -Validation
5.5.2 Limitations of Cross -Validation
5.5.3Applications of Cross -Validation
5.6 Summary
5.7 List of References
5.8 Unit End Exercises
5.0 OBJECTIVES
To understand the factors that needs to be considered while selecting a
model
To get familiar with the regularization techniques and bias -variance
tradeoffs
To understand the parsimony and cross -validation techniques munotes.in
Page 74
Data science
74 5.1 IN TRODUCTION
The process of choosing a single machine learning model out of a group of
potential candidates for a training dataset is known as model selection.
Model selection is a procedure that can be used to compare models of the
same type that have been set up with various model hyperparameters (e.g.,
different kernels in an SVM)and models of other types (such as logistic
regression, SVM, KNN, etc).
A "good enough" model is particular to your project and might mean
many different things, including:
A des ign that satisfies the demands and limitations of project
stakeholders
A model that, given the time and resources at hand, is suitably skilled
A skilled model as opposed to unsophisticated models
A model that performs well compared to other models that hav e been
examined
A model that is proficient in terms of current technology
5.2 REGULARIZATION
The term "regularization" describes methods for calibrating machine
learning models to reduce the adjusted loss function and avoid overfitting
or underfitting.
Figure 1: Regularization on an over -fitted model
We can properly fit our machine learning model on a particular test set
using regularization, which lowers the mistakes in the test set.
5.2.1 Regularization techniques
There are two main types of regularizat ion techniques: Ridge
Regularization and Lasso Regularization.
1] Ridge Regularization
It is also referred to as Ridge Regression and modifies over - or under -
fitted models by applying a penalty equal to the sum of the squares of the
coefficient magnitude. munotes.in
Page 75
Introduction to Model
Selection
75 As a result, coefficients are produced and the mathematical function that
represents our machine learning model is minimized. The coefficients'
magnitudes are squared and summed. Ridge Regression applies
regularization by reducing the number of coefficient s. The cost function of
ridge regression is shown in the function below:
Figure 2: Cost Function of Ridge Regression
The penalty term is represented by Lambda in the cost function. We can
control the punishment term by varying the values of the penalty fu nction.
The magnitude of the coefficients decreases as the penalty increases. The
settings are trimmed. As a result, it serves to prevent multicollinearity and,
through coefficient shrinkage, lower the model's complexity.
Have a look at the graph below, wh ich shows linear regression:
Figure 3: Linear regression model
‖‖ Cost function = Loss + λ x∑ w^2
For Linear Regression line, let’s consider two points that are on the line,
Loss = 0 (considering the two points on the line)
λ= 1
w = 1.4
Then, Cost functio n = 0 + 1 x 1.42
= 1.96 munotes.in
Page 76
Data science
76 For Ridge Regression, let’s assume,
Loss = 0.32 + 0.22 = 0.13
λ = 1
w = 0.7
Then, Cost function = 0.13 + 1 x 0.72
= 0.62
Figure 4: Ridge regression model
Comparing the two models, with all data points, we can see that the Ridge
regression line fits the model more accurately than the linear regression
line.
Figure 5: Optimization of model fit using Ridge Regression
2]Lasso Regularization
By imposing a penalty equal to the total of the absolute values of t he
coefficients, it alters the models that are either overfitted or underfitted. munotes.in
Page 77
Introduction to Model
Selection
77 Lasso regression likewise attempts coefficient minimization, but it uses
the actual coefficient values rather than squaring the magnitudes of the
coefficients. As a result of the occurrence of negative coefficients, the
coefficient sum can also be 0. Think about the Lasso regression cost
function:
Figure 6: Cost function for Lasso Regression
We can control the coefficient values by controlling the penalty terms, just
like we d id in Ridge Regression. Again, consider a Linear Regression
model:
Figure 7: Linear Regression Model
‖‖ Cost function = Loss + λ x ∑ w
For Linear Regression line, let’s assume,
Loss = 0 (considering the two points on the line)
λ = 1
w = 1.4
Then, Cost function = 0 + 1 x 1.4
= 1.4
For Ridge Regression, let’s assume,
Loss = 0.32 + 0.12 = 0.1
λ = 1 munotes.in
Page 78
Data science
78 w = 0.7
Then, Cost function = 0.1 + 1 x 0.7
= 0.8
Figure 8: Lasso regression
Comparing the two models, with all data points, we can see that the Lasso
regression line fits the model more accurately than the linear regression
line.
5.3 BIAS/ VARIANCE TRADEOFF
5.3.1 What is Bias?
Our model will examine our data and look for patterns before making
predictions. We can draw conclusions about specific cases in our data
using these patterns. Following training, our model picks up on these
trends an d uses them to predict the test set.
The bias is the discrepancy between our actual values and the predictions.
In order for our model to be able to forecast new data, it must make some
basic assumptions about our data.
Figure 9: Bias munotes.in
Page 79
Introduction to Model
Selection
79 When the bias is si gnificant, our model's assumptions are too simplistic,
and the model is unable to capture the crucial aspects of our data. As a
result, our model cannot successfully analyze the testing data because it
has not been able to recognize patterns in the trainin g data. If so, our
model is unable to operate on fresh data and cannot be put into use.
Underfitting refers to the situation where the model is unable to recognize
patterns in our training set and hence fails for both seen and unseen data.
Figure following provides an illustration of underfitting. The line of best
fit is a straight line that doesn't go through any of the data points, as can be
seen by the model's lack of pattern detection in our data. The model was
unable to effectively train on the provide d data and is also unable to
predict fresh data.
Figure 10: Underfitting
5.3.2 What is Variance?
Bias' complete opposite is variation. Our model is given a limited number
of opportunities to "view" the data during training in order to look for
patterns. Insufficient time spent working with the data will result in bias
because patterns won't be discovered. On the other hand, if our model is
given access to the data too frequently, it will only be able to train very
well for that data. The majority of patte rns in the data will be captured, but
it will also learn from extraneous data or noise that is there.
Variance can be thought of as the model's sensitivity to changes in the
data. From noise, our model might learn. This will lead our model to value
unimpor tant features highly.
Figure 11: Example of Variance munotes.in
Page 80
Data science
80 We can see from the above picture how effectively our model has learned
from the training data, which has trained it to recognize cats. Nevertheless,
given fresh information, like the image of a fox, o ur model predicts it to be
a cat because that is what it has learnt to do. When variance is high, our
model will catch all the properties of the data provided, including the
noise, will adjust to the data, and predict it extremely well. However,
when given new data, it is unable to forecast since it is too specific to
training data.
As a result, while our model will perform admirably on test data and
achieve high accuracy, it will underperform on brand -new, unforeseen
data. The model won't be able to foreca st new data very effectively
because it could not have the exact same characteristics. Overfitting is the
term for this.
Figure 12: Over -fitted model where we see model performance on, a)
training data b) new data
5.3.3 Bias -Variance Tradeoff
We need to s trike the ideal balance between bias and variance for every
model. This only makes sure that we record the key patterns in our model
and ignore the noise it generates. The term for this is bias -variance
tradeoff. It aids in optimizing and maintaining the l owest feasible level of
inaccuracy in our model.
A model that has been optimized will be sensitive to the patterns in our
data while also being able to generalize to new data. This should have a
modest bias and variance to avoid overfitting and underfittin g.
Figure 13: Error in Training and Testing with high Bias and Variance munotes.in
Page 81
Introduction to Model
Selection
81 We can observe from the above figure that when bias is large, the error in
both the training set and the test set is also high. When the variance is
high, the model performs well on t he testing set and the error is low, but
the error on the training set is significant. We can see that there is a zone
in the middle where the bias and variance are perfectly balanced and the
error in both the training and testing set is minimal.
Figure 14:Bull’s Eye Graph for Bias and Variance
The bull's eye graphs up top clarifies the bias and variance tradeoff. When
the data is concentrated in the center, or at the target, the fit is optimal. We
can see that the error in our model grows as we move fart her and farther
from the center. The ideal model has little bias and little volatility.
5.4 PARSIMONY MODEL
A parsimonious model is one that employs the fewest number of
explanatory variables necessary to reach the desired level of goodness of
fit.
The the ory behind this kind of model is Occam's Razor, often known as
the "Principle of Parsimony," which holds that the best explanation is
usually the simplest one.
In terms of statistics, a model with fewer parameters but a reasonable
degree of goodness of fit ought to be chosen over one with many
parameters but a marginally higher level of goodness of fit.
This is due to two factors:
1. It is simpler to interpret and comprehend parsimonious models. Less
complicated models are simpler to comprehend and justify. munotes.in
Page 82
Data science
82 2. Parsimonious models typically exhibit higher forecasting accuracy.
When used on fresh data, models with fewer parameters typically
perform better.
To demonstrate these concepts, think about the following two situations.
Example 1: Parsimonious Models=S imple Interpretation
Assume that we wish to create a model to forecast house prices using a set
of real estate -related explanatory factors. Take into account the two
models below, together with their modified R -squared:
Model 1:
Equation: House price = 8,830 + 81*(sq. ft.)
Adjusted R2: 0.7734
Model 2:
Equation: House price = 8,921 + 77*(sq. ft.) + 7*(sq. ft.)2 – 9*(age) +
600*(rooms) + 38*(baths)
Adjusted R2: 0.7823
While the second model includes five explanatory factors and an only
marginally higher adj usted R2, the first model only has one explanatory
variable with an adjusted R2 of.7734.
According to the parsimony principle, we would like to choose the first
model since it is simpler to comprehend and explain and has nearly the
same ability to explain the fluctuation in home prices as the other models.
For instance, according to the first model, an increase of one unit in a
home's square footage corresponds to a $81 rise in the average price of a
home. That is easy to comprehend and explain.
The coeffic ient estimates in the second example, however, are
significantly more challenging to understand. For instance, if the house's
square footage, age, and number of bathrooms are all kept same, adding
one room to the home will boost the price by an average of $600. That is
considerably more difficult to comprehend and justify.
Example 2: Parsimonious Models = Better Predictions
Because they are less prone to overfit the initial dataset, parsimonious
models also have a tendency to make predictions on fresh data sets that are
more correct.
In comparison to models with fewer parameters, models with more
parameters typically result in tighter fits and higher R2 values. Sadly, if a
model has too many parameters, the model may end up fitting the data's
noise or "rando mness" rather than the actual underlying link between the
explanatory and response variables. munotes.in
Page 83
Introduction to Model
Selection
83 This indicates that compared to a simpler model with fewer parameters, a
very complicated model with many parameters is likely to perform poorly
on a fresh datase t that it hasn't seen before.
5.4.1 How to choose a Parsimonious Model
Model selection could be the subject of an entire course, but ultimately,
picking a parsimonious model comes down to picking one that performs
well based on some criteria.
Typical metri cs that assess a model's effectiveness on a training dataset
and the quantity of its parameters include:
5.4.1.1 Akaike Information Criterion (AIC)
The AIC of a model can be calculated as:
AIC = -2/n * LL + 2 * k/n
where:
n: Number of observations in the training dataset.
LL: Log-likelihood of the model on the training dataset.
k: Number of parameters in the model.
The AIC of each model may be determined using this procedure, and the
model with the lowest AIC value will be chosen as the best model.
When com pared to the next method, BIC, this strategy tends to prefer
more intricate models.
5.4.1.2 Bayesian Information Criterion (BIC)
The BIC of a model can be calculated as:
BIC = -2 * LL + log(n) * k
where:
n: Number of observations in the training dataset.
log: The natural logarithm (with base e)
LL: Log-likelihood of the model on the training dataset.
k: Number of parameters in the model.
Using this method, you can calculate the BIC of each model and then
select the model with the lowest BIC value as the best model.
5.4.1.3. Minimum Description Length (MDL)
The MDL is a way of evaluating models that comes from the field of
information theory. It can be calculated as: munotes.in
Page 84
Data science
84 MDL = L(h) + L(D | h)
where:
h: The model.
D: Predictions made by the model.
L(h): Number of bits required to represent the model.
L(D | h): Number of bits required to represent the predictions from the
model on the training data.
Using this method, you can calculate the MDL of each model and then
select the model with the lowest MDL value as the best model.
Depending on the type of problem you’re working on, one of these
methods – AIC, BIC, or MDL – may be preferred over the others as a way
of selecting a parsimonious model.
5.5 CROSS VALIDATION
By training the model on a subset of the input data a nd testing it on a
subset of the input data that hasn't been used before, you may validate the
model's effectiveness. It is also a method for determining how well a
statistical model generalizes to a different dataset.
Testing the model's stability is a ne cessary step in machine learning (ML).
This indicates that we cannot fit our model to the training dataset alone.
We set aside a specific sample of the datasetone that wasn't included in the
training datasetfor this use. After that, before deployment, we t est our
model on that sample, and the entire procedure is referred to as cross -
validation. It differs from the typical train -test split in this way.
Hence, the fundamental cross -validation stages are:
As a validation set, set aside a portion of the dataset .
Use the training dataset to provide the model with training.
Use the validation set to assess the model's performance right now. Do
the next step if the model works well on the validation set; otherwise,
look for problems.
5.5.1 Methods used for Cross -Validation
There are some common methods that are used for cross -validation. These
methods are given below:
1] Validation Set Approach
With the validation set approach, we separate our input dataset into a
training set and a test or validation set. 50% of th e dataset is divided
between the two subsets. munotes.in
Page 85
Introduction to Model
Selection
85 Nevertheless, it has a significant drawback in that we are only using 50%
of the dataset to train our model, which means that the model can fail to
capture crucial dataset information. It frequently produces th e underfitted
model as well.
2] Cross -validation using Leave -P-out
The training data in this method excludes the p datasets. This means that if
the original input dataset has a total of n datapoints, n -p datapoints will be
utilised as the training dataset, and p datapoints will be used as the
validation set. For each sample, the entire procedure is carried out once,
and the average error is determined to determine the model's efficacy.
This method has a drawback in that it can be computationally challenging
for large p.
3] Leave one out cross -validation
This technique is similar to leave -p-out cross -validation, but we need to
exclude one dataset from training instead of p. It indicates that in this
method, only one data point is set aside for each learning s et, while the
remaining dataset is used to train the model. Each datapoint in this process
is repeated again. Hence, for n samples, n distinct training sets and n test
sets are obtained. It has these characteristics:
As all the data points are used, the bi as is minimal in this method.
Because the process is run n times, the execution time is long.
Due to the iterative nature of this method, measuring the model's
efficacy against a single data point is highly variable.
4] K-Fold Cross -Validation
K-fold cross -validation approach divides the input dataset into K groups of
samples of equal sizes. These samples are called folds. For each learning
set, the prediction function uses k -1 folds, and the rest of the folds are used
for the test set. This approach is a ve ry popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k -fold cross -validation are:
o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the
model using the test set.
Let's take an example of 5 -folds cross -validation. So, the dataset is
grouped into 5 folds. On 1st iteration, the first fold is reserved for test the
model, and rest are used to train the model. On 2nd iteration, the second munotes.in
Page 86
Data science
86 fold is used to test the model, and rest are used to train the model. This
process will continue until each fold is not used for the test fold.
Conside r the below diagram:
5] Cross -validation with a stratified k -fold
With a few minor adjustments, this method is identical to k -fold cross -
validation. The stratification principle underlies this method, which
involves rearranging the data to make sure that each fold or group is a
good representation of the entire dataset. It is one of the finest strategies
for addressing bias and variation.
It can be understood by utilizing the example of housing costs, where
some homes may have substantially higher prices than others. A stratified
k-fold cross -validation technique is helpful to handle such circumstances.
6] Holdout Method
This methodology for cross -validation is the simplest one available. With
this technique, we must take out a portion of the training dat a and train the
remaining dataset on it to obtain the prediction results.
The inaccuracy that results from this procedure provides insight into how
effectively our model will work with the unidentified dataset. Although
this method is straightforward to us e, it still struggles with large volatility
and occasionally yields inaccurate findings.
5.5.2 Limitations of Cross -Validation
There are some limitations of the cross -validation technique, which are
given below:
It delivers the best results under the best circumstances. Nonetheless,
the contradictory data could lead to a dramatic outcome. When there is
uncertainty over the type of data used in machine learning, this is one
of the major drawbacks of cross -validation.
Because data in predictive modelling chan ges over time, there may be
variations between the training set and validation sets. For instance, if
we develop a stock market value prediction model and the data is
trained on the stock prices from the previous five years, but the
realistic future stock prices for the following five years could be very
different, it is challenging to predict the correct output in such
circumstances. munotes.in
Page 87
Introduction to Model
Selection
87 5.5.3Applications of Cross -Validation
This method can be used to evaluate how well various predictive
modelling approaches w ork.
It has a lot of potential for medical study.
Although data scientists are already using it in the field of medical
statistics, it can also be utilised for meta -analysis.
5.6 SUMMARY
We have studied the following points from this chapter:
Model select ion is a procedure used by statisticians to examine the
relative merits of various statistical models and ascertain which one
best fits the observed data.
The process of selecting a model from a large pool of potential models
for a predictive modelling iss ue is known as model selection.
Beyond model performance, there may be several competing
considerations to consider throughout the model selection process,
including complexity, maintainability, and resource availability.
Probabilistic measurements and re sampling procedures are the two
primary groups of model selection strategies.
5.7 LIST OF REFERENCES
1. Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013.
2. Mastering Machine Learning with R, Cory Lesmeister, PACKT
Publication,2015.
3. Hands -On Programming with R, Garrett Grolemund,1st Edition, 2014.
4. An Introduction to Statistical Learning, James, G., Witten, D., Hastie,
T., Tibshirani, R.,Springer,2015.
5.8 UNIT END EXERCISES
1) What is Regularization?
2) What are the different R egularization techniques?
3) Explain the Bias/variance tradeoff.
4) What is Bias?
5) What is Variance?
6) Describe the Bias -Variance Tradeoff. munotes.in
Page 88
Data science
88 7) Explain the Parsimony Model.
8) How you will choose a Parsimonious Model?
9) Explain: AIC, BIC and MDL.
10) Explain the Cross validation.
11) Describe the methods used for Cross -Validation.
12) Write a note on limitations and applications of Cross -Validation
techniques.
munotes.in
Page 89
89 6
DATA TRANSFORMATIONS
Unit Structure
6.0 Objectives
6.1 Introduction
6.2 Dimension reduction
6.2.1 The curse of dimensionality
6.2.2 Benefits of applying dimensionality reduction
6.2.3 Disadvantages of dimensionality reduction
6.2.4 Approaches of dimens ion reduction
6.2.5 Common techniques of dimensionality reduction
6.3 Feature extraction
6.3.1 Why feature extraction is useful?
6.3.2 Applications of Feature Extraction
6.3.3 Benefits
6.3.4 Feature extraction techniques
6.4 Smoothing
6.5 Aggregating
6.5.1 Working of data aggregation
6.5.2 Examples of aggregate data
6.5.3 Data aggregators
6.6 Summary
6.7 List of References
6.8 Unit End Exercises
6.0 OBJECTIVES
To understand the various data transformations involved in machine
learning
To get familiar wit h the concept of dimensionality reduction and its
effect on performance
To acquaint with the concepts of data aggregation and smoothing munotes.in
Page 90
Data science
90 6.1 INTRODUCTION
It's challenging to track or comprehend raw data. Because of this, it needs
to be preprocessed before a ny information can be extracted from it. The
process of transforming raw data into a format that makes it easier to
conduct data mining and recover strategic information is known as data
transformation. In order to change the data into the right form, data
transformation techniques also include data cleansing and data reduction.
To produce patterns that are simpler to grasp, data transformation is a
crucial data preprocessing technique that must be applied to the data
before data mining.
Data transformation transforms the data into clean, useable data by
altering its format, structure, or values. In two steps of the data pipeline for
data analytics projects, data can be modified. Data transformation is the
middle phase of an ETL (extract, transform, and load ) process, which is
commonly used by businesses with on -premises data warehouses. The
majority of businesses now increase their compute and storage resources
with latency measured in seconds or minutes by using cloud -based data
warehouses. Organizations ca n load raw data directly into the data
warehouse and perform preload transformations at query time thanks to
the scalability of the cloud platform.
Data transformation may be used in data warehousing, data wrangling,
data integration, and migration. Data t ransformation makes business and
analytical processes more effective and improves the quality of data -
driven decisions made by organizations. The structure of the data will be
determined by an analyst throughout the data transformation process.
Hence, data transformation might be:
o Constructive: The data transformation process adds, copies, or
replicates data.
o Destructive: The system deletes fields or records.
o Aesthetic: The transformation standardizes the data to meet
requirements or parameters.
o Structural: The database is reorganized by renaming, moving, or
combining columns
6.2 DIMENSION REDUCTION
Dimensionality refers to how many input features, variables, or columns
are present in a given dataset, while dimensionality reduction refers to the
process of r educing these features.
In many circumstances, a dataset has a significant number of input
features, which complicates the process of predictive modelling. For
training datasets with a large number of features, it is extremely munotes.in
Page 91
Data Transformations
91 challenging to visualize or a nticipate the results; hence, dimensionality
reduction techniques must be used.
The phrase "it is a manner of turning the higher dimensions dataset into
lower dimensions dataset, guaranteeing that it gives identical information"
can be used to describe the technique of "dimensionality reduction." These
methods are frequently used in machine learning to solve classification
and regression issues while producing a more accurate predictive model.
It is frequently utilized in disciplines like speech recognition , signal
processing, bioinformatics, etc. that deal with high -dimensional data.
Moreover, it can be applied to cluster analysis, noise reduction, and data
visualization.
6.2.1 The curse of dimensionality
The "curse of dimensionality"the difficulty in han dling high -dimensional
datais a well -known phenomenon. Any machine learning algorithm and
model become increasingly sophisticated as the dimensionality of the
input dataset rises. As the number of characteristics rises, the number of
samples rises correspo ndingly as well, raising the possibility of
overfitting. The machine learning model performs poorly if it is overfitted
after being trained on high -dimensional data.
As a result, it is frequently necessary to decrease the number of features,
which can be a ccomplished by dimensionality reduction.
6.2.2 Benefits of applying dimensionality reduction
Following are some advantages of using the dimensionality reduction
technique on the provided dataset:
The space needed to store the dataset is decreased by loweri ng the
dimensionality of the features. munotes.in
Page 92
Data science
92 Reduced feature dimensions call for shorter computation training
times.
The dataset's features with reduced dimensions make the data easier to
visualize rapidly.
By taking care of the multicollinearity, it removes the redundant
features (if any are present).
6.2.3 Disadvantages of dimensionality reduction
The following list of drawbacks of using the dimensionality reduction also
includes:
The reduction in dimensionality may result in some data loss.
Sometimes the prima ry components needed to consider in the PCA
dimensionality reduction technique are unknown.
6.2.4 Approaches of dimension reduction
There are two ways to apply the dimension reduction technique, which are
given below:
A] Feature Selection
In order to creat e a high accuracy model, a subset of the important features
from a dataset must be chosen, and the irrelevant characteristics must be
excluded. This process is known as feature selection. To put it another
way, it is a method of choosing the best character istics from the input
dataset.
The feature selection process employs three techniques:
1] Filter methods
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2] Wrapper methods
The wrapper technique uses a machine learning model to evaluate itself,
but it has the same objective as the filter method. With this approach,
some features are provided to the ML model, and performance is assessed.
To improve the model's accuracy, the performance determines whether to
include or exclude certain features. Although it is more difficult to use,
this method is more accurate than the filtering method. The following are
some typical wrapper method techniques: munotes.in
Page 93
Data Transformations
93 o Forward Selection
o Backward Selection
o Bi-directional Elimination
3] Embedded Methods: Embedded methods check the different training
iterations of the machine learning model and evaluate the importance of
each feature . Some common techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
B] Feature extraction
The process of converting a space with many dimensions into one with
fewer dimensions is known as feature extraction. This strategy is helpful
when we want to retain all of the information while processing it with
fewer resources.
Some common feature extraction techniques are:
Principal Component Analysis
Linear Discriminant Analysis
Kernel PCA
Quadratic Discriminant Analysis
6.2.5 Common techniq ues of dimensionality reduction
Principal Component Analysis
Backward Elimination
Forward Selection
Score comparison
Missing Value Ratio
Low Variance Filter
High Correlation Filter
Random Forest
Factor Analysis
Auto -Encoder
Principal Component Analysis (P CA)
Principal Component Analysis is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated munotes.in
Page 94
Data science
94 features with the help of orthogonal transformation. These new
transformed features are called the Principal Components . It is one of the
popular tools that is used for exploratory data analysis and predictive
modelling.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduc es the
dimensionality. Some real -world applications of PCA are image
processing, movie recommendation system, optimizing the power
allocation in various communication channels.
Backward Feature Elimination
The backward feature elimination technique is main ly used while
developing Linear Regression or Logistic Regression model. Below steps
are performed in this technique to reduce the dimensionality or in feature
selection:
o In this technique, firstly, all the n variables of the given dataset are
taken to tra in the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n -1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no chan ge in
the performance of the model, and then we will drop that variable or
features; after that, we will be left with n -1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and
maximum tolerable error rate, we can define the optimal number of
features require for the machine learning algorithms.
Forward Feature Selection
Forward feature selection follows the inverse process of the backward
elimination process. It means, in this technique, we don't eliminate the
feature; instead, we will find the best features that can produce the highest
increase in the performance of the model. Below steps are performed in
this technique:
o We start with a single feature only, and progres sively we will add each
feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the
performance of the model. munotes.in
Page 95
Data Transformations
95 Missin g Value Ratio
If a dataset has too many missing values, then we drop those variables as
they do not carry much useful information. To perform this, we can set a
threshold level, and if a variable has missing values more than that
threshold, we will drop th at variable. The higher the threshold value, the
more efficient the reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some
changes in the data have less information. Therefore, we need to calculate
the variance of e ach variable, and all data columns with variance lower
than a given threshold are dropped because low variance features will not
affect the target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry
approximately similar information. Due to this factor, the performance of
the model can be degraded. This correlation between the independent
numerical variable gives the calculated value of the correlation coefficient.
If this value is higher than the threshold value, we can remove one of the
variables from the dataset. We can consider those variables or features that
show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in
machine learning . This algorithm contains an in -built feature importance
package, so we do not need to program it separately. In this technique, we
need to generate a large set of trees against the target variable, and with
the help of usage statistics of each attribute, we need to find the subset of
features.
Random forest algorithm takes only numerical variables, so we need to
convert the input data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a
group according to the correlation with other variables, it means variables
within a group can have a high correlation between themselves, but they
have a low correlation with variables of other groups.
We can understand it by an example, such as if we h ave two variables
Income and spend. These two variables have a high correlation, which
means people with high income spends more, and vice versa. So, such
variables are put into a group, and that group is known as the factor. The
number of these factors wi ll be reduced as compared to the original
dimension of the dataset.
munotes.in
Page 96
Data science
96 Auto -encoders
One of the popular methods of dimensionality reduction is auto -encoder,
which is a type of ANN or artificial neural network , and its main aim is to
copy the inputs to their outputs. In this, the input is compressed into latent -
space representation, and output is occurred using this representation. It
has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form
the latent -space representation.
o Decoder: The function of the decoder is to recreate the output from the
latent -space representation.
6.3 FEATURE EXTRACTION
Feature extraction is a method for extracting important features fro m a
huge input data collection. Dimensionality reduction is used in this
procedure to break up enormous input data sets into more manageable
processing units.
The dimensionality reduction method, which divides and condenses a
starting set of raw data into smaller, easier -to-manage groupings, includes
feature extraction. As a result, processing will be simpler. The fact that
these enormous data sets contain a lot of different variables is their most
crucial feature. Processing these variables takes a lot of computing power.
In order to efficiently reduce the amount of data, feature extraction helps
to extract the best feature from those large data sets by choosing and
combining variables into features. These features are simple to use while
still accurately a nd uniquely describing the real data set.
6.3.1 Why feature extraction is useful?
When you have a large data set and need to conserve resources without
losing any crucial or pertinent information, the feature extraction
technique can be helpful. The amount of redundant data in the data
collection is decreased with the aid of feature extraction.
In the end, the data reduction speeds up the learning and generalization
phases of the machine learning process while also enabling the model to
be built with less m achine effort.
6.3.2 Applications of Feature Extraction
Bag of Words: Bag-of-Words is the most used technique for natural
language processing. In this process they extract the wor ds or the
features from a sentence, document, website, etc. and then they
classify them into the frequency of use. So, in this whole process
feature extraction is one of the most important parts.
Image Processing : Image processing is one of the best and mo st
interesting domains. In this domain basically you will start playing
with your images in order to understand them. So here we use many munotes.in
Page 97
Data Transformations
97 techniques which includes feature extraction as well and algorithms to
detect features such as shaped, edges, or motio n in a digital image or
video to process them.
Auto -encoders: The main purpose of the auto-encoders is efficient
data coding which is unsupervised in nature. this process comes und er
unsupervised learning. So, Feature extraction procedure is applicable
here to identify the key features from the data to code by learning from
the coding of the original data set to derive new ones.
6.3.3 Benefits
Feature extraction can prove helpful wh en training a machine learning
model. It leads to:
A Boost in training speed
An improvement in model accuracy
A reduction in risk of overfitting
A rise in model explainability
Better data visualization
6.3.4 Feature extraction techniques
The following is a list of some common feature extraction techniques:
Principle Components Analysis (PCA)
Independent Component Analysis (ICA)
Linear Discriminant Analysis (LDA)
Locally Linear Embedding (LLE)
t-distributed Stochastic Neighbor Embedding (t -SNE)
6.4 DATA SMOO THING
Data smoothing is the process of taking out noise from a data set using an
algorithm. Important patterns can then be more easily distinguished as a
result.
Data smoothing can be used in economic analysis as well as to assist
predict trends, such as those seen in securities prices. The purpose of data
smoothing is to eliminate singular outliers and account for seasonality.
Advantages and disadvantages
The identification of patterns in the economy, in financial instruments like
stocks, and in consum er mood can be aided by data smoothing. Further
commercial uses for data smoothing are possible. munotes.in
Page 98
Data science
98 By minimizing the changes that may occur each month, such as vacations
or petrol prices, an economist can smooth out data to make seasonal
adjustments for part icular indicators, such retail sales.
Yet, there are drawbacks to using this technology. When identifying trends
or patterns, data smoothing doesn't necessarily explain them. It might also
cause certain data points to be overlooked in favor of others.
Pros
Helps identify real trends by eliminating noise from the data
Allows for seasonal adjustments of economic data
Easily achieved through several techniques including moving
averages
Cons
Removing data always comes with less information to analyze,
increasin g the risk of errors in analysis
Smoothing may emphasize analysts' biases and ignore outliers that
may be meaningful
6.5 AGGREGATING
Finding, gathering, and presenting data in a condensed style is the process
of aggregation, which is used to do statistical analysis of business plans or
analysis of behavioral patterns in people. It's essential to acquire reliable
data when a lot of data is collected from several sources in order to
produce meaningful results. Aggregating data can assist in making wise
select ions in marketing, finances, product pricing, etc. The statistical
summaries replace aggregated data groups. As aggregated data is present
in the data warehouse, using it to address rational issues might speed up
the process of answering queries from data sets.
6.5.1 Working of data aggregation
When a dataset's total amount of information is useless and unable to be
used for analysis, data aggregation is required. To achieve desired results
and improve the user experience or the application itself, the data sets are
compiled into useable aggregates. They offer aggregation metrics
including sum, count, and average. Summarized data is useful for
researching client demographics and patterns of activity. After being
written as reports, aggregated data assist in u ncovering insightful facts
about a group. Understanding, capturing, and visualizing data aids in data
lineage, which aids in identifying the primary causes of errors in data
analytics. An aggregated element does not necessarily have to be a
number. We can a lso find the count of non -numeric data. Aggregation
must be done for a group of data and not based on individual data.
munotes.in
Page 99
Data Transformations
99 6.5.2 Examples of aggregate data
Finding the average age of customer buying a particular product
which can help in finding out the target ed age group for that
particular product. Instead of dealing with an individual customer,
the average age of the customer is calculated.
Finding the number of consumers by country. This can increase sales
in the country with more buyers and help the compan y to enhance its
marketing in a country with low buyers. Here also, instead of an
individual buyer, a group of buyers in a country are considered.
By collecting the data from online buyers, the company can analyze
the consumer behaviour pattern, the succes s of the product which
helps the marketing and finance department to find new marketing
strategies and planning the budget.
Finding the value of voter turnout in a state or country. It is done by
counting the total votes of a candidate in a particular regi on instead
of counting the individual voter records.
6.5.3 Data aggregators
A system used in data mining called a "data aggregator" gathers
information from many sources, analyses it, and then repackages it in
usable packages. They significantly contribute to the improvement of
client data by serving as an agent. When a consumer asks data examples
concerning a specific product, it aids in the query and delivery process.
The customers receive matching records for the goods from the
aggregators. The consumer can thereby purchase any matching record
instances.
Working
The working of data aggregators takes place in three steps:
Collection of data: Collecting data from different datasets from the
enormous database. The data can be extracted using IoT(Internet of
things) such as munotes.in
Page 100
Data science
100 Communications in social media
Speech recognition like call centers
Headlines of a news
Browsing history and other personal data of devices.
Processing of data: After collecting data, the data aggregator finds
the atomic data and aggreg ates it. In the processing technique,
aggregators use various algorithms from the field of Artificial
Intelligence or Machine learning techniques. It also incorporates
statistical methods to process it, like the predictive analysis. By this,
various useful insights can be extracted from raw data.
Presentation of data: After the processing step, the data will be in a
summarized format which can provide a desirable statistical result
with detailed and accurate data.
6.6 SUMMARY
The modification of data charac teristics for better access or storage is
known as data transformation. Data's format, structure, or values may all
undergo transformation. Data analytics transformation typically takes
place after data has been extracted or loaded (ETL/ELT).
Data transfor mation improves the effectiveness of analytical procedures
and makes it possible to make judgements using data. There is a need for
clean, usable data since raw data is frequently challenging to examine and
has a size that is too great to yield useful insi ght.
An analyst or engineer will choose the data structure before starting the
transformation procedure. The following are the most typical types of data
transformation:
Constructive: The process of data transformation adds, duplicates, or
copies data.
Destructive: System deletes fields or records, which is destructive.
Aesthetic: The data are standardized through the transformation to
adhere to specifications or guidelines.
Structural: a re-structured database consists of combining, re -naming,
or shifting a number of columns.
A practitioner may additionally map data and save data using the right
database technology.
munotes.in
Page 101
Data Transformations
101 6.7 LIST OF REFERENCES
1. Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013.
2. Mastering Machine Learning with R, Cory Lesme ister, PACKT
Publication,2015.
3. Hands -On Programming with R, Garrett Grolemund,1st Edition, 2014.
4. An Introduction to Statistical Learning, James, G., Witten, D., Hastie,
T., Tibshirani, R.,Springer,2015.
6.8 UNIT END EXERCISES
1] What do you mean by Dimension reduction?
2] What is the curse of dimensionality?
3] What are the benefits of applying dimensionality reduction?
4] State the disadvantages of dimensionality reduction.
5] Explain the different approaches of dimension reduction.
6] What are the common techniques of dimensionality reduction?
7] What is Feature extraction?
8] Why feature extraction is useful?
9] State the benefits and applications of Feature Extraction.
10] Describe various feature extraction techniques.
11] Define Smoothing and Ag gregating.
12] Explain the working of data aggregation.
12] What are the different examples of aggregate data.
13] What is Data aggregators?
munotes.in
Page 102
102 7
SUPERVISED LEARNING
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Linear models
7.2.1 What is linear model?
7.2.2 Types of linear model
7.2.3 Applications of linear model
7.3 Regression trees
7.3.1 What are regression trees?
7.3.2 Mean square erro r
7.3.3 Building a regression tree
7.4 Time -series Analysis
7.4.1 What is time series analysis?
7.4.2 Types of time series analysis
7.4.3 Value of time series analysis
7.4.4 Time series models and techniques
7.5 Forecasting
7.5.1 Time series forecasting in machine learning
7.5.2 Machine learning models for time series forecasting
7.5.3 Machine learning time series forecasting applications
7.6 Classification trees
7.7 Logistic regression
7.8 Classification using separating hyperplanes
7.9 k-NN
7.9.1 Need of KNN Algorithm
7.9.2 Working of KNN Algorithm
7.9.3 Selecting value of k in KNN Algorithm
7.9.4 Advantages of KNN Algorithm
7.9.5 Disadvantages of KNN Algorithm
7.10 Summary
7.11 List of References
7.12 Unit End Exercises munotes.in
Page 103
Supervised Learning
103 7.0 OBJECTIVES
To understand the supervised learning mechanisms
To learn about different regression and classification models
7.1 INTRODUCTION
A class of techniques and algorithms known as "supervised learning" in
machine learning and artificial intelligence develop predictive models
utilizing data points with predetermined outcomes. The model is trained
using an appropriate learning technique (such as neural networks, linear
regression, or random forests), which often employs some sort of
optimization procedure to reduce a loss or error function.
In other words, supervised learning is the process of training a model by
providing both the right input data and output data. The term "labelled
data" is typically used to describe this input/output pair. Consider a
teacher who, armed with the right answers, will award or deduct points
from a student depending on how accurately she answered a question. For
two different sorts of issues, supervised learning is frequently utilized to
develop machine learning models.
Regression: The model identifie s outputs that correspond to actual
variables (number which can have decimals.)
Classification: The model creates categories for its inputs.
7.2 LINEAR MODELS
One of the simplest models in machine learning is the linear model. It
serves as the foundation for many sophisticated machine learning
techniques, such as deep neural networks. Using a linear function of the
input data, linear models forecast the target variable. Here, we've covered
linear regression and logistic regression, two essential linear mod els in
machine learning. While logistic regression is a classification algorithm,
linear regression is utilized for jobs involving regression.
7.2.1 What is linear model?
One of the simplest models in machine learning is the linear model. It
attempts to determine the importance of each feature while assuming that
the data can be linearly separated. In mathematics, it may be expressed as
Y = WTX
To turn the continuous -valued variable Y into a discrete category for the
classification issue, we apply a tran sformation function or threshold. Here,
we'll quickly go over the models for the classification and regression
tasks, respectively: logistic and linear regression.
munotes.in
Page 104
Data science
104 7.2.2 Types of linear model
1] Linear regression
A statistical method known as "linear regre ssion" makes predictions about
the outcome of a response variable by fusing a variety of affecting
variables. It makes an effort to depict the target's (dependent
variables)linear relationship with features (independent variables). We can
determine the ide al model parameter values using the cost function.
Example: An analyst would be interested in seeing how market
movement influences the price of ExxonMobil (XOM). The value of the
S&P 500 index will be the independent variable, or predictor, in this
exampl e, while the price of XOM will be the dependent variable. In
reality, various elements influence an event's result. Hence, we usually
have many independent features.
2] Logistic regression
A progression from linear regression is logistic regression. The re sult of
the linear regression is first transformed between 0 and 1 by the sigmoid
function. Following that, a predetermined threshold aids in calculating the
likelihood of the output values. Values over the threshold value have a
tendency to have a probabi lity of 1, whereas values below the threshold
value have a tendency to have a probability of 0.
Example: A bank wants to predict if a customer will default on their loan
based on their credit score and income. The independent variables would
be credit sco re and income, while the dependent variable would be
whether the customer defaults (1) or not (0).
7.2.3 Applications of linear model
There are many situations in real life when dependent and independent
variables follow linear relationships. Such instan ces include:
The connection between elevation variation and the boiling point of
water.
The connection between an organization's revenue and its advertising
expenditures. munotes.in
Page 105
Supervised Learning
105 The connection between fertilizer application rates and agricultural
yields.
Athletes ' performances and training schedule.
7.3 REGRESSION TREES
7.3.1 What are regression trees?
A regression tree, which is used to predict continuous valued outputs
rather than discrete outputs, is essentially a decision tree that is employed
for the regressi on job.
7.3.2 Mean square error
To provide accurate and effective classifications, decision trees for
classification pose the proper questions at the appropriate nodes. Entropy
and Information Gain are the two metrics used in Classifier Trees to
accomplish this. But, since we are making predictions about continuous
variables, we are unable to compute the entropy and follow the same
procedure. Now, we require a different approach. The mean square error is
a measurement that indicates how much our projections stray from the
initial goal.
We only care about how far the prediction deviates from the target; Y is
the actual value, and Y hat is the prediction. not which way around. In
order to divide the total amount by the total number of records, we square
the difference.
We follow the same procedure as with classification trees in the regression
tree approach. But rather than focusing on entropy, we strive to lower the
Mean Square Error for each child.
7.3.3 Building a regression tree
Consider the dataset below , which has 2 variables.
Figure 1: Dataset where X and Y are a continuous variable munotes.in
Page 106
Data science
106
Figure 2: Actual dataset
We need to build a Regression tree that best predicts the Y given the X.
Step 1
The first step is to sort the data based on X ( In this case, i t is already
sorted) . Then, take the average of the first 2 rows in variable X (which is
(1+2)/2 = 1.5 according to the given dataset). Divide the dataset into 2
parts (Part A and Part B) , separated by x < 1.5 and X ≥ 1.5.
Now, Part A consist only of one p oint, which is the first row (1,1) and all
the other points are in Part — B. Now, take the average of all the Y values
in Part A and average of all Y values in Part B separately. These 2 values
are the predicted output of the decision tree for x < 1.5 and x ≥ 1.5
respectively. Using the predicted and original values, calculate the mean
square error and note it down.
Step 2
In step 1, we calculated the average for the first 2 numbers of sorted X and
split the dataset based on that and calculated the predicti ons. Then, we do
the same process again but this time, we calculate the average for the
second 2 numbers of sorted X ( (2+3)/2 = 2.5 ). Then, we split the dataset
again based on X < 2.5 and X ≥ 2.5 into Part A and Part B again and
predict outputs, find mea n square error as shown in step 1. This process is
repeated for the third 2 numbers, the fourth 2 numbers, the 5th, 6th, 7th till
n-1th 2 numbers ( where n is the number of records or rows in the dataset ). munotes.in
Page 107
Supervised Learning
107 Step 3
Now that we have n -1 mean squared errors ca lculated, we need to choose
the point at which we are going to split the dataset. and that point is the
point, which resulted in the lowest mean squared error on splitting at it. In
this case, the point is x=5.5. Hence the tree will be split into 2 parts. x<5.5
and x ≥ 5.5. The Root node is selected this way and the data points that go
towards the left child and right child of the root node are further recursively
exposed to the same algorithm for further splitting.
Brief Explanation of working of the algori thm:
The basic idea behind the algorithm is to find the point in the independent
variable to split the data -set into 2 parts, so that the mean squared error is
the minimised at that point. The algorithm does this in a repetitive fashion
and forms a tree -like structure.
A regression tree for the above shown dataset would look like this
Figure 3: Resultant decision tree and the resultant prediction visualisation
would be this
Figure 4: The decision boundary munotes.in
Page 108
Data science
108 7.4 TIME -SERIES ANALYSIS
7.4.1 What is ti me series analysis?
A method of examining a collection of data points gathered over time is a
time series analysis. Additionally, it is specifically utilized for non -
stationary data, or data that is constantly changing over time. The time
series data varie s from all other data due to this component as well. Time
series analysis is also used to predict future data based on the past. As a
result, we can conclude that it goes beyond simply gathering data.
Predictive analytics includes the subfield of time seri es analysis. It
supports in forecasting by projecting anticipated variations in data, such as
seasonality or cyclical activity, which provides a greater understanding of
the variables.
7.4.2 Types of time series analysis
Time series are used to collect a v ariety of data kinds;thus, analysts have
created some intricate models to help with understanding. Analysts, on the
other hand, are unable to take into account all variations or generalize a
specific model to all samples. These are the typical time series analysis
methods:
o Classification: This model is used for the identification of data. It also
allocates categories to the data.
o Descriptive Analysis: As time series data has various components,
this descriptive analysis helps to identify the varied pattern s of time
series including trend, seasonal, or cyclic fluctuations.
o Curve Fitting: Under this type of time series analysis, we generally
plot data along some curve in order to investigate the correlations
between variables in the data.
o Explanative Analys is: This model basically explains the correlations
between the data and the variables within it, and also explains the
causes and effects of the data on the time series.
o Exploratory Analysis: The main function of this model is to highlight
the key feature s of time series data, generally in a graphic style.
o Forecasting: As the name implies, this form of analysis is used to
forecast future data. Interestingly, this model uses the past data (trend)
to forecast the forthcoming data, thus, projecting what could happen at
future plot points.
o Intervention Analysis: This analysis model of time series denotes or
investigates how a single incident may alter data.
munotes.in
Page 109
Supervised Learning
109 o Segmentation: This type typically divides the data into several
segments in order to display the under lying attributes of the original
data.
o Data Variation: It includes
Functional Analysis : It helps in the picking of patterns within data
and also correlates a notable relationship.
Trend Analysis: It refers to the constant movement in a specific
directio n. Trends are classified into two types: deterministic
(determining core causes) and stochastic (inexplicable).
o Seasonal Variation : It defines event occurrences that specifically
happen at certain and regular periods throughout the year.
7.4.3 Value of ti me series analysis
Our lives are significantly impacted by time series analysis. It aids
businesses and organizations in examining the root causes of trends or
other systematic patterns across time. Moreover, with all these facts, you
can put them down in a chart visually and that assists in a deeper
knowledge of the industry. In turn, this will help firms delve more into the
fundamental causes of seasonal patterns or trends.
Also, it aids organizations in projecting how specific occurrences will turn
out in the future. This may be achieved by performing ongoing analyses of
past data. Predictive analytics therefore includes time series forecasting as
a subset. It enables more precise data analysis and forecasting by
anticipating projected variations in data, such as seasonality, trend, or
cyclic behavior.
Time series forecasting also contains additional crucial components.
Dependable: Time series forecasting is one of the most dependable
methods available today. It is trustworthy when the data represents a
long-time span. At regular periods, various significant information can
be gleaned from the data fluctuations.
Seasonal Patterns: Changes in data points indicate a seasonal
fluctuation that forms the basis for projections of the future. This
information is essential to the market since it allows for a basic strategy
for production and other costs in a market where the product swings
seasonally.
Estimated trend: Time series analysis, together with seasonal patterns,
is helpful in identifying trends. This wil l eventually assist the
management in keeping track of data trends that show an uptick or
decline in sales of a particular product.
munotes.in
Page 110
Data science
110 Growth: Another significant feature of time series analysis is that it
also adds to the financial as well as endogenous gro wth of an
organization. Endogenous growth is the internal expansion of a
business that resulted in increased financial capital. Time series
analysis can be used to detect changes in policy factors, which is a great
illustration of the value of this series in many domains.
Several sectors have noted the prevalence of time series analysis. Statistics
professionals frequently utilize it to determine probability and other
fundamentals. Also, it is crucial in the medical sectors.
Mathematicians also prefer time series because econometrics uses them as
well. It is crucial for predicting earthquakes and other natural disasters,
estimating their impact zones, and identifying weather patterns for
forecasting.
7.4.4 Time series models and techniques
Data in a time ser ies can be examined in a variety of ways. These popular
time series models can be applied to data analysis:
1. Decompositional Models
The time series data shows certain patterns. Consequently, it is quite
beneficial to divide the time series into differen t parts for simple
comprehension. Each element represents a particular pattern. The term
"decompositional models" refers to this procedure. The time series is
primarily broken down into three main components: trend, seasonality,
and noise. Predictability a nd change -rate decomposition are the two types
of decomposition.
2. Smoothing -based Model
This technique is one of the most statistical ones for time series because it
concentrates on removing outliers from the data and enhancing the
pattern's visibility. The process of gathering data over time involves some
random fluctuation. In order to show underlying patterns and cyclic
components, data smoothing removes or reduces random fluctuation.
3. Moving Average Model
Moving Average, or MA model, is a well -liked technique for modelling
univariate time series in time series analysis. The anticipated output is
linearly correlated with the present and other prior values of a probabilistic
term, according to the moving -average model.
4. Exponential Smoothing Model
A quick method for blending time series data that use the "exponential
window function" is exponential smoothing. This process is simple to
learn and can be used to base decisions on historical user expectations,
such as seasonality. This model comes in e ssentially three varieties: single,
double, and triple exponential smoothing. munotes.in
Page 111
Supervised Learning
111 Moreover, it is a crucial component of the ARMA and ARIMA models.
Moreover, this model is employed because of the TBATS forecasting
model.
5. ARIMA
AutoRegressive Integrated Movi ng Average is abbreviated as ARMA. It is
the forecasting technique in time series analysis that is most frequently
utilised. The Moving Average Model and the Autoregressive Model are
combined to create it.
Hence, rather than focusing on individual values i n the series, the model
instead seeks to estimate future time series movement. When there is
evidence of non -stationarity in the data, ARIMA models are applied.
A linear mixture of seasonal historical values and/or prediction errors is
added to the SARIMA model, also known as the Seasonal ARIMA model,
in addition to these.
7.5 FORECASTING
Forecasting is a method of foretelling the future using the outcomes of the
past data. In order to anticipate future events, a thorough analysis of past
and present trends or events is required. It makes use of statistical methods
and tools.
Time series forecasting is employed in various sectors, including finance,
supply chain management, production, and inventory planning, making it
one of the most widely used data scienc e approaches. Time series
forecasting has many applications, including resource allocation, business
planning, weather forecasts, and stock price prediction.
7.5.1 Time series forecasting in machine learning
A set of observations made through time, whether daily, weekly, monthly,
or yearly, make up a time series forecasting method. Time series analysis
involves building models in order to describe the observed time series and
understand the "why" underlying its dataset. This includes speculating and
interpr eting scenarios based on the information at hand. In time series
forecasting, the best -fitting model is employed to predict future
observations based on meticulously processed recent and previous data.
Machine learning -based time series analysis forecastin g has been
demonstrated to be the most effective at finding trends in both structured
and unstructured data.
Understanding the components of the time series data is essential to
using an appropriate deep learning model for time series forecasting.
Finding repeating changes in a time series and determining whether
they are cyclical.
munotes.in
Page 112
Data science
112 The term "trends" is used to characterize the upward or downward
motion of time series, which is often displayed in linear modes.
Seasonality: To highlight the repeating patte rns of behavior across
time.
To take into account the random component of time series that
deviates from the conventional model values.
7.5.2 Machine learning models for time series forecasting
There are numerous models that can be used for time series fo recasting.
One particular kind of neural network that uses historical data to forecast
outcomes is the LSTM Network. It is frequently employed for a variety of
tasks, including language recognition and time series analysis. Models like
the random forest, g radient boosting regressor, and time delay neural
networks can contain temporal information and represent the data at
different points in time by adding a series of delays to the input.
1] Naive model
Naive models are often implemented as a random walk and a seasonal
random walk, with the most recent value observed serving as the unit for
the forecast for the following period (a forecast is made using a value from
the same time period as the most recent observation).
2] Exponential smoothing model
An expon ential smoothing time series forecasting technique can be
expanded to support data with a systematic trend or seasonal component.
It is a potent forecasting technique that can be employed in place of the
well-known Box -Jenkins ARIMA family of techniques.
3] ARIMA/ SARIMA
For building a composite time series model, the approaches of
Autoregressive (AR) and Moving Average (MA) are combined under the
term ARIMA. ARIMA models incorporate seasonal and trend factors (for
example, dummy variables for weekdays and t heir ability to differentiate).
Additionally, they allow the handling of the underlying autocorrelation in
the data by using moving averages and autoregressive terms.
By incorporating a linear mixture of previous seasonal values and/or
forecast mistakes, t he seasonal autoregressive integrated moving average,
or SARIMA, expands the use of the ARIMA.
4] Linear regression method
The simple statistical technique known as linear regression is commonly
used in predictive modelling. In its most basic form, providi ng an equation
of independent variables upon which our goal variable is based is all that
is required. munotes.in
Page 113
Supervised Learning
113 5] Multi -layer perceptron (MLP)
The term “MLP” is used ambiguously; sometimes, it is used broadly to
refer to any feedforward ANN and other times, it is used specifically to
describe networks made up of several layers of perceptron’s.
6] Recurrent neural network (RNN)
RNNs may predict time -dependent objectives because they are essentially
memory -enhanced neural networks. Recurrent neural networks are capab le
of remembering the status of previously acquired input while determining
the appropriate time step. Recurrent networks have lately undergone
several improvements that can be used in a variety of sectors.
7] Long short -term memory (LSTM)
By giving the m odel multiple gate options, LSTM cells (special RNN
cells) were developed to solve the gradient problem. These gates let the
model to choose which data to recognize as meaningful and which data to
disregard. The GRU is yet another variety of gated recurren t network.
CNNs, also referred to as convolutional neural network models, decision
tree-based models like Random Forest, and variations of gradient boosting
(LightGBM, CatBoost, etc.) can be used for time series forecasting in
addition to the methods menti oned above.
7.5.3 Machine learning time series forecasting applications
Time series forecasting can be used by any business or organization
dealing with continuously generated data and the requirement to adjust to
operational shifts and changes. Here, mach ine learning acts as the greatest
enabler, improving our ability to:
Web traffic forecasting: Common data on normal traffic rates among
competing websites is paired with input data on traffic -related trends to
anticipate online traffic rates for certain ti mes.
Sales and demand forecasting: Machine learning models can identify
the most in -demand products and arrange them precisely in the
dynamic market using data on customer behavior patterns along with
inputs from purchase history, demand history, seasonal influence, etc.
Weather forecasting: Time -based data are regularly gathered from
numerous worldwide networked weather stations, and machine learning
techniques enable in -depth analysis and interpretation of the data for
upcoming forecasts based on statis tical dynamics.
Stock price forecasting: To produce precise predictions of the most
likely upcoming stock price movements, one can combine historical
stock price data with knowledge of both common and atypical stock
market surges and drops.
munotes.in
Page 114
Data science
114 Forecasting b ased on economic and demographic factors: Economic
and demographic factors contain a wealth of statistical information that
can be used to accurately forecast time series data. Hence, the optimum
target market may be defined, and the most effective tactics to interact
with that specific TA may be developed.
Academics: Deep learning and machine learning theories significantly
speed up the procedures of developing and presenting scientific
theories. For instance, machine learning patterns may enable the
analysis of scientific data that must undergo countless iterations of
analysis much more swiftly.
7.6 CLASSIFICATION TREES
A supervised learning method called a decision tree can be used to solve
classification and regression problems, but it is typically fav ored for doing
so. It is a tree -structured classifier, where internal nodes stand in for a
dataset's features, branches for the decision -making process, and each leaf
node for the classification result. The Decision Node and Leaf Node are
the two nodes of a decision tree. Whereas Leaf nodes are the results of
decisions and do not have any more branches, Decision nodes are used to
create decisions and have numerous branches.The given dataset's features
are used to execute the test or make the decisions.The g iven dataset's
features are used to execute the test or make the decisions.It is a graphical
depiction for obtaining all feasible answers to a choice or problem based
on predetermined conditions.It is known as a decision tree because, like a
tree, it begin s with the root node and grows on subsequent branches to
form a structure resembling a tree.The CART algorithm, which stands for
Classification and Regression Tree algorithm, is used to construct a tree.A
decision tree simply asks a question, then based on the answer (Yes/No), it
further split the tree into subtrees.The decision tree's general structure is
shown in the diagram below:
munotes.in
Page 115
Supervised Learning
115 Decision Tree Terminologies:
Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root
node into sub -nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches
from the tree.
Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
Working of an algorithm:
In a decision tree, the algorithm begins at the root node and works its way
up to forecast the class of the given dataset. This algorithm follows the
branch and jumps to the follow ing node by comparing the values of the
root attribute with those of the record (real dataset) attribute.
The algorithm verifies the attribute value with the other sub -nodes once
again for the following node before continuing. It keeps doing this until it
reaches the tree's leaf node. The following algorithm can help you
comprehend the entire procedure:
o Step -1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step -2: Find the best attribute in the dataset using Attribute Selec tion
Measure (ASM).
o Step -3: Divide the S into subsets that contains possible values for the
best attributes.
o Step -4: Generate the decision tree node, which contains the best
attribute.
o Step -5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.
7.7 LOGISTIC REGRESSION
One of the most often used Machine Learning algorithms, within the
catego ry of Supervised Learning, is logistic regression. With a
predetermined set of independent factors, it is used to predict the
categorical dependent variable. In a categorical dependent variable, the
output is predicted via logistic regression. As a result, the result must be a munotes.in
Page 116
Data science
116 discrete or categorical value. Rather of providing the exact values of 0 and
1, it provides the probabilistic values that fall between 0 and 1. It can be
either Yes or No, 0 or 1, true or false, etc. With the exception of how they
are applied, logistic regression and linear regression are very similar.
Whereas logistic regression is used to solve classification difficulties,
linear regression is used to solve regression problems.
In logistic regression, we fit a "S" shaped logistic fun ction, which predicts
two maximum values, rather than a regression line (0 or 1). The logistic
function's curve shows the possibility of several things, including whether
or not the cells are malignant, whether or not a mouse is obese depending
on its weig ht, etc.Logistic Regression is a major machine learning
technique since it has the capacity to offer probabilities and categorize
new data using continuous and discrete datasets. When classifying
observations using various sources of data, logistic regress ion can be used
to quickly identify the factors that will work well. The logistic function is
displayed in the graphic below:
Logistic function (Sigmoid function):
The projected values are converted to probabilities using a mathematical
tool called the s igmoid function.It transforms any real value between 0 and
1 into another value.The logistic regression's result must fall within the
range of 0 and 1, and because it cannot go beyond this value, it has the
shape of a "S" curve. The S -form curve is called the Sigmoid function or
the logistic function.We apply the threshold value idea in logistic
regression, which establishes the likelihood of either 0 or 1. Examples
include values that incline to 1 over the threshold value and to 0 below it.
Assumptions for logistic regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi -collinearity.
Type of logistic regression:
On the basis of the categories, Logistic Regression can be classified into
three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc. munotes.in
Page 117
Supervised Learning
117 o Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent v ariable, such as
"cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
7.8 CLASSIFICATION USING SEPARATING
HYPERPLANES
Suppose that we ha ve a n×p data matrix X that consists of n training
observations in p -dimensional space
and that these observations fall into two classesthat is, y 1,...,y n∈ {−1, 1}
where −1 represents one class and 1 the other class. We also have a test
observation, a p -vector of observed features x∗ = (x 1*….X p*)T. Our goal is
to develop a classifier based on the training data that will correctly classify
the test observati on using its feature measurements
Suppose that it is possible to construct a hyperplane that separates the
training observations perfectly according to their class labels. Examples of
three such separating hyperplanes are shown in the left -hand panel of
Figure. We can label the observations from the blue class as y i = 1 and
those from the purple class as y i = −1. Then a separating hyperplane has
the property that
munotes.in
Page 118
Data science
118
If a separating hyperplane exists, we can use it to construct a very natural
classifier: a test observation is assigned a class depending on which side of
the hyperplane it is located. The right -hand panel of Figure shows an
example of such a classifier. That is, we classify the test observation x ∗
based on the sign of
If f(x∗) is positive, then we assign the test observation to class 1, and if
f(x∗) is negative, then we assign it to class −1. We can also make use of
the magnitude of f(x∗). If f(x∗) is far from zero, then this means that x∗ lies
far from the hyperplane, and so we can be confident about our class
assignment for x∗. On the other hand, if f(x∗) is close to zero, then x ∗ is
located near the hyperplane, and so we are less certain about the class
assignment for x∗. As we see in Figure, a classifier that is based on a
separating hyperplane leads to a linear decision boundary.
7.9 K -NN
One of the simplest machine learning algorithms, bas ed on the supervised
learning method, is K -Nearest Neighbor.The K -NN algorithm makes the
assumption that the new case and the existing cases are comparable, and it
places the new instance in the category that is most like the existing
categories.A new data point is classified using the K -NN algorithm based
on similarity after all the existing data has been stored. This means that
utilizing the K -NN method, fresh data can be quickly and accurately
sorted into a suitable category.Although the K -NN approach is most
frequently employed for classification problems, it can also be utilized for
regression.Since K -NN is a non -parametric technique, it makes no
assumptions about the underlying data.It is also known as a lazy learner
algorithm since it saves the traini ng dataset rather than learning from it
immediately. Instead, it uses the dataset to perform an action when
classifying data.KNN method maintains the dataset during the training munotes.in
Page 119
Supervised Learning
119 phase and subsequently classifies new data into a category that is quite
simil ar to the new data.
Consider the following scenario: We have a photograph of a creature that
resembles both cats and dogs, but we are unsure of its identity. However,
since the KNN algorithm is based on a similarity metric, we can utilize it
for this ident ification. Our KNN model will examine the new data set for
features that are comparable to those found in the photographs of cats and
dogs, and based on those features, it will classify the data as belonging to
either the cat or dog group.
7.9.1 Need of KN N Algorithm
If there are two categories, Category A and Category B, and we have a
new data point, x1, which category does this data point belong in? We
require a K -NN algorithm to address this kind of issue. K -NN makes it
simple to determine the category o r class of a given dataset. Take a look at
the diagram below:
7.9.2 Working of KNN Algorithm
The K -NN working can be explained on the basis of the below algorithm:
o Step -1: Select the number K of the neighbors
o Step -2: Calculate the Euclidean distance of K number of neighbors
o Step -3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step -4: Among these k neighbors, count the number of the data points
in each category.
o Step -5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step -6: Our model is ready.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image: munotes.in
Page 120
Data science
120
o Firstly, we will choose the number of neighbors, so we will choose the
k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance, we got the neare st neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image: munotes.in
Page 121
Supervised Learning
121
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
7.9.3 Selecting valu e of k in KNN Algorithm
o There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
7.9.4 Advantages of KNN Algorithm
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large
7.9.5 Disadvantages of KNN Algorithm
o Always needs to determine the value of K which may be complex
some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.
7.10 SUMMARY
A subset of mac hine learning and artificial intelligence is supervised
learning, commonly referred to as supervised machine learning. It is
defined by its use of labelled datasets to train algorithms that to classify
data or predict outcomes effectively.The most widely u sed machine
learning algorithm is supervised learning since it is simple to comprehend
and apply. The model uses labelled data and variables as inputs to get
reliable results.Building an artificial system that can learn the relationship munotes.in
Page 122
Data science
122 between the input a nd the output and anticipate the system's output given
new inputs is the aim of supervised learning. We have also covered
several supervised learning algorithms along with its working and delved
into its various fundamental’s aspects affecting the performa nce.
7.11 LIST OF REFERENCES
1] Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013.
2] Mastering Machine Learning with R, Cory Lesmeister, PACKT
Publication,2015.
3] Hands -On Programming with R, Garrett Grolemund,1st Edition, 2014.
4] An I ntroduction to Statistical Learning, James, G., Witten, D., Hastie,
T., Tibshirani, R.,Springer,2015.
7.12 UNIT END EXERCISES
1] Explain the concept oflinear models.
2] State the types of linear model.
3] Illustrate the applications of linear model.
4] Wh at are regression trees?
5] Explain the steps involved in building a regression tree.
6] What is time -series Analysis?
7] Explain the term Forecasting.
8] What is classification trees?
9] What do you mean by logistic regression?
10] Describe the classifica tion process using separating hyperplanes.
11] Explain the k -NN algorithm in detail.
munotes.in
Page 123
123 8
UNSUPERVISED LEARNING
Unit Structure
8.0 Objectives
8.1 Introduction
8.2 Principal Components Analysis (PCA)
8.2.1 Principal components in PCA
8.2.2 Steps for PCA algorithm
8.2.3 Applications of PCA
8.3 k-means clustering
8.3.1 k -means algorithm
8.3.2 Working of k -means algorithm
8.4 Hierarchical clustering
8.5 Ensemble methods
8.5.1 Categories of ensemble methods
8.5.2 Main types of ensemble methods
8.6 Summary
8.7 List of References
8.8 Unit End Exercises
8.0 OBJECTIVES
To get familiar with the fun damentals and principles involved in
unsupervised learning
To get acquaint with the different algorithms associated with the
unsupervised learning
8.1 INTRODUCTION
As the name suggests, unsupervised learning is a machine learning
technique in which models are not supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given
data. It can be compared to learning which takes place in the human brain
while learning new things. It can be defined as: munotes.in
Page 124
Data science
124 “Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data
without any supervision ”
As unlike supervised learning, we have the input data but no
corresponding output data, unsupervised lea rning cannot be used to solve
a regression or classification problem directly. Finding the underlying
structure of a dataset, classifying the data into groups based on
similarities, and representing the dataset in a compressed format are the
objectives of unsupervised learning.
Consider the following scenario: An input dataset including photos of
various breeds of cats and dogs is provided to the unsupervised learning
algorithm. The algorithm is never trained on the provided dataset;thus, it
has no knowledg e of its characteristics. The unsupervised learning
algorithm's job is to let the image features speak for themselves. This work
will be carried out by an unsupervised learning algorithm, which will
cluster the image collection into groups based on visual similarities.
The following are a few key arguments for the significance of
unsupervised learning:
Finding valuable insights from the data is made easier with the aid of
unsupervised learning.
Unsupervised learning is considerably more like how humans lea rn to
think via their own experiences, which brings it closer to actual
artificial intelligence.
Unsupervised learning is more significant because it operates on
unlabeled and uncategorized data.
Unsupervised learning is necessary to handle situations wh en the input
and output are not always the same in the real world.
8.2 PRINCIPAL COMPONENTS ANALYSIS (PCA)
An unsupervised learning approach called principal component analysis is
used in machine learning to reduce dimensionality. With the use of
orthogona l transformation, it is a statistical process that transforms the
observations of correlated features into a set of linearly uncorrelated data.
The Main Components are these newly altered features. One of the widely
used tools for exploratory data analysis and predictive modelling is this
one. It is a method for identifying significant patterns in the provided
dataset by lowering the variances.
Typically, PCA looks for the surface with the lowest dimensionality onto
which to project the high -dimensional dat a.
PCA functions by taking into account each attribute's variance since a high
attribute demonstrates a solid split between classes, which lowers the
dimensionality. Image processing, movie recommendation systems, and munotes.in
Page 125
Unsupervised Learning
125 power allocation optimization in multi ple communication channels are
some examples of PCA's practical uses. Since it uses a feature extraction
technique, it keeps the crucial variables and discards the unimportant ones.
The PCA algorithm is founded on mathematical ideas like:
Variance and cova riance
Eigen values and eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the
given dataset. More easily, it is the number of columns present in the
dataset.
o Correlation: It signifie s that how strongly two variables are related to
each other. Such as if one changes, the other variable also gets
changed. The correlation value ranges from -1 to +1. Here, -1 occurs if
variables are inversely proportional to each other, and +1 indicates t hat
variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other,
and hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non -zero vect or v is
given. Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the
pair of variables is called the Covariance Matrix.
8.2.1 Principal components in PCA
The Principal Components are the newly altered characteristics or the
result of PCA, as previously said. These PCs are either the same number
or fewer than the initial characteristics that were included in the dataset.
Following are a few characteristics of these primary components:
The linear combination of the unique traits must be the major
component.
Because these components are orthogonal, there is no association
between any two variables.
Going from 1 to n, the importance of each component declines,
making PC -1 the most importan t and PC -n the least important.
8.2.2 Steps for PCA algorithm
1] Obtaining the dataset
Firstly, we need to take the input dataset and divide it into two halves, X
and Y, where X represents the training set and Y represents the validation
set.
munotes.in
Page 126
Data science
126 2] Data repr esentation in a structure
We will now create a structure to represent our dataset. We'll use the two -
dimensional matrix of independent variable X as an example. Thus, each
row represents a data item and each column represents a feature. The
dataset's dimen sions are determined by the number of columns.
3.] Data standardization
We will normalize our dataset in this stage. For instance, in a given
column, features with higher variation are more significant than features
with smaller variance.We shall split eac h piece of data in a column by the
column's standard deviation if the importance of features is independent of
the variance of the feature. The matrix in this case will be called Z.
4] Determining Z's Covariance
We shall transpose the Z matrix in order to determine Z's covariance.
Transposing it first, we'll multiply it by Z. The Covariance matrix of Z
will be the output matrix.
5] Calculating the Eigen Values and Eigen Vectors
The resulting covariance matrix Z's eigenvalues and eigenvectors must
now be de termined. The high information axis' directions are represented
by eigenvectors or the covariance matrix. Moreover, the eigenvalues are
defined as the coefficients of these eigenvectors.
6] Sorting the Eigen Vectors.
This phase involves taking all of the e igenvalues and sorting them from
largest to lowest in a decreasing order. Also, in the eigenvalues matrix P,
simultaneously sort the eigenvectors in accordance. The matrix that results
will be known as P*.
7] Figuring out the new features or Primary Consti tuents
We will compute the new features here. We'll multiply the P* matrix by Z
to achieve this. Each observation in the resulting matrix Z* is the linear
combination of the original features. The Z* matrix's columns are
independent of one another.
8] Remo ve less significant or irrelevant features from the new dataset.
We will determine here what to keep and what to eliminate now that the
new feature set has been implemented. It indicates that we will only retain
relevant or significant features in the new dataset and will exclude
irrelevant information.
8.2.3 Applications of PCA
PCA is primarily utilized as a dimensionality reduction technique in a
variety of AI applications, including image compression and computer
vision. munotes.in
Page 127
Unsupervised Learning
127 If the data has a high dimension, it can also be used to uncover hidden
patterns. Banking, data mining, psychology, and other industries are
just a few ways PCA is applied.
8.3 K -MEANS CLUSTERING
The clustering issues in machine learning or data science are resolved
using the unsupervised learning algorithm K -Means Clustering.
8.3.1 k -means algorithm
Unsupervised learning algorithm K -Means Clustering divides the
unlabeled dataset into various clusters. Here, K specifies how many pre -
defined clusters must be produced as part of the process; for example, if
K=2, there will be two clusters, if K=3, there will be three clusters, and so
on.
The unlabeled dataset is divided into k separate clusters using an iterative
process, and each dataset is only a part of one group that shares
characteristic s with the others.
It gives us the ability to divide the data into various groups and provides a
practical method for automatically identifying the groups in the unlabeled
dataset without the need for any training.
Each cluster has a centroid assigned to i t because the algorithm is
centroid -based. This algorithm's primary goal is to reduce the total
distances between each data point and its corresponding clusters.
The algorithm starts with an unlabeled dataset as its input, separates it into
k clusters, and then continues the procedure until it runs out of clusters to
use. In this algorithm, the value of k should be predetermined.
The two major functions of the k -means clustering algorithm are:
Uses an iterative technique to choose the best value for K centr e points
or centroids.
Each data point is matched with the nearest k -center. A cluster is
formed by the data points that are close to a specific k -center.As a
result, each cluster is distinct from the others and contains datapoints
with some commonality.
munotes.in
Page 128
Data science
128 The K -means Clustering Algorithm is explained in the diagram below:
8.3.2 Working of k -means algorithm
The following stages illustrate how the K -Means algorithm functions:
Step 1: To determine the number of clusters, choose K.
Step-2: Pick rando m K locations or centroids. That might not be the input
dataset.
Step 3: Assign each data point to its nearest centroid, which will create the
K clusters that have been predetermined.
Step 4: Determine the variance and relocate each cluster's centroid.
Step 5: Re -assign each data point to the new centroid of each cluster by
repeating the third step.
Step 6: Go to step 4 if there is a reassignment; otherwise, move to
FINISH.
Step 7: The model is finished.
Let's analyze the visual plots in order to comprehen d the
aforementioned steps:
Consider that there are two variables, M1 and M2. The following shows
the x -y axis scatter plot of these two variables:
munotes.in
Page 129
Unsupervised Learning
129 o Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different cluste rs. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the belowimage:
o Now we will assign each data point of the scatter plot to its closest K -
point or centroid. We will compute it by applying some mathematics
that we have stud ied to calculate the distance between two points. So,
we will draw a median between boththe centroids. Consider the below
image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the ri ght of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear
visualization.
munotes.in
Page 130
Data science
130 o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid . To choose the new centroids, we will
compute the center of gravity ofthese centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:
o From the above image, we can see, one yellow point is on the left side
of the line, and two blue points are right to the line. So, these three
points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step -4, which
is finding new centroids or K -points.
o We will repeat the process by finding the center of gravity of
centroids, so the new centroids will be as shown in the below image: munotes.in
Page 131
Unsupervised Learning
131
o As we got the new centroids so again will draw the median line and
reassign the data poi nts. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side ofthe line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, an d
the two final clusters will be as shown in the below image: munotes.in
Page 132
Data science
132
8.4 HIERARCHICAL CLUSTERING
Data are grouped into groups in a tree structure in a hierarchical clustering
method. Every data point is first treated as a separate cluster in a
hierarchical clus tering process. The following steps are then repeatedly
carried out by it:
Choose the two clusters that are the closest to one another, and then
combine the two clusters that are the most similar. These procedures must
be repeated until all of the clusters are combined.
The goal of hierarchical clustering is to create a hierarchy of nested
clusters. a Dendrogram, a type of graph (A Dendrogram is a tree -like
diagram that statistics the sequences of merges or splits) depicts this
hierarchy graphically and is an inverted tree that explains the sequence in
which elements are combined (bottom -up view) or clusters are dispersed
(top-down view).
A data mining technique called hierarchical clustering builds a
hierarchical representation of the clusters in a dataset. Each data point is
initially treated as an independent cluster, and from there, the algorithm
iteratively aggregates the nearest clusters until a stopping requirement is
met. A dendrogram - a tree -like structure that shows the hierarchical links
between t he clustersis the outcome of hierarchical clustering.
Compared to other clustering techniques, hierarchical clustering has a
variety of benefits, such as:
1. The capacity for non -convex clusters as well as clusters of various
densities and sizes.
2. The ca pacity to deal with noisy and missing data.
3. The capacity to display the data's hierarchical structure, which is
useful for comprehending the connections between the clusters.
It does, however, have several shortcomings, such as:
1. The requirement for a threshold to halt clustering and establish the
total number of clusters. munotes.in
Page 133
Unsupervised Learning
133 2. The approach can have high processing costs and memory needs,
particularly for huge datasets.
3. The initial conditions, linkage criterion, and distance metric can have
an impact on the outcomes.
In conclusion, hierarchical clustering is a data mining technique that
groups related data points into clusters by giving the clusters a
hierarchical structure.
4. This technique can handle various data formats and show the
connections bet ween the clusters. Unfortunately, the results could be
sensitive to certain circumstances and have a large computational cost.
1. Agglomerative: At first, treat each data point as a separate cluster.
Next, at each step, combine the cluster's closest pairs. It uses a bottom -up
approach. Every dataset is first viewed as a distinct entity or cluster. The
clusters combine with other clusters at each iteration until only one cluster
remains.
Agglomerative Hierarchical Clustering uses the following algorithm:
Determine how similar each cluster is to each of the other clusters
(calculate proximity matrix)
Think of each data point as a separate cluster.
Combine the groups that are quite similar to one another or those are
nearby.
For each cluster, recalculate the pr oximity matrix.
Once there is just one cluster left, repeat steps 3 and 4 as necessary.
Let's look at this algorithm's visual representation using a dendrogram.
Let’s say we have six data points A, B, C, D, E, and F.
Figure: Agglomerative Hierarchical c lustering munotes.in
Page 134
Data science
134 Step -1: Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
Step -2: In the second step comparable clusters are merged together
to form a single cluster. Let’s say cluster (B) and clust er (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
Step -3: We recalculate the proximity according to the algorithm and
merge the two near est clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
Step -4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now
left with clusters [(A), (BCDEF)].
Step -5: At last the two remaining clusters are merged together to
form a single cluster [(ABCDEF)].
2. Divisive:
We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.
Figure: Divisive Hierarchical clustering
8.5 ENSEMBLE ME THODS
A machine learning technique called ensemble techniques combines
multiple base models to create a single, ideal predictive model.By mixing
numerous models rather than relying just on one, ensemble approaches
seek to increase the accuracy of findings in models. The integrated models
considerably improve the results' accuracy. Due of this, ensemble
approaches in machine learning have gained prominence. munotes.in
Page 135
Unsupervised Learning
135
8.5.1 Categories of ensemble methods
Sequential ensemble techniques and parallel ensemble techniques are the
two main categories into which ensemble methods belong. Base learners
are produced via sequential ensemble approaches, such as adaptive
boosting (AdaBoost). The dependency between the base learners is
encouraged by their consecutive generation. Th e model's performance is
then enhanced by giving previously misrepresented learners more weight.
Base learners are created in a parallel fashion, such as random forest, in
parallel ensemble approaches. To promote independence among the basis
learners, para llel techniques make use of parallel generations of base
learners. The mistake resulting from the use of averages is greatly
decreased by the independence of base learners.
The majority of ensemble techniques only use one algorithm for base
learning, which makes all base learners homogeneous. Base learners who
have comparable traits and are of the same type are referred to as
homogenous base learners. Some approaches create heterogeneous
ensembles by using heterogeneous base learners. Many sorts of learners
make up heterogeneous base learners.
8.5.2 Main types of ensemble methods
1] Bagging
Bootstrap aggregating is commonly used in classification and regression,
and also known as bagging. Using decision trees, it improves the models'
accuracy, greatly reduci ng variation. Many prediction models struggle
with overfitting, which is eliminated by reducing variation and improving
accuracy.
Bootstrapping and aggregation are the two categories under which
bagging is categorized. Bootstrapping is a sampling strategy where
samples are taken utilizing the replacement procedure from the entire
population (set). The sampling with replacement method aids in the
randomization of the selection process. The process is finished by
applying the base learning algorithm to the sa mples.
In bagging, aggregation is used to include all potential outcomes of the
prediction and randomize the result. Predictions made without aggregation
won't be accurate because all possible outcomes won't be taken into munotes.in
Page 136
Data science
136 account. As a result, the aggregat e is based either on all of the results from
the predictive models or on the probability bootstrapping techniques.
Bagging is useful because it creates a single strong learner that is more
stable than individual weak base learners. Moreover, it gets rid of any
variance, which lessens overfitting in models. The computational cost of
bagging is one of its drawbacks. Hence, ignoring the correct bagging
technique can result in higher bias in models.
2] Boosting
Boosting is an ensemble strategy that improves fut ure predictions by
learning from previous predictor errors. The method greatly increases
model predictability by combining numerous weak base learners into one
strong learner. Boosting works by placing weak learners in a sequential
order so that they can l earn from the subsequent learner to improve their
predictive models.
There are many different types of boosting, such as gradient boosting,
Adaptive Boosting (AdaBoost), and XGBoost (Extreme Gradient
Boosting). AdaBoost employs weak learners in the form of decision trees,
the majority of which include a single split known as a decision stump.
The primary decision stump in AdaBoost consists of observations with
equal weights.
Gradient boosting increases the ensemble's predictors in a progressive
manner, allo wing earlier forecasters to correct later ones, improving the
model's accuracy. To offset the consequences of errors in the earlier
models, new predictors are fitted. The gradient booster can identify and
address issues with learners' predictions thanks to the gradient of descent.
Decision trees with boosted gradients are used in XGBoost, which offers
faster performance. It largely depends on the goal model's efficiency and
effectiveness in terms of computing. Gradient boosted machines must be
implemented s lowly since model training must proceed sequentially.
3] Stacking
Another ensemble method called stacking is sometimes known as layered
generalization. This method works by allowing a training algorithm to
combine the predictions of numerous different lear ning algorithms that are
similar. Regression, density estimations, distance learning, and
classifications have all effectively used stacking. It can also be used to
gauge the amount of inaccuracy that occurs when bagging.
8.6 SUMMARY
Any machine learning c hallenge aims to choose a single model that can
most accurately forecast the desired result. Ensemble approaches consider
a wide range of models and average those models to build one final model,
as opposed to creating one model and hoping that this model is the
best/most accurate predictor we can make. munotes.in
Page 137
Unsupervised Learning
137 Unsupervised learning, commonly referred to as unsupervised machine
learning, analyses and groups unlabeled datasets using machine learning
algorithms. These algorithms identify hidden patterns or data clust ers
without the assistance of a human.
Unsupervised learning's main objective is to find hidden and intriguing
patterns in unlabeled data. Unsupervised learning techniques, in contrast
to supervised learning, cannot be used to solve a regression or
classif ication problem directly because it is unknown what the output
values will be. We have also studied different techniques and algorithms
for classification and to boost the performance.
In conclusion, the unsupervised learning algorithms allow you to
accom plish more complex processing jobs. There are various benefits of
unsupervised learnings such as it is taken in place issue solving time,
hence all of the input data which is to be examined and categorized in the
appearance of learners.
8.7 LIST OF REFEREN CES
1] Doing Data Science, Rachel Schutt and Cathy O’Neil, O’Reilly,2013.
2] Mastering Machine Learning with R, Cory Lesmeister, PACKT
Publication,2015.
3] Hands -On Programming with R, Garrett Grolemund,1st Edition, 2014.
4] An Introduction to Statistical Learning, James, G., Witten, D., Hastie,
T., Tibshirani, R.,Springer,2015.
8.8 UNIT END EXERCISES
1] Explain the Principal Components Analysis (PCA).
2] What are the principal components in PCA?
3] Explain the steps involved and applications for PCA algo rithm.
4] Describe the k -means clustering.
5] Explain the working of k -means algorithm.
6] Write a note on Hierarchical clustering.
7] Explain Ensemble methods.
8] What are the categories of ensemble methods?
9] Describe the main types of ensemble methods.
munotes.in