FULL-DATA-SCIENCE-munotes

Page 1

Unit I1DATA SCIENCETECHNOLOGY STACKUnit Structure1.1Introduction1.2Summary1.3Unit End Questions1.4 ReferencesRAPID INFORMATION FACTORY (RIF) ECOSYSTEMRapid Information Factory (RIF) System is a technique and toolwhich is used for processing the data in the development. The RapidInformation Factory is a massive parallel data processing platform capableof processing theoretical unlimited size data sets.The Rapid Information Factory (RIF) platform supports five high-level layers:;Functional Layer:The functional layer is the core processing capability of the factory.Core functional data processing methodology is theR-A-P-T-O-Rframework.;Retrieve Super Step.The retrieve super step supports the interaction between external datasources and the factory.;Assess Super Step.The assess super step supports the data quality clean-up in the factory.;Process Super Step.The process super step converts data into data vault.;Transform Super Step.The transform super step converts data vault via sun modeling intodimensional modeling to form a data warehouse.munotes.in

Page 2


;Organize Super Step.The organize super step sub-divides the data warehouse into data marts.;Report Super Step.The report super step is the Virtualization capacity of the factory.;Operational Management Layer.;Audit, Balance and Control Layer.;Utility Layer.Common components supporting other layers.;Maintenance Utilities.;Data Utilities.;Processing Utilities.Business Layer:Contains the business requirements (Functional and Non-functional)Data Science Storage Tools:;Data Science ecosystem has a bunch of series of tools which are usedto build your solution. By using this tools and techniques you will getrapid information in advanced for its better capability and newdevelopment will occur each day.;There are two basic data processing tools to perform the practical ofdata science as given below:Schema on write ecosystem:;Traditional Relational Database Management System requires aschema before loading the data. Schema basically denotes theorganizational data which is like a blueprint, describing how thedatabase should be constructed.;Schema is a single structure which represents logical view of entiredatabase. It represents how the data is organized andrelatedbetweenthem.;It is responsible of the database designer to design the database perfectto understand the logic and structure with the help of programmer.;Relational Database Management System is used and designed to storethe data.;To Retrieve the data from therelational database system, you need torun the specific structure query language to perform these tasks.munotes.in

Page 3

;A traditional database management system only works with schemaand it will work once the schema is described and there is only onepoint of view to describe and view the data into the database.;It stores a dense of data and all the data are stored into the datastoreand schema on write widely use methodology to store the dense data.;Schema on write schemas are build with the purpose which makesthem change and maintain the data into the database.;When there is a lot of raw data which are available for the processing,during, some of the data are lost and it makes themweakfor futureanalysis.;If some important data are not stored into the database thenyou cannotprocess the data for further data analysis.Schema on read ecosystem:;Schema on read ecosystem does not need schema, without this you canload the data into the database.;This type of schema stores the minimal data with values into thedatabase and some of the important schema are applied during thequery phase.;It has the capabilities to store the structure, semi-structure,unstructured data and it has potential to apply most of the flexibilitieswhen we request the query during the execution.;These types of ecosystem are applicable for both experimental andexploration of data to retrieve the data from the schema or structure.;Schema on read generate the fresh and new data and increase the speedof data generation as well as reduce the cycle time of data availabilityof actionable information.;These types of ecosystem that means schema on read and schema onwrite are very useful and essential for data scientist and engineeringpersonal for better understanding about data preparation, modeling,development, and deployment of data into the production.;When you apply schema on read on structure, un-structure, and semi-structure, it would generate very slow result because it does not havethe schema fast retrieval of data into the data warehouse.;Schema on read follow the agile way of working and it has capabilitiesand potential to work likeNoSQL database as it works in theenvironment.;Some time schema on read through the error during the query timebecause there are three type of data stored intothe database likestructure, un-structure, and semi-structure. There is no better processand rule and regulation for fast and better retrieval of data fromstructure database.munotes.in

Page 4

Data Lake:;A Data Lake is storage repository of large amount of raw data thatmeans structure, semi-structure, unstructured data.;This is the place where you can store three types of data structure,semi-structure, unstructured data with no fix amount of limitandstorage to store the data.;If we compare schema on write and data lake then we will find thatschema on write store the data into the data warehouse in predefineddatabase on the other hand data lake store the less data structure tostore the data into the database.;Data Lake follow to store less data into the structure database becauseit follows the schema on read process architecture to storethe data.;Data Lake allow us to transform the raw data that means structure,semi-structure, unstructured data into the structure data formatso thatSQL query could be performed for the analysis.;Most of the time data lake is deployed by using the distributed dataobject storage database which enable the schema on read so thatbusiness analytics and data mining tools and algorithms can be appliedon the data.;Retrieval of data isso fast because there is no schema applied. Datamust be access without any failure or any complex reason.;Data Lake is similar to real time river or lake where the water comesfrom different-different places and at the last all the small-small riverandlake are merged into the big river or lake where large amount ofwater are stored, whenever there is need of waterthen it can be usedby anyone.;It is low cost and effective way to store the large amount of data storedintocentralizeddatabasefor further organizationalanalysisanddeployment.
Figure 1.1
munotes.in

Page 5

Data Vault:;Data Vault is a database modeling method which is designed to storethe long-term historical storage amount of data and it can controlled byusing the data vault.;In Data Vault, data mustcome from different sources and it isdesigned in such a ways that data could be loaded in parallel ways sothat large amount of data implementation can be done without anyfailure or any major design.;Data Vault is the process of transforming the schemaon read data lakeinto schema on write data lake.;Data Vault are designed schema on read query request and after that itwould be converted into the data lake because schema on read increasethe speed of generating new data for the better analysis andimplementation.;Data Vault store a single version of data and does not distinguishbetween good data and bad data.;Data Lake and Data Vault are built by using the three main componentor structure of data i.e. Hub, Link and satellite.2.Hub :;Hub has unique business key with low amount of data to be changedand meta data that means data is the main source of generating thehubs.;Hub contains surrogate key for each metadata information and eachhub items i.e. origin of this business key.;Hub contains a set of unique business key that will never change overa period manner.;There are different types of hubs like person hub, time hub, object hub,event hub, locations hub. The Time hub contains ID Number, ID TimeNumber, ZoneBasekey, DateTimekey, DateTimeValue and all theselinks are interconnected to each other like Time-Person, Time-Object,Time-Event, Time-Location, Time-Links etc.;The Person hub contains IDPersonNumber, FirstName, SecondName,LastName, Gender, TimeZone, BirthDateKey, BirthDate and all theselinksare interconnected to each other like Person-Time, Person-Object, Person-Location, Person-Event, Person-Link etc.;The Object hub contains IDObjectNumber, ObjectBaseKey,ObjectNumber, ObjectValue and all these links are interconnected toeach other like Object-Time, Object-Link, Object-Event, Object-Location, Object-Person etc.;The Event hub contains ID Event Number, Event Type, EventDescription and all these links are interconnected to each other likeEvent-Person, Event-Location, Event-Object, Event-Timeetc.munotes.in

Page 6

;The Location hub contains ID Location Number, Object Base Key,Location Number, Location Name, Longitude and Latitude all theselinks are interconnected to each other like Location-Person, Location-Time, Location-Object, Location-event etc.Link:;Link plays a very important role during transaction and association ofbusiness key. The Table relate to each other depending upon theattribute of table like that one to one relationship, one to manyrelationships, Many to One relationship, Many to many relationships.;Link represents and connect only element in the business relationshipsbecause when one node or link relate to one or another link on that timedata transfers smoothly.Satellites:;When the hubs and links produce and form the structure of satelliteswhich store no chronological structure of data means then it would notprovide the information about the mean, median, mode, maximum,minimum, sum of the data.;Satellites are the strong structure of data that store a detailedinformation about therelated data or business characteristics key andstores large volume of data vault.;The combinations of all these three i.e. hub, link, and satellites areformed together to help the data analytics and data scientists and dataengineer to store the business structure, types of information or datainto it.
Figure 1.2Data Science Processing Tools:;The process of transforming the data, data lake to data vault and thentransferring the data vault into data warehouse.;Most of the data scientist and data analysis, data engineer uses thesedata science processing tool to process and transfer the data vault intodata warehouse.
;The Location hub contains ID Location Number, Object Base Key,Location Number, Location Name, Longitude and Latitude all theselinks are interconnected to each other like Location-Person, Location-Time, Location-Object, Location-event etc.Link:;Link plays a very important role during transaction and association ofbusiness key. The Table relate to each other depending upon theattribute of table like that one to one relationship, one to manyrelationships, Many to One relationship, Many to many relationships.;Link represents and connect only element in the business relationshipsbecause when one node or link relate to one or another link on that timedata transfers smoothly.Satellites:;When the hubs and links produce and form the structure of satelliteswhich store no chronological structure of data means then it would notprovide the information about the mean, median, mode, maximum,minimum, sum of the data.;Satellites are the strong structure of data that store a detailedinformation about therelated data or business characteristics key andstores large volume of data vault.;The combinations of all these three i.e. hub, link, and satellites areformed together to help the data analytics and data scientists and dataengineer to store the business structure, types of information or datainto it.
Figure 1.2Data Science Processing Tools:;The process of transforming the data, data lake to data vault and thentransferring the data vault into data warehouse.;Most of the data scientist and data analysis, data engineer uses thesedata science processing tool to process and transfer the data vault intodata warehouse.
;The Location hub contains ID Location Number, Object Base Key,Location Number, Location Name, Longitude and Latitude all theselinks are interconnected to each other like Location-Person, Location-Time, Location-Object, Location-event etc.Link:;Link plays a very important role during transaction and association ofbusiness key. The Table relate to each other depending upon theattribute of table like that one to one relationship, one to manyrelationships, Many to One relationship, Many to many relationships.;Link represents and connect only element in the business relationshipsbecause when one node or link relate to one or another link on that timedata transfers smoothly.Satellites:;When the hubs and links produce and form the structure of satelliteswhich store no chronological structure of data means then it would notprovide the information about the mean, median, mode, maximum,minimum, sum of the data.;Satellites are the strong structure of data that store a detailedinformation about therelated data or business characteristics key andstores large volume of data vault.;The combinations of all these three i.e. hub, link, and satellites areformed together to help the data analytics and data scientists and dataengineer to store the business structure, types of information or datainto it.
Figure 1.2Data Science Processing Tools:;The process of transforming the data, data lake to data vault and thentransferring the data vault into data warehouse.;Most of the data scientist and data analysis, data engineer uses thesedata science processing tool to process and transfer the data vault intodata warehouse.
munotes.in

Page 7

1. Spark:;Apache Spark is an open source clustering computing framework. Theword open source means it is freely available on internet and just goon internet and type apache spark and you can get freely source code,you can download and use according to your wish.;Apache Spark was developed at AMP Lab of university of California,Berkeley and after that all the code and data was donated to ApacheSoftware Foundation to keep doing changes over a time and make itmore effective, reliable, portable that will run on all the platform.;Apache Spark provide an interface for the programmer and developerto directly interact with the system and make data parallel andcompatible with data scientist and data engineer.;Apache Spark has the capabilities and potential, process all types andvariety of data with repositories including Hadoop distributed filesystem, NoSQL database as well as apache spark.;IBM are hiring most of the data scientist and data engineer, who hasmore knowledge and information about apache spark project so thatinnovation could perform an easy way and will come up with morefeature and changing.;Apache Spark has potential and capabilities to process the data veryfast and hold the data in memory andtransfer the data into memorydata processing engine.;It has been built on top of the Hadoop distributed file system whichmake the data more efficient, more reliable and make it moreextendable on the Hadoop map reduce.
Figure 1.32. Spark Core:;Spark Core is base and foundation for over all of the projectdevelopment and provide some most important Information likedistributed task, dispatching, scheduling and basic Input and outputfunctionalities.
munotes.in

Page 8

;By using spark core, you can have more complex queries that will helpus to work with complex environment.;The distributed nature of spark ecosystem enables you the sameprocessing data on a small cluster, to go for hundreds or thousand ofnodes without making any changes.;Apache Spark uses Hadoop in two ways one is storage and second oneis for processing purpose.;Spark is not a modified version of Hadoop distributed file system,because it depend upon the Hadoop which has its own featureand toolfor data storage and data processing.;Apache Spark has a lot of feature which makes it compatible andreliable. Speed is one the most important feature of spark that meanswith the help of spark, your application are able to run on directly onHadoop and it is 100 times much faster in the memory.;Apache Spark Core support many more language and it has its ownbuilt in function and API in java, Scala, python that means you canwrite the application by using the java, python, C++, Scala etc.;ApacheSpark Core has come up with advanced analytics that means itdoes not support the map and reduce the potential and capabilities tosupport SQl Queries, Machine Learning and graph Algorithms.
Figure 1.43. Spark SQL:;Spark SQL is a component on top of the Spark Core that presents dataabstraction called Data Frames.;Spark SQL is fast clustering data abstraction, so that data manipulationcan be done for fast computation.;It enables the user to run SQL/HQL on top of the spark and by suingthis, we can process the structure, unstructured and semi structureddata.
munotes.in

Page 9

;Apache Spark SQL provides a much relationship between relationaldatabase and procedural processing. This comes, when we want toload the data from traditional way into data lake ecosystem.;Spark SQL is Apache Spark’s module for working with structured andsemi-structured data and it originated to overcome the limitation ofapache hive.;It always dependent upon the MapReduce engine of Hadoop forexecution and processing of data and allows the batch-orientedoperation.;Hive lags in performance uses to MapReduce jobs for executing ad-hoc process and hive does not allow you to resume a job processing, ifit fails in the middle.;Spark performs better operation than hive in many situation. Latencyin theterms of hours and CPU reservation time.;You can integrate the Spark SQL and queryingstructured, semi-structured data inside the apache spark.;Spark SQL follows the RDD Model and it also support large job andmiddle query fault tolerance.;You can easily connect the Spark SQL with the JDBC and ODBC forbetter connectivity of business purpose.4. Spark Streaming:;Apache Spark Streaming enables powerful interactive and dataanalytics application for live streaming data. In Streaming, data is notfixed and data comes from different source continuously.;Stream divide the incoming input data into the small-small unit of datafor further data analytics and data processing for next level.;There are multilevel of processing involved in it. Live streaming dataare received and divided into small-small parts or batches and thesesmall-small of data or batches are then processed or mixed by thespark engine to generate or produced the final level of streaming ofdata.;Processing of data in the system in Hadoop has veryhigh latencymeans that data is not received on timely manner and it is not suitablefor real time processing requirement.;Processing of data is generated by storm, if it is not happened again.But this type of mistake and latency give the data loss and repetition ofrecords processing.;Most of scenario, Hadoop are used for data batching and ApacheSpark are used for the live streaming of data.;Apache Streaming provide and help us to fix these types of issue andprovides reliable, portable, scalable, efficiency, and integration of thesystem.munotes.in

Page 10

Figure 1.55. GraphX:GraphX is very powerful graph processing tool applicationprogramming interface for apache spark analytics engine.;GraphX is a new component in a spark for graphs and graphs-parallelcomputation.;GraphX follow the ETL process that means Extract, transform andLoad, exploratory analysis, iterative graph computation within a singlesystem.;The usage can be seen in the Facebooks, LinkedIn connection, googlemap, and internet routers use these types of tool for better responseand analysis.;Graph is an abstract data types that means it is used to implement thedirected and undirected graph concepts from the mathematics in thegraph theory concept.;In the graph theory concept, each data associate with some other datawith edge like numeric value.;Every edge and node or vertex have user defined properties and valuesassociated with it.
Figure 1.6
Figure 1.55. GraphX:GraphX is very powerful graph processing tool applicationprogramming interface for apache spark analytics engine.;GraphX is a new component in a spark for graphs and graphs-parallelcomputation.;GraphX follow the ETL process that means Extract, transform andLoad, exploratory analysis, iterative graph computation within a singlesystem.;The usage can be seen in the Facebooks, LinkedIn connection, googlemap, and internet routers use these types of tool for better responseand analysis.;Graph is an abstract data types that means it is used to implement thedirected and undirected graph concepts from the mathematics in thegraph theory concept.;In the graph theory concept, each data associate with some other datawith edge like numeric value.;Every edge and node or vertex have user defined properties and valuesassociated with it.
Figure 1.6
Figure 1.55. GraphX:GraphX is very powerful graph processing tool applicationprogramming interface for apache spark analytics engine.;GraphX is a new component in a spark for graphs and graphs-parallelcomputation.;GraphX follow the ETL process that means Extract, transform andLoad, exploratory analysis, iterative graph computation within a singlesystem.;The usage can be seen in the Facebooks, LinkedIn connection, googlemap, and internet routers use these types of tool for better responseand analysis.;Graph is an abstract data types that means it is used to implement thedirected and undirected graph concepts from the mathematics in thegraph theory concept.;In the graph theory concept, each data associate with some other datawith edge like numeric value.;Every edge and node or vertex have user defined properties and valuesassociated with it.
Figure 1.6
munotes.in

Page 11

;GraphX has moreflexibilitiesto work with graph and computation.Graph follow the ETL process that means Extract, transform and Load,exploratory analysis, iterative graph computation within a singlesystem.;Speed is one of the most important point in the point of Graph and it iscomparable with the fastest graph system while when there is any faulttolerance and provide ease of use.;We can choose lots of more feature that comes with more flexibilitiesand reliability and it provide library of graph algorithms.6. Mesos:;Apache Mesos is an open source cluster manager and it was developedby the universities of California, Berkeley.;It provides all the required resource for the isolation and sharingpurpose across all the distributed application.;The software we are using for Mesos, provide resources sharing in afine-grained manner so that improving can bedone with the help ofthis.;Mesosphere enterprises DC/OS is the enterprise version of Mesos andthis run specially on Kafka, Cassandra, spark and Akka.;It can handle the workload in distributed environments by suing thedynamic sharing and isolation manner.;Apache Mesos could be deployed and manageable of large amount ofdata distribution scale cluster environment.;Whatever the data are available in the existing system, it will begrouped together with the machine or node of the cluster into a singlecluster so that load could be optimized.;Apache Mesos is totally opposite to the virtualizations because invirtualization one physical resource is going to be shared with multiplevirtual resource but in Mesos multiple, physical resource are groupedtogether toform a single virtual machine.
Figure 1.7
;GraphX has moreflexibilitiesto work with graph and computation.Graph follow the ETL process that means Extract, transform and Load,exploratory analysis, iterative graph computation within a singlesystem.;Speed is one of the most important point in the point of Graph and it iscomparable with the fastest graph system while when there is any faulttolerance and provide ease of use.;We can choose lots of more feature that comes with more flexibilitiesand reliability and it provide library of graph algorithms.6. Mesos:;Apache Mesos is an open source cluster manager and it was developedby the universities of California, Berkeley.;It provides all the required resource for the isolation and sharingpurpose across all the distributed application.;The software we are using for Mesos, provide resources sharing in afine-grained manner so that improving can bedone with the help ofthis.;Mesosphere enterprises DC/OS is the enterprise version of Mesos andthis run specially on Kafka, Cassandra, spark and Akka.;It can handle the workload in distributed environments by suing thedynamic sharing and isolation manner.;Apache Mesos could be deployed and manageable of large amount ofdata distribution scale cluster environment.;Whatever the data are available in the existing system, it will begrouped together with the machine or node of the cluster into a singlecluster so that load could be optimized.;Apache Mesos is totally opposite to the virtualizations because invirtualization one physical resource is going to be shared with multiplevirtual resource but in Mesos multiple, physical resource are groupedtogether toform a single virtual machine.
Figure 1.7
;GraphX has moreflexibilitiesto work with graph and computation.Graph follow the ETL process that means Extract, transform and Load,exploratory analysis, iterative graph computation within a singlesystem.;Speed is one of the most important point in the point of Graph and it iscomparable with the fastest graph system while when there is any faulttolerance and provide ease of use.;We can choose lots of more feature that comes with more flexibilitiesand reliability and it provide library of graph algorithms.6. Mesos:;Apache Mesos is an open source cluster manager and it was developedby the universities of California, Berkeley.;It provides all the required resource for the isolation and sharingpurpose across all the distributed application.;The software we are using for Mesos, provide resources sharing in afine-grained manner so that improving can bedone with the help ofthis.;Mesosphere enterprises DC/OS is the enterprise version of Mesos andthis run specially on Kafka, Cassandra, spark and Akka.;It can handle the workload in distributed environments by suing thedynamic sharing and isolation manner.;Apache Mesos could be deployed and manageable of large amount ofdata distribution scale cluster environment.;Whatever the data are available in the existing system, it will begrouped together with the machine or node of the cluster into a singlecluster so that load could be optimized.;Apache Mesos is totally opposite to the virtualizations because invirtualization one physical resource is going to be shared with multiplevirtual resource but in Mesos multiple, physical resource are groupedtogether toform a single virtual machine.
Figure 1.7munotes.in

Page 12


7. Akka:;Akka is an actor-based message driven runtime for runningconcurrency, elasticity, and resilience processes.;The actor can be controlled and limited to perform the intended taskonly. Akka is an opensource library or toolkit.;Apache Akka is used to create distributed and fault tolerance and it canbe integrated to this library into the java virtual machine or JVM tosupport the language.;Akka could be integrated with the Scala programming language anditis written in the Scala and it help us and developers to deal withexternal locking and threat management.
Figure 1.8;The Actor is an entity which communicate with another actor bypassing the massage to each other and it has its own state andbehavior.;In object-oriented programming like that everything is an object samething is here like Akka is an actor based driven system.;In other way we can say that Actor is an object that include andincapsulate it states and behavior.8. Cassandra:;Apache Cassandra is an open source distributed database system that isdesigned for storing and managing large amount of data acrosscommodity servers.;Cassandra can be used for both real time operational data store foronline transaction data application.

7. Akka:;Akka is an actor-based message driven runtime for runningconcurrency, elasticity, and resilience processes.;The actor can be controlled and limited to perform the intended taskonly. Akka is an opensource library or toolkit.;Apache Akka is used to create distributed and fault tolerance and it canbe integrated to this library into the java virtual machine or JVM tosupport the language.;Akka could be integrated with the Scala programming language anditis written in the Scala and it help us and developers to deal withexternal locking and threat management.
Figure 1.8;The Actor is an entity which communicate with another actor bypassing the massage to each other and it has its own state andbehavior.;In object-oriented programming like that everything is an object samething is here like Akka is an actor based driven system.;In other way we can say that Actor is an object that include andincapsulate it states and behavior.8. Cassandra:;Apache Cassandra is an open source distributed database system that isdesigned for storing and managing large amount of data acrosscommodity servers.;Cassandra can be used for both real time operational data store foronline transaction data application.

7. Akka:;Akka is an actor-based message driven runtime for runningconcurrency, elasticity, and resilience processes.;The actor can be controlled and limited to perform the intended taskonly. Akka is an opensource library or toolkit.;Apache Akka is used to create distributed and fault tolerance and it canbe integrated to this library into the java virtual machine or JVM tosupport the language.;Akka could be integrated with the Scala programming language anditis written in the Scala and it help us and developers to deal withexternal locking and threat management.
Figure 1.8;The Actor is an entity which communicate with another actor bypassing the massage to each other and it has its own state andbehavior.;In object-oriented programming like that everything is an object samething is here like Akka is an actor based driven system.;In other way we can say that Actor is an object that include andincapsulate it states and behavior.8. Cassandra:;Apache Cassandra is an open source distributed database system that isdesigned for storing and managing large amount of data acrosscommodity servers.;Cassandra can be used for both real time operational data store foronline transaction data application.
munotes.in

Page 13

;Cassandra is designed for to have peer to peer process continues nodesinstead of master or named nodes to ensure that there should not beany single point of failure.;Apache Cassandra is a highly scalable, high performance distributeddatabase designed tohandle large amounts of data and it is type ofNoSQL Database.;A NoSQL database is a database that provide mechanism to store andretrieve the data from the database than relational database.;NoSQL database uses different data structure compared to relationaldatabase and it support very simple query language.;NoSQL Database has no Schema and does not provide the datatransaction.;Cassandra is being used by some of the most important companies likeFacebook, Twitter, Cisco, Netflix and much more.;The component of the Cassandra is Node, Data Center, Commit Log,Cluster, Mem-Table, SS Table, Bloom Filter.
Figure 1.9
munotes.in

Page 14

Figure 1.103. Kafka:;Kafka is a high messaging backbone that enables communicationbetween data processing entities and Kafka is written in java and Scalalanguage.;Apache Kafka is highly scalable, reliable, fast, and distributed system.Kafka is suitable for both offline and online message consumption.;Kafka messages are stored on the hard disk and replicated within thecluster to prevent the data loss.;Kafka is distributed, partitioned, replicated and fault tolerant whichmake it more reliable.;Kafka messaging system scales easily without down time which makeit more scalable. Kafka has high throughput for both publishing andsubscribing messages, and it can store data up to TB.;Kafka has unique platform for handling the real time data for feedbackand it can handle large amount of data to diverse consumers.;Kafka persists all data to the disk, which essentially means that all thewrites go to the page cache of the OS (RAM). This makes it veryefficient to transfer data from page cache to a network socket.
Figure 1.11
Figure 1.103. Kafka:;Kafka is a high messaging backbone that enables communicationbetween data processing entities and Kafka is written in java and Scalalanguage.;Apache Kafka is highly scalable, reliable, fast, and distributed system.Kafka is suitable for both offline and online message consumption.;Kafka messages are stored on the hard disk and replicated within thecluster to prevent the data loss.;Kafka is distributed, partitioned, replicated and fault tolerant whichmake it more reliable.;Kafka messaging system scales easily without down time which makeit more scalable. Kafka has high throughput for both publishing andsubscribing messages, and it can store data up to TB.;Kafka has unique platform for handling the real time data for feedbackand it can handle large amount of data to diverse consumers.;Kafka persists all data to the disk, which essentially means that all thewrites go to the page cache of the OS (RAM). This makes it veryefficient to transfer data from page cache to a network socket.
Figure 1.11
Figure 1.103. Kafka:;Kafka is a high messaging backbone that enables communicationbetween data processing entities and Kafka is written in java and Scalalanguage.;Apache Kafka is highly scalable, reliable, fast, and distributed system.Kafka is suitable for both offline and online message consumption.;Kafka messages are stored on the hard disk and replicated within thecluster to prevent the data loss.;Kafka is distributed, partitioned, replicated and fault tolerant whichmake it more reliable.;Kafka messaging system scales easily without down time which makeit more scalable. Kafka has high throughput for both publishing andsubscribing messages, and it can store data up to TB.;Kafka has unique platform for handling the real time data for feedbackand it can handle large amount of data to diverse consumers.;Kafka persists all data to the disk, which essentially means that all thewrites go to the page cache of the OS (RAM). This makes it veryefficient to transfer data from page cache to a network socket.
Figure 1.11
munotes.in

Page 15

Different Programming languages in data science processing:1. Elastic Search:;Elastic Search is an open source, distributed search and analyticalengine designed.;Scalability mean that it can scale any point of view, reliability meansthat it should be trustable, stress free management.;Combine the power of search and power of analytics so thatdevelopers, programmers,data engineer and data scientist could workwith very smoothly with structures, un-structured, and time series data.;Elastics search is an open source that means anyone can download andwork with it and it is developed by using java language and most ofthe big organization are using this search engine for their need.;It enables the user to expand the very large amount of data at very highspeed.;It is used for the replacement of the documents and data store in thedatabase like mongo dB etc.;Elastic searchis one of the popular search engines and mostly used bythe recent organization like google, stack Overflow, GitHub and muchmore.;Elastic Search is an open source search engine and is available underthe hive version 2.0.2. R Language:;R is a programming language and it is used for statistical computingand graphics purpose.;R Language are used by data engineer, data scientist,statisticians, anddata miners for developing the software and performing data analytics.;There is core requirement before learning the R Language and somedepend on library and package concept that you should know about itand know how to work upon it easily.;The related packages are of R Language is sqldf, forecast,dplyr,stringer, lubridate, ggplot2, reshape etc.;R language is freely available, and it comes with General PublicLicense and it supports many of the platform like windows,Linux/Unix, Mac.;R language has built in capability to support and can be implementedand integrated with procedural language written in c, c++, java, .Net,and python.;R Language has capacity and potential for handling data and datastorage.munotes.in

Page 16

3. Scala:;Scala is a general-purpose programming language and it supportfunctional programming and a strong type statics type system.;Most of the data scienceproject and framework are build by using theScala programming language because it has so many capabilities andpotential to work with it.;Scala integrate the feature of object-oriented language and its functionbecause Scala can be written in java, c++, python language.;Types and behavior of objects are described by the class and class canbe extended by another class by using its properties.;Scala support the high-level functions and function can be called byanother function by using and written the function in a code.;Once the Scala program is ready to compile and executive, Scalaprogram convert into the byte code (machine understandable language)with the help of java virtual machine.;This means that Scala and Java Programs can be complied andexecutedby using the JVM. So, we can easily say that it can be movedfrom Java to Scala and vice-versa.;Scala enables you to use and import all the class, object and itsbehavior and function because Scala and java run with the help of JavaVirtual Machine and you can create its own class and object.;Instead of writing thousands of codes, Scala reduce the code in such away that it can be readable, reliable, portable and reduce lines of codeand support the developer and programmers to type the code in easyway.4. Python:;Python is a programming language and it can used on a server tocreate web application.;Python can be used for web development, mathematics, softwaredevelopment and it is used to connect the database and create andmodify the data.;Python canhandle the large amount of data and it is capable andpotential to perform the complex task on data.;Python is reliable, portable, and flexible to work on different platformlike windows, mac and Linux etc.;As compare to the other programming language , python is easy tolearn and can perform the simple as well as complex task and it has thecapabilities to reduce the line of code and help the programmer anddevelopers to work with is easily friendly manner.;Python support object-oriented programming language, functional andwork with structure data.munotes.in

Page 17

;Python support dynamics data type and can be supported by dynamicstype checking.;Python is an interpreter and it has the philosophy and statements that itreduces the line of code.SUMMARYChapter will help you to recognize the basics of data science toolsand their influence on modern data lake development. You will discoverthe techniques for transforming a data vault into a data warehouse busmatrix. It will explain the use of Spark, Mesos, Akka, Cassandra,andKafka, to tame your data science requirements. It will guide you in the useof elastic search and MQTT (MQ Telemetry Transport), to enhance yourdata science solutions. It will help you to recognize the influence of R as acreative visualization solution. It will also introduce the impact andinfluence on the data science ecosystem of such programming languagesas R, Python, and Scala.UNIT END QUESTIONS1.Explain the basics of Business layer.2.Explain the basics of Utility layer.3.Explain the basics of operational management layer.4.Explain the basics of audit, control and balance layer.REFERENCES●Principal Data Science,Redundant Storage Architecture AndreasFrancois vermeulen,Apress,2018●Principal Data Science, Sinan Ozdemir,PACKT 2016.●Data Science from Scratch,Joel Grus, O’Really 2015.●Data Science from Scratch first principle in Python,JoelGrus,ShroffPublisher, 2017.●Experimental Design in Data Science with Least Resources,N C Das,Shroff Publishers 2018munotes.in

Page 18

2VERMEULEN-KRENNWALLNER-HILLMAN-CLARK;Vermeulen-krennwallner-Hillman-Clark is small group like VKHCGand has a small size international company and it consist of 4subcomponent 1. Vermeulen PLC, 2. Krennwallner AG, 3. HillmanLtd 4. Clark Ltd.1.Vermeulen PLC:;Vermeulen PLC is a data processing company which process all thedata within the group companies.;This is the company for which we hire most of the data engineer anddata scientist to work with it.;This company supplies data science tool, Network, server andcommunication system, internal and external web sites, decisionscience and process automation.2.Krennwallner AG:;This is an advertising and media company which prepares advertisingand media information which is required for the customers.;Krennwallner supplies advertising on billboards, make Advertisingand content management for online delivery etc.;By using the number of record and data which are available oninternet for media stream, it takes the data from there and make ananalysis onthis according to that it searches which of the media streamare watched by customer, how many time and which is most watchablecontent on internet.;By using the survey, it specifies and choose content for the billboards,make and understand how many times customer are visited for whichchannel.3.Hillman Ltd:;This is logistic and supply chain company and it is used to supply thedata around the worldwide for the business purpose.;This include client warehouse, international shipping, home–to–home logistics.munotes.in

Page 19

4.Clark Ltd:;This is the financial company which process all financial data which isrequired for financial purpose includes Support Money, VentureCapital planning and allow to put your money on share market.Scala:;Scala is a general-purpose programming language and it supportfunctional programming and a strong type statics type system.;Most of the data science project and framework are built by using theScala programming language because it has so many capabilities andpotential to work with it.;Scala integrate the feature of object-oriented language and its functionbecause Scala can be written in java, C++, python language.;Types and behavior of objects are described by the class and class canbe extended by another class by using its properties.;Scala support the high-level functions and function can be called byanother function by using and written the function in a code.Apache Spark:;Apache Spark is an open source clustering computing framework. Theword open source means it is freely available on internet and just goon internet and type apache spark and you will get freely source codeare available there, you can download and according to your wish.;Apache Spark was developed at AMP Lab of university of California,Berkeley and after that all the code and data was donated to ApacheSoftware Foundation for keep doing changes over a time and make itmore effective, reliable, portable that will run all the platform.;Apache Spark provide an interface for the programmer and developerto directly interact with the system and make data parallel andcompatible with data scientist and data engineer.;Apache Spark has the capabilities and potential, process all types andvariety of data with repositories including Hadoop distributed filesystem, NoSQL database as well as apache spark.;IBM are hiring most of the data scientist and data engineer to whomhas more knowledge and information about apache spark project sothat innovation could be perform an easy way and will come up withmore feature and changing.Apache Mesos:;Apache Mesos is an open source cluster manager and it was developedby the universities of California, Berkeley.;It provides all the required resource for the isolation and sharingpurpose across all the distributed application.munotes.in

Page 20


;Thesoftware we are using for Mesos, provide resources sharing in afine-grained manner so that improvement can be done with the help ofthis.;Mesosphere enterprises DC/OS is the enterprise version of Mesos andthis run specially on Kafka, Cassandra, spark andAkka.Akka:;Akka is an actor-based message driven runtime for runningconcurrency, elasticity, and resilience processes.;The actor can be controlled and limited to perform the intended taskonly. Akka is an open source library or toolkit.;Apache Akka isused to create distributed and fault tolerant and it canbe integrated to the library into the java virtual machine or JVM tosupport the language.;Akka could be integrated with the Scala programming language and itis written in the Scala and it help us and developers to deal withexternal locking and threat management.Apache Cassandra:;Apache Cassandra is an open source distributed database system that isdesigned for storing and managing large amount of data acrosscommodity servers.;Cassandra can be used for both real time operational data store foronline transaction data application.;Cassandra is designed to have peer to peer process continuing nodesinstead of master or named nodes to ensure that there should not beany single point of failure.;Apache Cassandra is a highly scalable, high performance distributeddatabase designed to handle large amounts of data and it is type ofNoSQL Database.;A NoSQL database is a database that provide mechanism to store andretrieve the data from the database thanrelational database.Kafka:;Kafka is a high messaging backbone that enables communicationbetween data processing entities and Kafka is written in java and Scalalanguage.;Apache Kafka is highly scalable, reliable, fast, and distributed system.Kafka is suitable for both offline and online message consumption.;Kafka messages are stored on the hard disk and replicated within thecluster to prevent the data loss.;Kafka is distributed, partitioned, replicated and fault tolerant whichmake it more reliable.munotes.in

Page 21


;Kafka messaging system scales easily without down time which makeit more scalable. Kafka has high throughput for both publishing andsubscribing messages, and it can store data up to TB.Python:;Python is a programming language and it can used on a serverto createweb application.;Python can be used for web development, mathematics, softwaredevelopment and it is used to connect the database and create andmodify the data.;Python can handle the large amount of data and it is capable to performthe complextask on data.;Python is reliable, portable, and flexible to work on different platformlike windows, mac, and Linux etc.;Python can be installed on all the operating system example windows,Linux and mac operating system and it can work on all these platformsfor better understanding and learning purpose.You can earn much more knowledge by installing and working all threeplatform for data science and data engineering.;To working and installing the data science required package in python,Ubunturun the following command below:;sudo apt-get install python3 python3-pip python3-setuptools;To working and installing the data science required package in python,Linuxrun the following command below:;sudo yum install python3 python3-pip python3-setuptools;To workand installthe data science required package in python,Windowsrun the following command below:;https://www.python.org/downloads/;Python Libraries:;Python library is a collection of functions and methodsthat allowsyou to perform many actions without writing your code.Pandas:;Pandas stands for panel data and it is the core library for datamanipulation and data analysis.;It consists of single and multidimensional data for data analysis.;How to install pandas inUBUNTUby usingthe following commands:;sudo apt-get install python-pandas;How to install pandas inLINUXby usingthe following commands:;yum install python-pandasmunotes.in

Page 22



;How to install pandas inWINDOWSby using the followingcommands:;pip install pandasMatplotlib:;Matplotlib is used for data visualization and is one of the mostimportant packages of python.;Matplotlib is used to display and visualize the 2D data and it is writtenin python.;It can be used for python, Jupiter, notebook and web application serveralso.;How to install Matplotlib Library forUBUNTUin python by using thefollowing command:;sudo apt-get install python-matplotlib;How to install Matplotlib Library forLINUXin python by using thefollowing command:;Sudo yum install python-matplotlib;How to install Matplotlib Library forWINDOWSin python by usingthe following command:;pip install matplotlibNumPy:;NumPy is the fundamental package of python language and is used forthe numerical purpose.;NumPy is used with the SciPy and Matplotlib package of python and itis freely available on internet.SymPy:;Sympy is a python library and which is used for symbolic mathematicsand it can be used with complex algebra formula.R:;R is a programming language and it is used for statistical computingand graphics purpose.;R Language is used by data engineer, data scientist, statisticians, anddata miners for developing the software and performing data analytics.;There is core requirement before learning the R Language and somedepend on library and package concept that you should know about itand know how to work upon it easily.;The related packages are of R Language is sqldf, forecast,dplyr,stringer, lubridate, ggplot2, reshape etc.munotes.in

Page 23


SUMMARYChapter will introduce you to new concepts that enable us to shareinsights on a common understanding and terminology. It will define theData Science Framework in detail, while introducing the HomogeneousOntology for Recursive Uniform Schema (HORUS). It will take you on ahigh-level tour of the top layers of the framework, by explaining thefundamentals of the business layer, utility layer, operational managementlayer, plus audit, balance, and control layers. It will discuss how toengineer a layered framework for improving the quality of data sciencewhen you are working in a large team in parallel with common businessrequirementsUNIT END QUESTIONS1.Define Data Science Framework. Explain the Homogeneous Ontologyfor Recursive Uniform Schema.2.Discuss the Cross-Industry Standard Process for Data Mining (CRISP-DM).3.State and explain the top layers of data science framework.4.Explain the Rapid Information Factory ecosystem.5.Explain Schema–on–Write and Schema–on–Read.6.Explain data lake and data vault.7.What is data vault? Explain hubs, links and satellite with respect todata vault.8.Explain Spark and its components as data science processing tools.9.Explain Kafka and its components as data science processing tools.10. Explain Mesos, Akka and Cassandra as data science processing tools.11.List and explain different programming languages using in datascience processing.12.What is MQTT? Explain the use of MQTT in data science.References●Principal Data Science,Redundant Storage Architecture AndreasFrancois vermeulen,Apress,2018●Principal Data Science, Sinan Ozdemir,PACKT 2016.●Data Science from Scratch,Joel Grus, O’Really 2015.●Data Science from Scratch first principle in Python,JoelGrus,ShroffPublisher, 2017.●Experimental Design in Data Science with Least Resources,N C Das,Shroff Publishers 2018munotes.in

Page 24


Unit II3THREE MANAGEMENT LAYERSUnit Structure3.0Objectives3.1Introduction3.2 Operational Management Layer3.2.1Definition and Management of Data Processing stream3.2.2Eco system Parameters3.2.3Overall Process Scheduling3.2.4Overall Process Monitoring3.2.5Overall Communication3.2.6Overall Alerting3.3Audit, Balance, andControl Layer3.4Yoke Solution3.5Functional Layer3.6Data Science Process3.7Unit end Question3.8References3.0 OBJECTIVES;The objective is to explain in detail the core operations of the ThreeManagement Layers i.e. Operational Management Layer, Audit,Balance, andControl Layer& the Functional Layers3.1 INTRODUCTION;The Three Management Layers are a very importantpart of theframework.;They watch the overall operations in the data science ecosystem andmake sure that things are happening as per plan.;If things are not going as per plan then it has contingency actions inplace for recovery or cleanup.munotes.in

Page 25


3.2 OPERATIONAL MANAGEMENT LAYER;Operations management is one ofthe areas insidethe ecosystemresponsible for designing and controlling the process chains of a datascience environment.;This layer is the center for complete processing capability in the datascience ecosystem.;This layer stores what you want to process along with everyprocessing schedule and workflow for the entire ecosystem.This area enables us to see an integrated view of the entireecosystem. It reports the status each and every processing in theecosystem. This is where we plan our data science processing pipelines.;We record the following in the operations management layer:;Definition and Management of Data Processingstream;Eco system Parameters;Overall Process Scheduling;Overall Process Monitoring;Overall Communication;Overall AlertingDefinition and Management of Data Processing stream:;The processing-stream definitions are the building block of the datascience environment.;This section of the ecosystemstores all currently active processingscripts.;Management refers to Definition management, it describes theworkflow of the scripts throughout the ecosystem, it manages thecorrect execution order according to the workflow designed by thedata scientist.Eco system Parameters:;The processing parameters are stored in this section; here it is madesure that a single location is made available for all the systemparameters.;In any production system, for every existing customer, all theparameters can be placed together in a single location and then callscould be made to this location every time the parameters are needed.;Two ways to maintain a central location for all parameters are:1. Having a text file which we can import into every processing script.2. A standard parameter setupscript that defines aparameter databasewhich we can import into every processing script.munotes.in

Page 26


;Example: an ecosystem setup phase
Overall Process Scheduling:;Along with other things the scheduling plan is stored in this section, itenables a centralized control and visibility of the complete schedulingplan for the entire system.;One of the scheduling methods is a Drum-Buffer-Rope method.
Figure 3.1 : Original Drum-Buffer-Rope useThe Drum-buffer-rope Method:;It is a standard practice to identify the slowest process among all.;Once identified it is then used to control the speed of the completepipeline;This is done by tying or binding the remaining processes of thepipeline to this process.;The method implies that:the “drum” is placed at the slow part ofthe pipeline, to give theprocessing pace,
munotes.in

Page 27


:the “rope” is attached to all the processes from beginning to end ofthe pipeline, this makes sure that no processing is done that is notattached to the drum.;This approach ensures that all the processes in the pipeline completemore efficiently, as no process is entering or leaving the processpipeline without been recorded by the drum’s beat.Process Monitoring:;The central monitoring process makes sure that there is a singleunified view of the complete system.;We should always ensure that the monitoring of our data science isbeing done from a single point.;With no central monitoring running different data science processes onthe same ecosystem will make managing a difficult task.Overall Communication:;The Operations management handles all communication from thesystem, it makes sure that any activities that are happening arecommunicated to the system.;To make sure that we have all our data science processes trackedwemay use a complex communicationprocess.Over all Alerting;The alerting section of the Operations management layerusescommunications to inform the correct status of the complete system tothe correct person, at the correct time.3.3 AUDIT, BALANCE, ANDCONTROL LAYER;Any process currently under executing is controlled by the audit,balance, and control layer.;It is this layer that has the engine that makes sure that every processingrequest is completed by the ecosystem according to the plan.;This is the only area where you can observe which processes are iscurrently running within your data scientist environment.;It records the following information:•Process-execution statistics•Balancing and controls•Rejects-and error-handling•Fault codes managementmunotes.in

Page 28


3.3.1 Audit:;An audit refers to an examination of the ecosystem that is systematicand independent;This sublayer records which processes are running at any givenspecific point within the ecosystem.;Data scientists and engineers use this information collected to betterunderstand and plan future improvements to the processing to be done.;the audit in the data science ecosystem, contain of a series of observerswhich record prespecified processing indicators related to theecosystem.The following are good indicators for auditpurposes:;Built-in Logging;Debug Watcher;Information Watcher;Warning Watcher;Error Watcher;Fatal Watcher;Basic Logging;Process Tracking;Data Provenance;Data LineageBuilt-in Logging:;It is always a good thing to design our logging in an organizedprespecified location, this ensures that we capture every relevant logentry in one location.;Changing the internal or built-in logging process of the data sciencetools should be avoid as this complicate any future upgrades complexand will prove very costly to correct.;A built-in Logging mechanism along with acause-and-effectanalysis system allows you to handle more than 95% of all issues thatcan rise in the ecosystem.;Since there are five layers it would be a good practice to have fivewatchers foreach logging locations independent of each other asdescribed below:Debug Watcher:;This level of logging is the maximum worded logging level.;Anydebuglogs if discovered in the ecosystem,it should raise an alarm, indicating that the tool is using somepreciseprocessing cycles to perform the necessary low-level debugging.Information Watcher:;The information watcher logs information that is beneficial to therunning and management of a system.munotes.in

Page 29


;It is advised that these logs be piped to the central Audit, Balance, andControl data store of the ecosystem.Warning Watcher:;Warning is usually used for exceptions that are handled or otherimportant log events.;Usually this means that the issue was handled by the tool and also tookcorrective action for recovery.;It is advised that these logs be piped to the central Audit, Balance, andControl data store of the ecosystem.Error Watcher:;An Error logs all unhandled exceptions in the data science tool.;An Error is a state of the system. This state is notgood for the overallprocessing, since it normally means that a specific step did notcomplete as expected.;In case of an error the ecosystem should handle the issue and take thenecessary corrective action for recovery.;It is advised that these logs be piped to the central Audit, Balance, andControl data store of the ecosystem.Fatal Watcher:;Fatal is a state reserved for special exceptions or conditions for whichit is mandatory that the event causing this state be identifiedimmediately.;This state is not good for the overall processing, since it normallymeans that a specific step did not complete as expected.;In case of an fatal error the ecosystem should handle the issue and takethe necessary corrective action for recovery.;It is advised that these logs be piped to the central Audit, Balance, andControl data store of the ecosystem.;Basic Logging:Every time a process is executed this logging allowsyou to log everything that occurs to a central file.Process Tracking:;For Process Tracking it is advised to create a tool that will perform acontrolled, systematic and independent examination of the process forthe hardware logging.;There may be numerous server-based software that monitors systemhardware related parameters like voltage, fan speeds, temperaturesensors and clock speeds of a computer system.munotes.in

Page 30

;It is advised to use the tool which your customer and you bot are mostcomfortable to work with.;It is also advised that logs generated should be used for cause-and-effect analysis system.Data Provenance:;For every data entity all the transformations in the system should betracked so that a record can be generated for activity.;This ensures two things: 1. that we can reproduce the data, if required,in the future and 2. That we can supply adetailed history of the data’ssource in the system throughout its transformation.Data Lineage:;This involves keeping records of every change whenever it happens toevery individual data value in the data lake.;Thishelp us to figure out the exact value of any data item in the past.;This is normally accomplished by enforcing a valid-from and valid-toaudit entry for every data item in the data lake.3.3.2 Balance:;The balance sublayer has the responsibility to make sure that the datascience environment is balanced between the available processingcapability against the required processing capability or has the abilityto upgrade processing capability during periods of extreme processing.;In such cases the on-demand processing capability of a cloudenvironment becomes highly desirable.3.3.3 Control:;The execution of the current active data science processes is controlledby the control sublayer.;The control elements of the control sublayer are a combination of:;the control element available in theData Science TechnologyStack’s tools and;a custom interface to control the overarching work.;When processing pipeline encounters an error, the control sublayerattempts a recovery as per our prespecified requirements else ifrecovery does not work out itwill schedule a cleanup utility to undothe error.;The cause-and-effect analysis system is the core data source for thedistributed control system in the ecosystem.munotes.in

Page 31

3.4 YOKE SOLUTION;The yoke solution is a custom design;Apache Kafka is developed as anopen source stream processingplatform. Its function is to deliver a platform that is unified, has high-throughput and low-latency for handling real-time data feeds.;Kafka provides a publish-subscribe solution that can handle allactivity-stream data andprocessing. The Kafka environment enablesyou to send messages between producers and consumers that enableyou to transfer control between different parts of your ecosystem whileensuring a stable process.3.4.1 Producer:;The producer is the part of thesystem that generates the requests fordata science processing, by creating structures messages for each typeof data science process it requires.;The producer is the end point of the pipeline that loads messages intoKafka.3.4.2 Consumer:;The consumer is the part of the process that takes in messages andorganizes them for processing by the data science tools.;The consumer is the end point of the pipeline that offloads themessages from Kafka.3.4.3 Directed Acyclic Graph Scheduling:;This solutionuses a combination of graph theory and publish-subscribe stream data processing to enable scheduling.;You can use the Python NetworkX library to resolve any conflicts, bysimply formulating the graph into a specific point before or after yousend or receive messages via Kafka.;That way, you ensure an effective and an efficient processing pipeline3.4.4 Cause-and-Effect Analysis System;The cause-and-effect analysis system is the part of the ecosystem thatcollects all the logs, schedules, and other ecosystem-relatedinformation and;Enables data scientists to evaluate the quality of their system.3.5 FUNCTIONAL LAYER;The functional layer of the data science ecosystem is the largest andmost essential layer formunotes.in

Page 32


;Programming and modeling. Any data scienceproject must haveprocessing elements in this3.6 DATA SCIENCE PROCESS;Following are the five fundamental data science process steps.;Begin process by asking a What if question;Attempt to guess at a probably potential pattern;Create a hypothesis by putting together observations;Verify the hypothesis using real-world evidence;Promptly and regularly collaborate with subject matter experts andcustomers as and when you gain insights;Begin process by asking a What if question:Decide what you wantto know, even if it is only the subset of the data lake you want to usefor your data science, which is a good start.;Create a hypothesis by putting together observations:Use yourexperience or insights to guess a pattern you want to discover, touncover additional insights from the data you already have;Gather Observations and Use Them to Produce a Hypothesis:Ahypothesis, it is a proposed explanation, prepared on the basis oflimited evidence, as a starting point for further investigation.;Verify the hypothesis using real-world evidence:Now, we verifyour hypothesis by comparing it with real-world evidence;Promptly and regularly collaborate with subject matter expertsand customers as and when you gain insights:Things that arecommunicated with experts may include technical aspects likeworkflows or more specifically data formats & data schemas.;Data structures in the functional layer of the ecosystem are:•Data schemas and data formats:Functional data schemas anddata formats deploy onto the data lake’sraw data, to perform therequired schema-on-query via the functional layer.•Data models:These form the basis for future processing toenhance the processing capabilities of the data lake, by storingalready processed data sources for future use by otherprocessesagainst the data lake.•Processing algorithms:The functional processing is performedvia a series of well-designed algorithms across the processingchain.•Provisioning of infrastructure:The functional infrastructureprovision enables the framework to add processing capability tothe ecosystem, using technology such as Apache Mesos, whichenables the dynamic previsioning of processing work cells.munotes.in

Page 33

;The processing algorithms and data models are spread across sixsupersteps for processing the datalake.1. Retrieve: This super step contains all the processing chains forretrieving data from the raw data lake into a more structuredformat.3. Assess:This super step contains all the processing chains for qualityassurance and additional data enhancements.3.Process:This super step contains all the processing chains forbuilding the data vault.4. Transform:This super step contains all the processing chains forbuilding the data warehouse from the core data vault.5. Organize:This super stepcontains all the processing chains forbuilding the data marts from the core data warehouse.6.Report:This super step contains all the processing chains forbuilding virtualization and reporting of the actionable knowledge.UNIT END QUESTION1.Explainin detail the function of Operational Management Layer2.Give an overview of the Drum-buffer-rope Method3.Give an overview of the functions of Audit, Balance, and ControlLayer4.Explain the different ways of implementing the Built-in Logging in theAudit phase.5.Explain the different ways of implementing the Basic Logging in theAudit phase.6.Explain Directed Acyclic Graph Scheduling7.List &Explain the data structures in the functional layer of theecosystem8.Explain the fundamental data science process steps9.Listthe super steps for processing the data lake.REFERENCESAndreas François Vermeulen, “Practical Data Science-A Guide toBuilding the Technology Stack for Turning Data Lakes into BusinessAssets”munotes.in

Page 34

4RETRIEVE SUPER STEPUnit Structure4.0Objectives4.1Introduction4.2Data Lakes4.3Data Swamps4.3.1Start with Concrete Business Questions4.3.2Data Quality4.3.4Audit and Version Management4.3.5Data Governance4.3.5.1. Data Source Catalog4.3.5.2. Business Glossary4.3.5.3. Analytical Model Usage4.4Training the Trainer Model4.5Shipping Terminologies4.5.1 Shipping Terms4.5.2 Incoterm 20104.6Other Data Sources /Stores4.7Review Questions4.8References4.0 OBJECTIVES;The objective of this chapter is to explain in detail the core operationsin the Retrieve Super step.;This chapter explains important guidelines which if followed willprevent the data lake turning into a data swamp.;This chapter explains another important example related to shippingterminology called Incoterm;Finally this chapter explains the different possible data sources toretrieve data from.4.1 INTRODUCTION;The Retrieve super step is a practical method for importing a data lakeconsisting of different external data sources completely into theprocessing ecosystem.munotes.in

Page 35

;The Retrieve super step is the first contact between your data scienceand the source systems.;The successful retrieval of the data is a major stepping-stone toensuring that you are performing good data science.;Data lineage delivers the audittrail of the data elements at the lowestgranular level, to ensure full data governance.;Data governance supports metadata management for systemguidelines, processing strategies, policies formulation, andimplementation of processing.;Data quality andmaster data management helps to enrich the datalineage with more business values, if you provide complete datasource metadata.;The Retrieve super step supports the edge of the ecosystem, whereyour data science makes direct contact with the outside data world. Iwill recommend a current set of data structures that you can use tohandle the deluge of data you will need to process to uncover criticalbusiness knowledge.4.2 DATA LAKES;A company’s data lake covers all data that your business is authorizedto process, to attain an improved profitability of your business’s coreaccomplishments.;The data lake is the complete data world your company interacts withduring its business life span.;In simple terms, if you generate data or consume data to perform yourbusiness tasks, that data is in your company’s data lake.;Just as a lake needs rivers and streams to feed it, the data lake willconsume an unavoidable deluge of data sources from upstream anddeliver it to downstream partners4.3 DATA SWAMPS;Data swamps are simply data lakes that are not managed.;They are not to be feared. They need to be tamed.;Following are four critical steps to avoid a data swamp.1.Start with Concrete Business Questions2.Data Quality3.Audit and Version Management4.Data Governancemunotes.in

Page 36

4.3.1 Start with Concrete Business Questions:;Simply dumping a horde of data into a data lake, with no tangiblepurpose in mind, will result in a big business risk.;The data lake must be enabled to collect the data required to answeryour business questions.;It is suggested to perform a comprehensive analysis of the entire set ofdata you have and then apply a metadata classification for the data,stating full data lineage for allowing it into the data lake.4.3.2 Data Quality:;More data points donot mean that data quality is less relevant.;Data quality can cause the invalidation of a complete data set, if notdealt with correctly.4.3.3 Aud: it and Version Management:;You must always report the following:•Who used the process?•When was itused?•Which version of code was used?4.3.4 Data Governance:;The role of data governance, data access, and data security does not goaway with the volume of data in the data lake.;It simply collects together into a worse problem, if not managed.;DataGovernance can be implemented in the following ways:•Data Source Catalog•Business Glossary•Analytical Model Usage4.3.4.1. Data Source Catalog:;Metadata that link ingested data-to-data sources are a must-have forany data lake.;Data processing should include the following rules:;Unique data catalog number•use YYYYMMDD/ NNNNNN/NNN.•E.g. 20171230/000000001/001 for data first registered into themetadata registers on December 30, 2017, as data source 1 ofdata type 1.•This is a critical requirement.;Short description (It should be under 100 characters)•Country codes and country namesmunotes.in

Page 37

•Ex. ISO 3166 defines Country Codes as per United NationsSources;Long description (It should kept as complete as possible)•Country codes and country names used by your organization asstandard for country entries;Contact information for external data source•ISO 3166-1:2013 code lists fromwww.iso.org/iso-3166-country-codes.html;Expected frequency•Irregular i.e., no fixed frequency, also known as ad hoc, everyminute, hourly, daily, weekly, monthly, or yearly.•Other options are near-real-time, every 5 seconds, every minute,hourly, daily, weekly, monthly, or yearly.;Internal business purpose•Validate country codes and names.4.3.4.2. Business Glossary:;The business glossary maps the data-source fields and classifies theminto respective lines of business.;This glossary is a must-have for any good data lake.;The business glossary records the data sources ready for the retrieveprocessing to load the data.;Create a data-mapping registry with the following information:;Unique data catalog number:use YYYYMMDD/NNNNNN/NNN.;Unique data mapping number:use NNNNNNN/NNNNNNNNN. E.g., 0000001/000000001 for field 1 mapped to;internal field 1;External data source field name:States the field as found in theraw data source;External data source field type:Records the full set of the field’sdata types when loading the data lake;Internal data source field name:Records every internal data fieldname to use once loaded from the data lake;Internal data source field type:Records the full set of the field’stypes to use internally once loaded;Timestamp of last verification of the data mapping:useYYYYMMDD-HHMMSS-SSSthat supports timestamp down to athousandth of a second.munotes.in

Page 38

4.3.4.3 Analytical Model Usage:;Data tagged in respective analytical models define the profile of thedata that requires loading and guides the data scientist to whatadditional processing is required.;The following data analytical models should be executed on every dataset in the data lake by default.;Data Field Name Verification;Unique Identifier of Each DataEntry;Data Type of Each DataColumn;Histograms of Each Column;Minimum Value;MaximumValue;Mean;Median;Mode;Range;Quartiles;Standard Deviation;Skewness;Missing or UnknownValues;Data Pattern;The models can be applied using R or Python, we will use R;The data set used to demonstrate the models is INPUT_DATA.csv;Data Field Name Verification•This is used to validate and verify the data field’s names in theretrieve processing in an easy manner.•Example•library(table)•set_tidy_names(INPUT_DATA, syntactic = TRUE,quiet = FALSE)•Reveals field names that are not easy to use;Unique Identifierof Each Data Entry•Allocate a unique identifier within the system that is independent ofthe given file name.•This ensures that the system can handle different files fromdifferent paths and keep track of all data entries in an effectivemanner.•Then allocate a unique identifier for each record or data element inthe files that are retrieved.•Example:To add the unique identifier, run the following commandINPUT_DATA_with_ID=Row_ID_to_column(INPUT_DATA_FIX, var ="Row_ID")munotes.in

Page 39

;Data Type of Each Data Column•Determine the best data type for each column, to assist you incompleting the business glossary, to ensure that you record thecorrect import processing rules.•Example: To find datatype of each columnsapply(INPUT_DATA_with_ID, typeof);Histograms ofEach Column•I always generate a histogram across every column, to determinethe spread of the data value.•Example: to compute histogramlibrary(data.table)country_histogram=data.table(Country=unique(INPUT_DATA_with_ID[is.na(INPUT_DATA_with_ID ['Country'])== 0, ]$Country));Minimum Value•Determine the minimum value in a specific column.•Example: find minimum valuemin(country_histogram$Country)orsapply(country_histogram[,'Country'], min, na.rm=TRUE);Maximum Value•Determine the maximum value in a specific column.•Example: find maximum valuemax(country_histogram$Country)orsapply(country_histogram[,'Country'], max, na.rm=TRUE);Mean•If the column is numeric in nature, determine the average value in aspecific column.•Example: find mean of latitudesapply(lattitue_histogram_with_id[,'Latitude'], mean,na.rm=TRUE);Median•Determine the value that splits the data set into two parts in aspecific column.•Example:find median of latitudesapply(lattitue_histogram_with_id[,'Latitude'], median,na.rm=TRUE)munotes.in

Page 40

;Mode•Determine the value that appears most in a specific column.•Example: Find mode for column countryINPUT_DATA_COUNTRY_FREQ=data.table(with(INPUT_DATA_with_ID, table(Country)));Range•For numeric values, you determine the range of the values by takingthe maximum value and subtracting the minimum value.•Example: find range of latitudesapply(lattitue_histogram_with_id[,'Latitude'], range,na.rm=TRUE;Quartiles•These are the base values that divide a data set in quarters. This isdone by sorting the data column first and then splitting it in groupsof four equal parts.•Example: find quartile of latitudesapply(lattitue_histogram_with_id[,'Latitude'], quantile,na.rm=TRUE);Standard Deviation•The standard deviation is a measure of the amount of variation ordispersion of a set of values.•Example: find standard deviation of latitudesapply(lattitue_histogram_with_id[,'Latitude'], sd,na.rm=TRUE);Skewness•Skewness describes the shape or profile of the distribution of thedata in the column.•Example: find skewness of latitudelibrary(e1071)skewness(lattitue_histogram_with_id$Latitude, na.rm =FALSE, type = 2);Missing or Unknown Values•Identify if you have missing or unknown values in the data sets.Example: find missing value in country columnmissing_country=data.table(Country=unique(INPUT_DATA_with_ID[is.na(INPUT_DATA_with_ID ['Country']) ==1, ]))munotes.in

Page 41

;Data Pattern•I have used the following process for years, to determine a patternof the data values themselves.•Here is my standard version:•Replace all alphabet values with an uppercase case A, all numberswith an uppercase N, and replace any spaces with a lowercase letterband all other unknown characters with a lowercase u.•As a result, “Data Science 102” becomes"AAAAbAAAAAAAbNNNu.” This pattern creation is beneficialfor designing any specific assess rules.4.4 TRAINING THE TRAINER MODEL;To prevent a data swamp, it is essential that you train your team also.Data science is a team effort.;People, process, and technology are the three cornerstones to ensurethat data is curated and protected.;You are responsible for your people; share the knowledge you acquirefrom this book. The process I teach you, you need to teach them.Alone, you cannot achieve success.;Technology requires that youinvest time to understand it fully. We areonly at the dawn of major developments in the field of dataengineering and data science.;Remember: A big part of this process is to ensure that business usersand data scientists understand the need to start small, have concretequestions in mind, and realize that there is work to do with all data toachieve success.4.5SHIPPING TERMINOLOGIESIn this section we discuss two things : shipping terms and Incoterm 2010.4.5.1 Shipping Terms;These determine therules of the shipment, the conditions under whichit is made. Normally, these are stated on the shipping manifest.;Following are the terms used:•Seller-The person/company sending the products on the shippingmanifest is the seller. This is not a location but a legal entitysending the products.•Carrier-The person/company that physically carries the productson the shipping manifest is the carrier. Note that this is not alocation but a legal entity transporting the products.munotes.in

Page 42


•Port-A Port is any pointfrom which you have to exit or enter acountry. Normally, these are shipping ports or airports but can alsoinclude border crossings via road. Note that there are two ports inthe complete process. This is important. There is a port of exit anda port ofentry.•Ship-Ship is the general term for the physical transport methodused for the goods. This can refer to a cargo ship, airplane, truck,or even person, but it must be identified by a unique allocationnumber.•Terminal-A terminal is the physical point at which the goods arehanded off for the next phase of the physical shipping.•Named Place-This is the location where the ownership is legallychanged from seller to buyer. This is a specific location in theoverall process. Remember this point, asit causes many legaldisputes in the logistics industry.•Buyer-The person/company receiving the products on theshipping manifest is the buyer. In our case, there will bewarehouses, shops, and customers. Note that this is not a locationbut a legal entity receiving the products.4.5.2 Incoterm 2010:;Incoterm 2010 is a summary of the basic options, as determined andpublished by a standard board;This option specifies which party has an obligation to pay if somethinghappens to the product being shipped (i.e. if the product is damaged ordestroyed inroute before it reaches to the buyer);EXW—Ex Works•Here the seller will make the product or goods available at hispremises or at another named place. This term EXW puts theminimum obligations on the seller of the product /item andmaximum obligation on the buyer.•Here is the data science version: If I were to buy an item a localstore and take it home, and the shop has shipped it EXW—ExWorks, the moment I pay at the register, the ownership istransferred to me. If anything happens to the book, I would have topay to replace it.;FCA—Free Carrier•In this condition, the seller is expected to deliver the product orgoods, that are cleared for export, at a named place.•The data science version: : If I wereto buy an item at an overseasduty-free shop and then pick it up at the duty-free desk beforetaking it home, and the shop has shipped it FCA—Free Carrier—to the duty-free desk, the moment I pay at the register, themunotes.in

Page 43

ownership is transferred to me, but ifanything happens to the bookbetween the shop and the duty-free desk, the shop will have to pay.•It is only once I pick it up at the desk that I will have to pay, ifanything happens. So, the moment I take the book, the transactionbecomes EXW, so I haveto pay any necessary import duties onarrival in my home country.;CPT—Carriage Paid To•Under this term, the seller is expected to pay for the carriage ofproduct or goods up to the named place of destination.•The moment the product or goods are delivered to the first carrierthey are considered to be delivered, and the risk gets transferred tothe buyer.•All the costs including origin costs, clearance of export and freightcosts for carriage till the place of named destination have to bepaid by the seller to the named place of destination. This is couldbe anything like the final destination like the buyer's facility, or aport of at the destination country. This has to be agreed upon byboth seller and buyer in advance.•The data science version: If I were to buy an item at an overseasstore and then pick it up at the export desk before taking it homeand the shop shipped it CPT—Carriage Paid To—the duty desk forfree, the moment I pay at the register, the ownership is transferredto me, but if anything happens to the book between the shop andthe duty desk of the shop, I will have to pay.•It is only once I have picked up the book at the desk that I have topay if anything happens. So, the moment I take the book, thetransaction becomes EXW, so I must payany required export andimport duties on arrival in my home country.;CIP-Carriage& Insurance Paid•The seller has to get insurance for the goods for shipping thegoods.•The data science version If I were to buy an item at an overseasstore and then pick it up at the export desk before taking it home,and the shop has shipped it CPT—Carriage Paid To—to the dutydesk for free, the moment I pay at the register, the ownership istransferred to me. However, if anything happens to the bookbetween the shop and the duty desk at the shop, I have to take outinsurance to pay for the damage.•It is only once I have picked it up at the desk that I have to pay ifanything happens. So, the moment I take the book, it becomesEXW, so I have to pay any export and import duties on arrival inmy home country. Note that insurance only covers that portion ofthe transaction between the shop and duty desk.munotes.in

Page 44

;DAT—Delivered at a Terminal•According to this term the seller has to deliver and unload thegoods at a named terminal.The seller assumes all risks till thedelivery at the destination and has to pay all incurred costs oftransport including export fees, carriage, unloading from the maincarrier at destination port, and destination port charges.•The terminal can be a port, airport, or inland freight interchange,but it must be a facility with the capability to receive the shipment.If the seller is not able to organize unloading, it should considershipping under DAP terms instead. All charges after unloading (forexample,import duty, taxes, customs and on-carriage costs) are tobe borne by buyer.•The data science version. If I were to buy an item at an overseasstore and then pick it up at a local store before taking it home, andthe overseas shop shipped it—Delivered atTerminal (LocalShop)—the moment I pay at the register, the ownership istransferred to me.•However, if anything happens to the book between the paymentand the pickup, the local shop pays. It is picked up only once at thelocal shop. I have to pay if anything happens. So, the moment Itake it, the transaction becomes EXW, so I have to pay any importduties on arrival in my home.;DAP—Delivered at Place•Under this option the seller delivers the goods at a given place ofdestination. Here, the risk willpass from seller to buyer fromdestination point.•Packaging cost at the origin has to be paid by the seller alsoall thelegal formalities in the exporting country will be carried out by theseller at his own expense.•Once the goods are delivered in the destination country the buyerhas to pay for the customs clearance.•Here is the data science version. If I were to buy 100 pieces of aparticular item from an overseas web site and then pick up thecopies at a local store before taking them home, and the shopshipped the copies DAP-Delivered At Place (Local Shop)—themoment I paid at the register, the ownership would be transferredto me. However, if anything happened to the item between thepayment and the pickup, the web site owner pays. Once the 100pieces are picked up at the local shop, I have to pay to unpackthem at store. So, the moment I take the copies, the transactionbecomes EXW, so I will have to pay costs after I take the copies.;DDP—Delivered Duty Paid•Here the seller is responsible for the delivery of the products orgoods to an agreed destination place in the country of the buyer.munotes.in

Page 45

The seller has to pay for all expenses like packing at origin,delivering the goods to the destination, import duties and taxes,clearing customs etc.•The seller is not responsible for unloading. This termDDPwillplace the minimum obligations on the buyer and maximumobligations on the seller. Neither the risk nor responsibility istransferred to the buyer until delivery of the goods is completed atthe named place of destination.•Here is the data science version. If I were to buy an item inquantity 100 at an overseas web site and then pick them up at alocal store before taking them home, and the shop shipped DDP—Delivered Duty Paid (my home)—the moment I pay at the till, theownership is transferred to me. However, if anything were tohappen to the items between the payment and the delivery at myhouse, the store must replace the items as the term covers thedelivery to my house.4.6OTHER DATA SOURCES /STORES;While performing data retrieval you may have to work with one of thefollowing data stores;SQLite•This requires a package named sqlite3.;Microsoft SQL Server•Microsoft SQL server is common in companies, and this connectorsupports your connection to thedatabase. Via the directconnection, usefrom sqlalchemy import create_engineengine =create_engine('mssql+pymssql://scott:tiger@hostname:port/folder');Oracle•Oracle is a common database storage option in bigger companies.It enables you to load data from the following data source withease:from sqlalchemy import create_engineengine =create_engine('oracle://andre:vermeulen@127.0.0.1:1521/vermeulen');MySQL•MySQL is widely used by lots of companies for storing data. Thisopens that data to your datascience with the change of a simpleconnection string.•There are two options. For direct connect to the database, usefrom sqlalchemy import create_enginemunotes.in

Page 46

engine =create_engine('mysql+mysqldb://scott:tiger@localhost/vermeulen');Apache Cassandra•Cassandra is becoming a widely distributed databaseengine in thecorporate world.•To access it, use the Python package cassandra.from cassandra.cluster import Clustercluster = Cluster()session = cluster.connect(‘vermeulen’);Apache Hadoop•Hadoop is one ofthe most successful data lake ecosystems inhighly distributed data Science.•The pydoop package includes a Python MapReduce and HDFSAPI for Hadoop.;Pydoop 9•It is a Python interface to Hadoop that allows you to writeMapReduce applications and interactwith HDFS in pure Python;Microsoft Excel•Excel is common in the data sharing ecosystem, and it enables youto load files using this format with ease.;Apache Spark•Apache Spark is now becoming the next standard for distributeddata processing. The universal acceptance and support of theprocessing ecosystem is starting to turn mastery of this technologyinto a must-have skill.;Apache Hive•Access to Hive opens its highly distributed ecosystem for use bydata scientists.;Luigi•Luigi enables a series of Python features that enable you to buildcomplex pipelines into batch jobs. It handles dependencyresolution and workflow management as part of the package.•This will save you from performing complex programming whileenabling good quality processing;AmazonS3 Storage•S3, or Amazon Simple Storage Service (Amazon S3), createssimple and practical methods to collect, store, and analyze data,irrespective of format, completely at massive scale. I store most ofmy base data in S3, as it is cheaper than most other methods.munotes.in

Page 47

•Package s3-Python’s s3 module connects to Amazon’s S3 RESTAPI•Package Boot-The Botopackage is another useful too thatconnects to Amazon’s S3 REST API;Amazon Redshift•Amazon Redshift is cloud service that is a fully managed,petabyte-scaledata warehouse.•The Python package redshift-sqlalchemy, is an Amazon Redshiftdialect for sqlalchemythat opens this data source to your datascience;Amazon Web Services•The boto3 package is an Amazon Web Services Library Pythonpackage that provides interfaces to Amazon Web ServicesUNIT END QUESTION1.Explain the Retrieve Superstep.2.Explain Data Lakes and Data Swamps.3.Explain the general rules for data source catalog.4.State and explain the four critical steps to avoid data swamps.5.Why is it necessary to train the data science team?6.Explain the following shipping terms:i.Sellerii.Carrieriii.Port,iv.Ship,v. Terminal, Named Placevi.Buyer.7.Explain the following shipping terms with example:iEx WorksiiFree CarrieriiiCarriage Paid ToivCarriage and Insurance Paid TovDelivered at TerminalviDelivered at PlaceviiDelivery Duty Paid8.List and explain the different data stores used in data science.REFERENCESBooks:;Andreas François Vermeulen, “Practical Data Science-A Guide toBuilding the Technology Stack for Turning Data Lakes into BusinessAssets”Websites:;https://www.aitworldwide.com/incoterms;Incoterm:https://www.ntrpco.com/what-is-incoterms-part2/munotes.in

Page 48

Unit III5ASSESS SUPERSTEPUnit Structure5.0Objectives5.1Assess Superstep5.2Errors5.2.1 Accept the Error5.2.2 Reject the Error5.2.3 Correct the Error5.2.4 Create a Default Value5.3Analysis of Data5.3.1 Completeness5.3.2 Consistency5.3.3 Timeliness5.3.4 Conformity5.3.5 Accuracy5.3.6Integrity5.4Practical Actions5.4.1Missing Values in Pandas5.4.1.1Drop the Columns Where All Elements Are MissingValues5.4.1.2Drop the Columns Where Any of the Elements IsMissing Values5.4.1.3Keep Only the Rows That Contain a Maximum ofTwoMissing Values5.4.1.4Fill All Missing Values with the Mean, Median,Mode,Minimum, and Maximum ofthe ParticularNumericColumn5.5Let us Sum up5.6Unit End Questions5.7List of References5.0 OBJECTIVESThis chapter makes you understand the following concepts:;Dealing with errors in data;Principles of data analysismunotes.in

Page 49

;Different ways to correct errors in data5.1 ASSESS SUPERSTEPData quality problems result in a 20% decrease in workerproductivity and explain why 40% of business initiatives fail to achieveset goals. Incorrect data can harm a reputation, misdirect resources, slowdown theretrieval of information, and lead to false insights and missedopportunities.For example, if an organization has the incorrect name or mailingaddress of a prospective client, their marketing materials could go to thewrong recipient. If sales data isattributed to the wrong SKU or brand, thecompany might invest in a product line with less than stellarcustomer demand.Data profiling is the process of examining, analyzing andreviewing data to collect statistics surrounding the quality and hygiene ofthe dataset. Data quality refers to the accuracy, consistency, validity andcompleteness of data. Data profiling may also be known as dataarcheology, data assessment, data discovery or data quality analysis5.2 ERRORSErrors are the norm, not the exception, when working with data.By now, you’ve probably heard the statistic that 88% of spreadsheetscontain errors. Since we cannot safely assume that any of the data wework with is error-free, our mission should be to find and tackle errors inthe most efficient way possible.5.2.1 Accept the Error:If an error falls within an acceptable standard (i.e., Navi Mumbaiinstead of Navi Mum.), then it could be accepted and move on to the nextdata entry. But remember that if you accept the error, you will affect datascience techniques and algorithms that perform classification, such asbinning, regression, clustering, and decision trees, because these processesassume that the values in this example are not the same. This option is theeasy option, but not always the best option.5.2.2 Reject the Error:Unless the nature of missing data is ‘Missing completely atrandom’, the best avoidable method in many cases is deletion. a. Listwise:In this case, rows containing missing variables are deleted.a.Listwise: In this case, rows containing missing variables are deletedmunotes.in

Page 50

UserDeviceOSTransactionsAMobileAndroid5BMobileWindow3CTabletNA4DNAAndroid1EMobileIOS2Table 5.1In the above case, the entire observation for User C and User Dwill be ignored for listwise deletion. b. Pairwise: In this case, only themissing observations are ignored and analysis is In the above case, 2separate sample data will be analyzed, one with the combination of User,Device and Transaction and the other with the combination of User, OSand Transaction. In such a case, one won't be deleting any observation.Each of thesamples will ignore the variable which has the missing valuein it.Both the above methods suffer from loss of information. Listwisedeletion suffers the maximum information loss compared to Pairwisedeletion. But, the problem with pairwise deletion is that even though ittakes the available cases, one can’t compare analyses because the sampleis different every time.Use reject the error option if you can afford to lose a bit of data.This is an option to be used only if the number of missing values is2% ofthe whole dataset or less.5.2.3 Correct the Error:Identify the Different Error Types:We are going to look at a few different types of errors. Let’s takethe example of a sample of people described by a number of differentvariables:
Table 5.2
UserDeviceOSTransactionsAMobileAndroid5BMobileWindow3CTabletNA4DNAAndroid1EMobileIOS2Table 5.1In the above case, the entire observation for User C and User Dwill be ignored for listwise deletion. b. Pairwise: In this case, only themissing observations are ignored and analysis is In the above case, 2separate sample data will be analyzed, one with the combination of User,Device and Transaction and the other with the combination of User, OSand Transaction. In such a case, one won't be deleting any observation.Each of thesamples will ignore the variable which has the missing valuein it.Both the above methods suffer from loss of information. Listwisedeletion suffers the maximum information loss compared to Pairwisedeletion. But, the problem with pairwise deletion is that even though ittakes the available cases, one can’t compare analyses because the sampleis different every time.Use reject the error option if you can afford to lose a bit of data.This is an option to be used only if the number of missing values is2% ofthe whole dataset or less.5.2.3 Correct the Error:Identify the Different Error Types:We are going to look at a few different types of errors. Let’s takethe example of a sample of people described by a number of differentvariables:
Table 5.2
UserDeviceOSTransactionsAMobileAndroid5BMobileWindow3CTabletNA4DNAAndroid1EMobileIOS2Table 5.1In the above case, the entire observation for User C and User Dwill be ignored for listwise deletion. b. Pairwise: In this case, only themissing observations are ignored and analysis is In the above case, 2separate sample data will be analyzed, one with the combination of User,Device and Transaction and the other with the combination of User, OSand Transaction. In such a case, one won't be deleting any observation.Each of thesamples will ignore the variable which has the missing valuein it.Both the above methods suffer from loss of information. Listwisedeletion suffers the maximum information loss compared to Pairwisedeletion. But, the problem with pairwise deletion is that even though ittakes the available cases, one can’t compare analyses because the sampleis different every time.Use reject the error option if you can afford to lose a bit of data.This is an option to be used only if the number of missing values is2% ofthe whole dataset or less.5.2.3 Correct the Error:Identify the Different Error Types:We are going to look at a few different types of errors. Let’s takethe example of a sample of people described by a number of differentvariables:
Table 5.2
munotes.in

Page 51

Can you point out a few inconsistencies? Write them down a few andcheck your answers below!1.First, there are empty cells for the "country" and "date of birthvariables". We call thesemissing attributes.2.If you look at the "Country" column, you see a cell that contains 24.“24” is definitely not a country! This is known as alexical error.3.Next, you may notice in the "Height" column that there is an entrywith a different unit of measure. Indeed, Rodney's height is recorded infeet and incheswhile the rest are recorded in meters. This is anirregularity errorbecause the unit of measures are not uniform.5.Mark has two email addresses. It’s is not necessarily a problem, but ifyou forget about this and code an analysis program based on theassumption that each person has only one email address, your programwill probably crash! This is called aformatting error.5.Look at the "date of birth" variable. There is also aformatting errorhere as Rob’sdate of birth is not recorded in the sameformat asthe others.6.Samuel appears on two different rows. But, how can we be sure this isthe same Samuel? By his email address, of course! This is called aduplication error. But look closer, Samuel’s two rows each give adifferent value for the "height variable": 1.67mand 1.45m. This iscalled acontradiction error.7.Honey is apparently 9'1". This height diverges greatly from the normalheights of human beings. This value is, therefore, referred to asanoutlier.The termoutliercan indicate two different things: anatypicalvalue and anaberration.Deal With These Errors:When it comes to cleansing data sets, there is no set rule.Everything you do depends on how you plan to use your data. No two dataanalysts will cleanse the same data setthe same way—not if theirobjectives are different!So there’s no set rule, but I can give you a few pointers:1.Missing attributes will be addressed in the following chapter.2.For the invalid country, it’s possible to supply a list of authorizedcountries in advance, then eliminate all of the values that are not foundon this list (hint: 24 will not be found). Such a list is often referred toas a dictionary.3.For irregularity errors, it’s more complicated! You can, for example,set a fixed format (here: a decimal number followed bythe letter “m”for “meter”) and eliminate values that don’t adhereto it. But wecan dobetter, by first detecting whatunit the valueis expressed in(meters orcentimeters) then converting everythingto thesame unit.munotes.in

Page 52


4.For the formatting error of the duplicate email address, it all dependson what you want to do. If you won’t be looking at emailsin yourfuture analysis, there’s no need to correct this error. If, onthe otherhand, you want to know the proportion of people whose address endsin, for example @example.com, or @supermail.eu, etc., then you canchoose between:1. Taking the first email address and forgetting the second one.2. Keeping all email addresses.5.Let’s move on to the Date of Birthvariable. There are many differentformats; each country has its own custom when it comes to writingdates (India and North America, for example, do not use the sameformat). Add to this the problem of time zones! In our case, thesimplest solution wouldbe to eliminate dates that are not in the desiredformat month/day/year.6.Duplicates.7.Outliers!5.2.4 Create a Default Value:NaN is the default missing value marker for reasons ofcomputational speed and convenience. This is a sentinel value, inthesense that it is a dummy data or flag value that can be easily detected andworked with using functions in pandas.5.3 ANALYSIS OF DATA
Figure 5.1

4.For the formatting error of the duplicate email address, it all dependson what you want to do. If you won’t be looking at emailsin yourfuture analysis, there’s no need to correct this error. If, onthe otherhand, you want to know the proportion of people whose address endsin, for example @example.com, or @supermail.eu, etc., then you canchoose between:1. Taking the first email address and forgetting the second one.2. Keeping all email addresses.5.Let’s move on to the Date of Birthvariable. There are many differentformats; each country has its own custom when it comes to writingdates (India and North America, for example, do not use the sameformat). Add to this the problem of time zones! In our case, thesimplest solution wouldbe to eliminate dates that are not in the desiredformat month/day/year.6.Duplicates.7.Outliers!5.2.4 Create a Default Value:NaN is the default missing value marker for reasons ofcomputational speed and convenience. This is a sentinel value, inthesense that it is a dummy data or flag value that can be easily detected andworked with using functions in pandas.5.3 ANALYSIS OF DATA
Figure 5.1

4.For the formatting error of the duplicate email address, it all dependson what you want to do. If you won’t be looking at emailsin yourfuture analysis, there’s no need to correct this error. If, onthe otherhand, you want to know the proportion of people whose address endsin, for example @example.com, or @supermail.eu, etc., then you canchoose between:1. Taking the first email address and forgetting the second one.2. Keeping all email addresses.5.Let’s move on to the Date of Birthvariable. There are many differentformats; each country has its own custom when it comes to writingdates (India and North America, for example, do not use the sameformat). Add to this the problem of time zones! In our case, thesimplest solution wouldbe to eliminate dates that are not in the desiredformat month/day/year.6.Duplicates.7.Outliers!5.2.4 Create a Default Value:NaN is the default missing value marker for reasons ofcomputational speed and convenience. This is a sentinel value, inthesense that it is a dummy data or flag value that can be easily detected andworked with using functions in pandas.5.3 ANALYSIS OF DATA
Figure 5.1
munotes.in

Page 53

One of the causes of data quality issues is in source data that ishoused in a patchwork of operational systems and enterprise applications.Each of these data sources can have scattered or misplaced values,outdated and duplicate records, and inconsistent (or undefined) datastandards and formats across customers, products, transactions, financialsandmore.Data quality problems can also arise when an enterpriseconsolidates data during a merger or acquisition. But perhaps the largestcontributor to data quality issues is that the data are being entered, edited,maintained, manipulated and reported onby people.To maintain the accuracy and value of the business-criticaloperational information that impact strategic decision-making, businessesshould implement a data quality strategy that embeds data qualitytechniques into their business processes and into their enterpriseapplications and data integration.5.3.1 Completeness:Completeness is defined as expected comprehensiveness. Data canbe complete even if optional data is missing. As long as the data meets theexpectations then the data is considered complete.For example, a customer’s first name and last name are mandatorybut middle name is optional; so a record can be considered complete evenif a middle name is not available.Questions you can ask yourself: Is all the requisite informationavailable? Do any data values have missing elements? Or are they in anunusable state?5.3.2 Consistency:Consistency means data across all systems reflects the sameinformation and are in synch with each other across the enterprise.Examples:;A business unit status is closed but there are sales for that businessunit.;Employee status is terminated but pay status is active.Questions you can ask yourself: Are data values the same across the datasets? Are there any distinct occurrences of the same datainstances thatprovide conflicting information?munotes.in

Page 54

5.3.3 Timeliness:Timeliness referes to whether information is available when it is expectedand needed. Timeliness of data is very important. This is reflected in:;Companies that are required to publish their quarterly results within agiven frame of time;Customer service providing up-to date information to the customers;Credit system checking in real-time on the credit card account activityThe timeliness depends on user expectation. Online availability ofdata could be required for room allocation system in hospitality, butnightly data could be perfectly acceptable for a billing system.5.3.4 Conformity:Conformity means the data is following the set of standard datadefinitions like data type, size and format. For example, date of birth ofcustomer is in the format “mm/dd/yyyy” Questions you can ask yourself:Do data values comply with the specified formats? If so, do all the datavalues comply with those formats?Maintaining conformance to specific formats is important.5.3.5 Accuracy:Accuracy is the degree to which data correctly reflects the real worldobject OR an event being described. Examples:;Sales of the business unit are the real value.;Address of an employee in the employee databaseis the real address.Questions you can ask yourself: Do data objects accuratelyrepresent the “real world” values they are expected to model? Are thereincorrect spellings of product or person names, addresses, and evenuntimely or not current data?These issues can impact operational and advanced analytics applications.5.3.6 Integrity:Integrity means validity of data across the relationships andensures that all data in a database can be traced and connected to otherdata.For example, in a customer database, there should be a validcustomer, addresses and relationship between them. If there is an addressrelationship data without a customer then that data is not valid and isconsidered an orphaned record.munotes.in

Page 55

Ask yourself: Is there are any data missing important relationshiplinkages? The inability to link related records together may actuallyintroduce duplication across your systems.5.4 PRACTICAL ACTIONSIn Unit 2, you have been introduced to the Python package pandas.The package enables severalautomatic error-management features.5.4.1 Missing Values in Pandas:Following are four basic processing concepts.1.Drop the Columns Where All Elements Are Missing Values2.Drop the Columns Where Any of the Elements Is Missing Values3.Keep Onlythe Rows That Contain a Maximum of Two MissingValues4.Fill All Missing Values with the Mean, Median, Mode, Minimum5.4.1.1. Drop the Columns Where All Elements Are Missing ValuesImporting data:Step 1: Importing necessary libraries:import osimport pandas as pdStep 2: Changing the working directory:os.chdir("D:\Pandas")Pandas provides various data structures and operations formanipulating numerical data and time series. However, there can be caseswhere some data might be missing. In Pandas missing data is representedby two values:;None:None is a Python singleton object that is often used for missingdata in Python code.;NaN:NaN (an acronym for Not a Number), is a special floating-pointvalue recognized by all systems that use the standardIEEE floating-point representationPandas treat None and NaN as essentially interchangeable forindicating missing or null values. In order to drop a null values from adataframe, we used dropna() function this function drop Rows/Columns ofdatasets withNull values in different ways.munotes.in

Page 56

Syntax:DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None,inplace=False)Parameters:;axis:axis takes int or string value for rows/columns. Input can be 0 or1 for Integer and ‘index’ or ‘columns’ for String.;how:how takes string value of two kinds only (‘any’ or ‘all’). ‘any’drops the row/column if ANY value is Null and ‘all’ drops only ifALL values are null.;thresh:thresh takes integer value which tells minimum amount of navalues to drop.;subset:It’s anarray which limits the dropping process to passedrows/columns through list.;inplace:It is a boolean which makes the changes in data frame itself ifTrue.Let’s take an example of following dataframe:ABCD0NaN2.0NaN013.04.0NaN12NaNNaNNaN5Table 5.3Here, column C is having all NaN values.Let’s drop this column. For thisuse the following code.  import pandas as pdimport numpy as npdf = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],[np.nan, np.nan, np.nan, 5]],columns=list('ABCD'))$&)14)**./),11(%$!1!&/!+%$&$/-.,!!5)0 (-4!**1()0#-$%4)**$%*%1%1(%#-*2+,04)1(!**,2**3!*2%0munotes.in

Page 57

%/%!5)0 +%!,0#-*2+,0!,$(-47!**7+%!,0$/-.1(%#-*2+,04)1(!**!3!*2%0ABD0NaN2.0013.04.012NaNNaN5Table 5.45.4.1.2. Drop the Columns Where Any of the Elements Is MissingValues:Let’s consider the same dataframe again:ABCD0NaN2.0NaN013.04.0NaN12NaNNaNNaN5Table 5.5%/%#-*2+,!,$!/%(!3),'!**!3!*2%0%170$/-.1(%0%#-*2+,0-/1()020%1(%&-**-4),'#-$%  )+.-/1.!,$!0!0.$)+.-/1,2+.6!0,.$&.$!1!/!+%,.,!,
,.,!,   ,.,!, ,.,!,,.,!,,.,!, #-*2+,0*)01$&)14)**./),11(%$!1!&/!+%$&$/-.,!!5)0 (-4!,61()0#-$%4)**$%*%1%1(%#-*2+,04)1(!**,2**3!*2%0%/% !5)0  +%!,0 #-*2+,0 !,$ (-47!,67 +%!,0 $/-. 1(%#-*2+,04)1(-,%-/,-/%!3!*2%0
Table 5.6munotes.in

Page 58

 $ "   #!"! %170#-,0)$%/1(%0!+%$!1!&/!+%!'!),ABCD0NaN2.0NaN013.04.0NaN12NaNNaNNaN5Table 5.7%/%/-4
)0(!3),'+-/%1(!,
!3!*2%0-1()0/-44)**'%1$/-..%$-/1()020%1(%&-**-4),'#-$%  )+.-/1),'.!,$!0!0.$)+.-/1.!,$!0!0.$)+.-/1,2+.6!0,.$&.$!1!/!+%,.,!,
,.,!,   ,.,!, ,.,!,,.,!,,.,!, #-*2+,0*)01$&$&$/-.,!1(/%0(
1()0#-$%4)**$%*%1%1(%/-404)1(+-/%1(!,14-,2**3!*2%0%/%1(/%0(
+%!,0+!5)+2+14-!4)**"%!**-4%$.%//-4ABCD0NaN2.0NaN013.04.0NaN1Table 5.8 
!"  !Another approach to handling missing values is to imputeorestimate them. Missing value imputation has a long history in statisticsand has been thoroughly researched. In essence, imputation usesinformation and relationships among the non-missing predictors to providean estimate to fill in the missing value.The goal of these techniques is toensure that the statistical distributions are tractable and of good enoughquality to support subsequent hypothesis testing. The primary approach inmunotes.in

Page 59

this scenario is to use multiple imputations; several variations of thedataset are created with different estimates of the missing values. Thevariations of the data sets are then used as inputs to models and the teststatistic replicates are computed for each imputed data set. From thesereplicate statistics, appropriate hypothesis tests can be constructed andused for decision making.A simple guess of a missing value is the mean, median, or mode(most frequently appeared value) of that variable.Replacing Nan values with mean:In pandas, .fillna can be used to replaceNA’s with a specified value.AppleOrangeBananaPearBasket 110NaN3040Basket27142128Basket355NaN812Basket41514NaN8Basket5711NaNBasket6NaN492Table 5.9Here, we can see NaN in all the columns. Let’s fill it by their mean.For this, use the following code:import pandas as pdimport numpy as npdf = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,12],[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])dfdf.fillna(df.mean())munotes.in

Page 60

Output:AppleOrangeBananaPearBasket 1108.253040Basket27142128Basket3558.25812Basket4151413.88Basket571118Basket618.8492Table 5.10Here, the mean of Apple Column = (10 + 7 + 55 + 15 + 7)/5 = 18.8.So, Nan value is replaced by 18.8. Similarly, in Orange Column Nan’s arereplaced with 8.25, in Banana’s column Nan replaced with 13.8 and inPear’s column it is replaced with 18.Replacing Nan values with median:Let’s take an example:AppleOrangeBananaPearBasket 110NaN3040Basket27142128Basket355NaN812Basket41514NaN8Basket5711NaNBasket6NaN492Table 5.11Here, we can see NaN in all the columns. Let’s fill it by their median. Forthis, use the following code:import pandas as pdimport numpy as npdf = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,12], [15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],munotes.in

Page 61

 columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1','Basket2', 'Basket3','Basket4', 'Basket5', 'Basket6'])$&$&&)**,!$&+%$)!,Output:AppleOrangeBananaPearBasket 1109.03040Basket27142128Basket3559.0812Basket415149.08Basket571112.0Basket610.0492Table 5.12Here, the median of Apple Column = (7, 7, 10, 15, 55) = 10. So, Nanvalue is replaced by 10. Similarly, in Orange Column Nan’s are replacedwith 9, in Banana’s column Nan replaced with 9 and in Pear’s column it isreplaced with 12.Replacing Nan valueswith mode:Let’s take an exampleAppleOrangeBananaPearBasket 110NaN3040Basket2714828Basket355NaN812Basket41514NaN12Basket5711NaNBasket6NaN492Table 5.13Here, we can see NaN in all the columns. Let’s fill it by their mode.For this, use the following code:munotes.in

Page 62


import pandas as pdimport numpy as npdf = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 8, 28], [55, np.nan,8, 12],[15, 14, np.nan, 12], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])dffor column in df.columns:df[column].fillna(df[column].mode()[0], inplace=True)dfOutput:AppleOrangeBananaPearBasket 110143040Basket2714828Basket35514812Basket41514812Basket571112Basket67.0492Table 5.14Here, the mode of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nanvalue is replaced by 7. Similarly, in Orange Column Nan’s are replacedwith 14, in Banana’s column Nan replaced with 8 and in Pear’s column itis replaced with 12.Replacing Nan values with min:Let’s take an example:AppleOrangeBananaPearBasket 110NaN3040Basket27142128Basket355NaN812Basket41514NaN8Basket5711NaNBasket6NaN492Table 5.15munotes.in

Page 63

 Here, we can see NaN in all the columns. Let’s fill it by theirminimum value. For this, use the following code:import pandas as pdimport numpy as npdf = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,12],[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])dfdf.fillna(df.min())Output:AppleOrangeBananaPearBasket 11013040Basket27142128Basket3551812Basket4151418Basket57112Basket67492Table 5.16Here, the minimum of Apple Column = (10, 7, 55, 15, 7) = 7. So,Nan value is replaced by 7. Similarly, in Orange Column Nan’s arereplaced with 1, in Banana’s column Nan replaced with 1 and in Pear’scolumn it is replaced with 2.5.5 LET US SUM UPThis chapter focuses on dealing with errors in data. The mainconcepts related to errors are: accept the errors. This is very crucial.Another way is to reject the errors. This step can be used if you can takethis risk and not more than 10-15% data is to be compromised. Anotherway is to correct the error. To correct the errors, there are differentpractical solutions available like using different error correction methods.Principles of data analysis were also discussed.munotes.in

Page 64

 Practical Solutions to solve the missing values were also coveredlike Drop the Columns Where All Elements Are Missing Values, Drop theColumns Where Any of the Elements Is Missing Values, and keep Onlythe Rows That Contain a Maximum of Two Missing Values, Fill AllMissing Values with the Mean, Median, Mode, Minimum, and Maximumof the Particular Numeric ColumnUNIT END QUESTIONS1.Explain error2.Explain the different ways to deal with errors.3.Explain the principles of data analysis.4.How you will handle missing values in Pandas? Explain.5.Write a python program to Drop the Columns Where All ElementsAre Missing Values.6.Write a python program to Drop the Columns Where Any of theElements Is Missing Values.7.Write a python program tokeep Only the Rows That Contain aMaximum of Two Missing Values.8.Write a python program to Fill All Missing Values with the Mean ofthe particular column.9.Write a python program to Fill All Missing Values with the Medianof the particular column.10.Write a python program to Fill All Missing Values with the Mode ofthe particular column.11.Write a python program to Fill All Missing Values with theMinimum of the particular column.12.Write a python program to Fill All Missing Values with theMaximum of the particular column.LIST OF REFERENCES;Python for Data Science For Dummies, by Luca Massaron John PaulMueller (Author),;ISBN-13 : 978-8126524938, Wiley;Python for Data Analysis: Data Wrangling with Pandas, NumPy, andIPython, 2nd Edition;by William McKinney (Author), ISBN-13 : 978-9352136414 ,Shroff/O'Reillymunotes.in

Page 65

 ;Data Science From Scratch: First Principles with Python, SecondEdition by Joel Grus,;ISBN-13 : 978-9352138326, Shroff/O'Reilly;Data Science from Scratch by Joel Grus, ISBN-13 : 978-1491901427, O8Reilly;Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 :978-8126533367 , Wiley;Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen,ISBN-13 : 978-9352869169, Pearson Education;Practical Data Science with R (MANNING) by Nina Zumel, JohnMount, ISBN-13 : 978-9351194378, Dreamtech Pressmunotes.in

Page 66

6ASSESS SUPERSTEPUnit Structure6.0Objectives6.1Engineering a Practical Assess Superstep6.2Unit End Questions6.3References6.0OBJECTIVESThis chapter will make you understand the practical concepts of:;Assess superstep;Python NetworkX Library used to draw network routing graphs;Python Schedule library to schedule various jobs6.1ENGINEERING A PRACTICAL ASSESSSUPERSTEPLet us first consider an example of Network routing:Python uses a library called NetworkX for network routing.To use NetworkX library, first install the library on your machine byusing following command on your command prompt:pip install Network XNetworkXis a Python package for the creation, manipulation, andstudy of the structure, dynamics, and functions of complex networks.Network X provides:;tools for the study of the structure and dynamics of social, biological,and infrastructure networks;;a standard programming interface and graph implementation that issuitable for many applications;;a rapid development environment for collaborative, multidisciplinaryprojects;;an interface to existing numerical algorithms and code written in C,C++, and FORTRAN; and the ability to painlessly work with largenonstandard data sets.With NetworkX you can load and store networks in standard andnonstandard data formats, generate many types of random and classicmunotes.in

Page 67

networks, analyze network structure, build network models, design newnetwork algorithms, draw networks, and much more.Graph Theory:In the Graph Theory, a graph has a finite set of vertices (V)connected to two-elements (E).Each vertex (v) connecting two destinations, or nodes, is called a link oran edge. Consider the Graph of bike paths below:sets {K,L}, {F,G},{J,H}, {H,L}, {A,B}, and {C,E} are examples of edges.
Figure 6.1The total number of edges for each node is the degree of that node.In the Graph above, M has a degree of 2 ({M,H} and {M,L}) while B hasa degree of 1 ({B,A}). Degree is described formally as:
networks, analyze network structure, build network models, design newnetwork algorithms, draw networks, and much more.Graph Theory:In the Graph Theory, a graph has a finite set of vertices (V)connected to two-elements (E).Each vertex (v) connecting two destinations, or nodes, is called a link oran edge. Consider the Graph of bike paths below:sets {K,L}, {F,G},{J,H}, {H,L}, {A,B}, and {C,E} are examples of edges.
Figure 6.1The total number of edges for each node is the degree of that node.In the Graph above, M has a degree of 2 ({M,H} and {M,L}) while B hasa degree of 1 ({B,A}). Degree is described formally as:
networks, analyze network structure, build network models, design newnetwork algorithms, draw networks, and much more.Graph Theory:In the Graph Theory, a graph has a finite set of vertices (V)connected to two-elements (E).Each vertex (v) connecting two destinations, or nodes, is called a link oran edge. Consider the Graph of bike paths below:sets {K,L}, {F,G},{J,H}, {H,L}, {A,B}, and {C,E} are examples of edges.
Figure 6.1The total number of edges for each node is the degree of that node.In the Graph above, M has a degree of 2 ({M,H} and {M,L}) while B hasa degree of 1 ({B,A}). Degree is described formally as:
munotes.in

Page 68

Connections through use of multiple edges are called paths. {F, H,M, L, H, J, G, I} is an example of a path. A simple path is when a pathdoes not repeat a node—formally known as Eulerian path. {I, G, J, H, F}is an example of a simple path. The shortest simple path is calledGeodesic. Geodesic between I and J is {I, G, J} or {I, K, J}. Finally, acycle is when a path’s start and end points are the same (ex. {H,M,L,H}).In some notebooks, a cycle is formally referred to as Eulerian cycle.Not all networks in a Graph system are interconnected. Thisdisconnection is when components are formed. As shown in the graphbelow, a component is formed only when every node has a path to othernodes. .
Figure 6.2Neo4J’s book on graph algorithms provides a clear summary
Connections through use of multiple edges are called paths. {F, H,M, L, H, J, G, I} is an example of a path. A simple path is when a pathdoes not repeat a node—formally known as Eulerian path. {I, G, J, H, F}is an example of a simple path. The shortest simple path is calledGeodesic. Geodesic between I and J is {I, G, J} or {I, K, J}. Finally, acycle is when a path’s start and end points are the same (ex. {H,M,L,H}).In some notebooks, a cycle is formally referred to as Eulerian cycle.Not all networks in a Graph system are interconnected. Thisdisconnection is when components are formed. As shown in the graphbelow, a component is formed only when every node has a path to othernodes. .
Figure 6.2Neo4J’s book on graph algorithms provides a clear summary
Connections through use of multiple edges are called paths. {F, H,M, L, H, J, G, I} is an example of a path. A simple path is when a pathdoes not repeat a node—formally known as Eulerian path. {I, G, J, H, F}is an example of a simple path. The shortest simple path is calledGeodesic. Geodesic between I and J is {I, G, J} or {I, K, J}. Finally, acycle is when a path’s start and end points are the same (ex. {H,M,L,H}).In some notebooks, a cycle is formally referred to as Eulerian cycle.Not all networks in a Graph system are interconnected. Thisdisconnection is when components are formed. As shown in the graphbelow, a component is formed only when every node has a path to othernodes. .
Figure 6.2Neo4J’s book on graph algorithms provides a clear summary
munotes.in

Page 69

Figure 6.3For example:# ### Creating a graph# Create an empty graph with no nodes and no edges.import networkx as nxG = nx.Graph()# By definition, a `Graph` is a collection of nodes (vertices) along withidentified pairs of# nodes (called # edges, links, etc). In NetworkX, nodes can be any[hashable] object e.g., a# text string, an image, an # XML object, another Graph, a customizednode object, etc.# # Nodes# The graph `G` can be grown in several ways. NetworkX includes manygraph generator# functions # and facilities to read and write graphs in many formats.# To get started # though we’ll look at simple manipulations. You can addone node at a# time,G.add_node(1)# or add nodes from any [iterable] container, such as a listG.add_nodes_from([2, 3])# Nodes from one graph can be incorporated into another:H = nx.path_graph(10)G.add_nodes_from(H)# `G` now contains the nodes of `H` as nodes of `G`.
Figure 6.3For example:# ### Creating a graph# Create an empty graph with no nodes and no edges.import networkx as nxG = nx.Graph()# By definition, a `Graph` is a collection of nodes (vertices) along withidentified pairs of# nodes (called # edges, links, etc). In NetworkX, nodes can be any[hashable] object e.g., a# text string, an image, an # XML object, another Graph, a customizednode object, etc.# # Nodes# The graph `G` can be grown in several ways. NetworkX includes manygraph generator# functions # and facilities to read and write graphs in many formats.# To get started # though we’ll look at simple manipulations. You can addone node at a# time,G.add_node(1)# or add nodes from any [iterable] container, such as a listG.add_nodes_from([2, 3])# Nodes from one graph can be incorporated into another:H = nx.path_graph(10)G.add_nodes_from(H)# `G` now contains the nodes of `H` as nodes of `G`.
Figure 6.3For example:# ### Creating a graph# Create an empty graph with no nodes and no edges.import networkx as nxG = nx.Graph()# By definition, a `Graph` is a collection of nodes (vertices) along withidentified pairs of# nodes (called # edges, links, etc). In NetworkX, nodes can be any[hashable] object e.g., a# text string, an image, an # XML object, another Graph, a customizednode object, etc.# # Nodes# The graph `G` can be grown in several ways. NetworkX includes manygraph generator# functions # and facilities to read and write graphs in many formats.# To get started # though we’ll look at simple manipulations. You can addone node at a# time,G.add_node(1)# or add nodes from any [iterable] container, such as a listG.add_nodes_from([2, 3])# Nodes from one graph can be incorporated into another:H = nx.path_graph(10)G.add_nodes_from(H)# `G` now contains the nodes of `H` as nodes of `G`.
munotes.in

Page 70

# In contrast, you could use the graph `H` as a node in `G`.G.add_node(H)# The graph `G` now contains `H` as a node. This flexibility is verypowerful as it allows# graphs of graphs, graphs of files, graphs of functions and much more. Itis worth thinking# about how to structure # your application sothat the nodes are usefulentities. Of course# you can always use a unique identifier # in `G` and have a separatedictionary keyed by# identifier to the node information if you prefer.# # Edges# `G` can also be grown by adding one edge at a time,G.add_edge(1, 2)e = (2, 3)G.add_edge(*e) # unpack edge tuple*# by adding a list of edges,G.add_edges_from([(1, 2), (1, 3)])# or by adding any ebunch of edges. An *ebunch* is any iterable containerof edge-tuples.# An edge-tuple can be a 2-tuple of nodes or a 3-tuple with 2 nodesfollowed by an edge# attribute dictionary, e.g.,# `(2, 3, {'weight': 3.1415})`. Edge attributes are discussed further below.G.add_edges_from(H.edges)# There are no complaints when adding existing nodes or edges.# Forexample, after removing all # nodes and edges,G.clear()# we add new nodes/edges and NetworkX quietly ignores any that arealready present.G.add_edges_from([(1, 2), (1, 3)])G.add_node(1)G.add_edge(1, 2)G.add_node("spam") # adds node "spam"G.add_nodes_from("spam") # adds 4 nodes: 's', 'p', 'a', 'm'G.add_edge(3, 'm')# At this stage the graph `G` consists of 8 nodes and 3 edges, as can beseen by:G.number_of_nodes()munotes.in

Page 71

 G.number_of_edges()# # Examining elements of a graph# We can examine the nodesand edges. Four basic graph propertiesfacilitate reporting:#`G.nodes`,# `G.edges`, `G.adj` and `G.degree`. These are set-like views of the nodes,edges, neighbors# (adjacencies), and degrees of nodes in a graph. They offer a continuallyupdated read-only#view into the graph structure. They are also dict-like in that you can lookup node and edge#data attributes via the views and iterate with data attributes usingmethods `.items()`,#`.data('span')`.# If you want a specific container type instead ofa view, you can specifyone.# Here we use lists, though sets, dicts, tuples and other containers may bebetter in other#contexts.list(G.nodes)list(G.edges)list(G.adj[1]) # or list(G.neighbors(1))G.degree[1] # the number of edges incident to 1# One can specify to report the edges and degree from a subset of allnodes using an#nbunch.# An *nbunch* is any of: `None` (meaning all nodes), a node, or aniterable container of nodes that is # not itself a node in the graph.G.edges([2, 'm'])G.degree([2, 3])# # Removing elements from a graph# One can remove nodes and edges from the graph in a similar fashion toadding.# Use methods `Graph.remove_node()`, `Graph.remove_nodes_from()`,#`Graph.remove_edge()`# and `Graph.remove_edges_from()`, e.g.G.remove_node(2)G.remove_nodes_from("spam")munotes.in

Page 72


list(G.nodes)G.remove_edge(1, 3)# # Using the graph constructors# Graph objects do not have to be built up incrementally-data specifying# graph structure can be passed directly to the constructors of the variousgraph classes.# When creating a graph structure by instantiating one of the graph# classes you can specify data in several formats.G.add_edge(1, 2)H = nx.DiGraph(G) # create a DiGraph using the connections from Glist(H.edges())edgelist = [(0,1), (1, 2), (2, 3)]H = nx.Graph(edgelist)# # What to use as nodes and edges# You might notice that nodes and edges are not specified as NetworkX# objects. This leaves you free to use meaningful items as nodes and# edges. The most common choices are numbers or strings, but a node can# be any hashable object (except `None`), and an edge can be associated# with any object `x` using `G.add_edge(n1, n2, object=x)`.# As an example, `n1` and `n2` could be protein objects from the RCSBProtein Data Bank,#and `x` # could refer to an XML record of publications detailingexperimental observations#of their interaction.# We have found this power quite useful, but its abuse can lead tosurprising behavior#unless one is # familiar with Python.# If in doubt,consider using `convert_node_labels_to_integers()` to obtaina moretraditional graph with # integer labels. Accessing edges and neighbors# In addition to the views `Graph.edges`, and `Graph.adj`, access to edgesand neighbors is#possible using subscript notation.G = nx.Graph([(1, 2, {"color": "yellow"})])G[1] # same as G.adj[1]G[1][2]G.edges[1, 2]# You can get/set the attributes of an edge using subscript notation# if the edge already existsmunotes.in

Page 73

 G.add_edge(1, 3)G[1][3]['color'] = "blue"G.edges[1,2]['color'] = "red"G.edges[1, 2]# Fast examination of all (node, adjacency) pairs is achieved using# `G.adjacency()`, or `G.adj.items()`.# Note that for undirected graphs, adjacency iteration sees each edgetwice.FG = nx.Graph()FG.add_weighted_edges_from([(1, 2, 0.125), (1, 3, 0.75), (2, 4, 1.2), (3, 4,0.375)])for n, nbrs in FG.adj.items():for nbr, eattr in nbrs.items():wt = eattr['weight']if wt< 0.5: print(f"({n}, {nbr}, {wt:.3})")# Convenient access to all edges is achieved with the edges propertyfor (u, v, wt) in FG.edges.data('weight'):if wt< 0.5:print(f"({u}, {v}, {wt:.3})")# # Adding attributes to graphs, nodes, and edges## Attributes such as weights, labels, colors, or whatever Python object youlike,# can be attached to graphs, nodes, or edges.## Each graph, node, and edge can hold key/value attribute pairs in anassociated# attribute dictionary (the keys must be hashable). By default these areempty,# but attributes can be added or changed using `add_edge`, `add_node` ordirect# manipulation of the attribute dictionaries named `G.graph`, `G.nodes`,and# `G.edges` for a graph `G`.# ## Graph attributes# Assign graph attributes when creating a new graphG = nx.Graph(day="Friday")G.graph# Or you can modify attributeslaterG.graph['day'] = "Monday"G.graphmunotes.in

Page 74

 # # Node attributes# Add node attributes using `add_node()`, `add_nodes_from()`, or`G.nodes`G.add_node(1, time='5pm')G.add_nodes_from([3], time='2pm')G.nodes[1]G.nodes[1]['room'] = 714G.nodes.data()# Note that adding a node to `G.nodes` does not add it to the graph, use# `G.add_node()` to add new nodes. Similarly for edges.# # Edge Attributes# Add/change edge attributes using `add_edge()`, `add_edges_from()`,# or subscript notation.G.add_edge(1, 2,weight=4.7 )G.add_edges_from([(3, 4), (4, 5)], color='red')G.add_edges_from([(1, 2, {'color': 'blue'}), (2, 3, {'weight': 8})])G[1][2]['weight'] = 4.7G.edges[3, 4]['weight'] = 4.2# The special attribute `weight` should be numeric as it is used by#algorithms requiring weighted edges.# Directed graphs# The `DiGraph` class provides additional methods and properties specific# to directed edges, e.g.,# `DiGraph.out_edges`, `DiGraph.in_degree`,# `DiGraph.predecessors()`, `DiGraph.successors()` etc.# To allow algorithms to work with both classes easily, the directedversions of# `neighbors()` is equivalent to `successors()` while `degree` reports# the sum of `in_degree` and `out_degree` even though that may feel# inconsistent at times.DG = nx.DiGraph()DG.add_weighted_edges_from([(1, 2, 0.5), (3, 1, 0.75)])DG.out_degree(1, weight='weight')DG.degree(1, weight='weight')list(DG.successors(1))list(DG.neighbors(1))# Some algorithms work only for directed graphs and others are not well# definedfor directed graphs. Indeed the tendency to lump directed# and undirected graphs together is dangerous. If you want to treat# a directed graph as undirected for some measurement you shouldprobably# convert it using `Graph.to_undirected()` or withmunotes.in

Page 75

 H =nx.Graph(G) # create an undirected graph H from a directed graph G# # Multigraphs# NetworkX provides classes for graphs which allow multiple edges# between any pair of nodes. The `MultiGraph` and# `MultiDiGraph`# classes allow you to add the same edge twice, possibly with different# edge data. This can be powerful for some applications, but many# algorithms are not well defined on such graphs.# Where results are well defined,# e.g., `MultiGraph.degree()` we provide the function. Otherwise you# should convert to a standard graph in a way that makes the measurementwell definedMG = nx.MultiGraph()MG.add_weighted_edges_from([(1, 2, 0.5), (1, 2, 0.75), (2, 3, 0.5)])dict(MG.degree(weight='weight'))GG = nx.Graph()for n, nbrs in MG.adjacency():for nbr, edict in nbrs.items():minvalue = min([d['weight'] for d in edict.values()])GG.add_edge(n, nbr, weight = minvalue)nx.shortest_path(GG, 1, 3)# # Graph generators and graph operations# In addition to constructing graphs node-by-node or edge-by-edge, they# can also be generated by# 1. Applying classic graph operations, such as:# 1. Using a call to one of the classic small graphs, e.g.,# 1. Using a (constructive) generator for a classic graph, e.g.,# like so:K_5 = nx.complete_graph(5)K_3_5= nx.complete_bipartite_graph(3, 5)barbell = nx.barbell_graph(10, 10)lollipop = nx.lollipop_graph(10, 20)# 1. Using a stochastic graph generator, e.g, like so:er = nx.erdos_renyi_graph(100, 0.15)ws = nx.watts_strogatz_graph(30, 3, 0.1)ba = nx.barabasi_albert_graph(100, 5)red = nx.random_lobster(100, 0.9, 0.9)# 1. Reading a graph stored in a file using common graph formats,# such as edge lists, adjacency lists, GML, GraphML, pickle, LEDA andothers.munotes.in

Page 76

nx.write_gml(red, "path.to.file")mygraph = nx.read_gml("path.to.file")# For details on graph formats see Reading and writing graphs# and for graph generator functions see Graph generators# # Analyzing graphs# The structure of `G` can be analyzed using various graph-theoreticfunctions such as:G = nx.Graph()G.add_edges_from([(1, 2), (1, 3)])G.add_node("spam") # adds node "spam"list(nx.connected_components(G))sorted(d for n, d in G.degree())nx.clustering(G)# Some functions with large output iterate over (node, value) 2-tuples.# These areeasily stored in a[dict](https://docs.python.org/3/library/stdtypes.html#dict)# structure if you desire.sp = dict(nx.all_pairs_shortest_path(G))sp[3]# See Algorithms for details on graph algorithms supported.# # Drawing graphs# NetworkX is not primarily a graph drawing package but basic drawingwith# Matplotlib as well as an interface to use the open source Graphvizsoftware# package are included. These are part of the `networkx.drawing` moduleand will# be imported if possible.# First importMatplotlib’s plot interface (pylab works too)import matplotlib.pyplot as plt# To test if the import of `networkx.drawing` was successful draw `G`using one ofG = nx.petersen_graph()plt.subplot(121)nx.draw(G, with_labels=True, font_weight='bold')plt.subplot(122)nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True,font_weight='bold')# when drawing to an interactive display. Note that you may need to issuea Matplotlibmunotes.in

Page 77

plt.show()options = {'node_color': 'black','node_size': 100,'width': 3,}plt.subplot(221)nx.draw_random(G, **optionsplt.subplot(222)nx.draw_circular(G, **options)plt.subplot(223)nx.draw_spectral(G, **options)plt.subplot(224)nx.draw_shell(G, nlist=[range(5,10), range(5)], **options)# You can find additionaloptions via `draw_networkx()` and# layouts via `layout`.# You can use multiple shells with `draw_shell()`.G = nx.dodecahedral_graph()shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,13]]nx.draw_shell(G, nlist=shells,**options)# To save drawings to a file, use, for examplenx.draw(G)plt.savefig("path.png")# writes to the file `path.png` in the local directory.Output:G = nx.petersen_graph()plt.subplot(121)nx.draw(G, with_labels=True, font_weight='bold')plt.subplot(122)nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True,font_weight='bold')munotes.in

Page 78

Figure 6.4plt.show()options = {'node_color': 'black','node_size': 100,'width': 3,}plt.subplot(221)nx.draw_random(G, **options)plt.subplot(222)nx.draw_circular(G, **options)plt.subplot(223)nx.draw_spectral(G, **options)plt.subplot(224)nx.draw_shell(G, nlist=[range(5,10), range(5)], **options
Figure 6.5
Figure 6.4plt.show()options = {'node_color': 'black','node_size': 100,'width': 3,}plt.subplot(221)nx.draw_random(G, **options)plt.subplot(222)nx.draw_circular(G, **options)plt.subplot(223)nx.draw_spectral(G, **options)plt.subplot(224)nx.draw_shell(G, nlist=[range(5,10), range(5)], **options
Figure 6.5
Figure 6.4plt.show()options = {'node_color': 'black','node_size': 100,'width': 3,}plt.subplot(221)nx.draw_random(G, **options)plt.subplot(222)nx.draw_circular(G, **options)plt.subplot(223)nx.draw_spectral(G, **options)plt.subplot(224)nx.draw_shell(G, nlist=[range(5,10), range(5)], **options
Figure 6.5
munotes.in

Page 79

G = nx.dodecahedral_graph()shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,13]]nx.draw_shell(G, nlist=shells, **options)nx.draw(
Figure 6.6nx.draw(G)plt.savefig("path.png")
Figure 6.7 Building a DAG for Scheduling Jobs
G = nx.dodecahedral_graph()shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,13]]nx.draw_shell(G, nlist=shells, **options)nx.draw(
Figure 6.6nx.draw(G)plt.savefig("path.png")
Figure 6.7 Building a DAG for Scheduling Jobs
G = nx.dodecahedral_graph()shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,13]]nx.draw_shell(G, nlist=shells, **options)nx.draw(
Figure 6.6nx.draw(G)plt.savefig("path.png")
Figure 6.7 Building a DAG for Scheduling Jobs
munotes.in

Page 80

Python:Python Schedule Library:Schedule is in-process scheduler for periodicjobs that use thebuilder pattern for configuration. Schedule lets you run Python functions(or any other callable) periodically at pre-determined intervals using asimple, human-friendly syntax.Schedule Library is used to schedule a task at a particular time everyday or a particular day of a week. We can also set time in 24 hours formatthat when a task should run. Basically, Schedule Library matches yoursystems time to that of scheduled time set by you. Once the scheduledtime and system time matchesthe job function (command function that isscheduled ) is called.Installation:$ pip install scheduleschedule.Scheduler class:;schedule.every(interval=1) : Calls every on the default schedulerinstance. Schedule a new periodic job.;schedule.run_pending() : Calls run pending on the default schedulerinstance. Run all jobs that are scheduled to run.;schedule.run_all(delay_seconds=0) : Calls run_all on the defaultscheduler instance. Run all jobs regardless if they are scheduled torun or not.schedule.idle_seconds() : Calls idle_seconds on the default schedulerinstance.;schedule.next_run() : Calls next_run on the default schedulerinstance. Datetime whenthe next job should run.;schedule.cancel_job(job) : Calls cancel_job on the default schedulerinstance. Delete ascheduled job.;schedule.Job(interval, scheduler=None) classA periodic job as used by Scheduler.Parameters:;interval: A quantity of a certain time unit;scheduler: The Scheduler instance that this job will register itselfwithonce it has been fully configured in Job.do().Basic methods for Schedule.job:;at(time_str) : Schedule the job every day at a specific time. Callingthisisonly valid for jobs scheduled to run every N day(s).Parameters:time_str–A string in XX:YY format. Returns: Theinvoked job instancemunotes.in

Page 81

 ;do(job_func, *args, **kwargs) : Specifies the job_func that shouldbecalled every time the job runs. Any additional arguments arepassedon to job_func when the job runs.Parameters: job_func–The function to be scheduled. Returns: Theinvoked job instance;run() : Run the job and immediately reschedule it. Returns: Thereturnvalue returned by the job_func;to(latest) : Schedule the job to run at an irregular (randomized)interval. For example, every(A).to(B).seconds executes the jobfunction every N seconds such that A <= N <= B.For example# Schedule Library importedimport scheduleimport time# Functions setupdef placement():print("Get ready for Placement at various companies")def good_luck():print("Good Luckfor Test")def work():print("Study and work hard")def bedtime():print("It is bed time go rest")def datascience():print("Data science with python is fun")# Task scheduling# After every 10mins datascience() is called.schedule.every(10).minutes.do(datascience# After every hour datascience() is called.schedule.every().hour.do(datascience)# Every day at 12am or 00:00 time bedtime() is called.schedule.every().day.at("00:00").do(bedtime)# After every 5 to 10 mins in between run work()schedule.every(5).to(10).minutes.do(work)# Every mondaygood_luck() is calledschedule.every().monday.do(good_luck)# Every tuesday at 18:00 placement() is calledschedule.every().tuesday.at("18:00").do(placementmunotes.in

Page 82


# Loop so that the scheduling task# keeps on running all time.while True:# Checks whether a scheduled task# is pending to run or notschedule.run_pending()time.sleep(1)UNIT END QUESTIONS1.Write Python program to create the network routing diagram from thegiven data.2.Write a Python program to build directed acyclic graph.3.Write a Python program to pick the content for Bill Boards from thegiven data.4.Write a Python program to generate visitors data from the given csvfile.REFERENCES;Python for Data Science For Dummies, by Luca Massaron John PaulMueller (Author),;ISBN-13 : 978-8126524938, Wiley;Python for Data Analysis: Data Wrangling with Pandas, NumPy, andIPython, 2nd Edition by William McKinney (Author), ISBN-13 :978-9352136414 , Shroff/O'Reilly;Data Science From Scratch: FirstPrinciples with Python, SecondEdition by Joel Grus, ISBN-13 : 978-9352138326, Shroff/O'Reilly;Data Science from Scratch by Joel Grus, ISBN-13 : 978-1491901427,O′Reilly;Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 :978-8126533367 , Wiley;Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen,ISBN-13 : 978-9352869169, Pearson Education;Practical Data Science with R (MANNING) by Nina Zumel, JohnMount, ISBN-13 : 978-9351194378, Dreamtech Pressmunotes.in

Page 83

 Unit IV7PROCESS SUPERSTEPUnit Structure7.0Objectives7.1Introduction7.2Data Vault7.2.1 Hubs7.2.2 Links7.2.3 Satellites7.2.4 Reference Satellites7.3Time-Person-Object-Location-Event Data Vault7.4Time Section7.4.1 Time Hub7.4.2 Time Links7.4.3 Time Satellites7.5Person Section7.5.1 Person Hub7.5.2 Person Links7.5.3 Person Satellites7.6Object Section7.6.1 Object Hub7.6.2 Object Links7.6.3 Object Satellites7.7Location Section7.7.1 Location Hub7.7.2 Location Links7.7.3 Location Satellites7.8Event Section7.8.1 Event Hub7.8.2 Event Links7.8.3 Event Satellites7.9Engineering a Practical Process Superstep7.9.1 Event7.9.2 Explicit Event7.9.3 Implicit Event7.105-Whys Technique7.10.1 Benefits of the 5 Whysmunotes.in

Page 84

 7.10.2 When Are the 5 Whys Most Useful?7.10.3 How to Complete the 5 Whys7.11Fishbone Diagrams7.12Carlo Simulation7.13Causal Loop Diagrams7.14Pareto Chart7.15Correlation Analysis7.16Forecasting7.17Data Science7.0 OBJECTIVESThe objective of this chapter to learnTime-Person-Object-Location-Event(T-P-O-L-E) design principle and various concepts that areuse to create/define relationship among this data.7.2 INTRODUCTIONThe Process superstep uses the assess results of the retrieveversions of the data sources into a highly structured data vault. These datavaults form the basic data structure for the rest of the data science steps.The Process superstep is the amalgamation procedure that pipes your datasources into five primary classifications of data.
Figure7.1 Categories of data7.2 DATA VAULTData Vault modelling is a technique to manage long term storageof data from multiple operation system. It store historical data in thedatabase.
munotes.in

Page 85

 7.2.1 Hubs:Data vault hub is used to store business key. These keys do notchange over time. Hub also contains a surrogate key for each hub entryand metadata information for a business key.7.2.2 Links:Data vault links are join relationship between business keys.7.2.3 Satellites:Data vault satellites stores thechronological descriptive andcharacteristics for a specific section of business data. Using hub and linkswe get model structure but no chronological characteristics. Satellitesconsist of characteristics and metadata linking them to their specific hub.7.2.4 Reference Satellites:Reference satellites are referenced from satellites that can be usedby other satellites to prevent redundant storage of reference characteristics.7.3 TIME-PERSON-OBJECT-LOCATION-EVENT DATAVAULTWe will useTime-Person-Object-Location-Event (T-P-O-L-E) designprinciple.All five sections are linked with each other, resulting into sixteen links.
Figure7.2 Time-Person-Object-Location-Event high-level design
 7.2.1 Hubs:Data vault hub is used to store business key. These keys do notchange over time. Hub also contains a surrogate key for each hub entryand metadata information for a business key.7.2.2 Links:Data vault links are join relationship between business keys.7.2.3 Satellites:Data vault satellites stores thechronological descriptive andcharacteristics for a specific section of business data. Using hub and linkswe get model structure but no chronological characteristics. Satellitesconsist of characteristics and metadata linking them to their specific hub.7.2.4 Reference Satellites:Reference satellites are referenced from satellites that can be usedby other satellites to prevent redundant storage of reference characteristics.7.3 TIME-PERSON-OBJECT-LOCATION-EVENT DATAVAULTWe will useTime-Person-Object-Location-Event (T-P-O-L-E) designprinciple.All five sections are linked with each other, resulting into sixteen links.
Figure7.2 Time-Person-Object-Location-Event high-level design
 7.2.1 Hubs:Data vault hub is used to store business key. These keys do notchange over time. Hub also contains a surrogate key for each hub entryand metadata information for a business key.7.2.2 Links:Data vault links are join relationship between business keys.7.2.3 Satellites:Data vault satellites stores thechronological descriptive andcharacteristics for a specific section of business data. Using hub and linkswe get model structure but no chronological characteristics. Satellitesconsist of characteristics and metadata linking them to their specific hub.7.2.4 Reference Satellites:Reference satellites are referenced from satellites that can be usedby other satellites to prevent redundant storage of reference characteristics.7.3 TIME-PERSON-OBJECT-LOCATION-EVENT DATAVAULTWe will useTime-Person-Object-Location-Event (T-P-O-L-E) designprinciple.All five sections are linked with each other, resulting into sixteen links.
Figure7.2 Time-Person-Object-Location-Event high-level design
munotes.in

Page 86

7.4 TIME SECTIONTime section contain data structure to store all time relatedinformation.For example, time at which event has occurred.7.4.1Time Hub:This hub act as connector between time zones.Following are the fields of time hub.
7.4.2 Time Links:Time Links connect time hub to other hubs.
Figure7.3 Time linkFollowing are the time links that can be stored as separate links.;Time-Person Link•This link connects date-time values from time hub to person hub.•Dates such as birthdays, anniversaries, book access date, etc.
7.4 TIME SECTIONTime section contain data structure to store all time relatedinformation.For example, time at which event has occurred.7.4.1Time Hub:This hub act as connector between time zones.Following are the fields of time hub.
7.4.2 Time Links:Time Links connect time hub to other hubs.
Figure7.3 Time linkFollowing are the time links that can be stored as separate links.;Time-Person Link•This link connects date-time values from time hub to person hub.•Dates such as birthdays, anniversaries, book access date, etc.
7.4 TIME SECTIONTime section contain data structure to store all time relatedinformation.For example, time at which event has occurred.7.4.1Time Hub:This hub act as connector between time zones.Following are the fields of time hub.
7.4.2 Time Links:Time Links connect time hub to other hubs.
Figure7.3 Time linkFollowing are the time links that can be stored as separate links.;Time-Person Link•This link connects date-time values from time hub to person hub.•Dates such as birthdays, anniversaries, book access date, etc.
munotes.in

Page 87

;Time-Object Link•This link connects date-time values from time hub to object hub.•Dates such as when you buy or sell car, house or book, etc.;Time-Location Link•This link connects date-time values from time hub to location hub.•Dates such as when you moved or access book from post code,etc.;Time-Event Link•This link connects date-time values from time hub to event hub.•Dates such as when you changed vehicles, etc.7.4.3 Time Satellites:Following are the fields of time satellites.
Time satellite can be used to move from one time zone to othervery easily. This feature will be used during Transform superstep.7.5 PERSON SECTIONPerson section contains data structure to store all data related to person.7.5.1 Person Hub:Following are the fields of Person hub.
munotes.in

Page 88

7.5.2 Person Links:Person Links connect person hub to other hubs.
Figure 7.4 Person LinkFollowing are the person links that can be stored as separate links.;Person-Time Link•This link contains relationship between person hub and time hub.;Person-Object Link•This link contains relationship between person hub and object hub.;Person-Location Link•This link contains relationship between person hub and locationhub.;Person-Event Link•This link contains relationship between person hub and event hub.7.5.3 Person Satellites:Person satellites are part of vault. Basically, it is information aboutbirthdate, anniversary or validity dates of ID for respective person.
7.5.2 Person Links:Person Links connect person hub to other hubs.
Figure 7.4 Person LinkFollowing are the person links that can be stored as separate links.;Person-Time Link•This link contains relationship between person hub and time hub.;Person-Object Link•This link contains relationship between person hub and object hub.;Person-Location Link•This link contains relationship between person hub and locationhub.;Person-Event Link•This link contains relationship between person hub and event hub.7.5.3 Person Satellites:Person satellites are part of vault. Basically, it is information aboutbirthdate, anniversary or validity dates of ID for respective person.
7.5.2 Person Links:Person Links connect person hub to other hubs.
Figure 7.4 Person LinkFollowing are the person links that can be stored as separate links.;Person-Time Link•This link contains relationship between person hub and time hub.;Person-Object Link•This link contains relationship between person hub and object hub.;Person-Location Link•This link contains relationship between person hub and locationhub.;Person-Event Link•This link contains relationship between person hub and event hub.7.5.3 Person Satellites:Person satellites are part of vault. Basically, it is information aboutbirthdate, anniversary or validity dates of ID for respective person.
munotes.in

Page 89

7.6 OBJECT SECTIONObject section contains data structure to store all data related to object.7.6.1 Object Hub:Object hub represent a real-world object with few attributes.Following are the fields of object hub.
7.6.2 Object Links:Object Links connect object hub to other hubs.
Figure 7.5 Object LinkFollowing are the object links that can be stored as separate links.;Object-Time Link•This link contains relationship between Object hub and time hub.;Object-Person Link•This link contains relationship between Object hub and Personhub.
7.6 OBJECT SECTIONObject section contains data structure to store all data related to object.7.6.1 Object Hub:Object hub represent a real-world object with few attributes.Following are the fields of object hub.
7.6.2 Object Links:Object Links connect object hub to other hubs.
Figure 7.5 Object LinkFollowing are the object links that can be stored as separate links.;Object-Time Link•This link contains relationship between Object hub and time hub.;Object-Person Link•This link contains relationship between Object hub and Personhub.
7.6 OBJECT SECTIONObject section contains data structure to store all data related to object.7.6.1 Object Hub:Object hub represent a real-world object with few attributes.Following are the fields of object hub.
7.6.2 Object Links:Object Links connect object hub to other hubs.
Figure 7.5 Object LinkFollowing are the object links that can be stored as separate links.;Object-Time Link•This link contains relationship between Object hub and time hub.;Object-Person Link•This link contains relationship between Object hub and Personhub.
munotes.in

Page 90

;Object-Location Link•This link contains relationship between Object hub and Locationhub.;Object-Event Link•This link contains relationship between Object hub and event hub.7.6.3 Object Satellites:Object satellites are part of vault. Basically, it is information aboutID,UUID, type, key, etc. for respective object.
7.7 LOCATION SECTIONLocation section contains data structure to store all data related tolocation.7.7.1 Location Hub:The location hub consists of a series of fields that supports a GPSlocation. The locationhub consists of the following fields:
7.7.2 Location Links:Location Links connect location hub to other hubs.
munotes.in

Page 91

 Figure 7.6 Location LinkFollowing are the location links that can be stored as separate links.;Location-Time Link•This link contains relationship between location hub and time hub.;Location-Person Link•This link contains relationship between location hub and personhub.;Location-Object Link•This link contains relationship between location hub and objecthub.;Location-Event Link•This link contains relationship between locationhub and eventhub.7.7.3 Location Satellites:Location satellites are part of vault that contains locations of entities.
 Figure 7.6 Location LinkFollowing are the location links that can be stored as separate links.;Location-Time Link•This link contains relationship between location hub and time hub.;Location-Person Link•This link contains relationship between location hub and personhub.;Location-Object Link•This link contains relationship between location hub and objecthub.;Location-Event Link•This link contains relationship between locationhub and eventhub.7.7.3 Location Satellites:Location satellites are part of vault that contains locations of entities.
 Figure 7.6 Location LinkFollowing are the location links that can be stored as separate links.;Location-Time Link•This link contains relationship between location hub and time hub.;Location-Person Link•This link contains relationship between location hub and personhub.;Location-Object Link•This link contains relationship between location hub and objecthub.;Location-Event Link•This link contains relationship between locationhub and eventhub.7.7.3 Location Satellites:Location satellites are part of vault that contains locations of entities.
munotes.in

Page 92


7.8 EVENT SECTIONIt contains data structure to store all data of entities related to eventthat has occurred.7.8.1 Event Hub:Event hub contains various fields that stores real world events.
7.8.2 Event Links:Event Links connect event hub to other hubs.
Figure7.7 Event LinkFollowing are the time links that can be stored as separate links.;Event-Time Link•This link contains relationship between event hub and time hub.;Event-Person Link•This link contains relationship between event hub and person hub.

7.8 EVENT SECTIONIt contains data structure to store all data of entities related to eventthat has occurred.7.8.1 Event Hub:Event hub contains various fields that stores real world events.
7.8.2 Event Links:Event Links connect event hub to other hubs.
Figure7.7 Event LinkFollowing are the time links that can be stored as separate links.;Event-Time Link•This link contains relationship between event hub and time hub.;Event-Person Link•This link contains relationship between event hub and person hub.

7.8 EVENT SECTIONIt contains data structure to store all data of entities related to eventthat has occurred.7.8.1 Event Hub:Event hub contains various fields that stores real world events.
7.8.2 Event Links:Event Links connect event hub to other hubs.
Figure7.7 Event LinkFollowing are the time links that can be stored as separate links.;Event-Time Link•This link contains relationship between event hub and time hub.;Event-Person Link•This link contains relationship between event hub and person hub.
munotes.in

Page 93

 ;Event-Object Link•This link contains relationship between event hub and object hub.;Event-Location Link•This link containsrelationship between event hub and locationhub.7.8.3 Event Satellites:Event satellites are part of vault it contains event information that occursin the system.7.9 ENGINEERING A PRACTICAL PROCESSSUPERSTEPTime:Time is most important characteristics of data used to record eventtime. ISO 8601-2004 defines an international standard for interchangeformats for dates and times.The following entities are part of ISO 8601-2004 standard:Year, month, day, hour, minute, second, and fraction of a secondThe data/time is recorded from largest (year) to smallest (fractionof second). These values must have a pre-approved fixed number of digitsthat are padded with leading zeros.YearThe standard uses four digits to represent year. The values ranges from0000 to 9999.AD/BC requires conversionYearConversionN ADYear N3 ADYear 31 ADYear 11 BCYear 02 BCYear–12020AD+20202020BC-2019 (year-1 for BC)Table 7.1from datetime import datetimefrom pytz import timezone, all_timezonesnow_date = datetime(2020,1,2,3,4,5,6)now_utc=now_date.replace(tzinfo=timezone('UTC'))munotes.in

Page 94

 print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Year:',str(now_utc.strftime("%Y")))Output:Month:The standard uses two digits to represent month. The values ranges from01 to 12.The rule for a valid month is 12 January 2020 becomes 2020-11-12.Above program can be updated to extract month value.print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Month:',str(now_utc.strftime("%m")))print('Month Name:',str(now_utc.strftime("%B")))Output:Following are the English names for monthNumberName01January02February03March04April05May06June04.1July08August09September10October11November12DecemberTable 7.2DayThe standard uses two digits to represent month. The values ranges from01 to 31.
munotes.in

Page 95

 The rule for a valid month is 22 January 2020 becomes 2020-01-22 or+2020-01-22.print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Day:',str(now_utc.strftime("%d")))Output:Hour:The standard uses two digits to represent hour. The values ranges from 00to 24.The valid format is hhmmss or hh:mm:ss. The shortened format hhmm orhh:mm is acceptedThe use of 00:00:00 is the beginning of the calendar day. The use of24:00:00 is only to indicate the end of the calendar day.print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Hour:',str(now_utc.strftime("%H")))Output:Minute:The standard uses two digits to represent minute. The values ranges from00 to 59.The standard minute must use two-digit values within the range of 00through 59.The valid format is hhmmss or hh:mm:ss.Output:
munotes.in

Page 96

Second:The standard uses two digits to representsecond. The values ranges from00 to 59.The valid format is hhmmss or hh:mm:ss.print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Second:',str(now_utc.strftime("%S")))Output:The fraction of a second is only defined as a format: hhmmss,sss orhh:mm:ss,sss orhhmmss.sss or hh:mm:ss.sss.The current commonly used formats are the following:• hh:mm:ss.s: Tenth of a second• hh:mm:ss.ss: Hundredth of a second• hh:mm:ss.sss: Thousandth of a secondprint('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)(%z)")))print('Millionth of Second:',str(now_utc.strftime("%f")))Coordinated Universal Time (UTC)A sample program to display current time.from datetime import datetimefrompytz import all_timezones,timezone#get the current timenow_date_local=datetime.now()#Change the local time to 'Etc/GMT-4.1'now_date=now_date_local.replace(tzinfo=timezone('Etc/GMT-4.1'))#get the time in Mumbai, Indianow_india=now_date.astimezone(timezone('Etc/GMT-4.1'))print('India Date Time:',str(now_india.strftime("%Y-%m-%d %H:%M:%S(%Z)(%z)")))
munotes.in

Page 97

Output:7.9.1 Event:This structure records any specific event or action that isdiscovered in the data sources. Anevent is any action that occurs withinthe data sources. Events are recorded using three main data entities: EventType, Event Group, and Event Code. The details of each event arerecorded as a set of details against the event code. There are two maintypes of events.7.9.2 Explicit Event:This type of event is stated in the data source clearly and with fulldetails. There is cleardata to show that the specific action was performed.Following are examples of explicit events:•A security card with number 1234 was used to open door A.•You are reading Chapter 9 of Practical DataScience.•I bought ten cans of beef curry.Explicit events are the events that the source systems supply, asthese have directdata that proves that the specific action was performed.7.9.3 Implicit Event:This type of event is formulated from characteristics of the data inthe source systems plus a series of insights on the data relationships.The following are examples of implicit events:•A security card with number 8884.1 was used to open door X.•A security card with number 8884.1 was issued to Mr. Vermeulen.•Room 302 is fitted with a security reader marked door X.These three events would imply that Mr. Vermeulen entered room302 as an event. Not true!7.10 5-WHYS TECHNIQUEData science is at its core about curiosity and inquisitiveness.Thiscore is rooted in the 5Whys. The 5 Whys is a technique used in theanalysis phase of data science.
munotes.in

Page 98

7.10.1 Benefits of the 5 Whys:The 5 Whys assist the data scientist to identify the root cause of aproblem and determine the relationship between different root causes ofthe same problem. It is one of the simplest investigative tools—easy tocomplete without intense statistical analysis.7.10.2 When Are the 5 Whys Most Useful?:The 5 Whys are most useful for finding solutions to problems thatinvolve human factors or interactions that generate multi-layered dataproblems. In day-to-day business life, they can be used in real-worldbusinesses to find the root causes of issues.7.10.3 How to Complete the 5 Whys?:Write down the specific problem. This will help you to formalizethe problem and describe it completely. It also helps the data science teamto focus on the same problem. Ask why the problem occurred and writethe answer below the problem. If the answer you provided doesn’t identifythe root cause of the problem that you wrote down first, ask why again,and write down that answer. Loop back to the preceding step until you andyour customer are in agreement that the problem’s root cause is identified.Again, this may require fewer or more than the5 Whys.Example:Problem Statement: Customers are unhappy because they are beingshipped products that don’t meet their specifications.1. Why are customers being shipped bad products?•Because manufacturing built the products to a specification that isdifferent from what the customer and the salesperson agreed to.2. Why did manufacturing build the products to a different specificationthan that of sales?•Because the salesperson accelerates work on the shop floor by callingthe head of manufacturing directly to begin work. An error occurredwhen the specifications were being communicated or written down.3. Why does the salesperson call the head of manufacturing directly tostart work instead of following the procedure established by the company?•Because the “start work” form requires the sales director’s approvalbefore work can begin and slows the manufacturing process (or stopsit when the director is out of the office).4. Why does the form contain an approval for the sales director?•Because the sales director must be continually updated on sales fordiscussions with the CEO, as my retailer customer was a top ten keyaccount.munotes.in

Page 99

In this case, only four whys were required to determine that a non-value-add edsignature authority helped to cause a process breakdown inthe quality assurance for a key account! The rest was just criminal.The external buyer at the wholesaler knew this process wasregularly by passed and started buying the bad tins to act as an unofficialbackfill for the failingprocess in the quality-assurance process inmanufacturing, to make up the shortfalls in sales demand. The wholesalersimply relabelled the product and did not change how it wasmanufactured. The reason? Big savings lead to big bonuses. A key client’sorders had to be filled. Sales are important!7.11 FISHBONE DIAGRAMSThe fishbone diagram or Ishikawa diagram is a useful tool to findwhere each data fits into data vault. This is a cause-and-effect diagram thathelps managers to track down the reasons for imperfections, variations,defects, or failures. The diagram looks just like a fish’s skeleton with theproblem at its head and the causes for the problem feeding into the spine.Once all the causes that underlie the problem have been identified,managers can start looking for solutions to ensure that the problem doesn’tbecome a recurring one. It can also be used in product development.Having a problem-solving product will ensure that your new developmentwill be popular–provided people care about the problem you’re trying tosolve. The fishbone diagram strives to pinpoint everything that’s wrongwith current market offerings so that you can develop an innovation thatdoesn’t have these problems. Finally, the fishbone diagram is also a greatway to look for and prevent quality problems before they ever arise. Use itto troubleshoot before there is trouble, and you can overcome all or mostof your teething troubles when introducing something new.
Figure7.8 Fishbone diagram
In this case, only four whys were required to determine that a non-value-add edsignature authority helped to cause a process breakdown inthe quality assurance for a key account! The rest was just criminal.The external buyer at the wholesaler knew this process wasregularly by passed and started buying the bad tins to act as an unofficialbackfill for the failingprocess in the quality-assurance process inmanufacturing, to make up the shortfalls in sales demand. The wholesalersimply relabelled the product and did not change how it wasmanufactured. The reason? Big savings lead to big bonuses. A key client’sorders had to be filled. Sales are important!7.11 FISHBONE DIAGRAMSThe fishbone diagram or Ishikawa diagram is a useful tool to findwhere each data fits into data vault. This is a cause-and-effect diagram thathelps managers to track down the reasons for imperfections, variations,defects, or failures. The diagram looks just like a fish’s skeleton with theproblem at its head and the causes for the problem feeding into the spine.Once all the causes that underlie the problem have been identified,managers can start looking for solutions to ensure that the problem doesn’tbecome a recurring one. It can also be used in product development.Having a problem-solving product will ensure that your new developmentwill be popular–provided people care about the problem you’re trying tosolve. The fishbone diagram strives to pinpoint everything that’s wrongwith current market offerings so that you can develop an innovation thatdoesn’t have these problems. Finally, the fishbone diagram is also a greatway to look for and prevent quality problems before they ever arise. Use itto troubleshoot before there is trouble, and you can overcome all or mostof your teething troubles when introducing something new.
Figure7.8 Fishbone diagram
In this case, only four whys were required to determine that a non-value-add edsignature authority helped to cause a process breakdown inthe quality assurance for a key account! The rest was just criminal.The external buyer at the wholesaler knew this process wasregularly by passed and started buying the bad tins to act as an unofficialbackfill for the failingprocess in the quality-assurance process inmanufacturing, to make up the shortfalls in sales demand. The wholesalersimply relabelled the product and did not change how it wasmanufactured. The reason? Big savings lead to big bonuses. A key client’sorders had to be filled. Sales are important!7.11 FISHBONE DIAGRAMSThe fishbone diagram or Ishikawa diagram is a useful tool to findwhere each data fits into data vault. This is a cause-and-effect diagram thathelps managers to track down the reasons for imperfections, variations,defects, or failures. The diagram looks just like a fish’s skeleton with theproblem at its head and the causes for the problem feeding into the spine.Once all the causes that underlie the problem have been identified,managers can start looking for solutions to ensure that the problem doesn’tbecome a recurring one. It can also be used in product development.Having a problem-solving product will ensure that your new developmentwill be popular–provided people care about the problem you’re trying tosolve. The fishbone diagram strives to pinpoint everything that’s wrongwith current market offerings so that you can develop an innovation thatdoesn’t have these problems. Finally, the fishbone diagram is also a greatway to look for and prevent quality problems before they ever arise. Use itto troubleshoot before there is trouble, and you can overcome all or mostof your teething troubles when introducing something new.
Figure7.8 Fishbone diagram
munotes.in

Page 100

7.12 MONTE CARLO SIMULATIONMonte Carlo simulation technique performs analysis by buildingmodels of possible results, by substituting a range of values—a probabilitydistribution—for parameters that have inherent uncertainty. It thencalculates results over and over, each time using adifferent set of randomvalues from the probability functions. Depending on the number ofuncertainties and the ranges specified for them, a Monte Carlo simulationcan involve thousands or tens of thousands of recalculations before it iscomplete. MonteCarlosimulation produces distributions of possibleoutcome values. As a data scientist, this gives you an indication of howyour model will react under real-life situations. It also gives the datascientist a tool to check complex systems, wherein the inputparametersare high-volume or complex.7.13 CAUSAL LOOP DIAGRAMSA causal loop diagram (CLD) is a causal diagram that aids invisualizing how a number of variables in a system are interrelated anddrive cause-and-effect processes. The diagram consists of a set of nodesand edges. Nodes represent the variables, and edges are the links thatrepresent a connection or a relation between the two variables.Example:The challenge is to keep the “Number of Employees Availableto Work and Productivity” as highas possible.
Figure7.9 Causal loop diagram
7.12 MONTE CARLO SIMULATIONMonte Carlo simulation technique performs analysis by buildingmodels of possible results, by substituting a range of values—a probabilitydistribution—for parameters that have inherent uncertainty. It thencalculates results over and over, each time using adifferent set of randomvalues from the probability functions. Depending on the number ofuncertainties and the ranges specified for them, a Monte Carlo simulationcan involve thousands or tens of thousands of recalculations before it iscomplete. MonteCarlosimulation produces distributions of possibleoutcome values. As a data scientist, this gives you an indication of howyour model will react under real-life situations. It also gives the datascientist a tool to check complex systems, wherein the inputparametersare high-volume or complex.7.13 CAUSAL LOOP DIAGRAMSA causal loop diagram (CLD) is a causal diagram that aids invisualizing how a number of variables in a system are interrelated anddrive cause-and-effect processes. The diagram consists of a set of nodesand edges. Nodes represent the variables, and edges are the links thatrepresent a connection or a relation between the two variables.Example:The challenge is to keep the “Number of Employees Availableto Work and Productivity” as highas possible.
Figure7.9 Causal loop diagram
7.12 MONTE CARLO SIMULATIONMonte Carlo simulation technique performs analysis by buildingmodels of possible results, by substituting a range of values—a probabilitydistribution—for parameters that have inherent uncertainty. It thencalculates results over and over, each time using adifferent set of randomvalues from the probability functions. Depending on the number ofuncertainties and the ranges specified for them, a Monte Carlo simulationcan involve thousands or tens of thousands of recalculations before it iscomplete. MonteCarlosimulation produces distributions of possibleoutcome values. As a data scientist, this gives you an indication of howyour model will react under real-life situations. It also gives the datascientist a tool to check complex systems, wherein the inputparametersare high-volume or complex.7.13 CAUSAL LOOP DIAGRAMSA causal loop diagram (CLD) is a causal diagram that aids invisualizing how a number of variables in a system are interrelated anddrive cause-and-effect processes. The diagram consists of a set of nodesand edges. Nodes represent the variables, and edges are the links thatrepresent a connection or a relation between the two variables.Example:The challenge is to keep the “Number of Employees Availableto Work and Productivity” as highas possible.
Figure7.9 Causal loop diagram
munotes.in

Page 101

 7.14 PARETO CHARTA Pareto chart is a bar graph. It is also called as Pareto diagram orPareto analysis. The lengths of the bars represent frequency or cost (timeor money), and are arranged with longest barson the left and the shortestto the right. In this way the chart visually depicts which situations aremore significant.When to use Pareto Chart:;When analysing data about the frequency of problems or causes in aprocess.;When there are many problems or causes and you want to focus onthe most significant.;When analysing broad causes by looking at their specificcomponents.;When communicating with others about your data.Following Diagram shows how many customer complaints were receivedin each of five categories.
Figure 7.10 Pareto Chart7.15 CORRELATION ANALYSISThe most common analysis I perform at this step is the correlationanalysis of all the data in the data vault. Feature development is performedbetween data items, to find relationships between data values.import pandas as pda = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,4.1], [4, 3, 6]]df = pd.DataFrame(data=a)
 7.14 PARETO CHARTA Pareto chart is a bar graph. It is also called as Pareto diagram orPareto analysis. The lengths of the bars represent frequency or cost (timeor money), and are arranged with longest barson the left and the shortestto the right. In this way the chart visually depicts which situations aremore significant.When to use Pareto Chart:;When analysing data about the frequency of problems or causes in aprocess.;When there are many problems or causes and you want to focus onthe most significant.;When analysing broad causes by looking at their specificcomponents.;When communicating with others about your data.Following Diagram shows how many customer complaints were receivedin each of five categories.
Figure 7.10 Pareto Chart7.15 CORRELATION ANALYSISThe most common analysis I perform at this step is the correlationanalysis of all the data in the data vault. Feature development is performedbetween data items, to find relationships between data values.import pandas as pda = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,4.1], [4, 3, 6]]df = pd.DataFrame(data=a)
 7.14 PARETO CHARTA Pareto chart is a bar graph. It is also called as Pareto diagram orPareto analysis. The lengths of the bars represent frequency or cost (timeor money), and are arranged with longest barson the left and the shortestto the right. In this way the chart visually depicts which situations aremore significant.When to use Pareto Chart:;When analysing data about the frequency of problems or causes in aprocess.;When there are many problems or causes and you want to focus onthe most significant.;When analysing broad causes by looking at their specificcomponents.;When communicating with others about your data.Following Diagram shows how many customer complaints were receivedin each of five categories.
Figure 7.10 Pareto Chart7.15 CORRELATION ANALYSISThe most common analysis I perform at this step is the correlationanalysis of all the data in the data vault. Feature development is performedbetween data items, to find relationships between data values.import pandas as pda = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,4.1], [4, 3, 6]]df = pd.DataFrame(data=a)
munotes.in

Page 102


cr=df.corr()print(cr)7.16 FORECASTINGForecasting is the ability to project a possible future, by looking athistorical data. The data vault enables these types of investigations, owingto the complete history it collects as it processes the source’s systems data.You will perform many forecasting projects during your career as a datascientist and supply answers to such questions as the following:• What should we buy?• What should we sell?• Where will our next business come from?People want to know what you calculate to determine what isabout to happen7.17 DATA SCIENCEData Science work best when approved techniques and algorithmsare followed.After performing various experiments on data, the result must beverified and it must have support.Data sciences that work follow these steps:Step 1:It begins with a question.Step 2:Design a model, select prototype for the data and start a virtualsimulation. Some statistics and mathematical solutions can beadded to start a data science model.All questions must be related to customer's business, such a waythat answer must provide an insight of business.Step3:Formulate a hypothesis based on collected observation. Based onmodel process the observation and prove whether hypothesis istrue or false.Step4:Compare the above result with the real-world observations andprovide these results to real-life business.Step 5:Communicate the progress and intermediate results withcustomers and subject expert and involve them in the wholeprocess to ensure that they are part of journey of discovery.munotes.in

Page 103

 SUMMARYThe Process superstep uses the assessresults of the retrieve processfrom the data sources into a highly structured data vaults that acts as basicdata structure for the remaining data science steps.UNIT END QUESTIONS1.Explain the process superstep.2.Explain concept of data valut.3.What are the different typical reference satellites? Explain.4.Explain the TPOLE design principle.5.Explain the Time section of TPOLE.6.Explain the Person section of TPOLE.7.Explain the Object section of TPOLE.8.Explain the Location section of TPOLE.9.Explain the Event section of TPOLE.10.Explain the different date and time formats. What is leap year?Explain.11.What is an event? Explain explicit and implicit events.12.How to Complete the 5 Whys?13.What is a fishbone diagram? Explain with example.14.Explain the significance of Monte Carlo Simulation and Causal LoopDiagram.15.What are pareto charts? What information can be obtained from paretocharts?16.Explain the use of correlation and forecasting in data science.17.State and explain the five steps of data science.REFERENCES•https://asq.org/•https://scikit-learn.org/•https://www.geeksforgeeks.org/•https://statistics.laerd.com/spss-tutorials/•https://www.kdnuggets.com/munotes.in

Page 104

 8TRANSFORM SUPERSTEPUnit Structure8.0Objectives8.1Introduction8.2Dimension Consolidation8.3Sun Model8.3.1 Person-to-Time Sun Model8.3.2Person-to-Object Sun Model8.3.3 Person-to-Location Sun Model8.3.4 Person-to-Event Sun Model8.3.5 Sun Model to Transform Step8.4Transforming with Data Science8.5Common Feature Extraction Techniques8.5.1Binning8.5.2 Averaging8.6Hypothesis Testing8.6.1 T-Test8.6.2 Chi-Square Test8.7Overfitting & Underfitting8.7.1 Polynomial Features8.7.2 Common Data-Fitting Issue8.8Precision-Recall8.8.1 Precision-Recall Curve8.8.2 Sensitivity & Specificity8.8.3 F1-Measure8.8.4 Receiver Operating Characteristic (ROC) Analysis Curves8.9Cross-Validation Test8.10Univariate Analysis8.11Bivariate Analysis8.12Multivariate Analysis8.13Linear Regression8.13.1 Simple Linear Regression8.13.2 RANSACLinear Regression8.13.3 Hough Transform8.14Logistic Regression8.14.1 Simple Logistic Regression8.14.2 Multinomial Logistic Regression8.14.3 Ordinal Logistic Regression8.15Clustering Techniquesmunotes.in

Page 105

 8.15.1 Hierarchical Clustering8.15.2 Partitional Clustering8.16ANOVADecision Trees8.0 OBJECTIVESThe objective of this chapter is to learn data transformationtechniques, feature extraction techniques, missing datahandling, andvarious techniques to categorise data into suitable groups.8.1 INTRODUCTIONThe Transform Superstep allow us to take data from data vault andanswer the questions raised by the investigation.It takes standard data science techniques and methods to attain insightand knowledge about the data that then can be transformed into actionabledecisions. These results can be explained to non-data scientist.The Transform Superstep uses the data vault from the process step asits source data.8.2 DIMENSION CONSOLIDATIONThe data vault consists of five categories of data, with linkedrelationships and additional characteristics in satellite hubs.To perform dimension consolidation, you start with a givenrelationship in the data vault and construct a sun model for thatrelationship, as shown in Figure
Figure 8.1 Categories of data
 8.15.1 Hierarchical Clustering8.15.2 Partitional Clustering8.16ANOVADecision Trees8.0 OBJECTIVESThe objective of this chapter is to learn data transformationtechniques, feature extraction techniques, missing datahandling, andvarious techniques to categorise data into suitable groups.8.1 INTRODUCTIONThe Transform Superstep allow us to take data from data vault andanswer the questions raised by the investigation.It takes standard data science techniques and methods to attain insightand knowledge about the data that then can be transformed into actionabledecisions. These results can be explained to non-data scientist.The Transform Superstep uses the data vault from the process step asits source data.8.2 DIMENSION CONSOLIDATIONThe data vault consists of five categories of data, with linkedrelationships and additional characteristics in satellite hubs.To perform dimension consolidation, you start with a givenrelationship in the data vault and construct a sun model for thatrelationship, as shown in Figure
Figure 8.1 Categories of data
 8.15.1 Hierarchical Clustering8.15.2 Partitional Clustering8.16ANOVADecision Trees8.0 OBJECTIVESThe objective of this chapter is to learn data transformationtechniques, feature extraction techniques, missing datahandling, andvarious techniques to categorise data into suitable groups.8.1 INTRODUCTIONThe Transform Superstep allow us to take data from data vault andanswer the questions raised by the investigation.It takes standard data science techniques and methods to attain insightand knowledge about the data that then can be transformed into actionabledecisions. These results can be explained to non-data scientist.The Transform Superstep uses the data vault from the process step asits source data.8.2 DIMENSION CONSOLIDATIONThe data vault consists of five categories of data, with linkedrelationships and additional characteristics in satellite hubs.To perform dimension consolidation, you start with a givenrelationship in the data vault and construct a sun model for thatrelationship, as shown in Figure
Figure 8.1 Categories of data
munotes.in

Page 106

8.3 SUN MODELSun model technique is used by data scientist to perform consistentdimension consolidation. It allows us to explain data relationship with thebusiness without going in technical details.8.3.1 Person-to-Time Sun Model:Person-to-TimeSun Model explains the relationship between thePerson and Time categories in the data vault. The sun model is constructedto show all the characteristics from the two data vault hub categories. Itexplains how you will create two dimensions and a fact via the Transformstep from below figure.
Figure 8.2 Person-to-Time sun modelThe sun model is constructed to show all the characteristics fromthe two data vault hub categories you are planning to extract. It explainshow you will create two dimensions and a fact via the Transform stepfrom above figure. You will create two dimensions (Person and Time)with one fact (PersonBornAtTime) as shown in below figure,
Figure 8.3 Person-to-Time sun model (explained)
8.3 SUN MODELSun model technique is used by data scientist to perform consistentdimension consolidation. It allows us to explain data relationship with thebusiness without going in technical details.8.3.1 Person-to-Time Sun Model:Person-to-TimeSun Model explains the relationship between thePerson and Time categories in the data vault. The sun model is constructedto show all the characteristics from the two data vault hub categories. Itexplains how you will create two dimensions and a fact via the Transformstep from below figure.
Figure 8.2 Person-to-Time sun modelThe sun model is constructed to show all the characteristics fromthe two data vault hub categories you are planning to extract. It explainshow you will create two dimensions and a fact via the Transform stepfrom above figure. You will create two dimensions (Person and Time)with one fact (PersonBornAtTime) as shown in below figure,
Figure 8.3 Person-to-Time sun model (explained)
8.3 SUN MODELSun model technique is used by data scientist to perform consistentdimension consolidation. It allows us to explain data relationship with thebusiness without going in technical details.8.3.1 Person-to-Time Sun Model:Person-to-TimeSun Model explains the relationship between thePerson and Time categories in the data vault. The sun model is constructedto show all the characteristics from the two data vault hub categories. Itexplains how you will create two dimensions and a fact via the Transformstep from below figure.
Figure 8.2 Person-to-Time sun modelThe sun model is constructed to show all the characteristics fromthe two data vault hub categories you are planning to extract. It explainshow you will create two dimensions and a fact via the Transform stepfrom above figure. You will create two dimensions (Person and Time)with one fact (PersonBornAtTime) as shown in below figure,
Figure 8.3 Person-to-Time sun model (explained)
munotes.in

Page 107

8.3.2 Person-to-Object Sun Model:Person-to-Object Sun Model explains the relationship between thePerson and Object categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.4 Sun model for the PersonIsSpecies fact8.3.3 Person-to-Location Sun Model:Person-to-Location Sun Model explains the relationship betweenthe Person and Location categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.5 Sun model for PersonAtLocation fact8.3.4 Person-to-Event Sun ModeL:Person-to-Event Sun Model explains the relationship between thePerson and Event categories in the data vault.
8.3.2 Person-to-Object Sun Model:Person-to-Object Sun Model explains the relationship between thePerson and Object categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.4 Sun model for the PersonIsSpecies fact8.3.3 Person-to-Location Sun Model:Person-to-Location Sun Model explains the relationship betweenthe Person and Location categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.5 Sun model for PersonAtLocation fact8.3.4 Person-to-Event Sun ModeL:Person-to-Event Sun Model explains the relationship between thePerson and Event categories in the data vault.
8.3.2 Person-to-Object Sun Model:Person-to-Object Sun Model explains the relationship between thePerson and Object categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.4 Sun model for the PersonIsSpecies fact8.3.3 Person-to-Location Sun Model:Person-to-Location Sun Model explains the relationship betweenthe Person and Location categories in the data vault. The sun model isconstructed to show all the characteristics from the two data vault hubcategories. It explains how you will create two dimensions and a fact viathe Transform step from Figure.
Figure 8.5 Sun model for PersonAtLocation fact8.3.4 Person-to-Event Sun ModeL:Person-to-Event Sun Model explains the relationship between thePerson and Event categories in the data vault.
munotes.in

Page 108

Figure 8.6 Sun model for PersonBorn fact8.3.5 Sun Model to Transform Step:You must build three items: dimension Person, dimension Time, andfactPersonBornAtTime. Open your Python editor and create a file namedTransform-import sysimport osfrom datetime import datetimefrom pytz import timezoneimport pandas as pdimport sqlite3 as sqimport uuidpd.options.mode.chained_assignment = None################################################################if sys.platform == 'linux' or sys.platform == ' Darwin':Base=os.path.expanduser('~') + '/VKHCG'else:Base='C:/VKHCG'print('################################')print('Working Base :',Base, ' using ', sys.platform)print('################################')################################################################Company='01-Vermeulen'################################################################sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
Figure 8.6 Sun model for PersonBorn fact8.3.5 Sun Model to Transform Step:You must build three items: dimension Person, dimension Time, andfactPersonBornAtTime. Open your Python editor and create a file namedTransform-import sysimport osfrom datetime import datetimefrom pytz import timezoneimport pandas as pdimport sqlite3 as sqimport uuidpd.options.mode.chained_assignment = None################################################################if sys.platform == 'linux' or sys.platform == ' Darwin':Base=os.path.expanduser('~') + '/VKHCG'else:Base='C:/VKHCG'print('################################')print('Working Base :',Base, ' using ', sys.platform)print('################################')################################################################Company='01-Vermeulen'################################################################sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
Figure 8.6 Sun model for PersonBorn fact8.3.5 Sun Model to Transform Step:You must build three items: dimension Person, dimension Time, andfactPersonBornAtTime. Open your Python editor and create a file namedTransform-import sysimport osfrom datetime import datetimefrom pytz import timezoneimport pandas as pdimport sqlite3 as sqimport uuidpd.options.mode.chained_assignment = None################################################################if sys.platform == 'linux' or sys.platform == ' Darwin':Base=os.path.expanduser('~') + '/VKHCG'else:Base='C:/VKHCG'print('################################')print('Working Base :',Base, ' using ', sys.platform)print('################################')################################################################Company='01-Vermeulen'################################################################sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
munotes.in

Page 109

if not os.path.exists(sDataBaseDir):os.makedirs(sDataBaseDir)sDatabaseName=sDataBaseDir + '/Vermeulen.db'conn1 = sq.connect(sDatabaseName)################################################################sDataWarehousetDir=Base + '/99-DW'if not os.path.exists(sDataWarehousetDir):os.makedirs(sDataWarehousetDir)sDatabaseName=sDataWarehousetDir + '/datawarehouse.db'conn2 = sq.connect(sDatabaseName)print('\n#################################')print('Time Dimension')BirthZone = 'Atlantic/Reykjavik'BirthDateUTC = datetime(1960,12,20,10,15,0)BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d%H:%M:%S")BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d%H:%M:%S (%Z)(%z)")BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")################################################################IDTimeNumber=str(uuid.uuid4())TimeLine=[('TimeID', [IDTimeNumber]),('UTCDate', [BirthDateZoneStr]),('LocalTime', [BirthDateLocal]),('TimeZone', [BirthZone])]TimeFrame = pd.DataFrame.from_items(TimeLine)################################################################DimTime=TimeFrameDimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)sTable = 'Dim-Time'print('\n#################################')print('Storing :',sDatabaseName,'\n Table:',sTable)print('\n#################################')DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")munotes.in

Page 110

DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")print('\n#################################')print('Dimension Person')print('\n#################################')FirstName = 'Guðmundur'LastName = 'Gunnarsson'###############################################################IDPersonNumber=str(uuid.uuid4())PersonLine=[('PersonID', [IDPersonNumber]),('FirstName', [FirstName]),('LastName', [LastName]),('Zone', ['UTC']),('DateTimeValue', [BirthDateZoneStr])]PersonFrame = pd.DataFrame.from_items(PersonLine)################################################################DimPerson=PersonFrameDimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)################################################################sTable = 'Dim-Person'print('\n#################################')print('Storing :',sDatabaseName,'\n Table:',sTable)print('\n#################################')DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")print('\n#################################')print('Fact-Person-time')print('\n#################################')IDFactNumber=str(uuid.uuid4())PersonTimeLine=[('IDNumber',[IDFactNumber]),('IDPersonNumber', [IDPersonNumber]),('IDTimeNumber', [IDTimeNumber])]PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)################################################################FctPersonTime=PersonTimeFrameFctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=False)munotes.in

Page 111

################################################################sTable = 'Fact-Person-Time'print('\n#################################')print('Storing:',sDatabaseName,'\n Table:',sTable)print('\n#################################')FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")Gunnarsson-Sun-Model.py in directory8.4 TRANSFORMING WITH DATA SCIENCE8.4.1 Missing Value Treatment:We must describe the missing value treatment in the transformation.The missing value treatment must be acceptable by the businesscommunity.8.4.2 Why Missing Value Treatment Is Required:Missing data in the training data set can reduce the power / fit of amodel or can lead to a biased model because we have not analyzed thebehavior and relationship with other variables correctly. It can lead towrong prediction or classification.8.4.3 Why Data Has Missing Values:Following are some commonreasons for missing data:•Data fields were renamed during upgrades•Mappings were incomplete during the migration processes from oldsystems to new systems•Wrong table name was provided during loading•Data was not available•Legal reasons, owing to data protection legislation, such as the GeneralData Protection Regulation (GDPR).•Poor data science. People and projects make mistakes during datascience process.8.5 COMMON FEATURE EXTRACTION TECHNIQUESFollowing are common feature extraction techniques that help usto enhance existing data warehouse, by applying data science to the data inthe warehouse.munotes.in

Page 112


8.5.1 Binning:Binning technique is used to reduce the complexity of data sets, toenable the data scientist to evaluate the data with anorganized groupingtechnique.Binning is a good way for you to turn continuous data into a dataset that has specific features that you can evaluate for patterns. Forexample, if you have data about a group of people, you might want toarrange their ages into a smaller number of age intervals (for example,grouping every five years together).import numpydata = numpy.random.random(100)bins = numpy.linspace(0, 1, 10)digitized = numpy.digitize(data, bins)bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]print(bin_means)#The second is to use the histogram function.bin_means2 = (numpy.histogram(data, bins, weights=data)[0] /numpy.histogram(data, bins)[0])print(bin_means2)8.5.2 Averaging:The use of averaging enables you to reduce the amount of recordsyou require to report any activity that demands a more indicative, ratherthan a precise, total.Example:Create a model that enables you to calculate the average position for tensample points. First, set up the ecosystem.import numpy as npimport pandas as pd#Create two series to model the latitude and longitude ranges.LatitudeData = pd.Series(np.array(range(-90,91,1)))LongitudeData = pd.Series(np.array(range(-14.20,14.21,1)))#Select 10 samples for each range:LatitudeSet=LatitudeData.sample(10)LongitudeSet=LongitudeData.sample(10)#Calculate the average of each data setLatitudeAverage = np.average(LatitudeSet)LongitudeAverage = np.average(LongitudeSet)#See the resultsmunotes.in

Page 113

print('Latitude')print(LatitudeSet)print('Latitude (Avg):',LatitudeAverage)print('##############')print('Longitude')print(LongitudeSet)print('Longitude (Avg):', LongitudeAverage)Set of common data science terminology8.6 HYPOTHESIS TESTINGHypothesis testing must be known to any data scientist. Youcannot progress until you have thoroughly mastered this technique.Hypothesis testing is a statistical test to check if a hypothesis is true basedon the available data. Based on testing, data scientists choose to accept orreject (not accept) the hypothesis. To check whether the event is animportant occurrence or just happenstance, hypothesis testing is necessary.When an event occurs, it can be a trend or at random.8.6.1 T-Test:The t-test is one of many tests used for the purpose of hypothesistesting in statistics. A t-test is a popular statistical test to make inferencesabout single means or inferences about two means or variances, to check ifthe two groups’ means are statistically different from each other, wheren(sample size) < 30 and standard deviation is unknown.The One SampletTest determines whether the sample mean isstatistically different from a known or hypothesised population mean. TheOne SampletTest is a parametric test.H0: Mean age of given sample is 30.H1: Mean age of given sample is not 30#pip3 install scipy#pip3 install numpyfrom scipy.stats import ttest_1sampimport numpy as npages = np.genfromtxt('ages.csv')print(ages)ages_mean = np.mean(ages)print("Mean age:",ages_mean)print("Test 1: m=30")munotes.in

Page 114

tset, pval = ttest_1samp(ages, 30)print('p-values-',pval)if pval< 0.05:print("we reject null hypothesis")else:print("we fail to reject null hypothesis")8.6.2 Chi-Square Test:A chi-square (or squared [χ2]) test is used to check if two variables aresignificantly different from each other. These variables are categorical.import numpy as npimport pandas as pdimport scipy.stats as statsnp.random.seed(10)stud_grade = np.random.choice(a=["O","A","B","C","D"],p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5],size=100)mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})print(mscpart1)stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)stud_tab.columns = ["Male", "Female", "row_totals"]stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]observed = stud_tab.iloc[0:5, 0:2 ]print(observed)expected = np.outer(stud_tab["row_totals"][0:5],stud_tab.loc["col_totals"][0:2]) / 100print(expected)chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()print('Calculated : ',chi_squared_stat)crit = stats.chi2.ppf(q=0.95, df=4)print('Table Value : ',crit)if chi_squared_stat>= crit:print('H0 is Accepted ')else:print('H0 is Rejected ')
munotes.in

Page 115

8.7 OVERFITTING & UNDERFITTINGOverfitting and Underfitting, these are the major problems facedby the data scientists when they retrieve the data insights from the trainingdata sets which they are using.They refer to the deficiencies that themodel’s performance might suffer from.Overfitting occurs when the model or the algorithm fits the datatoo well.When a model gets trained with so much of data, it startslearning from the noise and inaccurate data entries in our data set.But theproblem then occurred is, the model will not be able to categorize the datacorrectly, and this happens because of too much of details and noise.Underfitting occurs when the model or the algorithmcannotcapture the underlying trendof the data.Intuitively, underfitting occurswhen the model or the algorithm does not fit the data well enough. It isoften a result of an excessively simple model. It destroys the accuracy ofour model.
Figure 8.7Overfitting & Underfitting8.7.1 Polynomial Features:The polynomic formula is the following:(a1x + b1) (a2x + b2) = a1a2x2+ (a1b2+ a2b1) x + b1b2.The polynomial feature extraction can use a chain of polynomicformulas to create a hyperplane that will subdivide any data sets into thecorrect cluster groups. The higherthe polynomic complexity, the moreprecise the result that can be achieved.Example:import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridgefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipelinedef f(x):""" function to approximate by polynomial interpolation"""
8.7 OVERFITTING & UNDERFITTINGOverfitting and Underfitting, these are the major problems facedby the data scientists when they retrieve the data insights from the trainingdata sets which they are using.They refer to the deficiencies that themodel’s performance might suffer from.Overfitting occurs when the model or the algorithm fits the datatoo well.When a model gets trained with so much of data, it startslearning from the noise and inaccurate data entries in our data set.But theproblem then occurred is, the model will not be able to categorize the datacorrectly, and this happens because of too much of details and noise.Underfitting occurs when the model or the algorithmcannotcapture the underlying trendof the data.Intuitively, underfitting occurswhen the model or the algorithm does not fit the data well enough. It isoften a result of an excessively simple model. It destroys the accuracy ofour model.
Figure 8.7Overfitting & Underfitting8.7.1 Polynomial Features:The polynomic formula is the following:(a1x + b1) (a2x + b2) = a1a2x2+ (a1b2+ a2b1) x + b1b2.The polynomial feature extraction can use a chain of polynomicformulas to create a hyperplane that will subdivide any data sets into thecorrect cluster groups. The higherthe polynomic complexity, the moreprecise the result that can be achieved.Example:import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridgefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipelinedef f(x):""" function to approximate by polynomial interpolation"""
8.7 OVERFITTING & UNDERFITTINGOverfitting and Underfitting, these are the major problems facedby the data scientists when they retrieve the data insights from the trainingdata sets which they are using.They refer to the deficiencies that themodel’s performance might suffer from.Overfitting occurs when the model or the algorithm fits the datatoo well.When a model gets trained with so much of data, it startslearning from the noise and inaccurate data entries in our data set.But theproblem then occurred is, the model will not be able to categorize the datacorrectly, and this happens because of too much of details and noise.Underfitting occurs when the model or the algorithmcannotcapture the underlying trendof the data.Intuitively, underfitting occurswhen the model or the algorithm does not fit the data well enough. It isoften a result of an excessively simple model. It destroys the accuracy ofour model.
Figure 8.7Overfitting & Underfitting8.7.1 Polynomial Features:The polynomic formula is the following:(a1x + b1) (a2x + b2) = a1a2x2+ (a1b2+ a2b1) x + b1b2.The polynomial feature extraction can use a chain of polynomicformulas to create a hyperplane that will subdivide any data sets into thecorrect cluster groups. The higherthe polynomic complexity, the moreprecise the result that can be achieved.Example:import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridgefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipelinedef f(x):""" function to approximate by polynomial interpolation"""
munotes.in

Page 116

return x * np.sin(x)# generate points used to plotx_plot = np.linspace(0, 10, 100)# generate points and keep a subset of themx = np.linspace(0, 10, 100)rng = np.random.RandomState(0)rng.shuffle(x)x = np.sort(x[:20])y = f(x)# create matrix versions of these arraysX = x[:,np.newaxis]X_plot = x_plot[:,np.newaxis]colors = ['teal', 'yellowgreen', 'gold']lw = 2plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,label="Ground Truth")plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")for count, degree in enumerate([3, 4, 5]):model = make_pipeline(PolynomialFeatures(degree), Ridge())model.fit(X, y)y_plot = model.predict(X_plot)plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw,label="Degree %d" % degree)plt.legend(loc='lower left')plt.show()8.7.2 Common Data-Fitting Issue:These higher order polynomic formulas are, however, more proneto overfitting, while lower order formulas are more likely to underfit. It isa delicate balance between two extremes that support good data science.Example:import numpy as npimport matplotlib.pyplot as pltfrom sklearn.pipeline import Pipelinefromsklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_val_scoredef true_fun(X):return np.cos(1.5 * np.pi * X)np.random.seed(0)munotes.in

Page 117

n_samples = 30degrees = [1, 4, 15]X = np.sort(np.random.rand(n_samples))y = true_fun(X) + np.random.randn(n_samples) * 0.1plt.figure(figsize=(14, 5))for i in range(len(degrees)):ax = plt.subplot(1, len(degrees), i + 1)plt.setp(ax, xticks=(), yticks=())polynomial_features= PolynomialFeatures(degree=degrees[i],include_bias=False)linear_regression = LinearRegression()pipeline = Pipeline([("polynomial_features", polynomial_features),("linear_regression", linear_regression)])pipeline.fit(X[:, np.newaxis], y)# Evaluate the models using crossvalidationscores = cross_val_score(pipeline, X[:, np.newaxis], y,scoring="neg_mean_squared_error", cv=10)X_test = np.linspace(0, 1, 100)plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")plt.plot(X_test, true_fun(X_test), label="True function")plt.scatter(X, y, edgecolor='b', s=20, label="Samples")plt.xlabel("x")plt.ylabel("y")plt.xlim((0, 1))plt.ylim((-2, 2))plt.legend(loc="best")plt.title("Degree {}\nMSE = {:.2e}(+/-{:.2e})".format( degrees[i],-scores.mean(), scores.std()))plt.show()8.8 PRECISION-RECALLPrecision-recall is a useful measure for successfully predicting whenclasses are extremely imbalanced. In information retrieval,• Precision is a measure of result relevancy.• Recall is a measure of how many truly relevant results are returned.8.8.1 Precision-Recall Curve:The precision-recall curve shows the trade-off between precisionand recall for different thresholds. A high area under the curve representsboth high recalland high precision, where high precision relates to a lowmunotes.in

Page 118

false positive rate, and high recall relates to a low false negative rate. Highscores for both show that the classifier is returning accurate results (highprecision), as well as returning a majority of all positive results (highrecall).A system with high recalls but low precision returns many results,but most of its predicted labels are incorrect when compared to thetraining labels. A system with high precision but low recall is just theopposite, returning very few results, but most of its predicted labels arecorrect when compared to the training labels. An ideal system with highprecision and high recall will return many results, with all results labelledcorrectly.Precision (P) is defined as the number of true positives (Tp) over thenumber of true positives (Tp) plus the number of false positives (Fp).Recall (R) is defined as the number of true positives (Tp) over thenumber of true positives (Tp) plus the number of false negatives(Fn).The true negative rate (TNR) is the rate that indicates the recall of thenegative items.Accuracy (A) is defined as8.8.2 Sensitivity & Specificity:Sensitivity and specificity are statistical measures of theperformance of a binary classification test, also known in statistics as aclassification function. Sensitivity (also called the true positive rate, therecall, or probability of detection) measures the proportion of positivesthat are correctly identified as such (e.g., the percentageof sick peoplewho are correctly identified as having the condition). Specificity (alsocalled the true negative rate) measures the proportion of negatives that arecorrectly identified as such (e.g., the percentage of healthy people who arecorrectly identified as not having the condition).
munotes.in

Page 119

8.8.3 F1-Measure:The F1-score is a measure that combines precision and recall in theharmonic mean of precision and recall.Note: The precision may not decrease with recall.The following sklearn functions are useful when calculating thesemeasures:• sklearn.metrics.average_precision_score• sklearn.metrics.recall_score• sklearn.metrics.precision_score• sklearn.metrics.f1_score8.8.4 Receiver Operating Characteristic (ROC) Analysis Curves:A receiver operatingcharacteristic (ROC) analysis curve is agraphical plot that illustrates the diagnostic ability of a binary classifiersystem as its discrimination threshold is varied. The ROC curve plots thetruepositive rate (TPR) against the false positive rate (FPR)at variousthreshold settings. The true positive rate is also known as sensitivity,recall, or probability of detection.You will find the ROC analysis curves useful for evaluatingwhether your classification or feature engineering is good enough todetermine the value of the insights you are finding. This helps withrepeatable results against a real-world data set. So, if you suggest that yourcustomers should take as pecific action as a result of your findings, ROCanalysis curves will support your advice and insights but also relay thequality of the insights at given parameters.8.9 CROSS-VALIDATION TESTCross-validation is a model validation technique for evaluatinghow the results of a statistical analysis will generalize to an independentdata set.It is mostly used in settings where the goal is the prediction.Knowing how to calculate a test such as this enables you to validate theapplication of your model on real-world, i.e., independent data sets.Example:import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn import datasets, svmimport matplotlib.pyplot as plt
munotes.in

Page 120


digits = datasets.load_digits()X = digits.datay = digits.targetLet’s pick three different kernels and compare how they will perform.kernels=['linear', 'poly', 'rbf']for kernel in kernels:svc = svm.SVC(kernel=kernel)C_s = np.logspace(-15, 0, 15)scores = list()scores_std = list()for C in C_s:svc.C = Cthis_scores = cross_val_score(svc, X, y, n_jobs=1)scores.append(np.mean(this_scores))scores_std.append(np.std(this_scores))You must plot your results.Title="Kernel:>" + kernelfig=plt.figure(1, figsize=(4.2, 6))plt.clf()fig.suptitle(Title, fontsize=20)plt.semilogx(C_s, scores)plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')plt.semilogx(C_s, np.array(scores)-np.array(scores_std), 'b--')locs, labels = plt.yticks()plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))plt.ylabel('Cross-Validation Score')plt.xlabel('Parameter C')plt.ylim(0, 1.1)plt.show()Well done. You can now perform cross-validation of your results.8.10 UNIVARIATE ANALYSISThis type of data consists of only one variable. The analysis ofunivariate data is thus the simplest form of analysis since the informationdeals with only one quantity that changes. It does not deal with causes orrelationships and the main purpose of the analysis is to describe the datamunotes.in

Page 121


and find patterns that exist within it. The example of a univariate data canbe height.Table 8.1Suppose that the heights of seven students of a class is recorded (inthe above figure), there is only one variable that is height and it is notdealing with any cause or relationship. The description of patterns foundin this type of data can be made by drawing conclusions using centraltendency measures (mean, median and mode), dispersion or spread of data(range, minimum, maximum, quartiles, variance and standard deviation)and by using frequency distribution tables, histograms, pie charts,frequency polygon and bar charts.8.11 BIVARIATE ANALYSISThis type of data involves two different variables. The analysis ofthis type of data deals with causes and relationships and the analysis isdone to find out the relationship among the two variables. Example ofbivariate data can be temperature and ice cream sales in summer season.
Table 8.2Suppose the temperature and ice cream sales are the two variablesof a bivariate data (in the above figure). Here, the relationship is visiblefrom the table that temperature and sales aredirectly proportional to eachother and thus related because as the temperature increases, the sales alsoincrease. Thus, bivariate data analysis involves comparisons, relationships,causes and explanations. These variables are often plotted on X and Y axison the graph for better understanding of data and one of these variables isindependent while the other is dependent.
munotes.in

Page 122



8.12 MULTIVARIATE ANALYSISWhen the data involvesthree or more variables, it is categorizedunder multivariate. Example of this type of data is suppose an advertiserwants to compare the popularity of four advertisements on a website, thentheir click rates could be measured for both men and women andrelationships between variables can then be examined.It is similar to bivariate but contains more than one dependentvariable. The ways to perform analysis on this data depends on the goalsto be achieved. Some of the techniques are regression analysis, pathanalysis, factor analysis and multivariate analysis of variance(MANOVA).8.13LINEAR REGRESSIONLinear regression is a statistical modelling technique thatendeavours to model the relationship between an explanatory variable anda dependent variable, by fitting the observed data points on a linearequation, for example, modellingthe body mass index (BMI) ofindividuals by using their weight.Linear regression is often used in business, government, and otherscenarios. Some common practical applications of linear regression in thereal world include the following:•Real estate:A simple linear regression analysis can be used to modelresidential home prices as a function of the home's living area. Such amodel helps set or evaluate the list price of a home on the market. Themodel could be further improved by including other input variablessuch as number of bathrooms, number of bedrooms, lot size, schooldistrict rankings, crime statistics, and property taxes•Demand forecasting:Businesses and governments can use linearregression models to predict demand for goods and services. Forexample, restaurant chains can appropriately prepare for the predictedtype and quantity of food that customers will consume based upon theweather, the day of the week, whether an item is offered as a special,the time of day, and the reservation volume. Similar models can bebuilt to predict retail sales, emergency room visits, and ambulancedispatches.•Medical:A linear regression model can be used to analyze the effectof a proposed radiation treatment on reducing tumour sizes. Inputvariables might include duration of a single radiation treatment,frequency of radiation treatment, and patient attributes such as age orweight.munotes.in

Page 123


8.13.1 Simple Linear Regression:Linear regression attempts to model the relationship between twovariables byfitting a linear equation to observed data. One variable isconsidered to be an explanatory variable, and the other is considered to bea dependent variable. For example, a modeler might want to relate theweights of individuals to their heights using a linear regression model.Before attempting to fit a linear model to observed data, a modelershould first determine whether or not there is a relationship between thevariables of interest. This does not necessarily imply that one variablecauses the other(for example, higher SAT scores do not cause highercollege grades), but that there is some significant association between thetwo variables. A scatterplot can be a helpful tool in determining thestrength of the relationship between two variables. If there appears to beno association between the proposed explanatory and dependent variables(i.e., the scatterplot does not indicate any increasing or decreasing trends),then fitting a linear regression model to the data probably will not providea useful model. A valuable numerical measure of association between twovariables is the correlation coefficient, which is a value between-1 and 1indicating the strength of the association of the observed data for the twovariables.A linear regression line has an equation of the form (without error):Y = a + bX,Where, X = explanatory variableY = dependent variableb = slope of the linea = intercept (the value of y when x = 0)A linear regression model can be expressed as follows (with error):
Figure 8.88.13.2 RANSAC Linear Regression:RANSAC is an acronym for Random Sample Consensus. Whatthis algorithm does is fit a regression model on a subset of data that thealgorithm judges as inliers while removing outliers. This naturallyimproves the fit of themodel due to the removal of some data points. An

8.13.1 Simple Linear Regression:Linear regression attempts to model the relationship between twovariables byfitting a linear equation to observed data. One variable isconsidered to be an explanatory variable, and the other is considered to bea dependent variable. For example, a modeler might want to relate theweights of individuals to their heights using a linear regression model.Before attempting to fit a linear model to observed data, a modelershould first determine whether or not there is a relationship between thevariables of interest. This does not necessarily imply that one variablecauses the other(for example, higher SAT scores do not cause highercollege grades), but that there is some significant association between thetwo variables. A scatterplot can be a helpful tool in determining thestrength of the relationship between two variables. If there appears to beno association between the proposed explanatory and dependent variables(i.e., the scatterplot does not indicate any increasing or decreasing trends),then fitting a linear regression model to the data probably will not providea useful model. A valuable numerical measure of association between twovariables is the correlation coefficient, which is a value between-1 and 1indicating the strength of the association of the observed data for the twovariables.A linear regression line has an equation of the form (without error):Y = a + bX,Where, X = explanatory variableY = dependent variableb = slope of the linea = intercept (the value of y when x = 0)A linear regression model can be expressed as follows (with error):
Figure 8.88.13.2 RANSAC Linear Regression:RANSAC is an acronym for Random Sample Consensus. Whatthis algorithm does is fit a regression model on a subset of data that thealgorithm judges as inliers while removing outliers. This naturallyimproves the fit of themodel due to the removal of some data points. An

8.13.1 Simple Linear Regression:Linear regression attempts to model the relationship between twovariables byfitting a linear equation to observed data. One variable isconsidered to be an explanatory variable, and the other is considered to bea dependent variable. For example, a modeler might want to relate theweights of individuals to their heights using a linear regression model.Before attempting to fit a linear model to observed data, a modelershould first determine whether or not there is a relationship between thevariables of interest. This does not necessarily imply that one variablecauses the other(for example, higher SAT scores do not cause highercollege grades), but that there is some significant association between thetwo variables. A scatterplot can be a helpful tool in determining thestrength of the relationship between two variables. If there appears to beno association between the proposed explanatory and dependent variables(i.e., the scatterplot does not indicate any increasing or decreasing trends),then fitting a linear regression model to the data probably will not providea useful model. A valuable numerical measure of association between twovariables is the correlation coefficient, which is a value between-1 and 1indicating the strength of the association of the observed data for the twovariables.A linear regression line has an equation of the form (without error):Y = a + bX,Where, X = explanatory variableY = dependent variableb = slope of the linea = intercept (the value of y when x = 0)A linear regression model can be expressed as follows (with error):
Figure 8.88.13.2 RANSAC Linear Regression:RANSAC is an acronym for Random Sample Consensus. Whatthis algorithm does is fit a regression model on a subset of data that thealgorithm judges as inliers while removing outliers. This naturallyimproves the fit of themodel due to the removal of some data points. Anmunotes.in

Page 124


advantage of RANSAC is its ability to do robust estimation of the modelparameters, i.e., it can estimate the parameters with a high degree ofaccuracy, even when a significant number of outliers are present in thedata set. The process will find a solution, because it is so robust.The process that is used to determine inliers and outliers is describedbelow.1.The algorithm randomly selects a random number of samples to beinliers in the model.2.Alldata is used to fit the model and samples that fall with a certaintolerance are relabelled as inliers.3.Model is refitted with the new inliers.4.Error of the fitted model vs the inliers is calculated.5.Terminate or go back to step 1 if a certaincriterion of iterations orperformance is not met.8.13.3 Hough Transform:The Hough transform is a feature extraction technique used inimage analysis, computer vision, and digital image processing. Thepurpose of the technique is to find imperfect instances of objects within acertain class of shapes, by a voting procedure. This voting procedure iscarried out in a parameter space, from which object candidates areobtained as local maxima in a so-called accumulator space that isexplicitly constructed by the algorithm for computing the Houghtransform.With the help of the Hough transformation, this regressionimproves the resolution of the RANSAC technique, which is extremelyuseful when using robotics and robot vision in which the robot requires theregression of the changes between two data frames or data sets to movethrough an environment.8.14 LOGISTIC REGRESSIONIn linear regression modelling, the outcome variable is acontinuous variable. When the outcome variable is categorical in nature,logistic regression can be used to predict the likelihood of an outcomebased on the input variables. Although logistic regression can be appliedto an outcome variable that represents multiple values, but we willexamine the case in which the outcome variablerepresents two valuessuch as true/false, pass/fail, or yes/no.For example, a logistic regression model can be built to determineif a person will or will not purchase a new automobile in the next 12months. The training set could include input variables for a person's age,munotes.in

Page 125


income, and gender as well as the age of an existing automobile. Thetraining set would also include the outcome variable on whether theperson purchased a new automobile over a 12-month period. The logisticregression model providesthe likelihood or probability of a person makinga purchase in the next 12 months.The logistic regression model is applied to a variety of situations in boththe public and the private sector. Some common ways that the logisticregression model is usedinclude the following:•Medical:Develop a model to determine the likelihood of a patient'ssuccessful response to a specific medical treatment or procedure. Inputvariables could include age, weight, blood pressure, and cholesterollevels.•Finance:Using a loan applicant's credit history and the details on theloan, determine the probability that an applicant will default on theloan. Based on the prediction, the loan can be approved or denied, orthe terms can be modified.•Marketing:Determinea wireless customer's probability of switchingcarriers (known as churning) based on age, number of family memberson the plan, months remaining on the existing contract, and socialnetwork contacts. With such insight, target the high-probabilitycustomers with appropriate offers to prevent churn.•Engineering:Based on operating conditions and various diagnosticmeasurements, determine the probability of a mechanical partexperiencing a malfunction or failure. With this, probability estimate,schedulethe appropriate preventive maintenance activity.8.14.1 Simple Logistic Regression:Simple logistic regression can be used when you have one nominalvariable with two values (male/female, dead/alive, etc.) and onemeasurement variable. The nominal variable is the dependent variable,and the measurement variable is the independent variable. LogisticRegression, also known as Logit Regression or Logit Model. LogisticRegression works with binary data, where either the event happens (1) orthe event does not happen (0).In linear regression modelling, the outcome variable is acontinuous variable. When the outcome variable is categorical in nature,logistic regression can be used to predict the likelihood of an outcomebased on the input variables. Althoughlogistic regression can be appliedto an outcome variable that represents multiple values, but we willexamine the case in which the outcome variable represents two valuessuch as true/false, pass/fail, or yes/no.munotes.in

Page 126


Simple logistic regression is analogous to linear regression, exceptthat the dependent variable is nominal, not a measurement. One goal is tosee whether the probability of getting a particular value of the nominalvariable is associated with the measurement variable; the other goal is topredict the probability of getting a particular value of the nominal variable,given the measurement variable.For example, a logistic regression model can be built to determineif a person will or will not purchase a new automobile in the next 12months. The training set could include input variables for a person's age,income, and gender as well as the age of an existing automobile. Thetraining set would also include the outcome variable on whether theperson purchased a new automobile over a 12-month period.The logisticregression model provides the likelihood or probability of a person makinga purchase in the next 12 months.Logistics Regression is based on the logistics function f(y), asgiven in the equation below,8.14.2 Multinomial Logistic Regression:Multinomial logistic regression (often just called 'multinomialregression') is used to predict a nominal dependent variable given one ormore independent variables. It is sometimes considered an extension ofbinomial logistic regression to allow for a dependent variable with morethan two categories. As with other types of regression, multinomiallogistic regression can have nominal and/or continuous independentvariables and can have interactions between independent variables topredict the dependent variable. Multinomial Logistic Regression is theregression analysis to conduct when the dependent variable is nominalwith more than two levels.For example, you could use multinomial logistic regression tounderstand which type of drink consumers prefer based on location in theUK and age (i.e., the dependent variable would be "type of drink", withfour categories–Coffee, Soft Drink, Tea and Water–and yourindependent variables would be the nominal variable, "location in UK",assessed using threecategories–London, South UK and North UK–andthe continuous variable, "age", measured in years). Alternately, you coulduse multinomial logistic regression to understand whether factors such asemployment duration within the firm, total employment duration,qualifications and gender affect a person's job position (i.e., the dependentvariable would be "job position", with three categories–juniormanagement, middle management and senior management–and theindependent variables would be the continuous variables, "employment
munotes.in

Page 127


duration within the firm" and "total employment duration", both measuredin years, the nominal variables, "qualifications", with four categories–nodegree, undergraduate degree, master's degree and PhD–"gender", whichhas two categories: "males" and "females").8.14.3 Ordinal Logistic Regression:Ordinal logistic regression (often just called 'ordinal regression') isused to predict an ordinal dependent variable given one or moreindependent variables. It can be considered aseither a generalisation ofmultiple linear regression or as a generalisation of binomial logisticregression, but this guide will concentrate on the latter. As with othertypes of regression, ordinal regression can also use interactions betweenindependentvariables to predict the dependent variable.For example, you could use ordinal regression to predict the beliefthat "tax is too high" (your ordinal dependent variable, measured on a 4-point Likert item from "Strongly Disagree" to "Strongly Agree"), based ontwo independent variables: "age" and "income". Alternately, you could useordinal regression to determine whether a number of independentvariables, such as "age", "gender", "level of physical activity" (amongstothers), predict the ordinal dependentvariable, "obesity", where obesity ismeasured using three ordered categories: "normal", "overweight" and"obese".8.15 CLUSTERING TECHNIQUESIn general, clustering is the use of unsupervised techniques forgrouping similar objects. In machine learning,unsupervised refers to theproblem of finding hidden structure within unlabelled data. Clusteringtechniques are unsupervised in the sense that the data scientist does notdetermine, in advance, the labels to apply to the clusters. The structure ofthe data describes the objects of interest and determines how best to groupthe objects. Clustering is a method often used for exploratory analysis ofthe data. In clustering, there are no predictions made. Rather, clusteringmethods find the similarities between objects according to the objectattributes and group the similar objects into clusters. Clustering techniquesare utilized in marketing, economics, and various branches of science.Clustering is often used as a lead-in to classification. Once theclusters are identified, labels can be applied to each cluster to classify eachgroup based on its characteristics. Some specific applications of Clusteringare image processing, medical and customer segmentation.•Image Processing:Video is one example of the growing volumes ofunstructured data being collected. Within each frame of a video, k-means analysis can be used to identify objects in the video. For eachframe, the task is to determine which pixels are most similar to eachother. The attributes of each pixel can include brightness, color, andmunotes.in

Page 128


location, the x and y coordinates in the frame. With security videoimages,for example, successive frames are examined to identify anychanges to the clusters. These newly identified dusters may indicateunauthorized access to a facility.•Medical:Patient attributes such as age, height, weight, systolic anddiastolic blood pressures, cholesterol level, and other attributes canidentify naturally occurring clusters. These dusters could be used totarget individuals for specific preventive measures or clinical trialparticipation. Clustering, in general, is useful in biology for theclassification of plants and animals as well as in the field of humangenetics.•Customer Segmentation:Marketing and sales groups use k-means tobetter identify customers who have similar behaviours and spendingpatterns. For example, a wireless provider may look at the followingcustomer attributes: monthly bill, number of text messages, datavolume consumed, minutes used duringvarious daily periods, andyears as a customer. The wireless company could then look at thenaturally occurring clusters and consider tactics to increase sales orreduce the customer churn rate, the proportion of customers who endtheir relationship witha particular company.8.15.1 Hierarchical Clustering:Hierarchical clustering is a method of cluster analysis whereby youbuild a hierarchy of clusters. This works well for data sets that arecomplex and have distinct characteristics for separated clusters of data.Also called Hierarchical cluster analysis or HCA is an unsupervisedclustering algorithm which involves creating clusters that havepredominant ordering from top to bottom.For example:All files and folders on our hard disk are organized in ahierarchy.The algorithm groups similar objects into groups called clusters.The endpoint is a set of clusters or groups, where each cluster is distinctfrom each other cluster, and the objects within each cluster are broadlysimilar to each other.Thisclustering technique is divided into two types:1. Agglomerative Hierarchical Clustering2. Divisive Hierarchical ClusteringAgglomerative Hierarchical Clustering:The Agglomerative Hierarchical Clustering is the most commontype of hierarchical clustering used to group objects in clusters based ontheir similarity. It’s also known as AGNES (Agglomerative Nesting). It’ s amunotes.in

Page 129


“bottom-up” approach: each observation starts in its own cluster, and pairsof clusters are merged as one moves up the hierarchy.Howdoes it work?1.Make each data point a single-point cluster→ forms N clusters2.Take the two closest data points and make them one cluster→ formsN-1 clusters3.Take the two closest clusters and make them one cluster→ Forms N-2clusters.4.Repeat step-3until you are left with only one cluster.Divisive Hierarchical Clustering:In Divisive or DIANA (DIvisiveANAlysis Clustering) is a top-down clustering method where we assign all of the observations to a singlecluster and then partition the cluster to two least similar clusters. Finally,we proceed recursively on each cluster until there is one cluster for eachobservation. So this clustering approach is exactly opposite toAgglomerative clustering.
Figure 8.9 Agglomerative and Divisive8.15.2 Partitional Clustering: !!" !  &# ! !!! !# " ! " !  "!!!! %!& " ! !!" !  ! !! ! !" ! 
#! ! ! !!! !"!   ( !!   ! ! $!  !! !" !! !  !!!  "  & ! & ! $"!    " !  !  ! 
munotes.in

Page 130

!    !   ! %!& " !!!"''&!!!!!"Many partitional clustering algorithms try to minimize an objectivefunction. For example, in K-meansand K-medoids the function (alsoreferred to as the distortion function) is:8.16 ANOVAThe ANOVA test is the initial step in analysing factors that affect agiven data set. Once the test is finished, an analyst performs additionaltesting on the methodical factors that measurably contribute to the dataset's inconsistency. The analyst utilizes the ANOVA test results in an f-test to generate additional data that aligns with the proposed regressionmodels. The ANOVA test allows a comparison of more thantwo groups atthe same time to determine whether a relationship exists between them.Example:A BOGOF (buy-one-get-one-free) campaign is executed on 5groups of 100 customers each. Each group is different in terms of itsdemographic attributes. We wouldlike to determine whether these fiverespond differently to the campaign. This would help us optimize the rightcampaign for the right demographic group, increase the response rate, andreduce the cost of the campaign.The analysis of variance works by comparing the variance betweenthe groups to that within the group. The core of this technique lies inassessing whether all the groups are in fact part of one larger population ora completely different population with different characteristics.The Formula for ANOVA is:
There are two types of ANOVA:one-way (or unidirectional) andtwo-way.One-way or two-way refers to the number of independentvariables in your analysis of variance test. A one-way ANOVA evaluates
munotes.in

Page 131

the impact of a sole factor on a soleresponse variable. It determineswhether all the samples are the same. The one-way ANOVA is used todetermine whether there are any statistically significant differencesbetween the means of three or more independent (unrelated) groups.A two-way ANOVA isan extension of the one-way ANOVA.With a one-way, you have one independent variable affecting a dependentvariable. With a two-way ANOVA, there are two independents. Forexample, a two-way ANOVA allows a company to compare workerproductivity based on twoindependent variables, such as salary and skillset. It is utilized to observe the interaction between the two factors andtests the effect of two factors at the same time.8.17 DECISION TREESA decision tree (also called prediction tree) uses a tree structure tospecify sequences of decisions and consequences. Given input X ={x1,x2,...xn}, the goal is to predict a response or output variable Y. Eachmember of the set {x1,x2,...xn} is called an input variable. The predictioncan be achieved by constructing a decision tree with test points andbranches. At each test point, a decision is made to pick a specific branchand traverse down the tree. Eventually, a final point is reached, and aprediction can be made. Due to its flexibility and easy visualization,decision trees are commonly deployed in data mining applications forclassification purposes.The input values of a decision tree can be categorical orcontinuous. A decision tree employs a structure of testpoints (callednodes) and branches, which represent the decision being made. A nodewithout further branches is called a leaf node. The leaf nodes return classlabels and, in some implementations, they return the probabilityscores. Adecision tree can be converted into a set of decision rules. In thefollowingexample rule, income andmortgage_amountare input variables, and theresponse is the output variable default with a probability score.IF income <50,000 AND mortgage_amount> 100KTHEN default = True WITH PROBABILITY 75%Decision trees have twovarieties:classification trees and regressiontrees. Classification trees usuallyapply to output variables that arecategorical—often binary—in nature, such as yes or no, purchase ornotpurchase, and so on. Regression trees, on the other hand, can apply tooutput variables that are numeric orcontinuous, such as the predicted priceof a consumer good or the likelihood a subscription will bepurchased.munotes.in

Page 132


Example:
Figure 8.10 Decision TreeThe above figure shows an example of using a decision tree topredict whether customers will buy a product. The term branch refers tothe outcome of a decision and is visualized as a line connecting two nodes.If a decision is numerical, the "greater than" branch is usually placed onthe right, and the "less than"branch is placed on the left. Depending on thenature of the variable, one of the branches may need to include an "equalto" component.Internal nodes are the decision or test points. Each internal noderefers to an input variable or an attribute. The top internal node is calledthe root. The decision tree in the above figure is a binary tree in that eachinternal node has no more than two branches. The branching of a node isreferred to as a split.The depth of a node is the minimum number of steps required toreach the node from the root. In above figure for example, nodes Incomeand Age have a depth of one, and the four nodes on the bottom of the treehave a depth of two. Leaf nodes are at the end of the last branches on thetree. They represent classlabels—the outcome of all the prior decisions.The path from the root to a leaf node contains a series of decisions made atvarious internal nodes.The decision tree in the above figure shows that females withincome less than or equal to $45,000 and males 40 years old or youngerare classified as people who would purchase the product. In traversing thistree, age does not matter for females, and income does not matter formales.

Example:
Figure 8.10 Decision TreeThe above figure shows an example of using a decision tree topredict whether customers will buy a product. The term branch refers tothe outcome of a decision and is visualized as a line connecting two nodes.If a decision is numerical, the "greater than" branch is usually placed onthe right, and the "less than"branch is placed on the left. Depending on thenature of the variable, one of the branches may need to include an "equalto" component.Internal nodes are the decision or test points. Each internal noderefers to an input variable or an attribute. The top internal node is calledthe root. The decision tree in the above figure is a binary tree in that eachinternal node has no more than two branches. The branching of a node isreferred to as a split.The depth of a node is the minimum number of steps required toreach the node from the root. In above figure for example, nodes Incomeand Age have a depth of one, and the four nodes on the bottom of the treehave a depth of two. Leaf nodes are at the end of the last branches on thetree. They represent classlabels—the outcome of all the prior decisions.The path from the root to a leaf node contains a series of decisions made atvarious internal nodes.The decision tree in the above figure shows that females withincome less than or equal to $45,000 and males 40 years old or youngerare classified as people who would purchase the product. In traversing thistree, age does not matter for females, and income does not matter formales.

Example:
Figure 8.10 Decision TreeThe above figure shows an example of using a decision tree topredict whether customers will buy a product. The term branch refers tothe outcome of a decision and is visualized as a line connecting two nodes.If a decision is numerical, the "greater than" branch is usually placed onthe right, and the "less than"branch is placed on the left. Depending on thenature of the variable, one of the branches may need to include an "equalto" component.Internal nodes are the decision or test points. Each internal noderefers to an input variable or an attribute. The top internal node is calledthe root. The decision tree in the above figure is a binary tree in that eachinternal node has no more than two branches. The branching of a node isreferred to as a split.The depth of a node is the minimum number of steps required toreach the node from the root. In above figure for example, nodes Incomeand Age have a depth of one, and the four nodes on the bottom of the treehave a depth of two. Leaf nodes are at the end of the last branches on thetree. They represent classlabels—the outcome of all the prior decisions.The path from the root to a leaf node contains a series of decisions made atvarious internal nodes.The decision tree in the above figure shows that females withincome less than or equal to $45,000 and males 40 years old or youngerare classified as people who would purchase the product. In traversing thistree, age does not matter for females, and income does not matter formales.
munotes.in

Page 133

Where decision tree is used?•Decision trees are widely used in practice.•To classify animals, questions (like cold-blooded or warm-blooded,mammal or not mammal) are answered to arrive at a certainclassification.•A checklist of symptoms during a doctor's evaluation of a patient.•The artificial intelligence engine of a video game commonly usesdecision trees to control the autonomous actions of a character inresponse to various scenarios.•Retailers can use decision trees to segment customers or predictresponse rates to marketing and promotions.•Financialinstitutions can use decision trees to help decide if a loanapplication should be approved or denied. In the case of loan approval,computers can use the logical if-then statements to predict whetherthe customer will default on the loan.SUMMARYThe Transform superstep allows us to take data from the data vaultand formulate answers to questions raised by the investigations. Thetransformation step is the data science process that converts results intoinsights.UNIT END QUESTIONS1.Explain thetransform superstep.2.Explain the Sun model for TPOLE.3.Explain Person-to-Time Sun Model.4.Explain Person-to-Object Sun Model.5.Why does data have missing values? Why do missing values needtreatment? What methods treat missing values?6.What is feature engineering? What are the common featureextraction techniques?7.What is Binning? Explain with example.8.Explain averaging and Latent Dirichlet Allocation with respect to thetransform step of data science.9.Explain hypothesis testing, t-test and chi-square test withrespect todata science.10.Explain over fitting and underfitting. Discuss the common fittingissues.munotes.in

Page 134

11.Explain precision recall, precision recall curve, sensitivity,specificity and F1 measure.12.Explain Univariate Analysis.13.Explain Bivariate Analysis.14.What isLinear Regression? Give some common application of linearregression in the real world.15.What is Simple Linear Regression? Explain.16.Write a note on RANSAC Linear Regression.17.Write a note on Logistic Regression.18.Write a note on Simple Logistic Regression.19.Write a note on Multinomial Logistic Regression.20.Write a note on Ordinal Logistic Regression.21.Explain CLustering techniques.22.Explain Receiver Operating Characteristic (ROC) Analysis Curvesand cross validation test.23.Write a note on ANOVA.24.Write a note on Decision Trees.REFERENCES•https://asq.org/•https://scikit-learn.org/•https://www.geeksforgeeks.org/•https://statistics.laerd.com/spss-tutorials/•https://www.kdnuggets.com/munotes.in

Page 135

9TRANSFORM SUPERSTEPUnit Structure9.0Objectives9.1Introduction9.2Overview9.3Dimension Consolidation9.4The SUN Model9.5Transforming with data science9.5.1Missing value treatment9.5.2Techniques of outlier detection and Treatment9.6Hypothesis testing9.7Chi-square test9.8Univariate Analysis.9.9Bivariate Analysis9.10Multivariate Analysis9.11Linear Regression9.12Logistic Regression9.13Clustering Techniques9.14ANOVA9.15Principal Component Analysis (PCA)9.16Decision Trees9.17Support Vector Machines9.18Networks, Clusters, and Grids9.19Data Mining9.20Pattern Recognition9.21Machine Learning9.22Bagging Data9.23Random Forests9.24Computer Vision (CV)9.25Natural Language Processing (NLP)9.26Neural Networks9.27TensorFlow9.0 OBJECTIVESThe objective of this chapter is to learn thedata transformationwhere it brings data to knowledge and converts results into insights.munotes.in

Page 136

9.1 INTRODUCTIONThe Transform superstepallows to take data from the data vaultand formulate answers to questions. The transformation step is the datascience process that converts results into meaningful insights9.2 OVERVIEWAs to explain the below scenario is shown.Data is categorised in to 5 different dimensions:1.Time2.Person3.Object4.Location5.Event9.3 DIMENSION CONSOLIDATIONThe data vault consists of fivecategories of data, with linkedrelationships and additional characteristics in satellite hubs.
Figure 9.1
9.1 INTRODUCTIONThe Transform superstepallows to take data from the data vaultand formulate answers to questions. The transformation step is the datascience process that converts results into meaningful insights9.2 OVERVIEWAs to explain the below scenario is shown.Data is categorised in to 5 different dimensions:1.Time2.Person3.Object4.Location5.Event9.3 DIMENSION CONSOLIDATIONThe data vault consists of fivecategories of data, with linkedrelationships and additional characteristics in satellite hubs.
Figure 9.1
9.1 INTRODUCTIONThe Transform superstepallows to take data from the data vaultand formulate answers to questions. The transformation step is the datascience process that converts results into meaningful insights9.2 OVERVIEWAs to explain the below scenario is shown.Data is categorised in to 5 different dimensions:1.Time2.Person3.Object4.Location5.Event9.3 DIMENSION CONSOLIDATIONThe data vault consists of fivecategories of data, with linkedrelationships and additional characteristics in satellite hubs.
Figure 9.1
munotes.in

Page 137

9.4 THE SUN MODELThe use of sun models is a technique that enables the data scientistto perform consistent dimension consolidation, by explaining the intendeddata relationship with the business, without exposing it to the technicaldetails required to complete the transformation processing.The sun model is constructed to show all the characteristics fromthe two data vault hub categories you are planning to extract. It explainshow you will create two dimensions and a fact via the Transform step.9.5 TRANSFORMING WITH DATA SCIENCE9.5.1 Missing value treatment:You must describe in detail what the missing value treatments arefor the data lake transformation. Make sure you take your businesscommunity with you along the journey. At the end of the process, theymust trust your techniques and results. If they trust the process, they willimplement the business decisions that you, as a datascientist, aspire toachieve.Why Missing value treatment is required?Explain with notes on the data traceability matrix why there ismissing data in the datalake. Remember: Every inconsistency in the datalake is conceivably the missing insightyour customer is seeking from youas a data scientist. So, find them and explain them.Your customer willexploit them for business value.Why Data Has Missing Values ?The 5 Whys is the technique that helps you to get to the root cause of youranalysis. The useof cause-and-effect fishbone diagrams will assist you toresolve those questions.I have found the following common reasons formissing data:•Data fields renamed during upgrades•Migration processes from old systems to new systems wheremappingswere incomplete•Incorrect tables supplied in loading specifications by subject-matterexpert•Data simply not recorded, as it was not available•Legal reasons, owing to data protection legislation, such as theGeneralData Protection Regulation (GDPR), resulting in a not-to-processtagon the data entrymunotes.in

Page 138

•Someone else’s “bad” data science. People and projects makemistakes,and you will have to fix their errors in your own datascience.9.6HYPOTHESIS TESTINGHypothesis testing is not precisely an algorithm, but it’s a must-know for any datascientist. You cannot progress until you have thoroughlymastered this technique.Hypothesis testing is the process by whichstatistical tests are used to check if ahypothesis is true, by using data.Based on hypothetical testing, data scientists choose to accept or reject thehypothesis. When an event occurs, it can be a trend or happen by chance.To check whether the event is an important occurrence orjusthappenstance,hypothesis testing is necessary.There are many tests for hypothesistesting, but the following twoare most popular.T-test and Chi-Square test.9.7CHI-SQUARE TESTThere are two types of chi-square tests. Both use the chi-squarestatistic and distribution for different purposes:A chi-square goodness of fit test determines if a sample datamatches a population. For more details on this type, see: Goodness of FitTest.A chi-square test for independence compares two variables in acontingency table to see if they are related. In a more general sense, it teststo see whether distributions of categorical variables differ from eachanother.A very small chi square test statistic means that your observeddata fits your expected data extremely well. In other words, there is arelationship.A very large chi square test statistic means that the data doesnot fit very well. In other words, there isn’t a relationship.9.8 UNIVARIATE ANALYSISUnivariate analysis is the simplest form of analysing data. “Uni”means “one”, so in other words your data has only one variable. It doesn’tdeal with causes or relationships (unlike regression) and its major purposeis to describe; It takes data, summarizes that data and finds patterns in thedata.Univariate analysis is used to identify those individual metaboliteswhich, either singly ormultiplexed, are capable of differentiating betweenbiological groups, such as separating tumour bearing mice from nontumormunotes.in

Page 139

bearing (control) mice. Statistical procedures used for this analysis includea t-test, ANOVA, Mann–Whitney U test, Wilcoxon signed-rank test, andlogistic regression. These tests are used to individually or globally screenthe measured metabolites for an association with a disease.9.9 BIVARIATE ANALYSISBivariate analysis is when two variables are analysed together forany possibleassociationor empirical relationship, such as, for example, thecorrelationbetween gender andgraduation with a data science degree?Canonical correlation in the experimental contextis to take two sets ofvariables and see what is common between the two sets.Graphs that areappropriate for bivariate analysis depend on the type of variable.For twocontinuous variables, a scatterplot is a common graph. When onevariableis categorical and the other continuous, a box plot is common, andwhen both arecategorical, a mosaic plot is common.9.10 MULTIVARIATE ANALYSISA single metabolite biomarker is generally insufficient todifferentiate between groups. For this reason, a multivariate analysis,which identifies sets of metabolites (e.g., patterns or clusters) inthe data,can result in a higher likelihood of group separation. Statistical methodsfor this analysis include unsupervised methods, such as principlecomponent analysis (PCA) or cluster analysis, and supervised methods,such as latent Dirichlet allocation(LDA), partial least squares (PLS), PLSDiscriminant Analysis (PLS-DA), artificial neural network (ANN), andmachine learning methods. These methods provide an overview of a largedataset that is useful for identifying patterns and clusters in the data andexpressing the data to visually highlight similarities and differences.Unsupervised methods may reduce potential bias since the classes areunlabelled.Regardless of one's choice of method for statistical analysis, it isnecessary to subsequently validate the identified potential biomarkers andtherapeutic targets by examining them in new and separate sample sets(for biomarkers), and in vitro and/or in vivo experiments evaluating theidentified pathways or molecules (for therapeutic targets)9.11 LINEAR REGRESSIONLinear regression is an analytical technique used to model therelationship between several input variables and a continuous outcomevariable. A key assumption is that the relationship between an inputvariable and the outcome variable is linear. Although this assumption mayappear restrictive, it is often possible to properly transform the input ormunotes.in

Page 140

outcome variables to achieve a linear relationship between the modifiedinput and outcome variables.A linear regression model is a probabilistic one that accounts forthe randomness that can affect any particular outcome. Based on knowninput values, a linear regression model provides the expected value of theoutcome variable based on the values of the input variables, but someuncertainty mayremain in predicting any particular outcome.Regression analysis is useful for answering the following kinds ofquestions:• What is a person’ s expected income?• What is the probability that an applicant will default on a loan?Linear regression is auseful tool for answering the first question, andlogistic regression is a popular method for addressing the second.9.12 LOGISTIC REGRESSIONLogistic regression is one another technique, used for convertingbinary classification(dichotomous) problem tolinear regression problems.Logistic regression is a predictive analysis technique. This could bedifficult to interpret so there are tools available for it. It is used to describedata and to explain the relationship between one dependent binary variableand one or more nominal, ordinal, interval or ratio-level independentvariables.It can answer complex but dichotomous questions such as:;Probability of getting attending college (YES or NO), providedsyllabus is over but faculty is interesting but boring at times dependsupon the mood and behaviour of students in class.;Probability of finishing the lunch sent my mother (YES or NO),depends upon multiple aspects, a) mood b) better options available c)food taste d) scolding by mom e) friends open treat and so further.Hence, Logistic regression predicts the probability of an outcomethat can only have two values(YES or NO)9.13 CLUSTERING TECHNIQUESClustering is an unsupervised learning model, similar toclassification, it helps creating different setof classes together. It groupsthe similar types together by creating/identifying the clusters of the similartypes. Clustering is a task of dividing homogeneous data types orpopulation or groups. It does so by identifying the similar data types ornearby data elements over graph. In classification the classes are definedmunotes.in

Page 141

with the help algorithms or predefined classes are used and then the datainputs is considered, while in clustering the algorithm inputs itself decideswith the help of inputs, the number of clusters depending upon it similaritytraits. These similar set of inputs forms a group and called as clusters.Clustering is more dynamic model in terms of grouping.Basic comparison for clustering and classification is given as below:PARAMETERSCLASSIFICATIONCLUSTERINGFundamentalThis model functionclassifies the data intoone of given pre-defineddefinite classes.This function maps thedata into one of themultiple clusters where thearrangement of data itemsis relies on the similaritiesbetween them.Involved inSupervised learningUnsupervised learningTrainingsampleProvidedNot providedWe can classify clustering into two categories :Soft clustering and hard clustering. Let me give one example toexplain the same. For an instancewe are developing a website for writingblogs. Now your blog belongs to a particular category such as : Science,Technology, Arts, Fiction etc. It might be possible that the article which iswritten could belong or relate 2 or more categories. Now, in this case if werestrict our blogger to choose one of the category then we would call thisas “hard or strict clustering method”, where a user can remain in any oneof the category. Let say this work is done automated by our piece of codeand it chooses categories on the basis of the blog content. If my algorithmchooses any one of the given cluster for the blog then it would be called as“hard or strict clustering”. In contradiction to this if my algorithm choosesto select more than one cluster for the blogcontent then it would be calledas “ soft or loose clustering” method.Clustering methods should consider following important requirements:;Robustness;Flexibility;EfficiencyClustering algorithms/Methods:There are several clustering algorithms/Methodsavailable, of which wewill be explaining a few:;Connectivity Clustering Method:This model is based on theconnectivity between the data points.These models are based on thenotion that the data points closer in data space exhibit more similarityto each other than the data points lying farther away.munotes.in

Page 142


;Clustering Partition Method:it works on divisions method, wherethe division or partition between data set is created. These partitionsare predefined non-empty sets. This is suitable for a small dataset.;Centroid Cluster Method:This model revolve around the centreelement of the dataset. The closest data points to the centre data point(centroid) in the dataset is considered to form a cluster. K-Meansclustering algorithm is the best fit example of suchmodel.;Hierarchical clustering Method:This method describes the treebased structure (nested clusters) of the clusters. In this method we haveclusters based on the divisions and their sub divisions in a hierarchy(nested clustering). The hierarchy can bepre-determined based uponuser choice. Here number of clusters could remain dynamic and notneeded to be predetermined as well.;Density-based Clustering Method:In this method the density of theclosest datasetis considered to form a cluster. The more number of closerdata sets (denser the data inputs), the better the cluster formation. Theproblem here comes with outliers, which is handled in classifications(support vector machine) algorithm.9.14 ANOVAANOVAis an acronym which stands for “ANalysisOfVAriance”.An ANOVA test is a way to find out if survey or experiment results aresignificant. In other words, they help you to figure out if you need to rejectthe null hypothesis or accept the alternate hypothesis.Basically, you’re testing groups to see if there’s a difference betweenthem. Examples of when you might want to test different groups:;A group of psychiatric patients are trying three different therapies:counselling, medication and biofeedback. You want to see if onetherapy is better thanthe others.;A manufacturer has two different processes to make light bulbs. Theywant to know if one process is better than the other.;Students from different colleges take the same exam. You want to seeif one college outperforms the other.Formula of ANOVA:F= MST / MSEWhere, F = ANOVA CoefficientMSE = Mean sum of squares due to treatmentMST = Mean sum of squares due to errorThe ANOVA test is the initial step in analysing factors that affect agiven data set. Once the test is finished, an analystperforms additionaltesting on the methodical factors that measurably contribute to the dataset's inconsistency. The analyst utilizes the ANOVA test results in an f-munotes.in

Page 143

test to generate additional data that aligns with the proposed regressionmodels.The ANOVA test allows a comparison of more than two groups atthe same time to determine whether a relationship exists between them.The result of the ANOVA formula, the F statistic (also called the F-ratio),allows for the analysis of multiple groups of data todetermine thevariability between samples and within samples.(citation:https://www.investopedia.com/terms/a/anova.asp,https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/anova/#:~:text=An%20ANOVA%20test%20is%20a,there's%20a%20difference%20between%20them. )9.15 PRINCIPAL COMPONENT ANALYSIS (PCA)PCA is actually a widely covered method on the web, and there aresome great articles about it, but only few of them go straight to the pointand explain how it works without diving too muchinto the technicalitiesand the ‘why’ of things. That’s the reason why i decided to make my ownpost to present it in a simplified way.Before getting to the explanation, this post provides logicalexplanations of what PCA is doing in each step and simplifies themathematical concepts behind it, as standardization, covariance,eigenvectors and eigenvalues without focusing on how to compute them.Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of largedata sets, by transforming a large set of variables into a smaller one thatstill contains most of the information in the large set.Reducing the number of variables of a data set naturally comes atthe expense of accuracy, but the trick in dimensionality reduction is totrade a little accuracy for simplicity. Because smaller data sets are easierto explore and visualize and make analysing data much easier and fasterfor machine learning algorithms without extraneous variables to process.So to sum up, the idea of PCA is simple—reduce the number of variablesof a data set, while preserving as much information as possible.(Citations:https://builtin.com/data-science/step-step-explanation-principal-component-analysis)9.16 DECISION TREESA decision tree represents classification. Decision tree learning isthe most promising technique for supervised classification learning. Sinceitis a decision tree it is mend to take decision and being a learningmunotes.in

Page 144

decision tree it trains itself and learns from the experience of set of inputiterations. These input iterations are also well known as “input trainingsets” or “training set data”.Decision trees predict the future based on the previous learning andinput rule sets.It taken multiple input values and returns back the probableoutput with the single value which is considered as a decision. Theinput/output could be continuous as well asdiscrete. A decision tree takesits decision based on the defined algorithms and the rule sets.For example you want to take a decision to buy a pair of shoes. We startwith few set of questions:1.Do we need one?2.What would be the budget?3.Formal or informal?4.Is it for a special occasion?5.Which colour suits me better?6.Which would be the most durable brand?7.Shall we wait for some special sale or just buy one since its needed?And similar more questions would give us a choice for selection.This predictionworks on the classification where the choice of outputs areclassified and the possibility of occurrence is decided on the basis of theprobability of the occurrence of that particular output.Example:Fig 9.2 Example showing the decision tree of weather forecast.The above figure shows how a decision needs to be taken in aweather forecast scenario where the day is specified as Sunny, Cloudy orRainy. Depending upon the metrics received by an algorithm it will takethe decision. The metrics could be humidity, sky visibility and others. Wecan also see the cloudy situation having two possibilities of havingpartially cloudy and dense clouds, wherein having partial clouds is also asubset of a Sunny day. Such occurrences make decision tree bivalence.9.17 SUPPORT VECTOR MACHINESSupport vector machine is an algorithm which is used forclassification in a supervised learning algorithm example. It does
decision tree it trains itself and learns from the experience of set of inputiterations. These input iterations are also well known as “input trainingsets” or “training set data”.Decision trees predict the future based on the previous learning andinput rule sets.It taken multiple input values and returns back the probableoutput with the single value which is considered as a decision. Theinput/output could be continuous as well asdiscrete. A decision tree takesits decision based on the defined algorithms and the rule sets.For example you want to take a decision to buy a pair of shoes. We startwith few set of questions:1.Do we need one?2.What would be the budget?3.Formal or informal?4.Is it for a special occasion?5.Which colour suits me better?6.Which would be the most durable brand?7.Shall we wait for some special sale or just buy one since its needed?And similar more questions would give us a choice for selection.This predictionworks on the classification where the choice of outputs areclassified and the possibility of occurrence is decided on the basis of theprobability of the occurrence of that particular output.Example:Fig 9.2 Example showing the decision tree of weather forecast.The above figure shows how a decision needs to be taken in aweather forecast scenario where the day is specified as Sunny, Cloudy orRainy. Depending upon the metrics received by an algorithm it will takethe decision. The metrics could be humidity, sky visibility and others. Wecan also see the cloudy situation having two possibilities of havingpartially cloudy and dense clouds, wherein having partial clouds is also asubset of a Sunny day. Such occurrences make decision tree bivalence.9.17 SUPPORT VECTOR MACHINESSupport vector machine is an algorithm which is used forclassification in a supervised learning algorithm example. It does
decision tree it trains itself and learns from the experience of set of inputiterations. These input iterations are also well known as “input trainingsets” or “training set data”.Decision trees predict the future based on the previous learning andinput rule sets.It taken multiple input values and returns back the probableoutput with the single value which is considered as a decision. Theinput/output could be continuous as well asdiscrete. A decision tree takesits decision based on the defined algorithms and the rule sets.For example you want to take a decision to buy a pair of shoes. We startwith few set of questions:1.Do we need one?2.What would be the budget?3.Formal or informal?4.Is it for a special occasion?5.Which colour suits me better?6.Which would be the most durable brand?7.Shall we wait for some special sale or just buy one since its needed?And similar more questions would give us a choice for selection.This predictionworks on the classification where the choice of outputs areclassified and the possibility of occurrence is decided on the basis of theprobability of the occurrence of that particular output.Example:Fig 9.2 Example showing the decision tree of weather forecast.The above figure shows how a decision needs to be taken in aweather forecast scenario where the day is specified as Sunny, Cloudy orRainy. Depending upon the metrics received by an algorithm it will takethe decision. The metrics could be humidity, sky visibility and others. Wecan also see the cloudy situation having two possibilities of havingpartially cloudy and dense clouds, wherein having partial clouds is also asubset of a Sunny day. Such occurrences make decision tree bivalence.9.17 SUPPORT VECTOR MACHINESSupport vector machine is an algorithm which is used forclassification in a supervised learning algorithm example. It does
munotes.in

Page 145

classification of the inputs received on the basis of the rule-set. It alsoworks on Regression problems.Classification is needed to differentiate 2 or more sets of similar data.Let us understand how it works.Scene one:
Figure 9.3The above scene shows A, B and C as 3 line segments creatinghyper planes by dividing the plane. The graph here showsthe 2 inputscircles and stars. The inputs could be from two classes. Looking at thescenario we can say that A is the line segment which is diving the 2 hyperplanes showing 2 different input classes.Scene two:
Figure 9.4In the scene 2 we can see another rule, the one which cuts thebetter halves is considered. Hence, hyper plane C is the best choice of thealgorithm.
munotes.in

Page 146

Scene three:
Figure 9.5Here in scene 3, we see one circle overlapping hyper plane A ,hence according to rule 1 of scene 1 wewill choose B which is cutting theco-ordinates into 2 better halves.Scene four:
Figure 9.6Scene 4 shows one hyper plane dividing the 2 better halves butthere exist one extra circle co-ordinate in the other half hyperplane. Wecall this as an outlierwhich is generally discarded by the algorithm.Scene five:
Figure 9.7Scene 5 shows another strange scenario where we have the co-ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut
Scene three:
Figure 9.5Here in scene 3, we see one circle overlapping hyper plane A ,hence according to rule 1 of scene 1 wewill choose B which is cutting theco-ordinates into 2 better halves.Scene four:
Figure 9.6Scene 4 shows one hyper plane dividing the 2 better halves butthere exist one extra circle co-ordinate in the other half hyperplane. Wecall this as an outlierwhich is generally discarded by the algorithm.Scene five:
Figure 9.7Scene 5 shows another strange scenario where we have the co-ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut
Scene three:
Figure 9.5Here in scene 3, we see one circle overlapping hyper plane A ,hence according to rule 1 of scene 1 wewill choose B which is cutting theco-ordinates into 2 better halves.Scene four:
Figure 9.6Scene 4 shows one hyper plane dividing the 2 better halves butthere exist one extra circle co-ordinate in the other half hyperplane. Wecall this as an outlierwhich is generally discarded by the algorithm.Scene five:
Figure 9.7Scene 5 shows another strange scenario where we have the co-ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cutmunotes.in

Page 147

y axis into two halves and transfer thestars and circle on one side of thequadrant and simplify the solution. The representation is shown below:
Figure 9.8This gives us again a chance to divide the 2 classes into 2 betterhalves with using a hyperplane. In the above scenario we have scoopedout the stars from the circle co-ordinates and shown it as a differenthyperplane.Neural Networks:Artificial Neural Networks; the term is quite fascinating when any studentstarts learning it. Let us break down the term and know its meaning.Artificial = means “man made”,Neural = comes from the term Neurons in the brain; a complex structureof nerve cells with keeps the brain functioning. Neurons are the vital partof human brain which does simple input/output to complex problemsolving in the brain.Network = A connection of two entities (here in our case “Neurons”, notjust two but millions of them).What is a Neural Network ?:Neural network is a network of nerve cells in the brain. There areabout 100 million neurons in our brain. Let us know few more facts abouthuman brain.Well we all have one 1200 gms of brain (appx). Try weighingit.?..That’s a different thing that few have kidney beans Inside theirskull.??For those who think they have a tiny little brain. Let me sharecertain thingswith you.Our brain weighs appx 1200-1500 gms.;It has a complex structure of neurons (nerve cells) with a lot of greymatter. Which is very essential to keep you working fine.
y axis into two halves and transfer thestars and circle on one side of thequadrant and simplify the solution. The representation is shown below:
Figure 9.8This gives us again a chance to divide the 2 classes into 2 betterhalves with using a hyperplane. In the above scenario we have scoopedout the stars from the circle co-ordinates and shown it as a differenthyperplane.Neural Networks:Artificial Neural Networks; the term is quite fascinating when any studentstarts learning it. Let us break down the term and know its meaning.Artificial = means “man made”,Neural = comes from the term Neurons in the brain; a complex structureof nerve cells with keeps the brain functioning. Neurons are the vital partof human brain which does simple input/output to complex problemsolving in the brain.Network = A connection of two entities (here in our case “Neurons”, notjust two but millions of them).What is a Neural Network ?:Neural network is a network of nerve cells in the brain. There areabout 100 million neurons in our brain. Let us know few more facts abouthuman brain.Well we all have one 1200 gms of brain (appx). Try weighingit.?..That’s a different thing that few have kidney beans Inside theirskull.??For those who think they have a tiny little brain. Let me sharecertain thingswith you.Our brain weighs appx 1200-1500 gms.;It has a complex structure of neurons (nerve cells) with a lot of greymatter. Which is very essential to keep you working fine.
y axis into two halves and transfer thestars and circle on one side of thequadrant and simplify the solution. The representation is shown below:
Figure 9.8This gives us again a chance to divide the 2 classes into 2 betterhalves with using a hyperplane. In the above scenario we have scoopedout the stars from the circle co-ordinates and shown it as a differenthyperplane.Neural Networks:Artificial Neural Networks; the term is quite fascinating when any studentstarts learning it. Let us break down the term and know its meaning.Artificial = means “man made”,Neural = comes from the term Neurons in the brain; a complex structureof nerve cells with keeps the brain functioning. Neurons are the vital partof human brain which does simple input/output to complex problemsolving in the brain.Network = A connection of two entities (here in our case “Neurons”, notjust two but millions of them).What is a Neural Network ?:Neural network is a network of nerve cells in the brain. There areabout 100 million neurons in our brain. Let us know few more facts abouthuman brain.Well we all have one 1200 gms of brain (appx). Try weighingit.?..That’s a different thing that few have kidney beans Inside theirskull.??For those who think they have a tiny little brain. Let me sharecertain thingswith you.Our brain weighs appx 1200-1500 gms.;It has a complex structure of neurons (nerve cells) with a lot of greymatter. Which is very essential to keep you working fine.munotes.in

Page 148

;There are around 100 billion neurons in our brain, which keeps ourbrain and body functioning by constantly transferring data 24×7 till weare alive.;The data transfer is achieved by the exchange of electrical or chemicalsignals with the help ofsynapses ( junction between two neurons).;They exchange about 1000 trillion synaptic signals per second, whichis equivalent to 1 trillion bit per second computer processor.;The amount of energy generated with these synaptic signal exchange isenough to light a bulb of 5 volts.;A human brain hence can also store upto 1000 terabytes of data.;The information transfer happen with the help of the synapticexchange.;It takes 7 years to replace every single neuron in the brain. So we tendto forget the content after 7 years, since during the synaptic exchangethere is a loss of energy, means lossof information. If we do not recallanything for 7 years then that information is completely erased fromour memory.;Similar to these neurons computer scientist build a complex Artificialneural network using the array of logic gates. The most preferred gatesused are XOR gates.;The best part about Artificial brain is that it can store and cannot forgetlike human brain and we can store much more information than anindividual brain. Unfortunately that has a bigger side effect. Trust me“forgetting is better”.Certain things in our lives we should forget sowe can move forward.Imagine if you would have to live with everymemory in life. Both negative and positive. Things will haunt you andyou will start the journey of psychotic disorders.. ??Scary ? Isn’t it !!Well there are many assumptions with their probable outcome. Wealways need to look at the positive side of it. Artificial neuralnetworks(ANN) are very useful for solving complex problems anddecision making. ANN is an artificial representation of a human brain thattries to simulate its various functions such as learning, calculating,understanding, decision making and many more. Unfortunately it couldnot reach the exact human brain like function. It is a connection of logicalgates which uses mathematical computational model to work and give theoutput.After WWII, in year 15.143 , Warren McCulloch and Walter Pittsmodelled artificial neuron to perform computation. They did this bydeveloping a neuron from a logic gate.munotes.in

Page 149

Here, the neuron isactually a processing unit, it calculates theweighted sum of the input signal to the neuron to generate the activationsignal a, given by :
InputsWeightshidden layerOutputA single Artificial Neuron representation.
Another representation showing detailed multiple neurons working.Here it shows that, the inputs of all neurons is calculated alongwith their weights. Hence the weighted sum of all the inputs XiWi( X1W1,X2W2, X3W3.......XnWn), Where X represents input signals and Wrepresents weights is considered as an output to the equation “a”.These neurons are connected in a long logical network to create apolynomial function(s). So that to calculate multiple complex problems.In the architecture, more element needsto be added that is a threshold.Threshold defines the limits to the model. Threshold is defined asTHETA (Θ) in neural network model. It is added or subtracted to theoutput depending upon the model definitions.This theta defines additional limits acting as a filter to the inputs.With the help of which we can filter out unwanted stuff and get morefocus onthe needed ones. Another fact about theta is that its value isdynamic according to the environment. For an instance it can beunderstood as + or-tolerance value in semiconductors / resistors.
W1W2......WnX1X2.......Xna =Sum ofinputs∑
Here, the neuron isactually a processing unit, it calculates theweighted sum of the input signal to the neuron to generate the activationsignal a, given by :
InputsWeightshidden layerOutputA single Artificial Neuron representation.
Another representation showing detailed multiple neurons working.Here it shows that, the inputs of all neurons is calculated alongwith their weights. Hence the weighted sum of all the inputs XiWi( X1W1,X2W2, X3W3.......XnWn), Where X represents input signals and Wrepresents weights is considered as an output to the equation “a”.These neurons are connected in a long logical network to create apolynomial function(s). So that to calculate multiple complex problems.In the architecture, more element needsto be added that is a threshold.Threshold defines the limits to the model. Threshold is defined asTHETA (Θ) in neural network model. It is added or subtracted to theoutput depending upon the model definitions.This theta defines additional limits acting as a filter to the inputs.With the help of which we can filter out unwanted stuff and get morefocus onthe needed ones. Another fact about theta is that its value isdynamic according to the environment. For an instance it can beunderstood as + or-tolerance value in semiconductors / resistors.
W1W2......WnX1X2.......Xna =Sum ofinputs∑
Here, the neuron isactually a processing unit, it calculates theweighted sum of the input signal to the neuron to generate the activationsignal a, given by :
InputsWeightshidden layerOutputA single Artificial Neuron representation.
Another representation showing detailed multiple neurons working.Here it shows that, the inputs of all neurons is calculated alongwith their weights. Hence the weighted sum of all the inputs XiWi( X1W1,X2W2, X3W3.......XnWn), Where X represents input signals and Wrepresents weights is considered as an output to the equation “a”.These neurons are connected in a long logical network to create apolynomial function(s). So that to calculate multiple complex problems.In the architecture, more element needsto be added that is a threshold.Threshold defines the limits to the model. Threshold is defined asTHETA (Θ) in neural network model. It is added or subtracted to theoutput depending upon the model definitions.This theta defines additional limits acting as a filter to the inputs.With the help of which we can filter out unwanted stuff and get morefocus onthe needed ones. Another fact about theta is that its value isdynamic according to the environment. For an instance it can beunderstood as + or-tolerance value in semiconductors / resistors.W1W2......WnX1X2.......Xna =Sum ofinputs∑munotes.in

Page 150

9.18 TENSORFLOWTensorFlow is an end-to-end open sourceplatform for machinelearning. It has a comprehensive, flexible ecosystem of tools, libraries andcommunity resources that lets researchers push the state-of-the-art in MLand developers easily build and deploy ML powered applications.It is an open source artificial intelligence library, using data flowgraphs to build models. It allows developers to create large-scale neuralnetworks with many layers.TensorFlowis mainlyusedfor:Classification, Perception, Understanding, Discovering, Prediction andCreation.More can be learnt fromhttps://www.tensorflow.org/learnUNIT END QUESTIONS1.Explain regression and its types?2.What is Annova method?3Explain support vector machine?4.Where is chi square test used?5.Write a note on Pricipal Component Analysis6.Explain Tensorflow with example.7.Write a note on Machine learning.munotes.in

Page 151

10ORGANIZE AND REPORT SUPERSTEPSUnit Structure10.1Organize Superstep10.2Report Superstep10.3Graphics, Pictures10.4Unit End QuestionsOrganize Superstep, Report Superstep, Graphics, Pictures, Showingthe Difference(citation: from the book: Practical Data Science byAndreas FrançoisVermeulen)10.1 ORGANIZE SUPERSTEPThe Organize superstep takes the complete data warehouse youbuilt at the end of the Transform superstep and subsections it intobusiness-specific data marts. A data mart is the access layer of the datawarehouse environment built to expose data to the users.The data mart isa subset of the data warehouse and is generally orientedto a specific business group.Horizontal Style:Performing horizontal-style slicing or subsetting of the datawarehouse is achieved by applying a filter technique that forces the datawarehouse toshow only the data for a specific preselected set of filteredoutcomes against the data population. The horizontal-style slicing selectsthe subset of rows from the population while preserving the columns.That is, the data science tool can see the complete record for the records inthe subset of records.Vertical Style:Performing vertical-style slicing or subsetting of the datawarehouse is achieved by applying a filter technique that forces the datawarehouse to show only the data for specific preselected filtered outcomesagainst the data population. The vertical-style slicing selects the subset ofcolumns from the population, while preserving the rows.munotes.in

Page 152


That is, the data science tool can see only the preselected columnsfrom a record for all the records in the population.Island Style:Performing island-style slicing or subsetting of the data warehouseis achieved by applying a combination of horizontal-and vertical-styleslicing. This generates a subset of specific rows and specific columnsreduced at the same time.Secure Vault Style:The secure vault is a version of one of the horizontal, vertical, orisland slicing techniques, but the outcome is also attached to the personwho performs the query.This is common in multi-security environments,where differentusers are allowed to see different data sets.This process works well, if you use a role-based access control(RBAC) approach to restricting system access to authorized users. Thesecurity is applied against the “role,” and a person can then, by thesecurity system, simply be added or removed from the role, to enable ordisable access.The security in most data lakes I deal with is driven by an RBACmodel that is an approach to restricting system access to authorized usersby allocating them to a layer of roles that the data lake is organized into tosupport security access.It is also possible to use a time-bound RBAC that has differentaccess rights during office hours than after hours.Association Rule Mining:Association rule learning is a rule-based machine-learning methodfor discovering interesting relations between variables in large databases,similar to the data you will find in a data lake. The technique enables youto investigate the interaction between data within the same population.This example I will discuss is also called “market basket analysis.”It will investigate the analysis of a customer’s purchases during a period oftime.The new measure you need to understand is called “lift.” Lift issimply estimated by the ratio of the joint probability of two items x and y,divided by the product of their individual probabilities:munotes.in

Page 153

If the two items are statistically independent, then P(x,y) =P(x)P(y), corresponding to Lift = 1, in that case. Note that anti-correlationyields lift values less than 1, which is also an interesting discovery,corresponding to mutually exclusive items that rarely co-occur.You will require the following additional library: conda install-c conda-forge mlxtend.The general algorithm used for this is the Apriori algorithm forfrequent item set mining and association rule learning over the content ofthe data lake. It proceeds by identifying the frequent individual items inthe data lake and extends them to larger and larger item sets, as long asthose item sets appear satisfactorily frequently in the data lake.The frequent item sets determined by Apriorican be used todetermine association rules that highlight common trends in the overalldata lake. I will guide you through an example.Start with the standard ecosystem.(citation: from the book: Practical Data Science byAndreas FrançoisVermeulen)10.2 REPORT SUPERSTEPThe Report superstep is the step in the ecosystem that enhances thedata science findings with the art of storytelling and data visualization.You can perform the best data science, but if you cannot execute arespectable and trustworthy Report step by turning your data science intoactionable business insights, you have achieved no advantage for yourbusiness.Summary of theResults:The most important step in any analysis is the summary of theresults. Your data science techniques and algorithms can produce the mostmethodically, most advanced mathematical or most specific statisticalresults to the requirements, but if youcannot summarize those into a goodstory, you have not achieved your requirements.Understand the Context:What differentiates good data scientists from the best datascientists are not the algorithms or data engineering; it is the ability of thedata scientist to apply the context of his findings to the customer.
munotes.in

Page 154

Appropriate Visualization:It is true that a picture tells a thousand words. But in data science,you only want your visualizations to tell one story: the findings of the datascience you prepared. It is absolutely necessity to ensure that youraudience will get your most important message clearly and without anyother meanings.Practice with your visual tools and achieve a high level ofproficiency. I have seen numerous data scientists lose the value of greatdata science results because they did not perform an appropriate visualpresentation.Eliminate Clutter:Have you ever attended a presentation where the person haspainstakingly prepared 50 slides to feedback his data science results? Themost painful image is the faces of the people suffering through such apresentation for over two hours.The biggest task of a data scientist is to eliminate clutter in the datasets. There are various algorithms, such as principal component analysis(PCA), multicollinearity using the variance inflation factor to eliminatedimensions and impute or eliminate missing values, decision trees tosubdivide, and backward feature elimination, but the biggest contributor toeliminating clutter is good and solid feature engineering.10.3 GRAPHICS, PICTURESGraphic visualisation is the most important part of data science.Hence plotting the graphical representation using python matplotlib andsuch libraries for data visualisation is prominent and useful.Try using certain libraries and plot the outcomes of :;Pie Graph;Double Pie;Line Graph;Bar Graph;Horizontal Bar Graph;Area graph;Scatter GraphAnd so further.Channels of Imagesmunotes.in

Page 155

The interesting fact about any picture is that it is a complex data set inevery image.Pictures are built using many layers or channels that assists thevisualization tools to render the required image.Open your Python editor, and let’s investigate the inner workings of animage.UNIT END QUESTIONS1.What is organize superstep? Explain in brief.2.Explain the importance of graphics in organize superstep.3.Why is report superstep important? Explain its importance.4.Explain the importance of data organizing and reporting.munotes.in