1 1 SET UP AND CONFIGURATION HADOOP USING CLOUDERA CREATING A HDFS SYSTEM WITH MINIMUM 1 NAME NODE AND 1 DATA NODES HDFS COMMANDS Unit Structure : 1.1 Objectives 1.2 Prerequisite 1.3 GUI Configuration 1.4 Command Line Configuration 1.5 Summary 1.6 Sample Questions 1.7 References 1.1 OBJECTIVES Hadoop file system stores the data in multiple copies. Also, it’s a cost-effective solution for any business to store their data efficiently. HDFS Operations acts as the key to open the vaults in which you store the data to be available from remote locations. This chapter describes how to set up and edit the deployment configuration files for HDFS 1.2 PREREQUISITE: TO INSTALL HADOOP, YOU SHOULD HAVE JAVA VERSION 1.8 IN YOUR SYSTEM. Check your java version through this command on command prompt Java -version Create a new user variable. Put the Variable_name as HADOOP_HOME and Variable_value as the path of the bin folder where you extracted hadoop. munotes.in
Page 2
2 Big Data Analytics and Visualization Lab 2 Likewise, create a new user variable with variable name as JAVA_HOME and variable value as the path of the bin folder in the Java directory. Now we need to set Hadoop bin directory and Java bin directory path in system variable path. Edit Path in system variable
munotes.in
Page 3
3 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands Click on New and add the bin directory path of Hadoop and Java in it. 1.3 GUI CONFIGURATIONS Now we need to edit some files located in the hadoop directory of the etc folder where we installed hadoop. The files that need to be edited have been highlighted.
munotes.in
Page 4
4 Big Data Analytics and Visualization Lab 4 1. Edit the file core-site.xml in the hadoop directory. Copy this xml property in the configuration in the file fs.defaultFShdfs://localhost:9000 2. Edit mapred-site.xml and copy this property in the configuration mapreduce.framework.nameyarn 3. Create a folder ‘data’ in the hadoop directory 4. Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data directory munotes.in
Page 5
5 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands 5. Edit the file hdfs-site.xml and add below property in the configuration Note: The path of namenode and datanode across value would be the path of the datanode and namenode folders you just created. dfs.replication1dfs.namenode.name.dirC:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\namenodedfs.datanode.data.dir C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\datanode 6. Edit the file yarn-site.xml and add below property in the configuration yarn.nodemanager.aux-servicesmapreduce_shuffleyarn.nodemanager.auxservices.mapreduce.shuffle.classorg.apache.hadoop.mapred.ShuffleHandler munotes.in
Page 6
6 Big Data Analytics and Visualization Lab 6 7. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder where your jdk 1.8 is installed 8. Hadoop needs windows OS specific files which does not come with default download of hadoop. To include those files, replace the bin folder in hadoop directory with the bin folder provided in this github link. https://github.com/s911415/apache-hadoop-3.1.0-winutils Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder, rename it like bin_old and paste the copied bin folder in that directory. Check whether hadoop is successfully installed by running this command on cmd- munotes.in
Page 7
7 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands hadoop –version Format the NameNode Formatting the NameNode is done once when hadoop is installed and not for running hadoop filesystem, else it will delete all the data inside HDFS. Run this command- hdfs namenode –format Now change the directory in cmd to sbin folder of hadoop directory with this command, Start namenode and datanode with this command – start-dfs.cmd Two more cmd windows will open for NameNode and DataNode Now start yarn through this command- start-yarn.cmd Note: Make sure all the 4 Apache Hadoop Distribution windows are up n running. If they are not running, you will see an error or a shutdown message. In that case, you need to debug the error. To access information about resource manager current jobs, successful and failed jobs, go to this link in browser- http://localhost:8088/cluster To check the details about the hdfs (namenode and datanode), http://localhost:9870/
munotes.in
Page 8
8 Big Data Analytics and Visualization Lab 8 1.4 COMMAND LINE CONFIGURATION Starting HDFS Format the configured HDFS file system and then open the namenode (HDFS server) and execute the following command. $ hadoop namenode -format Start the distributed file system and follow the command listed below to start the namenode as well as the data nodes in cluster. $ start-dfs.sh Read & Write Operations in HDFS You can execute almost all operations on Hadoop Distributed File Systems that can be executed on the local file system. You can execute various reading, writing operations such as creating a directory, providing permissions, copying files, updating files, deleting, etc. You can add access rights and browse the file system to get the cluster information like the number of dead nodes, live nodes, spaces used, etc. HDFS Operations to Read the file To read any file from the HDFS, you have to interact with the NameNode as it stores the metadata about the DataNodes. The user gets a token from the NameNode and that specifies the address where the data is stored. You can put a read request to NameNode for a particular block location through distributed file systems. The NameNode will then check your privilege to access the DataNode and allows you to read the address block if the access is valid. $ hadoop fs -cat HDFS Operations to write in file Similar to the read operation, the HDFS Write operation is used to write the file on a particular address through the NameNode. This NameNode provides the slave address where the client/user can write or add data. After writing on the block location, the slave replicates that block and copies to another slave location using the factor 3 replication. The salve is then reverted back to the client for authentication. The process for accessing a NameNode is pretty similar to that of a reading operation. Below is the HDFS write commence: bin/hdfs dfs -ls munotes.in
Page 9
9 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands Listing Files in HDFS Finding the list of files in a directory and the status of a file using ‘ls’ command in the terminal. Syntax of ls can be passed to a directory or a filename as an argument which are displayed as follows: $ $HADOOP_HOME/bin/hadoop fs -ls Inserting Data into HDFS Below mentioned steps are followed to insert the required file in the Hadoop file system. Step1: Create an input directory $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input Step2: Use the put command transfer and store the data file from the local systems to the HDFS using the following commands in the terminal. $ $HADOOP_HOME/bin/hadoop fs -put /home/intellipaat.txt /user/input Step3: Verify the file using ls command. $ $HADOOP_HOME/bin/hadoop fs -ls /user/input Retrieving Data from HDFS For instance, if you have a file in HDFS called Intellipaat. Then retrieve the required file from the Hadoop file system by carrying out: Step1: View the data from HDFS using the cat command. $ $HADOOP_HOME/bin/hadoop fs -cat /user/output/intellipaat Step2: Gets the file from HDFS to the local file system using get command as shown below $ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/ Shutting Down the HDFS Shut down the HDFS files by following the below command $ stop-dfs.sh Multi-Node Cluster Installing Java Syntax of java version command $ java -version munotes.in
Page 10
10 Big Data Analytics and Visualization Lab 10 Following output is presented. java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) Bottom of Form Creating User Account System user account is used on both master and slave systems for the Hadoop installation. # useradd hadoop # passwd hadoop Mapping the nodes Hosts files should be edited in /etc/ folder on each and every nodes and IP address of each system followed by their host names must be specified mandatorily. # vi /etc/hosts Enter the following lines in the /etc/hosts file. 192.168.1.109 hadoop-master 192.168.1.145 hadoop-slave-1 192.168.56.1 hadoop-slave-2 Configuring Key Based Login Ssh should be set up in each node so they can easily converse with one another without any prompt for a password. # su hadoop $ ssh-keygen -t rsa $ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1 $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2 $ chmod 0600 ~/.ssh/authorized_keys $ exit munotes.in
Page 11
11 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands Installation of Hadoop Hadoop should be downloaded in the master server using the following procedure. # mkdir /opt/hadoop # cd /opt/hadoop/ # wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz # tar -xzf hadoop-1.2.0.tar.gz # mv hadoop-1.2.0 hadoop # chown -R hadoop /opt/hadoop # cd /opt/hadoop/hadoop/ Configuring Hadoop Hadoop server must be configured in core-site.xml and should be edited wherever required. fs.default.namehdfs://hadoop-master:9000/dfs.permissionsfalse hdfs-site.xml file should be editted. dfs.data.dir/opt/hadoop/hadoop/dfs/name/datatrue munotes.in
Page 12
12 Big Data Analytics and Visualization Lab 12 dfs.name.dir/opt/hadoop/hadoop/dfs/nametruedfs.replication1 mapred-site.xml file should be edited as per the requirement example is being shown. mapred.job.trackerhadoop-master:9001 JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows: export JAVA_HOME=/opt/jdk1.7.0_17 export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf Installing Hadoop on Slave Servers Hadoop should be installed on all the slave servers # su hadoop $ cd /opt/hadoop $ scp -r hadoop hadoop-slave-1:/opt/hadoop $ scp -r hadoop hadoop-slave-2:/opt/hadoop munotes.in
Page 13
13 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands Configuring Hadoop on Master Server Master server configuration # su hadoop $ cd /opt/hadoop/hadoop Master Node Configuration $ vi etc/hadoop/masters hadoop-master Slave Node Configuration $ vi etc/hadoop/slaves hadoop-slave-1 hadoop-slave-2 Name Node format on Hadoop Master # su hadoop $ cd /opt/hadoop/hadoop $ bin/hadoop namenode –format 25/05/22 10:58:07 INFO namenode.NameNode: STARTUP_MSG: ************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop-master/192.168.1.109 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.2.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Monday May 23 06:59:37 UTC 2022 STARTUP_MSG: java = 1.7.0_71 ************************************************************ 25/05/22 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edits …………………………………………………. munotes.in
Page 14
14 Big Data Analytics and Visualization Lab 14 25/05/22 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted. 25/05/22 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG: ************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15 ************************************************************ Hadoop Services Starting Hadoop services on the Hadoop-Master procedure explains its setup. $ cd $HADOOP_HOME/sbin $ start-all.sh Addition of a New DataNode in the Hadoop Cluster is as follows: Networking Add new nodes to an existing Hadoop cluster with some suitable network configuration. Consider the following network configuration for new node Configuration: IP address : 192.168.1.103 netmask : 255.255.255.0 hostname : slave3.in Adding a User and SSH Access Add a user working under “hadoop” domain and the user must have the access added and password of Hadoop user can be set to anything one wants. useradd hadoop passwd hadoop To be executed on master mkdir -p $HOME/.ssh chmod 700 $HOME/.ssh ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys munotes.in
Page 15
15 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands chmod 644 $HOME/.ssh/authorized_keys Copy the public key to new slave node in hadoop user $HOME directory scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/ Execution done on slaves su hadoop ssh -X hadoop@192.168.1.103 Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then the permission for the same must be changed as per the requirement. cd $HOME mkdir -p $HOME/.ssh chmod 700 $HOME/.ssh cat id_rsa.pub >>$HOME/.ssh/authorized_keys chmod 644 $HOME/.ssh/authorized_keys ssh login must be changed from the master machine. It is possible that the ssh to the new node without a password from the master must be verified. ssh hadoop@192.168.1.103 or hadoop@slave3 Setting Hostname for New Node Hostname is setup in the file directory /etc/sysconfig/network On new slave3 machine NETWORKING=yes HOSTNAME=slave3.in Machine must be restarted again or hostname command should be run under new machine with the corresponding hostname to make changes effectively. On slave3 node machine: hostname slave3.in /etc/hosts must be updated on all machines of the cluster 192.168.1.102 slave3.in slave3 ping the machine with hostnames to check whether it is resolving to IP address. ping master.in munotes.in
Page 16
16 Big Data Analytics and Visualization Lab 16 Start the DataNode on New Node Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. Master (NameNode) should correspondingly join the cluster after automatically contacted. New node should be added to the configuration/slaves file in the master server. New node will be identified by script-based commands. Login to new node su hadoop or ssh -X hadoop@192.168.1.103 HDFS is started on a newly added slave node ./bin/hadoop-daemon.sh start datanode jps command output must be checked on a new node. $ jps 7141 DataNode 10312 Jps Removing a DataNode Node can be removed from a cluster while it is running, without any worries of data loss. A decommissioning feature is made available by HDFS which ensures that removing a node is performed securely. Step 1 Login to master machine so that the user can check Hadoop is being installed. $ su hadoop Step 2 Before starting the cluster an exclude file must be configured where a key named dfs.hosts.exclude should be added to our$HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile. NameNode’s local file system contains a list of machines which are not permitted to connect to HDFS receives full path by this key and the value associated with it as follows. dfs.hosts.exclude/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt>DFS exclude munotes.in
Page 17
17 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands Step 3 Hosts with respect to decommission are determined. File reorganization by the hdfs_exclude.txt for each and every machine to be decommissioned which will results in preventing them from connecting to the NameNode. slave2.in Step 4 Force configuration reloads. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run $ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes NameNode will be forced made to re-read its configuration, as this is inclusive for the newly updated ‘excludes’ file. Nodes will be decommissioned over a period of time intervals, and allowing time for each node’s blocks to be replicated onto machines which are scheduled to be active.jps command output should be checked on slave2.in. Once the work is done DataNode process will shutdown automatically. Step 5 Shutdown nodes. The decommissioned hardware can be carefully shut down for maintenance purpose after the decommission process has been finished. $ $HADOOP_HOME/bin/hadoop dfsadmin -report Step 6 Excludes are edited again and once the machines have been decommissioned, they are removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” will read the excludes file back into the NameNode. Data Nodes will rejoin the cluster after the maintenance has been completed, or if additional capacity is needed in the cluster again is being informed. To run/shutdown tasktracker $ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker $ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker Add a new node with the following steps 1) Take a new system which gives access to create a new username and password munotes.in
Page 18
18 Big Data Analytics and Visualization Lab 18 2) Install the SSH and with master node setup ssh connections 3) Add sshpublic_rsa id key having an authorized keys file 4) Add the new data node hostname, IP address and other informative details in /etc/hosts slaves file192.168.1.102 slave3.in slave3 5) Start the DataNode on the New Node 6) Login to the new node command like suhadoop or Ssh -X hadoop@192.168.1.103 7) Start HDFS of newly added in the slave node by using the following command ./bin/hadoop-daemon.sh start data node 8) Check the output of jps command on a new node. 1.5 SUMMARY Hadoop Distributed File System is a highly scalable, flexible, fault-tolerant, and reliable system that stores the data across multiple nodes on different servers. It follows a master-slave architecture, where the NameNode acts as a master, and the DataNode as the slave. HDFS Operations are used to access these NameNodes and interact with the data. The files are broken down into blocks where the client can store the data, read, write, and perform various operations by completing the authentication process. 1.6 SAMPLE QUESTIONS 1. What are the different vendor-specific distributions of Hadoop? 2. What are the different Hadoop configuration files? 3. What are the three modes in which Hadoop can run? 4. What are the differences between regular FileSystem and HDFS? 5. Why is HDFS fault-tolerant? 6. Explain the architecture of HDFS. 7. What are the two types of metadata that a NameNode server holds? 8. What is the difference between a federation and high availability? 9. If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split? 10. How does rack awareness work in HDFS? 11. How can you restart NameNode and all the daemons in Hadoop? 12. Which command will help you find the status of blocks and FileSystem health? munotes.in
Page 19
19 Set up and Configuration Hadoop using Cloudera creating a HDFS System with Minimum 1 Name Node and 1 Data Nodes HDFS Commands 13. What would happen if you store too many small files in a cluster on HDFS? 14. How do you copy data from the local system onto HDFS? 1.7 REFERENCES 1 Tom White, “HADOOP: The definitive Guide” O Reilly 2012, Third Edition 2 Chuck Lam, “Hadoop in Action”, Dreamtech Press 2016, First Edition 3 Shiva Achari,” Hadoop Essential “PACKT Publications 4 Radha Shankarmani and M. Vijayalakshmi,” Big Data Analytics “Wiley Textbook Series, Second Edition 5 Jeffrey Aven,”Apache Spark in 24 Hours” Sam’s Publication, First Edition 6 Bill Chambers and MateiZaharia,”Spark: The Definitive Guide: Big Data Processing Made Simple “O’Reilly Media; First edition 7 James D. Miller,” Big Data Visualization” PACKT Publications. 8 https://hadoop.apache.org/docs/stable/ 9 https://pig.apache.org/ 10 https://hive.apache.org/ 11 https://spark.apache.org/documentation.html 12 https://help.tableau.com/current/pro/desktop/en-us/default.htm 13 https://www.youtube.com/watch?v=HloGuAzP_H8 munotes.in
Page 20
20 Big Data Analytics and Visualization Lab 20 2 EXPERIMENT 1 AIM Write a simple program for Word Count Using Map Reduce Programming OBJECTIVE In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair. THEORY In Hadoop, Map Reduce is a computation that decomposes large manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final results. Pre-requisite Make sure that Hadoop is installed on your system with the Java SDK, ECLIPSE editor. For all Experiment. Steps 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish. 2. Right Click > New > Package ( Name it - PackageDemo) > Finish. 3. Right Click on Package > New > Class (Name it - WordCount). 4. Add Following Reference Libraries: 1. Right Click on Project > Build Path> Add External 2. /usr/lib/hadoop-0.20/hadoop-core.jar 3. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar 5. Type the following code: package PackageDemo; import java.io.IOException; import org.apache.hadoop.conf.Configuration; munotes.in
22 Big Data Analytics and Visualization Lab 22 for(String word: words ) { Text outputKey = new Text(word.toUpperCase().trim()); IntWritable outputValue = new IntWritable(1); con.write(outputKey, outputValue); } } } public static class ReduceForWordCount extends Reducer { public void reduce(Text word, Iterable values, Context con) throws IOException, InterruptedException { int sum = 0; for(IntWritable value : values) { sum += value.get(); } con.write(word, new IntWritable(sum)); } } } The above program consists of three classes: ● Driver class (Public, void, static, or main; this is the entry point). ● The Map class which extends the public class Mapper and implements the Map function. ● The Reduce class which extends the public class Reducer and implements the Reduce function. 6. Make a jar file Right Click on Project> Export> Select export destination as Jar File > next> Finish. munotes.in
Page 23
23 Experiments 7. Take a text file and move it into HDFS format: To move this into Hadoop directly, open the terminal and enter the following commands: munotes.in
Page 24
24 Big Data Analytics and Visualization Lab 24 [training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile 9. Open the result: [training@localhost ~]$ hadoop fs -ls MRDir1 Found 3 items -rw-r--r-- 1 training supergroup 0 2022-02-23 03:36 /user/training/MRDir1/_SUCCESS drwxr-xr-x - training supergroup 0 2022-02-23 03:36 /user/training/MRDir1/_logs -rw-r--r-- 1 training supergroup 20 2022-02-23 03:36 /user/training/MRDir1/part-r-00000 [training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000 BUS 7 CAR 4 TRAIN 6 munotes.in
Page 25
25 Experiments EXPERIMENT 2 AIM : Write a program in Map Reduce for Union operation. OBJECTIVE In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair. THEORY: In Hadoop, Map Reduce is a computation that decomposes large manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final results. Pre-requisite Make sure that Hadoop is installed on your system with the Java SDK, ECLIPSE editor. For all Experiment. Steps 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish. 2. Right Click > New > Package ( Name it - PackageDemo) > Finish. 3. Right Click on Package > New > Class (Name it - Union). 4. Add Following Reference Libraries: 1. Right Click on Project > Build Path> Add External 1. /usr/lib/hadoop-0.20/hadoop-core.jar 2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar 5. Type the following code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; munotes.in
Page 26
26 Big Data Analytics and Visualization Lab 26 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class Union { private static Text emptyWord = new Text(""); public static class Mapper extends org.apache.hadoop.mapreduce.Mapper
Page 27
27 Experiments job.setCombinerClass(Reducer.class); job.setReducerClass(Reducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); Path input = new Path( args[0]); Path output = new Path(args[1]); FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run Above code similar to first experiment. munotes.in
Page 28
28 Big Data Analytics and Visualization Lab 28 EXPERIMENT 3 AIM : Write a program in Map Reduce for Intersection operation. OBJECTIVE In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair. THEORY In Hadoop, Map Reduce is a computation that decomposes large manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final results. Pre-requisite Make sure that Hadoop is installed on your system with the Java SDK, ECLIPSE editor. For all Experiment. Steps 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish. 2. Right Click > New > Package ( Name it - PackageDemo) > Finish. 3. Right Click on Package > New > Class (Name it - Intersection). 4. Add Following Reference Libraries: 1. Right Click on Project > Build Path> Add External 1. /usr/lib/hadoop-0.20/hadoop-core.jar 2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar Type the following code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; munotes.in
Page 29
29 Experiments import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class Intersection { public static class Mapper extends org.apache.hadoop.mapreduce.Mapper
Page 30
30 Big Data Analytics and Visualization Lab 30 public static class Reducer extends org.apache.hadoop.mapreduce.Reducer { public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum ++; } if (sum > 1) { context.write(key, key); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(Intersection.class); job.setMapperClass(Mapper.class); job.setCombinerClass(Combiner.class); job.setReducerClass(Reducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run Above code similar to first experiment. munotes.in
Page 31
31 Experiments EXPERIMENT 4 AIM : Write a program in Map Reduce for GroupSum operation. OBJECTIVE In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair. THEORY: In Hadoop, Map Reduce is a computation that decomposes large manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final results. Pre-requisite Make sure that Hadoop is installed on your system with the Java SDK, ECLIPSE editor. For all Experiment. Steps 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish. 2. Right Click > New > Package ( Name it - PackageDemo) > Finish. 3. Right Click on Package > New > Class (Name it - GroupSum). 4. Add Following Reference Libraries: 1. Right Click on Project > Build Path> Add External 1. /usr/lib/hadoop-0.20/hadoop-core.jar 2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar Type the following code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; munotes.in
Page 32
32 Big Data Analytics and Visualization Lab 32 import java.io.IOException; public class GroupSum { public static class MyMapper extends org.apache.hadoop.mapreduce.Mapper
Page 33
33 Experiments Job job = Job.getInstance(conf, "Word sum"); job.setJarByClass(GroupSum.class); job.setMapperClass(MyMapper.class); job.setCombinerClass(MyReducer.class); job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Path input = new Path( args[0]); Path output = new Path(args[1]); FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Run Above code similar to first experiment. munotes.in
Page 34
34 Big Data Analytics and Visualization Lab 34 EXPERIMENT 5 AIM Write a program in Map Reduce for Matrix Multiplication OBJECTIVE In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair. THEORY In Hadoop, Map Reduce is a computation that decomposes large manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final results. Pre-requisite Make sure that Hadoop is installed on your system with the Java SDK, ECLIPSE editor. For all Experiment. Steps 1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish. 2. Right Click > New > Package ( Name it - PackageDemo) > Finish. 3. Right Click on Package > New > Class (Name it - MatrixMultiplication). 4. Add Following Reference Libraries: 1. Right Click on Project > Build Path> Add External 1. /usr/lib/hadoop-0.20/hadoop-core.jar 2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar Type the following code: Java.Lang.ArrayIndexOutOfBoundsException: 2 at MatrixMult$mapper.map(MatrixMult.java:44) at Matrix$mapper.map(MatrixMult.java:1) at org.apache.hadoop.mapreduce.mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.Maptask.runNewMapper(mapTask.java:793) at org.apache.hadoop.mapred.maptask.run(maptask.java:341) at org.apache.hadoop.mapred.yarnChild$2.run(YarnChild.java:164) at java.security.accesscontroller.dopriviledged(Native Method) munotes.in
Page 35
35 Experiments at javax.security.auth.subject.doAs(Subject.Java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) mapreduce worked when I used a smaller matrix. I will post the code for mapper and reducer below: public class Map extends org.apache.hadoop.mapreduce.Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); int m = Integer.parseInt(conf.get("m")); int p = Integer.parseInt(conf.get("p")); String line = value.toString(); // (M, i, j, Mij); String[] indicesAndValue = line.split(","); Text outputKey = new Text(); Text outputValue = new Text(); if (indicesAndValue[0].equals("M")) { for (int k = 0; k < p; k++) { outputKey.set(indicesAndValue[1] + "," + k); // outputKey.set(i,k); outputValue.set(indicesAndValue[0] + "," + indicesAndValue[2] + "," + indicesAndValue[3]); // outputValue.set(M,j,Mij); context.write(outputKey, outputValue); } } else { // (N, j, k, Njk); for (int i = 0; i < m; i++) { outputKey.set(i + "," + indicesAndValue[2]); outputValue.set("N," + indicesAndValue[1] + "," + indicesAndValue[3]); context.write(outputKey, outputValue); } } munotes.in
Page 36
36 Big Data Analytics and Visualization Lab 36 } } public class Reduce extends org.apache.hadoop.mapreduce.Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { String[] value; //key=(i,k), //Values = [(M/N,j,V/W),..] HashMap hashA = new HashMap(); HashMap hashB = new HashMap(); for (Text val : values) { value = val.toString().split(","); if (value[0].equals("M")) { hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2])); } else { hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2])); } } int n = Integer.parseInt(context.getConfiguration().get("n")); float result = 0.0f; float m_ij; float n_jk; for (int j = 0; j < n; j++) { m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f; n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f; result += m_ij * n_jk; } if (result != 0.0f) { context.write(null, munotes.in
Page 37
37 Experiments new Text(key.toString() + "," + Float.toString(result))); } } } public class MatrixMultiply { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: MatrixMultiply "); System.exit(2); } Configuration conf = new Configuration(); // M is an m-by-n matrix; N is an n-by-p matrix. conf.set("m", "1000"); conf.set("n", "100"); conf.set("p", "1000"); @SuppressWarnings("deprecation") Job job = new Job(conf, "MatrixMultiply"); job.setJarByClass(MatrixMultiply.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } munotes.in
Page 38
38 Big Data Analytics and Visualization Lab 38 Run Above code similar to first experiment. www.geeksforgeeks.org https://github.com/wosiu/Hadoop-map-reduce-relational-algebra/blob/master/src/main/java/WordCount.java munotes.in
Page 39
39 3 MONGO DB Unit Structure : 3.0 Objectives 3.1 Introduction 3.1.1 How it Works 3.1.2 How MongoDB is different from RDBMS 3.1.3 Features of MongoDB 3.1.4 Advantages of MongoDB 3.1.5 Disadvantages of MongoDB 3.2 MongoDB Environment 3.2.1 Install MongoDB in Windows 3.3 Creating a Database Using MongoDB 3.4 Creating Collections in MongoDB 3.5 CRUD Document 3.5.1 Create Operations 3.5.2 Read Operations 3.5.3 Update Operations 3.5.4 Delete Operations 3.6 Let’s Sum up 3.7 Website references 3.0 OBJECTIVES After going through this unit, you will be able to: • Have a solid understanding of basics of MongoDB • Learn to run queries against a MongoDB instance • Manipulate and retrieve data • Different techniques and algorithms used in association rules. • Understand the difference between Data Mining and KDD in Databases. 3.1 INTRODUCTION MongoDB is a free, open-source document-oriented database that can hold a big amount of data while also allowing you to deal with it quickly. Because MongoDB does not store and retrieve data in the form of tables, it is classified as a NoSQL (Not Only SQL) database. munotes.in
Page 40
40 Big Data Analytics and Visualization Lab 40 MongoDB is a database developed and managed by MongoDB.Inc, which was first released in February 2009 under the SSPL (Server Side Public License). It also includes certified driver support for all major programming languages, including C, C++, C#, etc. Net, Go, Java, Node.js, Perl, PHP, Python, Motor, Ruby, Scala, Swift, and Mongoid are examples of programming languages. As a result, any of these languages can be used to develop an application. MongoDB is now used by a large number of firms, including Face book and Google. 3.1.1 How it works Now we'll see how things really work behind the scenes. As we all know, MongoDB is a database server that stores data in databases. In other words, the MongoDB environment provides you with a server on which you may install MongoDB and then construct several databases. The data is saved in collections and documents thanks to the NoSQL database. As a result, the database, collection, and documents are linked as follows: Figure: 4.1 Mongo DB Structure ● Collections exist in the MongoDB database, exactly as tables do in the MYSQL database. You have the option of creating several databases and collections. ● We now have papers within the collection. These documents hold the data we want to store in the MongoDB database, and a single collection can contain several documents. You are schema-less, which means that each document does not have to be identical to the others. ● The fields are used to build the documents. Fields in documents are key-value pairs, similar to columns in a relation database. The fields' values can be of any BSON data type, such as double, string, Boolean, and so on. ● The data in MongoDB is saved in the form of BSON documents. BSON stands for Binary representation of JSON documents in this munotes.in
Page 41
41 MONGO DB context. To put it another way, the MongoDB server translates JSON data into a binary format called as BSON, which can then be stored and searched more effectively. ● It is possible to store nested data in MongoDB documents. When opposed to SQL, layering data allows you to construct complicated relationships between data and store them in the same document, making data working and retrieval extremely efficient. To retrieve data from tables 1 and 2, you must use complicated SQL joins. The BSON document can be up to 16MB in size. In Mongo DB users are allowed to run multiple databases. 3.1.2 How Mongo DB is different from RDBMS Some major differences between Mongo DB and RDBMS are as follows. Mongo DB RDBMS It is a non -relational and document - oriented database. It is a relational database It is suitable for hierarchical data storage. It is not suitable for hierarchical data storage. It has a dynamic schema. It has a predefined schema. It centers on the CAP theorem (Consistency, Availability, and Partition tolerance). It centers on ACID properties (Atomicity, Consistency, Isolation, and Durability). In terms of performance, it is much faster than RDBMS. In terms of performance, it is slower than MongoDB. 3.1.3 Features of MongoDB ● Schema-less Database: MongoDB has a fantastic feature called schema-less database. A schema-less database means that different types of documents can be stored in the same collection. In other words, a single collection in the MongoDB database can hold numerous documents, each of which may have a different amount of fields, content, and size. In contrast to relational databases, it is not required that one document be similar to another. MongoDB gives databases a lot of flexibility because to this amazing feature. ● Document Oriented: Unlike RDBMS, MongoDB stores all data in documents rather than tables. Data is kept in fields (key-value pair) rather than rows and columns in these documents, making the data far more flexible than in RDBMS. Each document also has its own unique object id. ● Indexing: Every field in the documents in the MongoDB database is indexed with primary and secondary indices, making it easier and munotes.in
Page 42
42 Big Data Analytics and Visualization Lab 42 faster to get or search data from the pool of data. If the data isn't indexed, the database will have to search each document individually for the query, which takes a long time and is inefficient. ● Scalability: Sharding is used by MongoDB to achieve horizontal scalability. Sharding is a method of distributing data across numerous servers. A significant quantity of data is partitioned into data chunks using the shard key, and these data pieces are evenly dispersed across shards located on multiple physical servers. It can also be used to add additional machines to an existing database. ● Replication: MongoDB delivers high availability and redundancy through replication, which produces multiple copies of data and sends them to a different server so that if one fails, the data may be retrieved from another. ● Aggregation: It enables you to conduct operations on grouped data and obtain a single or computed result. It's analogous to the GROUPBY clause in SQL. Aggregation pipelines map-reduce functions, and single-purpose aggregation methods are among the three types of aggregations available. ● High Performance: Due to characteristics such as scalability, indexing, replication, and others, MongoDB has a very high performance and data persistence when compared to other databases. 3.1.4 Advantages of MongoDB ● It's a NoSQL database with no schema. When working with MongoDB, you don't have to worry about designing the database schema. ● The operation of joining is not supported. ● It gives the fields in the papers a lot more flexibility. ● It contains a variety of data. ● It offers excellent availability, scalability, and performance. ● It effectively supports geospatial. ● The data is stored in BSON documents in this document-oriented database. ● It also allows ACID transitions between multiple documents (string from MongoDB 4.0). ● There is no need for SQL injection. ● It is simple to integrate with Hadoop Big Data munotes.in
Page 43
43 MONGO DB 3.1.5 Disadvantages of MongoDB ● It stores data in a large amount of memory. ● You may not keep more than 16MB of data in your documents. ● The nesting of data in BSON is likewise limited; you can only nest data up to 100 levels. 3.2 MONGO DB ENVIRONMENT To get started with MongoDB, you have to install it in your system. You need to find and download the latest version of MongoDB, which will be compatible with your computer system. You can use this (http://www.mongodb.org/downloads) link and follow the instruction to install MongoDB in your PC. The process of setting up MongoDB in different operating systems is also different, here various installation steps have been mentioned and according to your convenience, you can select it and follow it. 3.2.1 Install Mongo DB in Windows The website of MongoDB provides all the installation instructions, and MongoDB is supported by Windows, Linux as well as Mac OS. It is to be noted that, MongoDB will not run in Windows XP; so you need to install higher versions of windows to use this database. Once you visit the link (http://www.mongodb.org/downloads), click the download button. 1. Once the download is complete, double click this setup file to install it. Follow the steps: 2.
munotes.in
Page 44
44 Big Data Analytics and Visualization Lab 44 3. Now, choose Complete to install MongoDB completely. 4. Then, select the radio button "Run services as Network service user." 5. The setup system will also prompt you to install MongoDB Compass, which is MongoDB official graphical user interface (GUI). You can tick the checkbox to install that as well. munotes.in
Page 45
45 MONGO DB Once the installation is done completely, you need to start MongoDB and to do so follow the process: 1. Open Command Prompt. 2. Type: C:\Program Files\MongoDB\Server\4.0\bin 3. Now type the command simply: mongodb to run the server. In this way, you can start your MongoDB database. Now, for running MongoDB primary client system, you have to use the command: C:\Program Files\MongoDB\Server\4.0\bin>mongo.exe
munotes.in
Page 46
46 Big Data Analytics and Visualization Lab 46 3.3 CREATING A DATABASE USING MONGODB To build a database in MongoDB, first construct a MongoClient object, and then supply a connection URL with the right IP address and the database name. If the database does not already exist, MongoDB will create it and connect to it. Example: Create a database called “mydb” var MongoClient = require('mongodb').MongoClient; var url = "mongodb://localhost:27017/mydb"; MongoClient.connect(url, function(err, db) { if (err) throw err; console.log("Database created!"); db.close(); }); Save the code above in a file called "demo_create_mongo_db.js" and run the file: Run "demo_create_mongo_db.js" C:\Users\Your Name>node demo_create_mongo_db.js This will give you this result: Database created! Note: MongoDB waits until you have created a collection (table), with at least one document (record) before it actually creates the database (and collection). The use Command MongoDB use DATABASE_NAME is used to create database. The command will create a new database if it doesn't exist, otherwise it will return the existing database. Syntax Basic syntax of use DATABASE statement is as follows – use DATABASE_NAME Example If you want to use a database with name , then use DATABASE statement would be as follows − >use mydb switched to db mydb To check your currently selected database, use the command db >db Mydb munotes.in
Page 47
47 MONGO DB If you want to check your databases list, use the command show dbs. >show dbs local 0.78125GB test 0.23012GB Your created database (mydb) is not present in list. To display database, you need to insert at least one document into it. >db.movie.insert({"name":"tutorials point"}) >show dbs local 0.78125GB mydb 0.23012GB test 0.23012GB In MongoDB default database is test. If you didn't create any database, then collections will be stored in test database. DROP DATABASE The dropDatabase() Method MongoDB db.dropDatabase() command is used to drop a existing database. Syntax Basic syntax of dropDatabase() command is as follows – db.dropDatabase() This will delete the selected database. If you have not selected any database, then it will delete default 'test' database. Example First, check the list of available databases by using the command, show dbs. >show dbs local 0.78125GB mydb 0.23012GB test 0.23012GB > munotes.in
Page 48
48 Big Data Analytics and Visualization Lab 48 If you want to delete new database , then dropDatabase() command would be as follows − >use mydb switched to db mydb >db.dropDatabase() >{ "dropped" : "mydb", "ok" : 1 } > Now check list of databases. >show dbs local 0.78125GB test 0.23012GB > 3.4 CREATING COLLECTIONS IN MONGO DB The createCollection() Method MongoDB db.createCollection(name, options) is used to create collection. Syntax Basic syntax of createCollection() command is as follows − db.createCollection(name, options) In the command, name is name of collection to be created. Options is a document and is used to specify configuration of collection. Parameter Type Description Name String Name of the collection to be created Options Document (Optional) Specify options about memory size and indexing munotes.in
Page 49
49 MONGO DB Options parameter is optional, so you need to specify only the name of the collection. Following is the list of options you can use – Field Type Description capped Boolean (Optional) If true, enables a capped collection. Capped collection is a fixed size collection that automatically overwrites its oldest entries when it reaches its maximum size. If you specify true, you need to specify size parameter also. autoIndexId Boolean (Optional) If true, automatically create index on _id field.s Default value is false. size number (Optional) Specifies a maximum size in bytes for a capped collection. If capped is true, then you need to specify this field also. max number (Optional) Spe cifies the maximum number of documents allowed in the capped collection. While inserting the document, MongoDB first checks size field of capped collection, then it checks max field. Examples Basic syntax of createCollection() method without options is as follows − >use test switched to db test >db.createCollection("mycollection") { "ok" : 1 } > You can check the created collection by using the command show collections. >show collections mycollection system.indexes The following example shows the syntax of createCollection() method with few important options − > db.createCollection("mycol", { capped : true, autoIndexID : true, size : 6142800, max : 10000 } ){ munotes.in
Page 50
50 Big Data Analytics and Visualization Lab 50 "ok" : 0, "errmsg" : "BSON field 'create.autoIndexID' is an unknown field.", "code" : 40415, "codeName" : "Location40415" } > In MongoDB, you don't need to create collection. MongoDB creates collection automatically, when you insert some document. >db.tutorialspoint.insert({"name" : "tutorialspoint"}), WriteResult({ "nInserted" : 1 }) >show collections mycol mycollection system.indexes tutorialspoint > The drop() Method MongoDB's db.collection.drop() is used to drop a collection from the database. Syntax Basic syntax of drop() command is as follows − db.COLLECTION_NAME.drop() Example First, check the available collections into your database mydb. >use mydb switched to db mydb >show collections mycol mycollection system.indexes tutorialspoint > Now drop the collection with the name mycollection. >db.mycollection.drop() true > Again check the list of collections into database. munotes.in
Page 51
51 MONGO DB >show collections mycol system.indexes tutorialspoint > drop() method will return true, if the selected collection is dropped successfully, otherwise it will return false. 3.5 CRUD DOCUMENT As we know, we can use MongoDB for a variety of purposes such as building an application (including web and mobile), data analysis, or as an administrator of a MongoDB database. In all of these cases, we must interact with the MongoDB server to perform specific operations such as entering new data into the application, updating data in the application, deleting data from the application, and reading the data of the application. MongoDB provides a set of simple yet fundamental operations known as CRUD operations that will allow you to quickly interact with the MongoDB server. 3.5.1 Create Operations — these operations are used to insert or add new documents to the collection. If a collection does not exist, a new collection will be created in the database. MongoDB provides the following methods for performing and creating operations: Method Description db.collection.insertOne() It is used to insert a single document in the collection. db.collection.insertMany() It is used to insert multiple documents in the collection. insertOne() As the namesake, insertOne() allows you to insert one document into the collection. For this example, we’re going to work with a collection called munotes.in
Page 52
52 Big Data Analytics and Visualization Lab 52 RecordsDB. We can insert a single entry into our collection by calling the insertOne() method on RecordsDB. We then provide the information we want to insert in the form of key-value pairs, establishing the Schema. Example: db.RecordsDB.insertOne({ name: "Marsh", age: "6 years", species: "Dog", ownerAddress: "380 W. Fir Ave", chipped: true }) If the create operation is successful, a new document is created. The function will return an object where “acknowledged” is “true” and “insertID” is the newly created “ObjectId.” > db.RecordsDB.insertOne({ ... name: "Marsh", ... age: "6 years", ... species: "Dog", ... ownerAddress: "380 W. Fir Ave", ... chipped: true ... }) { "acknowledged" : true, "insertedId" : ObjectId("5fd989674e6b9ceb8665c57d") } insertMany() It’s possible to insert multiple items at one time by calling the insertMany() method on the desired collection. In this case, we pass multiple items into our chosen collection (RecordsDB) and separate them by commas. Within the parentheses, we use brackets to indicate that we are passing in a list of multiple entries. This is commonly referred to as a nested method. munotes.in
Page 53
53 MONGO DB Example: db.RecordsDB.insertMany([{ name: "Marsh", age: "6 years", species: "Dog", ownerAddress: "380 W. Fir Ave", chipped: true}, {name: "Kitana", age: "4 years", species: "Cat", ownerAddress: "521 E. Cortland", chipped: true}]) db.RecordsDB.insertMany([{ name: "Marsh", age: "6 years", species: "Dog", ownerAddress: "380 W. Fir Ave", chipped: true}, {name: "Kitana", age: "4 years", species: "Cat", ownerAddress: "521 E. Cortland", chipped: true}]) { "acknowledged" : true, "insertedIds" : [ ObjectId("5fd98ea9ce6e8850d88270b4"), ObjectId("5fd98ea9ce6e8850d88270b5") ] } 3.5.2 Read Operations You can give specific query filters and criteria to the read operations to indicate the documents you desire. More information on the possible query filters can be found in the MongoDB manual. Query modifiers can also be used to vary the number of results returned. MongoDB offers two ways to read documents from a collection: munotes.in
Page 54
54 Big Data Analytics and Visualization Lab 54 ● db.collection.find() ● db.collection.findOne() find() In order to get all the documents from a collection, we can simply use the find() method on our chosen collection. Executing just the find() method with no arguments will return all records currently in the collection. db.RecordsDB.find findOne() In order to get one document that satisfies the search criteria, we can simply use the findOne() method on our chosen collection. If multiple documents satisfy the query, this method returns the first document according to the natural order which reflects the order of documents on the disk. If no documents satisfy the search criteria, the function returns null. The function takes the following form of syntax. db.{collection}.findOne({query}, {projection}) 3.5.3 Update Operations Update operations, like create operations, work on a single collection and are atomic at the document level. Filters and criteria are used to choose the documents to be updated during an update procedure. You should be cautious while altering documents since alterations are permanent and cannot be reversed. This also applies to remove operations. There are three techniques for updating documents in MongoDB CRUD: ● db.collection.updateOne() ● db.collection.updateMany() ● db.collection.replaceOne() updateOne() With an update procedure, we may edit a single document and update an existing record. To do this, we use the updateOne() function on a specified collection, in this case "RecordsDB." To update a document, we pass two arguments to the method: an update filter and an update action. The update filter specifies which items should be updated, and the update action specifies how those items should be updated. We start with the update filter. Then we utilise the "$set" key and supply the values for the fields we wish to change. This function updates the first record that matches the specified filter. munotes.in
Page 55
55 MONGO DB updateMany() updateMany() allows us to update multiple items by passing in a list of items, just as we did when inserting multiple items. This update operation uses the same syntax for updating a single document. replaceOne() The replaceOne() method is used to replace a single document in the specified collection. replaceOne() replaces the entire document, meaning fields in the old document not contained in the new will be lost. 3.5.4 Delete Operations Delete operations, like update and create operations, work on a single collection. For a single document, delete actions are similarly atomic. You can provide delete actions with filters and criteria to indicate which documents from a collection you want to delete. The filter options use the same syntax as the read operations. MongoDB provides two ways for removing records from a collection: ● db.collection.deleteOne() ● db.collection.deleteMany() deleteOne() deleteOne() is used to remove a document from a specified collection on the MongoDB server. A filter criterion is used to specify the item to delete. It deletes the first record that matches the provided filter. deleteMany() deleteMany() is a method used to delete multiple documents from a desired collection with a single delete operation. A list is passed into the method and the individual items are defined with filter criteria as in deleteOne() 3.6 Let us sum up ● MongoDB is an open-source database that uses a document-oriented data model and a non-structured query language ● It is one of the most powerful NoSQL systems and databases around, today. ● The data model that MongoDB follows is a highly elastic one that lets you combine and store data of multivariate types without having to compromise on the powerful indexing options, data access, and validation rules. ● A group of database documents can be called a collection. The RDBMS equivalent to a collection is a table. The entire collection exists within a single database. There are no schemas when it comes munotes.in
Page 56
56 Big Data Analytics and Visualization Lab 56 to collections. Inside the collection, various documents can have varied fields, but mostly the documents within a collection are meant for the same purpose or for serving the same end goal. ● A set of key-value pairs can be designated as a document. Documents are associated with dynamic schemas. The benefit of having dynamic schemas is that a document in a single collection does not have to possess the same structure or fields. Also, the common fields in a collection document can have varied types of data. 3.7 WEBSITE REFERENCES 1. https://www.mongodb.com 2. https://www.tutorialspoint.com 3. https://www.w3schools.com 4. https://www.educba.com munotes.in
Page 57
57 4 HIVE Unit Structure : 4.0 Objectives 4.1 Introduction 4.2 Summary 4.3 References 4.4 Unit End Exercises 4.0 OBJECTIVE Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. 4.1 INTRODUCTION Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It was developed by Facebook. Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs. Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF). Features of Hive These are the following features of Hive: Hive is fast and scalable. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs. It is capable of analyzing large datasets stored in HDFS. It allows different storage types such as plain text, RCFile, and HBase. It uses indexing to accelerate queries. It can operate on compressed data stored in the Hadoop ecosystem. munotes.in
Page 58
58 Big Data Analytics and Visualization Lab 58 It supports user-defined functions (UDFs) where user can provide its functionality. Limitations of Hive Hive is not capable of handling real-time data. It is not designed for online transaction processing. Hive queries contain high latency. Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. This chapter explains how to create Hive database. Hive contains a default database named default. Create Database Statement Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows: o CREATE DATABASE|SCHEMA [IF NOT EXISTS] Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb: o hive> CREATE DATABASE [IF NOT EXISTS] userdb; or o hive> CREATE SCHEMA userdb; The following query is used to verify a databases list: o hive> SHOW DATABASES; o default o userdb munotes.in
Page 59
59 HIVE JDBC Program The JDBC program to create a database is given below. o import java.sql.SQLException; o import java.sql.Connection; o import java.sql.ResultSet; o import java.sql.Statement; o import java.sql.DriverManager; o o public class HiveCreateDb { o private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; o o public static void main(String[] args) throws SQLException { o // Register driver and create driver instance o o Class.forName(driverName); o // get connection o o Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); o Statement stmt = con.createStatement(); o o stmt.executeQuery("CREATE DATABASE userdb"); o System.out.println(“Database userdb created successfully.”); o o con.close(); o } o } munotes.in
Page 60
60 Big Data Analytics and Visualization Lab 60 Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program. o $ javac HiveCreateDb.java o $ java HiveCreateDb Output: Database userdb created successfully. Hive - Partitioning Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data. Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table. For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data: The following file contains employeedata table. /tab1/employeedata/file1 id, name, dept, yoj 1, gopal, TP, 2012 2, kiran, HR, 2012 3, kaleel,SC, 2013 4, Prasanth, SC, 2013 The above data is partitioned into two files using year. /tab1/employeedata/2012/file2 1, gopal, TP, 2012 2, kiran, HR, 2012 /tab1/employeedata/2013/file3 3, kaleel,SC, 2013 4, Prasanth, SC, 2013 munotes.in
Page 61
61 HIVE Adding a Partition We can add partitions to a table by altering the table. Let us assume we have a table called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj. Syntax: ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...; partition_spec: : (p_column = p_col_value, p_column = p_col_value, ...) The following query is used to add a partition to the employee table. hive> ALTER TABLE employee > ADD PARTITION (year=’2012’) > location '/2012/part2012'; Renaming a Partition The syntax of this command is as follows. ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec; The following query is used to rename a partition: hive> ALTER TABLE employee PARTITION (year=’1203’) > RENAME TO PARTITION (Yoj=’1203’); Dropping a Partition The following syntax is used to drop a partition: ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...; The following query is used to drop a partition: hive> ALTER TABLE employee DROP [IF EXISTS] > PARTITION (year=’1203’); munotes.in
Page 62
62 Big Data Analytics and Visualization Lab 62 4.2 SUMMARY Hive - Built-in Functions Here we are explaining the built-in functions available in Hive. The functions look quite similar to SQL functions, except for their usage. Built-In Functions Hive supports the following built-in functions: Return Type Signature Description BIGINT round(double a) It returns the rounded BIGINT value of the double. BIGINT floor(double a) It returns the maximum BIGINT value that is equal or less than the double. BIGINT ceil(double a) It returns the minimum BIGINT value that is equal or greater than the double. Double rand(), rand(int seed) It returns a random number that changes from row to row. String concat(string A, string B,...) It returns the string resulting from concatenating B a fter A. String substr(string A, int start) It returns the substring of A starting from start position till the end of string A. String substr(string A, int start, int length) It returns the substring of A starting from start position with the given length. String upper(string A) It returns the string resulting from converting all characters of A to upper case. munotes.in
Page 63
63 HIVE String ucase(string A) Same as above. String lower(string A) It returns the string resulting from converting all characters of B to lower case. String lcase(string A) Same as above. String trim(string A) It returns the string resulting from trimming spaces from both ends of A. String ltrim(string A) It returns the string resulting from trimming spaces from the beginning (left hand side) of A. String rtrim(string A) rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A. String regexp_replace(string A, string B, string C) It returns the string resulting from replacing all subst rings in B that match the Java regular expression syntax with C. Int size(Map) It returns the number of elements in the map type. Int size(Array) It returns the number of elements in the array type. value of cast( as ) It converts the results of the expression expr to e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed. String from_unixtime(int unixtime) convert the number of second s from Unix epoch (1970 -01-01 00:00:00 UTC) to a string munotes.in
Page 64
64 Big Data Analytics and Visualization Lab 64 representing the timestamp of that moment in the current system time zone in the format of "1970 -01-01 00:00:00" String to_date(string timestamp) It returns the date part of a timestamp string: to_date("1970 - 01-01 00:00:00") = "1970 -01-01" Int year(string date) It returns the year part of a date or a timestamp string: year("1970 -01-01 00:00:00") = 1970, year("1970 -01-01") = 1970 Int month(string d ate) It returns the month part of a date or a timestamp string: month("1970 -11-01 00:00:00") = 11, month("1970 -11-01") = 11 Int day(string date) It returns the day part of a date or a timestamp string: day("1970 - 11-01 00:00:00") = 1, day("1970 - 11-01") = 1 String get_json_object(string json_string, string path) It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid. Example The following queries demonstrate some built-in functions: round() function hive> SELECT round(2.6) from temp; On successful execution of query, you get to see the following response: floor() function hive> SELECT floor(2.6) from temp; On successful execution of the query, you get to see the following response: munotes.in
Page 65
65 HIVE ceil() function hive> SELECT ceil(2.6) from temp; On successful execution of the query, you get to see the following response: 3.0 Aggregate Functions Hive supports the following built-in aggregate functions. The usage of these functions is as same as the SQL aggregate functions. Return Type Signature Description BIGINT count(*), count(expr), count(*) - Returns the total number of retrieved rows. DOUBLE sum(col), sum( DISTINCT col) It returns the sum of the elements in the group or the sum of the distinct values of the column in the group. DOUBLE avg(col), avg(DISTINCT col) It returns the average of the elements in the group or the average of the distinct values of the column in the group. DOUBLE min(col) It returns the minimum value of the column in the group. DOUBLE max(col) It returns the maximum value of the column in the group. Hive - Built-in Operators Here we are explaining the operators available in Hive. There are five types of operators in Hive: ● Relational Operators ● Arithmetic Operators ● Logical Operators ● Complex Operators munotes.in
Page 66
66 Big Data Analytics and Visualization Lab 66 Relational Operators These operators are used to compare two operands. The following table describes the relational operators available in Hive: Operator Operand Description A = B all primitive types TRUE if expression A is equivalent to expression B otherwise FALSE. A != B all primitive types TRUE if expression A is not equivalent to expression B otherwise FALSE. A < B all primitive types TRUE if expression A is less than expression B otherwise FALSE. A <= B all primitive types TRUE if expression A is less than or equal to expression B otherwise FALSE. A > B all primitive types TRUE if expression A is greater than expression B otherwise FALSE. A >= B all primitive types TRUE if expression A is greater than or equal to expression B otherwise FALSE. A IS NULL all types TRUE if expression A evaluates to NULL otherwise FALSE. A IS NOT NULL all types FALSE if expression A evaluates to NULL otherwise TRUE. A LIKE B Strings TRUE if string pattern A matches to B otherwise FALSE. A RLIKE B Strings NULL if A or B is NULL, TRUE if any substring of A matches the Java regular expression B , otherwise FALSE. A REGEXP B Strings Same as RLIKE. munotes.in
Page 67
67 HIVE Example Let us assume the employee table is composed of fields named Id, Name, Salary, Designation, and Dept as shown below. Generate a query to retrieve the employee details whose Id is 1205. +-----+--------------+--------+---------------------------+------+ | Id | Name | Salary | Designation | Dept | +-----+--------------+------------------------------------+------+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | |1205 | Kranthi | 30000 | Op Admin | Admin| +-----+--------------+--------+---------------------------+------+ The following query is executed to retrieve the employee details using the above table: hive> SELECT * FROM employee WHERE Id=1205; On successful execution of query, you get to see the following response: +-----+-----------+-----------+----------------------------------+ | ID | Name | Salary | Designation | Dept | +-----+---------------+-------+----------------------------------+ |1205 | Kranthi | 30000 | Op Admin | Admin | +-----+-----------+-----------+----------------------------------+ The following query is executed to retrieve the employee details whose salary is more than or equal to Rs 40000. hive> SELECT * FROM employee WHERE Salary>=40000; On successful execution of query, you get to see the following response: +-----+------------+--------+----------------------------+------+ | ID | Name | Salary | Designation | Dept | +-----+------------+--------+----------------------------+------+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali| 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | +-----+------------+--------+----------------------------+------+ munotes.in
Page 68
68 Big Data Analytics and Visualization Lab 68 Arithmetic Operators These operators support various common arithmetic operations on the operands. All of them return number types. The following table describes the arithmetic operators available in Hive: Operators Operand Description A + B all number types Gives the result of adding A and B. A - B all number types Gives the result of subtracting B from A. A * B all number types Gives the result of multiplying A and B. A / B all number types Gives the result of dividing B from A. A % B all number types Gives the reminder resulting from dividing A by B. A & B all number types Gives the result of bitwise AND of A and B. A | B all number types Gives the result of bitwise OR of A and B. A ^ B all number types Gives the result of bitwise XOR of A and B. ~A all number types Gives the result of bitwise NOT of A. Example The following query adds two numbers, 20 and 30. hive> SELECT 20+30 ADD FROM temp; On successful execution of the query, you get to see the following response: +--------+ | ADD | +--------+ | 50 | +--------+ munotes.in
Page 69
69 HIVE Logical Operators The operators are logical expressions. All of them return either TRUE or FALSE. Operators Operands Description A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE. A && B boolean Same as A AND B. A OR B boolean TRUE if either A or B or both are TRUE, otherwise FALSE. A || B boolean Same as A OR B. NOT A boolean TRUE if A is FALSE, otherwise FALSE. !A boolean Same as NOT A. Example The following query is used to retrieve employee details whose Department is TP and Salary is more than Rs 40000. hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; On successful execution of the query, you get to see the following response: +------+--------------+-------------+-------------------+--------+ | ID | Name | Salary | Designation | Dept | +------+--------------+-------------+-------------------+--------+ |1201 | Gopal | 45000 | Technical manager | TP | +------+--------------+-------------+-------------------+--------+ Complex Operators These operators provide an expression to access the elements of Complex Types. munotes.in
Page 70
70 Big Data Analytics and Visualization Lab 70 Operator Operand Description A[n] A is an Array and n is an int It returns the nth element in the array A. The first element has index 0. M[key] M is a Map and key has type K It returns the value corresponding to the key in the map. S.x S is a struct It returns the x field of S. Hive - View and Indexes Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view. Creating a View You can create a view at the time of executing a SELECT statement. The syntax is as follows: CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT table_comment] AS SELECT ... Example Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000. +------+--------------+-------------+-------------------+--------+ | ID | Name | Salary | Designation | Dept | +------+--------------+-------------+-------------------+--------+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | |1205 | Kranthi | 30000 | Op Admin | Admin | +------+--------------+-------------+-------------------+--------+ munotes.in
Page 71
71 HIVE The following query retrieves the employee details using the above scenario: hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000; Dropping a View Use the following syntax to drop a view: DROP VIEW view_name The following query drops a view named as emp_30000: hive> DROP VIEW emp_30000; Creating an Index An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows: CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS 'index.handler.class.name' [WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, ...)] [IN TABLE index_table_name] [PARTITIONED BY (col_name, ...)] [ [ ROW FORMAT ...] STORED AS ... | STORED BY ... ] [LOCATION hdfs_path] [TBLPROPERTIES (...)] Example Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table. munotes.in
Page 72
72 Big Data Analytics and Visualization Lab 72 The following query creates an index: hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'; It is a pointer to the salary column. If the column is modified, the changes are stored using an index value. Dropping an Index The following syntax is used to drop an index: DROP INDEX ON The following query drops an index named index_salary: hive> DROP INDEX index_salary ON employee; 4.3 REFERENCES ● “The Visual Display of Quantitative Information” by Edward R. ... ● “Storytelling With Data: A Data Visualization Guide for Business Professionals” by Cole Nussbaumer Knaflic. ● “Data Visualization – A Practical Introduction” by Kieran Healy. 4.4 UNIT END EXERCISES Write a program to perform Table Operations such as Creation, Altering, and Dropping tables in Hive munotes.in
Page 73
73 5 PIG Unit Structure : 5.0 Objectives 5.1 Introduction 5.2 Summary 5.3 References 5.4 Unit End Exercises 5.0 OBJECTIVE Pig is an open-source high level data flow system. It provides a simple language called Pig Latin, for queries and data manipulation, which are then compiled in to MapReduce jobs that run on Hadoop. 5.1 INTRODUCTION Pig is important as companies like Yahoo, Google and Microsoft are collecting huge amounts of data sets in the form of click streams, search logs and web crawls. Pig is also used in some form of ad-hoc processing and analysis of all the information. Why Do you Need Pig? ● It’s easy to learn, especially if you’re familiar with SQL. ● Pig’s multi-query approach reduces the number of times data is scanned. This means 1/20th the lines of code and 1/16th the development time when compared to writing raw MapReduce. ● Performance of Pig is in par with raw MapReduce ● Pig provides data operations like filters, joins, ordering, etc. and nested data types like tuples, bags, and maps, that are missing from MapReduce. ● Pig Latin is easy to write and read. Why was Pig Created? Pig was originally developed by Yahoo in 2006, for researchers to have an ad-hoc way of creating and executing MapReduce jobs on very large data sets. It was created to reduce the development time through its multi-query approach. Pig is also created for professionals from non-Java background, to make their job easier. munotes.in
Page 74
74 Big Data Analytics and Visualization Lab 74 Where Should Pig be Used? Pig can be used under following scenarios: When data loads are time sensitive. ● When processing various data sources. ● When analytical insights are required through sampling. Pig Latin – Basics Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational operators, and Pig Latin UDF’s. Pig Latin – Data Model As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin data model. And it is a bag where − ● A bag is a collection of tuples. ● A tuple is an ordered set of fields. ● A field is a piece of data. Pig Latin – Statemets While processing data using Pig Latin, statements are the basic constructs. ● These statements work with relations. They include expressions and schemas. ● Every statement ends with a semicolon (;). ● We will perform various operations using operators provided by Pig Latin, through statements. ● Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. ● As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried out. munotes.in
Page 75
75 PIG Example Given below is a Pig Latin statement, which loads data to Apache Pig. grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); Pig Latin – Data types Given below table describes the Pig Latin data types. S.N. Data Type Description & Example 1 int Represents a signed 32 -bit integer. Example : 8 2 long Represents a signed 64 -bit integer. Example : 5L 3 float Represents a signed 32 -bit floating point. Example : 5.5F 4 double Represents a 64 -bit floating point. Example : 10.5 5 chararray Represents a character array (string) in Unicode UTF -8 format. Example : ‘tutorials point’ 6 Bytearray Represents a Byte array (blob). 7 Boolean Represents a Boolean value. Example : true/ false. 8 Datetime Represents a date -time. Example : 1970 -01-01T00:00:00.000+00:00 9 Biginteger Represents a Java BigInteger. Example : 60708090709 10 Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883 munotes.in
Page 76
76 Big Data Analytics and Visualization Lab 76 Complex Types 11 Tuple A tuple is an ordered set of fields. Example : (raja, 30) 12 Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} 13 Map A Map is a set of key -value pairs. Example : [ ‘name’#’Raju’, ‘age’#30] Null Values Values for all the above data types can be NULL. Apache Pig treats null values in a similar way as SQL does. A null can be an unknown value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can be the result of an operation. Pig Latin – Arithmetic Operators The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20. Operator Description Example + Addition − Adds values on either side of the operator a + b will give 30 − Subtraction − Subtracts right hand operand from left hand operand a − b will give −10 * Multiplication − Multiplies values on either side of the operator a * b will give 200 / Division − Divides left hand operand by right hand operand b / a will give 2 % Modulus − Divides left hand operand by right hand operand and returns remainder b % a will give 0 munotes.in
Page 77
77 PIG ? : Bincond − Evaluates the Boolean operators. It has three operands as shown below. variable x = (expression) ? value1 if true : value2 if false . b = (a == 1)? 20: 30; if a = 1 the value of b is 20. if a!=1 the value of b is 30. CASE WHEN THEN ELSE END Case − The case operator is equivalent to nested bincond operator. CASE f2 % 2 WHEN 0 THEN 'even' WHEN 1 THEN 'odd' END Pig Latin – Comparison Operators The following table describes the comparison operators of Pig Latin. Operator Description Example == Equal − Checks if the values of two operands are equal or not; if yes, then the condition becomes true. (a = b) is not true != Not Equal − Checks if the values of two operands are equal or not. If the values are not equal, then condition becomes true. (a != b) is true. > Greater than − Checks if the value of the left operand is greater than the value of the right operand. If yes, then the condition becomes true. (a > b) is not true. < Less than − Checks if the value of the left operand is less than the value of the right operand. If yes, then the condition becomes true. (a < b) is true. >= Greater than or equal to − Checks if the value of the left operand is greater than or equal to the value of the right operand. If yes, then the condition becomes true. (a >= b) is not true. munotes.in
Page 78
78 Big Data Analytics and Visualization Lab 78 <= Less than or equal to − Checks if the value of the left operand is less than or equal to the value of the right operand. If yes, then the condition becomes true. (a <= b) is true. matches Pattern matching − Checks whether the string in the left -hand side matches with the constant in the right -hand side. f1 matches '.*tutorial.*' Pig Latin – Type Construction Operators The following table describes the Type construction operators of Pig Latin. Operator Description Example () Tuple constructor operator − This operator is used to construct a tuple. (Raju, 30) {} Bag constructor operator − This operator is used to construct a bag. {(Raju, 30), (Mohammad, 45)} [] Map constructor operator − This operator is used to construct a tuple. [name#Raja, age#30] Pig Latin – Relational Operations The following table describes the relational operators of Pig Latin. Operator Description Loading and Storing LOAD To Load the data from the file system (local/HDFS) into a relation. STORE To save a relation to the file system (local/HDFS). Filtering FILTER To remove unwanted rows from a relation. DISTINCT To remove duplicate rows from a relation. munotes.in
Page 79
79 PIG FOREACH, GENERATE To generate data transformations based on columns of data. STREAM To transform a relation using an external program. Grouping and Joining JOIN To join two or more relations. COGROUP To group the data in two or more relations. GROUP To group the data in a single relation. CROSS To create the cross product of two or more relations. Sorting ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending). LIMIT To get a limited number of tuples from a relation. Combining and Splitting UNION To combine two or more relations into a single relation. SPLIT To split a single relation into two or more relations. Diagnostic Operators DUMP To print the contents of a relation on the console. DESCRIBE To describe the schema of a relation. EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation. ILLUSTRATE To view the step -by-step execution of a series of statements. munotes.in
Page 80
80 Big Data Analytics and Visualization Lab 80 Apache Pig - Grunt Shell After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there are certain useful shell and utility commands provided by the Grunt shell. This chapter explains the shell and utility commands provided by the Grunt shell. Shell Commands The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke any shell commands using sh and fs. sh Command Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command from the Grunt shell, we cannot execute the commands that are a part of the shell environment (ex − cd). Syntax Given below is the syntax of sh command. grunt> sh shell command parameters Example We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown below. In this example, it lists out the files in the /pig/bin/ directory. grunt> sh ls pig pig_1444799121955.log pig.cmd pig.py fs Command Using the fs command, we can invoke any FsShell commands from the Grunt shell. Syntax Given below is the syntax of fs command. grunt> sh File System command parameters Example We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following example, it lists the files in the HDFS root directory. grunt> fs –ls Found 3 items drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data munotes.in
Page 81
81 PIG In the same way, we can invoke all the other file system shell commands from the Grunt shell using the fs command. Utility Commands The Grunt shell provides a set of utility commands. These include utility commands such as clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt shell. Given below is the description of the utility commands provided by the Grunt shell. clear Command The clear command is used to clear the screen of the Grunt shell. Syntax You can clear the screen of the grunt shell using the clear command as shown below. grunt> clear help Command The help command gives you a list of Pig commands or Pig properties. Usage You can get a list of Pig commands using the help command as shown below. grunt> help Commands: ; - See the PigLatin manual for details: http://hadoop.apache.org/pig File system commands:fs - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html Diagnostic Commands:describe [::] [-out ] [-brief] [-dot|-xml] [-param =] [-param_file ] [] - Show the execution plan to compute the alias or for entire script. -script - Explain the entire script. -out - Store the output into directory rather than print to stdout. -brief - Don't expand nested plans (presenting a smaller graph for overview). -dot - Generate the output in .dot format. Default is text format. -xml - Generate the output in .xml format. Default is text format. munotes.in
Page 82
82 Big Data Analytics and Visualization Lab 82 -param - See parameter substitution for details. alias - Alias to explain. dump - Compute the alias and writes the results to stdout. Utility Commands: exec [-param =param_value] [-param_file ]