Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7


In this article presented step by step how create Hadoop cluster with 1 name node and 3 slaves.

At first point there is one VM (from hardware administrator) with properties:

       
10.242.5.88
root/123456Qw
 

       
# cat /etc/centos-release
CentOS Linux release 7.1.1503 (Core)

# uname -a
Linux ders-hadoop1 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 
x86_64 x86_64 GNU/Linux
       
 


Storage information:
       
Файловая система        Размер Использовано  Дост Использовано% Cмонтировано в
/dev/mapper/centos-root    50G         914M   50G            2% /
devtmpfs                  1,9G            0  1,9G            0% /dev
tmpfs                     1,9G            0  1,9G            0% /dev/shm
tmpfs                     1,9G         8,3M  1,9G            1% /run
tmpfs                     1,9G            0  1,9G            0% /sys/fs/cgroup
/dev/mapper/centos-home    73G          33M   73G            1% /home
/dev/sda1                 497M         119M  379M           24% /boot
       
 

CPU information:

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz

 

RAM information:

# cat /proc/meminfo
MemTotal:        3876056 kB


and network interfaces property
       
# ifconfig
eth0: flags=4163  mtu 1500
        inet 10.242.5.88  netmask 255.255.255.0  broadcast 10.242.5.255
        inet6 fe80::215:5dff:fe04:6f05  prefixlen 64  scopeid 0x20
        ether 00:15:5d:04:6f:05  txqueuelen 1000  (Ethernet)
        RX packets 46021  bytes 42368623 (40.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14164  bytes 1063856 (1.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10
        loop  txqueuelen 0  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# hostname
ders-hadoop1

 

The next step is setting ip and hostnames in /etc/hosts and install Java 8.

# vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.242.5.88 hadoop-master hadoop-master
10.242.5.89 hadoop-slave-1 hadoop-slave-1
10.242.5.90 hadoop-slave-2 hadoop-slave-2
10.242.5.91 hadoop-slave-3 hadoop-slave-3

#vi /etc/hostname
write here hadoop-master

restart network
#systemctl restart network.service

and reboot VM
#reboot


Now we have correct hostname

# hostname
hadoop-master


and ready to install Java

cd /opt/
wget --no-cookies --no-check-certificate 
--header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" 
"http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz"

or http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz

# ls -lh
итого 181M
-rw-r--r-- 1 root root 181M jdk-8u161-linux-x64.tar.gz

tar xzf jdk-8u161-linux-x64.tar.gz

If you don't have wget install it with: yum install wget


After extracting archive file use alternatives command to install it. alternatives command is available in chkconfig package.


# cd /opt/jdk1.8.0_161/
# alternatives --install /usr/bin/java java /opt/jdk1.8.0_161/bin/java 2
# alternatives --config java

Имеется 1 программа, которая предоставляет 'java'.

  Выбор    Команда
-----------------------------------------------
*+ 1           /opt/jdk1.8.0_161/bin/java

# alternatives --install /usr/bin/jar jar /opt/jdk1.8.0_161/bin/jar 2
# alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_161/bin/javac 2
# alternatives --set jar /opt/jdk1.8.0_161/bin/jar
# alternatives --set javac /opt/jdk1.8.0_161/bin/javac

# java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

# vi /etc/profile
and add next lines:

export JAVA_HOME=/opt/jdk1.8.0_161
export JRE_HOME=/opt/jdk1.8.0_161/jre
export PATH=$PATH:/opt/jdk1.8.0_161/bin:/opt/jdk1.8.0_161/jre/bin

pathmunge () {
...

and reboot server

Add hadoop user

# useradd hadoop
# passwd hadoop
Изменяется пароль пользователя hadoop.
Новый пароль :
НЕУДАЧНЫЙ ПАРОЛЬ: В пароле должно быть не меньше 8 символов
Повторите ввод нового пароля :
passwd: все данные аутентификации успешно обновлены.


Configuring Key Based Login It’s required to set up hadoop user to ssh itself without password.
# su - hadoop
[hadoop@hadoop-master ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:UT1YjUi5WO9ldVq2VN7/6KzPZMKMKa08JVVYo/eKhIY hadoop@hadoop-master
The key's randomart image is:
+---[RSA 2048]----+
|         .oO+o  o|
|         .*.=..o*|
|        .o.+...==|
|       ..oo...+..|
|      E S... o. .|
|       ..o.*.. ..|
|        .o= = + .|
|       ..o   B   |
|        o.  .o=  |
+----[SHA256]-----+
[hadoop@hadoop-master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'hadoop-master (10.242.5.88)' can't be established.
ECDSA key fingerprint is SHA256:3i32OhdNiKfXfaHUGHQP5dfb+9YHkDbjajRxYKmp8Do.
ECDSA key fingerprint is MD5:05:24:e7:9b:2f:a7:c4:9b:2a:ca:85:96:7b:67:1b:bc.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@hadoop-master's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop@hadoop-master'"
and check to make sure that only the key(s) you wanted were added.
Download and Extract Hadoop Source

# cd opt
# wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
# tar -xzf hadoop-3.0.0.tar.gz
# mv hadoop-3.0.0 hadoop
# chown -R hadoop /opt/hadoop

Configure Hadoop

[root@hadoop-master hadoop]# pwd
/opt/hadoop/etc/hadoop
[root@hadoop-master hadoop]# vi core-site.xml
here XML header ...
configuration

property
    name=fs.default.name
    value=hdfs://hadoop-master:9000/

property
    name=dfs.permissions
    value=false

/property
/configuration

Configure hdfs-site.xml and directories

# mkdir /opt/hadoop/dfs
# mkdir /opt/hadoop/dfs/name
# mkdir /opt/hadoop/dfs/data
# pwd
/opt/hadoop/etc/hadoop
# vi hdfs-site.xml

here XML header ...
configuration
property
        name=dfs.data.dir
        value=/opt/hadoop/dfs/data
        final=true
/property
property
        name=dfs.name.dir
        value=/opt/hadoop/dfs/name
        final=true
/property
property
        name=dfs.replication
        value=2
/configuration
Configure mapred-site.xml

# vi mapred-site.xml
here XML header ...
configuration
property
        name=mapred.job.tracker
        value=hadoop-master:9001
/property
/configuration
Edit hadoop-env.sh file

# vi hadoop-env.sh

# Set Hadoop-specific environment variables here.

export JAVA_HOME=/opt/jdk1.8.0_161
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop


Now I ask our administrator make 3 clones of VM with ip:

10.242.5.89
10.242.5.90
10.242.5.91
 


On each clone edit 

#vi /etc/hostname
write here slave correct hostname

restart network
#systemctl restart network.service

Now check that nodes can ping each other.
And the next step is checking SSH.

Configure slaves and workers on Name node only.

# su - hadoop
# pwd 
# /opt/hadoop/etc/hadoop
# vi workers

 write here 3 lines
hadoop-slave-1
hadoop-slave-2
hadoop-slave-3

#cp workers slaves



Setup Environment Variables

#su - hadoop
#vi ~/.bashrc

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

and check hadoop version,

# source .bashrc
# hadoop version


Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533
Compiled by andrew on 2017-12-08T19:16Z
Compiled with protoc 2.5.0
From source with checksum 397832cb5529187dc8cd74ad54ff22
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.0.0.jar


Next cop .bashrc on all nodes. with

[hadoop@hadoop-master ~]$ rsync .bashrc hadoop@hadoop-slave-1:/home/hadoop
[hadoop@hadoop-master ~]$ rsync .bashrc hadoop@hadoop-slave-2:/home/hadoop
[hadoop@hadoop-master ~]$ rsync .bashrc hadoop@hadoop-slave-3:/home/hadoop



Make command on all nodes:
# chown -R hadoop /opt/hadoop

and in master execute:
# hadoop namenode -format

STARTUP_MSG:   java = 1.8.0_161
************************************************************/
2018-02-19 16:08:45,031 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2018-02-19 16:08:45,045 INFO namenode.NameNode: createNameNode [-format]
2018-02-19 16:08:46,623 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-02-19 16:08:49,568 WARN common.Util: Path /opt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
2018-02-19 16:08:49,570 WARN common.Util: Path /opt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-34b701e3-4fd1-4680-beb7-3ac01da82a2a
2018-02-19 16:08:49,917 INFO namenode.FSEditLog: Edit logging is async:true
2018-02-19 16:08:50,052 INFO namenode.FSNamesystem: KeyProvider: null
2018-02-19 16:08:50,071 INFO namenode.FSNamesystem: fsLock is fair: true
2018-02-19 16:08:50,078 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2018-02-19 16:08:50,103 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
2018-02-19 16:08:50,103 INFO namenode.FSNamesystem: supergroup          = supergroup
2018-02-19 16:08:50,103 INFO namenode.FSNamesystem: isPermissionEnabled = true
2018-02-19 16:08:50,104 INFO namenode.FSNamesystem: HA Enabled: false
2018-02-19 16:08:50,229 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2018-02-19 16:08:50,270 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2018-02-19 16:08:50,270 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2018-02-19 16:08:50,294 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2018-02-19 16:08:50,294 INFO blockmanagement.BlockManager: The block deletion will start around 2018 фев 19 16:08:50
2018-02-19 16:08:50,299 INFO util.GSet: Computing capacity for map BlocksMap
2018-02-19 16:08:50,299 INFO util.GSet: VM type       = 64-bit
2018-02-19 16:08:50,321 INFO util.GSet: 2.0% max memory 916.4 MB = 18.3 MB
2018-02-19 16:08:50,321 INFO util.GSet: capacity      = 2^21 = 2097152 entries
2018-02-19 16:08:50,453 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2018-02-19 16:08:50,471 INFO Configuration.deprecation: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
2018-02-19 16:08:50,471 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2018-02-19 16:08:50,471 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2018-02-19 16:08:50,471 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: defaultReplication         = 2
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: maxReplication             = 512
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: minReplication             = 1
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: redundancyRecheckInterval  = 3000ms
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
2018-02-19 16:08:50,472 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2018-02-19 16:08:50,968 INFO util.GSet: Computing capacity for map INodeMap
2018-02-19 16:08:50,968 INFO util.GSet: VM type       = 64-bit
2018-02-19 16:08:50,969 INFO util.GSet: 1.0% max memory 916.4 MB = 9.2 MB
2018-02-19 16:08:50,969 INFO util.GSet: capacity      = 2^20 = 1048576 entries
2018-02-19 16:08:50,971 INFO namenode.FSDirectory: ACLs enabled? false
2018-02-19 16:08:50,971 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2018-02-19 16:08:50,971 INFO namenode.FSDirectory: XAttrs enabled? true
2018-02-19 16:08:50,971 INFO namenode.NameNode: Caching file names occurring more than 10 times
2018-02-19 16:08:50,989 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, sna                                                             pshotDiffAllowSnapRootDescendant: true
2018-02-19 16:08:51,004 INFO util.GSet: Computing capacity for map cachedBlocks
2018-02-19 16:08:51,004 INFO util.GSet: VM type       = 64-bit
2018-02-19 16:08:51,005 INFO util.GSet: 0.25% max memory 916.4 MB = 2.3 MB
2018-02-19 16:08:51,005 INFO util.GSet: capacity      = 2^18 = 262144 entries
2018-02-19 16:08:51,053 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2018-02-19 16:08:51,053 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2018-02-19 16:08:51,053 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2018-02-19 16:08:51,084 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2018-02-19 16:08:51,084 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2018-02-19 16:08:51,089 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2018-02-19 16:08:51,089 INFO util.GSet: VM type       = 64-bit
2018-02-19 16:08:51,090 INFO util.GSet: 0.029999999329447746% max memory 916.4 MB = 281.5 KB
2018-02-19 16:08:51,090 INFO util.GSet: capacity      = 2^15 = 32768 entries
2018-02-19 16:08:51,268 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1839426364-10.242.5.88-1519045731213
2018-02-19 16:08:51,421 INFO common.Storage: Storage directory /opt/hadoop/dfs/name has been successfully formatted.
2018-02-19 16:08:51,482 INFO namenode.FSImageFormatProtobuf: Saving image file /opt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2018-02-19 16:08:52,004 INFO namenode.FSImageFormatProtobuf: Image file /opt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds.
2018-02-19 16:08:52,095 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-02-19 16:08:52,120 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/10.242.5.88
************************************************************/



Start cluster:

If etc/hadoop/workers and ssh trusted access is configured 
(see Single Node Setup), all of the HDFS processes can be started with a utility script. As hdfs:

#start-dfs.sh

Starting namenodes on [hadoop-master]
Starting datanodes
hadoop-slave-3: WARNING: /opt/hadoop/logs does not exist. Creating.
hadoop-slave-2: WARNING: /opt/hadoop/logs does not exist. Creating.
hadoop-slave-1: WARNING: /opt/hadoop/logs does not exist. Creating.
Starting secondary namenodes [hadoop-master]
2018-02-19 16:14:24,025 WARN util.NativeCodeLoader: 
Unable to load native-hadoop library for your platform... 
using builtin-java classes where applicable




Go to Web UI: http://10.242.5.88:9870




#hdfs dfsadmin -report 

2018-02-19 16:45:18,724 WARN util.NativeCodeLoader: 
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 160982630400 (149.93 GB)
Present Capacity: 154067443712 (143.49 GB)
DFS Remaining: 154067419136 (143.49 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 10.242.5.89:9866 (hadoop-slave-1)
Hostname: hadoop-slave-1
Decommission Status : Normal
Configured Capacity: 53660876800 (49.98 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 2305089536 (2.15 GB)
DFS Remaining: 51355779072 (47.83 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 19 16:45:17 MSK 2018
Last Block Report: Mon Feb 19 16:44:14 MSK 2018


Name: 10.242.5.90:9866 (hadoop-slave-2)
Hostname: hadoop-slave-2
Decommission Status : Normal
Configured Capacity: 53660876800 (49.98 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 2304983040 (2.15 GB)
DFS Remaining: 51355885568 (47.83 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 19 16:45:17 MSK 2018
Last Block Report: Mon Feb 19 16:14:47 MSK 2018


Name: 10.242.5.91:9866 (hadoop-slave-3)
Hostname: hadoop-slave-3
Decommission Status : Normal
Configured Capacity: 53660876800 (49.98 GB)
DFS Used: 8192 (8 KB)
Non DFS Used: 2305114112 (2.15 GB)
DFS Remaining: 51355754496 (47.83 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 19 16:45:18 MSK 2018
Last Block Report: Mon Feb 19 16:14:23 MSK 2018

Комментарии

Популярные сообщения из этого блога

Spark operations with sparl.sql (Dataset, Dataframe)

Load data from Cassandra to HDFS parquet files and select with Hive

Loading data into Spark from Oracle RDBMS, CSV