New server for Big Data experiments with LXC containers (Cassandra, Spark, Hadoop)

This article begins a big story about the design and development of Big Data infrastructure.
Used list of sites:
LXC

I have one server with next properties:


[root]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)

[root]# cat /proc/meminfo
MemTotal:       32781268 kB

[root]# cat /proc/cpuinfo
processor       : 4 core
vendor_id       : GenuineIntel
model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

My idea is using LXC for this purpose and going begin from next architecture:

First of all, we need to make some preparations for LXC infrastructure.


yum update
yum install install systemd-services cgroup-bin
yum -y install epel-release
yum -y install lxc dnsmasq-base bridge-utils lxc lxc-templates lxc-extras dnsmasq 
                   bridge-utils iptables-services debootstrap perl libvirt

content presented next
1) vi /etc/systemd/system/lxc-net.service
2) vi /etc/systemd/system/lxc-dhcp.service

systemctl enable lxc-net.service
systemctl enable lxc-dhcp.service
systemctl start lxc-net.service
systemctl start lxc-dhcp.service
systemctl enable iptables
systemctl start iptables

cd ~
3) vi lxc-net

chmod +x lxc-net && ./lxc-net
/sbin/service iptables save

4) vi /etc/sysctl.conf
sysctl -p

/etc/systemd/system/lxc-net.service


[Unit]
Description=Bridge interface for LXC Containers

[Service]
Type=oneshot

# Bring up bridge interface
ExecStart=/sbin/brctl addbr lxcbr0
ExecStart=/sbin/ip address add 10.0.3.1/24 dev lxcbr0
ExecStart=/sbin/ip link set lxcbr0 up

RemainAfterExit=yes

# Bring bridge interface down
ExecStop=/sbin/ip link set lxcbr0 down
ExecStop=/sbin/brctl delbr lxcbr0

/etc/systemd/system/lxc-dhcp.service


[Unit]
Requires=lxc-net.service
Requires=sys-devices-virtual-net-lxcbr0.device
After=sys-devices-virtual-net-lxcbr0.device

[Service]
ExecStart=/sbin/dnsmasq \
            --dhcp-leasefile=/var/run/lxc-dnsmasq.leases \
            --user=nobody \
            --group=nobody \
            --keep-in-foreground \
            --conf-file=/etc/lxc/dnsmasq.conf \
            --listen-address=10.0.3.1 \
            --except-interface=lo \
            --bind-interfaces \
            --dhcp-range=10.0.3.2,10.0.3.254

[Install]
WantedBy=default.target

~/lxc-net


iptables -I INPUT -i lxcbr0 -p udp --dport 67 -j ACCEPT
iptables -I INPUT -i lxcbr0 -p tcp --dport 67 -j ACCEPT
iptables -I INPUT -i lxcbr0 -p tcp --dport 53 -j ACCEPT
iptables -I INPUT -i lxcbr0 -p udp --dport 53 -j ACCEPT
iptables -I FORWARD -i lxcbr0 -j ACCEPT
iptables -I FORWARD -o lxcbr0 -j ACCEPT
iptables -t nat -A POSTROUTING -s 10.0.3.0/24 ! -d 10.0.3.0/24 -j MASQUERADE
iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill

Checking phase:


[root]# systemctl start libvirtd
[root]# systemctl start lxc.service
[root]# systemctl status lxc.service
Б≈▐ lxc.service - LXC Container Initialization and Autoboot Code
   Loaded: loaded (/usr/lib/systemd/system/lxc.service; disabled; vendor preset: disabled)
   Active: active (exited) since Thu 2018-09-27 09:16:59 MSK; 5min ago
  Process: 10272 ExecStart=/usr/libexec/lxc/lxc-autostart-helper start (code=exited, status=0/SUCCESS)
  Process: 10269 ExecStartPre=/usr/libexec/lxc/lxc-devsetup (code=exited, status=0/SUCCESS)
 Main PID: 10272 (code=exited, status=0/SUCCESS)

Sep 27 09:16:29 db-ders2.moon.lan systemd[1]: Starting LXC Container Initialization and Autoboot Code...
Sep 27 09:16:29 db-ders2.moon.lan lxc-devsetup[10269]: /dev is devtmpfs
Sep 27 09:16:59 db-ders2.moon.lan lxc-autostart-helper[10272]: Starting LXC autoboot containers:  [  OK  ]
Sep 27 09:16:59 db-ders2.moon.lan systemd[1]: Started LXC Container Initialization and Autoboot Code.



[root@db-ders2 ~]# lxc-checkconfig
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-3.10.0-862.11.6.el7.x86_64
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
newuidmap is not installed
newgidmap is not installed
Network namespace: enabled
Multiple /dev/pts instances: enabled

--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
Bridges: enabled
Advanced netfilter: enabled
CONFIG_NF_NAT_IPV4: enabled
CONFIG_NF_NAT_IPV6: enabled
CONFIG_IP_NF_TARGET_MASQUERADE: enabled
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: enabled

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig

Disabling SELinux


vi /etc/selinux/config
and set the SELINUX mod to disabled 
(anyway you will see "Authentication token manipulation error" when password changeing after LXC container created)
reboot

Check after reboot:
[root]# sestatus
SELinux status:                 disabled

And little more: lxc-ls is part of the 
lxc-extra package, try

yum install /usr/bin/lxc-ls

The reason for moving it into a separate package is that it uses python3 and the python3 lxc bindings.

Now it's everything is ready and we can create/clone containers.
I begin from scX, these containers for Cassandra (cluster) and Spark SlaveNodes. Its possible installs and configures one container and next clone it with fixing hostnames and network settings.

Creation first container:


[root]# ls /usr/share/lxc/templates/
lxc-alpine    lxc-archlinux  lxc-centos  lxc-debian    lxc-fedora  lxc-openmandriva  lxc-oracle  lxc-sshd    lxc-ubuntu-cloud
lxc-altlinux  lxc-busybox    lxc-cirros  lxc-download  lxc-gentoo  lxc-opensuse      lxc-plamo   lxc-ubuntu

[root]# lxc-create -n sc1 -t centos
...
...
The temporary root password is stored in:

        '/var/lib/lxc/sc1/tmp_root_pass'


The root password is set up as expired and will require it to be changed
at first login, which you should do as soon as possible.  If you lose the
root password or wish to change it without starting the container, you
can change it from the host by running the following command (which will
also reset the expired flag):

        chroot /var/lib/lxc/sc1/rootfs passwd

Change sc1:root password on sctest2018

[root]# chroot /var/lib/lxc/sc1/rootfs passwd
...
passwd: all authentication tokens updated successfully. Here the step related with (SELinux)

[root]# lxc-start -d -n sc1

[root]# lxc-ls -f
NAME  STATE    IPV4             IPV6  AUTOSTART
-----------------------------------------------
sc1   RUNNING  192.168.122.192  -     NO

connect to sc1 with ssh:
[root]# ssh root@192.168.122.192

inside container, we need to make this command: 
ln -s /usr/lib/systemd/system/halt.target /etc/systemd/system/sigpwr.target
From [https://github.com/lxc/lxd/issues/1183]

Configure autostart for container.


[root]# vi /var/lib/lxc/sc1/config

Add:

lxc.start.auto  = 1 # enabled
lxc.start.delay = 5 # delay in seconds
lxc.start.order = 100 # higher value means starts earlier

Check:
[root@db-ders2 sc1]# lxc-ls -f
NAME  STATE    IPV4             IPV6  AUTOSTART
-----------------------------------------------
sc1   RUNNING  192.168.122.192  -    YES

Connect to sc1 with ssh and make some validation and preparation for install cassandra and spark.


[root]# ssh root@192.168.122.192
root@192.168.122.192's password:
Last login: Thu Sep 27 10:51:29 2018 from gateway
[root@sc1 ~]#

[root@sc1 ~]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)
[root@sc1 ~]# hostname
sc1.moon.lan

[root@sc1 ~]# yum install net-tools  --because there is no ifconfig

[root@sc1 ~]# ifconfig
eth0: flags=4163  mtu 1500
        inet 192.168.122.192  netmask 255.255.255.0  broadcast 192.168.122.255
        inet6 fe80::fc76:f2ff:fecc:9964  prefixlen 64  scopeid 0x20
        ether fe:76:f2:cc:99:64  txqueuelen 1000  (Ethernet)
        RX packets 1498  bytes 407242 (397.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 453  bytes 52331 (51.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@sc1 ~]# yum install update

Install Java(JDK) and set environment variables.



[root@sc1 opt]# yum install wget
[root@sc1 opt]# cd opt
[root@sc1 opt]# wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" 
http://download.oracle.com/otn-pub/java/jdk/8u181-b13/96a7b8442fe848ef90c96a2fad6ed6d1/jdk-8u181-linux-x64.tar.gz
[root@sc1 opt]# tar xzf jdk-8u181-linux-x64.tar.gz

# vi /etc/profile
and add next lines:

...
export JAVA_HOME=/opt/jdk1.8.0_181
export JRE_HOME=/opt/jdk1.8.0_181/jre
export PATH=$PATH:/opt/jdk1.8.0_181/bin:/opt/jdk1.8.0_181/jre/bin

pathmunge () {
...

apply changes with: source /etc/profile

and check:

[root@sc1 opt]# echo $JAVA_HOME
/opt/jdk1.8.0_181

[root@sc1 opt]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

[root@sc1 opt]# javac -version
javac 1.8.0_181

Install cassandra (27/09/2018 last version 3.11.3



[root@sc1 opt]# vi /etc/yum.repos.d/cassandra.repo
[cassandra]
name=Apache Cassandra
baseurl=https://www.apache.org/dist/cassandra/redhat/311x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://www.apache.org/dist/cassandra/KEYS

yum install cassandra

[root@sc1 opt]# systemctl daemon-reload

[root@sc1 opt]# systemctl start cassandra

[root@sc1 opt]# systemctl enable cassandra
cassandra.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig cassandra on

[root@sc1 opt]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  103.67 KiB  256          100.0%            f705c139-ad17-4454-af4c-a03e8a19c1f9  rack1

[root@sc1 opt]# cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Cassandra configuration.


On first step I only download Spark binary for future installing.
[root@sc1 opt]# wget http://apache-mirror.rbc.ru/pub/apache/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
[root@sc1 opt]# tar xzf spark-2.3.2-bin-hadoop2.7.tgz
[root@sc1 opt]# mv spark-2.3.2-bin-hadoop2.7 spark-2.3.2
[root@sc1 opt]# rm spark-2.3.2-bin-hadoop2.7.tgz

[root@sc1 opt]# vi /etc/hosts
127.0.0.1       localhost sc1
192.168.122.192 sc1 sc1
192.168.122.193 sc2 sc2
192.168.122.194 sc3 sc3

[root@sc1 opt]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO="static"
ONBOOT=yes
HOSTNAME=sc1.moon.lan
NM_CONTROLLED=no
TYPE=Ethernet
IPADDR=192.168.122.192
BROADCAST=192.168.122.255
NETMASK=255.255.255.0
MTU=
DHCP_HOSTNAME=`hostname`

[root@sc1 opt]#systemctl restart network.service

[root@sc1 opt]# vi /etc/cassandra/default.conf/cassandra.yaml
cluster_name: 'cass cluster'
num_tokens: 256
listen_address:192.168.122.192
rpc_address: 192.168.122.192

Do not make all nodes seed nodes. - https://docs.datastax.com/en/cassandra/3.0/cassandra/initialize/initSingleDS.html

seed_provider:
    # Addresses of hosts that are deemed contact points.
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring.  You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: ",,"
          - seeds: "192.168.122.192"

[root@sc1 opt]# rm -rf /var/lib/cassandra/data/system/*

After these steps we close container and clone it. Also we set memory limits for containers.



[root@db-ders2 sc1]# lxc-stop -n sc1
[root@db-ders2 sc1]# vi /var/lib/lxc/sc1/config
add lxc.cgroup.memory.limit_in_bytes = 4096M

try run with these properties and with log option.
[root@db-ders2 sc1]# lxc-start -d -n sc1 -l debug -o test.log

[root@db-ders2 sc1]# vi /var/lib/lxc/sc1/test.log
[root@db-ders2 sc1]# cat /var/lib/lxc/sc1/test.log  | grep -i memory
      lxc-start 1538056914.122 DEBUG    lxc_cgfs - cgfs.c:do_setup_cgroup_limits:1999 - cgroup 'memory.limit_in_bytes' set to '4096M'
      lxc-start 1538056914.244 DEBUG    lxc_conf - conf.c:umount_oldrootfs:1167 - umounted '/lxc_putold/sys/fs/cgroup/memory'
      lxc-start 1538056914.336 DEBUG    lxc_cgfs - cgfs.c:do_setup_cgroup_limits:1999 - cgroup 'memory.limit_in_bytes' set to '4096M'

[root@db-ders2 sc1]# lxc-stop -n sc1

[root@db-ders2 sc1]# lxc-clone -o sc1 -n sc2
Created container sc2 as copy of sc1
[root@db-ders2 sc1]# lxc-clone -o sc1 -n sc3
Created container sc3 as copy of sc1

modify /etc/hosts on host server.
 
[root@db-ders2 sc1]# vi /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.122.192 sc1 sc1
192.168.122.193 sc2 sc2
192.168.122.194 sc3 sc3

[root@db-ders2 sc1]# systemctl restart network.service

Start all 3 nodes and make additional configuration.



[root@db-ders2 sc1]# lxc-start -d -n sc1
[root@db-ders2 sc1]# lxc-start -d -n sc2
[root@db-ders2 sc1]# lxc-start -d -n sc3
[root@db-ders2 sc1]# lxc-ls -f
NAME  STATE    IPV4             IPV6  AUTOSTART
-----------------------------------------------
sc1   RUNNING  192.168.122.192  -     YES
sc2   RUNNING  192.168.122.192  -     YES
sc3   RUNNING  192.168.122.192  -     YES

Huh, each node has equal IP.
Next we need income into each nodes and modify ifcfg-eth0

vi /etc/sysconfig/network-scripts/ifcfg-eth0
systemctl restart network.service

After all changes:
[root@db-ders2 ~]# lxc-ls -f
NAME  STATE    IPV4             IPV6  AUTOSTART
-----------------------------------------------
sc1   RUNNING  192.168.122.192  -     YES
sc2   RUNNING  192.168.122.193  -     YES
sc3   RUNNING  192.168.122.194  -     YES

Next, we need send ssh keys on each node from host server.

[root@db-ders2 ~]#  ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:7HZ/qrIZ6HrYAZbSLgfm9vOFabZk5ciGV7zJ5/d6q7Q root@db-ders2.moon.lan
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|                 |
|   . .           |
|  + =  o         |
| o = .  S        |
|  + oo.@ o       |
| . +.o&.X o .    |
|    +Oo+.* o..o  |
|    .=+ ooooEBo. |
+----[SHA256]-----+

[root@db-ders2 ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.122.192
[root@db-ders2 ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.122.193
[root@db-ders2 ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@192.168.122.194

and check connections without password

[root@db-ders2 ~]# ssh root@192.168.122.19X
OK

Start Cassandra cluster and check configurations and states.


Summary:

[root@sc1 ~]# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.122.192  271.31 KiB  256          64.6%             82d7a870-a8ba-4294-8f15-5b42efe93810  rack1
UN  192.168.122.193  108.6 KiB  256          69.1%             45ce9639-d743-497d-ac9d-7dcf9dd6f49d  rack1
UN  192.168.122.194  119.87 KiB  256          66.4%             19e9e2a7-a58d-4f17-ba9f-47ab48813dbc  rack1


[root@sc1 ~]# nodetool describecluster
Cluster Information:
        Name: cass cluster
        Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
        DynamicEndPointSnitch: enabled
        Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
        Schema versions:
                ea63e099-37c5-3d7b-9ace-32f4c833653d: [192.168.122.192, 192.168.122.193, 192.168.122.194]


[root@sc1 ~]# cqlsh 192.168.122.192
Connected to cass cluster at 192.168.122.192:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Now we need configure port fovaring on host server, for external connections to LXC container.


[root@db-ders2 ~]# cd ~
[root@db-ders2 ~]# vi port-forv-cxs

iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 9042  -j DNAT --to 192.168.122.192:9042
iptables -I FORWARD -d 192.168.122.192/32 -p tcp -m state --state NEW -m tcp --dport 9042 -j ACCEPT

[root@db-ders2 ~]# chmod +x port-forv-cxs && ./port-forv-cxs

Noew from my local work PC I can connect to cassandra cluster with DataStax Dev Center through server 10.241.5.234.

Now I connect to LXC cassandra cluster from my local PC. Schema here.
It possible to run DevCenter and connect to cassandra and also I tested scala application thar read data from Oracle RDBMS and write it into cassandra tables.
oratocass (commit #11)

Here we begin install Spark on each cassandra nodes.


Repeat for all nodes:
[root@sc1 opt]# cd /opt
[root@sc1 opt]# wget http://downloads.typesafe.com/scala/2.11.8/scala-2.11.8.tgz
[root@sc1 opt]# tar xvf scala-2.11.8.tgz
[root@sc1 opt]# mv scala-2.11.8 /usr/lib
[root@sc1 opt]# ln -s /usr/lib/scala-2.11.8 /usr/lib/scala
[root@sc1 opt]# vi /etc/profile
modify: export PATH=$PATH:/opt/jdk1.8.0_181/bin:/opt/jdk1.8.0_181/jre/bin:/usr/lib/scala/bin
[root@sc1 opt]# source /etc/profile
[root@sc1 opt]# scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
[root@sc1 opt]# rm /opt/scala-2.11.8.tgz

[root@sc1 opt]# vi /etc/profile
export SPARK_HOME=/opt/spark-2.3.2
export PATH=$PATH:$SPARK_HOME/bin
[root@sc1 opt]# source /etc/profile

now we can copy file on any nodes.
[root@sc1 opt]# scp /etc/profile root@192.168.122.193:/etc/profile
[root@sc1 opt]# scp /etc/profile root@192.168.122.194:/etc/profile

For the next activity, I will use this source - apache-spark-on-a-multi-node-cluster


Stop all LXC and clone sc1 for Spark Master node:
ssh, systemctl stop cassandra, exit

lxc-stop -n sc1

lxc-clone -o sc1 -n smn

and modify /etc/hosts on all nodes.

127.0.0.1       localhost sc1
192.168.122.219 smn smn
192.168.122.192 sc1 sc1
192.168.122.193 sc2 sc2
192.168.122.194 sc3 sc3

Also we need configure Key Based Login (ssh) between all nodes.

ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub USER@HOST

[root@db-ders2 ~]# lxc-ls -f
NAME  STATE    IPV4             IPV6  AUTOSTART
-----------------------------------------------
sc1   RUNNING  192.168.122.192  -     YES
sc2   RUNNING  192.168.122.193  -     YES
sc3   RUNNING  192.168.122.194  -     YES
smn   RUNNING  192.168.122.219  -     YES

Spark Master Configuration



[root@smn conf]# pwd
/opt/spark-2.3.2/conf

[root@smn conf]# cp spark-env.sh.template spark-env.sh

[root@smn conf]# vi spark-env.sh

SPARK_MASTER_HOST="192.168.122.219"
JAVA_HOME="/opt/jdk1.8.0_181"

[root@smn conf]# cp slaves.template slaves
[root@smn conf]# vi slaves

smn
sc1
sc2
sc3

Try start cluster
[root@smn sbin]# cd /opt/spark-2.3.2/sbin/
[root@smn sbin]# ./start-all.sh

starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.3.2/logs/spark-root-org.apache.spark.deploy.master.Master-1-smn.out
smn: Warning: Permanently added 'smn,192.168.122.219' (ECDSA) to the list of known hosts.
smn: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-smn.out
sc1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-smn.out
sc3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-sc3.out
sc2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-sc2.out

Also we can make port 8080 forvarding from host into smn, for using WebUI in Work PC.

iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 8080  -j DNAT --to 192.168.122.219:8080
iptables -I FORWARD -d 192.168.122.219/32 -p tcp -m state --state NEW -m tcp --dport 8080 -j ACCEPT

I save it in file ~/port-forv-cxs

After port forwarding I can see Spark WebUI:http://10.241.5.234:8080/



It was littel mistake in file slaves, move out smn name.
And check first line in all /etc/hosts !:)

After it, repeat 
[root@smn sbin]# ./stop-all.sh
[root@smn sbin]# ./start-all.sh

now you can see correct Workers in http://10.241.5.234:8080 and JPS

[root@smn sbin]# jps
2223 Master
2303 Jps
[root@smn sbin]# ssh root@sc1
Last login: Thu Oct  4 14:31:16 2018 from smn
[root@sc1 ~]# jps
1443 Worker
1517 Jps
[root@sc1 ~]# exit
logout
Connection to sc1 closed.
[root@smn sbin]# ssh root@sc2
Last login: Thu Oct  4 14:30:48 2018 from smn
[root@sc2 ~]# jps
1651 Worker
1724 Jps

A little test of Spark, spark-shell, reading data from Cassandra cluster. https://github.com/AlexGruPerm/oratocass - commit#15. I have built scala spark application with oratocass>sbt package



spark-submit --class OraToCass --master spark://192.168.122.219:6066 --deploy-mode cluster C:\oratocass\target\scala-2.11\oratocass_2.11-1.0.jar

[root@smn ~]# spark-submit --class OraToCass --master spark://192.168.122.219:6066 --deploy-mode cluster C:\oratocass\target\scala-2.11\oratocass_2.11-1.0.jar
2018-10-04 14:53:05 WARN  Utils:66 - Your hostname, smn resolves to a loopback address: 127.0.0.1; using 192.168.122.219 instead (on interface eth0)
2018-10-04 14:53:05 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Running Spark using the REST application submission protocol.
2018-10-04 14:53:05 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in spark://192.168.122.219:6066.
2018-10-04 14:53:06 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20181004145306-0000. Polling submission state...
2018-10-04 14:53:06 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20181004145306-0000 in spark://192.168.122.219:6066.
2018-10-04 14:53:06 INFO  RestSubmissionClient:54 - State of driver driver-20181004145306-0000 is now RUNNING.
2018-10-04 14:53:06 INFO  RestSubmissionClient:54 - Driver is running on worker worker-20181004143241-192.168.122.192-39112 at 192.168.122.192:39112.
2018-10-04 14:53:06 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20181004145306-0000",
  "serverSparkVersion" : "2.3.2",
  "submissionId" : "driver-20181004145306-0000",
  "success" : true
}
2018-10-04 14:53:06 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-10-04 14:53:06 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-7397c5e4-36bd-4329-98e5-7d20b5c82149

But in Spark console there is Worker error label.
Any error message:

2018-10-05 06:02:21 ERROR RestSubmissionClient:70 - Exception from the cluster:
java.lang.NullPointerException
        org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:151)
        org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)
        org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
2018-10-05 06:02:21 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:

Try copy .jar on each node.
Also mistyping in path to jar.

Use sbt assembly to get fat jar.

[root@smn ~]# spark-submit --class OraToCass --master spark://192.168.122.219:6066 --deploy-mode cluster /root/oratocass_v1.jar

--packages datastax:spark-cassandra-connector:2.3.2-s_2.11

[root@db-ders2 forticlient]# systemctl disable firewalld
[root@db-ders2 forticlient]# systemctl stop firewalld

Also, it was fixing some access problems with forticlient on Linux (main host server). Now I have successful execution of Spark application with spark-submit cluster mode. It reads data from Oracle DB and put it into Cassandra DB. Speed is 10-20x without copying data into my local work PC. I think it's a control point https://github.com/AlexGruPerm/oratocass - commit# 17. Next, we need to install any software in LXCs.

Statistics



select round(sum(s.BYTES)/1024/1024/1024,2) as SizeGb 
  from user_segments s 
 where s.segment_name='T_DATA';
SQL> 
 
    SIZEGB
----------
     10,63

select round(sum(s.BYTES)/1024/1024/1024,2) as SizeGb 
  from user_segments s 
 where s.segment_name='T_KEYS';

SQL> 
 
    SIZEGB
----------
      7,95


[root@sc1 ~]# nodetool tablestats --human-readable  msk_arm_lead.t_data
Total number of tables: 41
----------------
Keyspace : msk_arm_lead
        Read Count: 0
        Read Latency: NaN ms
        Write Count: 4474322
        Write Latency: 0.037354793642478124 ms
        Pending Flushes: 0
                Table: t_data
                SSTable count: 3
                Space used (live): 427.35 MiB
                Space used (total): 427.35 MiB
                Space used by snapshots (total): 204.89 MiB
                Off heap memory used (total): 377.94 KiB
                SSTable Compression Ratio: 0.13985382171169486
                Number of partitions (estimate): 1046
                Memtable cell count: 410576
                Memtable data size: 34.77 MiB
                Memtable off heap memory used: 0 bytes
                Memtable switch count: 21
                Local read count: 0
                Local read latency: NaN ms
                Local write count: 3002166
                Local write latency: NaN ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 1.02 KiB
                Bloom filter off heap memory used: 1016 bytes
                Index summary off heap memory used: 208 bytes
                Compression metadata off heap memory used: 376.74 KiB
                Compacted partition minimum bytes: 87
                Compacted partition maximum bytes: 52066354
                Compacted partition mean bytes: 4352659
                Average live cells per slice (last five minutes): NaN
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): NaN
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0 bytes

[root@sc1 ~]# nodetool tablestats --human-readable  msk_arm_lead.t_keys
Total number of tables: 41
----------------
Keyspace : msk_arm_lead
                Space used (total): 190.18 MiB

Next, I will install a Hadoop 3.1.1 cluster (1 NN and 5 DN).
With using this instruction I have created a new one lxc container with name hdpnn (for NameNode) from smn(lxc-clone). And prepare it for next cloning DataNodes.


modify /etc/hosts on host machine and in hdpnn

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.122.219 smn smn
192.168.122.192 sc1 sc1
192.168.122.193 sc2 sc2
192.168.122.194 sc3 sc3
192.168.122.240 hdpnn hdpnn
192.168.122.241 hdp1 hdp1
192.168.122.242 hdp2 hdp2
192.168.122.243 hdp3 hdp3
192.168.122.244 hdp4 hdp4
192.168.122.245 hdp5 hdp5

useradd hadoop
passwd hadoop
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hdpnn
cd opt
wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -xzf hadoop-3.1.1.tar.gz
mv hadoop-3.1.1 hadoop
chown -R hadoop /opt/hadoop

vi /opt/hadoop/etc/hadoop/core-site.xml

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hdpnn:9000</value>
  </property>
  <property>
    <name>dfs.permissions</name>
    <value>false</value>
  </property>
</configuration>

mkdir /opt/hadoop/dfs
mkdir /opt/hadoop/dfs/name
mkdir /opt/hadoop/dfs/data

vi /opt/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
  <property>
     <name>dfs.data.dir</name>
     <value>/opt/hadoop/dfs/data</value>
     <final>true</final>
  </property>
  <property>
     <name>dfs.name.dir</name>
     <value>/opt/hadoop/dfs/name</value>
     <final>true</final>
  </property>
  <property>
     <name>dfs.replication</name>
     <value>3</value>
  </property>
</configuration>

vi /opt/hadoop/etc/hadoop/mapred-site.xml

<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>hdpnn:9001</value>
   </property>
</configuration>

vi /opt/hadoop/etc/hadoop/hadoop-env.sh

# Set Hadoop-specific environment variables here.
export JAVA_HOME=/opt/jdk1.8.0_181
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

close lxc and clone it for DataNodes:

 lxc-clone -o hdpnn -n hdp1
 lxc-clone -o hdpnn -n hdp2
 lxc-clone -o hdpnn -n hdp3
 lxc-clone -o hdpnn -n hdp4
 lxc-clone -o hdpnn -n hdp5

After cloning we need modify network interface priperties on each hdX
lxc-start -d -n hdp1

vi /etc/sysconfig/network-scripts/ifcfg-eth0  
change IPADDR=192.168.122.24X
systemctl restart network.service

After these changes I have:

[root@db-ders2 ~]# lxc-ls -f
NAME   STATE    IPV4             IPV6  AUTOSTART
------------------------------------------------
hdp1   RUNNING  192.168.122.241  -     YES
hdp2   RUNNING  192.168.122.242  -     YES
hdp3   RUNNING  192.168.122.243  -     YES
hdp4   RUNNING  192.168.122.244  -     YES
hdp5   RUNNING  192.168.122.245  -     YES
hdpnn  RUNNING  192.168.122.240  -     YES
sc1    STOPPED  -                -     YES
sc2    STOPPED  -                -     YES
sc3    STOPPED  -                -     YES
smn    STOPPED  -                -     YES


[hadoop@hdpnn ~]$ vi ~/.bashrc
...
fi

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

# Uncomment the following line if you don't like systemctl's auto-paging feature:
...

[hadoop@hdpnn ~]$ hadoop version
Hadoop 3.1.1
Source code repository https://github.com/apache/hadoop -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c
Compiled by leftnoteasy on 2018-08-02T04:26Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar

Copy .bashrc on each DataNode.

rsync .bashrc hadoop@hdpX:/home/hadoop



On host server configure port forwarding:

[root@db-ders2 ~]# 
iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 9870  -j DNAT --to 192.168.122.240:9870
[root@db-ders2 ~]# 
iptables -I FORWARD -d 192.168.122.240/32 -p tcp -m state --state NEW -m tcp --dport 9870 -j ACCEPT

[hadoop@hdpnn ~]$ hdfs dfsadmin -report
2018-10-30 06:04:51,933 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Configured Capacity: 1100006051840 (1.00 TB)
Present Capacity: 797274091520 (742.52 GB)
DFS Remaining: 797274071040 (742.52 GB)
DFS Used: 20480 (20 KB)
DFS Used%: 0.00%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (5):

Name: 192.168.122.241:9866 (hdp1)
Hostname: hdp1
Decommission Status : Normal
Configured Capacity: 220001210368 (204.89 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 60546392064 (56.39 GB)
DFS Remaining: 159454814208 (148.50 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 30 06:04:51 UTC 2018
Last Block Report: Tue Oct 30 05:55:22 UTC 2018
Num of Blocks: 0


Name: 192.168.122.242:9866 (hdp2)
Hostname: hdp2
Decommission Status : Normal
Configured Capacity: 220001210368 (204.89 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 60546392064 (56.39 GB)
DFS Remaining: 159454814208 (148.50 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 30 06:04:52 UTC 2018
Last Block Report: Tue Oct 30 05:55:22 UTC 2018
Num of Blocks: 0


Name: 192.168.122.243:9866 (hdp3)
Hostname: hdp3
Decommission Status : Normal
Configured Capacity: 220001210368 (204.89 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 60546392064 (56.39 GB)
DFS Remaining: 159454814208 (148.50 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 30 06:04:52 UTC 2018
Last Block Report: Tue Oct 30 05:55:22 UTC 2018
Num of Blocks: 0


Name: 192.168.122.244:9866 (hdp4)
Hostname: hdp4
Decommission Status : Normal
Configured Capacity: 220001210368 (204.89 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 60546392064 (56.39 GB)
DFS Remaining: 159454814208 (148.50 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 30 06:04:51 UTC 2018
Last Block Report: Tue Oct 30 05:55:22 UTC 2018
Num of Blocks: 0


Name: 192.168.122.245:9866 (hdp5)
Hostname: hdp5
Decommission Status : Normal
Configured Capacity: 220001210368 (204.89 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 60546392064 (56.39 GB)
DFS Remaining: 159454814208 (148.50 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.48%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 30 06:04:52 UTC 2018
Last Block Report: Tue Oct 30 05:55:22 UTC 2018
Num of Blocks: 0

And we have Hadoop cluster web UI.

Next, we install Hive and DBeaver for work with it. And will try to work with data from HDFS (special parquet files).



# hostname
hmpg
yum install https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
yum install postgresql11
yum install postgresql11-server
/usr/pgsql-11/bin/postgresql-11-setup initdb
systemctl enable postgresql-11
systemctl start postgresql-11

su - postgres
psql
postgres=# create database hive_meta;
postgres=# create user hive with encrypted password 'hive';
postgres=# grant all privileges on database hive_meta to hive;

vi /var/lib/pgsql/11/data/pg_hba.conf

# TYPE  DATABASE        USER            ADDRESS                 METHOD

local   all             all                                     peer
host    all             all             127.0.0.1/32            md5
host    all             all             0.0.0.0/0               md5
host    all             all             ::1/128                 md5
host    all             all             all                     md5

vi /var/lib/pgsql/11/data/postgresql.conf

# - Connection Settings -
listen_addresses = '*'          # what IP address(es) to listen on;
port = 5432                     # (change requires restart)

-bash-4.2$ /usr/pgsql-11/bin/pg_ctl restart -!

And port forwarding:

#postgres - Hive metastore database.
iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 5432  -j DNAT --to 192.168.122.231:5432
iptables -I FORWARD -d 192.168.122.231/32 -p tcp -m state --state NEW -m tcp --dport 5432 -j ACCEPT

And now I can connect to Postgres 11 DB from my local PC



[hadoop@hdpnn ~]$ vi js1.json

{"ts":1520318907,"device":1,"metric":"p","value":100}
{"ts":1520318908,"device":2,"metric":"p","value":110}
{"ts":1520318909,"device":1,"metric":"v","value":8}
{"ts":1520318910,"device":2,"metric":"v","value":9}
{"ts":1520318911,"device":1,"metric":"p","value":120}
{"ts":1520318912,"device":2,"metric":"p","value":140}
{"ts":1520318913,"device":1,"metric":"v","value":10}
{"ts":1520318914,"device":2,"metric":"v","value":11}

[hadoop@hdpnn ~]$ hadoop fs -copyFromLocal  /home/hadoop/js1.json /user/data/js_db/js1.json

[hadoop@hdpnn lib]$
wget http://www.congiu.net/hive-json-serde/1.3.8/hdp23/json-serde-1.3.8-jar-with-dependencies.jar

restart hiveserver2

new ports forwarding:

#HIVE
iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 10000  -j DNAT --to 192.168.122.230:10000
iptables -I FORWARD -d 192.168.122.230/32 -p tcp -m state --state NEW -m tcp --dport 10000 -j ACCEPT

#HiveServer2
iptables -t nat -A PREROUTING -d 10.241.5.234  -p tcp --dport 10002  -j DNAT --to 192.168.122.230:10002
iptables -I FORWARD -d 192.168.122.230/32 -p tcp -m state --state NEW -m tcp --dport 10002 -j ACCEPT

And in Dbeaver (5.3.0) now you can connect to jdbc:hive2://10.241.5.234:10000/default

And you can see HiveServer2 Web UI http://10.241.5.234:10002


Whi is listening port: on hv 
[root@hv ~]# lsof -i -P -n

COMMAND  PID   USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
sshd     289   root    3u  IPv4   55952      0t0  TCP *:22 (LISTEN)
sshd     289   root    4u  IPv6   55954      0t0  TCP *:22 (LISTEN)
sshd     596   root    3u  IPv4  189470      0t0  TCP 192.168.122.230:22->192.168.122.1:41136 (ESTABLISHED)
sshd    1175   root    3u  IPv4  214741      0t0  TCP 192.168.122.230:22->192.168.122.1:41248 (ESTABLISHED)
java    4602 hadoop  505u  IPv4 1183239      0t0  TCP 192.168.122.230:40514->192.168.122.231:5432 (ESTABLISHED)
java    4602 hadoop  506u  IPv4 1185255      0t0  TCP 192.168.122.230:40516->192.168.122.231:5432 (ESTABLISHED)
java    4602 hadoop  507u  IPv4 1185259      0t0  TCP 192.168.122.230:40522->192.168.122.231:5432 (ESTABLISHED)
java    4602 hadoop  508u  IPv4 1185804      0t0  TCP 192.168.122.230:40524->192.168.122.231:5432 (ESTABLISHED)
java    4602 hadoop  509u  IPv4 1185805      0t0  TCP *:9083 (LISTEN)
java    4602 hadoop  510u  IPv4 1201090      0t0  TCP 192.168.122.230:9083->192.168.122.230:50108 (ESTABLISHED)
java    4602 hadoop  511u  IPv4 1201094      0t0  TCP 192.168.122.230:9083->192.168.122.230:50112 (ESTABLISHED)
java    4602 hadoop  512u  IPv4 1184351      0t0  TCP 192.168.122.230:9083->192.168.122.230:50098 (ESTABLISHED)
java    4726 hadoop  505u  IPv4 1186149      0t0  TCP 192.168.122.230:10000->10.242.4.61:55477 (ESTABLISHED)
java    4726 hadoop  511u  IPv4 1196169      0t0  TCP 192.168.122.230:10002->10.242.4.61:55668 (FIN_WAIT2)
java    4726 hadoop  514u  IPv4 1186148      0t0  TCP *:10000 (LISTEN)
java    4726 hadoop  515u  IPv4 1184348      0t0  TCP *:10002 (LISTEN)
java    4726 hadoop  519u  IPv4 1184350      0t0  TCP 192.168.122.230:10002->10.242.4.61:55667 (FIN_WAIT2)
java    4726 hadoop  521u  IPv4 1199786      0t0  TCP 192.168.122.230:50108->192.168.122.230:9083 (ESTABLISHED)
java    4726 hadoop  522u  IPv4 1186673      0t0  TCP 192.168.122.230:10000->10.242.4.61:55478 (ESTABLISHED)
java    4726 hadoop  523u  IPv4 1186674      0t0  TCP 192.168.122.230:50098->192.168.122.230:9083 (ESTABLISHED)
java    4726 hadoop  525u  IPv4 1199787      0t0  TCP 192.168.122.230:50112->192.168.122.230:9083 (ESTABLISHED)
sshd    4952   root    3u  IPv4 1208630      0t0  TCP 192.168.122.230:22->192.168.122.1:45192 (ESTABLISHED)

Padminiprwatech7 марта 2020 г. в 01:45
Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update.If you are looking for any How to install Cassandra on ubuntu related information, please visit our website Cassandra Cluster ubuntu Setup
ОтветитьУдалить
Ответы

Добавить комментарий

Поиск по этому блогу

Bigdata ideas

New server for Big Data experiments with LXC containers (Cassandra, Spark, Hadoop)

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Spark operations with sparl.sql (Dataset, Dataframe)

Load data from Cassandra to HDFS parquet files and select with Hive

Install Hive 2.3.2 on Hadoop (3.0.0) NameNode. Hive metastore on external postgres database.