Write data into HDFS with Java

In previous post I show how to setup and configure hadoop 3.0 simple cluster

Now we can read and write into HDFS with Java:

Create dir into HDFS (from master node)

# hadoop fs -mkdir -p /user/data
# hadoop fs -ls /user/data
2018-02-19 18:17:58,784 WARN util.NativeCodeLoader: 
Unable to load native-hadoop library for your platform
... using builtin-java classes where applicable


HadoopSimple.java

package gdev;

import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.Progressable;
import org.apache.log4j.Logger;

import java.net.URI;

public class HadoopSimple {
 
 final static Logger logger = Logger.getLogger(HadoopSimple.class);
 
 void write_into_hdfs() throws IOException{

     String hdfsuri = "hdfs://10.242.5.88:9000";
     String path = "/user/data/";
     String fileName = "hello.csv";
     String fileContent = "hello;world";
  
  Configuration conf = new Configuration();

       // Set FileSystem URI
       conf.set("fs.defaultFS", hdfsuri);
  
    System.setProperty("HADOOP_USER_NAME", "hadoop");
       
       //Get the filesystem - HDFS
       FileSystem fs = FileSystem.get(URI.create(hdfsuri), conf);
  
       Path workingDir=fs.getWorkingDirectory();
       Path newFolderPath= new Path(path);
       if(!fs.exists(newFolderPath)) {
          // Create new Directory
          fs.mkdirs(newFolderPath);
          logger.info(">>>>>>  Path "+path+" created.");
       }
       
       //==== Write file
       logger.info("Begin Write file into hdfs");
       //Create a path
       Path hdfswritepath = new Path(newFolderPath + "/" + fileName);
       //Inpit output stream
       FSDataOutputStream outputStream=fs.create(hdfswritepath);
       //Classical output stream usage
       outputStream.writeBytes(fileContent);
       outputStream.close();
       logger.info("End Write file into hdfs");
  
       //==== Read file
       logger.info("Read file into hdfs");
       //Create a path
       Path hdfsreadpath = new Path(newFolderPath + "/" + fileName);
       //Init input stream
       FSDataInputStream inputStream = fs.open(hdfsreadpath);
       //Classical input stream usage
       String out= IOUtils.toString(inputStream, "UTF-8");
       logger.info("---------- >>> ["+out+"]");
       inputStream.close();
       fs.close();
  
 }
}



And call it with

package gdev;

import java.io.IOException;

import org.apache.log4j.Logger;
 
public class App {

 final static Logger logger = Logger.getLogger(App.class);
 
 public static void main(String[] args) {
  logger.info("info message");
  HadoopSimple hs = new HadoopSimple();
    try {
   hs.write_into_hdfs();
    } catch (IOException e) {
   logger.debug(e.fillInStackTrace());
    }
  }

}




Full output

2018-02-19 20:38:43 0    [main] INFO  gdev.App  - info message
2018-02-19 20:38:44 218  [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory  - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of successful kerberos logins and latency (milliseconds)], valueName=Time)
2018-02-19 20:38:44 227  [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory  - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of failed kerberos logins and latency (milliseconds)], valueName=Time)
2018-02-19 20:38:44 227  [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory  - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[GetGroups], valueName=Time)
2018-02-19 20:38:44 228  [main] DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - UgiMetrics, User and group related metrics
2018-02-19 20:38:44 262  [main] DEBUG org.apache.hadoop.security.Groups  -  Creating new Groups object
2018-02-19 20:38:44 264  [main] DEBUG org.apache.hadoop.util.NativeCodeLoader  - Trying to load the custom-built native-hadoop library...
2018-02-19 20:38:44 266  [main] DEBUG org.apache.hadoop.util.NativeCodeLoader  - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
2018-02-19 20:38:44 266  [main] DEBUG org.apache.hadoop.util.NativeCodeLoader  - java.library.path=C:\Java8\jdk1.8.0_161\bin;C:\windows\Sun\Java\bin;C:\windows\system32;C:\windows;C:\spark-2.2.1-bin-hadoop2.7\jars;C:\Python35\Lib\site-packages\PyQt5;C:\Program Files\ImageMagick-7.0.5-Q16;C:\oracle\product\11.2.0\client_2;C:\instantclient_12_1;C:\Windows\System32\INSTANTCLIENT_11_2;C:\Windows\SysWOW64\INSTANTCLIENT_11_2;C:\oracle\product\11.2.0\client_1;C:\oracle\product\11.2.0\client_1\bin;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\ProgramData\Oracle\Java\javapath;C:\PostgreSQL\pg95\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Program Files\TortoiseGit\bin;C:\Program Files (x86)\pgmodeler;C:\Program Files (x86)\Skype\Phone\;C:\scala\bin;C:\Users\Yakushev\AppData\Local\Programs\Python\Python36-32\Scripts\;C:\Users\Yakushev\AppData\Local\Programs\Python\Python36-32\;C:\instantclient_12_1;C:\Windows\System32\INSTANTCLIENT_11_2;C:\Windows\SysWOW64\INSTANTCLIENT_11_2;C:\Program Files (x86)\OpenVPN_HARD\bin;C:\Program Files (x86)\OpenVPN\bin;C:\mingw_w64\mingw64\bin;C:\Users\Yakushev\AppData\Local\GitHubDesktop\bin;.
2018-02-19 20:38:44 266  [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-02-19 20:38:44 267  [main] DEBUG org.apache.hadoop.util.PerformanceAdvisory  - Falling back to shell based
2018-02-19 20:38:44 267  [main] DEBUG org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback  - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
2018-02-19 20:38:44 329  [main] DEBUG org.apache.hadoop.security.Groups  - Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000; warningDeltaMs=5000
2018-02-19 20:38:44 333  [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - hadoop login
2018-02-19 20:38:44 334  [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - hadoop login commit
2018-02-19 20:38:44 335  [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - Using user: "hadoop" with name hadoop
2018-02-19 20:38:44 336  [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - User entry: "hadoop"
2018-02-19 20:38:44 337  [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - UGI loginUser:hadoop (auth:SIMPLE)
2018-02-19 20:38:44 401  [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal  - dfs.client.use.legacy.blockreader.local = false
2018-02-19 20:38:44 401  [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal  - dfs.client.read.shortcircuit = false
2018-02-19 20:38:44 401  [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal  - dfs.client.domain.socket.data.traffic = false
2018-02-19 20:38:44 401  [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal  - dfs.domain.socket.path = 
2018-02-19 20:38:44 411  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - No KeyProvider found.
2018-02-19 20:38:44 429  [main] DEBUG org.apache.hadoop.io.retry.RetryUtils  - multipleLinearRandomRetry = null
2018-02-19 20:38:44 442  [main] DEBUG org.apache.hadoop.ipc.Server  - rpcKind=RPC_PROTOCOL_BUFFER, rpcRequestWrapperClass=class org.apache.hadoop.ipc.ProtobufRpcEngine$RpcRequestWrapper, rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@2e4b8173
2018-02-19 20:38:44 556  [main] DEBUG org.apache.hadoop.ipc.Client  - getting client out of cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:44 758  [main] DEBUG org.apache.hadoop.util.PerformanceAdvisory  - Both short-circuit local reads and UNIX domain socket are disabled.
2018-02-19 20:38:44 763  [main] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil  - DataTransferProtocol not using SaslPropertiesResolver, no QOP found in configuration for dfs.data.transfer.protection
2018-02-19 20:38:44 777  [main] DEBUG org.apache.hadoop.ipc.Client  - The ping interval is 60000 ms.
2018-02-19 20:38:44 779  [main] DEBUG org.apache.hadoop.ipc.Client  - Connecting to /10.242.5.88:9000
2018-02-19 20:38:44 829  [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: starting, having connections 1
2018-02-19 20:38:44 839  [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #0
2018-02-19 20:38:44 849  [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #0
2018-02-19 20:38:44 849  [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: getFileInfo took 83ms
2018-02-19 20:38:44 872  [main] INFO  gdev.HadoopSimple  - Begin Write file into hdfs
2018-02-19 20:38:44 873  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - /user/data/hello.csv: masked=rw-r--r--
2018-02-19 20:38:44 891  [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #1
2018-02-19 20:38:44 916  [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #1
2018-02-19 20:38:44 916  [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: create took 26ms
2018-02-19 20:38:44 918  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - computePacketChunkSize: src=/user/data/hello.csv, chunkSize=516, chunksPerPacket=127, packetSize=65532
2018-02-19 20:38:44 932  [LeaseRenewer:hadoop@10.242.5.88:9000] DEBUG org.apache.hadoop.hdfs.LeaseRenewer  - Lease renewer daemon for [DFSClient_NONMAPREDUCE_-1239533414_1] with renew id 1 started
2018-02-19 20:38:44 935  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - DFSClient writeChunk allocating new packet seqno=0, src=/user/data/hello.csv, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
2018-02-19 20:38:44 936  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - Queued packet 0
2018-02-19 20:38:44 936  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - Queued packet 1
2018-02-19 20:38:44 936  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - Allocating new block
2018-02-19 20:38:44 936  [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - Waiting for ack for: 1
2018-02-19 20:38:44 945  [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #2
2018-02-19 20:38:44 969  [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #2
2018-02-19 20:38:44 969  [Thread-4] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: addBlock took 24ms
2018-02-19 20:38:44 977  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - pipeline = 10.242.5.91:9866
2018-02-19 20:38:44 977  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - pipeline = 10.242.5.90:9866
2018-02-19 20:38:44 977  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - pipeline = 10.242.5.89:9866
2018-02-19 20:38:44 977  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - Connecting to datanode 10.242.5.91:9866
2018-02-19 20:38:44 981  [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient  - Send buf size 131072
2018-02-19 20:38:44 981  [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #3
2018-02-19 20:38:44 984  [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #3
2018-02-19 20:38:44 985  [Thread-4] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: getServerDefaults took 3ms
2018-02-19 20:38:44 989  [Thread-4] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient  - SASL client skipping handshake in unsecured configuration for addr = /10.242.5.91, datanodeId = 10.242.5.91:9866
2018-02-19 20:38:44 1138 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient  - DataStreamer block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 11
2018-02-19 20:38:45 1206 [ResponseProcessor for block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient  - DFSClient seqno: 0 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 12764600 4: "\000\000\000"
2018-02-19 20:38:45 1206 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient  - DataStreamer block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002 sending packet packet seqno:1 offsetInBlock:11 lastPacketInBlock:true lastByteOffsetInBlock: 11
2018-02-19 20:38:45 1251 [ResponseProcessor for block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient  - DFSClient seqno: 1 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 37259500 4: "\000\000\000"
2018-02-19 20:38:45 1252 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient  - Closing old block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002
2018-02-19 20:38:45 1253 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #4
2018-02-19 20:38:45 1274 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #4
2018-02-19 20:38:45 1274 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: complete took 21ms
2018-02-19 20:38:45 1276 [main] INFO  gdev.HadoopSimple  - End Write file into hdfs
2018-02-19 20:38:45 1276 [main] INFO  gdev.HadoopSimple  - Read file into hdfs
2018-02-19 20:38:45 1280 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #5
2018-02-19 20:38:45 1286 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #5
2018-02-19 20:38:45 1286 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine  - Call: getBlockLocations took 6ms
2018-02-19 20:38:45 1289 [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - newInfo = LocatedBlocks{
  fileLength=11
  underConstruction=false
  blocks=[LocatedBlock{BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002; getBlockSize()=11; corrupt=false; offset=0; locs=[10.242.5.90:9866, 10.242.5.91:9866, 10.242.5.89:9866]; storageIDs=[DS-72760395-0b7b-474a-9439-37f6ecca5b73, DS-9af4c3dd-074a-463c-8fc1-fe26574721ba, DS-06124844-d0c1-4162-b03f-113730966916]; storageTypes=[DISK, DISK, DISK]}]
  lastLocatedBlock=LocatedBlock{BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002; getBlockSize()=11; corrupt=false; offset=0; locs=[10.242.5.91:9866, 10.242.5.89:9866, 10.242.5.90:9866]; storageIDs=[DS-9af4c3dd-074a-463c-8fc1-fe26574721ba, DS-06124844-d0c1-4162-b03f-113730966916, DS-72760395-0b7b-474a-9439-37f6ecca5b73]; storageTypes=[DISK, DISK, DISK]}
  isLastBlockComplete=true}
2018-02-19 20:38:45 1292 [main] DEBUG org.apache.hadoop.hdfs.DFSClient  - Connecting to datanode 10.242.5.90:9866
2018-02-19 20:38:45 1298 [main] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient  - SASL client skipping handshake in unsecured configuration for addr = /10.242.5.90, datanodeId = 10.242.5.90:9866
2018-02-19 20:38:45 1379 [main] INFO  gdev.HadoopSimple  - ---------- >>> [hello;world]
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client  - stopping client from cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client  - removing client from cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client  - stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client  - Stopping client
2018-02-19 20:38:45 1383 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: closed
2018-02-19 20:38:45 1383 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: stopped, remaining connections 0




And now:



Where data real stored on datanodes. hadoop-slave-1

# pwd
/opt/hadoop/dfs/data/current/BP-1839426364-10.242.5.88-1519045731213/current/finalized/subdir0/subdir0

...
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:08 blk_1073741940
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:08 blk_1073741940_1116.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:10 blk_1073741941
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:10 blk_1073741941_1117.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:11 blk_1073741942
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:11 blk_1073741942_1118.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:13 blk_1073741943
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:13 blk_1073741943_1119.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:14 blk_1073741944
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:14 blk_1073741944_1120.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:16 blk_1073741945
...


Комментарии

Популярные сообщения из этого блога

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive