Write data into HDFS with Java
In previous post I show how to setup and configure hadoop 3.0 simple cluster
Now we can read and write into HDFS with Java:
Create dir into HDFS (from master node)
Now we can read and write into HDFS with Java:
# hadoop fs -mkdir -p /user/data
# hadoop fs -ls /user/data
2018-02-19 18:17:58,784 WARN util.NativeCodeLoader:
Unable to load native-hadoop library for your platform
... using builtin-java classes where applicable
HadoopSimple.java
package gdev;
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.Progressable;
import org.apache.log4j.Logger;
import java.net.URI;
public class HadoopSimple {
final static Logger logger = Logger.getLogger(HadoopSimple.class);
void write_into_hdfs() throws IOException{
String hdfsuri = "hdfs://10.242.5.88:9000";
String path = "/user/data/";
String fileName = "hello.csv";
String fileContent = "hello;world";
Configuration conf = new Configuration();
// Set FileSystem URI
conf.set("fs.defaultFS", hdfsuri);
System.setProperty("HADOOP_USER_NAME", "hadoop");
//Get the filesystem - HDFS
FileSystem fs = FileSystem.get(URI.create(hdfsuri), conf);
Path workingDir=fs.getWorkingDirectory();
Path newFolderPath= new Path(path);
if(!fs.exists(newFolderPath)) {
// Create new Directory
fs.mkdirs(newFolderPath);
logger.info(">>>>>> Path "+path+" created.");
}
//==== Write file
logger.info("Begin Write file into hdfs");
//Create a path
Path hdfswritepath = new Path(newFolderPath + "/" + fileName);
//Inpit output stream
FSDataOutputStream outputStream=fs.create(hdfswritepath);
//Classical output stream usage
outputStream.writeBytes(fileContent);
outputStream.close();
logger.info("End Write file into hdfs");
//==== Read file
logger.info("Read file into hdfs");
//Create a path
Path hdfsreadpath = new Path(newFolderPath + "/" + fileName);
//Init input stream
FSDataInputStream inputStream = fs.open(hdfsreadpath);
//Classical input stream usage
String out= IOUtils.toString(inputStream, "UTF-8");
logger.info("---------- >>> ["+out+"]");
inputStream.close();
fs.close();
}
}
And call it with
package gdev;
import java.io.IOException;
import org.apache.log4j.Logger;
public class App {
final static Logger logger = Logger.getLogger(App.class);
public static void main(String[] args) {
logger.info("info message");
HadoopSimple hs = new HadoopSimple();
try {
hs.write_into_hdfs();
} catch (IOException e) {
logger.debug(e.fillInStackTrace());
}
}
}
Full output
2018-02-19 20:38:43 0 [main] INFO gdev.App - info message
2018-02-19 20:38:44 218 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of successful kerberos logins and latency (milliseconds)], valueName=Time)
2018-02-19 20:38:44 227 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[Rate of failed kerberos logins and latency (milliseconds)], valueName=Time)
2018-02-19 20:38:44 227 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, value=[GetGroups], valueName=Time)
2018-02-19 20:38:44 228 [main] DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl - UgiMetrics, User and group related metrics
2018-02-19 20:38:44 262 [main] DEBUG org.apache.hadoop.security.Groups - Creating new Groups object
2018-02-19 20:38:44 264 [main] DEBUG org.apache.hadoop.util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
2018-02-19 20:38:44 266 [main] DEBUG org.apache.hadoop.util.NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
2018-02-19 20:38:44 266 [main] DEBUG org.apache.hadoop.util.NativeCodeLoader - java.library.path=C:\Java8\jdk1.8.0_161\bin;C:\windows\Sun\Java\bin;C:\windows\system32;C:\windows;C:\spark-2.2.1-bin-hadoop2.7\jars;C:\Python35\Lib\site-packages\PyQt5;C:\Program Files\ImageMagick-7.0.5-Q16;C:\oracle\product\11.2.0\client_2;C:\instantclient_12_1;C:\Windows\System32\INSTANTCLIENT_11_2;C:\Windows\SysWOW64\INSTANTCLIENT_11_2;C:\oracle\product\11.2.0\client_1;C:\oracle\product\11.2.0\client_1\bin;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\ProgramData\Oracle\Java\javapath;C:\PostgreSQL\pg95\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Program Files\TortoiseGit\bin;C:\Program Files (x86)\pgmodeler;C:\Program Files (x86)\Skype\Phone\;C:\scala\bin;C:\Users\Yakushev\AppData\Local\Programs\Python\Python36-32\Scripts\;C:\Users\Yakushev\AppData\Local\Programs\Python\Python36-32\;C:\instantclient_12_1;C:\Windows\System32\INSTANTCLIENT_11_2;C:\Windows\SysWOW64\INSTANTCLIENT_11_2;C:\Program Files (x86)\OpenVPN_HARD\bin;C:\Program Files (x86)\OpenVPN\bin;C:\mingw_w64\mingw64\bin;C:\Users\Yakushev\AppData\Local\GitHubDesktop\bin;.
2018-02-19 20:38:44 266 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-02-19 20:38:44 267 [main] DEBUG org.apache.hadoop.util.PerformanceAdvisory - Falling back to shell based
2018-02-19 20:38:44 267 [main] DEBUG org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
2018-02-19 20:38:44 329 [main] DEBUG org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000; warningDeltaMs=5000
2018-02-19 20:38:44 333 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login
2018-02-19 20:38:44 334 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit
2018-02-19 20:38:44 335 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - Using user: "hadoop" with name hadoop
2018-02-19 20:38:44 336 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - User entry: "hadoop"
2018-02-19 20:38:44 337 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - UGI loginUser:hadoop (auth:SIMPLE)
2018-02-19 20:38:44 401 [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal - dfs.client.use.legacy.blockreader.local = false
2018-02-19 20:38:44 401 [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal - dfs.client.read.shortcircuit = false
2018-02-19 20:38:44 401 [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal - dfs.client.domain.socket.data.traffic = false
2018-02-19 20:38:44 401 [main] DEBUG org.apache.hadoop.hdfs.BlockReaderLocal - dfs.domain.socket.path =
2018-02-19 20:38:44 411 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - No KeyProvider found.
2018-02-19 20:38:44 429 [main] DEBUG org.apache.hadoop.io.retry.RetryUtils - multipleLinearRandomRetry = null
2018-02-19 20:38:44 442 [main] DEBUG org.apache.hadoop.ipc.Server - rpcKind=RPC_PROTOCOL_BUFFER, rpcRequestWrapperClass=class org.apache.hadoop.ipc.ProtobufRpcEngine$RpcRequestWrapper, rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@2e4b8173
2018-02-19 20:38:44 556 [main] DEBUG org.apache.hadoop.ipc.Client - getting client out of cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:44 758 [main] DEBUG org.apache.hadoop.util.PerformanceAdvisory - Both short-circuit local reads and UNIX domain socket are disabled.
2018-02-19 20:38:44 763 [main] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil - DataTransferProtocol not using SaslPropertiesResolver, no QOP found in configuration for dfs.data.transfer.protection
2018-02-19 20:38:44 777 [main] DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms.
2018-02-19 20:38:44 779 [main] DEBUG org.apache.hadoop.ipc.Client - Connecting to /10.242.5.88:9000
2018-02-19 20:38:44 829 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: starting, having connections 1
2018-02-19 20:38:44 839 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #0
2018-02-19 20:38:44 849 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #0
2018-02-19 20:38:44 849 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getFileInfo took 83ms
2018-02-19 20:38:44 872 [main] INFO gdev.HadoopSimple - Begin Write file into hdfs
2018-02-19 20:38:44 873 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - /user/data/hello.csv: masked=rw-r--r--
2018-02-19 20:38:44 891 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #1
2018-02-19 20:38:44 916 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #1
2018-02-19 20:38:44 916 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: create took 26ms
2018-02-19 20:38:44 918 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - computePacketChunkSize: src=/user/data/hello.csv, chunkSize=516, chunksPerPacket=127, packetSize=65532
2018-02-19 20:38:44 932 [LeaseRenewer:hadoop@10.242.5.88:9000] DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [DFSClient_NONMAPREDUCE_-1239533414_1] with renew id 1 started
2018-02-19 20:38:44 935 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient writeChunk allocating new packet seqno=0, src=/user/data/hello.csv, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
2018-02-19 20:38:44 936 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - Queued packet 0
2018-02-19 20:38:44 936 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - Queued packet 1
2018-02-19 20:38:44 936 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - Allocating new block
2018-02-19 20:38:44 936 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - Waiting for ack for: 1
2018-02-19 20:38:44 945 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #2
2018-02-19 20:38:44 969 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #2
2018-02-19 20:38:44 969 [Thread-4] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: addBlock took 24ms
2018-02-19 20:38:44 977 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = 10.242.5.91:9866
2018-02-19 20:38:44 977 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = 10.242.5.90:9866
2018-02-19 20:38:44 977 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = 10.242.5.89:9866
2018-02-19 20:38:44 977 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - Connecting to datanode 10.242.5.91:9866
2018-02-19 20:38:44 981 [Thread-4] DEBUG org.apache.hadoop.hdfs.DFSClient - Send buf size 131072
2018-02-19 20:38:44 981 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #3
2018-02-19 20:38:44 984 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #3
2018-02-19 20:38:44 985 [Thread-4] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getServerDefaults took 3ms
2018-02-19 20:38:44 989 [Thread-4] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL client skipping handshake in unsecured configuration for addr = /10.242.5.91, datanodeId = 10.242.5.91:9866
2018-02-19 20:38:44 1138 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient - DataStreamer block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 11
2018-02-19 20:38:45 1206 [ResponseProcessor for block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient seqno: 0 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 12764600 4: "\000\000\000"
2018-02-19 20:38:45 1206 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient - DataStreamer block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002 sending packet packet seqno:1 offsetInBlock:11 lastPacketInBlock:true lastByteOffsetInBlock: 11
2018-02-19 20:38:45 1251 [ResponseProcessor for block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient seqno: 1 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 37259500 4: "\000\000\000"
2018-02-19 20:38:45 1252 [DataStreamer for file /user/data/hello.csv block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002] DEBUG org.apache.hadoop.hdfs.DFSClient - Closing old block BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002
2018-02-19 20:38:45 1253 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #4
2018-02-19 20:38:45 1274 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #4
2018-02-19 20:38:45 1274 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: complete took 21ms
2018-02-19 20:38:45 1276 [main] INFO gdev.HadoopSimple - End Write file into hdfs
2018-02-19 20:38:45 1276 [main] INFO gdev.HadoopSimple - Read file into hdfs
2018-02-19 20:38:45 1280 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop sending #5
2018-02-19 20:38:45 1286 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop got value #5
2018-02-19 20:38:45 1286 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getBlockLocations took 6ms
2018-02-19 20:38:45 1289 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - newInfo = LocatedBlocks{
fileLength=11
underConstruction=false
blocks=[LocatedBlock{BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002; getBlockSize()=11; corrupt=false; offset=0; locs=[10.242.5.90:9866, 10.242.5.91:9866, 10.242.5.89:9866]; storageIDs=[DS-72760395-0b7b-474a-9439-37f6ecca5b73, DS-9af4c3dd-074a-463c-8fc1-fe26574721ba, DS-06124844-d0c1-4162-b03f-113730966916]; storageTypes=[DISK, DISK, DISK]}]
lastLocatedBlock=LocatedBlock{BP-1839426364-10.242.5.88-1519045731213:blk_1073741826_1002; getBlockSize()=11; corrupt=false; offset=0; locs=[10.242.5.91:9866, 10.242.5.89:9866, 10.242.5.90:9866]; storageIDs=[DS-9af4c3dd-074a-463c-8fc1-fe26574721ba, DS-06124844-d0c1-4162-b03f-113730966916, DS-72760395-0b7b-474a-9439-37f6ecca5b73]; storageTypes=[DISK, DISK, DISK]}
isLastBlockComplete=true}
2018-02-19 20:38:45 1292 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - Connecting to datanode 10.242.5.90:9866
2018-02-19 20:38:45 1298 [main] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL client skipping handshake in unsecured configuration for addr = /10.242.5.90, datanodeId = 10.242.5.90:9866
2018-02-19 20:38:45 1379 [main] INFO gdev.HadoopSimple - ---------- >>> [hello;world]
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client - stopping client from cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client - removing client from cache: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client - stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3911c2a7
2018-02-19 20:38:45 1382 [main] DEBUG org.apache.hadoop.ipc.Client - Stopping client
2018-02-19 20:38:45 1383 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: closed
2018-02-19 20:38:45 1383 [IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop] DEBUG org.apache.hadoop.ipc.Client - IPC Client (715378067) connection to /10.242.5.88:9000 from hadoop: stopped, remaining connections 0
And now:
Where data real stored on datanodes. hadoop-slave-1
# pwd
/opt/hadoop/dfs/data/current/BP-1839426364-10.242.5.88-1519045731213/current/finalized/subdir0/subdir0
...
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:08 blk_1073741940
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:08 blk_1073741940_1116.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:10 blk_1073741941
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:10 blk_1073741941_1117.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:11 blk_1073741942
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:11 blk_1073741942_1118.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:13 blk_1073741943
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:13 blk_1073741943_1119.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:14 blk_1073741944
-rw-rw-r-- 1 hadoop hadoop 1,1M фев 22 02:14 blk_1073741944_1120.meta
-rw-rw-r-- 1 hadoop hadoop 128M фев 22 02:16 blk_1073741945
...
Комментарии
Отправить комментарий