Compare file formats supported by Hive, some examples, change replication factor

Previous posts about hadoop 3.0 cluster and hive install used in this article.

Used hive version:


[hadoop@hadoop-master myhql]$ hive --version
Hive 2.3.2


You can see here about supported file formats (Storage Formats)

I gonna just try next 4:

SEQUENCEFILE - Stored as compressed Sequence File.
ORC          - Stored as ORC file format. Supports ACID Transactions & Cost-based Optimizer (CBO). 
               Stores column-level metadata.(ORC - Optimized Row Columnar)
PARQUET      - Stored as Parquet format for the Parquet columnar storage format. 
AVRO         - Stored as Avro format.


Simple script to create and populate table with test data:


[hadoop@hadoop-master myhql]$ hostname
hadoop-master
[hadoop@hadoop-master myhql]$ pwd
/opt/hive/myhql
[hadoop@hadoop-master myhql]$ ls
pop_tables.hql
[hadoop@hadoop-master myhql]$ vi pop_tables.hql

drop table tmp_seq_part_date;
drop table tmp_orc_part_date;
drop table tmp_prq_part_date;
drop table tmp_avro_part_date;

CREATE TABLE tmp_orc_part_date(country_id int, val float)
PARTITIONED BY (nyear int)
STORED AS ORC;

CREATE TABLE tmp_prq_part_date(country_id int, val float)
PARTITIONED BY (nyear int)
STORED AS PARQUET;

CREATE TABLE tmp_seq_part_date(country_id int, val float)
PARTITIONED BY (nyear int)
STORED AS SEQUENCEFILE;

CREATE TABLE tmp_avro_part_date(country_id int, val float)
PARTITIONED BY (nyear int)
STORED AS AVRO;

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.strict.checks.cartesian.product=false;

insert overwrite table tmp_seq_part_date partition(nyear)
SELECT Cast((round(RAND()*10.0)) as INT) as country_id,
       RAND()*10.0 as val,
       Array(2014, 2015, 2016, 2017, 2018)[Cast((FLOOR(RAND()*5.0)) as INT)] as nyear
FROM (select 1 from hive_cc1 limit 100000) t1
cross join
     (select 1 from hive_cc1 limit 1000) t2;

insert overwrite table tmp_orc_part_date partition(nyear)
select country_id,val,nyear
  from tmp_seq_part_date;

insert overwrite table tmp_prq_part_date partition(nyear)
select country_id,val,nyear
  from tmp_seq_part_date;

insert overwrite table tmp_avro_part_date partition(nyear)
select country_id,val,nyear
  from tmp_seq_part_date;

select count(*) from tmp_seq_part_date;

select count(*) from tmp_orc_part_date;

select count(*) from tmp_prq_part_date;


Existing table hive_cc1 is used only for generate rows, and next data copied from one table tmp_seq_part_date to each other, so data in all tables is equal. It's necessary for more clear tests of performance and size.

All tables are partitioned by nyear column (type int). Later I will show HDFS file structure of each table.

We can run hql script with command like this:

[hadoop@hadoop-master myhql]$ hive -f /opt/hive/myhql/pop_tables.hql

As you can see each table contains 100000000 rows.

Next step is compare hdfs file size. (First value is size of file and second is summary size of all replicas, we have replication factor:2)

#hadoop fs -ls /user/hive/warehouse
Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 21:24 /user/hive/warehouse/tmp_avro_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:18 /user/hive/warehouse/tmp_orc_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 20:16 /user/hive/warehouse/tmp_prq_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 18:21 /user/hive/warehouse/tmp_seq_part_date

#hadoop fs -du -s -h /user/hive/warehouse/tmp_seq_part_date
2.2 G  4.4 G  /user/hive/warehouse/tmp_seq_part_date

#hadoop fs -du -s -h /user/hive/warehouse/tmp_avro_part_date
667.8 M  1.3 G

#hadoop fs -du -s -h /user/hive/warehouse/tmp_prq_part_date
429.4 M  858.8 M

#hadoop fs -du -s -h /user/hive/warehouse/tmp_orc_part_date 
395.0 M  790.0 M


As you can see there is high difference 2.2 Gb and 395 Mb. But I found one moment and going continue research, When you create ORC table you can specify next properties:

KEY DEFAULT NOTES
orc.compress ZLIB high level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size 262144 compression chunk size
orc.stripe.size 67108864 memory buffer in bytes for writing
orc.row.index.stride 10 number of rows between index entries
orc.create.index true create indexes?
orc.bloom.filter.columns ”” comma separated list of column names
orc.bloom.filter.fpp 0.05 bloom filter false positive rate

What about our ORC table compression,


hive> DESCRIBE extended tmp_orc_part_date;
OK
country_id              int
val                     float
nyear                   int

# Partition Information
# col_name              data_type  comment
nyear                   int

Detailed Table Information      Table(
tableName:tmp_orc_part_date, 
dbName:default, 
owner:hadoop, 
createTime:1520861796, 
lastAccessTime:0, 
retention:0, 
 sd:StorageDescriptor(cols:[FieldSchema(name:country_id, type:int, comment:null), 
                            FieldSchema(name:val, type:float, comment:null), 
                            FieldSchema(name:nyear, type:int, comment:null)], 
                            location:hdfs://hadoop-master:9000/user/hive/warehouse/tmp_orc_part_date, 
                            inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, 
                            outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, 
                          compressed:false, 
                            numBuckets:-1, 
                            serdeInfo:SerDeInfo(name:null, 
                                                serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, 
                                                parameters:{serialization.format=1}), 
                                                bucketCols:[], 
                                                sortCols:[], 
                                                parameters:{}, 
                                                skewedInfo:SkewedInfo(skewedColNames:[], 
                                                                      skewedColValues:[], 
                                                                      skewedColValueLocationMaps:{}), 
                                                                      storedAsSubDirectories:false), 
                           partitionKeys:[FieldSchema(name:nyear, type:int, comment:null)], 
                           parameters:{totalSize=414180492, 
                                       numRows=100000000, 
                                       rawDataSize=800000000, 
                                       COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, 
                                       numFiles=13, 
                                       numPartitions=5, /*Here we don't see compression*/ 
                                       transient_lastDdlTime=1520861796}, 
                            viewOriginalText:null, 
                            viewExpandedText:null, 
                            tableType:MANAGED_TABLE, 
                            rewriteEnabled:false)

Time taken: 0.416 seconds, Fetched: 10 row(s)

When used default compression level it not presented /*Here we don't see compression*/


If you create table like this, with explicit compression definition (that equal default) STORED AS ORC tblproperties ("orc.compress"="ZLIB"); And get property of table it can be:

...
   numFiles = 13,
   numPartitions = 5,
   orc.compress = ZLIB,
   transient_lastDdlTime = 1520938797 },
   viewOriginalText:null,
...



Little bit about file structures in HDFS (for example tmp_orc_part_date): CODE #1

[hadoop@hadoop-master myhql]$ hadoop fs -ls -h /user/hive/warehouse/
Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 21:24 /user/hive/warehouse/tmp_avro_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:18 /user/hive/warehouse/tmp_orc_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 20:16 /user/hive/warehouse/tmp_prq_part_date
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 18:21 /user/hive/warehouse/tmp_seq_part_date
[hadoop@hadoop-master myhql]$ hadoop fs -ls -h /user/hive/warehouse/tmp_orc_part_date
Found 5 items
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 18:54 /user/hive/warehouse/tmp_orc_part_date/nyear=2014
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:00 /user/hive/warehouse/tmp_orc_part_date/nyear=2015
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:07 /user/hive/warehouse/tmp_orc_part_date/nyear=2016
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:13 /user/hive/warehouse/tmp_orc_part_date/nyear=2017
drwxr-xr-x   - hadoop supergroup          0 2018-03-12 19:18 /user/hive/warehouse/tmp_orc_part_date/nyear=2018
[hadoop@hadoop-master myhql]$ hadoop fs -ls -h /user/hive/warehouse/tmp_orc_part_date/nyear=2015
Found 3 items
-rwxr-xr-x   2 hadoop supergroup     22.9 M 2018-03-12 18:28 /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000000_0
-rwxr-xr-x   2 hadoop supergroup     11.2 M 2018-03-12 18:34 /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000001_0
-rwxr-xr-x   2 hadoop supergroup     44.9 M 2018-03-12 19:00 /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000005_0


The files are 00000X_0 and directories structure is related with table partitioning by nyear field. And what about Hadoop DataNode files structure. First of all we need path of data-files in DataNode, go to hadoop-master and get it


# hdfs getconf -confKey dfs.datanode.data.dir
/opt/hadoop/dfs/data

Next, select one of NameNodes (I use hadoop-slave-1)

[root@hadoop-slave-1 subdir6]# pwd
/opt/hadoop/dfs/data/current/BP-1839426364-10.242.5.88-1519045731213/current/finalized/subdir0/subdir6
[root@hadoop-slave-1 subdir6]# ls -lh
итого 2,0G
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:13 blk_1073743360
-rw-rw-r-- 1 hadoop hadoop 513K мар 12 18:13 blk_1073743360_2548.meta
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:13 blk_1073743361
-rw-rw-r-- 1 hadoop hadoop 513K мар 12 18:13 blk_1073743361_2549.meta
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:20 blk_1073743362
-rw-rw-r-- 1 hadoop hadoop 513K мар 12 18:20 blk_1073743362_2550.meta
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:20 blk_1073743363
-rw-rw-r-- 1 hadoop hadoop 513K мар 12 18:20 blk_1073743363_2551.meta
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:20 blk_1073743364
-rw-rw-r-- 1 hadoop hadoop 513K мар 12 18:20 blk_1073743364_2552.meta
-rw-rw-r-- 1 hadoop hadoop  64M мар 12 18:20 blk_1073743365
...


What about change replication factor!? As you can see on CODE #1 files has replication factor =2.(-rwxr-xr-x 2 ) I try change replication factor and got some problems with it.


hadoop fs -setrep -w 3 /user/hive/warehouse/tmp_orc_part_date
setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000001_0
setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000004_0
setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000000_0
...

WTF
stop-dfs.sh
edit hdfs-site.xml on hadoop-master and set 

dfs.replication     3
dfs.replication.min 3
dfs.replication.max 3

[hadoop@hadoop-master hadoop]$ rsync /opt/hadoop/etc/hadoop/hdfs-site.xml hadoop@hadoop-slave-1:/opt/hadoop/etc/hadoop
[hadoop@hadoop-master hadoop]$ rsync /opt/hadoop/etc/hadoop/hdfs-site.xml hadoop@hadoop-slave-2:/opt/hadoop/etc/hadoop
[hadoop@hadoop-master hadoop]$ rsync /opt/hadoop/etc/hadoop/hdfs-site.xml hadoop@hadoop-slave-3:/opt/hadoop/etc/hadoop

start-dfs.sh

hadoop fs -setrep -w 3 /user/hive/warehouse/tmp_orc_part_date

and also

setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000001_0
setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000004_0
setrep: Requested replication factor of 3 exceeds maximum of 2 for /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000000_0
...
...

Mmmm...


Next step:

$ hdfs fsck /user/hive/warehouse/tmp_orc_part_date -delete
Connecting to namenode via http://hadoop-master:9870/fsck?ugi=hadoop&delete=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftmp_orc_part_date
FSCK started by hadoop (auth:SIMPLE) from /10.242.5.88 for path /user/hive/warehouse/tmp_orc_part_date at Wed Mar 14 13:26:31 MSK 2018

Status: HEALTHY
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    6
 Total symlinks:                0

Replicated Blocks:
 Total size:    414180492 B
 Total files:   13
 Total blocks (validated):      13 (avg. block size 31860037 B)
 Minimally replicated blocks:   13 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     2.0
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:   0
 Total block groups (validated):        0
 Minimally erasure-coded block groups:  0
 Over-erasure-coded block groups:       0
 Under-erasure-coded block groups:      0
 Unsatisfactory placement block groups: 0
 Average block group size:      0.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0
FSCK ended at Wed Mar 14 13:26:31 MSK 2018 in 1 milliseconds


The filesystem under path '/user/hive/warehouse/tmp_orc_part_date' is HEALTHY

Next I found that HDFS in safe mode, execute this:
hadoop dfsadmin -safemode leave

hdfs dfs -setrep -w 3 /user/hive/warehouse/tmp_orc_part_date/nyear=2014
Replication 3 set: /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000001_0
Replication 3 set: /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000004_0
Waiting for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000001_0 .......... done
Waiting for /user/hive/warehouse/tmp_orc_part_date/nyear=2014/000004_0 .... done

hdfs dfs -setrep -w 3 /user/hive/warehouse/tmp_orc_part_date/nyear=2015
Replication 3 set: /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000000_0
Replication 3 set: /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000001_0
Replication 3 set: /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000005_0
Waiting for /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000000_0 ...... done
Waiting for /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000001_0 ... done
Waiting for /user/hive/warehouse/tmp_orc_part_date/nyear=2015/000005_0 ....... done

Комментарии

Популярные сообщения из этого блога

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7