Hive partitioned EXTERNAL tables, create, load data, select, managing

First of all there is creation of database for studing external tables. Before, look on directory structure


$ hadoop fs -ls /user/data
Found 6 items
-rw-r--r--   2 hadoop supergroup        161 2018-03-02 14:31 /user/data/cc2.avsc
drwxr-xr-x   - hadoop supergroup          0 2018-04-03 12:27 /user/data/dics
drwxr-xr-x   - hadoop supergroup          0 2018-03-15 15:42 /user/data/js1.json
drwxr-xr-x   - hadoop supergroup          0 2018-03-16 10:52 /user/data/js_db
drwxr-xr-x   - hadoop supergroup          0 2018-03-02 13:49 /user/data/order.parquet
-rw-r--r--   2 dr.who supergroup        389 2018-02-28 16:10 /user/data/test_data.csv


Create new one database with set location folder and setting some properties, like comment and creator.

CREATE DATABASE IF NOT EXISTS ext_tabs 
COMMENT 'Database for studing external tables'
LOCATION '/user/data/exttabs'
WITH DBPROPERTIES ('creator' = 'Yakushev Aleksey', 'date' = '2018-04-05');

Now check database

DESCRIBE DATABASE ext_tabs;

db_name  |comment                              |location                                    |owner_name |owner_type |parameters |
---------|-------------------------------------|--------------------------------------------|-----------|-----------|-----------|
ext_tabs |Database for studing external tables |hdfs://hadoop-master:9000/user/data/exttabs |hadoop     |USER       |           |


Here we create our first external table

/*
 SET  - Stock Exchange of Thailand 
 NYSE - New York Stock Exchange
*/
DROP TABLE IF EXISTS ext_tabs.stocks;

CREATE EXTERNAL TABLE ext_tabs.stocks(
symbol STRING,
ts     TIMESTAMP,
price  FLOAT
)
PARTITIONED BY (year INT,exch STRING) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;


And create table partitions:

alter table ext_tabs.stocks add 
partition (year=2017,exch='SET') 
partition (year=2018,exch='SET')
partition (year=2017,exch='NYSE')
partition (year=2018,exch='NYSE');


Look at the directory structure

$ hadoop fs -ls /user/data/exttabs/stocks
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2018

$ hadoop fs -ls /user/data/exttabs/stocks/year=2017
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017/exch=NYSE
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017/exch=SET


Create 2 json file with test data and load it into HDFS

-- -- load data into one partition with one json file into /user/data/exttabs/stocks/year=2017/exch=SET
-- 2017_set_part_1.json
{"symbol":"ORCL","ts":"2017-04-05 11:34:12.120000000","price":12.34}
{"symbol":"ORCL","ts":"2017-04-05 11:34:12.240000000","price":13.34}
{"symbol":"ORCL","ts":"2017-04-05 11:34:13.370000000","price":14.34}
 
-- -- load data into one partition with one json file into /user/data/exttabs/stocks/year=2018/exch=NYSE
-- 2018_nyse_part_1.json
{"symbol":"IBM","ts":"2018-04-05 11:34:12.220000000","price":22.34}
{"symbol":"IBM","ts":"2018-04-05 11:34:12.340000000","price":23.34}
{"symbol":"IBM","ts":"2018-04-05 11:34:13.470000000","price":24.34}


Check the data of table.

select 
 stocks.symbol,
 cast(stocks.ts as double),
 stocks.ts,
 stocks.price,
 stocks.year,
 stocks.exch 
from 
 ext_tabs.stocks
 
stocks.symbol |ts            |stocks.ts           |stocks.price |stocks.year |stocks.exch |
--------------|--------------|--------------------|-------------|------------|------------|
ORCL          |1491381252.12 |2017-04-05 11:34:12 |12.34        |2017        |SET         |
ORCL          |1491381252.24 |2017-04-05 11:34:12 |13.34        |2017        |SET         |
ORCL          |1491381253.37 |2017-04-05 11:34:13 |14.34        |2017        |SET         |
IBM           |1522917252.22 |2018-04-05 11:34:12 |22.34        |2018        |NYSE        |
IBM           |1522917252.34 |2018-04-05 11:34:12 |23.34        |2018        |NYSE        |
IBM           |1522917253.47 |2018-04-05 11:34:13 |24.34        |2018        |NYSE        |


Additionally I can say thar Hive's external table doesn't store real data, it's just metadata. When you drop table the data will not erased and you can later recreate table. External table it's like a mask for data.

-- This script can drop partition 
alter table ext_tabs.stocks drop partition (year=2019, exch='LSE')

insert into ext_tabs.stocks partition(year=2019, exch='LSE')
select 
 'AMZ' as symbol,
 cast(stocks.ts as double) as ts,
 stocks.price
from 
 ext_tabs.stocks


Hive automatically create directory structure for new PARTITION (Hive feature -dynamic partition )

$ hadoop fs -ls /user/data/exttabs/stocks
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017
drwxr-xr-x   - hadoop supergroup          0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2018
drwxr-xr-x   - hadoop supergroup          0 2018-04-06 08:21 /user/data/exttabs/stocks/year=2019


Select new data with HQL

select * from ext_tabs.stocks where year=2019

stocks.symbol |stocks.ts           |stocks.price |stocks.year |stocks.exch |
--------------|--------------------|-------------|------------|------------|
AMZ           |2017-04-05 08:34:12 |12.34        |2019        |LSE         |
AMZ           |2017-04-05 08:34:12 |13.34        |2019        |LSE         |
AMZ           |2017-04-05 08:34:13 |14.34        |2019        |LSE         |
AMZ           |2018-04-05 08:34:12 |22.34        |2019        |LSE         |
AMZ           |2018-04-05 08:34:12 |23.34        |2019        |LSE         |
AMZ           |2018-04-05 08:34:13 |24.34        |2019        |LSE         |


File with new data is created (we can download it and open in text editor).

$ hadoop fs -ls /user/data/exttabs/stocks/year=2019/exch=LSE
Found 1 items
-rwxr-xr-x   1 hadoop supergroup        354 2018-04-06 08:22 /user/data/exttabs/stocks/year=2019/exch=LSE/000000_0

{"symbol":"AMZ","price":12.34,"ts":"2017-04-05T08:34:12Z"}
{"symbol":"AMZ","price":13.34,"ts":"2017-04-05T08:34:12Z"}
{"symbol":"AMZ","price":14.34,"ts":"2017-04-05T08:34:13Z"}
{"symbol":"AMZ","price":22.34,"ts":"2018-04-05T08:34:12Z"}
{"symbol":"AMZ","price":23.34,"ts":"2018-04-05T08:34:12Z"}
{"symbol":"AMZ","price":24.34,"ts":"2018-04-05T08:34:13Z"}


It's also easy export data from tables:
$ hadoop fs -mkdir /user/data/exttabs/export

insert overwrite directory 'hdfs://hadoop-master:9000/user/data/exttabs/export'
select * 
from ext_tabs.stocks 
where year=2017;

$ hadoop fs -ls /user/data/exttabs/export
Found 1 items
-rwxr-xr-x   1 hadoop supergroup        129 2018-04-09 09:43 /user/data/exttabs/export/000000_0

If there were more reducers writing output, we would have additional files with names e.g. 000001_0, 000002_0


Комментарии

Популярные сообщения из этого блога

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive