Hive partitioned EXTERNAL tables, create, load data, select, managing
First of all there is creation of database for studing external tables.
Before, look on directory structure
$ hadoop fs -ls /user/data
Found 6 items
-rw-r--r-- 2 hadoop supergroup 161 2018-03-02 14:31 /user/data/cc2.avsc
drwxr-xr-x - hadoop supergroup 0 2018-04-03 12:27 /user/data/dics
drwxr-xr-x - hadoop supergroup 0 2018-03-15 15:42 /user/data/js1.json
drwxr-xr-x - hadoop supergroup 0 2018-03-16 10:52 /user/data/js_db
drwxr-xr-x - hadoop supergroup 0 2018-03-02 13:49 /user/data/order.parquet
-rw-r--r-- 2 dr.who supergroup 389 2018-02-28 16:10 /user/data/test_data.csv
Create new one database with set location folder and setting some properties, like comment and creator.
CREATE DATABASE IF NOT EXISTS ext_tabs
COMMENT 'Database for studing external tables'
LOCATION '/user/data/exttabs'
WITH DBPROPERTIES ('creator' = 'Yakushev Aleksey', 'date' = '2018-04-05');
Now check database
DESCRIBE DATABASE ext_tabs;
db_name |comment |location |owner_name |owner_type |parameters |
---------|-------------------------------------|--------------------------------------------|-----------|-----------|-----------|
ext_tabs |Database for studing external tables |hdfs://hadoop-master:9000/user/data/exttabs |hadoop |USER | |
Here we create our first external table
/*
SET - Stock Exchange of Thailand
NYSE - New York Stock Exchange
*/
DROP TABLE IF EXISTS ext_tabs.stocks;
CREATE EXTERNAL TABLE ext_tabs.stocks(
symbol STRING,
ts TIMESTAMP,
price FLOAT
)
PARTITIONED BY (year INT,exch STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
And create table partitions:
alter table ext_tabs.stocks add
partition (year=2017,exch='SET')
partition (year=2018,exch='SET')
partition (year=2017,exch='NYSE')
partition (year=2018,exch='NYSE');
Look at the directory structure
$ hadoop fs -ls /user/data/exttabs/stocks
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2018
$ hadoop fs -ls /user/data/exttabs/stocks/year=2017
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017/exch=NYSE
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017/exch=SET
Create 2 json file with test data and load it into HDFS
-- -- load data into one partition with one json file into /user/data/exttabs/stocks/year=2017/exch=SET
-- 2017_set_part_1.json
{"symbol":"ORCL","ts":"2017-04-05 11:34:12.120000000","price":12.34}
{"symbol":"ORCL","ts":"2017-04-05 11:34:12.240000000","price":13.34}
{"symbol":"ORCL","ts":"2017-04-05 11:34:13.370000000","price":14.34}
-- -- load data into one partition with one json file into /user/data/exttabs/stocks/year=2018/exch=NYSE
-- 2018_nyse_part_1.json
{"symbol":"IBM","ts":"2018-04-05 11:34:12.220000000","price":22.34}
{"symbol":"IBM","ts":"2018-04-05 11:34:12.340000000","price":23.34}
{"symbol":"IBM","ts":"2018-04-05 11:34:13.470000000","price":24.34}
Check the data of table.
select
stocks.symbol,
cast(stocks.ts as double),
stocks.ts,
stocks.price,
stocks.year,
stocks.exch
from
ext_tabs.stocks
stocks.symbol |ts |stocks.ts |stocks.price |stocks.year |stocks.exch |
--------------|--------------|--------------------|-------------|------------|------------|
ORCL |1491381252.12 |2017-04-05 11:34:12 |12.34 |2017 |SET |
ORCL |1491381252.24 |2017-04-05 11:34:12 |13.34 |2017 |SET |
ORCL |1491381253.37 |2017-04-05 11:34:13 |14.34 |2017 |SET |
IBM |1522917252.22 |2018-04-05 11:34:12 |22.34 |2018 |NYSE |
IBM |1522917252.34 |2018-04-05 11:34:12 |23.34 |2018 |NYSE |
IBM |1522917253.47 |2018-04-05 11:34:13 |24.34 |2018 |NYSE |
Additionally I can say thar Hive's external table doesn't store real data, it's just metadata. When you drop table the data will not erased and you can later recreate table. External table it's like a mask for data.
-- This script can drop partition
alter table ext_tabs.stocks drop partition (year=2019, exch='LSE')
insert into ext_tabs.stocks partition(year=2019, exch='LSE')
select
'AMZ' as symbol,
cast(stocks.ts as double) as ts,
stocks.price
from
ext_tabs.stocks
Hive automatically create directory structure for new PARTITION (Hive feature -dynamic partition )
$ hadoop fs -ls /user/data/exttabs/stocks
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2017
drwxr-xr-x - hadoop supergroup 0 2018-04-05 11:02 /user/data/exttabs/stocks/year=2018
drwxr-xr-x - hadoop supergroup 0 2018-04-06 08:21 /user/data/exttabs/stocks/year=2019
Select new data with HQL
select * from ext_tabs.stocks where year=2019
stocks.symbol |stocks.ts |stocks.price |stocks.year |stocks.exch |
--------------|--------------------|-------------|------------|------------|
AMZ |2017-04-05 08:34:12 |12.34 |2019 |LSE |
AMZ |2017-04-05 08:34:12 |13.34 |2019 |LSE |
AMZ |2017-04-05 08:34:13 |14.34 |2019 |LSE |
AMZ |2018-04-05 08:34:12 |22.34 |2019 |LSE |
AMZ |2018-04-05 08:34:12 |23.34 |2019 |LSE |
AMZ |2018-04-05 08:34:13 |24.34 |2019 |LSE |
File with new data is created (we can download it and open in text editor).
$ hadoop fs -ls /user/data/exttabs/stocks/year=2019/exch=LSE
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 354 2018-04-06 08:22 /user/data/exttabs/stocks/year=2019/exch=LSE/000000_0
{"symbol":"AMZ","price":12.34,"ts":"2017-04-05T08:34:12Z"}
{"symbol":"AMZ","price":13.34,"ts":"2017-04-05T08:34:12Z"}
{"symbol":"AMZ","price":14.34,"ts":"2017-04-05T08:34:13Z"}
{"symbol":"AMZ","price":22.34,"ts":"2018-04-05T08:34:12Z"}
{"symbol":"AMZ","price":23.34,"ts":"2018-04-05T08:34:12Z"}
{"symbol":"AMZ","price":24.34,"ts":"2018-04-05T08:34:13Z"}
It's also easy export data from tables:
$ hadoop fs -mkdir /user/data/exttabs/export insert overwrite directory 'hdfs://hadoop-master:9000/user/data/exttabs/export' select * from ext_tabs.stocks where year=2017; $ hadoop fs -ls /user/data/exttabs/export Found 1 items -rwxr-xr-x 1 hadoop supergroup 129 2018-04-09 09:43 /user/data/exttabs/export/000000_0
If there were more reducers writing output, we would have additional files with names e.g. 000001_0, 000002_0
Комментарии
Отправить комментарий