JSON in Hive 2.3.2 With Hive-JSON-Serde (Load data,Query tables, complex structure)

previous posts: hadoop cluster and hive installation

In this topic we will go through some steps: create local json, load it into hdfs, creation external hive table, queries to this table and etc.

First of all create local json file with notepad or vi, name it js1.json and populate with this data:


{"ts":1520318907,"device":1,"metric":"p","value":100}
{"ts":1520318908,"device":2,"metric":"p","value":110}
{"ts":1520318909,"device":1,"metric":"v","value":8}
{"ts":1520318910,"device":2,"metric":"v","value":9}
{"ts":1520318911,"device":1,"metric":"p","value":120}
{"ts":1520318912,"device":2,"metric":"p","value":140}
{"ts":1520318913,"device":1,"metric":"v","value":10}
{"ts":1520318914,"device":2,"metric":"v","value":11}


Load this file from local to hdfs. You can use Hadoop web interface or hdfs put command. you can create new directory in HDFS with command:
$ hadoop fs -mkdir /user/data/js_db




$ hadoop fs -ls /user/data/js_db
Found 3 items
-rw-r--r--   3 dr.who supergroup        438 2018-03-16 09:29 /user/data/js_db/js1.json
-rw-r--r--   3 dr.who supergroup        438 2018-03-15 15:47 /user/data/js_db/js2.json
-rw-r--r--   3 dr.who supergroup        334 2018-03-15 15:48 /user/data/js_db/js3.json


Next we need download json SerDe library

[hadoop@hadoop-master lib]$ whoami
hadoop
[hadoop@hadoop-master lib]$ pwd
/opt/hive/lib
[hadoop@hadoop-master lib]$
wget http://www.congiu.net/hive-json-serde/1.3.8/hdp23/json-serde-1.3.8-jar-with-dependencies.jar

restart hiveserver2 (or try hive> ADD JAR json-serde-1.3.8-jar-with-dependencies.jar)
Next step is creating hive table and load json data.




DROP TABLE json1;

CREATE EXTERNAL TABLE default.json1(
  ts     INTEGER,
  device INTEGER,
  metric string,
  value  DOUBLE
  )
  ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
  
select * from json1

json1.ts |json1.device |json1.metric |json1.value |
---------|-------------|-------------|------------|

LOAD DATA INPATH 'hdfs://10.242.5.88:9000/user/data/js_db/js1.json' OVERWRITE INTO TABLE json1;

Pay attention that now js1.json removed from hdfs.

$ hadoop fs -ls /user/data/js_db
Found 2 items
-rw-r--r--   3 dr.who supergroup        438 2018-03-16 09:48 /user/data/js_db/js2.json
-rw-r--r--   3 dr.who supergroup        334 2018-03-16 09:48 /user/data/js_db/js3.json
  
select * from json1

json1.ts   |json1.device |json1.metric |json1.value |
-----------|-------------|-------------|------------|
1520318907 |1            |p            |100         |
1520318908 |2            |p            |110         |
1520318909 |1            |v            |8           |
1520318910 |2            |v            |9           |
1520318911 |1            |p            |120         |
1520318912 |2            |p            |140         |
1520318913 |1            |v            |10          |
1520318914 |2            |v            |11          |


I use DBeaver for HQL and work with Hive


In previous example we create table in Hive and load json into it, the next step is to create Hive table that reads data direct from json file.


DROP DATABASE js_db;

CREATE DATABASE js_db;

DROP TABLE json2;

CREATE EXTERNAL TABLE js_db.json2(
  ts     INTEGER,
  device INTEGER,
  metric string,
  value  DOUBLE
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'hdfs://10.242.5.88:9000/user/data/js_db';

select * from js_db.json2

json2.ts   |json2.device |json2.metric |json2.value |
-----------|-------------|-------------|------------|
1520318907 |1            |p            |100         |
1520318908 |2            |p            |110         |
1520318909 |1            |v            |8           |
1520318910 |2            |v            |9           |
1520318911 |1            |p            |120         |
1520318912 |2            |p            |140         |
1520318913 |1            |v            |10          |
1520318914 |2            |v            |11          |
1520318907 |1            |             |100         |
1520318908 |2            |             |110         |
1520318909 |1            |             |8           |
1520318910 |2            |             |9           |
1520318911 |1            |             |120         |


Hive read all json files from location hdfs://10.242.5.88:9000/user/data/js_db And in one file I remove field metric. Hive read it but filed metric is empty.
This approach can be useful if you upload json file into one location frequently, special from different sources, but json with similar structure.  For example you can create schedule job that read files from this location and reload it into Hive table and then delete sources.

Example of reading json with more difficult nested structure.


create and upload into hdfs js_nested.json

{"ts":1520318907,"conveyor_num":1, data : [{"metric":"temp","value":120},{"metric":"speed","value":1000}]}
{"ts":1520318907,"conveyor_num":2, data : [{"metric":"temp","value":240},{"metric":"speed","value":1000}]} 
{"ts":1520318908,"conveyor_num":1, data : [{"metric":"temp","value":130},{"metric":"speed","value":1200}]} 
{"ts":1520318908,"conveyor_num":2, data : [{"metric":"temp","value":260},{"metric":"speed","value":800}]} 

DROP TABLE json_nested_test;

CREATE TABLE json_nested_test (
  ts           INTEGER,
  conveyor_num INTEGER,
  data         array>
 )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs://10.242.5.88:9000/user/data/js_db/js_nested.json' OVERWRITE INTO TABLE  json_nested_test;

select 
      nt.ts,
      nt.conveyor_num,
      nt.data[0].metric,
      nt.data[0].value,
      nt.data[1].metric,
      nt.data[1].value 
  from 
      json_nested_test nt
      
nt.ts      |nt.conveyor_num |metric |value |metric |value |
-----------|----------------|-------|------|-------|------|
1520318907 |1               |temp   |120   |speed  |1000  |
1520318907 |2               |temp   |240   |speed  |1000  |
1520318908 |1               |temp   |130   |speed  |1200  |
1520318908 |2               |temp   |260   |speed  |800   |

more complex query with sum and FILTER

select 
      nt.conveyor_num,
      sum(nt.data[0].value*100 + nt.data[1].value/2) as calc_value 
  from 
      json_nested_test nt
 where nt.conveyor_num=1
group by nt.conveyor_num

nt.conveyor_num |calc_value |
----------------|-----------|
1               |26100      |


More examples you can find here (Hive-JSON-Serde)

Комментарии

Популярные сообщения из этого блога

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive