Load CSV into Hive table (example: Bangkok Districts) (Not finished article)


In this article I begin big preparation to finish research work - Using Big Data by police to search serial criminal.

Article 1. Dictionaries preparation.
We need prepare and load into hive database some new dictionaries- like Districts, road web cams, cars government registration numbers and cars owners information.

First of all I download wikipedia table - List of districts of Bangkok and save it with Excel as csv file with name bkk_dist_csv.csv


Data looks like

Download file into Hive table

$ hadoop fs -mkdir /user/data/dics 
Load with Hadoop WebUI and check 
$ hadoop fs -ls /user/data/dics/ 
Found 1 items -rw-r--r-- 3 dr.who supergroup 1202 2018-04-03 12:01 /user/data/dics/bkk_dist_csv.csv

And load it into Hive table
drop table IF EXISTS d_src;

CREATE EXTERNAL TABLE d_src(
     code    string,
     eng_name   string, 
     thai_Name  string
    )
    COMMENT 'source external table for loading scv'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE
    location '/user/data/dics/bkk_dist_csv.csv';

select * from d_src

d_src.code |d_src.eng_name |d_src.thai_name |
-----------|---------------|----------------|

-- load csv with hadoop web interface or with copyFromLocal/put

create table d_bkk_dist(
     code     int,
     eng_name   string, 
     thai_Name  string
)
COMMENT 'Dictionary - List of districts of Bangkok'
STORED AS ORC;

INSERT OVERWRITE TABLE d_bkk_dist
select 
cast(replace(d.code,' ','') as int) as code,
replace(d.eng_name,' ','')  as eng_name,
replace(d.thai_name,' ','') as thai_name
from d_src d
where locate('Code', d.code)=0 

drop table IF EXISTS d_src;
And now we can see results in Dbeaver


Now we need create WebCams and set their properties. Picture and simple description.
Look on this zoomed image.




There are road web cams, each cam has a unique ID, parent district ID.

For test reasons we need addition information, where we can go from the camera (for generate more real test data).
We use equal time for way between different cameras, in real data of course it's important.


CAM_ID    DESTIN_CAM_ID
1                 2                             
1                 3                               
1                 4                             
1                 5                             
1                 6          


I manually print full map, manually write cameras ID, and populate reference table as csv file. The purpose of this all - car that registered in WebCamera (1) next time can be registered only in 2,3,4,5,6. (Fe: not in 1- 8, 3 - 12).






Комментарии

Популярные сообщения из этого блога

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive