In
previous post we learnt about setting up and runnning Hive on our distributed
Hadoop cluster. In this post we will learn about various Hive input and output formats.
Key Formats
- TEXTFILE
- AVRO
- RCFILE
- SEQUENCEFILE
- PARQUET
We will use same
Hadoop cluster and Hive setup done in
previous post.
Usage | Hands On
- TEXTFILE
- Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
- Syntax: STORED AS TEXTFILE
- Usually human readable text.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.044 seconds
hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
OK
Time taken: 0.384 seconds
- LOAD DATA
hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
Loading data to table user_db.users_txt
Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
OK
Time taken: 1.138 seconds
- READ RECORDS
hive> SELECT * FROM users_txt LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.696 seconds, Fetched: 10 row(s)
- AVRO
- Serialization file format from Apache Avro (contains schema and data, tools available for processing).
- Syntax: STORED AS AVRO
- Usually binary storage.
- DOWNLOAD AVRO TOOL JAR
- Download Avro tools jar from Apache Avro website.
- Copy jar to some accessible location, in our case we are downloading to /home/anishsneh/installs/avro
- CREATE & UPLOAD SCHEMA (to HDFS)
{
"namespace": "com.anishsneh.demo.hive.avro.serde",
"name": "user_inline",
"type": "record",
"fields": [
{
"name":"uid",
"type":"string",
"doc":"UUID of user"
},
{
"name":"login",
"type":"string",
"doc":"Login name for user"
},
{
"name":"full_name",
"type":"string",
"doc":"Full name of user"
},
{
"name":"email",
"type":"string",
"doc:":"Email address of user"
},
{
"name":"country",
"type":"string",
"doc:":"Country code of user"
}
]
}
[anishsneh@server01 installs]$ hadoop fs -mkdir /data/avro
[anishsneh@server01 installs]$ hadoop fs -chmod 777 /data/avro
[anishsneh@server01 installs]$ hadoop fs -copyFromLocal /tmp/user.avsc /data/avro
[anishsneh@server01 installs]$ hadoop fs -ls /data/avro
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
15/09/28 17:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 3 anishsneh supergroup 580 2015-09-28 17:34 /data/avro/user.avsc
- GENERATE SAMPLE AVRO INPUT (local file system)
[anishsneh@server01 installs]$ cd /home/anishsneh/installs/avro
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar fromjson --schema-file /tmp/user.avsc /tmp/user_input.json > /tmp/user_input.avro
[anishsneh@server01 avro]$ ls -ltr /tmp/user_input.avro
-rw-rw-r--. 1 anishsneh anishsneh 8273 Sep 28 17:18 /tmp/user_input.avro
- CHECK AVRO CONTENTS
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar tojson -pretty /tmp/user_input.avro
- CHECK ASSOCIATED AVRO SCHEMA
[anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar getschema /tmp/user_input.avro
{
"type" : "record",
"name" : "users_inline",
"namespace" : "com.anishsneh.demo.hive.avro.serde",
"fields" : [ {
"name" : "uid",
"type" : "string",
"doc" : "UUID of user"
}, {
"name" : "login",
"type" : "string",
"doc" : "Login name for user"
}, {
"name" : "full_name",
"type" : "string",
"doc" : "Full name of user"
}, {
"name" : "email",
"type" : "string",
"doc:" : "Email address of user"
}, {
"name" : "country",
"type" : "string",
"doc:" : "Country code of user"
} ]
}
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.928 seconds
hive> CREATE TABLE users_avro STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='hdfs://server01:9000/data/avro/user.avsc');
OK
Time taken: 0.792 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET avro.output.codec=snappy;
hive> LOAD DATA LOCAL INPATH '/tmp/user_input.avro' INTO TABLE users_avro;
Loading data to table user_db.users_avro
Table user_db.users_avro stats: [numFiles=1, totalSize=8273]
OK
Time taken: 0.616 seconds
- READ RECORDS
hive> SELECT * FROM users_avro LIMIT 10;
OK
9fb2d6c1-16d8-4b75-8292-7a4f47d49cd3 jriley0 Jack jmartinez0@goo.ne.jp IR
89142191-825c-4413-8d34-87053b3683da adunn1 Arthur aramos1@surveymonkey.com PT
b3ea602d-45fd-452f-a02e-78d2d146eb45 bcook2 Brandon byoung2@upenn.edu TW
1fb72bdc-5745-4146-b2e0-6270cc9f1585 rburton3 Ronald rrogers3@google.es PT
24c8272d-05bc-47d2-ba3d-d61a9439882d jcox4 Jason jevans4@nationalgeographic.com SE
fb4d0fa4-91ca-4e19-9f40-3117a717341b gparker5 Gerald gstewart5@abc.net.au ID
8aa884d2-ad40-4f60-9262-c52ab3851fb3 wowens6 Willie whanson6@nasa.gov CN
d66ccccc-2f52-408c-a54e-d00dc0dc0805 aolson7 Arthur abradley7@elpais.com FR
31be75d9-ce74-41ea-858b-a63e6a7c1c0b dhamilton8 Daniel dfowler8@bloglines.com BR
d56e128e-65dd-40b1-ab3c-9bb1ed84a949 dellis9 Dennis djenkins9@51.la VN
Time taken: 0.743 seconds, Fetched: 10 row(s)
- RCFILE
- An internal hive format (binary), also known as Record Columnar File. RCFile combines multiple functions such as data storage formatting, data compression, and data access optimization.
- Syntax: STORED AS RCFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.914 seconds
hive> CREATE TABLE users_rcfile (uid String, login String, full_name String, email String, country String) STORED AS RCFILE;
OK
Time taken: 0.142 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_rcfile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928165631_5054934a-71e9-4f36-9ad9-98e4b6d463ee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0006, Tracking URL = http://server01:8088/proxy/application_1443477428845_0006/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:56:45,144 Stage-1 map = 0%, reduce = 0%
2015-09-28 16:56:53,726 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.33 sec
MapReduce Total cumulative CPU time: 1 seconds 330 msec
Ended Job = job_1443477428845_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_rcfile/.hive-staging_hive_2015-09-28_16-56-31_354_5196185093859930557-1/-ext-10000
Loading data to table user_db.users_rcfile
Table user_db.users_rcfile stats: [numFiles=1, numRows=100, totalSize=4559, rawDataSize=7360]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.33 sec HDFS Read: 11858 HDFS Write: 4638 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
Time taken: 23.91 seconds
- READ RECORDS
hive> SELECT * FROM users_rcfile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.151 seconds, Fetched: 10 row(s)
- SEQUENCEFILE
- This format stores data sequentially, should be preferred over text file if data needs to be compressed.
- Syntax: STORED AS SEQUENCEFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.912 seconds
hive> CREATE TABLE users_sequencefile (uid String, login String, full_name String, email String, country String) STORED AS SEQUENCEFILE;
OK
Time taken: 0.715 seconds
- LOAD DATA
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE user_db.users_sequencefile SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928170529_9ca50caa-e40b-4e73-8346-7c8be84c473e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0007, Tracking URL = http://server01:8088/proxy/application_1443477428845_0007/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 17:05:42,960 Stage-1 map = 0%, reduce = 0%
2015-09-28 17:05:54,428 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
MapReduce Total cumulative CPU time: 1 seconds 500 msec
Ended Job = job_1443477428845_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_sequencefile/.hive-staging_hive_2015-09-28_17-05-29_645_6697958115974804848-1/-ext-10000
Loading data to table user_db.users_sequencefile
Table user_db.users_sequencefile stats: [numFiles=1, numRows=100, totalSize=5064, rawDataSize=7760]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.5 sec HDFS Read: 11849 HDFS Write: 5149 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 500 msec
OK
Time taken: 27.09 seconds
- READ RECORDS
hive> SELECT * FROM users_sequencefile LIMIT 10;
OK
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.265 seconds, Fetched: 10 row(s)
- PARQUET
- A columnar storage format (compressed, binary), it is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
- Syntax: STORED AS PARQUETFILE
- Usually binary storage.
- CREATE TABLE
hive> USE user_db;
OK
Time taken: 0.055 seconds
hive> CREATE TABLE users_parquet (uid String, login String, full_name String, email String, country String) STORED AS PARQUETFILE;
OK
Time taken: 0.162 seconds
- LOAD DATA
hive> INSERT OVERWRITE TABLE user_db.users_parquet SELECT * FROM user_db.users_txt;
Query ID = anishsneh_20150928162135_31b54099-6dce-4960-bf64-e708916ea153
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1443477428845_0001, Tracking URL = http://server01:8088/proxy/application_1443477428845_0001/
Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job -kill job_1443477428845_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-28 16:21:51,415 Stage-1 map = 0%, reduce = 0%
2015-09-28 16:22:08,324 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.88 sec
MapReduce Total cumulative CPU time: 1 seconds 880 msec
Ended Job = job_1443477428845_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_parquet/.hive-staging_hive_2015-09-28_16-21-35_227_4234786993826549979-1/-ext-10000
Loading data to table user_db.users_parquet
Table user_db.users_parquet stats: [numFiles=1, numRows=100, totalSize=9739, rawDataSize=500]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.88 sec HDFS Read: 11886 HDFS Write: 9818 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
Time taken: 33.499 seconds
- READ RECORDS
hive> SELECT * FROM users_parquet LIMIT 10;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
Time taken: 0.107 seconds, Fetched: 10 row(s)
All the files used above can be downloaded from:
HDFS Table Structure (Database)
|
"user_db" database structure on HDFS |