Monday, September 28, 2015

Hive | Input & Output Formats

In previous post we learnt about setting up and runnning Hive on our distributed Hadoop cluster. In this post we will learn about various Hive input and output formats.

Key Formats

  • TEXTFILE
  • AVRO
  • RCFILE
  • SEQUENCEFILE
  • PARQUET
We will use same Hadoop cluster and Hive setup done in previous post.

Usage | Hands On

  • TEXTFILE
    • Separated readable text file e.g. text file with tab or comma separated fields. This is the default format for Hive (depending on hive.default.fileformat configuration).
    • Syntax:  STORED AS TEXTFILE 
    • Usually human readable text.
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.044 seconds
      hive> CREATE TABLE IF NOT EXISTS users_txt (uid String, login String, full_name String, email String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
      OK
      Time taken: 0.384 seconds
      
    • LOAD DATA
      hive> LOAD DATA LOCAL INPATH '/tmp/users.csv' OVERWRITE INTO TABLE user_db.users_txt;
      Loading data to table user_db.users_txt
      Table user_db.users_txt stats: [numFiles=1, numRows=0, totalSize=7860, rawDataSize=0]
      OK
      Time taken: 1.138 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_txt LIMIT 10;
      OK
      755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
      a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
      f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
      d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
      d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
      6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
      3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
      beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
      899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
      f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
      Time taken: 0.696 seconds, Fetched: 10 row(s)
      
  • AVRO
    • Serialization file format from Apache Avro (contains schema and data, tools available for processing).
    • Syntax:  STORED AS AVRO
    • Usually binary storage.
    • DOWNLOAD AVRO TOOL JAR
      • Download Avro tools jar from Apache Avro website.
      • Copy jar to some accessible location, in our case we are downloading to /home/anishsneh/installs/avro
    • CREATE & UPLOAD SCHEMA (to HDFS)
      {
        "namespace": "com.anishsneh.demo.hive.avro.serde",
        "name": "user_inline",
        "type": "record",
        "fields": [
          {
            "name":"uid",
            "type":"string",
            "doc":"UUID of user"
          },
          {
            "name":"login",
            "type":"string",
            "doc":"Login name for user"
          },
          {
            "name":"full_name",
            "type":"string",
            "doc":"Full name of user"
          },
          {
            "name":"email",
            "type":"string",
            "doc:":"Email address of user"
          },
          {
            "name":"country",
            "type":"string",
            "doc:":"Country code of user"
          }
        ]
      }
      
      [anishsneh@server01 installs]$ hadoop fs -mkdir /data/avro
      
      [anishsneh@server01 installs]$ hadoop fs -chmod 777 /data/avro
      
      [anishsneh@server01 installs]$  hadoop fs -copyFromLocal /tmp/user.avsc /data/avro
      
      [anishsneh@server01 installs]$ hadoop fs -ls /data/avro
      Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/anishsneh/installs/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
      It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
      15/09/28 17:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      Found 1 items
      -rw-r--r--   3 anishsneh supergroup        580 2015-09-28 17:34 /data/avro/user.avsc
      
    • GENERATE SAMPLE AVRO INPUT (local file system)
      [anishsneh@server01 installs]$ cd /home/anishsneh/installs/avro
      [anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar fromjson --schema-file /tmp/user.avsc /tmp/user_input.json > /tmp/user_input.avro
      [anishsneh@server01 avro]$ ls -ltr /tmp/user_input.avro
      -rw-rw-r--. 1 anishsneh anishsneh 8273 Sep 28 17:18 /tmp/user_input.avro
      
    • CHECK AVRO CONTENTS
      [anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar tojson -pretty /tmp/user_input.avro
      
    • CHECK ASSOCIATED AVRO SCHEMA
      [anishsneh@server01 avro]$ java -jar avro-tools-1.7.7.jar getschema /tmp/user_input.avro
      {
        "type" : "record",
        "name" : "users_inline",
        "namespace" : "com.anishsneh.demo.hive.avro.serde",
        "fields" : [ {
          "name" : "uid",
          "type" : "string",
          "doc" : "UUID of user"
        }, {
          "name" : "login",
          "type" : "string",
          "doc" : "Login name for user"
        }, {
          "name" : "full_name",
          "type" : "string",
          "doc" : "Full name of user"
        }, {
          "name" : "email",
          "type" : "string",
          "doc:" : "Email address of user"
        }, {
          "name" : "country",
          "type" : "string",
          "doc:" : "Country code of user"
        } ]
      }
      
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.928 seconds
      hive> CREATE TABLE users_avro STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='hdfs://server01:9000/data/avro/user.avsc');
      OK
      Time taken: 0.792 seconds
      
    • LOAD DATA
      hive> SET hive.exec.compress.output=true;
      hive> SET avro.output.codec=snappy;
      hive> LOAD DATA LOCAL INPATH '/tmp/user_input.avro' INTO TABLE users_avro;
      Loading data to table user_db.users_avro
      Table user_db.users_avro stats: [numFiles=1, totalSize=8273]
      OK
      Time taken: 0.616 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_avro LIMIT 10;
      OK
      9fb2d6c1-16d8-4b75-8292-7a4f47d49cd3 jriley0 Jack jmartinez0@goo.ne.jp IR
      89142191-825c-4413-8d34-87053b3683da adunn1 Arthur aramos1@surveymonkey.com PT
      b3ea602d-45fd-452f-a02e-78d2d146eb45 bcook2 Brandon byoung2@upenn.edu TW
      1fb72bdc-5745-4146-b2e0-6270cc9f1585 rburton3 Ronald rrogers3@google.es PT
      24c8272d-05bc-47d2-ba3d-d61a9439882d jcox4 Jason jevans4@nationalgeographic.com SE
      fb4d0fa4-91ca-4e19-9f40-3117a717341b gparker5 Gerald gstewart5@abc.net.au ID
      8aa884d2-ad40-4f60-9262-c52ab3851fb3 wowens6 Willie whanson6@nasa.gov CN
      d66ccccc-2f52-408c-a54e-d00dc0dc0805 aolson7 Arthur abradley7@elpais.com FR
      31be75d9-ce74-41ea-858b-a63e6a7c1c0b dhamilton8 Daniel dfowler8@bloglines.com BR
      d56e128e-65dd-40b1-ab3c-9bb1ed84a949 dellis9 Dennis djenkins9@51.la VN
      Time taken: 0.743 seconds, Fetched: 10 row(s)
      
  • RCFILE
    • An internal hive format (binary), also known as Record Columnar File. RCFile combines multiple functions such as data storage formatting, data compression, and data access optimization.
    • Syntax:  STORED AS RCFILE
    • Usually binary storage.
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.914 seconds
      hive> CREATE TABLE users_rcfile (uid String, login String, full_name String, email String, country String) STORED AS RCFILE;
      OK
      Time taken: 0.142 seconds
      
    • LOAD DATA
      hive> SET hive.exec.compress.output=true;
      hive> SET mapred.max.split.size=256000000;
      hive> SET mapred.output.compression.type=BLOCK;
      hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
      hive> SET hive.exec.dynamic.partition.mode=nonstrict;
      hive> SET hive.exec.dynamic.partition=true;
      hive> INSERT OVERWRITE TABLE user_db.users_rcfile SELECT * FROM user_db.users_txt;
      Query ID = anishsneh_20150928165631_5054934a-71e9-4f36-9ad9-98e4b6d463ee
      Total jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks is set to 0 since there's no reduce operator
      Starting Job = job_1443477428845_0006, Tracking URL = http://server01:8088/proxy/application_1443477428845_0006/
      Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0006
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
      2015-09-28 16:56:45,144 Stage-1 map = 0%,  reduce = 0%
      2015-09-28 16:56:53,726 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.33 sec
      MapReduce Total cumulative CPU time: 1 seconds 330 msec
      Ended Job = job_1443477428845_0006
      Stage-4 is selected by condition resolver.
      Stage-3 is filtered out by condition resolver.
      Stage-5 is filtered out by condition resolver.
      Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_rcfile/.hive-staging_hive_2015-09-28_16-56-31_354_5196185093859930557-1/-ext-10000
      Loading data to table user_db.users_rcfile
      Table user_db.users_rcfile stats: [numFiles=1, numRows=100, totalSize=4559, rawDataSize=7360]
      MapReduce Jobs Launched: 
      Stage-Stage-1: Map: 1   Cumulative CPU: 1.33 sec   HDFS Read: 11858 HDFS Write: 4638 SUCCESS
      Total MapReduce CPU Time Spent: 1 seconds 330 msec
      OK
      Time taken: 23.91 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_rcfile LIMIT 10;
      OK
      755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
      a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
      f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
      d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
      d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
      6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
      3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
      beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
      899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
      f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
      Time taken: 0.151 seconds, Fetched: 10 row(s)
      
  • SEQUENCEFILE
    • This format stores data sequentially, should be preferred over text file if data needs to be compressed.
    • Syntax:  STORED AS SEQUENCEFILE
    • Usually binary storage.
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.912 seconds
      hive> CREATE TABLE users_sequencefile (uid String, login String, full_name String, email String, country String) STORED AS SEQUENCEFILE;
      OK
      Time taken: 0.715 seconds
      
    • LOAD DATA
      hive> SET hive.exec.compress.output=true;
      hive> SET mapred.max.split.size=256000000;
      hive> SET mapred.output.compression.type=BLOCK;
      hive> SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
      hive> SET hive.exec.dynamic.partition.mode=nonstrict;
      hive> SET hive.exec.dynamic.partition=true;
      hive> INSERT OVERWRITE TABLE user_db.users_sequencefile SELECT * FROM user_db.users_txt;
      Query ID = anishsneh_20150928170529_9ca50caa-e40b-4e73-8346-7c8be84c473e
      Total jobs = 3
      Launching Job 1 out of 3
      Number of reduce tasks is set to 0 since there's no reduce operator
      Starting Job = job_1443477428845_0007, Tracking URL = http://server01:8088/proxy/application_1443477428845_0007/
      Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0007
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
      2015-09-28 17:05:42,960 Stage-1 map = 0%,  reduce = 0%
      2015-09-28 17:05:54,428 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.5 sec
      MapReduce Total cumulative CPU time: 1 seconds 500 msec
      Ended Job = job_1443477428845_0007
      Stage-4 is selected by condition resolver.
      Stage-3 is filtered out by condition resolver.
      Stage-5 is filtered out by condition resolver.
      Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_sequencefile/.hive-staging_hive_2015-09-28_17-05-29_645_6697958115974804848-1/-ext-10000
      Loading data to table user_db.users_sequencefile
      Table user_db.users_sequencefile stats: [numFiles=1, numRows=100, totalSize=5064, rawDataSize=7760]
      MapReduce Jobs Launched: 
      Stage-Stage-1: Map: 1   Cumulative CPU: 1.5 sec   HDFS Read: 11849 HDFS Write: 5149 SUCCESS
      Total MapReduce CPU Time Spent: 1 seconds 500 msec
      OK
      Time taken: 27.09 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_sequencefile LIMIT 10;
      OK
      755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
      a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
      f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
      d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
      d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
      6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
      3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
      beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
      899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
      f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
      Time taken: 0.265 seconds, Fetched: 10 row(s)
      
  • PARQUET
    • A columnar storage format (compressed, binary), it is supported by a plugin in Hive since version 0.10 and natively in Hive 0.13 and later.
    • Syntax: STORED AS PARQUETFILE
    • Usually binary storage.
    • CREATE TABLE
      hive> USE user_db;
      OK
      Time taken: 0.055 seconds
      hive> CREATE TABLE users_parquet (uid String, login String, full_name String, email String, country String) STORED AS PARQUETFILE;
      OK
      Time taken: 0.162 seconds
      
    • LOAD DATA
      hive> INSERT OVERWRITE TABLE user_db.users_parquet SELECT * FROM user_db.users_txt;
      Query ID = anishsneh_20150928162135_31b54099-6dce-4960-bf64-e708916ea153
      Total jobs = 3
      Launching Job 1 out of 3
      Number of reduce tasks is set to 0 since there's no reduce operator
      Starting Job = job_1443477428845_0001, Tracking URL = http://server01:8088/proxy/application_1443477428845_0001/
      Kill Command = /home/anishsneh/installs/hadoop-2.2.0/bin/hadoop job  -kill job_1443477428845_0001
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
      2015-09-28 16:21:51,415 Stage-1 map = 0%,  reduce = 0%
      2015-09-28 16:22:08,324 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.88 sec
      MapReduce Total cumulative CPU time: 1 seconds 880 msec
      Ended Job = job_1443477428845_0001
      Stage-4 is selected by condition resolver.
      Stage-3 is filtered out by condition resolver.
      Stage-5 is filtered out by condition resolver.
      Moving data to: hdfs://server01:9000/data/hive/warehouse/user_db.db/users_parquet/.hive-staging_hive_2015-09-28_16-21-35_227_4234786993826549979-1/-ext-10000
      Loading data to table user_db.users_parquet
      Table user_db.users_parquet stats: [numFiles=1, numRows=100, totalSize=9739, rawDataSize=500]
      MapReduce Jobs Launched: 
      Stage-Stage-1: Map: 1   Cumulative CPU: 1.88 sec   HDFS Read: 11886 HDFS Write: 9818 SUCCESS
      Total MapReduce CPU Time Spent: 1 seconds 880 msec
      OK
      Time taken: 33.499 seconds
      
    • READ RECORDS
      hive> SELECT * FROM users_parquet LIMIT 10;
      OK
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      755777ae-3d5f-415e-ac33-5d24db748e09 rjones0 Randy rjones0@archive.org RU
      a4dae376-970e-4548-908e-cbe6bff88550 mmitchell1 Martin mhamilton1@stumbleupon.com FI
      f4781787-c731-4db6-add2-13ab91de22a0 pharvey2 Peter pkim2@com.com FR
      d35df636-a7c8-4c50-aa57-e99db4cbdb1a gjames3 Gary gtorres3@bbb.org LT
      d26c04a3-ca28-4d2e-84cf-0104ad2acb92 rburton4 Russell rwest4@youtube.com YE
      6a487cfb-5177-4cc2-bdbd-4bc4751b9592 pharris5 Patrick ptaylor5@cnn.com NO
      3671d7f7-2a75-41dc-be84-609106e5bdfa kcrawford6 Keith ksmith6@weibo.com PT
      beae01c4-3ee6-4c59-b0d6-60c5811367f2 jedwards7 Juan joliver7@fc2.com PH
      899dc8a4-5a8f-44cf-ac23-ae8c3729836c slynch8 Samuel smcdonald8@princeton.edu VN
      f274e93d-378c-4377-a9c7-7c235a36b72a mgray9 Martin mrodriguez9@constantcontact.com IE
      Time taken: 0.107 seconds, Fetched: 10 row(s)
      
All the files used above can be downloaded from:

HDFS Table Structure (Database)

"user_db" database structure on HDFS