• 欢迎关注微信公众号:九万里大数据
  • 请使用Ctrl+D收藏本站到书签栏
  • 手机也可访问本站 jwldata.com

ORC文件存储格式和Hive创建ORC表

大数据技术 九万里大数据 10个月前 (07-30) 738次浏览 0个评论 扫描二维码
文章目录[隐藏]

ORC的优点

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

  • a single file as the output of each task, which reduces the NameNode’s load
  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
  • light-weight indexes stored within the file
    • skip row groups that don’t pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers, which allows addition and removal of fields

ORC文件结构

ORC文件存储格式和Hive创建ORC表

ORC文件由stripe,file footer,postscript组成。

  • file footer
    contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.
  • postscript
    holds compression parameters and the size of the compressed footer.
  • stripe
    each stripe in an ORC file holds index data, row data, and a stripe footer.
    index data includes min and max values for each column and the row positions within each column.
    stripe footer contains a directory of stream locations.
    row data is used in table scans.

Hive Table properties

The parameters placed in the TBLPROPERTIES.

Key Default Notes
orc.compress ZLIB high level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size 262,144 compression chunk size(number of bytes), 0.25MB
orc.stripe.size 67,108,864 memory buffer in bytes for writing(number of bytes), 64MB
orc.row.index.stride 10,000 number of rows between index entries
orc.create.index true create indexes?
orc.bloom.filter.columns ”” comma separated list of column names
orc.bloom.filter.fpp 0.05 bloom filter false positive rate (must >0.0 and <1.0)

创建ORC表

CREATE TABLE IF NOT EXISTS bank.account_orc (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc;

不加orc.compression,默认为ZLIB压缩。另外,还支持设置orc.compress为NONE, Snappy。

查询ORC建表语句

SHOW CREATE TABLE bank.account_orc;

CREATE TABLE `bank.account_orc`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1627435885')

往ORC表插入样本数据

INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:00', 'Jelly', 200);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:05', 'Jelly', -50);

查询底层文件

[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20                     
Found 2 items
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:33 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21
Found 3 items
-rwxrwx--x+  3 hive hive        516 2021-07-28 09:32 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_1
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:35 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_2

查询ORC文件的元数据

hive –orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
Structure for hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
File Version: 0.12 with HIVE_8732
21/07/28 09:52:12 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:string,_col3:int>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
    Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
    Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
    Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
  Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
  Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
  Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

Stripes:
  Stripe: offset: 3 data: 53 rows: 1 tail: 65 index: 136
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 27
    Stream: column 2 section ROW_INDEX start: 41 length 45
    Stream: column 3 section ROW_INDEX start: 86 length 29
    Stream: column 4 section ROW_INDEX start: 115 length 24
    Stream: column 1 section DATA start: 139 length 7
    Stream: column 2 section DATA start: 146 length 22
    Stream: column 2 section LENGTH start: 168 length 6
    Stream: column 3 section DATA start: 174 length 6
    Stream: column 3 section LENGTH start: 180 length 6
    Stream: column 4 section DATA start: 186 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2

File length: 519 bytes
Padding length: 0 bytes
Padding ratio: 0%

查询ORC文件的数据

hive –orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
21/07/28 09:53:08 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:05","_col2":"Tom","_col3":50}
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
21/07/28 09:53:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:10","_col2":"Tom","_col3":-25}

创建ORC表+Snappy压缩

CREATE TABLE IF NOT EXISTS bank.account_orc_snappy (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc
TBLPROPERTIES ("orc.compression"="SNAPPY");

查询ORC带Snappy压缩的建表语句

SHOW CREATE TABLE bank.account_orc_snappy;

CREATE TABLE `bank.account_orc_snappy`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc_snappy'
TBLPROPERTIES (
  'orc.compression'='SNAPPY', 
  'transient_lastDdlTime'='1627542655')

参考文档

欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。
欢迎访问原创技术博客网站 jwldata.com,排版更清晰,阅读更爽快。


ORC文件存储格式和Hive创建ORC表
 


本站文章,如未注明,均为原创 | 原创文章版权归九万里大数据所有,未经许可不得转载。
本文链接:ORC文件存储格式和Hive创建ORC表
喜欢 (2)

您必须 登录 才能发表评论!