ORC的优点
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
Compared with RCFile format, for example, ORC file format has many advantages such as:
- a single file as the output of each task, which reduces the NameNode’s load
- Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
- light-weight indexes stored within the file
- skip row groups that don’t pass predicate filtering
- seek to a given row
- block-mode compression based on data type
- run-length encoding for integer columns
- dictionary encoding for string columns
- concurrent reads of the same file using separate RecordReaders
- ability to split files without scanning for markers
- bound the amount of memory needed for reading or writing
- metadata stored using Protocol Buffers, which allows addition and removal of fields
ORC文件结构
ORC文件由stripe,file footer,postscript组成。
- file footer
contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum. - postscript
holds compression parameters and the size of the compressed footer. - stripe
each stripe in an ORC file holds index data, row data, and a stripe footer.
index data includes min and max values for each column and the row positions within each column.
stripe footer contains a directory of stream locations.
row data is used in table scans.
Hive Table properties
The parameters placed in the TBLPROPERTIES.
Key | Default | Notes |
---|---|---|
orc.compress | ZLIB | high level compression = {NONE, ZLIB, SNAPPY} |
orc.compress.size | 262,144 | compression chunk size(number of bytes), 0.25MB |
orc.stripe.size | 67,108,864 | memory buffer in bytes for writing(number of bytes), 64MB |
orc.row.index.stride | 10,000 | number of rows between index entries |
orc.create.index | true | create indexes? |
orc.bloom.filter.columns | ”” | comma separated list of column names |
orc.bloom.filter.fpp | 0.05 | bloom filter false positive rate (must >0.0 and <1.0) |
创建ORC表
CREATE TABLE IF NOT EXISTS bank.account_orc (
`id_card` int,
`tran_time` string,
`name` string,
`cash` int
)
partitioned by(ds string)
stored as orc;
不加orc.compression,默认为ZLIB压缩。另外,还支持设置orc.compress为NONE, Snappy。
查询ORC建表语句
SHOW CREATE TABLE bank.account_orc;
CREATE TABLE `bank.account_orc`(
`id_card` int,
`tran_time` string,
`name` string,
`cash` int)
PARTITIONED BY (
`ds` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc'
TBLPROPERTIES (
'transient_lastDdlTime'='1627435885')
往ORC表插入样本数据
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:00', 'Jelly', 200);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:05', 'Jelly', -50);
查询底层文件
[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20
Found 2 items
-rwxrwx--x+ 3 hive hive 519 2021-07-28 09:33 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
-rwxrwx--x+ 3 hive hive 519 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
[root@jwldata.com ~]#
[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21
Found 3 items
-rwxrwx--x+ 3 hive hive 516 2021-07-28 09:32 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0
-rwxrwx--x+ 3 hive hive 528 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_1
-rwxrwx--x+ 3 hive hive 528 2021-07-28 09:35 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_2
查询ORC文件的元数据
hive –orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
[root@jwldata.com ~]# hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
Structure for hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
File Version: 0.12 with HIVE_8732
21/07/28 09:52:12 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:string,_col3:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50
Stripes:
Stripe: offset: 3 data: 53 rows: 1 tail: 65 index: 136
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 27
Stream: column 2 section ROW_INDEX start: 41 length 45
Stream: column 3 section ROW_INDEX start: 86 length 29
Stream: column 4 section ROW_INDEX start: 115 length 24
Stream: column 1 section DATA start: 139 length 7
Stream: column 2 section DATA start: 146 length 22
Stream: column 2 section LENGTH start: 168 length 6
Stream: column 3 section DATA start: 174 length 6
Stream: column 3 section LENGTH start: 180 length 6
Stream: column 4 section DATA start: 186 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
File length: 519 bytes
Padding length: 0 bytes
Padding ratio: 0%
查询ORC文件的数据
hive –orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
21/07/28 09:53:08 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:05","_col2":"Tom","_col3":50}
[root@jwldata.com ~]#
[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
21/07/28 09:53:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:10","_col2":"Tom","_col3":-25}
创建ORC表+Snappy压缩
CREATE TABLE IF NOT EXISTS bank.account_orc_snappy (
`id_card` int,
`tran_time` string,
`name` string,
`cash` int
)
partitioned by(ds string)
stored as orc
TBLPROPERTIES ("orc.compression"="SNAPPY");
查询ORC带Snappy压缩的建表语句
SHOW CREATE TABLE bank.account_orc_snappy;
CREATE TABLE `bank.account_orc_snappy`(
`id_card` int,
`tran_time` string,
`name` string,
`cash` int)
PARTITIONED BY (
`ds` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc_snappy'
TBLPROPERTIES (
'orc.compression'='SNAPPY',
'transient_lastDdlTime'='1627542655')
参考文档
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
- https://orc.apache.org/specification/ORCv1/
- https://blog.csdn.net/dabokele/article/details/51542327
- https://blog.csdn.net/dabokele/article/details/51813322
欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。
欢迎访问原创技术博客网站 jwldata.com,排版更清晰,阅读更爽快。