创建LZO Compressed Text Tables
使用Hive创建LZO压缩的Text格式的表
CREATE TABLE IF NOT EXISTS bank.account_lzo (
`id_card` int,
`tran_time` string,
`name` string,
`cash` int
)
partitioned by(ds string)
STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
查询建表语句
show create table bank.account_lzo
CREATE TABLE `bank.account_lzo`(
`id_card` int,
`tran_time` string,
`name` string,
`cash` int)
PARTITIONED BY (
`ds` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo'
TBLPROPERTIES (
'transient_lastDdlTime'='1625551444')
往LZO表插入数据
SET mapreduce.output.fileoutputformat.compress=true;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
INSERT INTO bank.account_lzo partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account_lzo partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account_lzo partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
创建index
After loading data into an LZO-compressed text table, index the files so that they can be split.
hadoop jar /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-21
[root@jwldata ~]# hadoop fs -ls hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-21 Found 2 items
-rwxrwx--x+ 3 hive hive 83 2021-07-06 14:09 hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-21/000000_0.lzo
-rwxrwx--x+ 3 hive hive 8 2021-07-23 10:23 hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-21/000000_0.lzo.index
[root@jwldata ~]#
[root@jwldata ~]# hadoop fs -ls hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-20 Found 2 items
-rwxrwx--x+ 3 hive hive 82 2021-07-06 14:10 hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-20/000000_0.lzo
-rwxrwx--x+ 3 hive hive 83 2021-07-06 14:11 hdfs://nameservice1/user/hive/warehouse/bank.db/account_lzo/ds=2020-09-20/000000_0_copy_1.lzo
Impala查询LZO表
在使用Hive创建表和插入数据后,用impala刷新下元数据后,就可以impala查询表了。
INVALIDATE METADATA [table]
备注
- Impala does not currently support LZO compression in Parquet files.
- In Impala 2.0 and later, you can also use text data compressed in the gzip, bzip2, or Snappy formats. Because these compressed formats are not “splittable” in the way that LZO is, there is less opportunity for Impala to parallelize queries on them. Therefore, use these types of compressed data only for convenience if that is the format in which you receive the data. Prefer to use LZO compression for text data if you have the choice, or convert the data to Parquet using an INSERT … SELECT statement to copy the original data into a Parquet table.
参考文档
- https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/impala_txtfile.html
- https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/impala_parquet.html
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
- https://blog.csdn.net/u010002184/article/details/88573318
欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。
欢迎访问原创技术博客网站 jwldata.com,排版更清晰,阅读更爽快。