在执行动态分区INSERT OVERWRITE时,如果源表是有很多分区的大表,任务可能会报错org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded。
YARN报错
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator: 3 finished. closing...
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator: DESERIALIZE_ERRORS:0
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator: RECORDS_IN:1447355
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing...
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator: 1 finished. closing...
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator: 2 finished. closing...
2021-07-07 11:38:29,493 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[2]: records written - 1447835
2021-07-07 11:38:44,703 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
at parquet.hadoop.metadata.ParquetMetadata.<clinit>(ParquetMetadata.java:41)
at parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:468)
at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:127)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:144)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.abortWriters(FileSinkOperator.java:253)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1027)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:199)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2021-07-07 11:38:46,482 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression set to false
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression: UNCOMPRESSED
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Jul 7, 2021 11:38:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
Jul 7, 2021 11:38:17 AM WARNING: parquet.hadoop.MemoryManager: Total allocation exceeds 50.00% (3,817,865,216 bytes) of heap memory
Scaling row group sizes to 6.01% for 473 writers
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.codec.CodecConfig: Compression set to false
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.codec.CodecConfig: Compression: UNCOMPRESSED
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Jul 7, 2021 11:38:17 AM INFO: parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
Jul 7, 2021 11:38:17 AM WARNING: parquet.hadoop.MemoryManager: Total allocation exceeds 50.00% (3,817,865,216 bytes) of heap memory
Scaling row group sizes to 6.00% for 474 writers
考虑调大以下四个参数
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.map.java.opts
mapreduce.reduce.java.opts
建表语句
假如我们用bank.account表示源表,bank.account7是相同表结构的目标表。
CREATE TABLE IF NOT EXISTS bank.account (
`id_card` int,
`tran_time` string,
`name` string,
`cash` int
)
partitioned by(ds string)
stored as parquet
TBLPROPERTIES ("parquet.compression"="SNAPPY");
INSERT INTO bank.account partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
INSERT INTO bank.account partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:00', 'Jelly', 200);
INSERT INTO bank.account partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:05', 'Jelly', -50);
调大参数
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=60000;
set hive.exec.max.dynamic.partitions.pernode=60000;
set mapreduce.map.memory.mb=65536;
set mapreduce.reduce.memory.mb=65536;
set mapreduce.map.java.opts=-Xmx32768m;
set mapreduce.reduce.java.opts=-Xmx32768m;
INSERT OVERWRITE TABLE bank.account7 partition (ds) SELECT id_card, tran_time, name, cash, ds FROM bank.account;
参考文档
https://zhuanlan.zhihu.com/p/90953401
https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_ig_yarn_tuning.html
https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_mc_yarn_service.html
https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_ig_mapreduce_to_yarn_migrate1.html
欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。
欢迎访问原创技术博客网站 jwldata.com,排版更清晰,阅读更爽快。