如何修复Hadoop节点的逻辑坏道问题?

Cloudera-Hadoop

1 背景

生产环境的Spark应用程序被杀死,Java代码层报以下错误,

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/yarn/nm/filecache/1997/mes-bigtable-merger-indexer-1.4-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/11 16:30:59 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 48165@hd08.cmdschool.org
[...]
20/08/11 16:31:48 INFO HBaseSparkQuery: hbase connection opened
20/08/11 16:37:47 INFO AsyncProcess: #2, waiting for 1  actions to finish on table: tab1
20/08/11 16:37:47 INFO AsyncProcess: Left over 1 task(s) are processed on server(s): [hd01.cmdschool.org,60020,1571974414847]
20/08/11 16:37:47 INFO AsyncProcess: Regions against which left over task(s) are processed: [tab1,8BC,1588854228747.fe75f582fb0239a7710923a608677dc7., tab1,8BDJ005Q21,1588816980964.64d04bb582caad70105362074d617268., tab1,9AA6967D57,1589049938112.575aa2ca7166624dec0658060c37fa69., tab1,9AC,1594940599059.12d639c1344dfaab9aaa0fafdd136853., tab1,9AC4408K11,1594940599059.6d6e6d6874d31c6670e50459f85a6cb3., tab1,9ACA459B36,1593796543167.4448fee4ccd95b432133ff12a8e9c419., tab1,9ACF321B35,1593796543167.d5f15ef7116469fd5249d4cdf8b3ce5a., tab1,9AD2638J68,1589113124795.7242b0c3dfe126a17073fefc48f7eab6., tab1,9AFBA14C15,1593967705894.d4a1e5fac3f35afca759756020943aaf., tab1,9AFJ322S13,1593967705894.b5a609aa40d3921a71d62073914c4e54., tab1,9AG6A52A60,1588943710026.96ccf7f3fbc1395eff05b7bccbfb01d5.]
20/08/11 16:37:57 INFO AsyncProcess: #2, waiting for 1  actions to finish on table: tab1
20/08/11 16:37:57 INFO AsyncProcess: Left over 1 task(s) are processed on server(s): [hd01.cmdschool.org,60020,1571974414847]
20/08/11 16:37:57 INFO AsyncProcess: Regions against which left over task(s) are processed: [tab1,8BC,1588854228747.fe75f582fb0239a7710923a608677dc7., tab1,8BDJ005Q21,1588816980964.64d04bb582caad70105362074d617268., tab1,9AA6967D57,1589049938112.575aa2ca7166624dec0658060c37fa69., tab1,9AC,1594940599059.12d639c1344dfaab9aaa0fafdd136853., tab1,9AC4408K11,1594940599059.6d6e6d6874d31c6670e50459f85a6cb3., tab1,9ACA459B36,1593796543167.4448fee4ccd95b432133ff12a8e9c419., tab1,9ACF321B35,1593796543167.d5f15ef7116469fd5249d4cdf8b3ce5a., tab1,9AD2638J68,1589113124795.7242b0c3dfe126a17073fefc48f7eab6., tab1,9AFBA14C15,1593967705894.d4a1e5fac3f35afca759756020943aaf., tab1,9AFJ322S13,1593967705894.b5a609aa40d3921a71d62073914c4e54., tab1,9AG6A52A60,1588943710026.96ccf7f3fbc1395eff05b7bccbfb01d5.]
20/08/11 16:38:06 INFO AsyncProcess: #2, table=tab1, attempt=10/35 failed=1ops, last exception: java.io.IOException: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://hd05.cmdschool.org:8020/hbase/data/default/tab1/12d639c1344dfaab9aaa0fafdd136853/cf/c33eced60aaf4dd2be5cb6f2853eb7a0, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=19269, currentSize=2438131176, freeSize=952491800, maxSize=3390622976, heapSize=2438131176, minSize=3221091840, minFactor=0.95, multiSize=1610545920, multiFactor=0.5, singleSize=805272960, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=9AC0001A02/cf:BLOCK/1593547200000/Put, lastKey=9AC4408K10/cf:rowkey/1593453911000/Put, avgKeyLen=45, avgValueLen=7, entries=768821021, length=11411362080, cur=null] to key 9AC0034P28/cf:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:217)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:343)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:198)
	at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2106)
	at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2096)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.(HRegion.java:5544)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2569)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2555)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2536)
	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6791)
	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6770)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:629)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2142)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
Caused by: java.io.IOException: On-disk size without header provided is 16108, but block header contains 0. Block offset: 249363586, data starts with: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.validateOnDiskSizeWithoutHeader(HFileBlock.java:526)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.access$700(HFileBlock.java:92)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1699)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:445)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:199)
	... 17 more
 on hd01.cmdschool.org,60020,1571974414847, tracking started null, retrying after=10085ms, replay=1ops

2 最佳实践

2.1 日志分析

Caused by: java.io.IOException: On-disk size without header provided is 16108, but block header contains 0. Block offset: 249363586, data starts with: 

以上提到“block header contains 0”,即数据块标头含零,可理解为数据块异常,另外,也留意到如下提示信息,

on hd01.cmdschool.org,60020,1571974414847, tracking started null, retrying after=10085ms, replay=1ops

注:由以上可定位故障主机名称为“hd01.cmdschool.org”,所以我们直接到该服务器检查的磁盘健康状况

2.2 故障定位

df -Th

确定分区的文件系统类型,如果是EXT格式,可使用如下命令检查文件系统,

tune2fs -l /dev/mapper/ds-data | grep "Filesystem state"

我们使用以上命令检查到“hd01.cmdschool.org”其中的一个磁盘,命令行提示如下,

Filesystem state:         clean with errors

注:“clean with errors”意味有磁盘坏块,由于我们检查过硬件提示灯和硬件日志并没有报错,所以此错误应该为逻辑坏道。

2.3 制定修复方案

  • 使用“Cloudera Management Service”标记hd01为维护模式
  • 使用“Cloudera Management Service”停止hd01的所有角色
  • 备份分区的关键数据(HDFS数据节点由于有冗余,所以我们选择不备份)
  • 卸载挂于“/data”的“/dev/mapper/ds-data”分区
  • 使用e2fsck命令修复损坏的坏道

2.4 执行修复

2.4.1 停止所有服务

  • 使用“Cloudera Management Service”标记hd01为维护模式
  • 使用“Cloudera Management Service”停止hd01的所有角色
  • 使用命令“/etc/init.d/cloudera-scm-agent stop”停止CMS Agent服务
  • 使用命令“/etc/init.d/cloudera-scm-server stop”停止CMS管理服务(如果服务存在)
  • 停止其他的应用服务(不一一列举,如果服务存在)

2.4.2 备份分区必要数据

scp -r /data/xxx hdxxx.cmdschool.org:/data/backup/

2.4.3 卸载文件系统

umount /data/

2.4.4 修复文件系统

e2fsck -fy /dev/mapper/ds-data

基于安全起见,你可以使用详细模式通过交互方式修复,

e2fsck -fv /dev/mapper/ds-data

另外,以上命令适用于EXT格式的文件系统,如果是XFS格式的文件系统,请使用如下命令,

xfs_repair /dev/mapper/ds-data

2.4.5 检查文件系统

e2fsck -n /dev/mapper/ds-data

2.4.6 重新挂载分区

mount -a

2.4.7 恢复所有服务

  • 使用命令“/etc/init.d/cloudera-scm-server start”启动CMS管理服务(如果服务存在)
  • 使用命令“/etc/init.d/cloudera-scm-agent start”启动CMS Agent服务
  • 启动其他的应用服务(不一一列举,如果服务存在)
  • 使用“Cloudera Management Service”启动hd01的所有角色
  • 使用“Cloudera Management Service”标记hd01为非维护模式
没有评论

发表回复

Cloudera-Hadoop
如何平衡Hadoop数据节点?

1 前言 一个问题,一篇文章,一出故事。 笔者生产环境有一套Hadoop平台需要自动平衡数据节点的数 …

Bash
如何自动备份HDFS Name Node?

1 前言 之前的章节手动完成Hadoop HDFS的NameNode节点备份,本章重点是实现名称节点 …

Cloudera-Hadoop
如何备份恢复HDFS元数据-逻辑级备份?

1 基础知识 1.1 备份命令的简介 – 备份命令用于防止所有名称节点都不可用的情况下可 …