如何修复Hadoop节点的逻辑坏道问题?
- By : Will
- Category : Cloudera-Hadoop
Cloudera-Hadoop
1 背景
生产环境的Spark应用程序被杀死,Java代码层报以下错误,
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/yarn/nm/filecache/1997/mes-bigtable-merger-indexer-1.4-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 20/08/11 16:30:59 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 48165@hd08.cmdschool.org [...] 20/08/11 16:31:48 INFO HBaseSparkQuery: hbase connection opened 20/08/11 16:37:47 INFO AsyncProcess: #2, waiting for 1 actions to finish on table: tab1 20/08/11 16:37:47 INFO AsyncProcess: Left over 1 task(s) are processed on server(s): [hd01.cmdschool.org,60020,1571974414847] 20/08/11 16:37:47 INFO AsyncProcess: Regions against which left over task(s) are processed: [tab1,8BC,1588854228747.fe75f582fb0239a7710923a608677dc7., tab1,8BDJ005Q21,1588816980964.64d04bb582caad70105362074d617268., tab1,9AA6967D57,1589049938112.575aa2ca7166624dec0658060c37fa69., tab1,9AC,1594940599059.12d639c1344dfaab9aaa0fafdd136853., tab1,9AC4408K11,1594940599059.6d6e6d6874d31c6670e50459f85a6cb3., tab1,9ACA459B36,1593796543167.4448fee4ccd95b432133ff12a8e9c419., tab1,9ACF321B35,1593796543167.d5f15ef7116469fd5249d4cdf8b3ce5a., tab1,9AD2638J68,1589113124795.7242b0c3dfe126a17073fefc48f7eab6., tab1,9AFBA14C15,1593967705894.d4a1e5fac3f35afca759756020943aaf., tab1,9AFJ322S13,1593967705894.b5a609aa40d3921a71d62073914c4e54., tab1,9AG6A52A60,1588943710026.96ccf7f3fbc1395eff05b7bccbfb01d5.] 20/08/11 16:37:57 INFO AsyncProcess: #2, waiting for 1 actions to finish on table: tab1 20/08/11 16:37:57 INFO AsyncProcess: Left over 1 task(s) are processed on server(s): [hd01.cmdschool.org,60020,1571974414847] 20/08/11 16:37:57 INFO AsyncProcess: Regions against which left over task(s) are processed: [tab1,8BC,1588854228747.fe75f582fb0239a7710923a608677dc7., tab1,8BDJ005Q21,1588816980964.64d04bb582caad70105362074d617268., tab1,9AA6967D57,1589049938112.575aa2ca7166624dec0658060c37fa69., tab1,9AC,1594940599059.12d639c1344dfaab9aaa0fafdd136853., tab1,9AC4408K11,1594940599059.6d6e6d6874d31c6670e50459f85a6cb3., tab1,9ACA459B36,1593796543167.4448fee4ccd95b432133ff12a8e9c419., tab1,9ACF321B35,1593796543167.d5f15ef7116469fd5249d4cdf8b3ce5a., tab1,9AD2638J68,1589113124795.7242b0c3dfe126a17073fefc48f7eab6., tab1,9AFBA14C15,1593967705894.d4a1e5fac3f35afca759756020943aaf., tab1,9AFJ322S13,1593967705894.b5a609aa40d3921a71d62073914c4e54., tab1,9AG6A52A60,1588943710026.96ccf7f3fbc1395eff05b7bccbfb01d5.] 20/08/11 16:38:06 INFO AsyncProcess: #2, table=tab1, attempt=10/35 failed=1ops, last exception: java.io.IOException: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://hd05.cmdschool.org:8020/hbase/data/default/tab1/12d639c1344dfaab9aaa0fafdd136853/cf/c33eced60aaf4dd2be5cb6f2853eb7a0, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=19269, currentSize=2438131176, freeSize=952491800, maxSize=3390622976, heapSize=2438131176, minSize=3221091840, minFactor=0.95, multiSize=1610545920, multiFactor=0.5, singleSize=805272960, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=9AC0001A02/cf:BLOCK/1593547200000/Put, lastKey=9AC4408K10/cf:rowkey/1593453911000/Put, avgKeyLen=45, avgValueLen=7, entries=768821021, length=11411362080, cur=null] to key 9AC0034P28/cf:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0 at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:217) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:343) at org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:198) at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2106) at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2096) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.(HRegion.java:5544) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2569) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2555) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2536) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6791) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6770) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:629) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2142) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165) Caused by: java.io.IOException: On-disk size without header provided is 16108, but block header contains 0. Block offset: 249363586, data starts with: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 at org.apache.hadoop.hbase.io.hfile.HFileBlock.validateOnDiskSizeWithoutHeader(HFileBlock.java:526) at org.apache.hadoop.hbase.io.hfile.HFileBlock.access$700(HFileBlock.java:92) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1699) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:445) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:642) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:592) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:294) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:199) ... 17 more on hd01.cmdschool.org,60020,1571974414847, tracking started null, retrying after=10085ms, replay=1ops
2 最佳实践
2.1 日志分析
Caused by: java.io.IOException: On-disk size without header provided is 16108, but block header contains 0. Block offset: 249363586, data starts with:
以上提到“block header contains 0”,即数据块标头含零,可理解为数据块异常,另外,也留意到如下提示信息,
on hd01.cmdschool.org,60020,1571974414847, tracking started null, retrying after=10085ms, replay=1ops
注:由以上可定位故障主机名称为“hd01.cmdschool.org”,所以我们直接到该服务器检查的磁盘健康状况
2.2 故障定位
df -Th
确定分区的文件系统类型,如果是EXT格式,可使用如下命令检查文件系统,
tune2fs -l /dev/mapper/ds-data | grep "Filesystem state"
我们使用以上命令检查到“hd01.cmdschool.org”其中的一个磁盘,命令行提示如下,
Filesystem state: clean with errors
注:“clean with errors”意味有磁盘坏块,由于我们检查过硬件提示灯和硬件日志并没有报错,所以此错误应该为逻辑坏道。
2.3 制定修复方案
- 使用“Cloudera Management Service”标记hd01为维护模式
- 使用“Cloudera Management Service”停止hd01的所有角色
- 备份分区的关键数据(HDFS数据节点由于有冗余,所以我们选择不备份)
- 卸载挂于“/data”的“/dev/mapper/ds-data”分区
- 使用e2fsck命令修复损坏的坏道
2.4 执行修复
2.4.1 停止所有服务
- 使用“Cloudera Management Service”标记hd01为维护模式
- 使用“Cloudera Management Service”停止hd01的所有角色
- 使用命令“/etc/init.d/cloudera-scm-agent stop”停止CMS Agent服务
- 使用命令“/etc/init.d/cloudera-scm-server stop”停止CMS管理服务(如果服务存在)
- 停止其他的应用服务(不一一列举,如果服务存在)
2.4.2 备份分区必要数据
scp -r /data/xxx hdxxx.cmdschool.org:/data/backup/
2.4.3 卸载文件系统
umount /data/
2.4.4 修复文件系统
e2fsck -fy /dev/mapper/ds-data
基于安全起见,你可以使用详细模式通过交互方式修复,
e2fsck -fv /dev/mapper/ds-data
另外,以上命令适用于EXT格式的文件系统,如果是XFS格式的文件系统,请使用如下命令,
xfs_repair /dev/mapper/ds-data
2.4.5 检查文件系统
e2fsck -n /dev/mapper/ds-data
2.4.6 重新挂载分区
mount -a
2.4.7 恢复所有服务
- 使用命令“/etc/init.d/cloudera-scm-server start”启动CMS管理服务(如果服务存在)
- 使用命令“/etc/init.d/cloudera-scm-agent start”启动CMS Agent服务
- 启动其他的应用服务(不一一列举,如果服务存在)
- 使用“Cloudera Management Service”启动hd01的所有角色
- 使用“Cloudera Management Service”标记hd01为非维护模式
没有评论