如何修复Ignite启动失败节点?
- By : Will
- Category : Big Data Framework
Big Data Framework
1 前言
一个问题,一篇文章,一出故事。
笔者生产环境部署了Apache Ignite集群,集群详细资料请参阅如下文档,
然后笔者今天遇到其中一个节点无法启动,于是笔者使用如下命令查询日志,
less /var/log/apache-ignite/ignite.0.log
发现有如下警告信息,关键字提示“WAL history is too short”,
[13:21:11,765][SEVERE][main][IgniteKernal] Got exception while starting (will rollback startup routine). class org.apache.ignite.IgniteCheckedException: WAL history is too short [descs=[org.apache.ignite.internal.processors.cache.persistence.wal.FileDescriptor@28b8e], start=FileWALPointer [idx=166795, fileOff=13151190, len=310248]] at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$RecordsIterator.init(FileWriteAheadLogManager.java:2781) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$RecordsIterator.access$1000(FileWriteAheadLogManager.java:2651) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.replay(FileWriteAheadLogManager.java:956) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:2282) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:873) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:5022) at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1251) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2052) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698) at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114) at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656) at org.apache.ignite.Ignition.start(Ignition.java:353) at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300) [13:21:11,767][WARNING][main][IgniteKernal] Attempt to stop starting grid. This operation cannot be guaranteed to be successful. [13:21:11,782][INFO][main][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary [13:21:11,805][INFO][main][FilePageStoreManager] Cleanup cache stores [total=0, left=0, cleanFiles=false] [13:21:11,830][INFO][main][IgniteKernal] >>> +---------------------------------------------------------------------------------+ >>> Ignite ver. 2.9.1#20201203-sha1:adcce517ce542fdaca954bce399e7a1940630403 stopped OK >>> +---------------------------------------------------------------------------------+ >>> Grid uptime: 00:00:02.616
由于另外两个集群节点正常,也找不到修复方法,于是决定清理节点数据后让他以新节点的身份重新加入集群。
2 最佳实践
2.1 清理故障节点的数据
2.1.1 停止节点
systemctl stop apache-ignite
2.1.2 备份节点数据
mv /data/ignite/db /data/ignite/db-save
mv /data/ignite/storage /data/ignite/storage-save
2.1.3 重建程序所需的目录
mkdir -p /data/ignite/db
chown ignite:ignite /data/ignite/db
mkdir -p /data/ignite/storage
chown ignite:ignite /data/ignite/storage
2.1.4 启动节点
systemctl start apache-ignite
2.2 从集群中更换故障节点
2.2.1 关闭集群的自动调整
control.sh --user ignite --password ignite --baseline auto_adjust disable --yes
2.2.2 查询集群拓扑
control.sh --user ignite --password ignite --baseline
可见如下显示,
#... Current topology version: 15 (Coordinator: ConsistentId=da64c74f-25e0-48f2-86f3-a0b00ed10d53, Order=2) Baseline nodes: ConsistentId=a184ff86-2018-4615-ad2a-06053b3e1d83, State=ONLINE, Order=7 ConsistentId=da64c74f-25e0-48f2-86f3-a0b00ed10d53, State=ONLINE, Order=2 ConsistentId=e45e5d2c-42f6-4a50-816e-0b6df02ea865, State=OFFLINE #... Other nodes: ConsistentId=eb3f37bc-2c29-4d63-b9cc-4ed8a776aeeb, Order=15 #...
2.2.3 从集群中删除故障的节点
control.sh --user ignite --password ignite --baseline remove e45e5d2c-42f6-4a50-816e-0b6df02ea865 --yes
2.2.4 把新节点加入集群
control.sh --user ignite --password ignite --baseline add eb3f37bc-2c29-4d63-b9cc-4ed8a776aeeb --yes
2.2.5 恢复集群自动调整
control.sh --user ignite --password ignite --baseline auto_adjust enable --yes
没有评论