如何修复Ignite启动失败节点?

Big Data Framework

1 前言

一个问题,一篇文章,一出故事。
笔者生产环境部署了Apache Ignite集群,集群详细资料请参阅如下文档,

如何部署Apache Ignite集群?


然后笔者今天遇到其中一个节点无法启动,于是笔者使用如下命令查询日志,

less /var/log/apache-ignite/ignite.0.log

发现有如下警告信息,关键字提示“WAL history is too short”,

[13:21:11,765][SEVERE][main][IgniteKernal] Got exception while starting (will rollback startup routine).
class org.apache.ignite.IgniteCheckedException: WAL history is too short [descs=[org.apache.ignite.internal.processors.cache.persistence.wal.FileDescriptor@28b8e], start=FileWALPointer [idx=166795, fileOff=13151190, len=310248]]
        at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$RecordsIterator.init(FileWriteAheadLogManager.java:2781)
        at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$RecordsIterator.access$1000(FileWriteAheadLogManager.java:2651)
        at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.replay(FileWriteAheadLogManager.java:956)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:2282)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:873)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:5022)
        at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1251)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2052)
        at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
        at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
        at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
        at org.apache.ignite.Ignition.start(Ignition.java:353)
        at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:300)
[13:21:11,767][WARNING][main][IgniteKernal] Attempt to stop starting grid. This operation cannot be guaranteed to be successful.
[13:21:11,782][INFO][main][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary
[13:21:11,805][INFO][main][FilePageStoreManager] Cleanup cache stores [total=0, left=0, cleanFiles=false]
[13:21:11,830][INFO][main][IgniteKernal]

>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.9.1#20201203-sha1:adcce517ce542fdaca954bce399e7a1940630403 stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Grid uptime: 00:00:02.616

由于另外两个集群节点正常,也找不到修复方法,于是决定清理节点数据后让他以新节点的身份重新加入集群。

2 最佳实践

2.1 清理故障节点的数据

2.1.1 停止节点

systemctl stop apache-ignite

2.1.2 备份节点数据

mv /data/ignite/db /data/ignite/db-save
mv /data/ignite/storage /data/ignite/storage-save

2.1.3 重建程序所需的目录

mkdir -p /data/ignite/db
chown ignite:ignite /data/ignite/db
mkdir -p /data/ignite/storage
chown ignite:ignite /data/ignite/storage

2.1.4 启动节点

systemctl start apache-ignite

2.2 从集群中更换故障节点

2.2.1 关闭集群的自动调整

control.sh --user ignite --password ignite --baseline auto_adjust disable --yes

2.2.2 查询集群拓扑

control.sh --user ignite --password ignite --baseline

可见如下显示,

#...
Current topology version: 15 (Coordinator: ConsistentId=da64c74f-25e0-48f2-86f3-a0b00ed10d53, Order=2)

Baseline nodes:
    ConsistentId=a184ff86-2018-4615-ad2a-06053b3e1d83, State=ONLINE, Order=7
    ConsistentId=da64c74f-25e0-48f2-86f3-a0b00ed10d53, State=ONLINE, Order=2
    ConsistentId=e45e5d2c-42f6-4a50-816e-0b6df02ea865, State=OFFLINE
#...
Other nodes:
    ConsistentId=eb3f37bc-2c29-4d63-b9cc-4ed8a776aeeb, Order=15
#...

2.2.3 从集群中删除故障的节点

control.sh --user ignite --password ignite --baseline remove e45e5d2c-42f6-4a50-816e-0b6df02ea865 --yes

2.2.4 把新节点加入集群

control.sh --user ignite --password ignite --baseline add eb3f37bc-2c29-4d63-b9cc-4ed8a776aeeb --yes

2.2.5 恢复集群自动调整

control.sh --user ignite --password ignite --baseline auto_adjust enable --yes
没有评论

发表回复

Big Data Framework
如何部署带安全认证的Elasticsearch 8.x集群?

1 基础知识 1.1 集群的介绍 – Elasticsearch集群允许节点单点故障 & …

Big Data Framework
如何部署Elasticsearch 8.x集群?

1 基础知识 1.1 集群的介绍 – Elasticsearch集群允许节点单点故障 & …

Big Data Framework
如何配置Apache Ignite集群数据持久?

1 基础知识 1.1 数据持久化的概念 – Ignite持久化旨在提供持久化存储的一组功 …