Skip to content

2.X和3.X elasticjob-lite版本选主逻辑在遍历存活实例instances下的ip去匹配servers下的ip时使用startsWith,在某种异常情况下会导致选主进入死循环 #2226

@McdullFei

Description

@McdullFei

Bug Report

The 2.x and 3.x elasticjob-lite versions of the primary selector logic starts with the ip address of the live instances to match the ip address of the servers. The primary selector logic starts with an exception that causes the primary to enter an infinite loop

2.X和3.X elasticjob-lite版本选主逻辑在遍历存活实例instances下的ip去匹配servers下的ip时使用startsWith,在某种异常情况下会导致选主进入死循环,进而使zk cpu load异常

使用版本elasticjob-lite 2.1.5,elasticjob-lite 3.0.3版本源码也有此问题

ElasticJob-Lite

生产bug产生的场景的最终结论:

  1. 由于某种历史原因,有一个disabled=true 的job 在zk注册的server node下存在部分ip node 并没有打上DISABLED的标志(至于为啥没有打上DISABLED标志暂时无法追溯)
  2. 某一次生产服务部署产生一个ip为172.28.3.87的实例,(zk job 的instance下生成172.28.3.87@-@1 这个node),刚好发现servers下有一个node name 为 172.28.3.8的节点并没有DISABLED标志,此节点创建时间是2020年
  3. 反向分析代码发现
    com.dangdang.ddframe.job.lite.internal.election.LeaderService#isLeaderUntilBlock
    image
public boolean isLeaderUntilBlock() {
    while (!hasLeader() && serverService.hasAvailableServers()) {
        log.info("Leader is electing, waiting for {} ms", 100);
        BlockUtils.waitingShortTime();
        if (!JobRegistry.getInstance().isShutdown(jobName) && serverService.isAvailableServer(JobRegistry.getInstance().getJobInstance(jobName).getIp())) {
            electLeader();
        }
    }
    return isLeader();
}

com.dangdang.ddframe.job.lite.internal.server.ServerService#hasAvailableServers
image

public boolean **hasAvailableServers**() {
    List<String> servers = jobNodeStorage.getJobNodeChildrenKeys(ServerNode.ROOT);
    for (String each : servers) {
        if (isAvailableServer(each)) {
            return true;
        }
    }
    return false;
}

public boolean isAvailableServer(final String ip) {
    return isEnableServer(ip) && hasOnlineInstances(ip);
}

private boolean **hasOnlineInstances**(final String ip) {
    for (String each : jobNodeStorage.getJobNodeChildrenKeys(InstanceNode.ROOT)) {
        if (each.startsWith(ip)) {
            return true;
        }
    }
    return false;
}

ServerService#hasAvailableServers调用的hasOnlineInstances 方法会拿当前存活实例instances下的node通过startsWith方法匹配server下没有打上DISABLED标志的node name:("172.28.3.87@-@1".startsWith("172.28.3.8"))

  1. 如果job disabled之后,server下的 node没有正常打上DISABLED标志,这个bug触发到的几率就会很大

请帮忙修复 ServerService#hasOnlineInstances方法:通过截取@-@ 字符前的ip进行全字符匹配

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions