CN nodes in shared data with replication_factor

We have a problem that if one CN goes down any query stops to process with message

SQL Error [1064] [42000]: Backend node not found. Check if any backend node is down.backend: [starrocks-be-2.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-0.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-4.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-5.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-3.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-1.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-cn-7.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-6.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-5.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-4.starrocks-cn-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-cn-3.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-2.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-9.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: true] [starrocks-cn-0.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-8.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-1.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false]

Should we set replication_factor for local table cache? e.g. 2 or 3 to stay alive while one node goes down? Does replication_factor is working with shared data? If CN is stateless and can be replaced with any other CN why this error can happen?

Replication_factor for shared data architecture is hard coded to 1 since S3 will guarantee data availability. As far as the query itself and it stopped, what type of query was it?

Just a simple query, after CN terminated StarRocks can’t process query until node restarts

That would be a different issue that storage availability. Are you using the operator?

Yes, latest 1.9.0

It seems that FE require all CN nodes available to process queries

  1. All CN nodes are running
  2. Queries works without errors
  3. Terminate pod. e.g. latest that process query by CPU metrics
  4. Queries return errors while pod restarts
  5. After pod goes to running state all works again

Same thing here. This has been happening going all the way back to Apache Impala (original source). When a CN goes away when there’s a query mid-flight, this one fails instead of being rescheduled. I would think that instead of retrying the entire query, perhaps the failed fragments can be retried.