CN nodes in shared data with replication_factor

sergeyshaykhullin · January 5, 2024, 4:00pm

We have a problem that if one CN goes down any query stops to process with message

SQL Error [1064] [42000]: Backend node not found. Check if any backend node is down.backend: [starrocks-be-2.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-0.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-4.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-5.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-3.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-be-1.starrocks-be-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-cn-7.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-6.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-5.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-4.starrocks-cn-search.starrocks.svc.cluster.local alive: false inBlacklist: false] [starrocks-cn-3.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-2.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-9.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: true] [starrocks-cn-0.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-8.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false] [starrocks-cn-1.starrocks-cn-search.starrocks.svc.cluster.local alive: true inBlacklist: false]

Should we set replication_factor for local table cache? e.g. 2 or 3 to stay alive while one node goes down? Does replication_factor is working with shared data? If CN is stateless and can be replaced with any other CN why this error can happen?

atwong · January 5, 2024, 9:25pm

Replication_factor for shared data architecture is hard coded to 1 since S3 will guarantee data availability. As far as the query itself and it stopped, what type of query was it?

sergeyshaykhullin · January 6, 2024, 11:29am

Just a simple query, after CN terminated StarRocks can’t process query until node restarts

atwong · January 6, 2024, 8:30pm

That would be a different issue that storage availability. Are you using the operator?

sergeyshaykhullin · January 6, 2024, 10:32pm

Yes, latest 1.9.0

It seems that FE require all CN nodes available to process queries

All CN nodes are running
Queries works without errors
Terminate pod. e.g. latest that process query by CPU metrics
Queries return errors while pod restarts
After pod goes to running state all works again

nferrario · January 16, 2024, 10:46am

Same thing here. This has been happening going all the way back to Apache Impala (original source). When a CN goes away when there’s a query mid-flight, this one fails instead of being rescheduled. I would think that instead of retrying the entire query, perhaps the failed fragments can be retried.

Topic		Replies	Views
FE Leader Node failed after hours Infrastructure and Operations	1	130	October 31, 2024
I have only 1 instance but getting a replication_num table error since the default of 3 is wrong Infrastructure and Operations	2	761	February 15, 2024
Troubleshooting disk space usage issues Infrastructure and Operations	0	172	January 26, 2024
Unable to create tables disk size Infrastructure and Operations	5	101	September 24, 2024
Primary key table insert data failed with timeout? Help me Working with Data	3	163	April 3, 2024

CN nodes in shared data with replication_factor

Related topics