Starrocks Consuming Masses of Network Bandwidth When Idle

We’re using a small Starrocks cluster to query an Iceberg table setup with around 200GB of data, and using an Iceberg REST catalog.

The cluster is containerized and running on a single server. It consists of 1 FE node and a pair of BE nodes (our non-production test configuration)

We have been firing single queries at Starrocks with large gaps in time between queries. Every time a query is presented to Starrocks (simple single table SELECT) it answers with the correct results then idles. Approximately 5 seconds later, each of the 2 backends pegs a pair of CPU cores at 100% and starts consuming massive amounts of network bandwidth. In our case we’re seeing a total of around 10GB of network bandwidth consumed over several minutes by each BE, followed by a return to idle. Everything remains in a quiescent state until the next query is fired, and then the whole process repeats. I want to emphasize that this greedy behavior starts AFTER the query is satiated with results.

Note that neither of the BE processes (nor the FE process) is memory, CPU, or network constrained. There is no swapping or anything like that going (in fact, disk/block storage use is minimal).

Config is shared-nothing and running on a single 16 CPU server (AMD EPYC) - 32GB RAM - 500GB SSD. The Iceberg table is hosted on S3 compatible storage (linode).

Can anyone tell us what might be going on, and what settings we should be looking at to prevent this from happening?

Our FE is using all default settings except the usual suspects for S3 configuration, and:

default_replication = 1

The BE’s have all default settings, with the exception of separate network ports.

Thanks for any assistance

You can have a look of the flamegraph. To get the CPU usage.

perf record -F 99 -ag -p 54614 -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf-kernel.svg