Designing StarRocks for High Availability

1280X1280 (5)
The architecture diagram of StarRocks is shown above, and its high availability mechanism can be briefly introduced from the following two aspects:

  • Metadata high availability: that is, how to ensure the high availability of Metadata information, such as database & table structure, user information, replica information, materialized view scheduling information, etc.
  • Application data high availability: How to ensure high availability of data imported by users into StarRocks after writing to BE.

Metadata High Availability

Overview

1280X1280 (6)
FE nodes are divided into three roles:

  • Leader node: provides Metadata read and write services, selected by the Follower node.
  • Follower node: Provides Metadata reading service and participates in Leader elections.
  • Observer node: Provides Metadata reading service and does not participate in Leader election.
    StarRocks uses the Finite-State Machine model (state machine) to ensure the high availability of Metadata. Firstly, nodes select the leader through the paxos protocol, and all write operations are initiated by the leader (if the node visited by the user is not the leader, the internal write request will be forwarded to the leader). When the leader executes the write, he will first write a LOG ( wal ) locally, and then synchronize it to other nodes through the replication protocol. When most nodes confirm receipt of the log, the leader returns success to the Client.

Leader election

The selection process is roughly divided into two stages (the Paxos protocol is quite complex, here is just a brief introduction):

  1. Propose stage: When the node finds that there is no leader in the cluster, it initiates a propose request for all nodes. If most of the $$\lfloor n/2 \rfloor + 1$$responses are successful, it enters the next stage.
  2. Accept stage: Compare the received successful node information and determine the main node in the following order:
  3. The higher vlsn (log number of bdb) takes priority as the main node.
  4. Priority is given to the main node at the end of the IP dictionary order.
    After confirming the master, initiate an accept request for the node that responded successfully in the first step again. If most of the responses are successful, the master selection is successful.
    Here are a few points to note:
  5. The master node needs a majority of nodes to confirm, so the cluster needs a majority of Follower nodes alive to function properly.
  6. Most of the calculation formulas are⌊�/2⌋+1, so the fault tolerance of even number of Follower nodes and less than one Follower node cluster is the same, that is, 4 Follower clusters and 3 Follower clusters can only tolerate one Follower downtime, so Follower nodes are recommended to deploy an odd number, usually 3 or 5, if you need to expand the cluster’s QPS , you can deploy more Observer nodes.
  7. In the propose stage, the request will carry the local timestamp as a comparison of the proposed old and new, after the node receives the updated propose request, will respond to the successful request before discarding, so the time difference between the machines can not be too large, can be configured by max_bdbje_clock_delta_ms the maximum time difference between machines.
  8. If the data between nodes is up-to-date, then the selected master is always fixed, not random, because the logic of selecting the master is speculative. Therefore, in a 2-node cluster, no matter how the leader is restarted, it is the same node. If you want a node to become the leader, you can use a peripheral command to force the designated master node. java -jar fe/lib/je-7.3.7.jar DbGroupAdmin -helperHosts {ip:edit_log_port} -groupName PALO_JOURNAL_GROUP -transferMaster -force {node_name} 5000
  9. The leader and the follower maintain the status of the leader through heartbeat. The heartbeat timeout time is controlled by the bdbje_heartbeat_timeout_second . If the follower does not receive the leader’s heartbeat within this time, it will re-initiate the selection. It should be noted that if the leader restarts normally, the follower will immediately perceive and re-initiate the selection, because the heartbeat between them is maintained by a long link, and if the leader restarts the link, it will be immediately disconnected.
  10. If there are multiple IPs on the machine, and the FE node starts by IP, you need to configure the priority_networks , because the IP will be stored when you start for the first time, and the next time you start, if you do not configure priority_networks , you may choose the IP and The first time is different, which will cause the startup to fail.

LOG copy

Each write generates a LOG, which is identified by consecutive integers. The Leader node sequentially synchronizes the LOG to the Follower node (the LOG of the Observer node is pulled from the Leader node by itself). Only after the LOG is synchronized to most Followers can the write be considered successful.

  1. The timeout for log replication is controlled by the bdbje_replica_ack_timeout_second .
  2. StarRocks provides session consistency semantics (except stream load), that is, a successful write request initiated at the same node will be read at the next moment. If you want to read the latest data at other nodes, you can send the sync command before sending the request.
  3. The leader node will write a LOG of the current timestamp every 10s. If the non-master node receives the log and the local time is very different, it means that the Metadata of this node is far behind the leader node. At this time, the local metadata will be marked as unreadable, and all requests will go to the leader node. This time difference is controlled by the meta_delay_toleration_second .

Checkpoint

FE When the node restarts, it needs to play back LOG to complete the recovery of historical data. As the system runs, there will be more and more LOG, so the recovery time of the restart will also become longer. This requires Checkpoint to compress multiple LOG. Checkpoint is completed by the leader node, and then the generated image file is pushed to other nodes. If all nodes can receive this image file, the leader will delete the LOG before this image file. Checkpoint is triggered after a certain amount of LOG is generated, and the number of LOG is controlled by edit_log_roll_num .

Related parameters

FE

Parameter name Parameter description Default value
max_bdbje_clock_delta_ms Maximum time interval of the machine 5000
bdbje_heartbeat_timeout_second Select main heartbeat timeout time 30
priority_networks IP range N/A
bdbje_replica_ack_timeout_second Log replication timeout 10
meta_delay_toleration_second Metadata Tolerable Latency 300
edit_log_roll_num Number of LOGs triggering checkpoints 50000

Application data is highly available

Overview

StarRocks uses multi-replica technology to ensure high availability of data after writing. Specifically, when creating a table, users can determine how many replicas the created table should have by configuring replication_num properties. When initiating an import transaction, data will be written to multiple replicas and persisted at the same time, and then the data import success will be returned to the user. StarRocks also allows users to trade off between availability and performance, and can choose priority performance, that is, importing fewer replicas can return success to the user, or choose priority availability.
StarRocks ensures that multiple copies of a table are distributed on different hosts. When some copies are abnormal, such as machine downtime or network anomalies causing write failures, StarRocks will automatically detect and repair these abnormal copies. StarRocks will clone some or all of the data from healthy copies from other hosts to repair abnormal copies. StarRocks adopts a multi-version (MVCC) approach to improve the efficiency of data import. The cloning process can use these multi-version data for physical replication to ensure efficient version repair.

Multiple copy write

Flowchart
Taking the Stream Load in the above figure as an example, an import is divided into the following stages.

  1. Client submits import request to FE
  2. FE selects the Coordinator BE responsible for this import and generates an import execution plan
  3. Coordinator BE reads data to be written from Client
  4. Coordinator BE is responsible for distributing data to multiple Tablet replicas
  5. After the data is written and persisted in multiple copies, the transaction is published by FE , which makes the imported data visible to the outside world
  6. Return success to Client
    Other imports have similar processes, that is, data will only be visible to the outside world after being written to multiple replicas and persisted. Writing to multiple replicas ensures that query services can still be provided in extreme scenarios such as downtime and network anomalies.
    Of course, in some scenarios, users do not have such high requirements for data availability, but pay more attention to the performance of imports. StarRocks supports specifying the data import security level by configuring write_quorum properties on the table, that is, setting how many copies of data are needed to import successfully. StarRocks can return the import success. This property is supported since version 2.5.
    write_quorum The values and their corresponding descriptions are as follows:
  • MAJORITY : Default value . StarRocks returns import success when most data copies are imported successfully, otherwise it returns failure.
  • ONE : When a data copy is imported successfully, StarRocks returns import success, otherwise it returns failure.
  • ALL : When all data copies are imported successfully, StarRocks returns import success, otherwise it returns failure.

Copy auto repair

When some replicas are abnormal, such as when an import failure of a replica results in incomplete versions, downtime results in missing replicas, or the machine goes offline and needs to supplement new replicas, StarRocks will automatically repair the abnormal replicas.
FE will scan the tablet copy of all tables every 20s, and confirm whether the copy is healthy based on the current visible version number and BE health. If the version number of the copy is less than the visible version number, the copy needs to be repaired through incremental cloning. If the BE heartbeat where the copy is located is lost or manually marked as offline, the copy needs to be repaired through full cloning.
FE detects a replica that needs to be repaired, generates a scheduled task, and adds it to the scheduled queue. The scheduler (TabletScheduler) takes out the tasks that need to be scheduled from the queue, creates a clone task according to the task type, and distributes it to the corresponding BE for execution. For incomplete replicas, FE will directly distribute the clone task to the BE where the replica is located, and inform the BE to clone the missing data from the BE where the healthy replica is located. The subsequent cloning process is executed on the BE side. For the scenario of missing replicas, FE will select a new BE, create an empty replica, and issue a full clone task to the BE where the empty replica is located.
Whether it is incremental cloning or full cloning, BE adopts physical replication, directly obtains the required physical files from the healthy copy, and updates them to its own Metadata. After the cloning is completed, BE will report the status to the FE scheduler. FE cleans up and updates its Metadata. This copy repair is completed, and the overall process is as follows.

Flowchart (1)
During replica repair, as long as there are healthy replicas, the query can still be performed. In addition, for the scenario where 3 replicas have 1 replica that is unhealthy, the import can also be performed normally under the default MAJORITY write_quorum configuration.

Related parameters

FE

Parameter name Default value Description
tablet_sched_slot_num_per_path 8 The maximum number of clone tasks that can be performed simultaneously on a single path of each BE, such as a BE with 4 disks and 4 storage paths configured, by default, the number of clone tasks that can be participated in simultaneously (as source or target) is 16; the main reason why decommission or balance is slow at present comes from this configuration, which is mainly to limit concurrency and avoid affecting normal queries or imports
tablet_sched_max_scheduling_tablets 2000 Currently, the maximum number of tablets that can be scheduled for balance or repair is checked by the running queue and pending queue, which means that up to 4000 tasks can be scheduled in the scheduler at the same time
tablet_sched_repair_delay_factor_second 60 If a tablet is first found to be unhealthy, how long does it take to be scheduled for repair, in seconds? The delay time varies according to different priorities.
tablet_sched_min_clone_task_timeout_sec 180 Unit seconds, the minimum timeout time of the clone task. Normally, the timeout time of the clone task will be estimated based on the size of the tablet to be cloned, which is equivalent to controlling the upper and lower bounds of the timeout time to avoid estimation problems in abnormal situations
tablet_sched_max_clone_task_timeout_sec 7200 Unit seconds, the maximum timeout time of the clone task

BE

Parameter name | Default value | Description – | – | – parallel_clone_task_per_path | 8 | BE The number of clone threads per storage path, used for clone task execution

Backup & restore

Currently support database level or table level backup , if want to do a daily backup , can also try partition table by date and backup single partition each day
whiteboard_exported_image (5)


Nodes:

  • Default recommendation is at least 2 nodes for FE and 3 nodes for BE. That way there is no single point of failure.
  • For BE nodes, we use a raft-like consensus quorum so n/2 + 1 nodes needs to be alive for availability.
  • For FE nodes, they are stateless.

Load Balance all connections:

  • mysql protocol: Use proxySQL. StarRocks | StarRocks
  • HTTP services: You can use whatever HTTP load balancer.

Data Replication and Tablets:

  • Use at least 3 replicas. StarRocks can read from any of the replicas.

Recommendation:

  • Use the StarRocks operator as much as possible since they will have availability built-in.