Loading Data From Kafka to Starrocks Slow

Gerrit_van_Zyl1 · May 20, 2024, 4:00pm

Starrocks Version 3.2.6-2585333

Im running this locally in a docker container - 64GB ram (all of the containers are using around 15GB ram during loading - and only 13% of the available 1600% CPU (16 cores))

I have a table with around 40mil rows - Im using the starrocks sink connector to load data from postgres to starrocks.

This spesific table is loading at around 2000 rows per second which is way to slow (other table that is even larger loaded at almost 50k rows per second).

Schema of the table that im loading to:

CREATE TABLE IF NOT EXISTS statement(
id BIGINT,
statement_date int,
vat_percentage FLOAT,
amount DOUBLE,
unallocated_amount DOUBLE,
allocation_status BIGINT,
branch_id BIGINT,
provider_id BIGINT,
statement_mapping_id BIGINT,
description VARCHAR(255),
imported CHAR,
source BIGINT,
payment_reference VARCHAR(255),
created_at int,
updated_at int,
created_by_id BIGINT,
updated_by_id BIGINT,
remittance_date int,
division VARCHAR(255),
financial_source_id BIGINT,
bank_recon_id BIGINT,
source_currency_code VARCHAR(3),
source_currency_exchange_rate DOUBLE,
payment_currency_code VARCHAR(3),
reference_number VARCHAR(50),
bank_recon_ignored CHAR,
ignore_item_vat_errors CHAR,
ignore_item_vat_errors_reason VARCHAR(255),
statement_date2 date NULL AS str_to_date(from_unixtime(statement_date * 86400), ‘%Y-%m-%d’) COMMENT “”
) PRIMARY KEY (id)
DISTRIBUTED BY HASH(id)
order by (branch_id)
PROPERTIES (
“replication_num” = “1”,
“in_memory” = “false”,
“storage_format” = “DEFAULT”
);

(The table that was loading quicker does not included a generated column or an order by clause)

My question then: Is this a configuration on the starrocks side that needs to be tweaked or is this a kafka issue?

yan_zhang · May 24, 2024, 5:51am

(The table that was loading quicker does not included a generated column or an order by clause)
other table that is even larger loaded at almost 50k rows per second

From these two points, I’d say it’s not kafka source problem. But I can not reason why “generated column” or “order by clause” do matter with ingestion speed.

I have a table with around 40mil rows - Im using the starrocks sink connector to load data from postgres to starrocks.

you mean “kafka connector” ?

I would suggest to check that, when sinking data into starrocks table, are records buffered? Because if you do insert in small batch, the ingestion speed can not be high. Maybe you have to check if you sink table in larger batch.

Topic		Replies	Views
Improve the speed of consuming records with Kafka connector Data Loading Tools & Integrations	2	120	January 18, 2025
Real time change data capture (CDC) using Apache Kafka and Aiven's JDBC Sink Connector for Apache Kafka® to insert data into StarRocks Data Loading Tools & Integrations	1	503	January 26, 2024
Information on Apache Kafka integration Data Loading Tools & Integrations	0	98	January 25, 2024
Can we load data into a single StarRocks table using multiple Kafka topics? Data Loading Tools & Integrations	1	120	January 5, 2024
What are my StarRocks Data Loading options Working with Data	1	398	January 5, 2024

Loading Data From Kafka to Starrocks Slow

Related topics