I lot of s3 operations on single stream load

Before asking, did you search first? Press :mag: at the upper right to search.

Yes, github issues, source code, be configuration docs, forum

Questions Template:

I am inserting ±1 million rows each ±5 minutes in primary key table and getting a lot of s3 operations

Trying to investigate how to write in a big batches i set
write_buffer_size=1048576000
lake_compaction_stream_buffer_size_bytes=20971520
max_compaction_num_singleton_deltas=500-1000
min
_compaction_num_singleton_deltas=100-300

But average fslib s3 single upload iosize (quantile) is about 1 megabyte and a lot of async_delta_writer operations

I am trying to find solution to pre-batch/compact s3 files and upload them in a single operation, becase i am stream loading ± 100-200 megabytes each time

  • Support Latest Releases: Version 3.3.3

*I am new user and can’t attach dashboard images, but rps is about 150 and fslib s3 operations is up to 10k

Can i increase starlet block size to decrease s3 iops? Or block size is for local fs only?

  1. Could you show me your table schema? Use:

show create table xxxx;

  1. How do you ingest to the table, are you using stream load/broker load/routine load? Could you show me more information about your ingestion task

CREATE TABLE REDACTED (
customer_external_id varchar(256) NOT NULL,
product_external_id int(11) NOT NULL,
external_id varchar(128) NOT NULL,
created_at datetime NOT NULL,
REDACTED
3 int(11) NOT NULL columns
2 smallint(6) NOT NULL columns
2 float NOT NULL columns
1 varchar(30) NOT NULL column
1 varchar(256) NOT NULL column
1 varchar(1024) NULL column
2 varchar(32768) NOT NULL column
2 datetime NULL columns
1 array<varchar(1024)> NULL column
)
PRIMARY KEY(customer_external_id, product_external_id, external_id)
PARTITION BY RANGE(product_external_id) (
PARTITION p0_10 VALUES LESS THAN(“10000000”),
PARTITION p10_20 VALUES LESS THAN(“20000000”),
PARTITION p20_30 VALUES LESS THAN(“30000000”),
PARTITION p30_40 VALUES LESS THAN(“40000000”),
PARTITION p40_50 VALUES LESS THAN(“50000000”),
PARTITION p50_60 VALUES LESS THAN(“60000000”),
PARTITION p60_70 VALUES LESS THAN(“70000000”),
PARTITION p70_80 VALUES LESS THAN(“80000000”),
PARTITION p80_90 VALUES LESS THAN(“90000000”),
PARTITION p90_100 VALUES LESS THAN(“100000000”),
PARTITION p100_110 VALUES LESS THAN(“110000000”),
PARTITION p110_120 VALUES LESS THAN(“120000000”),
PARTITION p120_130 VALUES LESS THAN(“130000000”),
PARTITION p130_140 VALUES LESS THAN(“140000000”),
PARTITION p140_150 VALUES LESS THAN(“150000000”),
PARTITION p150_160 VALUES LESS THAN(“160000000”),
PARTITION p160_170 VALUES LESS THAN(“170000000”),
PARTITION p170_180 VALUES LESS THAN(“180000000”),
PARTITION p180_190 VALUES LESS THAN(“190000000”),
PARTITION p190_200 VALUES LESS THAN(“200000000”),
PARTITION p200_210 VALUES LESS THAN(“210000000”),
PARTITION p210_220 VALUES LESS THAN(“220000000”),
PARTITION p220_230 VALUES LESS THAN(“230000000”),
PARTITION p230_240 VALUES LESS THAN(“240000000”),
PARTITION p240_250 VALUES LESS THAN(“250000000”),
PARTITION p250_260 VALUES LESS THAN(“260000000”),
PARTITION p260_270 VALUES LESS THAN(“270000000”),
PARTITION p270_280 VALUES LESS THAN(“280000000”),
PARTITION p280_290 VALUES LESS THAN(“290000000”),
PARTITION p290_300 VALUES LESS THAN(“300000000”),
PARTITION p300_310 VALUES LESS THAN(“310000000”),
PARTITION p310_320 VALUES LESS THAN(“320000000”),
PARTITION p320_330 VALUES LESS THAN(“330000000”),
PARTITION p330_340 VALUES LESS THAN(“340000000”),
PARTITION p340_350 VALUES LESS THAN(“350000000”),
PARTITION p350_360 VALUES LESS THAN(“360000000”),
PARTITION p360_370 VALUES LESS THAN(“370000000”),
PARTITION p370_380 VALUES LESS THAN(“380000000”),
PARTITION p380_390 VALUES LESS THAN(“390000000”),
PARTITION p390_400 VALUES LESS THAN(“400000000”),
PARTITION p400_410 VALUES LESS THAN(“410000000”),
PARTITION p410_420 VALUES LESS THAN(“420000000”),
PARTITION p420_430 VALUES LESS THAN(“430000000”),
PARTITION p430_440 VALUES LESS THAN(“440000000”),
PARTITION p440_450 VALUES LESS THAN(“450000000”),
PARTITION p450_460 VALUES LESS THAN(“460000000”),
PARTITION p460_470 VALUES LESS THAN(“470000000”),
PARTITION p470_480 VALUES LESS THAN(“480000000”),
PARTITION p480_490 VALUES LESS THAN(“490000000”),
PARTITION p490_500 VALUES LESS THAN(“500000000”)
)
DISTRIBUTED BY HASH(customer_external_id) BUCKETS 30
ORDER BY(customer_external_id, created_at)
PROPERTIES (
“storage_medium” = “SSD”,
“replication_num” = “3”,
“bloom_filter_columns” = “REDACTED 3 int(11) NOT NULL COLUMNS”,
“datacache.enable” = “true”,
“storage_volume” = “REDACTED”,
“enable_async_write_back” = “false”,
“enable_persistent_index” = “true”,
“fast_schema_evolution” = “true”,
“compression” = “ZSTD”
);

  1. stream load uncompressed json array with strip_outer_array and strict_mode
    2.1. stream load can affect from 1 to ± 7-10 partitions at once

Because you have 30 buckets and if you affect 10 partitions at once, that will be 300 buckets be affected at once, and each bucket we will have 2 ~ 3 PUT operation, so that will be 900 PUT per stream load.

And now the s3 operations is 10k, so I guess maybe something wrong with stream load ingestion frequency, you should check about this?

I will try to find more details

Does StarRocks affects always 30 buckets, if i stream load data in 1 partition?

Also decreasing buckets count should help, right? But StarRocks will load more data from s3 to process read operations?

Also does buckets count will affect compaction? If i decrease buckets each bucket become bigger

Does StarRocks affects always 30 buckets, if i stream load data in 1 partition?
Yes, always 30 buckets.

Also decreasing buckets count should help, right
Yes, and the best practice about buckets count is that make each bucket has 1GB size (after compress).

E.g. if you have 30GB per partition, then you can use 30 buckets .

So to reduce s3 ops, there are 3 dimensions:

  1. bucket count
  2. Partition count been write per ingestions.
  3. ingestion frequency.