Best practices on a community deployment of StarRocks

StarRocks, the high-performance real-time OLAP database, is a game-changer for data-driven organizations. Its blazing-fast query speeds and seamless integration with data lakes make it ideal for complex analytics tasks. But to truly unlock StarRocks’ potential, following best practices is key.

This blog post dives into key areas to optimize your StarRocks community deployment:

0. Update to latest for fixes:

Most issues are resolved the latest. If not, you can fix the code yourself due to the OSS nature of the code and contribute the fix OR have a commercial agreement with CelerData.

1. Choosing the Right Deployment Method:

StarRocks offers flexibility in deployment. Here’s a quick rundown:

  • Docker (allin1 quickstart container): Ideal for quick tests and getting familiar with StarRocks.
  • StarRocks Kubernetes Operator: Manages StarRocks in Kubernetes environments for HA and production-ready deployments.
  • Bare Metal/VM: Provides full control for experienced users (requires manual installation and not recommended).

You can pick between the shared nothing or shared data architecture. Here are the pros and cons:

  • Shared Nothing Architecture: Slightly faster due to local disk however the infrastructure costs with scale out is not as cost efficient. Other OLAP that use this architecture is AWS Redshift.
  • Shared Data Architecture: Slightly slower due to access S3 (bounded by lower IOPS). Scales the best cost wise. Other OLAP that use this architecture is Snowflake and GCP Big Query.

2. Selecting the Optimal Server Type:

  • Frontend (FE): A machine with at least 8 CPU cores and 16GB+ RAM is recommended.
  • Backend (BE): Opt for a server with 16+ CPU cores and 64GB+ RAM. Ensure the CPU supports the AVX2 instruction set for StarRocks’ vectorization capabilities.
  • Networking: Equip your servers with 10 Gigabit Ethernet cards and a compatible switch for smooth data transfer.

3. Data Modeling for Peak Performance:

  • Picking the Right Table Type: StarRocks offers various table types. Choose between deuplicate keys, primary keys (for upserts, deletes) and aggregate key tables based on your needs.
  • Partitioning and Bucketing: Partitioning data by date or other relevant columns helps with faster queries. Bucketing further distributes data evenly across CN/BE nodes for parallel processing.

4. Optimizing Query Performance:

  • Leverage Materialized Views: Pre-compute aggregations for frequently used queries for lightning-fast results.
  • Utilize the StarRocks SQL dialect effectively: Understand best practices for writing efficient queries, including proper use of JOINs and filtering conditions.

5. Streamlining Data Loading:

  • Choose the Optimal Method: StarRocks offers a wide range of methods to move data and out out StarRocks.
  • Data Compression: Compress data before loading to save storage space and potentially improve load speeds.

Beyond the Basics:

  • Monitoring and Alerting: Set up monitoring tools (Prometheus and Grafana) to track StarRocks health and performance. Configure alerts to identify and address issues promptly.
  • Security: The community version only has RBAC and cleartext username/password authentication. Additional security features are only available in the CelerData version of StarRocks.

Community and Resources:

  • The StarRocks community is a valuable resource. Join the forum https://forum.starrocks.io/ for discussions, tips, and troubleshooting help.
  • The official StarRocks documentation https://docs.starrocks.io/ is comprehensive and covers everything from installation to advanced configuration.

By following these best practices, you can unlock the full potential of StarRocks and empower your organization to make data-driven decisions faster than ever. Remember, StarRocks is a powerful tool, and these are just starting points. As you gain experience, explore its advanced features and fine-tune your deployment for optimal performance.

1 Like

Hello, thanks in advance for your explanations and references about Starrocks. But I don’t agree when you mention that shared mode is slower

Next link have more details about S3 as storage volume, and any object storage is just as capable as local disks:

A simile without having to resort to so much documentation is the fact that today any architecture that requires having dedicated nodes over the network to access local disks, will be the same as an architecture that uses a s3 network protocol, HTTP rest api, or whatever, and will need the network to reach each disk