Real-time OLAP analytics for ad tech comes with a unique set of demands beyond standard OLAP applications. Here are some of the typical requirements you’ll encounter:
Data Handling and Ingestion:
-
High Volume and Velocity: Ad platforms generate massive amounts of data continuously, requiring the system to handle ingestion and processing efficiently without causing latency.
-
Data Variety: Diverse data sources, including ad events, campaign details, user profiles, and financial data, must be integrated and reconciled.
-
Real-time Updates: The system needs to ingest and process new data constantly to provide up-to-date insights without delay.
Analytical capabilities:
-
Multidimensional Analysis: Support for drilling down and slicing data along various dimensions like campaign, channel, audience, device, and geography.
-
Ad Hoc Queries: Users should be able to ask dynamic questions beyond pre-defined reports, requiring flexible query capabilities.
-
Trend Analysis and Forecasting: Identify patterns and trends in real-time to predict future performance and optimize campaigns.
-
Anomaly Detection: Real-time alerts for unusual deviations in metrics like click-through rates or impressions to pinpoint potential issues.
Performance and Scalability:
-
Low Latency: Queries should return results within milliseconds to enable immediate decision-making.
-
Scalability: The system must be able to handle increasing data volumes and query workloads without compromising performance.
-
High Availability: Guaranteed uptime and fault tolerance are crucial to avoiding data loss and ensuring continuous insights.
Architectural Decisions for StarRocks
-
Data Partitioning
- Pick a partition scheme that makes sense based on your query pattern. If your queries are commonly made within a 1 day range, pick 1 day range as your partition. Below is an example of a partition by 1 day range:
CREATE TABLE site_access (
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (
START ("2021-01-01") END ("2021-01-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id)
PROPERTIES ("replication_num" = "3" );
-
Data Bucketing
- Although the default bucketing may be “good enough”, you may use hash bucketing to optimize query performance. A good column choice is a column that is often used in a filter for queries. Below is an example of a partition by 1 day range with queries and SQL equality on “site_id” within that 1 day:
CREATE TABLE site_access (
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (
START ("2021-01-01") END ("2021-01-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id)
PROPERTIES ("replication_num" = "3" );
-
Streaming inserts and upserts through Event Streaming software such as Apache Kafka or Apache Flink (our 2 best supported connectors as of March 2024).
-
Apache Kafka: Use the StarRocks Apache Kafka Sink Connector
- Use the setting “bufferflush.intervalms” to optimize the time interval between micro-batch inserts into StarRocks. Start with the default and then change them as needed.
-
Apache Flink: Use the StarRocks Apache Flink Sink Connector
- Use the setting “sink.buffer-flush.interval-ms” or “sink.buffer-flush.max-rows” to optimize the micro-batch inserts into StarRocks. Start with the default and then change them as needed.
-
-
Architecture
-
Performance
-
Query
-
[Shared Nothing Architecture] Use at least 3 replicas.
- More replicas may lead to more performance for queries.
-
[Shared Data Architecture] Since CN all point to the same S3 bucket, there is no need for replicas.
-
-
Compute
-
Achieved through horizontal scaling of FE nodes.
- Use the StarRocks k8s operator to take advantage of the auto-scaling features.
-
-
Storage
-
Achieved through horizontal scaling of BE nodes.
- Use the StarRocks k8s operator to take advantage of the auto-scaling features.
-
[Shared Nothing Architecture] Use SSD storage with the highest IOPS.
- Ideally, it would be NVME if you could afford it and then SSD.
-
[Shared Data Architecture] Use an S3 service with the highest IOPS.
- In many cases, it’s Min.IO on premises or in public cloud (has higher limits than native services). AWS limits S3 to 5000 IOPS as do many other similar services in GCP and Azure.
-
-
-
Elasticity and Scalability
- Use the StarRocks k8s operator to take advantage of the k8s elasticity and scalability features.
-
Interoperability
- Stay with software that supports k8s, k8s networks, k8s storage, containers and S3-compatible object storage.
-
Adaptability
- Check with the StarRocks community about the latest integrations.
-
Usability
-
Maintainability
- Use the StarRocks k8s operator to take advantage of the k8s upgrade features.
-
Integration
-
Availability
-
Use the StarRocks k8s operator to take advantage of the k8s’ high availability features.
-
Use at least 3 FE so can still provide service in event of pod failure.
-
Use at least 3 BE.
-
-
-
Disaster Recovery
- You will need to design your own disaster recovery solution.
-
Security
- You will need to design your own security model for your solution.
-
Category definitions
-
Availability – What percentage of time does the cloud vendor guarantee cloud services will be available (including scheduled maintenance down-times)? Bear in mind that although 99% availability may sound good that actually equates to just over 3.5 days potential downtime a year. Even 99.99 could mean 8 hours down time. Also consider as part of this Disaster Recovery aspects of availability and if more then one physical data centre is used where do they reside? The latter is especially true where data residency is an issue if your data needs to reside on-shore for legal or regulatory reasons.
-
Elasticity (Scalability) – How easy is it to bring on line or take down compute resources (CPU, memory, network) as workload increases or decreases?
-
Interoperability – If using services from multiple cloud providers how easy is it to move workloads between cloud providers? (Hint: open standards help here). Also what about if you want to migrate from one cloud provider to another ? (Hint: open standards help here as well).
-
Security – What security levels and standards are in place? for public/private clouds not in your data centre also consider physical security of the cloud providers data centers as well as networks. Data residency again needs to be considered as part of this.
-
Adaptability – How easy is it to extend, add to or grow services as business needs change? For example if I want to change my business processes or connect to new back end or external API’s how easy would it be to do that?
-
Performance – How well suited is my cloud infrastructure to supporting the workloads that will be deployed onto it, particularly as workloads grow?
-
Usability – This will be different depending on who the client is (i.e. business users, developers/architects or IT operations). In all cases however you need to consider ease of use of the software and how well designed interfaces are etc. IT is no longer hidden inside your own company, instead your systems of engagement are out there for all the world to see. Effective design of those systems is more important than ever before.
-
Maintainability – More from an IT operations and developer point of view. How easy is it to manage (and develop) the cloud services?
-
Integration – In a world of hybrid cloud where some workloads and data need to remain in your own data centre (usually systems of record) whilst others need to be deployed in public or private clouds (usually systems of engagement) how those two clouds integrate is crucial.
-
Disaster Recovery - Do we have multiple hot data centers? Do we have hot-warm with RPO and RTO? How do we move compute, network and data? Do we do continuous backups?