Ray.io + StarRocks

Here’s an example of getting ray.io working with StarRocks.

Code

atwong@Alberts-MBP-3 sandbox % cat script.py
import mysql.connector
import ray

def create_connection():
    return mysql.connector.connect(
        user="root",
        password="",
        host="localhost",
        port=9030,
        connection_timeout=30,
        database="demo",
    )

ds = ray.data.read_sql("SELECT * FROM sr_member", create_connection)

ds.show();

Results

atwong@Alberts-MBP-3 sandbox % python script.py
2023-08-09 15:47:32,582	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2023-08-09 15:47:33,316	INFO read_api.py:374 -- To satisfy the requested parallelism of 200, each read task output will be split into 200 smaller blocks.
2023-08-09 15:47:33,321	INFO dataset.py:2180 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2023-08-09 15:47:33,322	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadSQL->SplitBlocks(200)]
2023-08-09 15:47:33,322	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-08-09 15:47:33,322	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
[dataset]: Run `pip install tqdm` to enable progress reporting.
{'sr_id': 1, 'name': 'tom', 'city_code': 100000, 'reg_date': datetime.date(2022, 3, 13), 'verified': 1}
{'sr_id': 6, 'name': 'mohammed', 'city_code': 300000, 'reg_date': datetime.date(2022, 3, 17), 'verified': 1}
{'sr_id': 4, 'name': 'ronaldo', 'city_code': 100000, 'reg_date': datetime.date(2022, 3, 15), 'verified': 0}
{'sr_id': 2, 'name': 'johndoe', 'city_code': 210000, 'reg_date': datetime.date(2022, 3, 14), 'verified': 0}
{'sr_id': 3, 'name': 'maruko', 'city_code': 200000, 'reg_date': datetime.date(2022, 3, 14), 'verified': 1}
{'sr_id': 5, 'name': 'pavlov', 'city_code': 210000, 'reg_date': datetime.date(2022, 3, 16), 'verified': 0}