Prerequisites
For this tutorial you need to:
- Have Docker Desktop or podman container runtime installed.
- Have a MySQL client.
- A StarRocks or CelerData database.
- Download the NYC Yellow or Green Tax Data and upload into a S3 bucket
- Create a database, database table and query the data.
- Execute local disk parquet file load
- Visualize this dataset via Apache SuperSet.
Have Docker Desktop or podman container runtime installed
This is out of scope for the tutorial.
Have a MySQL client
This is out of scope for the tutorial.
A StarRocks or CelerData database cluster
This is out of scope for the tutorial.
Download the NYC Yellow or Green Taxi Data and upload into a S3 bucket
You can download the NYC Yellow Taxi Data at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
and the data dictionary can be found at https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
. I downloaded the January 2023 data at https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
which is about 50 megs. The next step is that you have to upload the data into an object store. I used AWS S3 service.
Alternatively you can use the Green Taxi Data. It’s a smaller set of data. Between Yellow and Green, Green does not have 1 data field that Yellow has and Green has an additional data field.
RFE: Import parquet data from http:// URIs. #23903
Create a database, database table and query the data
Create the database.
create database demo;
Create the table based off of the data dictionary.
drop demo.table taxi_yellow;
create demo.table taxi_yellow (
tpep_pickup_datetime DATETIME
, VendorID int
, tpep_dropoff_datetime DATETIME
, passenger_count int
, trip_distance float
, PULocationID string
, DOLocationID string
, RatecodeID int
, store_and_fwd_flag string
, payment_type int
, fare_amount float
, extra float
, mta_tax float
, improvement_surcharge float
, tip_amount float
, tolls_amount float
, total_amount float
, congestion_surcharge float
, airport_fee float
)
ENGINE=OLAP
DUPLICATE KEY(`tpep_pickup_datetime`)
DISTRIBUTED BY HASH(`tpep_pickup_datetime`) BUCKETS 9;
or for Green Taxi Data use the following sql
drop table demo.taxi_green;
create table demo.taxi_green (
lpep_pickup_datetime DATETIME
, VendorID int
, lpep_dropoff_datetime DATETIME
, passenger_count int
, trip_distance float
, PULocationID string
, DOLocationID string
, RatecodeID int
, store_and_fwd_flag string
, payment_type int
, fare_amount float
, extra float
, mta_tax float
, improvement_surcharge float
, tip_amount float
, tolls_amount float
, total_amount float
, congestion_surcharge float
, trip_type int
)
ENGINE=OLAP
DUPLICATE KEY(`lpep_pickup_datetime`)
DISTRIBUTED BY HASH(`lpep_pickup_datetime`) BUCKETS 9;
Execute local disk parquet file load
We need to first get the file onto disk that is running StarRocks.
curl https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet -o /tmp/green_tripdata_2023-01.parquet
Then execute the mysql client command to load the file.
StarRocks > show broker;
+--------------+-----------+------+-------+---------------------+---------------------+--------+
| Name | IP | Port | Alive | LastStartTime | LastUpdateTime | ErrMsg |
+--------------+-----------+------+-------+---------------------+---------------------+--------+
| allin1broker | 127.0.0.1 | 8000 | true | 2023-07-30 10:11:57 | 2023-07-30 10:16:57 | |
+--------------+-----------+------+-------+---------------------+---------------------+--------+
1 row in set (0.00 sec)
StarRocks > load label taxiload1 (data infile("file:///tmp/green_tripdata_2023-01.parquet") into table taxi_green format as "parquet"(VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, RatecodeID, trip_distance, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee ) ) with broker allin1broker properties("timeout"="3600");
Query OK, 0 rows affected (0.02 sec)
Execute the load command to get data from S3 into StarRocks.
LOAD LABEL demo.s3upload
(
DATA INFILE("s3a://s3loadtable1/yellow_tripdata_2023-01.parquet")
INTO TABLE taxi_yellow
(VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, RatecodeID, trip_distance, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee )
)
WITH BROKER
(
"aws.s3.access_key" = "XXXXX",
"aws.s3.secret_key" = "YYYY",
"aws.s3.use_instance_profile" = "false",
"aws.s3.region" = "us-west-2"
);
The load is similar for Green Taxi Data.
See the status of the load. Keep on running the show load
command until you see a success or failure.
use demo;
show load;
Finally query the data.
select count(*) from taxi_yellow;
select * from taxi_yellow;
And you will see this as a result.
StarRocks > select count(*) from taxi_yellow;
+----------+
| count(*) |
+----------+
| 3066766 |
+----------+
1 row in set (0.07 sec)
Visualize this dataset via Apache SuperSet
Check out the StarRocks and Apache SuperSet tutorial at #23210 or StarRocks and preset.io tutorial at #24506.