Loading and querying NYC Yellow or Green Taxi Data (Parquet format) with StarRocks

Prerequisites
For this tutorial you need to:

  • Have Docker Desktop or podman container runtime installed.
  • Have a MySQL client.
  • A StarRocks or CelerData database.
  • Download the NYC Yellow or Green Tax Data and upload into a S3 bucket
  • Create a database, database table and query the data.
  • Execute local disk parquet file load
  • Visualize this dataset via Apache SuperSet.

Have Docker Desktop or podman container runtime installed

This is out of scope for the tutorial.

Have a MySQL client

This is out of scope for the tutorial.

A StarRocks or CelerData database cluster

This is out of scope for the tutorial.

Download the NYC Yellow or Green Taxi Data and upload into a S3 bucket

You can download the NYC Yellow Taxi Data at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and the data dictionary can be found at https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf. I downloaded the January 2023 data at https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet which is about 50 megs. The next step is that you have to upload the data into an object store. I used AWS S3 service.

Alternatively you can use the Green Taxi Data. It’s a smaller set of data. Between Yellow and Green, Green does not have 1 data field that Yellow has and Green has an additional data field.

RFE: Import parquet data from http:// URIs. #23903

Create a database, database table and query the data

Create the database.

create database demo;

Create the table based off of the data dictionary.

drop demo.table taxi_yellow;
create demo.table taxi_yellow (
     tpep_pickup_datetime DATETIME     
  , VendorID int                          
  , tpep_dropoff_datetime DATETIME   
  , passenger_count int                   
  , trip_distance float                   
  , PULocationID string          
  , DOLocationID string             
  , RatecodeID int                        
  , store_and_fwd_flag string            
  , payment_type int                       
  , fare_amount float                      
  , extra float                           
  , mta_tax float                          
  , improvement_surcharge float                         
  , tip_amount float                      
  , tolls_amount float                   
  , total_amount float                     
  , congestion_surcharge float            
  , airport_fee float            
)
ENGINE=OLAP
DUPLICATE KEY(`tpep_pickup_datetime`)
DISTRIBUTED BY HASH(`tpep_pickup_datetime`) BUCKETS 9;

or for Green Taxi Data use the following sql

drop table demo.taxi_green;
create table demo.taxi_green (
     lpep_pickup_datetime DATETIME     
  , VendorID int                          
  , lpep_dropoff_datetime DATETIME   
  , passenger_count int                   
  , trip_distance float                   
  , PULocationID string          
  , DOLocationID string             
  , RatecodeID int                        
  , store_and_fwd_flag string            
  , payment_type int                       
  , fare_amount float                      
  , extra float                           
  , mta_tax float                          
  , improvement_surcharge float                         
  , tip_amount float                      
  , tolls_amount float                   
  , total_amount float                     
  , congestion_surcharge float            
  , trip_type int         
)
ENGINE=OLAP
DUPLICATE KEY(`lpep_pickup_datetime`)
DISTRIBUTED BY HASH(`lpep_pickup_datetime`) BUCKETS 9;

Execute local disk parquet file load
We need to first get the file onto disk that is running StarRocks.

curl https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet -o /tmp/green_tripdata_2023-01.parquet

Then execute the mysql client command to load the file.

StarRocks > show broker;
+--------------+-----------+------+-------+---------------------+---------------------+--------+
| Name         | IP        | Port | Alive | LastStartTime       | LastUpdateTime      | ErrMsg |
+--------------+-----------+------+-------+---------------------+---------------------+--------+
| allin1broker | 127.0.0.1 | 8000 | true  | 2023-07-30 10:11:57 | 2023-07-30 10:16:57 |        |
+--------------+-----------+------+-------+---------------------+---------------------+--------+
1 row in set (0.00 sec)
StarRocks > load label taxiload1 (data infile("file:///tmp/green_tripdata_2023-01.parquet") into table taxi_green format as "parquet"(VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, RatecodeID, trip_distance, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee ) ) with broker allin1broker properties("timeout"="3600");
Query OK, 0 rows affected (0.02 sec)

Execute the load command to get data from S3 into StarRocks.

LOAD LABEL demo.s3upload
(
    DATA INFILE("s3a://s3loadtable1/yellow_tripdata_2023-01.parquet")
    INTO TABLE taxi_yellow
    (VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, RatecodeID, trip_distance, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge, airport_fee )
)
WITH BROKER
(
    "aws.s3.access_key" = "XXXXX",
    "aws.s3.secret_key" = "YYYY",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.region" = "us-west-2"
);

The load is similar for Green Taxi Data.

See the status of the load. Keep on running the show load command until you see a success or failure.

use demo;
show load;

Finally query the data.

select count(*) from taxi_yellow;
select * from taxi_yellow;

And you will see this as a result.

StarRocks > select count(*) from taxi_yellow;
+----------+
| count(*) |
+----------+
|  3066766 |
+----------+
1 row in set (0.07 sec)

Visualize this dataset via Apache SuperSet

Check out the StarRocks and Apache SuperSet tutorial at #23210 or StarRocks and preset.io tutorial at #24506.

1 Like

This is a more complex example and more typical of an analytics query.

select * from nyc_taxi_yellow limit 10;
create table if not exists taxi_tip_rate_per_passenger as
select
passenger_count,
 count(*) as trip_count,
 avg(tip_amount / fare_amount) as tip_rate
from nyc_taxi_yellow
where
pickup_time between '2020-03-01' and '2020-03-31'
  and passenger_count > 0
  and fare_amount > 0
group by passenger_count;
1 Like