By Digoal
This article describes how to optimize the data collected about the virtual tracks traveled by a vehicle by removing unnecessary data and separating data by vehicle.
One common scenario for Internet of Vehicles (IoV) applications is to collect the virtual tracks traveled by a vehicle. Virtual tracks are usually not generated and reported in real time, but rather consist of aggregated data that is reported at some interval. This aggregated data is from multiple vehicles and often may also represent multiple points in time at the same position. To make sense of this data, it is important that you optimize it.
Typically, the data of virtual tracks consists of the following parameters:
(car_id, pos geometry, crt_time timestamp)
This data includes information such as the vehicle identification information, its position, and the timestamp of the data collected.
From the data collected, you can also infer other information, such as when there were traffic jams and red traffic lights, by a lack of change in the position of the vehicle over time. Data reports are outputted and reported as follows:
1, position 1, '2017-01-01 12:00:00'
1, position 1, '2017-01-01 12:00:05'
1, position 1, '2017-01-01 12:00:10'
1, position 1, '2017-01-01 12:00:15'
1, position 1, '2017-01-01 12:00:20'
1, position 2, '2017-01-01 12:00:30'
In order to save space, reducing the size of data collected, it is more efficient to clean tracks. For example, by only retaining two records for a particular position, specifically the first and last times recorded, or the arrival and departure points, you can significantly reduce the size of the data collected.
You can optimize the data can using the window function. Next, it is recommended to clean tracks using a terminal window.
1. Design a table structure.
create table car_trace (cid int, pos point, crt_time timestamp);
In this table, columns will show the car identification information, the position of the car, and time that this information was recorded.
2. Generate 10 million test data records, assuming 1,000 vehicles are available. Having this large dataset will make it easier to repeat the data and test the effect, 25 points are used at the position.
insert into car_trace select random()*999, point((random()*5)::int, (random()*5)::int), clock_timestamp() from generate_series(1,10000000);
3. Create an index with the car identification number and the timestamp.
create index idx_car on car_trace (cid, crt_time);
4. Run the following command to query the data layout for the car with the identification number of 1:
select * from car_trace where cid=1 order by crt_time limit 1000;
1 | (3,1) | 2017-07-22 21:30:09.84984
1 | (1,4) | 2017-07-22 21:30:09.850297
1 | (1,4) | 2017-07-22 21:30:09.852586
1 | (1,4) | 2017-07-22 21:30:09.854155
1 | (1,4) | 2017-07-22 21:30:09.854425
1 | (3,1) | 2017-07-22 21:30:09.854493
Several pieces of data are repetitive.
5. Filter the records that were generated at a single position using the window. At most, only two records for arrival and departure at this position will be kept.
Two window functions are used here:
The method you can use to determine the arrival and departure points can be broken up into the following steps:
For this method, you can run following command using the parameters "lag" and "lead". This involves the methods discussed above.
using select * from
(
select
*,
lag(pos) over (partition by cid order by crt_time) as lag,
lead(pos) over (partition by cid order by crt_time) as lead
from car_trace
where cid=1
and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735'
) t
where pos <> lag
or pos <> lead
or lag is null
or lead is null;
cid | pos | crt_time | lag | lead
-----+-------+----------------------------+-------+-------
1 | (2,1) | 2017-07-22 21:30:09.83994 | | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)
1 | (4,4) | 2017-07-22 21:30:09.84179 | (5,2) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)
1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)
1 | (0,5) | 2017-07-22 21:30:09.84536 | (1,5) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)
1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.84984 | (3,3) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)
1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)
1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)
1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |
(20 rows)
If the track dataset cleaning is not performed, the obtained results are as follows:
select
*,
lag(pos) over (partition by cid order by crt_time) as lag,
lead(pos) over (partition by cid order by crt_time) as lead
from car_trace
where cid=1
and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735';
cid | pos | crt_time | lag | lead
-----+-------+----------------------------+-------+-------
1 | (2,1) | 2017-07-22 21:30:09.83994 | | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)
1 | (4,4) | 2017-07-22 21:30:09.84179 | (5,2) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)
1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)
1 | (0,5) | 2017-07-22 21:30:09.84536 | (1,5) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)
1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.84984 | (3,3) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.852586 | (1,4) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854155 | (1,4) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)
1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)
1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)
1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |
(22 rows)
The parameters Lag and lead are used to clear records during the stay of the vehicle between its arrival and departure points.
Because data collected involves many vehicle IDs, the data of different vehicles is usually aggregated and written into the database. If this data is not optimized, the data of different vehicles may be mixed together or scattered in a confusing way. That is, data for different vehicles may be stored in the same data block.
There are two ways to optimize the data so not have the data of different vehicles mixed together:
These two methods both relate to reorganizing the data based on query requirements to achieve the purpose of decreasing scanning IO amplification.
This method is similar to PostgreSQL Securities Industry Database Requirement Analysis and Application. For more information about this method, refer to this article.
Alibaba Clouder - March 9, 2018
Alibaba Cloud MaxCompute - September 18, 2018
digoal - June 26, 2019
Alibaba Clouder - February 15, 2018
Apache Flink Community China - December 25, 2019
digoal - May 28, 2019
Provides secure and reliable communication between devices and the IoT Platform which allows you to manage a large number of devices on a single IoT Platform.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMigrate your Internet Data Center’s (IDC) Internet gateway to the cloud securely through Alibaba Cloud’s high-quality Internet bandwidth and premium Mainland China route.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by digoal