Internet of Vehicles (IoV) is one of the hottest topic of research in the field of Internet of Things (IoT). A typical scenario for Internet of Vehicles applications is collecting vehicle's travel tracks, but tracks of vehicles are usually not reported in real time. Several track records may be accumulated or reported at intervals.
A typical data structure is as follows:
(car_id, pos geometry, crt_time timestamp)
Heavy traffic and traffic signals often occur during the vehicle traveling process. The reported track records may be as follows:
1, position 1, '2017-01-01 12:00:00'
1, position 1, '2017-01-01 12:00:05 PM'
1, position 1, '2017-01-01 12:00:10 PM'
1, position 1, '2017-01-01 12:00:15 PM'
1, position 1, '2017-01-01 12:00:20 PM'
1, position 2, '2017-01-01 12:00:30 PM'
That is, multiple records in the same position may be uploaded due to heavy traffic or traffic signals.
Therefore, there is a requirement to clean unnecessary wait records from the database. We keep at most two records for a given point, indicating arrival and departure.
This operation can be executed via window function.
Surely, in terms of providing the best efficiency, it is more reasonable to clean tracks from a terminal. Only two records will be kept for the start point of a position.
1.Design a table structure
create table car_trace (cid int, pos point, crt_time timestamp);
2.Generate 10 million pieces of test data and assume there are 1,000 vehicles (in order to make it easy to repeat the data and test the effect, 25 points are used at the position).
insert into car_trace select random()*999, point((random()*5)::int, (random()*5)::int), clock_timestamp() from generate_series(1,10000000);
3.Create an index
create index idx_car on car_trace (cid, crt_time);
4.Query a data layout
select * from car_trace where cid=1 order by crt_time limit 1000;
1 | (3,1) | 2017-07-22 21:30:09.84984
1 | (1,4) | 2017-07-22 21:30:09.850297
1 | (1,4) | 2017-07-22 21:30:09.852586
1 | (1,4) | 2017-07-22 21:30:09.854155
1 | (1,4) | 2017-07-22 21:30:09.854425
1 | (3,1) | 2017-07-22 21:30:09.854493
As shown on the list, several pieces of data are repetitive.
5.Filter the records at a single position via the window. At most two records for arrival and departure from the position will be kept.
Two window functions are used here: lead and lag. Lag indicates the previous record of the current record and Lead indicates the next record of the current record.
The method of determining arrival and departure points is as follows:
•The current position is not equal to the previous position, indicating that the record is the current position's arrival point.
•The current position is not equal to the next position, indicating that the record is the current position's leaving point.
•The previous position is empty, indicating that the current record is the first record.
•The next position is empty, indicating that the current record is the last record.
select * from
(
select
*,
lag(pos) over (partition by cid order by crt_time) as lag,
lead(pos) over (partition by cid order by crt_time) as lead
from car_trace
where cid=1
and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735'
) t
where pos <> lag
or pos <> lead
or lag is null
or lead is null;
cid | pos | crt_time | lag | lead
-----+-------+----------------------------+-------+-------
1 | (2,1) | 2017-07-22 21:30:09.83994 | | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)
1 | (4,4) | 2017-07-22 21:30:09.84179 | (5,2) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)
1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)
1 | (0,5) | 2017-07-22 21:30:09.84536 | (1,5) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)
1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.84984 | (3,3) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)
1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)
1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)
1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |
(20 rows)
When track cleaning is not performed, the obtained results are as follows:
select
*,
lag(pos) over (partition by cid order by crt_time) as lag,
lead(pos) over (partition by cid order by crt_time) as lead
from car_trace
where cid=1
and crt_time between '2017-07-22 21:30:09.83994' and '2017-07-22 21:30:09.859735';
cid | pos | crt_time | lag | lead
-----+-------+----------------------------+-------+-------
1 | (2,1) | 2017-07-22 21:30:09.83994 | | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.839953 | (2,1) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.840704 | (3,1) | (4,4)
1 | (4,4) | 2017-07-22 21:30:09.84179 | (5,2) | (5,2)
1 | (5,2) | 2017-07-22 21:30:09.843787 | (4,4) | (1,5)
1 | (1,5) | 2017-07-22 21:30:09.844165 | (5,2) | (0,5)
1 | (0,5) | 2017-07-22 21:30:09.84536 | (1,5) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.845896 | (0,5) | (3,3)
1 | (3,3) | 2017-07-22 21:30:09.846958 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.84984 | (3,3) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.850297 | (3,1) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.852586 | (1,4) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854155 | (1,4) | (1,4)
1 | (1,4) | 2017-07-22 21:30:09.854425 | (1,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.854493 | (1,4) | (3,2)
1 | (3,2) | 2017-07-22 21:30:09.854541 | (3,1) | (2,0)
1 | (2,0) | 2017-07-22 21:30:09.855297 | (3,2) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857592 | (2,0) | (4,1)
1 | (4,1) | 2017-07-22 21:30:09.857595 | (4,1) | (0,4)
1 | (0,4) | 2017-07-22 21:30:09.857597 | (4,1) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.858996 | (0,4) | (3,1)
1 | (3,1) | 2017-07-22 21:30:09.859735 | (3,1) |
(22 rows)
Lag and lead are used to clear records in the stay process.
In a typical scenario, a lot of vehicle IDs are involved in the business. Data gathered for different vehicles is usually written into a database. If no optimization is made, after entering the database, the data of different vehicles may be staggered. That is, the data of different vehicles may be stored in one data block.
A lot of data blocks will be scanned (scanning IO amplification) when the track of a single vehicle is queried. There are two optimization methods to speed up the querying process.
1.Write into the database after the business end gathers grouping and sorting.
For example, after receiving the data submitted by the vehicle terminal, the program groups vehicle IDs, sorts them by time, and writes them into the database (using insert into tbl values (),(),...();). In this way, the data of the same vehicle will fall into the same data block as much as is possible.
2.Use a partition to reorganize the data in the database.
For example, we can store data based on the vehicle ID, every vehicle, or vehicle HASH partition.
The two methods both relate to reorganizing the data based on query requirements to achieve the purpose of decreasing scanning IO.
Backend-as-a-Service (BaaS) for Efficient Software Development
2,599 posts | 764 followers
Followdigoal - June 26, 2019
digoal - July 24, 2019
Alibaba Clouder - February 15, 2018
digoal - May 28, 2019
Apache Flink Community China - December 25, 2019
digoal - September 17, 2019
2,599 posts | 764 followers
FollowProvides secure and reliable communication between devices and the IoT Platform which allows you to manage a large number of devices on a single IoT Platform.
Learn MoreA cloud solution for smart technology providers to quickly build stable, cost-efficient, and reliable ubiquitous platforms
Learn MoreMigrate your Internet Data Center’s (IDC) Internet gateway to the cloud securely through Alibaba Cloud’s high-quality Internet bandwidth and premium Mainland China route.
Learn MoreMore Posts by Alibaba Clouder