Research on outlier detection algorithm for express logistics

Research on outlier detection algorithm for express logistics

Computers in Industry 92 (2017) 43–49 Contents lists available at ScienceDirect Computers in Industry journal homepage: www.elsevier.com/locate/comp...

1MB Sizes 2 Downloads 111 Views

Computers in Industry 92 (2017) 43–49

Contents lists available at ScienceDirect

Computers in Industry journal homepage: www.elsevier.com/locate/compind

Research on outlier detection algorithm for express logistics Yan Zhanga,b , Zheyi Lia , Jiefeng Wena , Yiwei Maa , Oh-kyoung Kwonb,* a b

Chongqing University of Post and Telecommunication, Chongqing 400065, China Graduate School of Logistics, INHA University, Incheon 402751, Korea

A R T I C L E I N F O

Article history: Received 14 September 2016 Received in revised form 2 May 2017 Accepted 23 May 2017 Available online xxx Keywords: Express logistics Multi-attribute Outlier detection

A B S T R A C T

To detect the problems of time delay, path error and destination error in express logistics process effectively, a novel outlier detection algorithm for express logistics is proposed in this paper. To test the detection results, the express logistics system operating model is built to test the detection results. Experiment results show that the proposed algorithm is well applied to the express logistics data with multi-attribute characteristics, and can work well in detecting the abnormal conditions of express logistics. © 2017 Published by Elsevier B.V.

1. Introduction The joint distribution [1–3] is the development direction of the modern scale express logistics distribution mode [4–6]. The problem of heavy traffic and complex route [7,8] will be happened because of its large-scale load distribution mode [9] and no distinguish between shippers and commodity [10]. This will lead to delays in distribution time, the distribution path error and the destination error in some degree. Therefore, it is a very important part of modern large-scale express logistics distribution mode that how to find out the abnormal distribution problems quickly and feedback to the enterprises for dynamic processing in time. At present, the outlier detection technology for express logistics is still in the initial stage, because most of the outlier detection technique researches are mainly focused on the simple and structured dataset [11]. It is mainly include: Lee et al. [12] proposed the TRAOD algorithm based on classification and detection framework; Knorr et al. [13–15] proposed an algorithm based on extraction global features of trajectory; Li et al. [16–18] proposed a method based on the classifier, and so on. Outlier detection methods above are all trajectory outlier detection methods of line segment with certain shape. However, express logistics abnormalities include distribution time delay, path error and destination error, and so on. At the same time, the trajectory of express logistics is directed line segment that consists of a series of distribution centers in accordance with time and distribution level. It has multiattribute characteristics of time, direction, position coordinate,

* Corresponding author. E-mail address: [email protected] (O.-k. Kwon). http://dx.doi.org/10.1016/j.compind.2017.05.003 0166-3615/© 2017 Published by Elsevier B.V.

level of city, city point affiliation and so on compared with the traditional trajectory. Thus existing outlier detection methods [19] are unable to meet the demand of express logistics distribution. Therefore, this paper takes the express logistics trajectory dataset with multi-attribute characteristics as the research object to find the anomalies of express logistics, and proposes an express logistics outlier detection method to apply to the express logistics. 2. Problem description To illustrate what is express logistics outlier detection and solve the express logistics anomaly detection problem effectively, we firstly give a definition to the express logistics abnormality. This paper mainly studies the three anomalies in express delivery, including distribution time delay, distribution path error and destination error. The three anomalies are as shown in Fig. 1. Detailed definitions as follows: Definition 1. Distribution time delay. When a parcel is in the process of distribution, the normal time-consuming should be fixed in the same section under normal circumstances. But if the time-consuming in a distribution path sub-segment did sharp increase suddenly than the normal time-consuming, it means the distribution time delay problem occurs. Definition 2. Distribution path error. When a parcel is in the process of distribution, there is an established distribution path of the parcel according to the re-planning. If the distribution path did not follow the established path, it means the path error problem may occur. There are two cases that can lead to this result: (1) As the amount of goods of a distribution center or a logistics center (A or D shown in Fig. 1) is too large, the system

44

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

Fig. 1. The distribution schematic.

will plan a diversion path (A-E-B shown in Fig. 1) based on certain rules. (2) The distribution path has been wrong and the path error problem occurs. In the first case, detecting whether the actual split distribution scheme (A-D-B shown in Fig. 1) is consisted with the best split distribution scheme(The distance is the shortest and it will not lead to the backlog of parcels. A-E-B shown in Fig. 1). If the actual split distribution scheme is different with the best split distribution scheme, the path error problem occurs. Definition 3. Destination error. When the distribution is finished, detecting whether the attributes (position coordinates, serial number, subordinate level, etc.) of the parcel actually reached center is consistent with the destinations. If the actual distribution destination was inconsistent with the destination on the order, it means the destination error problem occurs. 3. Proposed algorithm in detail In order to find out the abnormal distribution problems quickly, we propose a novel express logistics outlier detection method to apply to the express logistics after depth analysis of the modern scale express logistics distribution mode. 3.1. Algorithm realization process 3.1.1. Distribution trajectory data multi-attribute representation Extracting the distribution center’s position coordinates, serial number, subordinate level, and the time of the parcel arrives at distribution center, the volume of the parcels in the distribution center, and the max transport capacity of the line which between two adjacent distribution centers, to represent the distribution path. A sub-segment of distribution path can be expressed as follows: p < xs ; ys ; xe ; ye ; nums ; nume ; grads ; grade ; tims ; time ; gamtse ; ctrse ; f lag > where: xs, ys represent the abscissa and the ordinate of the starting distribution center on the path sub-segment respectively; xe, ye represent the abscissa and the ordinate of the ending distribution center on the path sub-segment respectively; nums, nume represent the serial number of the starting and ending distribution center on the path sub-segment respectively; grads, grade represent the level of the starting and ending distribution center on the path sub-segment respectively; tims, time represent the time of the parcel arrives at the starting and ending distribution center respectively; gamtse represents the total amount of parcels from the starting distribution center to the ending distribution center (It means that gamtse is the amount of all the parcels through the line of the path sub-segment); ctrse represents the maxi transport capacity of the line that between the starting

distribution center and the ending distribution center (It means that ctrse is the biggest capacity of the starting distribution center delivery the parcel to the ending distribution center one day); flag symbolize whether the parcel has reached the termination (It means that the parcel will not be distributed). Annotation: The starting point and the ending point abovementioned do not refer to the source or the destination point of the order, but to the two adjacent distribution centers that form a complete distribution path. As shown in Fig. 1, it is assumed that a complete distribution path is A-D-C-B, and then it is made up of three sub-segments path (A-D, D-C, C-B). The two extreme points of these sub-segments path are the starting and ending distribution centers respectively. For example: the point D and point C are the starting and ending distribution centers of the sub-segment DC respectively. As shown in Fig. 1, it is assumed that a parcel should be distributed from A to B, so its complete distribution path is A-D-CB. And a sub- segment of the distribution path pAD(ie. A-D) can be expressed as follows: pAD < xA ; yA ; xD ; yD ; numA ; numD ; gradA ; gradD ; timA ; timD ; gamtAD ; ctrAD ; f lag > 3.1.2. The process of outlier detection 3.1.2.1. Time outlier detection. Assuming that the normal distribution time-consuming from one distribution center to an adjacent center is Time_t(unit: day), if the actual distribution timeconsuming t was longer than Time_t, then the distribution timeconsuming is abnormal. For example, assuming that the parcel g should be sent to D from A, then a sub-segment of its distribution path can be expressed as pAD that shown in Part 1, and the corresponding distribution time-consuming is tAD (unit: day). If tAD > Time_t, we set the outlier degree of distribution time Time t asOtim ¼ tADTime Otherwise, set the outlier degree of t . distribution time as Otim = 0. Where, tAD and Time_t are calculated by the following equations: tAD ¼ timD  timA

ð1Þ

Time t ¼ tAD þ 3  mAD

ð2Þ

Where tAD and mAD are defined as follows: n 1X t n i¼1 ADi

ð3Þ

tADi ¼ tDi  tAi

ð4Þ

tAD ¼

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

mAD ¼

n X

1 ðt  tAD Þ2 n i¼1 ADi

ð5Þ

In Eq. (1), tAD represents the actual distribution timeconsuming of parcel that sent from A to D. timD and timA represent the time that the parcel reached to the center D and A respectively. In Eq. (2), Time_t represents the threshold of normal distribution time-consuming, tAD and mAD represent the average value and variance value of the normal time-consuming of parcel that was sent from A to D respectively. In Eq. (3), n stands for the number of orders that the parcels are sent through the distribution path sub-segment A-D, tADi represents the distribution timeconsuming of orders i through distribution path sub-segment A-D; The time of parcel i through distribution path sub-segment A-D is calculated by Eq. (4), where, tDi and tAi represent the time when parcel i reached distribution center A and D respectively. The variance value of distribution time-consuming can be obtained by Eq. (5). 3.1.2.2. Destination path outlier detection. Once the order generated, the distribution path can be obtained by the history distribution experience. Therefore, the outlier detection of the parcel’s distribution process takes established destination path as the input data. The outlier detection of the parcel’s distribution is the process of comparing the actual distribution path with the experienced distribution path. As shown in Fig. 1, assuming that parcel g should be distributed from distribution center F to G, and we can know that the experienced distribution path is F-A-B-G. From the experienced distribution path we can find that the normal next distribution center is B when the parcel reached center A. Therefore, when the parcel g was sent from distribution center A and then was reached to the next center (B, D or E as shown in Fig. 1), we detected whether the actual reached distribution center’s attributes are consisted with the normal distribution center’s. If the attributes are same, it represents that the parcel was sent to the right distribution center (center B), we set the distribution path outlier degree of parcel as Otra = 0. Otherwise, set the distribution path outlier degree of parcel as Otra = ot(ot is the parameter of path outlier degree, which ranges from 0 to 1). In the case of Otra = ot (ie. the parcel was not sent to the right distribution center), detecting whether the total amount of parcels (gamtAB) that should be distributed from the previous distribution center (center A) to the right next distribution center (center B) is larger than the maximum transport capacity (ctrAB) of the line that between the previous distribution center (center A) and the next distribution center (center B). If gamtAB  ctrAB, set the distribution path outlier degree as Otra = 2ot. Else if gamtAB > ctrAB, it means that warehouse explosion problem occurred in distribution center A, so the parcels in A need to be split distributed to B. In this case, detecting whether the actual split distribution scheme (A-DB shown in Fig. 1) is consisted with the best split distribution scheme (ie. The distance is the shortest and it will not lead to the backlog of parcels. A-E-B shown in Fig. 1). If the actual split distribution scheme is different with the best split distribution scheme, set the distribution path outlier degree as Otra = 2ot. Otherwise, set the distribution path outlier degree asOtra = 0. 3.1.2.3. Destination outlier detection. When the parcel is still in the distribution process (flag = 0), there is no likely to happen the situation of destination outlier. Therefore, the destination outlier detection is unnecessary. When the distribution is finished (flag = 1), we detected whether the attributes (position coordinates, serial number, subordinate level, etc.) of the parcel

45

actually reached center is consistent with the destination’s. If the attributes of the parcel actually reached center are not consisted with the destination’s, it means that the parcel was distributed to the wrong destination, and we set the destination outlier degree as Oter = oe(oe is the parameter of destination outlier degree, which ranges from 0 to 1, the proposed value is 1). Otherwise, set the destination outlier degree as Oter = 0. 3.1.2.4. Distribution abnormality determination. The distribution Where:degree vtim, v(O , v therepresent outlier degree weight of tra outlier parcel (g)the is calculated as follows: g) ofter distribution time, path and destination respectively. The proportion of each weight is adjusted according to practical requirements. Each weight is normalized to get the value of vtim, vtraand vteraccording to the proportion. Therefore, each weight ranges between 0  1 and the sum of those weights is 1. Otim, Otra, Oter represent the value of outlier degree of distribution time, path, and destination respectively. The judging guideline of the abnormal distribution is: if the formula Og > Oout(Ooutis the outlier threshold and is range between 0 and 1) was true, the distribution was abnormal. On the contrary, it was normal.

Fig. 2. the process of the algorithm.

46

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

3.1.2.5. The overall process of the algorithm. Input: The express logistics distribution dataset, the parameter of path outlier degree (ot), the parameter of destination outlier degree (oe), the weight of time outlier degree (vtim), the weight of path outlier degree (vtra), the weight of destination outlier degree (vter), the total outlier threshold (Oout). Output: The orders’ information which are distributed abnormal. The overall process of the algorithm is shown in Fig. 2. The steps of the algorithm are as follows: Step 1.Obtain the order information of parcel g, its latest distribution path sub-segment pse, and the history distribution path sub-segments setNðpse Þ, in which the path sub-segment has the same starting and ending distribution center with pse. Step 2.Using Eqs. (1)–(5) to calculate the time outlier value Otimof pse. Step 3.Analyzing the distribution path of pse and calculate the path outlier value Otra. Step 4.Detecting whether the parcel g has reached the actual destination (the parcel will not be distributed anymore). If flag = 1, calculating the destination outlier valueOterof parcel g. Otherwise, jumping to step 5. Step 5.Using Eq. (6) to analyze the total distribution’s outlier degree, and then calculate the final outlier value Og. Step 6.In order to judge whether the distribution is abnormal, we compare the value ofOgand the value of Oout.If Og > Oout, it means that the distribution of parcel is abnormal, then updating the outlier distribution dataset. Otherwise, the distribution was normally. Step 7.Jump to step 1 to detect the next distribution path subsegment. Where step (2) to (5) are achieved by the following pseudo codes. Begin Calculate the distribution time-consuming tseofpse according to Eq. (1) For each pse'2 N(pse) Calculate the time-consuming tseof each historical distribution path sub-segment in setNðpse Þ according to Eq. (1). Save the calculation result in the time-consuming set tim(pse). End for Calculate the average valuetse of the time-consuming set tim (pse) according to Eq. (3). Calculate the variance value mseof the time-consuming set tim (pse) according to Eq. (5)

4. The express logistics system operating model In order to test the detection results, we have extracted the inherent nature characteristics of express logistics, and built an express logistics system operating model. The system model contains seven logistic function modules, which are system parameter configuration module, multi-city architecture generation module, order generation module, storage module, order transport module, distribution outlier detection module and terminal distribution module. This model is implemented in MATLAB/GUI. The operating principles are defined as follows: orders are distributed from a starting point, and sending these orders to the distribution center, which is belonging to the city of order creation. Then, these orders will be delivered from distribution center to the previous level that is logistics center. After all kinds of orders have been processed and classified, sending each order to the corresponding logistics center of destination city. Next, delivering orders from logistics center to the next level that is distribution

Calculate the threshold value Time_t of normal distribution time-consuming according to Eq. (2). If tse > Timet;

Otim ¼

tse  Time t ; Time t

Fig. 3. Express logistics simulation system operating diagram.

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

47

Fig. 4. The main interface of system.

the distribution time-consuming between two adjacent centers from 0.5 to 1.5 days). The overall operating chart is shown in Fig. 3. The main interface is shown in Fig. 4. 5. Experimental results and analysis Read configured system parameters by parameters configuration module, as shown in Fig. 5, generate city architecture in accordance with the principle of generation of city architecture, as shown in Fig. 6. In Fig. 6, the blue lines are the lines between the third level points and the second level points, and they represent the distribution path between those distribution centers. The light blue lines are the lines between the second level points and the first level points, and they represent the distribution path between those distribution centers. The green dotted lines are the

Fig. 5. The parameters configuration of this experiment.

Fig. 6. System structure.

center. At last, each order will arrive to their destination through distribution center. Where, order terminal distribution module belongs to the simple model of virtual logistics system, and the successful delivery of orders from secondary city to the terminal city is the main function for the module. (Where: when the orders are in distribution process, abnormal distribution occurs with a random probability. It contains distributing to the wrong distribution center, the distribution time-consuming does sharp increase suddenly, distributing to the wrong destination. We set

distribution path between first level and first level distribution centers. The outlier detection result when the system runs to the 6th day is shown in Fig. 7. In Fig. 7, the red lines in the left red box are the abnormal distribution path sub-segments, in which there are abnormal distribution orders. And the form in the right red box is the abnormal distribution orders’ information on the day (i.e. the 6th day). The specific information of these orders is shown in Table 1.

48

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

Fig. 7. The result of the outlier detection.

Table 1 The specific information of abnormal distribution orders. source point

destination point

flag Volume of chargo (t)

time consuming flag

date abnormal date of order

destination center

null destination center

time destination consuming center

time destination consuming center

time consuming

33 25 51 18 16 39

21 54 57 56 59 23

0.001264 0 0.012986 0.008013 0 0.005182 0.018551 0.005102

0.04 1.95 0.09 1.84 1.93 1.89

5 5 7 5 4 8

33 25 51 18 16 39

1 1 1 1 1 1

1.08 0.83 1.11 1.12 1.18 0.85

0.86 1.04 0.90 0.97 0.89 1.11

1.04 1.05

10 10 10 10 10 10

22 4 10 1 1 20

8 3 5 6 6

2

19 0

Table 2 The established distribution paths of corresponding orders. source point

destination point

destination center

destination center

destination center

destination center

33

21

33

22

8

2

25

54

25

4

3

54

0

51 18

57 56

51 18

10 1

5 6

11 56

57 0

16

59

16

1

6

39

23

39

20

The abnormal distribution orders’ established distribution paths are shown in Table 2. From Table 2, we can know that the order with source point 33 and destination point 21, when it was distributed to the distribution center 2, then it should be distributed to the destination point 21. But from Table 1, we find out that it was distributed to the wrong destination point 9. Similarly, from Tables 1 and 2, we can find out that the order with source point 16 and destination point 59 was distributed to the wrong destination point 42. It means that there are destination error problems occurring in the two orders’ distribution. From Tables 1 and 2, we can know that the order with source point 25 and destination point 54, when it was distributed to the distribution point 3, it should be

destination center

0

19

3

destination center

0

23

0

distributed to the right next point 23, but actually it was distributed to point 8 from point 3. Similarly, the order with source point 18 and destination point 56 and the order with source point 39 and destination point 23, were distributed to the wrong next distribution points. Thus, we can conclude that there are distribution path error problems occurring in the three orders’ distribution. From Table 1, we can know that order with source point 25 and destination point 54, its time-consuming from point 5 to point 14 is 2.91 days. It is obviously much longer than the normal time-consuming (0.5–1.5 days). We can conclude that there is a distribution time delay problem occurring in this order’s distribution.

1.16 1.02 0

Y. Zhang et al. / Computers in Industry 92 (2017) 43–49

From the above analysis, we can conclude that the proposed algorithm can effectively find out the abnormal distribution problems in express logistics, such as the distribution time delay problems, distribution path error problems and destination error problems. 6. Conclusions Express delivery raises an urgent demand to abnormal distribution detection because of the large-scale development trend of distribution. However, the distribution date set generated by large-scale distribution has the characteristic of large, multiattribute, and complex features compared with the structured data sets. Which makes the traditional anomaly detection method cannot be directly applied to these data and then the traditional anomaly detection technology is facing severe challenges. Aiming at the problem that existing outlier detection methods are unable to meet the demand of express logistics distribution, an express logistics outlier detection method is proposed in this paper. The algorithm can effectively find out the abnormal distribution problems in express logistics, such as the distribution time delay, distribution path error and destination error. And an express logistics system operating model is built to test the algorithm. Experiment results show that the proposed algorithm is effective. Conflict of interests The work is supported by Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ1704097), the Doctor Start-up Funding of Chongqing University of Posts and Telecommunications (A2016-05). References [1] Z.L. Wei, J.Q. Sun, Discussion on the city joint distribution model for small and medium business enterprise, J. Beijing Jiaotong Univ. 14 (1) (2015) (total contents(Social Science Edition). [2] Yue-feng Ma, Shi-zhuo Bi, Hu-sheng Lu, Research on joint distribution of sinter and pellet in iron and steel enterprise based on transportation problem, Proceedings of 2011 IEEE 18th International Conference on Industrial Engineering and Engineering Management(IE&EM 2011) (2011) 952–9566. [3] Y. Priziment, D. Malah, On joint distribution modeling in distributed video coding systems, Proceedingsof 2010 IEEE 12th International Workshop on Multimedia Signal Processing (MMSP) (2010) 303–308.

49

[4] Juhel, H. Marc, The role of logistics in stimulating economic development, Paper Presented at the China Logistics Seminar, Beijing, 1999, pp. 25–37. [5] G. Zach, T. John, Logistics salience in a changing environment, J. Bus. School 25 (1) (2004) 134–140. [6] M. Gao, Scale boosts the rapid development of China's logistics industry, China Econ. Inf. 5 (2014) 6–57. [7] X.K. Wang, F. Yang, D.Y. Yang, Development of collaborated distribution oriented to social benefits, Logistics Technol. 26 (3) (2007) 1–4. [8] Koehler city logistics In kassel, City Logistics (1999) 261–271. [9] W. Dong, Method of traffic dispatching of centralized logistics distribution under vehicle interference, Logistics Technol. 31 (2013) 5–318. [10] J.F. Luo, Joint Distribution Pattern Analysis and Implementation Research. China Market Region&City, (2007) . [11] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann Publishers, San Francisco, 2006. [12] J. Lee, J. Han, Trajectory outlier detection: a partition-and-detect framework, Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 140–149. [13] E.M. Knorr, R.T. Ng, Algorithms for mining distanced-based outliers in large datasets, Proceedings of the 24th International Conference On Very Large Data Bases, New York City, New York, 1998, pp. 392–403. [14] E.M. Knorr, R.T. Ng, Finding intensional knowledge of distance-based outliers, Proceedings of the 25th International Conference On Very Large Data Bases, Edinburgh, Scotland, 1999, pp. 211–222. [15] E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outliers: algorithms and applications, VLDB J. 8 (3) (2000) 237–253. [16] W. Jin, K.H. Tung, J. Han, Mining top-n local outliers in large databases, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2001, pp. 293–298. [17] W. Jin, K.H. Tung, J. Han, et al., Ranking outliers using symmetric neighborhood relationship, Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Heidelberg, Springer-Verlag, 2006, pp. 577–593. [18] X. Li, J. Han, S. Kim, ROAM: Rule- and motif-based anomaly detection in massive moving object data sets, Proceedings of the 7th SIAM International Conference on Data Mining, Minneapolis, Minnesota, 2007 (etc.). [19] S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, 2000, pp. 427–438. Yan Zhang, she was born in 1982 in Chongqing (China). In 2007, she received the master degree from the graduate school of logistics, Inha University. Now she is working in Chongqing University of Post and Telecommunication. Her research focus includes supply chain management, logistics system optimization, virtual environment modeling.