Reinforcement learning for adaptive order dispatching in the semiconductor industry

Reinforcement learning for adaptive order dispatching in the semiconductor industry

G Model CIRP-1774; No. of Pages 4 CIRP Annals - Manufacturing Technology xxx (2018) xxx–xxx Contents lists available at ScienceDirect CIRP Annals -...

566KB Sizes 0 Downloads 45 Views

G Model

CIRP-1774; No. of Pages 4 CIRP Annals - Manufacturing Technology xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

CIRP Annals - Manufacturing Technology jou rnal homep age : ht t p: // ees .e lse vi er . com /ci r p/ def a ult . asp

Reinforcement learning for adaptive order dispatching in the semiconductor industry Nicole Stricker a,*, Andreas Kuhnle a, Roland Sturm b, Simon Friess b a b

wbk Institute of Production Science, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany Infineon Technologies AG, Regensburg, Germany

Submitted by Hartmut Weule (1), Karlsruhe, Germany.

A R T I C L E I N F O

A B S T R A C T

Keywords: Production planning Artificial intelligence Semiconductor industry

The digitalization of production systems tends to provide a huge amount of data from heterogeneous sources. This is particularly true for the semiconductor industry wherein real time process monitoring is inherently required to achieve a high yield of good parts. An application of data-driven algorithms in production planning to enhance operational excellence for complex semiconductor production systems is currently missing. This paper shows the successful implementation of a reinforcement learning-based adaptive control system for order dispatching in the semiconductor industry. Furthermore, a performance comparison of the learning-based control system with the traditionally used rule-based system shows remarkable results. Since a strict rulebook does not bind the learning-based control system, a flexible adaption to changes in the environment can be achieved through a combination of online and offline learning. © 2018 Published by Elsevier Ltd on behalf of CIRP.

1. Introduction Currently, a huge amount of data is gathered in the manufacturing sector but only a small fraction is actually used [1], although great potential is expected. For example, a decrease of up to 25% in operating costs could be achieved [2]. Especially the semiconductor industry offers a good database for learning-based improvement [3]: products close to technological and even physical limits (low double-digit nanometre process architectures) cause high process complexity and dynamics (e.g. chamber “seasoning”) [4] and require an excellent yield management. Therefore, a data-based in-process control (e.g. Run-to-Run control) is a prerequisite to ensure a high yield. The amount of data gathered herein builds a rich database. To exploit the database’s potential and realize a competitive production performance an excellent production planning and control process is needed [5]. Especially in industries with expensive production equipment a high utilization is crucial. The semiconductor fabrication consists of expensive tools and machines that fabricate wafers layer by layer. Hence, lots of capital is bound and production control tasks, e.g. order dispatching, are essential. In the wafer fabrication the dispatching task is usually addressed using a heuristic rule-based approach [6]. Given the

* Corresponding author. E-mail address: [email protected] (N. Stricker).

existing database, further enhanced data-driven approaches enable an increasing production performance. In this paper a reinforcement learning approach for adaptive order dispatching in wafer fabrication is proposed. The approach has been applied and evaluated in an industrial case study and benchmarked against the existing heuristic rule-based approach. 2. Fundamentals and literature 2.1. Specific requirements in semiconductor industry Manufacturing a semiconductor product starts with the wafer fabrication. From a manufacturing point of view this is a complex job shop manufacturing. With regular throughput times of about 30 days and up to 800 process steps the manufacturing process includes many recurrent job flows, varying job processing times at the same machine as well as sequence and time restrictions between process steps (e.g. wafers contaminate quickly when not processed further) [7]. The assignment of orders is addressed in the so-called dispatching. Dispatching is an optimization problem that aims to assign orders to resources and hence determine also the sequence and schedule of orders. It directly influences the objectives utilization, lead-time and work-in-process (WIP). Given the challenges of wafer fabrication, adaptive dispatching becomes crucial. Based on real time process and product data the dispatching decision can be adapted in order to optimally match the current manufacturing situation.

https://doi.org/10.1016/j.cirp.2018.04.041 0007-8506/© 2018 Published by Elsevier Ltd on behalf of CIRP.

Please cite this article in press as: Stricker N, et al. Reinforcement learning for adaptive order dispatching in the semiconductor industry. CIRP Annals - Manufacturing Technology (2018), https://doi.org/10.1016/j.cirp.2018.04.041

G Model

CIRP-1774; No. of Pages 4 N. Stricker et al. / CIRP Annals - Manufacturing Technology xxx (2018) xxx–xxx

2

2.2. Rule-based heuristic for dispatching Heuristics are broadly applied to solve real world optimization problems when computation time is limited and the problem size (e.g. huge number of orders and resources) as well as complexity are incompatible with mathematical programming approaches (in the domain of NP-hard problems). In the semiconductor industry rule-based heuristic approaches are widely considered [3]. Common dispatching rules are for example [6]: first in first out (FIFO), shortest remaining processing time (SRPT), earliest due date (EDD) or critical ratio (CR, to be defined individually for the use case). Freitag and Hildebrandt also develop scheduling rules for the semiconductor industry by a simulation-based multi-objective optimization [8]. However, it is well known that no rule outperforms all others under any objective and multiple rules hold the risk of contradicting each other [9].

The algorithm is embedded into a simulation-based learning framework illustrated in Fig. 1. Starting with a respective production system, a simulation model is implemented. Secondly, the learning algorithm is linked to the simulation model and the so-called offline training is performed. Herein the RL algorithm iteratively learns the desired behaviour strategy. Finally, the results from the simulationbased learning model correspond to the production control algorithm for the real production system. The algorithm continues to dynamically optimize its performance online when applied in real world, e.g. with respect to changes in the production system like varying product mixes, demand changes, etc.

2.3. Machine learning in production planning and control Machine learning (ML) algorithms recently gain importance. ML algorithms are considered as algorithms that are not programmed explicitly with an exact deterministic procedure. Moreover, ML algorithms are named data-driven approaches because the input training data effects the performance and procedure to a large extent. There are various industrial applications where ML algorithms are applied with promising results [10]: a distributed agent-based control architecture is introduced in [11]. It shows a successful implementation of ML for a general job-shop scheduling problem. Günther et al. present a recent application of machine learning in the industrial application of laser welding [12]. A deep learning-based approach is implemented and verified by Wang et al. for the prediction and control of product quality in a mechanical process [13]. However, ML is not a favourable method for all industrial problems. The following properties are deemed advantageous for ML algorithms [14]: (i) applications with a limited scope in terms of the dimensions of states and actions (the learning period is dependent on these dimensions), (ii) fast responsive real time decision systems (computing the output of a ML algorithm requires just linear operations), (iii) “cheap” training data (the trial-and-error approach is intensively data-driven) and (iv) complex environments that can hardly be described in detail (ability to generalize). Reinforcement learning (RL) is one subcategory of ML algorithms next to supervised and unsupervised learning. These are distinguished by the kind of feedback, the representation of learned knowledge and the availability of prior knowledge. RL is characterized by a sequential decision-making process, i.e. a constant feedback for the learning agent affectively interacting with the environment without supervisor. The feedback signal which might be delayed form the agent’s action is translated into a reward. By the reward the agent learns a certain behaviour strategy which maximises its total expected reward. In doing so, the acting agent is commonly modelled as a Markov decision process (MDP) with a comprehensive state representation, action set and reward function. So just a limited amount of prior knowledge is needed to apply RL [15]. The above mentioned examples demonstrate the wide range and successful application of ML algorithms in production engineering as well as manufacturing. Based on this research an adaptive order dispatching for the use case in the semiconductor industry is considered in this paper. The optimal dispatching decision is largely dependent on the current system state. Hence, learning optimal state-action-pairs, i.e. a behaviour strategy, is a promising approach. RL is predominant as it is real time capable and adaptive. 3. Methodology The methodology presents the design of a novel data-driven RL algorithm for adaptive order dispatching. The core objective is to enhance the operational performance for complex semiconductor production systems.

Fig. 1. Overview of simulation-based learning framework.

3.1. Simulation model of the production system The simulation model is implemented in Python using the process-based discrete-event simulation framework SimPy. This framework was chosen with respect to extensive modelling capabilities and a high computational efficiency. The simulation model is built in a job-focused and decentralized logic, which means that the jobs individually demand machines or transportation resources for processing. The related decisions, e.g. which order to process next, are chosen by learning agents. Their decisions are subsequently executed in the simulation model as actions. The simulation model evaluates the actions of the learning algorithm and computes the reward for the actions of the individual agents. Thereby, the simulation model iteratively creates the needed training data. 3.2. Reinforcement learning algorithm setup Multiple agents represent different types of production control decisions such as the following: Which order to dispatch next? Which machine shall the order be assigned to? For each decision a RL algorithm is designed. The RL algorithm translates the decision into possible actions at that can be taken. The action is conditional to the current state st that is observable for the agent. The state is represented as a vector with multiple entries that contain all information relevant to the decision (e.g. number of orders, waiting time, machine state). This information is available to the individual agents within the simulation model. The agent’s action is executed in the simulation model and thus affects the transition to the subsequent state st+1. The Markov property that is commonly known for fully observable MDP applies also to RL: the probability p(st+ 1|st) to transfer to state st+ 1 is only conditional to state st. Additionally, the simulation model is used to derive a positive or negative reward which is saved in an updated value function vp(st) [15]. vp ðst Þ ¼ Ep

1 X

!

g Rtþkþ1 jSt ¼ st k

ð1Þ

k¼0

Herein p describes the behaviour strategy g the discount factor (0  g  1) for future rewards Rt + k + 1 (g ! 1: farsighted; g ! 0: myopic) and k the number of considered future interactions. So, the value function describes how “desirable” each state is in terms of expected reward for possible future actions. In this way the algorithm is able to iteratively learn via trial-and-error the optimal behaviour strategy p*.

Please cite this article in press as: Stricker N, et al. Reinforcement learning for adaptive order dispatching in the semiconductor industry. CIRP Annals - Manufacturing Technology (2018), https://doi.org/10.1016/j.cirp.2018.04.041

G Model

CIRP-1774; No. of Pages 4 N. Stricker et al. / CIRP Annals - Manufacturing Technology xxx (2018) xxx–xxx Table 1 Parameters of the algorithm and simulation.

3.3. Reinforcement learning algorithm based on Q-learning To implement the RL algorithm specifically for the given dispatching problem, the previously presented general algorithm setup and functional principle need to be adapted. Primarily, the delayed reward signal for an action is one elementary property that needs to be considered in production control applications, in other words, the agent’s actions affect the system state transition but the overall performance indicators such as system utilization, are not immediately influenced. Therefore, further algorithm extensions are needed to tune the performance of the RL algorithm: the value function (1) is approximated with a Temporal Difference (TD) learning approach that handles delayed reward signals. Following the TD approach, the value function vp(st) is, broadly speaking, updated for multiple successive states. Subordinated to the class of TD algorithms, Q-learning is one common algorithm. It is based on the action-value function qp(st, at) being the expected return with respect to strategy p, state st and action at. The prevalence of Q-learning is also due to the Bellman optimality equation that states that vp and qp are related for the optimal strategy vp*(st) = maxa qp*(st,at) [16]. However, it is noteworthy that Q-learning is a so-called offpolicy learning algorithm, i.e. no strategy p is considered when learning optimal action-values q(st,at). The update is defined as follows: qðst ; at Þ

  qðst ; at Þ þ a Rtþ1 þ g max qðstþ1 ; aÞ  qðst ; at Þ a

3

Parameter

(Initial) value

Learning rate a (decrease ratea) Exploration probability e (decrease ratea) Discount factor g Eligibility traces decay Number of stored experiences

1.0 (106) 0.9 (105) 0.25 0.7 1000

a

Per iteration.

4. Case study In the wafer fabrication adaptive order dispatching is considered as one elementary tasks of production control. Fig. 2 depicts the layout of the considered fabrication area. There are three joined areas, each consisting of one bidirectional lift managing the in- and out-flow of batches as well as multiple manufacturing equipment (EQ). Batches are dispatched and transported to a designated EQ (stochastic target EQ) and after processing (stochastic processing time) they exit the system again via the lift. There are only two places where batches can be stored intermediately: In front of each EQ there is a limited number of ports and accumulation is allowed within the lift. Based on a Poisson distribution of the mean time between failure and mean time to repair each EQ is disturbed stochastically from time to time. Production is performed in three shifts every day.

ð2Þ

Herein a is the learning rate (0  a  1) that regulates the extent of value update and needs to be determined specifically to the application. Obviously, the update is independent of any policy. However, the exploitation property, that is usually guaranteed by a policy, is ensured because the update considers that the “greedy” action maxa q(st + 1) is taken. So, whenever there is an explorative action, the update is still related to the best-known (greedy) action. As for Q-learning the action-values converge to the optimal strategy [16], the prior challenge is to speed up the time of convergence. Therefore, the here presented Q-learning algorithm is enhanced with another TD implementation. Eligibility Traces (ET) are utilized to more distinctly assign the reward for the changes in performance indicators to the right action. ET combine two plausible heuristics [15]: (i) assign the credit to the most frequent action, (ii) assign it to the most recent action. In order to avoid both, local minima and bad decisions, exploration and exploitation are needed. After initialisation (random initial values) decisions on how to act next are required. With probability e an exploitation action and with probability 1  e an explorative action, i.e. a valid random action, is taken. The exploitation corresponds to the “best-learned” action, conditional to state st. Next to the RL algorithm as the core of the learning algorithm, there is a mechanism required to efficiently represent the stateaction-value function for each agent as for real-world problems in production control the number of states and actions might be huge. Artificial neural networks (ANN) are widely applied to act as function approximator (for a detailed description of ANN see Ref. [14]). The number of input neurons is equal to the number of variables that describe the current state st of the production system (binary, integer and continuous variables) and the number of output neurons corresponds to the set of actions. In order to enhance the generalisation performance a drop-out rate is implemented that randomly deactivates neurons during the learning phase (here the drop-out rate is set to 0.5). Table 1 summarizes all relevant parameters and their (initial) values. The learning rate as well as the exploration probability constantly decrease after each iteration to ensure that the algorithm converges.

Fig. 2. Layout of fabrication area.

Due to multiple stochastic influences, such as volatile processing times, changing product variants, dysfunctional manufacturing resources and the limited number of resources, just a single worker, the system demands a highly adaptive order dispatching system. This need for adaptivity and ability to change is enhanced by a fast-changing industry and short product life cycles. The digitized manufacturing system makes the system state information transparent in real time, e.g. location of batches, tool state information and remaining processing time. The agent considers just the information that is relevant for its optimal behaviour. These are the following 32 state information variables: (i) EQ: number of waiting (unprocessed) batches and number of finished batches, (ii) lift: target of the next up to five incoming orders at the lift, (iii) current position of the worker. There are eight EQs, three lifts and one worker. There are twelve possible actions for the agent. Standing at a certain location (EQ or lift), the agent can either dispatch an available batch to one out of the eight EQs, bring it back to a lift or change the location by moving empty-handed. Additionally, there is the possibility to wait in case there is no order to be dispatched. Moreover, it might be beneficial to wait voluntarily knowing that a batch is available at this location soon. The learning phase of the RL agent is elementary defined by the reward for the interaction of the agent with the environment and results from expert knowledge. Maximizing the utilization of all EQs and simultaneously reducing the lead time are the two major objectives in the considered case study. The reward signal of the agent is aligned to achieve these objectives in the long run and calculated after each action. Unfeasible actions are not rewarded, a movement without an order is neutral (zero reward), dispatching an order to an EQ receives the value 0.2 and finally exiting one order by taking it to a lift is rewarded based on the linear combination of two equally-weighted variables: the order lead

Please cite this article in press as: Stricker N, et al. Reinforcement learning for adaptive order dispatching in the semiconductor industry. CIRP Annals - Manufacturing Technology (2018), https://doi.org/10.1016/j.cirp.2018.04.041

G Model

CIRP-1774; No. of Pages 4 N. Stricker et al. / CIRP Annals - Manufacturing Technology xxx (2018) xxx–xxx

4

time and the average utilization of the production system since the last utilization computation. The ANN is implemented as an approximator. A dense (fully) connected feed-forward network is chosen with an input layer representing the 32 state vector variables, two hidden layers with 64 neurons each and the output layer with all possible actions. The hyperbolic tangent tanh function is implemented as activation function. 5. Computational results The computational results are computed on an Intel1 Core i76700 64-bit machine with 3.40 GHz and 16 GB RAM. The total time to execute 1 million actions was 2.6 days. The RL algorithm improves its performance over time, proving that it can be applied as adaptive order dispatching control that continuously learns the optimal behaviour. Fig. 3 shows the development of the reward starting from the initial state where the agent’s behaviour is completely random. The agent successfully learns a high performance behaviour, however not losing the desired adaptive behaviour. During the first learning phase (around 200,000 actions) the reward rises significantly and continuously. This phase is also primarily characterised by ruling out unfeasible actions. The rate of unfeasible actions drops down below 2% in this early phase. Afterwards, the reward stabilizes on a high level. The outliers that remain point out that the agent is still adaptive enough to react on changing conditions of the production system (e.g. disturbances, demand fluctuations). The conventional heuristic approach is based on a set of if-thenrules, e.g. “take the longest waiting batch next” or “first dispatch all batches in one area and move to another area afterwards” (to minimize time-consuming area changes). According to Fig. 4 the learning-based algorithm for order dispatching yields a superior performance behaviour. After the first tuning phase the utilization drops quickly to a bottom value of 82%. Afterwards the perfor-

Fig. 3. Learning curve of agent’s reward over time.

Fig. 4. Comparison of heuristic and learning-based performance.

mance continuously increases what matches the prior results shown for the agent’s reward. A system utilization of 90% and lead time of 118 min is achieved after 1 million actions, comparing to 82% and 125 min for the heuristic rule-based dispatching. Moreover, the rule-based results show an almost stable performance that is not able to adapt to changing conditions. 6. Conclusion and outlook This research has shown that the application of data-driven algorithms can enhance the operational efficiency of production control systems. A RL-based adaptive order dispatching algorithm can outperform existing rule-based heuristic approaches. Complex production systems as the considered case study in the semiconductor industry are particularly hard and at the same time crucial to run at a high performance level. This work brings the application of learning algorithms and the transition towards autonomous production systems one step closer to reality. However, the limitations of machine learning algorithms still prevail, e.g. in terms of reproducibility and solution robustness. Further research in the area of modelling and designing learning algorithms is needed to achieve a broad application also in other areas of production control such as employee allocation and capacity control. Furthermore, research on cooperative multiagent systems is required to broaden the scope of applications. Acknowledgement We extend our sincere thanks to the German Federal Ministry of Education and Research (BMBF) for supporting this research project 02P14B161 “Empowerment and Implementation Strategies for Industry 4.0”.

References [1] Oliff H, Liu Y (2017) Towards Industry 4.0 Utilizing Data-Mining Techniques: A Case Study on Quality Improvement. Procedia CIRP 63:167–172. [2] Henke N, Bughin J, Chui M, Manyika J, Saleh T, Wiseman B, Sethupathy G (2016) The Age of Analytics: Competing in a Data-Driven World, McKinsey Global Institute. [3] Waschneck B, Altenmüller T, Bauernhansl T, Kyek A (2016) Production Scheduling in Complex Job Shops from an Industry 4.0 Perspective: A Review and Challenges in the Semiconductor Industry. CEUR Workshop Proceedings 1793:12–24. [4] Moyne J, Iskandar J (2017) Big Data Analytics for Smart Manufacturing: Case Studies in Semiconductor Manufacturing. Processes 5(3):39–59. [5] Schuh G, Reuter C, Prote JP, Brambring F, Ays J (2017) Increasing Data Integrity for Improving Decision Making in Production Planning and Control. CIRP Annals – Manufacturing Technology 66(1):425–428. [6] Fordyce K, Milne RJ, Wang C-T, Zisgen H (2015) Modeling and Integration of Planning, Scheduling, and Equipment Configuration in Semiconductor Manufacturing Part I. Review of Successes and Opportunities. International Journal of Industrial Engineering: Theory Applications and Practice 22(5):575– 600. [7] Mönch L, Fowler JW, Mason SJ (2013) Production Planning and Control for Semiconductor Wafer Fabrication Facilities: Modeling, Analysis, and System, Springer. [8] Freitag M, Hildebrandt T (2016) Automatic Design of Scheduling Rules for Complex Manufacturing Systems by Multi-Objective Simulation-Based Optimization. CIRP Annals – Manufacturing Technology 65(1):433–436. [9] Uzsoy R, Church LK, Ovacik IM, Hinchman J (1993) Performance Evaluation of Dispatching Rules for Semiconductor Testing Operations. Journal of Electronics Manufacturing 3(2):95–105. [10] Monostori L, Váncza J, Kumara S (2006) Agent-Based Systems for Manufacturing. CIRP Annals – Manufacturing Technology 55(2):697–720. [11] Monostori L, Csáji BC, Kádár B (2004) Adaptation and Learning in Distributed Production Control. CIRP Annals – Manufacturing Technology 53(1):349–352. [12] Günther J, Pilarski PM, Helfrich G, Shen H, Diepold K (2016) Intelligent Laser Welding Through Representation, Prediction, and Control Learning: An Architecture with Deep Neural Networks and Reinforcement Learning. Mechatronics 34:1–11. [13] Wang P, Gao RX, Yan R (2017) A Deep Learning-Based Approach to Material Removal Rate Prediction in Polishing. CIRP Annals – Manufacturing Technology 66(1):429–432. [14] Russell S, Norvig P (2009) Artificial Intelligence: A Modern Approach, Prentice Hall Press. [15] Sutton RS, Barto AG (2012) Reinforcement Learning: An Introduction, MIT Press. [16] Tsitsiklis JN (1994) Asynchronous Stochastic Approximation and Q-Learning. Machine Learning 16(3):185–202.

Please cite this article in press as: Stricker N, et al. Reinforcement learning for adaptive order dispatching in the semiconductor industry. CIRP Annals - Manufacturing Technology (2018), https://doi.org/10.1016/j.cirp.2018.04.041