A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion

A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion

Simulation Modelling Practice and Theory xxx (2015) xxx–xxx Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journ...

1MB Sizes 3 Downloads 44 Views

Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat

A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion Amir-Mohsen Karimi-Majd 1, Masoud Mahootchi ⇑, Amir Zakery 1 Amirkabir University of Technology, Industrial Engineering and Management Systems Department, 424 Hafez Ave., Tehran 15875-4413, Iran

a r t i c l e

i n f o

Article history: Received 25 January 2015 Received in revised form 23 June 2015 Accepted 16 July 2015 Available online xxxx Keywords: Reinforcement learning Production-inventory control Human resource planning Stochastic dynamic programming Knowledge-intensive

a b s t r a c t This paper addresses a combined problem of human resource planning (HRP) and production-inventory control for a high-tech industry, wherein the human resource plays a critical role. The main characteristics of this resource are the levels of ‘‘knowledge’’ and the learning process. The learning occurs during the production process in which a worker can promote to the upper knowledge level. Workers in upper levels have more productivity in the production. The objective is to maximize the expected profit by deciding on the optimal numbers of workers in various knowledge levels to fulfill both production and training requirement. As taking an action affects next periods’ decisions, the main problem is to find the optimal hiring policy of non-skilled workers in long-time horizon. Thus, we develop a reinforcement learning (RL) model to obtain the optimal decision for hiring workers under the demand uncertainty. The proposed interval-based policy of our RL model, in which for each state there are multiple choices, makes it more flexible. We also embed some managerial issues such as layoff and overtime-working hours into the model. To evaluate the proposed methodology, stochastic dynamic programming (SDP) and a conservative method implemented in a real case study are used. We study all these methods in terms of four criteria: average obtained profit, average obtained cost, the number of new-hired workers, and the standard deviation of hiring policies. The numerical results confirm that our developed method end up with satisfactory results compared to two other approaches. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction The main goal of a production plant or a service supplier in a developed and developing country is to have more shares in the internal or external markets especially where different competitors exist. To reach this valuable goal, a company should efficiently utilize all different resources such as workforces and facilities such that it could meet the required satisfaction of customers. As the level of satisfaction is usually changing with the enhancement of technology, all different operations accomplished need to be based on up-to-date knowledge. This is called knowledge intensive operations. Among different important resources employed in knowledge intensive operations, human resource is more critical because people and their knowledge are the most strategic resource for firms [2]. One of the main issues in human resource planning (HRP) is staffing and recruitment decision-making to provide enough qualified manpower for producing high quality products or giving superior services. Recruitment is usually a mid-term or ⇑ Corresponding author. Tel.: +98 21 64545387. 1

E-mail addresses: [email protected], [email protected] (M. Mahootchi). Tel.: +98 21 64545387.

http://dx.doi.org/10.1016/j.simpat.2015.07.004 1569-190X/Ó 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

2

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

even long-term decision which can really affect the near future of the company and its success. Furthermore, human resources, as a strategic and valuable asset, possess the knowledge and skills which are substantially necessary to move a company toward its predefined goals. In other words, one of the important aspects of HRP is to determine the required number of workers in different knowledge levels (e.g., new-hired, semi-skilled, and skilled workers) that should be utilized in various parts of the production process in a company. This is in fact a way of improving the utilization of knowledge resources toward a better efficiency. There are few quantitative approaches employed to cope with staffing problems for knowledge-intensive operations. One of the pioneering works in the area of human resources planning in a knowledge-based situation has been proposed by Bordoloi and Matsuo [7]. They proposed a model obtaining the number of different needed knowledge level. They also embedded employee’s learning and turnover rate into their optimization model to find the better recruitment decisions where demand is non-deterministic. The learning occurs during the production process in which a worker can be transferred from a lower knowledge level to the upper one (e.g., from the first level to the second one) after some periods. Furthermore, turnover in a company is defined as the rate of losing its workers in each knowledge level (semi-skilled or skilled level) at the end of each period [20]. When a company loses skilled workers in the upper levels, it cannot be directly compensated. This means that the company is only able to do the demand satisfaction by recruiting workers in the first level (new-hired workers). They used the chance-constraint method to tackle the high uncertainty of demand and the high volatility of knowledge workers in the last two levels. Their method will fail if we want to address the production-inventory control problem. This also ends up with the static hiring policy (i.e., the hiring rate is constant for all periods in the real-time decision-making process) which is very conservative (i.e., the policy is obtained for a pessimistic situation). Furthermore, the layoff has not been considered in their model. Given this fact that the demand is stochastic and unknown at the time of decision making, there is a possibility of not satisfying demand (we called this slack or shortfall hereafter) which is assumed to be lost sale in our paper. There is also another choice to construct a physical buffer to store the remaining goods for a situation in which the demand is more than the production level so that the extra demand can be met using the stored stock. Of course, it might be possible to compensate the stock-out using an overtime working shift with existing workers. By considering these managerial issues (i.e., overtime working hours, slack/shortfall, surplus, and layoff to the mathematical optimization model), the planning model based on the knowledge-intensive workers would be more compatible to what happens in reality and the final hiring policy would be more useful for managers and beneficial for the respective company. To address all the aforementioned issues, this paper contributes three important goals. First, this paper proposes a new optimization model in which the inventory level is also taken into consideration. This consideration makes our model more compatible to the reality, so it would lead to more proper decisions. Second, in order to efficiently solve this model, we develop a reinforcement learning (RL) method. Furthermore, to have a more applicable decision policy we achieve optimal interval decisions instead of single ones for every state using a modified version of the value iteration technique as a well-known approach in stochastic dynamic programming (SDP). This makes the optimization model more flexible as it gives multi choices to the decision-maker. It is worth mentioning that all the respective information about the demand are used to find the optimal hiring policy while in the chance-constraints approach (i.e., the basic model proposed by Bordoloi and Matsuo [7]) only the mean and the standard deviation of the stochastic demand are used. We will refer to their proposed model in more details in the rest of the paper and compare the results of our two models and their model using the data obtained from a semiconductor equipment manufacturing company. This paper is organized as follows: Section 2 provides a review on the related literature and introduces some papers related to the subject topic. Section 3 includes basic definitions of SDP and RL. In Section 4, we describe the production-inventory control problem and our proposed methods. Sections 5 and 6 present numerical results and conclusion, respectively.

2. Related work Workforce planning determines the required level of workers by which the strategic goals of an organization/company could be achieved. Bulla and Scott [9] defined it as a process in which the required level for the human resource in an organization is properly identified and efficient plans for satisfying those requirements are designed. Khoong [27] specified manpower planning as a core of HRP, which is supported by other aspects of HRP. Different mathematical approaches have been used in HRP (for a review, interested readers can see [5,17]). These approaches can be generally categorized in three parts: optimization, Markovian and computer simulation models. Most of research works in the area of HRP are devoted to staffing or recruitment decisions as an important and popular area of research [2]. Although all types of the mentioned categories have been applied in staffing, the main focus here is to make a suitable stochastic optimization model in an uncertain environment. Demand is mostly considered as a stochastic parameter in staffing problems. The demand could be either product demand [1,7,8,21], workforce demand [14,28,29,32,13,34] or service demand which can be, for example, the amount of call arrivals in a call center [3,4,19]. In some researches the staff knowledge as a critical resource is under focus, either generally from knowledge elicitation and knowledge management view [18], or particularly in the form of different knowledge levels of workers. Entering the Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

3

knowledge level into the optimization model makes it even more complex. For example, Zhu and Sherali [34] addressed a multi-category workforce planning problem for distributed units, each facing an uncertain workforce demand. To embed the demand uncertainty into the optimization model, they proposed a two-stage stochastic programming technique. Moreover, Chattopadhyay and Gupta [11] considered varying class sizes with promotion between classes. They also proposed a manpower planning model using probability distribution of the time interval in which a person remains in a particular grade. Here, the promotion is based on the seniority (number of years they have been working), not the amount of learning. The use of contingent labors (labors with different contracts like temporary, on-call, etc.) is formulated as a stochastic optimization problem with demand uncertainty in which the expected value of labor and backlog costs should be minimized [31]. The main source of workers categorization in these papers is based on two main issues which are denoted as ‘‘knowledge’’ in the literature: (1) learning and (2) experienced skills. Learning is a necessary part of a skillful and knowledgeable worker’s job, and directly affects the employee’s productivity [15] and also the long-term performance of the organization through increasing robustness against potential departure of key personnel [10]. In today’s knowledge-based business, managers really need a staffing policy to make sure that the sufficient number of workers in each certain level of knowledge exists to fulfill the production plan under uncertainty. There are many HRP papers directly or indirectly focused on human learning aspects (e.g., see [16,25,30]); e.g., through planning for major professional and knowledge-intensive companies in [12,26]. However, these studies have been implemented in a deterministic situation. There are only few papers in the staffing area considering uncertainty in the process of decision-making. On the other hand, considering a workers’ promotion between knowledge levels gives the HRP problem a good practical strength and theoretical justification. Promotion means that there are some certain knowledge levels in a company in which workers may promote to upper levels based on a predefined mechanism. Knowledge workers in upper levels usually have more productivity in the production process or even higher capability of doing hard and complex jobs. There are only few papers embedded both the learning process and the promotion between knowledge levels in the optimization model in a stochastic environment. Fragnière et al. [21] proposed extensions to a popular aggregate planning model using multistage stochastic programming in which the demand and the production capacity obtained by qualified and non-qualified employees were stochastic variables. Gans and Zhou [22] addressed employee learning and turnover in a staffing problem to meet forecasted service requirements. They considered learning lead-time, stochastic turnovers, and learning rates to find the optimal hiring policy. Georgiou and Tsantas [23] also suggested a non-homogeneous Markov chain model to optimize the operations of both training new-hired workers and improving the knowledge of existing ones. The main focus of these papers was learning an issue in which the knowledge levels and promotion between them were not clearly explained. As previously mentioned, a superior optimization model for the workers learning and promotion with consideration of turnover rates and stochastic demand was presented by Bordoloi and Matsuo [7]. They addressed this important issue of how to manage a group of workforces in different knowledge levels where products demands are stochastic. With demand fluctuation, there might be some surplus or shortage at the end of each period. This fact has not been considered in their proposed chance-constrained optimization model. In this paper, we apply RL and SDP models to solve the combined problem of production-inventory control and HRP. This is obvious that the obtained optimal policy in SDP is dynamic which is definitely superior compared to the static policy gained using their chance-constraint-based model [24]. Moreover, we embed the inventory level as an important concern into the optimization model in addition to other issues including workers’ learning, turnover, promotion, and the demand stochasticity. It is demonstrated that the proposed method using the new consideration in this paper could make the staffing policy more efficient and applicable in real-life applications. To evaluate the performance of the proposed method, some numerical examples are presented in the section of numerical results. 3. Stochastic dynamic programming and reinforcement learning Stochastic dynamic programming as a well-known optimization methodology which can cope with uncertain situation breaks down complex problems (e.g., non-linear, non-convex, and non-continues) into easier sub-problems [6]. However, it suffers from double curses of dimensionality and modeling for large-scale applications. Different techniques in SDP have been developed; however, value iteration (VI) approach is widely used in varied practical problems because it can approximate the true value functions with less computational efforts and fewer iterations. Before staring the iterative approach in VI, all admissible actions for each state should be determined as a preprocessing step. Then, all these actions should be taken in each iteration where the respective reward ðri Þ and the next possible state need to be calculated for each action taken. The value functions (V i ) are then updated using the transition probabilities ðPij Þ as follows:

V i ¼ ri þ b 

X Pij ðxÞV j

ð1Þ

j

where b is a discount factor. Implementing the VI approach until it converges, the optimal policy can be derived based on the following formula (for more information about how and why the optimal policy could be derived from this equation, please refer to [33]): Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

4

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

ai ¼ argmax r i þ b  a

X Pij ðxÞV j

! ð2Þ

j

Although VI could reach to the optimal policy and the value functions, it could be computational expensive for real-life applications. Reinforcement Learning (RL) approach as a simulation-based methodology is an efficient alternative methodology which can considerably reduce the computational time and lead to a near-optimal solution [24]. Q-learning, a well-known and widely used technique in RL, can alleviate the double curses issue in an appropriate way. In this technique, instead of updating the value function, which is a function of current state, Q-factor, which is a function of the current state and the action taken, is updated. In other words, Q-learning only care about all observed states through off- or on-line simulation not all possible states. Thus, it would be more computationally tractable. The formulation of Q-factor is as follows:

Qði; aÞ

  Q ði; aÞ þ a r it þ b maxQðj; bÞ  Q ði; aÞ ; bAðjÞ

ð3Þ

where a is the learning factor. It should be mentioned that different strategies such as greedy, -greedy, and softmax can be chosen as the action-taking policy for updating the Q-factors. 4. Methodology 4.1. Optimization model Bordoloi and Matsuo [7] addressed a linear program to model an assembly line for which the operations in its front-end stage, called stage A, should be completed prior to operations in the back-end stage, called stage B. Workers in the first knowledge level (new-hired) and in the second knowledge level (semi-experienced) perform the operations in stage A and B, respectively. Workers in the third knowledge level (fully experienced) are assigned to the production stages for training lower-level workers (i.e., new-hired and semi-experienced workers). They can also participate in the production process if they have free time. Fig. 1 illustrates both production stages and the possible assignment of workers at different knowledge levels for doing operations in each stage. Moreover, all workers in the first two levels should be trained during the production process in order to move forward to a higher level at the end of production period. It should be noted that the only possibility here is to recruit people in the lowest level and workers in the other two levels cannot be directly hired. In other words, to meet stochastic demand, the production level can be controlled only by determining the number of new-hired workers for each period. They should therefore be trained during the production process in a specified time framework. This model has some assumption as follows: 1. No worker is allowed to work in both stages A and B in a specific period. 2. All predefined values of parameters remain unchanged within the planning horizon. 3. The probability distribution of demand is already known. Bordoloi and Matsuo’s [7] objective is to minimize the total worker related costs. In their proposed optimization model, the production levels at the different production stages should be determined such that the total demand is met with a desired level of reliability. They used chance-constrained equations to take the demand uncertainty into account. In fact, the respective probability chance-constraints with assumption of normally distributed demand are converted to set of equivalent deterministic equations which are usually called reliability constraints.There are two decision variables including the steady-state hiring level at knowledge level 1, which is explicitly taken into account, and the restoration factor, which is implicitly embedded into the constraint equations. The following notations are used in our model:

Fig. 1. Workers allocation to production stages for the basic model (derived from [7]).

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

5

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Indices t k s

Index of period (3 months), t = 1, 2, 3, . . ., T Index of knowledge level, k = 1, 2, 3 Index of production stage, s = a, b

Parameters R Pc Sc Sh ck qk yk pk dt

Revenue per production unit of sale Production cost, without considering the labor cost, ($) per production unit Surplus cost (currency unit) per production unit Shortage cost (currency unit) per production unit Cost of a worker for each period (currency unit per 3 months) Training factor, k = 1, 2 Retention rate Productivity rates of worker Stochastic demand of period t

Variables Xt W kt W 3st P st TP t TSt Surplust Slackt

The number of workers in the first knowledge level in period t The number of workers in the second and third knowledge level (i.e. k = 2, 3, respectively) in period t The number of skilled workers assigned to stage s in period t The production levels in stage s in period t Real production units in period t Total sale units in period t Surplus level or inventory level in period t Shortfall level in period t

The estimates for these parameters can be obtained using historical data or based on expert opinion. The objective function of the basic mathematical model is as follows:

MinZ 1 ¼ c1 

T 3 T X X X Xt þ ck  W kt t¼1

k¼2

ð4Þ

t¼1

Subject to:

W 3at P q1  X t ;

8t

W 3bt P q2  W 2t ; W 3t ¼ W 3at þ W 3bt ; W 2;ðtþ1Þ ¼ y1  X t ;

ð5Þ

8t

ð6Þ

8t

ð7Þ

8t

W 3;ðtþ1Þ ¼ y2  W 2t þ y3  W 3t ;

ð8Þ

8t

p1  X t þ p3  ðW 3at  q1  X t Þ P dt ;

ð9Þ

8t

p2  W 2t þ p3  ðW 3bt  q2  W 2t Þ ¼ dt ;

ð10Þ

8t

ð11Þ

Eq. (4) shows the objective function of the basic model that is to minimize the total cost of employing workers. Constraints (5) and (6) guarantee the sufficient number of third-level workers that should be involved for training workers in the other two levels in stage A and B, respectively. Constraint (7) ensures that all skilled workers are engaged in the assembly line. Constraints (8) and (9) show workers’ promotion between levels and turnover which is also illustrated in Fig. 2. Constraints (10) and (11) indicate the balance of production levels and customers’ demand. It is unlikely to have both constraints (8) and (9) as equality constraints in the same period. Moreover, the equality constraint pertinent to stage B for satisfying the exact amount of demand increases the possibility of infeasible solutions in the respective optimization model. Therefore, this equation should be converted to the inequality constraint. Furthermore, there is no buffer or inventory considered in these constraints. Therefore, if the demand is less than the production level, the respective surplus should be disposed. The focus in our proposed method is to find a stochastic policy to trade off between the holding cost and the cost resulted from unsatisfied demands using our models so that it enables managers to make a flexible decision in a more real environment. Here, a buffer in stage B should exist where the final products are going to be produced (see Fig. 3). This buffer could play a role of connecting two consecutive periods to each other through the balance equation. From a modeling point of view, we add new variables for slack (shortfall) and surplus, and considering their respective costs in the objective function. As if the operations in stage B are preceded by the operation in stage A, none of the left hand sides of both constraints (10) and (11) is Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

6

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Fig. 2. Workers’ promotion between knowledge levels.

Fig. 3. Workers allocation to production stages for the proposed model.

solely able to specify the number of the final product. In this situation, the minimum values of the production in stages A and B will determine the number of final products if no buffer exists in stage B. However, by considering a buffer for stage B, the feasibility space is extremely increased by which some mentioned hard constraints are relaxed. The objective function of the proposed model is to maximize the profit including the total sale (TS) and to minimize the costs of workers, shortfall, production, and inventory. This can be written as:

MaxZ 2 ¼ R 

" !# T 3 X X TSt  Pc  TP t þ Sc  Surplust þ Sh  Slackt þ c1  X t þ ck  W kt t¼1

ð12Þ

k¼2

Subject to constraint (5)–(9) and constraint (13)–(18) as follows:

p1  X t þ p3  ðW 3at  q1  X t Þ ¼ Pat ;

ð13Þ

p2  W 2t þ p3  ðW 3bt  q2  W 2t Þ ¼ Pbt ;

ð14Þ

TPt ¼ minðPat ; Pbt Þ;

ð15Þ

TSt 6 dt ;

ð16Þ

TSt 6 TP t þ Surplust ; TPt  Surplust þ Slackt þ Surplust1 ¼ dt ;

ð17Þ ð18Þ

where TSt represents the total sale, dt is demand, Surplust is surplus level or inventory level, and Slackt is the same shortfall level in period t. Constraints (13)–(15) calculate the real production level. Constraints (16) and (17) confirm that the total amount of products for sale should be equal to or less than the entire demand and the total real production plus the amount of beginning storage. Constraint (18) is also the mass balance equation. As all non-satisfied demands are lost, variable Slackt+1 is not shown in this constraint. 4.2. Our methodology In this section, we describe our models: SDP and RL. The value iteration is a common technique of SDP [6]. It is obvious that SDP can find the global optimal solution if the discretization scheme is performed in an appropriate way. To apply SDP in this problem, stage, state, decision variables, and recursive function should be initially well defined: Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

7

State: The state variable is defined with three elements: Surplust1, W 2t , and W 3t . State: The number of periods where each period is considered three months in their case study. Decision variable ðXt Þ: the number of required workers to be hired (new-hired workers). Furthermore, constraints (8) and (9), and (18) can be used as transition equations and the recursive function can be written as follows:

"

# X V it ¼ max Reit þ b  Pijt ðxÞV jt ; xAðiÞ

ð19Þ

j

where Vit is the maximum expected reward for given state i (i.e. for a determined set of W 2t , W3t, and Surplust1 values) that can be earned from period t to termination, AðiÞ is the set of admissible actions for this state, and the expected immediate reward Reit is given by:

Reit ¼

X

Pijt ðxÞ  Reijt ðxÞ;

ð20Þ

j

V j;tþ1 ¼

X wz  V zt ;

ð21Þ

z

where  P ijt ðxÞ is the probability of transition from state i to state j when action x is taken in period t,  Reijt ðxÞ is the reward function of transition from state i to state j when action x is taken in period t which is obtained using Eq. (4),  b is a discount factor, and  V j;tþ1 the solution from period t + 1 to the end, can be calculated by Eq. (21) if we use a linear interpolation to affect the value functions of neighborhood possible states of state j to its reward,  wz is the proper weight of neighborhood state z, and  V zt is the value function of proper neighborhood state z. Since the demand is stochastic, the transition from one state to another is not deterministic anymore and it can be determined after the demand realization at the end of period. The respective probabilities can be calculated using the balance equation (constraint (18)). As it was previously mentioned in the above section, Q-learning technique is derived from the value iteration (VI) technique; however, it asynchronously sweeps the states [24]. This technique updates Q-factors using the following equation which is analogous to Eq. (3); however, the immediate reward and the learning rate are obtained in a specific way:

Q ði; aÞ

Q ði; aÞ þ aðReit þ b maxQ ðj; bÞ  Q ði; aÞÞ;

a ¼ A=Lði; aÞ;

bAðjÞ

ð22Þ ð23Þ

where A is a constant and Lði; aÞ denotes the number of time that the algorithm have taken the respective action a for state i. Fig. 4 demonstrates the whole process of Q-learning and optimization routine schematically. As illustrated in this figure, the optimal hiring policy emerges within an interval that means that more than one action can be chosen as optimal decisions in every state. This makes the hiring policy more flexible for decision makers. As mentioned above, the optimal decision is the number of new-hired workers to be hired. Furthermore, the number of required skilled workers for learning and production stages should be determined through a separate optimization process. This means that after observing the states of the system (the number of workers in the two highest levels) and taking an admissible action (the number of workers to be hired up), the decision maker should specify how to assign the skilled workers for performing the operations in stages A and B ðW 3a and W 3b Þ. Moreover, the required skilled workers for learning new-hired and semi-skilled workers for the production process in stages A and B might be more than the number of total skilled workers. If this happens (i.e., constraints (4)–(6) are violated), the layoff employment will come to play. This means that the top manager is able to lay off some workers in each of the first two levels. To determine the optimal number of layoff and also the number of skilled workers that should be assigned in both production stages, the aforementioned optimization problem needs to be implemented in every iteration of both RL and SDP algorithm. It is obvious that a suitable way of assigning these workers to different operations really affects the total production (TP) at the end of the period. It might also influence the end-of-period storage ðSurplust1 Þ. Furthermore, performing a new optimization after taking every action in both RL and SDP could be substantially time-consuming. To alleviate the computation time, the optimization process can be performed in an offline mode for all possible states, the respective admissible actions, and the demand scenarios. Therefore, in all iterations of both RL and SDP algorithm, the number of skilled workers to Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

8

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Fig. 4. The proposed Q-learning technique.

be assigned for the learning and production process is extracted from the look up table resulted from the optimal solution of the offline implemented optimization model. In addition to constraints (5)–(7) and (13) and (14) as constraints for this model, the following equation should be added: sc

NPt ¼ maxðdt  Surplust1 ; 0Þ;

ð24Þ sc dt

where NP t is the needed production and is the demand for scenario sc in period t (an input parameter from SDP). The objective function here is to minimize the absolute deviation between the needed production and the total production (TP t as a new decision variable in the model). The objective can be written as:

Obj ¼ minðjNPt  TPt jÞ

ð25Þ

There is a heuristic solution for this optimization problem, which is proved as the following lemma. Lemma 1. The optimal value of W3a for triple state variables in the RL and SDP model (including W 2 ; W 3 and Surplust1 ), demand level ðdÞ, and each decision (the number of new workers) can be obtained as follows:     ð1Þ ð1Þ if NP > TPðW 3a Þ (1) W 3a ¼ max min W 3a ; W 3  q2  W 2 ; q1  X           ð2Þ ð3Þ ð2Þ ð3Þ (2) W 3a ¼ arg:min TP W 3a ; TP W 3a if NP < min TP W 3a ; TP W 3a ð4Þ

(3) W 3a ¼ W 3a (4)

W 3a

(5) W 3a

ð5Þ W 3a

ð2Þ

ð3Þ

ð2Þ W 3a

ð3Þ

if W 3a 6 rootðNP ¼ P a Þ 6 W 3a

6 rootðNP ¼ P b Þ 6 W 3a      ð2Þ ð3Þ otherwise ¼ arg:max TP W 3a ; TP W 3a ¼

if

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

9

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Fig. 5. The feasible space of the optimization problem to determine the optimal value of W3a.

where ð1Þ

W 3a ¼ ððp1  p3  q1 Þ  X þ ðp3  W 3 þ ðp2  p3  q2 Þ  W 2 ÞÞ=ð2  p3 Þ

ð26Þ

ð2Þ W 3a ð3Þ W 3a ð4Þ W 3a ð5Þ W 3a

¼ q1  X

ð27Þ

¼ p3  W 3  q2  W 2

ð28Þ

¼ q1  X þ ðNP  p1  XÞ=p3

ð29Þ

¼ q2  W 2  W 3 þ ðNP  p2  W 2 Þ=p3

ð30Þ

Proof. As illustrated in Fig. 5, the maximum value of TP occurs where the pa and pb constraint meet each other. If the needed product (NP) is more than this maximum value (NP in state 1 in Fig. 4) and the respective solution is feasible, the optimal ð1Þ

value for W 3a equals W 3a . In this situation, if the respective two lower and upper bounds (constraints (4) and (5)) states in one side of point

ð1Þ W 3a

ð1Þ

ð2Þ

ð3Þ

(i.e., W 3a is not feasible), the optimal solution is one of W 3a or W 3a , one has TP closer to NP. In the ð2Þ

ð3Þ

second case, where NP is less than TP in both points W 3a or W 3a (NP is in state 2 in Fig. 5), it is clear that the optimal solution is one of the

ð2Þ W 3a

or

ð3Þ W 3a ,

one has a lower TP. In the third and fourth cases, if the intersection of pa or pb with NP generates a ð4Þ

ð5Þ

feasible solution, the best solution is one of W 3a or W 3a in which TP = NP (NP in state 3 in Fig. 5). Finally, in the fifth case, if ð4Þ

ð5Þ

ð4Þ

ð5Þ

none of W 3a and W 3a is feasible, that is, two lower and upper bounds state in one side of both solutions of W 3a and W 3a , and NP is higher than TP in both

ð2Þ W 3a

or

ð3Þ W 3a ,

the optimal solution is one of

ð2Þ W 3a

or

ð3Þ W 3a ,

one has higher TP. h

5. Numerical results To evaluate our proposed methods, we used a unique case study that is briefly explained in the next subsection. 5.1. Case study The value of parameters and demand is extracted from Bordoloi and Matsuo’s case study and reported in Table 1 with some added items such as r, pc, sc, sh, c1, c2 and c3. According to the demand variations observed in historical data (Normal distribution with the mean and the standard deviation equivalent to 520 and 130, respectively), five different scenarios could be considered. Table 2 demonstrates these scenarios with the respective probabilities. 5.2. Result discussion As Bordoloi and Matsuo’s work is a basis to evaluate the proposed method, we should mention the policy rule used for hiring new work forces as follows:

    X t ¼ X  þ q  W 2  W 2t =y1 þ q  W 3  W 3t =ðy1  y2 Þ

ð31Þ

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

10

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx Table 1 The values of parameters. Parameter

Value

Parameter

Value

p1 and p2 p3 q1 q2 y1 y2 y3

0.5 unit per 8 h 1 unit per 8 h 0.3 0.2 0.55 0.65 0.95

d r pc sc sh c1 and c3 c2

N (520, 130) product 1 currency unit/product 0.1 currency unit/product 0.03 currency unit/product 0.05 currency unit/product 3 currency unit 2 currency unit

Table 2 The demand scenario and the probabilities. Demand

Probability

188.37 365.44 542.51 719.58 896.65

0.031 0.278 0.513 0.161 0.017

where q, restoration factor, implies the effect of the difference between the target and the current manpower level on hiring decision. X  ; W 2 , and W 3 are target values (4.83, 2.66, and 17.26 in our case study) which are obtained based on their model in the steady-state situation. Interestingly, in our proposed methodology it is observed that all states related to the number of workers in the second and third level reach three specific states, called stationary states, after some period. These three stationary states are as follows: (1) W 2 ¼ 6, W 3 ¼ 10. (2) W 2 ¼ 7, W 3 ¼ 10. (3) W 2 ¼ 8, W 3 ¼ 10. Table 3 demonstrates the number of periods taken to reach the stationary situation through a simulation routine. This simulation is repeated for all possible states up to which the optimal actions are obtained. As it is clear, some states (called virtual states which are 1187 out of 8591 states) never converge to the above-mentioned stationary situations. These virtual states (e.g., W 3 ¼ 0 which does not practically occur) are solely considered for smooth implementation of the proposed methods’ process. Furthermore, the system using the optimal dynamic policy obtained from the proposed methods reaches stationary states after approximately three periods on average. This is between four and five on average using the static policy extracted by Bordoloi and Matsuo [7] through their optimization chance-constraint model. This means that our policies can reach stationary states at least 3 months (one period) sooner than their policy. Furthermore, the efficiency of our proposed methods can be investigated using a simulation where the following three policies are used: 1. Policy obtained using the chance-constraint model developed by Bordoloi and Matsuo (CC), 2. Control policy obtained using our proposed RL model, (RL) 3. Control policy obtained using our proposed SDP model (SDP). As SDP ends up with global optimal solution, we provide the SDP results to verify the results of RL. In other words, the closeness of SDP and RL could be a suitable verification for RL methodology. This is a known fact that SDP cannot be used in real-life human resources applications because the curse of dimensionality issue where RL can be suitably used. In fact, the main goal of defining SDP model is to highlight the efficiency of RL control policy against the optimal one (output of SDP). In this model, the gap (i.e. surplus or slack) between the production level and the stochastic demand is penalized through the recursive function.

Table 3 The number of states and the respective number of periods to reach the stationary situations. Number of periods Number of states

– 1187

0 2008

1 1613

2 1151

3 859

4 333

5 366

6 215

7 220

8 483

9 80

10 76

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

11

Fig. 6. The comparison of three policies (RL, SDP, and Bordoloi and Matsuo’s model) in terms of the average profit of 40 periods over 100 experiments.

Fig. 7. The comparison of three policies (RL, SDP, and Bordoloi and Matsuo’s model) in terms of the average labor cost of 40 periods over 100 experiments.

We generated demand data for ten years including forty periods using the normal distributions explained in the previous subsection. To have different problems, a set of 100 different initializations for the number of workers in the second and third levels are also used. One of these initializations is what Bordoloi and Matsuo used and the remaining problems are randomly generated. Moreover, to have a fair comparison, the starting inventory level (i.e. surplus of period 0) is assumed zero for all simulations. Fig. 6 compares the results of the simulation using three policies in terms of the total average profit. According to this figure, policies based on RL and SDP models lead to more efficient total reward compared to the policy obtained from conservative model for all 100 generated problems. In order to compare the performance of policies from a cost reduction point of view, three policies can be employed to calculate the obtained labor costs (ck in Eq. (4)). The results are represented in Fig. 7. This figure shows that both RL and SDP models have better results compared to conservative model. The three models can also be compared in terms of the mean and the variability of the number of new-hired workers using a Monte Carlo simulation. As demonstrated in Figs. 8 and 9, the proposed models create a control policy which is superior to the conservative model in terms of both criteria. This means that our RL model could lead to more stable policy for hiring new workers.

6. Conclusion and future work This paper focuses on a combined problem of human resource planning (HRP) and production-inventory control in knowledge-intensive industry. The main characteristics of human resource in this industry are levels of ‘‘knowledge’’ and learning process. Thus, the objective is to maximize the expected profit by finding the optimal numbers of workers in various Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

12

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

Fig. 8. The comparison of three policies (RL, SDP, and Bordoloi and Matsuo’s model) in terms of the average number of new-hired workers for 40 periods over 100 experiments.

Fig. 9. The comparison of three policies (RL, SDP, and Bordoloi and Matsuo’s model) in terms of the standard deviation of number of new-hired workers for 40 periods over 100 experiments.

knowledge levels to fulfill both production and training requirement. When a company loses skilled workers in the upper levels, it cannot be directly compensated. This means that the company is only able to do the demand satisfaction by recruiting workers in the first level (new-hired workers). In this paper, we developed a reinforcement learning (RL) method to obtain a near- optimal decision for hiring workers under the demand uncertainty. Moreover, the decision in every state is not considered unique and is in a specific domain that makes the decision-making process more flexible for decision makers in a practical staffing problem. Another distinguished feature of the proposed model compared to other similar research works is that an internal optimization routine has been used to determine the optimal number of skilled workers that should be assigned for training two other knowledge workers in the first and second levels. This optimal number has been prepared as a lookup table for all discrete states, demand scenario, and all admissible actions and used as an input for implementing both RL and stochastic dynamic programming (SDP) approaches. To evaluate our proposed RL method, in addition to SDP, we also implemented a well-known conservative approach. All the respective results have been compared in terms of four criteria: average obtained profit, average obtained cost, the number of new-hired workers, and the standard deviation of hiring policies. The numerical results confirm that our method leads to satisfactory results. The proposed methodology can be extended for a situation in which there are more than two production stages. Furthermore, applications such as call center of financial companies and consulting firms which have been investigated by other researchers may be coped with our proposed methodology. These are left for future study.

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004

A.-M. Karimi-Majd et al. / Simulation Modelling Practice and Theory xxx (2015) xxx–xxx

13

References [1] H.S. Ahn, R. Righter, J.G. Shanthikumar, Staffing decisions for heterogeneous workers, with turnover, Math. Meth. Oper. Res. 62 (2005) 499–514. [2] M. Armstrong, A Handbook of Human Resource Management Practice, tenth ed., London, UK, 2006. [3] A.N. Avramidis, W. Chan, M. Gendreau, P. L’Ecuyer, O. Pisacane, Optimizing daily agent scheduling in a multi-skill call centre, Eur. J. Oper. Res. 200 (2010) 822–832. [4] D. Barrera, N. Velasco, C. Amaya, A network-based approach to the multi-activity combined timetabling and crew scheduling problem, Comput. Ind. Eng. 63 (2012) 802–812. [5] D.J. Bartholomew, A.F. Forbes, S.L. McClean, Statistical Techniques for Manpower Planning, second ed., John Wiley and Sons, New York, USA, 1991. [6] R. Bellman, Dynamic Programming, Princeton University Press, Princeton, 1957. [7] S.K. Bordoloi, H. Matsuo, Human resource planning in knowledge-intensive operations: a model for learning with stochastic turnover, Eur. J. Oper. Res. 130 (2001) 169–189. [8] S.K. Bordoloi, A control rule for recruitment planning in engineering consultancy, J. Prod. Anal. 26 (2006) 147–163. [9] D.N. Bulla, P.M. Scott, Manpower requirement forecasting: a case example, Strateg. Hum. Resour. Plan. Appl. (1987) 145–155. [10] N. Celik, S. Lee, E. Mazhari, Y.J. Son, R. Lemaire, K.G. Provan, Simulation-based workforce assignment in a multi-organizational social network for alliance-based software development, Simul. Modell. Pract. Theory 19 (10) (2011) 2169–2188. [11] A.K. Chattopadhyay, A. Gupta, A stochastic manpower planning model under varying class sizes, Ann. Oper. Res. 155 (2007) 41–49. [12] Y.J. Chen, Y.M. Chen, M.S. Wu, empirical knowledge management framework for professional virtual community in knowledge-intensive service industries, Expert Syst. Appl. 39 (2012) 13135–13147. [13] A.A. Constantino, D. Landa-Silva, E.L. de Melo, C.F.X. de Mendonca, D.B. Rizzato, W. Romao, A heuristic algorithm based on multi-assignment procedures for nurse scheduling, Ann. Oper. Res. (2013), http://dx.doi.org/10.1007/s10479-013-1357-9. [14] C.F.F. Costa Filho, D.A.R. Rocha, M.G.F. Costa, W.C.A. Pereira, Using constraint satisfaction problem approach to solve human resource allocation problems in cooperative health services, Expert Syst. Appl. 39 (2012) 385–394. [15] P.F. Drucker, Knowledge-worker productivity: the biggest challenge, California Manage. Rev. XLI (2) (1999) 79–94. [16] R.J. Ebert, Aggregate planning with learning curve productivity, Manage. Sci. 23 (2) (1976) 171–182. [17] J. Edwards, A survey of manpower planning models and their applications, J. Oper. Res. Soc. 34 (11) (1983) 1031–1040. [18] J.S. Edwards, T. Alifantis, R.D. Hurrion, J. Ladbrook, S. Robinson, A. Waller, Using a simulation model for knowledge elicitation and knowledge management, Simul. Modell. Pract. Theory 12 (7–8) (2004) 527–540. [19] K. Ertogral, B. Bamuqabel, Developing staff schedules for a bilingual telecommunication call center with flexible workers, Comput. Ind. Eng. 54 (2008) 118–127. [20] C.Y. Fan, P.S. Fan, T.Y. Chan, S.H. Chang, Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals, Expert Syst. Appl. 39 (2012) 8844–8851. [21] E. Fragnière, J. Gondzio, X. Yang, Operations risk management by optimally planning the qualified workforce capacity, Eur. J. Oper. Res. 202 (2010) 518–527. [22] N. Gans, Y. Zhou, Managing learning and turnover in employee staffing, Oper. Res. 50 (6) (2002) 991–1006. [23] A.C. Georgiou, N. Tsantas, Modeling recruitment training in mathematical human resource planning, Appl. Stoch. Models Bus. Ind. 18 (2002) 53–74. [24] A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Kluwer Academic Publishers, Norwell, MA, 2003. pp. 3–5. [25] C. Heimerl, R. Kolisch, Work assignment to and qualification of multi-skilled human resources under knowledge depreciation and company skill level targets, Int. J. Prod. Res. 48 (13) (2010) 3759–3781. [26] N.C.R. Hwang, K. Kogan, Dynamic approach to human resources planning for major professional companies with a peak-wise demand, Int. J. Prod. Res. 41 (6) (2003) 1255–1271. [27] C.M. Khoong, An integrated system framework and analysis methodology for manpower planning, Int. J. Manpower 17 (1) (1996) 26–46. [28] N.K. Kwak, W.A. Garrett, S. Barone Jr., A stochastic model of demand forecasting for technical manpower planning, Manage. Sci. 23 (10) (1977) 1089– 1098. [29] G. Mincsovics, N. Dellaert, Stochastic dynamic nursing service budgeting, Ann. Oper. Res. 178 (2010) 5–21. [30] Othman. M, Bhuiyan. N, G.J. Gouw, Integrating workers’ differences into workforce planning, Comput. Ind. Eng. 63 (2012) 1096–1106. [31] E.J. Pinker, R.C. Larson, Optimizing the use of contingent labour when demand is uncertain, Eur. J. Oper. Res. 144 (2003) 39–55. [32] S.J. Sadjadi, R. Soltani, M. Izadbakhsh, F. Saberian, M. Darayi, A new nonlinear stochastic staff scheduling model, Scientia Iranica E 18 (3) (2011) 699– 710. [33] H.C. Tijms, A First Course in Stochastic Models, John Wiley & Sons, 2003. [34] X. Zhu, H.D. Sherali, Two-stage workforce planning under demand, fluctuations and uncertainty, J. Oper. Res. Soc. 60 (2009) 94–103.

Please cite this article in press as: A.-M. Karimi-Majd et al., A reinforcement learning methodology for a human resource planning problem considering knowledge-based promotion, Simulat. Modell. Pract. Theory (2015), http://dx.doi.org/10.1016/j.simpat.2015.07.004