Nonparametric advertising budget allocation with inventory constraint

Nonparametric advertising budget allocation with inventory constraint

Nonparametric Advertising Budget Allocation with Inventory Constraint Journal Pre-proof Nonparametric Advertising Budget Allocation with Inventory C...

1015KB Sizes 0 Downloads 44 Views

Nonparametric Advertising Budget Allocation with Inventory Constraint

Journal Pre-proof

Nonparametric Advertising Budget Allocation with Inventory Constraint Chaolin Yang , Yi Xiong PII: DOI: Reference:

S0377-2217(20)30123-5 https://doi.org/10.1016/j.ejor.2020.02.005 EOR 16325

To appear in:

European Journal of Operational Research

Received date: Accepted date:

19 September 2018 4 February 2020

Please cite this article as: Chaolin Yang , Yi Xiong, Nonparametric Advertising Budget Allocation with Inventory Constraint, European Journal of Operational Research (2020), doi: https://doi.org/10.1016/j.ejor.2020.02.005

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Highlights • Consider the advertising budget allocation problem in the revenue management context. • A nonparametric learning-while-doing policy is proposed. • The policy balances the advertising budget and inventory “budget” simultaneously. • The policy achieves near-best asymptotic performance.

1

Nonparametric Advertising Budget Allocation with Inventory Constraint Chaolin Yang∗ School of Information Management and Engineering Shanghai University of Finance and Economics, Shanghai, China [email protected]

Yi Xiong Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong, Shatin, N.T., Hong Kong [email protected]

Abstract In this paper, we study the optimization problem of the advertising budget allocation for revenue management faced by a marketer. Besides the advertising budget, the marketer is subject to an inventory constraint during the promotion season. The marketer can affect sales by spending on advertising but does not initially know the relationship between the advertising expense and consequent sales. We propose a nonparametric learning-while-doing budget allocation policy for the problem. Specifically, we first conduct a sequence of advertising experiments to learn (predict) the market sales response through observing realized sales (exploration), then based on the learned sales function determine the following budget allocation planning (exploitation). In particular, during the exploration and exploitation phases, we need to balance the advertising and inventory budgets simultaneously. We show that our policy is asymptotically optimal as the size of the market increases. By constructing a worst-case example, we show that our policy achieves near-best asymptotic performance. We also provide numerical illustrations to show how our policy works, and discuss how its performance changes as the system parameters vary. We also glen some managerial implications of our model and policy from the numerical results.

Keywords: Revenue management; advertising budget allocation; nonparametric; dynamic learning; asymptotic optimality.



Corresponding Author

2

1

Introduction

Allocating the advertising budget has been of paramount importance to marketers and the issue has been studied in academic literature for many decades. For instance, in sponsored search markets where marketers allot the bid value to each keyword and pay the assigned price for actual click to their product websites, marketers usually first specify their advertising budgets for a promotion season, and then dynamically determine the budget allocation with a goal to maximize the number of clicks received during the season (Amin et al. 2012). Display advertising is another very popular kind of market (e.g., Google’s DoubleClick, and Yahoo’s Right Media). Firms can buy display advertisements on the Internet in real-time via Ad exchanges platforms, where marketers participate in to launch marketing campaigns. And such campaigns are commonly based on a pre-determined budget and last for a fixed amount of time (Balseiro et al. 2015), and thus the key decision is how to allocation the advertising budget. On the other hand, revenue management has received considerable attention in operations management studies. One of the key focuses in the revenue management area is to determine a dynamic pricing strategy to maximize a firm’s expected revenue with given inventory. Yet the advertising strategy, which is a “twin” marketing decision of the pricing strategy that most businesses need to determine in controlling sales and revenue, gets much less attention in the revenue management literature. In some cases, however, marketers need to do the optimization of the advertising budget allocation in a revenue management context (consider the inventory constraint of the promoted product). For example, in the e-commerce search advertising scenario (Zhao et al. 2018), when the marketer launches a promotion for his products, the promotion event may be not only subject to the advertising budget but also the inventory ”budget” that is replenished before the start of the promotion season. And the goal of the marketer is to maximize the overall profit during the season, which is not necessarily equivalent to maximize the total number of clicks received during the season. However, for the optimization of the advertising budget allocation for revenue management with inventory constraint, we are not aware of any studies in the literature. Furthermore, in practice, due to the changing business environment, it is usually difficult for the marketing manager of a firm to know (predict) exactly what sales response or even how the probability distribution of the sales response will be elicited by spending money on a particular marketing campaign, especially when a new product is introduced (e.g., Amin et al. 2012 and Wu

3

et al. 2018). Hence the decision problem faced by the marketer is to determine an intertemporal allocation of an advertising budget within the promotion season to maximize his expected revenue, given an inventory constraint and that he may not be able to evaluate the impact of these advertising expenses on future returns. In such a case, the marketer has to gradually learn and update the prediction of the sales response function by observing the realized sales during the promotion season. Therefore, the central trade-off of the budget allocation policy is to balance the tension between sales learning (exploration) and near-optimal advertising (exploitation). More specifically, the more time the marketer spends on learning the sales response, the less budget and time he has to exploit that knowledge to boost earnings. On the other hand, if he decides to spend less budget and time on learning, the revenues collected in the advertising phase may be well below real optimal revenue because of the poor information he has at his disposal at the end of the learning phase. In this paper, we study the problem of the advertising budget allocation for revenue management with inventory constraint and unknown sales response function. A price-taking marketer joins the market with a pre-specified budget and inventory, and begins to promote and sell a single product with almost no prior information on the sales response of the market. The sales response is characterized by a Poisson process whose intensity is affected by the advertising expense rate of the marketer. However, the relationship between the advertising expense rate and the sales rate is unknown to the marketer and he can only obtain this information through observing realized sales. Our goal therefore is to find a policy whose expected revenue loss relative to an oracle that knows the sales function in advance (regret) is small and diminishing as the market size increases. We propose a nonparametric advertising budget allocation policy to dynamically learn the sales response of the market, which integrates the learning (exploration) step and doing (exploitation) step. The key contribution and highlight of our paper is to consider the advertising budget allocation problem in the revenue management context, and our policy balances the advertising budget and inventory “budget” simultaneously between the exploration and exploitation trade-off. We show that our policy is asymptotically optimal. In particular, the policy achieves a regret of O(n−1/2 log5+2 n) where n ∈ N+ is the market size,  ∈ R+ is a tuning parameter of our policy and O(n−1/2 log5+2 n) means that limn→+∞

O(n−1/2 log5+2 n) n−1/2 log5+2 n

= constant. We also prove that

the minimum regret of any policy is O(n−1/2 ), which implies that our policy achieves near-best performance. Besides theoretical results, we also provide numerical illustrations to show how our algorithm works, and discuss how its performance changes as the system parameters vary. We also

4

glen some managerial implications of our model and policy from the numerical results. This paper is organized as follows. Section 2 reviews a number of previous studies in related fields. In section 3, we formulate our model and state our main assumptions. Section 4 proposes the nonparametric advertising budget allocation policy. Section 5 shows its asymptotic performance. In section 6, we provide a numerical illustration of the policy. Section 7 summarizes the paper. All the technical proofs are provided in the Online Appendix.

2

Literature Review

Marketing budget allocation problems feature in the academic literature for a long time. Tull et al. (1986) suggest that determining a good allocation plan of the budget is more important than improving the overall budget. Lodish et al. (1988) propose an allocation algorithm for a specific type of market response implemented by a pharmaceutical company. Doyle and Saunders (1990) propose a linear-in-logs model and derive a closed-form allocation rule. These papers focus on maximizing the short-term profit. Fischer et al. (2011) study the marketing budget allocation problem for multi-product, multi-country firms. Although their study considers the long-term profit maximization, the budget is assumed to be equally allocated over time. For the dynamic advertising problem, Vidale and Wolf (1957) propose the celebrated Vidale-Wolf model. Sethi (1977) considers the optimal control problem of the Vidal-Wolf model, in which total advertising expenses over the entire selling season are subject to a budget. Fruchter and Dou (2005) further extend the model to the case involving investments in two types of advertising campaigns. Kumar and Sethi (2009) study hybrid models of dynamically pricing the subscription fee and determining the advertising level over time. Royo et al. (2013) consider the optimal advertising budget for multi-period multiproduct scenario. All these studies consider the budget allocation with deterministic market sales response. Holthausen and Assmus (1982), Du et al. (2007) and Yang et al. (2013) study the budget allocation model with uncertain market sales response. One potential application of our model is the online advertising, which is a well-studied topic emerging considerable research. One of its focuses is on how to allocate the advertising budget efficiently. Here we review two streams of literature, i.e. sponsored search and real-time bidding (RTB). Traditional sponsored search is a keyword level optimization problem, whose price is related to each keyword. Amin et al. (2012) consider budget optimization for such a problem in which the

5

market price is drawn from an unknown distribution and only censored observations are available. This paper uses Markov decision process (MDP) to formulate the problem and proposes a learning algorithm for the problem whose performance is numerically validated. Tran-Thanh et al. (2014) study the problem of distributing budget across a sequence of auctions. They theoretically and numerically prove the regret bound and convergence rate of three algorithms. Yang et al. (2015) consider a multi-keyword, multi-period search advertising problem. This paper develops an advertising response function considering the features of the sponsored search advertising, and then finds the solution of the dynamic advertising problem. Tunuguntla et al. (2019) study the joint decision model of sponsored search advertising and dynamic pricing for perishable products. RTB is another popular scheme of online advertising. When an ad display impression is generated from a user visit, it is auctioned off in real-time by the advertisers. Zhang et al. (2014) formulate simple bidding functions to map the impression value to bidding price and show the optimal functions are non-linear and concave. Zhang et al. (2016) innovate a feedback control mechanism to resolve the instability or volatility in RTB. More recently, the reinforcement learning (RL) framework becomes a popular way to study the RTB problems. Cai et al. (2017) leverage the RL framework to formulate a sequential bidding problem to dynamically allocate the budget on ad impressions. Wu et al. (2018) utilize a model-free RL mechanism to solve budget constrained RTB problems. Zhao et al. (2018) combine sponsored search and RTB these two paradigms together and study a called SS-RTB problem. They propose a deep RL model for the problem. One key idea of the RL framework is to balance the exploration and exploitation trade-off, such as the -greedy algorithm and the UCB algorithm. This valuable idea is indeed inherited in our policy. Compared to this stream of studies, our paper considers the advertising budget allocation problem embedded in the revenue management context, namely, to maximize the expected revenue and consider the inventory constraint. The highlight of our work is that we propose a learning-while-doing algorithm, which integrates the exploration phase (learning the sales response function) and exploitation phase (maximizing the profit with learned sales response), and simultaneously balances the advertising budget and inventory “budget” between the exploration and exploitation trade-off. Our model is also closely related to papers on dynamic pricing for revenue management. We refer the readers to Talluri and van Ryzin (2005) for a comprehensive review of this literature. Gallego and van Ryzin (1994) consider a retailer’s single-product pricing problem when the retailer has complete information about the demand function. Gallego and van Ryzin (1997) study the multi-product network RM problem. Subsequent research on this topic is abundant, for exam6

ple, Tong and Topaloglu (2014) study the problem using the approximate linear programming; Zhang and Weatherford (2017) consider its application in the hotel industry. In view of the fact that decision makers in reality usually do not have complete information about the demand, joint learning-while-pricing problems have received extensive research attention over the last decade (see den Boer (2015) for a comprehensive review). The parametric setting assumes that the demand function takes a certain parametric form while some key parameters are unknown, see e.g., Araman and Caldentey (2009), Broder and Rusmevichientong (2012), Harrison et al. (2012), and den Boer and Zwart (2014). Meanwhile, several papers consider the nonparametric setting of the problem. Besbes and Zeevi (2009) consider the dynamic pricing with demand learning in both parametric and nonparametric cases with proposing learning algorithms respectively. However, there is a gap in performance between these two cases. This gap has been closed by the dynamic pricing algorithm (DPA) of Wang et al. (2014). In particular, the DPA employs a more sophisticated learning procedure that iteratively performs price experimentation within a shrinking series of pricing intervals and achieves a regret of O(n−1/2 log4.5 n). Besbes and Zeevi (2012) consider the online network revenue management problem with multiple resources and multiple products. When the ambiguity set of feasible prices is a continuum, their algorithm achieves a regret of O(n−1/(d+3) log1/2 n) where d is the number of products. Besbes and Zeevi (2015) consider the dynamic pricing problem without inventory constraint. The focus of this paper is not on the algorithm but the robustness of the linear demand-price model. Recently, Cheung et al. (2017) consider the online dynamic pricing model with price-changing constraint. The general framework of our model follows the models of the dynamic pricing for revenue management literature, namely we consider a finite and given promotion/selling season, a given initial inventory and the Poisson demand process with decision variable dependent arrival rate, and the goal is to maximize the expected total revenue during the promotion season. However, in contrast to studies of the dynamic pricing literature, our model considers the problem of dynamic advertising for revenue management, hence the revenue rate function has different functional form. More importantly, in our model the total advertising expenditure in the promotion season is subject to an advertising budget. Whereas there is no “pricing budget” kind of constraint in the dynamic pricing problems. From a technical point of view, the DPA developed by Wang et al. (2014) most closely resembles our own approach, and our budget allocation policy builds upon the ideas incorporated in their DPA. However, our own approach differs from theirs in two important aspects. Firstly, Wang et al.’s dynamic pricing model only includes inventory constraint, whereas we take account of both 7

advertising budget constraint and inventory constraint. As a result, our policy needs to balance the advertising budget and inventory “budget” simultaneously between the exploration and exploitation trade-off. More specifically, the main aim of Wang et al.’s DPA is to sequentially estimate and approach the optimal solution of a deterministic relaxation of the original stochastic problem, which is equal to the maximum term of the global maximizer of the revenue rate function and the inventory clearance price. In our model, by contrast, we include both budget constraint and inventory constraint, and the optimal solution of the corresponding deterministic problem is the minimum of three quantities, i.e., the global maximizer of the revenue rate function, the inventory clearance advertising, and the average budget. We have therefore to redesign the learning procedure. Secondly, in the Wang et al.’s numerical study, the authors make a number of suggestions for modifying the logar ithmic factors of the tuning parameters of the algorithm to improve its numerical performance. However, these modifications may loosen the theoretical performance guarantee of the algorithm. To resolve this difficulty, we introduce an  (∈ R+ ) parameter in our algorithm to control the logarithmic factor. The performance of the algorithm is O(n−1/2 (log n)5+2 ). The marketers can choose appropriate values of  for different problem settings to obtain a practically efficient while theoretically performance guaranteed algorithm. Our numerical study shows that different values of  indeed result in a significantly different regret of the algorithm. Note that in Wang et al. (2014),  = 1/2. Finally, we remark that we can directly apply to our own problem the algorithm developed by Besbes and Zeevi (2012), which considered nonparametric dynamic pricing with multiple resource constraints. However, the policy only achieves a regret of O(n−1/4 (log n)1/2 ).

3

Problem Formulation

3.1

Model Primitives and Basic Assumption

Consider a marketer who faces the single-product revenue management problem to determine his intertemporal advertising budget allocation over a finite promotion season, denoted by T . The total advertising budget over the entire season is denoted by B > 0. The product price is fixed throughout the season. Initially, the marketer endowed x ∈ Z+ units of the products. The random demand follows a Poisson process with arrival rate λ(t). The marketer can influence demand by varying his advertising spending. Specifically, for each t the arrival rate λ(t) depends on the advertising expense rate of the marketer at time t, A(t). Let (A(s) : 0 ≤ s ≤ T ) denote 8

. ¯ ∪ ∅. Here ∅ represents a the advertising expense process which takes values in D where D = [0, A] shut-off action, which will be applied if and only if the marketer runs out of items. In the marketing literature there are three types of commonly used sales functions, i.e., an S-shaped curve, a function with a threshold, and an increasing with diminishing returns function (concave). While the preponderance of empirical evidence favors the concave sales function (e.g., Hanssens et al., 1990). Zhang et al. (2014) also indicate that the wining rate function of the realtime bidding for display advertising consistently has an (approximately) concave shape. We thus assume that λ(A) is increasing and concave and has an inverse function A = a(λ). Let w denote the unit margin. Then the revenue rate function r(λ(A)) = wλ(A) − A is also concave. However, in our model the marketer does not know the true sales function λ(A). He only knows that it belongs to the class L := L(M, K, mL , mU ) of nonnegative, increasing and concave functions, which for finite positive constants M, K, mL , mU , satisfies the following: 1. Boundedness: for all λ ∈ L, λ(A) < M for all A ∈ D. 2. Lipschitz continuity: λ(A) and r(λ(A)) are Lipschitz continuous in A with factor K. The inverse demand function A = a(λ) is also Lipschitz continuous in λ with factor K. 3. Strictly concavity and differentiability: r00 (λ(A)) exists and −mL ≤ r00 (λ(A)) ≤ −mU < 0 for all A ∈ D. Note that the above assumptions are benign and have been adopted in prior (blind) revenue management studies (see, e.g., Besbes and Zeevi (2012) and Wang et al. (2014)). Since r(λ(A)) is strictly concave in A, there exists a global maximizer, denoted by Au , such that r0 (λ(Au )) = 0. ¯ This can be achieved by setting a large A, ¯ e.g., A¯ We assume Au is within D, i.e., 0 ≤ Au ≤ A. can be set as the highest possible market revenue rate. Moreover, we assume a(x/T ) > 0 which is equivalent to x/T > λ(0). In other words, the inventory is sufficient if the demand arrives at a deterministic rate λ(0). Intuitively, we have to make sure that if produce a sound allocation plan, our spending on advertising will be profitable. We shall use π = (Aπ (t) : 0 ≤ t ≤ T ) to denote an allocation policy. For 0 ≤ t ≤ T , put Z t  π π N (t) := N λ(A (s))ds ,

(1)

0

where N (·) is a unit rate Poisson process, and N π (t) denotes the cumulative demand for the product up until time t under the policy π. We call a policy π is admissible if it is non-anticipating and 9

satisfies Z

T

0

Z

dN π (s) ≤ x a.s.,

(2)

Aπ (t)dt ≤ B.

(3)

T

t=0

Note that the key feature distinguishing our model from those in the existing literature lies in that we consider the above advertising budget and inventory constraint simultaneously. Specifically, in the literature on advertising budget allocation, the inventory constraint is not taken into the consideration, while in the literature on dynamic pricing for revenue management, there is no “pricing budget” kind of constraint. And one will see in our policy and the analysis that when both advertising budget and inventory constraint exist, to achieve near-best asymptotic optimality, our policy needs to carefully balance the advertising budget and inventory “budget” simultaneously between the exploration (learning the sales response function) and exploitation (maximizing the profit with learned sales response) trade-off. If we let P denote the set of admissible policies, then the problem facing the marketer is Z T  Z T π π π max J (x, T, B; λ) = E wdN (s) − A (s)ds . π∈P

0

0

Nonetheless, in our model, the above objective (expectation) is intractable. This is because the only information possessed by the marketer is the evolving history of demand observations, and the demand response function belongs to L. Hence the goal is to find a policy π ∈ P to minimize the regret of the marketer, which is defined to as being the difference between the total profit he can obtain and that of the best fixed decision in hindsight. We remark that here we employ a general Poisson process to model the demand and do not restrict the arrival rate function to any specific functional form; and our main purpose is to propose a nonparametric advertising budget allocation policy which simultaneously balances the advertising budget and inventory constraint. If one considers a certain specific advertising scenario (e.g., sponsored search and real-time bidding), then he can study a more specific functional form of the arrival rate function with unknown parameters (e.g., the customer arrival rate, click through rate and the purchasing probability), and establish the allocation policy following the general framework of our policy whereas with tailored estimators for the unknown parameters (we will provide more discussions on this in the Conclusion and Discussion Section).

10

3.2

Full Information Problem and Regret

We first consider the full-information deterministic relaxation of the problem. Assuming that the sales function is known and the demand process is deterministic, our objective is to maximize the total revenue generated over [0, T ] given x and B, denoted as J D (x, T, B|λ). Z T r(λ(A(s)))ds J D (x, T, B|λ) = sup 0 Z T λ(A(s))ds ≤ x, s.t. 0 Z T A(s)ds ≤ B,

(4)

0

A(s) ∈ D, ∀s ∈ [0, T ].

There are two important properties of problem (4). First, its optimal value provides a uniform upper bound on the expected revenue generated by any allocation policy π ∈ P, that is, J ∗ (x, T, B; λ) ≤ J D (x, T, B|λ) for all λ ∈ L. Second, define AD = min{Au , a(x/T ), B/T }, where Au is the global maximizer of r(·) and a(x/T ) is the advertising expense corresponding to the

demand rate x/T , then the optimal solution of problem (4) is A(s) = AD for s ∈ [0, T ]. Intuitively, if there are sufficient inventory and an adequate advertising budget, it must be optimal to spend Au over the entire selling season because it is the global maximizer of the revenue rate function. However, if the inventory is insufficient, then it is optimal merely to spend just enough money on advertising to ensure all the products in stock are sold, provided that the advertising budget is adequate. Because the sales response function is concave, equal expense over time maximizes the efficiency of advertising. Finally, if the advertising budget is limited, then it is optimal to spend the entire budget, and it is optimal to equally allocate the budget because λ(A) is concave. Detailed proofs of these two properties are given in the Online Appendix. Define regret as Rπ (x, T, B; λ) = 1 −

J π (x, T, B; λ) . J D (x, T, B|λ)

(5)

This regret measures the percentage loss in performance of any policy π relative to J D (T, B|λ). The smaller the value of this regret, the better the performance of the policy π. Because λ(A) is unknown, our goal is to seek a robust allocation policy that minimizes the following worst-case regret: sup Rπ (x, T, B; λ). λ∈L

11

Next, we introduce the widely used asymptotic performance analysis technic. Consider a regime in which the size of the market expands proportionally and call a market of size n (n ∈ Z+ ), if the initial advertising budget and the demand function are now given by xn = nx, Bn = nB, and λn (·) = nλ(·).

(6)

Intuitively, a market of size n can be viewed as a “large” market which contains n copies of the original market. In other words, in a market of size n, the marketer conducts business in n identical sub-markets (original markets) simultaneously. We will denote by JnD (x, T, B|λ) the optimal revenue of the deterministic problem in a market of size n. Let rD denote the revenue rate under AD , i.e., rD = wλ(AD ) − AD . Then we have J D (x, T, B|λ) = T rD and JnD (x, T, B|λ) = nT rD . The

revenue under any admissible policy π in a market of size n is denoted by Jnπ (x, T, B; λ). We then define regret for a market of size n accordingly as Rπn (x, T, B; λ) = 1 − Jnπ (x, T, B; λ)/JnD (x, T, B|λ). Finally, we are interested in designing an admissible budget allocation policy that has a good asymptotic performance of Rπn (x, T, B; λ) in terms of n. We summarize notations in Table 1. Notation T x B A(t) λ(A(t)) a(λ) w r(λ(A(t))) rD Au AD π N π (t) Jπ JD τiu τic κui κci

Description The length of promotion season The initial inventory The advertising budget over the promotion season The advertising expense rate at time t The demand arrival rate function The inverse function of λ(A) The unit margin The revenue rate function The revenue rate function under the full information problem The global maximizer of r(λ(A)) The optimal advertising expense rate for the deterministic problem An allocation policy The cumulative demand for the product up until time t under policy π The total expected revenue under policy π The optimal total revenue for the deterministic problem The length of the i-th learning interval in step 1 of the algorithm The length of the i-th learning interval in step 2 of the algorithm The number of advertising expenses to experiment with in interval τiu The number of advertising expenses to experiment with in interval τic Table 1: Summary of notations

12

4

Nonparametric Advertising Budget Allocation Policy

In this section, we first discuss the reasoning behind our policy, then present the algorithm which gives it effect.

4.1

Idea of the Policy

Our policy adopts the learning-while-doing procedure. We divide the selling season into two phases, a learning phase (exploration) and an advertising phase (exploitation). In the learning phase, we allocate a portion of the budget to test different advertising expenses, to dynamically estimate and approach the solution of the full information problem AD , which is shown to be asymptotically optimal, by the realized sales and revenues. In the advertising phase, we apply the even-level advertising policy with learned AD . We initially learn on the sales response function by testing different advertising expenses. This learning includes an exploration loss because the tested amounts are not optimal, and both the limited budget and inventory are depleted. To reduce the exploration loss, we divide the learning phase into several intervals. At each interval, we test a grid of expenses within a range which contains AD with high probability. Based on the resulting sales and revenue observations, we obtain a more accurate estimation of AD . We then shrink the spending range to a subrange that still contains AD with high probability. We repeat this shrinking procedure until the spending range is sufficiently small to enable a reasonably accurate estimation of AD . Specifically, the learning phase can be viewed as a positioning procedure of AD , which is the minimum of three quantities, i.e., B/T , Au and a(x/T ). In particular, Au and a(x/T ), which depend on the unknown sales response function, are unknown. The positioning of Au (the global maximizer of the revenue rate function) relies on the realized revenues of tested expenses, and the second-order information of the revenue rate function will be used. On the other hand, a(x/T ) is the advertising expense that corresponds to the demand rate x/T , and the estimation of a(x/T) relies on the realized demand of tested expenses. Therefore the estimation accuracy of Au and a(x/T ) is different. As a result, depending on whether we need to estimate Au , we use different shrinking procedures to minimize the exploration loss. The learning of AD therefore involves two tasks. The first task is to determine which one of Au , a(x/T ) and B/T is the smallest. The second task is to obtain more and more accurate estimations of Au and a(x/T ).

13

4.2

Algorithm

Our policy is based on four sequences of tuning parameters, τiu , κui , i = 1, · · · , N u and τic , κci , i = 1, · · · , N c . τiu defines the length of the i-th learning interval when Au is in the current spending

range (with high probability), and κui is the number of expenses to experiment with in interval τiu . τic defines the length of the i-th learning interval when Au is already eliminated from the spending range, and κci is the number of expenses to experiment with in interval τic . For p = u, c, define P tpi = ij=1 τjp for i = 0, . . . , N p ; put ∆pi = τip /κpi , and tpi,j = tpi−1 + j∆pi , j = 1, . . . , κpi . Define ¯ B/T }. Au1 = Ac1 = 0, A¯u = A¯ and A¯c = min{A, 1

1

Then, for problem with size n, we do the following (note that the tuning parameters depend on n, for ease of presentation, we make the dependency implicit).  Algorithm π1 = π {τiu , κui }i=1,··· ,N u , {τic , κci }i=1,··· ,N c )

Step 1. Learning Au or determine either AD = B/T or Au > a(x/T ): For i = 1, . . . , N u do (a) Divide [Aui , A¯ui ] into κui equally spaced intervals with length Lui =

¯u −Au A i i κu i

1, . . . , κui } be the left endpoints of these intervals.

and let {Auij , j =

(b) Apply Auij from tui,j−1 to tui,j , j = 1, . . . , κui . If inventory runs out, apply ∅ and stop. (c) Compute ˆ u)= d(A ij

total demand over [tui,j−1 , tui,j ) , j = 1, . . . , κui , n∆ui

ˆ u ) − x/T |. ˆ u ) − Au } and A bui = argmax{wd(A bci = argmin |d(A A ij ij ij

(7)

bui , A bci } − (log n)/2 Lui , B/T < min{A

(8)

bc < A bu − 2(log n)/2 Lu or A i i i

(9)

1≤j≤κu i

1≤j≤κu i

(d) If

then break from Step 1, enter Step 3(a) and set i0 = i and i1 = a. Else If

bu − (log n)/2 Lu , B/T < A i i 14

(10)

then break from Step 1, enter Step 2 and set i0 = i and i1 = b. bi = min{A bu , A bc }. Define I u = [Au , A¯u ] be the spending range for the Otherwise, set A i+1 i i+1 i i+1

next iteration where n o  + bi + (log n) Lu /3, A¯ . bi − 2(log n) Lu /3 Aui+1 = A and A¯ui+1 = min A i i (e) If i = N u , then set iu0 = N u + 1 and enter Step 3(b);

Step 2. Learn min{a(x/T ), B/T } when Au > min{a(x/T ), B/T }: For i = 1, . . . , N c do (a) Divide [Aci , A¯ci ] into κci equally spaced intervals with length Lci and let {Acij , j = 1, . . . , κci } be the left endpoints of these intervals; (b) Apply Acij from tci,j−1 to tci,j , j = 1, . . . , κci . If inventory runs out, apply ∅ and stop. (c) Compute total demand over [tci,j−1 , tci,j ) , j = 1, . . . , κci , and n∆ci ˆ c ) − x/T |; = argmin |d(A ij

ˆ c) = d(A ij ci W

1≤j≤κci

c (d) Set Ii+1 = [Aci+1 , A¯ci+1 ] be the spending range for the next iteration where  + n o ci − (log n) Lci /2 ci + (log n) Lci /2, B/T . Aci+1 = W and A¯ci+1 = min W

(e) If i = N c , then enter Step 3(c).

e = B/T ; (b) Define A e = min{A bu u , A bc u } − (log n) Lu u ; (c) Define Step 3. advertising: (a) Define A N N N e e c A = WN c ; then spend A for the rest of the selling season until the product is out of stock or there is no more advertising budget.

Explanations. In each learning interval of step 1, we test several grid points of expenses within a pre-specified spending range in the previous learning interval. After observing the realized demand and revenue of the tested expenses, we form estimations of Au and Ac . Then we identify whether AD = B/T or Au > a(x/T ) (with high probability). This is done by three “triggers” (8)-(10). First, we check whether B/T is less than the lower confidence bounds of Au and a(x/T ). If (8) holds, then AD = B/T with high probability, and we stop learning. Second, we compare the confidence upper and lower bounds of Au and a(x/T ). If either (9) or (10) holds, then AD < Au 15

with high probability. If so, there is no need to learn Au further, and the algorithm switches to Step 2 to learn AD = min{a(x/T ), B/T }. In this case, because λ(A) is increasing, we have

AD = argminA∈[0,B/T ] |λ(A) − x/T |, this quantity is what we attempt to learn in Step 2, and thus we set A¯c1 = B/T . Otherwise, i.e., if none of the three conditions are triggered, we shrink the spending range and proceed to next leaning interval. If none of the three conditions are triggered before Step 1 stops, then the estimation of Au is either equal to AD or it is so close to AD that applying it results in near-optimal performance. ¯ To ensure that Au is within the spending range, we set the first spending interval as [0, A]. Certainly, if A¯ is large, we are over allocating budget to the first learning interval. But since this learning interval is short, the amount of overused budget will not be too high. From the second learning interval, since we have condition (8) in the algorithm, if it is not triggered in the previous periods, we know that AD can not be much larger than B/T . Note that the testing expenses are the grid points of the spending ranges with equal spacing and we shrink the spending ranges bi , which is I u asymmetrically such that more grid points are selected from the left-hand side of A i

closer to B/T . Consequently, the average expense in the learning intervals is less than B/T . This adjustment also guarantees that inventory consumption in the learning phase will not be too high. Meanwhile, this fine-tuning of the spending range will not influence the order of the revenue in the learning phase. In Step 3, an even-level advertising policy is applied. There are three cases, corresponding to the cases where AD is equal to B/T , Au and a(x/T ), respectively. In fact, after the learning phase, we do not update the remaining budget and time, although the budget allocated to the learning phase may be less than B/T on average. Nevertheless, since the total budget allocated to the learning phase represents only a small portion of the total budget, the updating does not improve the asymptotic performance of the algorithm. Of course, in practice, if the algorithm breaks from Step 1 and enters Step 3(a), it is better for the marketer to equally allocate all the remaining budget to the remainder of the selling season. Finally, we remark that the main differences between our algorithm and DPA developed by Wang et al. (2014) lies in Part (d) of Step 1. In their algorithm, there is only one condition that corresponds to condition (9) here. In fact, if the advertising budget B is sufficiently large (i.e., there is only the inventory constraint), then conditions (8) and (10) will never be triggered, and our algorithm becomes very similar to theirs. On the other hand, if the initial inventory x is sufficiently

16

large (i.e., there is only the advertising budget constraint), there is then a high probability that bc ≥ A bu , and thus conditions (9) and (10) will not be triggered. This means that Step 2 will not A i i be initiated.

5

Performance of the Policy

In this section, we first establish the upper bound on the regret of our policy, which shows that our policy is asymptotically optimal. Then we provide a lower bound example, to show that our policy achieves near-best asymptotic performance.

5.1

Upper Bound on the Regret of the policy

We first discuss how to determine the best tuning parameters τiu , κui , N u , τic , κci and N c . Recall that there is an exploration loss when we do the testing. On top of the exploration loss, the algorithm suffers two additional losses. Firstly, there is discretization loss because each spending range is a continuous set and we can only test a grid of points. Secondly, there is stochastic loss because of the random nature of the demand process (i.e., only noisy observations of the sales response are available). The reader is referred to Wang et al. (2014) for more detailed discussions of these three errors. In general, the tuning parameters are set such that these three errors are balanced. First, in the i-th learning interval, the magnitude of discretization loss depends on the grid spacing we divide the spending range (Lui and Lci for step 1 and 2, respectively). On the other hand, p p √ √ the magnitude of the stochastic loss can be bounded by log n · κui /(nτiu ) and log n · κci /(nτic ) for step 1 and 2, respectively. We balance these two losses, that is s p κui (Lui )2 ∼ log n · , ∀ i = 1, . . . , N u . nτiu s p κci c Li ∼ log n · , ∀ i = 1, . . . , N c . nτic

(11) (12)

where the notation f ∼ g means that f and g are of the same order in n. Note that the discretization loss in Step 1 is quadratic of the grid granularity due to the local behavior of the revenue rate function around Au . Specifically, by the Taylor expansion for r(·), we obtain 1 1 mU (A − Au )2 ≤ |r(λ(A)) − r(λ(Au ))| ≤ mL (A − Au )2 . 2 2 17

(13)

As a result, the estimation accuracy of AD in the i-th learning interval can be bounded by (log n)/2 (Lui )2 and (log n)/2 Lci for step 1 and 2, respectively. Second, to guarantee that AD will not be eliminated when we shrink the spending range, we set A¯ui+1 − Aui+1 ∼ (log n) Lui , ∀ i = 1, . . . , N u − 1.

(14)

A¯ci+1 − Aci+1 ∼ (log n) Lci , ∀ i = 1, . . . , N c − 1.

(15)

which is of order (log n)/2 greater than the gap between the estimate and AD . Third, the exploration loss in each learning interval can be bounded by the deviation of the revenue rate multiplied by the length of the learning interval. We want this loss to be of the same order for each learning interval, in order to maximize learning efficiency. Intuitively, as the spending range becomes shorter, the deviations of the revenue rate become smaller, so we increase the data collection time to improve the estimation of AD . That is u u τi+1 · (A¯ui+1 − Aui+1 )2 = τi+1 · (log n)2 (Lui )2 ∼ τ1u , ∀ i = 1, . . . , N u − 1.

(16)

c c τi+1 · (A¯ci+1 − Aci+1 ) = τi+1 · (log n) Lci ∼ τ1c , ∀ i = 1, . . . , N c − 1.

(17)

Finally, when the estimation of AD is accurate enough, the shrinking procedure can stop. Recall that the revenue rate deviation of the estimate can be upper bounded by (log n)/2 (Lui )2 ((log n)/2 Lci , resp). When this is less than τ1u (τ1c , resp.), the targeted accuracy is achieved. Now solving (11)-(17), we obtain following sets of parameters: τiu = n−(1/2)(3/5) τic = n−(1/2)(2/3)

i−1

i−1

· (log n)5+1 , κui = n(1/10)(3/5)

i−1

i−1

· (log n)3+1 , κci = n(1/6)(2/3)

· (log n) , i = 1, · · · , N,

· (log n) , i = 1, · · · , N.

(18) (19)

Then, by N u = mini {i : (log n)/2 (Lui )2 < τ1u } and N c = mini {i : (log n)/2 Lci < τ1c }, we can obtain that

 N = log5/3 u

log n (13 + 2) log log n



c



and N = log3/2

 log n . (7 + 2) log log n

(20)

Theorem 1 The sequence of policies {π n } with parameters given by (18), (19) and (20) is asymptotically optimal. In particular, n

sup Rπn (T, B; λ) ≤ λ∈L

C(log n)5+2 √ as n → ∞ n

for some constant C. 18

The constant C above is independent of the problem size n and depends only on the parameters characterizing the class L, the initial inventory x, the length of selling season T , and the advertising budget B. The exact dependence is complex and is omitted here.

5.2

A Lower Bound on Achievable Regret

Next we establish a lower bound on the minimax regret by presenting a class of “bad” sales functions to show that our algorithm achieves near-best asymptotic performance. p Theorem 2 Let λ(A; z) = 2z A + 1/2 − 2z + 2, where z is a parameter taking values in set Z = [3/4, 1]. Let w = 1, A¯ = 1/2, T = 1, B = 1. Then this class of sales function satisfies the assumptions in our model. Furthermore, for any admissible policy π and for all n ≥ 1, sup Rπn (T, B; λ) ≥

z∈Z

3 √ . 32(96)2 n

The key feature of the functions in this class is that they cross at a common point which will impede demand learning. In previous studies of dynamic pricing problems, this common point was called an uninformative point (Broder and Rusmevichientong (2012) and Wang et al. (2014)). When there is an uninformative point, the tension between demand learning (exploration) and best√ guess promotion (exploitation) forces the worst-case regret of any policy to be O(1/ n). Intuitively, since all sales functions in the set cross at the uninformative point, a policy with good capability of identifying the underlying parameter of the sales function must necessarily spend advertising effort away from the uninformative expense. However, this incurs large regret when the underlying parameter is indeed the one associated with the uninformative expense. On the other hand, if a policy lacks discrimination of the sales functions, it must also incur a cost in regret.

6

Numerical Study

In this section, we numerically illustrate the performance of the algorithm in comprehensive aspects. In Section 6.1, we show three sample runs correspond to three different scenarios showed in Step 3 in our algorithm. Then, in Section 6.2, we conduct more extensive numerical experiments with different underlying demand functions to illustrate the asymptotic performance of our algorithm and glen insights on the implementation of the algorithm. 19

6.1

Sample Runs of the Algorithm

We first show three sample runs to depict three different “regimes” developed in the algorithm. ¯ = [0, 10], the promotion season T = 10, the unit We assume the initial advertising range [A, A] margin w = 2, parameter  = 0.01 and the market size n = 106 . The underlying demand function is √ λ(A) = 2 A (we will consider more demand functions in the next subsection). Under this demand √ function, the revenue rate function is r(λ(A)) = 4 A − A, thus the global optimizer is Au = 4. By varying the initial inventory level and the advertising budget, we obtain three different cases presented in the following tables, where each row represents: • Iter.no.: indicates which iteration is running in the scenario; • Step no.: corresponds to the current status (Step 1, 2 or 3); • τi : represents the time spent in this iteration; • [Ai , A¯i ]: shows the current advertising expense range or the value of A˜ in the last iteration; • κi : shows the number of advertising expenses tested in this learning interval τi ; • Aˆu : represents the empirical optimal advertising for the revenue maximizing constraint. • Aˆc : means the empirical optimal advertising for the inventory depleting constraint; • Cum.rev.: corresponds to the cumulative revenue until the end of this iteration. Table 2: The initial inventory level x = 40, and the advertising budget B = 30. The resulting B/T = 3, a(x/T ) = 4, then AD = min{Au , a(x/T ), B/T } = B/T = 3 and nT rD = 3.9282e7 . Iter.no. Step no. τi [Ai , A¯i ] κi Aˆu Aˆc Cum.rev.

1 Step1 0.0158 [0,10] 4 5.0000 5.0000 4.4585e4

2 Step1 0.2497 [3.2890,5.8555] 2 4.5722 4.5722 1.0391e6

3 Step3 9.7346 Apply 3 N/A N/A N/A 3.8966e7

In the first instance, the algorithm runs two iterations in Step 1 and identifies that AD = B/T (with high probability), then directly enters into Step 3, the advertising exploitation phase. 20

Specifically, in the first iteration, the resulting estimations of Au and a(x/T ) are both 5, which is greater than B/T = 3. However, the current estimations of Au and Ac are insufficiently accurate to claim AD = B/T . Then the spending range shrinks to [3.2890, 5.8555]. In the second iteration, the estimations of Au and a(x/T ) are both 4.5722, which are already sufficiently accurate to decide AD = B/T , i.e., satisfying condition (8). Therefore, it enters into step 3, which means that we do not need to explore (compute estimations of Au and Ac using κ tested advertising expenses) anymore, i.e., N/A showed in the last column in Table 2. We apply the estimated AD 3 in the remaining time 9.7346 until the inventory or the advertising budget is run out. In this instance, the regret is 0.8%. Table 3: The initial inventory level x = 60, and the advertising budget B = 60. The resulting AD = a(x/T ) = 9, B/T = 6, then AD = min{Au , a(x/T ), B/T } = Au = 4 and nT rD = 4e7 . Iter.no. Step no. τi [Ai , A¯i ] κi Aˆu Aˆc Cum.rev.

1 Step1 0.0158 [0,10] 4 5.000 7.5000 4.3969e4

2 Step1 0.2497 [3.2890,5.8555] 2 4.5722 4.5722 1.0337e5

3 Step1 1.3103 [3.6940,5.0114] 2 4.3527 4.3527 6.2631e6

4 Step1 3.5431 [3.9019,4.5781] 1 3.9019 3.9019 2.0425e7

5 Step3 4.8811 Apply 3.2076 N/A N/A N/A 3.9728e7

In the second instance, the algorithm runs 4 iterations in Step 1, then enters into Step 3. None of the conditions (8), the regime switching conditions (9) and (10) are triggered in this case, thus it learns Au all the times (N u iterations) before utilizing the estimator. In Step 3, what we do is to exploit the accurate enough estimator of AD 3.2076, i.e., there are three N/A notations in the last step in Table 3. The regret for this instance is 0.68%. Table 4: The initial inventory level x = 20, and the advertising budget B = 20. The resulting a(x/T ) = 1, B/T = 2, then AD = min{Au , a(x/T ), B/T } = a(x/T ) = 1, and nT rD = 3e7 . Iter.no. Step no. τi [Ai , A¯i ] κi Aˆu Aˆc Cum.rev.

1 Step1 0.0158 [0,10] 4 5 2.5 4.3841e4

2 Step2 0.0149 [0,10] 10 N/A 1 4.7443e4

3 Step2 0.1495 [0.4867,1.5133] 5 N/A 1.1027 4.7179e5

21

4 Step2 0.6938 [0.9973,1.2081] 3 N/A 0.9973 2.5998e6

5 Step3 9.1260 Apply 0.9973 N/A N/A N/A 2.9924e7

In the last instance, after the first iteration, the resulting estimations of Au and a(x/T ) are 5 and 2.5, respectively. The algorithm identifies that Au > min{a(x/T ), B/T } (with high probability), and we can find out through simple calculation that regime switching conditions (10) is triggered. Then it enters into Step 2 to learn min{a(x/T ), B/T } and no longer estimates Au , i.e., N/A in the fifth line in Table 4. After three iterations in Step2, the algorithm learns the accurate enough estimation of AD 0.9973, then enters into Step 3 to apply it in the remained horizon, i.e., using N/A in the last column in Table 4 to express that we no longer compute these three statistics. The regret for this case is 0.25%. e is fairly close to AD and results As can be seen, in all three instances, the final estimator A

in a small regret. This means that our algorithm is efficient when the market size is relatively large. Moreover, in all instances, we can observe that the learning time τ is increasing in i, and the number of tested expenses κ is decreasing in i. Intuitively, as the range [Ai , A¯i ] gets smaller, the tested expenses are close to the optimal allocation AD . We can therefore afford to spend more time to learn in each learning interval and only need to test on a few numbers of the possible expenses.

6.2

Algorithm Performance

Next, we illustrate the performance of our algorithm with different types of underlying demand functions. As we mentioned earlier, the empirical evidence in the literature favors concave demand √ functions. Here we consider three increasing and concave demand functions: (1) λ(A) = a A, (2) λ(A) = a log(A + b) and (3) λ(A) =

aA A+b ,

where a and b are parameters. The first two functions,

i.e., the square-root and log functions are very commonly used concave functions and recall that the class of square-root demand functions is a worst-case example of our algorithm. The last function is a fractional function which has been used by Zhang et al. (2014) to characterize the wining rate functions of the real-time bidding for display advertising. Specifically, for demand function λ(A) = aA/(A + b), a and A/(A + b) could be respectively viewed as the arrival rate of bid requests for display advertising and the probability of winning the bid request with bid price A. In this ¯ = [0, 10], subsection, we consider the base case of problem parameters x = 20, B = 20, [A, A] T = 10,  = 0.01, w = 2 and n = 106 . Following Wang et al. (2014), the performances of the algorithm are computed through averaging the regrets of 1000 times repeated runs. Recall from Theorem 1, the policy achieves a regret of O(n−1/2 log5+2 n). Figure 1 depicts the relation between log(Rπn (T, B; λ)) and log(n) under the three demand functions with different 22

parameters when the market size n ranges from 102 to 106 . We can see that the asymptotic performances of the algorithm under the three demand functions are all roughly close to the linear line with slope −1/2, although with some gap that implies the effect of the log term and the constant coefficient of the regret. It validates Theorem 1 numerically and shows our algorithm can behave robustly under different types of demand functions with different parameters. 0

0

0

-1

-1

-1

a=1 a=2 a=3

-2 -3 2

a=3,b=1 a=4,b=1.1 a=5,b=1.2

-2 -3

3

4

5

6

2

3

a=9,b=5 a=10,b=6 a=11,b=7

-2 -3

4

5

6

2

3

4

5

6

Figure 1: Regret as a function of market size n

Figure 2 investigates the impact of  on the performance of the algorithm under three demand √ 10A functions: λ(A) = 2 A, λ(A) = 3 log(A + 1), and λ(A) = A+6 . As can be seen, different values of  result in significantly different regret of the algorithm. In addition, the performance of the algorithm is not good when  = 1/2. This is consistent with the observation made by Wang et al. (2014) that when  = 1/2 the log n factor would play an overly dominant role. This observation also highlights the importance of choosing an appropriate power factor  of log n term in the algorithm for different problem settings, to obtain a practically efficient algorithm whose performance is also theoretically guaranteed. In particular, we find that  = 0.01 provides consistently good performances in our tested cases. 1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0 0

0.1

0.2

0.3

0.4

0.5

0 0

0.1

0.2

0.3

0.4

0.5

0

Figure 2: Impact of  on the regret performance 23

0.1

0.2

0.3

0.4

0.5

Finally, to glen managerial implications of our model and policy and to illustrate the highlight of our algorithm that takes into account advertising budget and inventory constraint simultaneously, we conduct a numerical study to compare the regret performances of the following three algorithms: (1) the algorithm considers both advertising budget and inventory constraint (our original algorithm); (2) the algorithm considers the advertising budget while ignores the existence of the inventory constraint, i.e., we set the parameter of the initial inventory in our algorithm sufficiently large but the implementation of the algorithm is subject to the actual inventory constraint; and (3) the algorithm considers the inventory constraint while ignores the existence of the advertising budget, i.e., we set the parameter of the advertising budget in our algorithm sufficiently large but the implementation of the algorithm is subject to the actual adverting budget. In the following, for ease of reference, we refer to these three algorithms as algorithms B&I, BO and IO, respectively. Figure 3 presents the regret performances of the three algorithms under the three demand functions when the advertising budget varies from 5 to 40. The underlying demand functions are the same as the previous figure. We can observe that when the advertising budget is very low, algorithm BO that considers only the advertising budget has the lowest regret, whereas as the advertising budget increases, the regrets of algorithms B&I and IO quickly decrease, particularly when the advertising budget is higher than a threshold, the regret of algorithm BO increases dramatically, while the regrets of the other two algorithms become relatively stable. Meanwhile, when the advertising budget is in the middle range, the regret of algorithm B&I is the lowest, and when the advertising budget is sufficiently high, the regrets of algorithms B&I and IO become the same. Intuitively, when the advertising budget is very low, because the advertising expense over time is very limited, the inventory will be redundant, and thus the algorithm ignoring the inventory constraint becomes very efficient whereas the other two algorithms would be “mislead” by it and result in worse performances. On the other hand, when the advertising budget is high, the inventory becomes the tight resource, and the algorithm that ignores the existence of the inventory constraint would over-advertise and results in a high regret. Moreover, when the advertising budget is very high, the budget constraint will never be binding, then ignoring its existence will not affect the performance of the algorithm.

24

1

B&I BO IO

0.8 0.6

1

1

B&I BO IO

0.8 0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0

5

10 15 20 25 30 35 40

5

B&I BO IO

0.8

10 15 20 25 30 35 40

5

10 15 20

30

40

Figure 3: Regret performances of three algorithms with different advertising budget B

7

Conclusion and Discussion

In this paper, we consider the advertising budget optimization for the revenue management with inventory constraint and unknown sales response function. We discuss how to allocate a pre-specified advertising budget to learn the sales response of the market and then to earn with the learned knowledge, and at the same time balancing the inventory between the exploration and exploitation trade-off. We propose a dynamic learn-while-doing policy for the problem. By constructing a worse-case example, we show that our policy achieves near-best asymptotic performance. There are several fruitful possibilities for future research. In this paper, we consider a general Poisson demand process. It is interesting to study different specific advertising scenarios and design tailored learning algorithms. For example, one can consider the sponsored search advertising. Suppose the auctions (opportunities to earn a click) arrive according to a Poisson process with rate λ. The marketer places a bid A for each auction (A can be different for different auction). If the advertiser’s bid exceeds a price which is given by the auction mechanism, the marketer wins an impression. Then with some probability α (click-through rate), the impression results in a click and it is charged the price from the budget. Then with another probability β (conversion rate), the click results in a sale. If we assume the market prices are i.i.d. drawing from distribution function F (·) (e.g., Amin et al. 2012). We know that the demand process is a Poisson process with arrival rate λαβF (A). Hence we can consider the advertising budget allocation problem with a semi-parametric demand process (namely, with known functional structure of the arrival rate function but with unknown parameters λ, α, β and F (A)) and design tailored learning policy for the problem, particularly form a tailored estimator for each of the parameters respectively. Another

25

attractive option would be to consider the nonparametric budget allocation model with multiple types of advertising campaign (for example, consider multiple search advertising markets studied by Yang et al. 2015). Another interesting area of research would be to consider the goodwill effect of advertising (Nerlove and Arrow, 1962), i.e., the market response depends on total (discounted) expenses up to time t. However, while these extensions will make the model more practically appealing, they will also be difficult to incorporate into the model’s design. In particular, the optimal solutions of the deterministic relaxation of these problems are likely to be complicated and even time-dependent, and thus the current learning framework can not be applied.

Acknowledgments Chaolin Yang is partially supported by the National Natural Science Foundation of China (NSFC) [Grant NSFC-71601103] and the Program for Innovative Research Team of Shanghai University of Finance and Economics.

References [1] Amin, K., Kearns, M., Key, P. and Schwaighoferm, A. (2012). Budget optimization for sponsored search: Censored learning in MDPs. 28th Conf. Uncertainty Artificial Intelligence (AUAI Press, Catalina Island, CA), 54-63. [2] Araman, V. F., Caldentey, R. (2009). Dynamic pricing for nonperishable products with demand learning. Operations Research, 57(5), 1169-1188. [3] Balseiro, S. R., Besbes, O. and Weintraub, G. Y. (2015). Repeated auctions with budgets in Ad exchanges: Approximations and design. Management Science, 61(4), 864-884. [4] Besbes, O. and Zeevi, A. (2009). Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6), 1407-1420. [5] Besbes, O. and Zeevi, A. (2012). Blind network revenue management. Operations Research, 60(6), 1537-C1550. [6] Besbes, O. and Zeevi, A. (2015). On the (surprising) sufficiency of linear models for dynamic pricing with demand learning. Management Science, 61(4), 723-739. [7] Bremaud, P. (1980). Point Process and Queues: Martingale Dynamics (Springer-Verlag, Berlin). [8] Broder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice model. Operations Research, 60(4), 965-980. [9] Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y. and Guo, D. (2017). Real-Time Bidding by Reinforcement Learning in Display Advertising. 10th ACM International Conference on Web Search and Data Mining. (ACM Publications, New York, USA), 661-670.

26

[10] Cheung, W., Simchi-Levi, D. and Wang, H. (2017). Technical Note–Dynamic pricing and demand learning with limited price experimentation. Operations Research, 65(6), 1722-1731. [11] den Boer, A. and Zwart, B. (2014). Simultaneously learning and optimizing using controlled variance pricing. Management Science, 60(3), 770-783. [12] den Boer, A. (2015). Dynamic pricing and learning: Historical origins, current research and new directions. Surveys in Operations Research and Management Science, 20(1), 1-18. [13] Doyle, P. and Saunders, J. (1990). Multiproduct advertising budgeting. Marketing Science, 9(2), 97-113. [14] Du, R., Hu, Q. and Ai, S. (2007). Stochastic optimal budget decision for advertising considering uncertain sales responses. European Journal of Operational Research, 183(3), 1042-1054. [15] Fischer, M., Albers S., Wagner N. and Frie M. (2011). Dynamic marketing budget allocation across countries, products, and marketing activities. Marketing Science, 30(4), 568-582. [16] Fruchter, G. E. and Dou, W. (2005). Optimal budget allocation over time for keyword Ads in web portals. Journal of Optimization Theory and Application, 124(1), 157-174. [17] Gallego, G. and van Ryzin, G. (1994). Optimal dynamic pricing of inventories with stochastic demand over finite horizons. Management Science, 40(8), 999-1020. [18] Gallego, G. and van Ryzin, G. (1997). A multiproduct dynamic pricing problem and its applications to network yield management. Operations Reserach, 45(1), 24-41. [19] Ghose, A., Yang, S. (2009). An empirical analysis of search engine advertising: sponsored search in electronic markets. Management Science, 55(10), 1605-1622. [20] Hanssens, D. M., Parsons, L. J., Shultz, R. L. (1990). Market response models: econometric and time series analysis. Kluwer Academic Publishers, Boston. [21] Harrison, J. M., Keskin, N. B., and Zeevi, A. (2012). Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3), 570-586. [22] Holthausen Jr, D. M., and Assmus, G. (1982). Advertising budget allocation under uncertainty. Management Science, 28(5), 487-499. [23] Kumar, S., Sethi, S. P. (2009). Dynamic pricing and advertising for web content providers. European Journal of Operational Research, 197, 924-944. [24] Lodish, L. M., Curtis, E., Ness, M. and Simpson, M. K. (1988). Sales force sizing and deployment using a decision calculus model at Syntex Laboratories. Interfaces, 18(1), 5-20. [25] Nerlove, M. and Arrow, K. J. (1962). Optimal Advertising Policy under Dynamic Conditions. Economica, 29(114), 129-142. [26] Royo, C. B., Zhang, H., Almogro, J. (2013). Multistage multiproduct advertising budget. European Journal of Operational Research, 225, 179-188. [27] Sethi S. P. (1977). Optimal advertising for the Nerlove-Arrow model under a budget constraint. Operational Research Quarterly, 28(3), 683-693. [28] Talluri, K. and van Ryzin, G. (2005). Theory and practice of revenue management (Springer, New York).

27

[29] Tong, C. and Topaloglu, H. (2014). On the approximate linear programming approach for network revenue management problems. INFORMS Journal on Computing, 26(1), 121-134. [30] Tran-Thanh, L., Stavrogiannis, L., Naroditskiy, V., Robu, V., Jennings, N.R. and Key, R. (2014). Efficient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search Auctions. 30th Conf. Uncertainty Artificial Intelligence. (AUAI Press, Catalina Island, CA) [31] Tull, D. S., Wood, V. R., Duhan, D., Gillpatick, T., Robertson, K. R. and Helgeson, J. G. (1986). ”Leveraged” decision making in advertising: The flat maximum principle and its implications. Journal of Marketing Research, 23(1), 25-32. [32] Tunuguntla, V., Basu, P., Rakshit, K., Ghosh, D. (2019). Sponsored search advertising and dynamic pricing for perishable products under inventory-linked customer willingness to pay. European Journal of Operational Research, 276, 119-132. [33] Vidale, M. L., Wolfe, H. B. (1957). An operations-research study of sales response to advertising. Operations Research, 5(3), 370-381. [34] Wang, Z., Deng, S. and Ye, Y. (2014). Close the gaps: A learning-while-doing algorithm for singleproduct revenue management problems. Operations Research, 62(2), 318-331. [35] Wu, D., Chen, X., Yang, X., Wang, H., Tan, Q., Zhang, X., Xu, J. and Gai, K. (2018). Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising. 27th ACM International Conference on Information and Knowledge Management. (ACM Publications, New York, USA), 1443-1451. [36] Yang, Y., Zeng, D. and Yang, Y., Zhang, J. (2015). Optimal budget allocation across search advertising markets. INFORMS Journal on Computing, 27(2), 285-300. [37] Yang, Y., Zhang, J., Qin, R., Li, J., Liu, B. and Liu, Z. (2013). Budget optimization strategies in uncertain environments of search auctions: A preliminary investigation. IEEE Trans. Services Comput., 6(2), 168-176. [38] Zhang, D. and Weatherford, L. (2017). Dynamic pricing for network revenue management: A new approach and application in the hotel industry. INFORMS Journal on Computing, 29(1), 18-35. [39] Zhang, W., Rong, Y., Wang, J., Zhu, T. and Wang, X. (2016). Feedback Control of Real-Time Display Advertising. 9th ACM International Conference on Web Search and Data Mining. (ACM Publications, New York, USA), 407-416. [40] Zhang, W., Yuan, S. and Wang, J. (2014). Optimal real-time bidding for display advertising. 20th ACM SIGKDD international conference on Knowledge discovery and data mining. (ACM Publications, New York, USA), 1077-1086. [41] Zhao, J., Qiu, G., Guan, Z., Zhao, W. and He, X. (2018). Deep Reinforcement Learning for Sponsored Search Real-time Bidding. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. (ACM Publications, New York, USA), 1021-1030.

28