Synthetic accessibility assessment using auxiliary responses

Synthetic accessibility assessment using auxiliary responses

Synthetic Accessibility Assessment Using Auxiliary Responses Journal Pre-proof Synthetic Accessibility Assessment Using Auxiliary Responses Shun Ito...

8MB Sizes 0 Downloads 25 Views

Synthetic Accessibility Assessment Using Auxiliary Responses

Journal Pre-proof

Synthetic Accessibility Assessment Using Auxiliary Responses Shun Ito, Yukino Baba, Tetsu Isomura, Hisashi Kashima PII: DOI: Reference:

S0957-4174(19)30823-1 https://doi.org/10.1016/j.eswa.2019.113106 ESWA 113106

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

28 July 2018 15 October 2019 28 November 2019

Please cite this article as: Shun Ito, Yukino Baba, Tetsu Isomura, Hisashi Kashima, Synthetic Accessibility Assessment Using Auxiliary Responses, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113106

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights • We propose prediction methods of synthetic accessibility with semi-experts. • Auxiliary responses that focus on the structure of targets lead to high performance. • Our methods identify skilled semi-experts and problematic substructures.

1

Synthetic Accessibility Assessment Using Auxiliary Responses Shun Itoa , Yukino Babab , Tetsu Isomurac , Hisashi Kashimaa a Kyoto

University, Yoshida-Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan of Tsukuba, 1-1-1, Tennodai, Tsukuba, 305-8573, Japan c Mitsubishi Chemical Holdings Corporation, 1-1, Marunouchi 1-chome, Chiyoda-ku, Tokyo, 100-8251, Japan b University

Abstract Despite the recent advances in computational approaches to discovering new chemical compounds, accessibility assessment of designed compounds has still been a difficult task to automate because it is a heavily knowledge intensive task. A promising solution to such “AI-hard” tasks is collective intelligence approaches that aggregate opinions of a group of human non-experts or semiexperts. However, the existing aggregation methods rely only on synthetic accessibility evaluation scores given by humans, and they do not exploit auxiliary information obtained as byproducts of human evaluations such as that related to chemical structures. In this paper, we propose to exploit such auxiliary responses to obtain better aggregations. We introduce a new two-stage aggregation method of semi-expert judgments consisting of synthetic accessibility evaluation scores along with auxiliary responses that select substructures of targets obstructive to their synthesis. The first stage divides both semi-experts and substructures into clusters using stochastic block models to identify similar skills or properties. The second stage aggregates judgments while considering groups of semi-experts and substructures, and predicts synthetic accessibility. Our experiments show that the use of auxiliary responses improves the prediction performance and gives insight into evaluators and the structure of evaluated Email addresses: [email protected] (Shun Ito), [email protected] (Yukino Baba), [email protected] (Tetsu Isomura), [email protected] (Hisashi Kashima)

Preprint submitted to Journal of LATEX Templates

November 29, 2019

compounds. Keywords: Synthetic Accessiblity, Semi-expert, Auxiliary Responses

1. Introduction Computational drug design enables large-scale screening to identify promising candidates that have desired properties from among many chemical compounds. The screening process consists of at least two steps: chemical property 5

prediction and synthetic accessibility assessment. Even if a compound is predicted to have some desirable property, there is no guarantee that the compound can actually be synthesized. Rather than chemical property prediction, synthetic accessibility assessment is still a difficult task in computational chemistry, even though there have been various attempts for automatic assessment (John-

10

son et al., 1992; Gillet et al., 1995; Pf¨ ortner & Sitzmann, 2007; Boda et al., 2007; Ertl & Schuffenhauer, 2009; Fukunishi et al., 2014; Podolyan et al., 2010). Assessment requires deep domain knowledge and often depends on the personal professional experiences of human chemists. Therefore, scaling up the use of expert knowledge and improving the accuracy is the key for further development

15

of entire drug discovery processes. Expert systems computationally help decision making by emulating professional decisions, but it is considerably hard for experts to express their implicit knowledge and skills explicitly. Computational machine learning models using target properties is one of solutions to find out common rules or knowledge, but

20

its benefit is generally limited within specific data. This problem has motivated researchers to consider collective intelligence of non-experts (Snow et al., 2008), which collects multiple responses from non-experts and aggregates them using statistical or machine learning techniques. Such redundancy of responses helps aggregation models to predict true expert knowledge with accuracy comparable

25

to professionals without employing them. Existing computational evaluation methods of synthetic accessibility consider the structural complexity (Boda et al., 2007; Fukunishi et al., 2014), but

3

they are not fully applicable in practical environment because of their limited performance. Crowdsourcing is a promising solution to scale up the use of 30

human workforces and improve the accuracy in synthetic accessibility assessment. However, as previously mentioned, evaluators are required more specialized knowledge or skills than average people to complete the assessment tasks, and finding a sufficient number of workers with such expertise on general crowdsourcing platforms is not easy.

35

To this end, we focus on semi-experts, who are not professional synthetic chemists but simply have substantial training experience in chemistry, e.g., researchers and students in molecular design and retired medicinal chemists. Utilizing the skills of semi-experts has been considered in several prior studies for various tasks (Antonis et al., 2003; Smuc et al., 2009). Baba et al. consid-

40

ered the use of semi-expert intelligence in synthetic accessibility assessment and showed that the reliability of judgments by one semi-expert is not sufficient; however, aggregating judgments by different semi-experts achieves an accuracy that is competitive with that of experts (Baba et al., 2018). We follow their approaches and propose the judgment aggregation method of chemical semi-

45

experts to predict synthesis possibility. Our another contribution in expert systems is the introduction of structural properties of candidate compounds into the aggregation method. Some existing aggregation methods consider the similarity among target items and/or annotators (Venanzi et al., 2014; Lakkaraju et al., 2015; Baba et al., 2018), but they

50

are based only on primary responses (in our case, synthetic accessibility scores). We extend this line of reasoning and consider the use of auxiliary information accompanied by decisions on synthetic accessibility, where evaluators annotate obstructive substructures that inhibit the synthesis of compounds in addition to synthetic accessibility scores. Figure 1 shows a schematic illustration of their

55

two types of responses. The choices of substructures provide clues to extract different skill levels of evaluators and different properties of structures. Our model considers the selection of substructures as auxiliary responses when aggregating primary responses. 4

We propose a two-stage aggregation method for predicting synthetic accessi60

bility from semi-expert judgments and auxiliary responses. The method assumes that evaluators who have similar expertise select similar substructures and that substructures with similar properties are selected by similar evaluators. The stochastic block model (Nowicki & Snijders, 2001) satisfies this assumption; it finds similar evaluators, substructures and relationships between them. We also

65

consider the mixed membership stochastic block model (Airoldi et al., 2008), which performs finer-grained clustering to assign different roles to evaluators and substructures in different relationships. In the first stage, the model divides evaluators and substructures into several groups based on their different ability or properties. In the following second stage, it integrates five-grade judgments

70

with clusters derived from the previous stage, and predicts synthetic accessibility of target compounds. We apply our method to real assessment response data obtained by semi-experts. The results show that our method successfully identifies skilled workers and presumably problematic substructures. In addition, it improves the prediction performance by incorporating auxiliary responses.

75

Our contributions include (i) the use of structural information of target compounds in the aggregation of expert intelligence, (ii) a two-stage method that identifies evaluators with similar skills and important substructures to predict synthetic accessibility scores, and (iii) the improvement of prediction performance of synthetic accessibility. The proposed method successfully combines

80

auxiliary responses with assessment scores and achieves higher accuracy compared to baseline methods using our dataset. However, the method still has the methodological limitation in its computational cost for large number of targets because it uses all the responses and has to consider many parameters in the inference. Reducing the inference cost while achieving high prediction perfor-

85

mance is the future challenge of our method.

5

(a) Five-grade assessment

(b) Selection of problematic atoms

Figure 1: Example of an assessment response. An evaluator (a) answers the synthetic accessibility in a five-point scale and (b) selects any number of seemingly problematic atoms.

2. Problem Setting Our synthetic accessibility assessment problem is formalized as follows. We have I compounds and K evaluators, where each evaluator is asked to evaluate the synthetic accessibility of each compound in an L-point scale. We denote the 90

(k)

grade for the i-th compound given by the k-th evaluator by xi

∈ {1, . . . , L},

where a higher grade indicates a higher synthetic accessibility (i.e., the compound is easy to synthesize). In addition, each evaluator is asked to annotate obstructive substructures of a compound. The set of substructures of the i-th

95

compound is denoted by Ji and the total number of substructures is denoted by P J = i |Ji |. The response for the j-th substructure given by the k-th evaluator (k)

is represented by yj

(k)

∈ {0, 1}, where yj

= 1 if the substructure is anno(k)

tated as obstructive. Given synthetic accessibility scores {xi }i,k and auxiliary (k)

responses {yj }j,k , our goal is to estimate the true synthetic accessibility of each compound ti ∈ {0, 1}, where ti = 1 indicates that the i-th compound is 100

synthesizable.

6

Figure 2: Factor graph of SBM. Dark blue denotes the hyperparameters and light blue denotes the observed parameter. Green denotes the derived parameters from inferred parameters.

3. Proposed Method Our statistical aggregation method for predicting synthetic accessibility is based on a two-stage procedure. In the first stage, by using the auxiliary responses related to the obstructive substructures, our method groups evaluators 105

and substructures into several clusters. The clustering results are then incorporated into a generative model that reviews the assessment scores in the second stage, in which we estimate parameters of the evaluators and the compounds as well as the true synthetic accessibility of each compound. 3.1. Clustering models for evaluators and substructures

110

In the first stage, we apply probabilistic clustering models to the auxiliary responses and group evaluators and substructures into clusters so that evaluators with similar expertise and substructures with similar properties are grouped. 3.1.1. Stochastic block model. The stochastic block model (SBM) (Nowicki & Snijders, 2001) is a natural choice for finding similar evaluators and substructures. SBM categorizes users (i.e., evaluators) into M clusters and items (i.e., substructures) into N clusters. SBM is based on the assumption that each evaluator k and each substructure j is assigned to a single cluster, respectively. SBM assumes that the cluster 7

of the k-th evaluator, Z1,k , is generated from a categorical distribution with parameters π1 : Z1,k |π1 ∼ Categorical(Z1,k |π1 ).

(1)

Similarly, the cluster of the j-th substructure, Z2,j , is generated from a categorical distribution with parameters π2 : Z2,j |π2 ∼ Categorical(Z2,j |π2 ).

(2)

The priors of π1 and π2 are generated from the Dirichlet distribution with 115

hyperparameters ν1 and ν2 , respectively: π1

∼ Dirichlet(π1 |ν1 )

(3)

π2

∼ Dirichlet(π2 |ν2 ).

(4)

In addition, SBM incorporates the probability that an evaluator of the m-th cluster will select a substructure of the n-th cluster, θm,n , which is assumed to be generated from the beta distribution with hyperparameters b1 and b2 : θm,n ∼ Beta(θm,n |b1 , b2 )

(5)

(k)

Finally, the observed response, yj , is generated from the Bernoulli distribution as follows: (k)

(k)

yj |θm,n , Z1,k , Z2,j ∼ Bernoulli(yj |θZ1,k ,Z2,j ).

(6)

While we handle auxiliary responses which direct at substructures in the first stage, we focus on assessment responses which direct at target compounds in the second stage. Therefore, we then transform the clustering results {Z1,k }k and {Z2,j }j into evaluator clusters {ck }k and compound clusters {di }i , respectively. 120

Each cluster of the k-th evaluator, Z1,k , is independent of the substructures, and thus ck is derived as follows: ck

= (ck,1 , . . . , ck,M )> ,

(7)

where I denotes the indicator function and ck,m = I(Z1,k = m) represents whether the k-th evaluator belongs to the m-th cluster. By contrast, the cluster 8

Figure 3: Factor graph of MMSB. The difference with SBM is the indices of π, Z, c, and d.

distribution of the i-th compound, di , is given as follows: di di,n 125

= (di,1 , . . . , di,N )> P = |J1i | j∈Ji I(Z2,j = n).

(8) (9)

The factor graph of SBM is given as Figure 2.

3.1.2. Mixed membership stochastic block model. The mixed membership stochastic block model (MMSB) (Airoldi et al., 2008) is another possible means of clustering evaluators and substructures. Unlike SBM, MMSB assigns different roles to evaluators and substructures in different relationships. In other words, every evaluator and substructure can simultaneously belong to multiple clusters with different affinity degrees. The cluster of the k-th evaluator who annotates to the j-th substructure is represented by Z1,k→j , and MMSB assumes that the cluster assignment is generated from a categorical distribution with parameters π1,k : Z1,k→j |π1,k ∼ Categorical(Z1,k→j |π1,k ).

(10)

Similarly, the cluster of the j-th substructure when it is annotated by the kth evaluator is represented by Z2,j→k , which is generated from a categorical distribution with parameters π2,j : Z2,j→k |π2,j ∼ Categorical(Z2,j→k |π2,j ). 9

(11)

The prior distributions of π1,k and π2,j are generated from the Dirichlet distribution with hyperparameters ν1 and ν2 , respectively: π1,k

∼ Dirichlet(π1,k |ν1 )

(12)

π2,j

∼ Dirichlet(π2,j |ν2 ).

(13)

The probability that an evaluator of the m-th cluster will select a substruc130

ture of the n-th cluster, θm,n , is generated from a beta distribution as given (k)

in (5). Finally, the observed response yj

is generated from the Bernoulli dis-

tribution: (k)

yj |θm,n , Z1,k→j ,Z2,j→k (k)

(14)

∼ Bernoulli(yj |θZ1,k→j ,Z2,j→k ). For the same reason as mentioned in previous section, we transform the clustering results {Z1,k→j }k,j and {Z2,j→k }j,k to evaluator-compound clusters 135

{ck,i }k,i and compound-evaluator clusters {di,k }i,k : ck,i ck,i,m di,k di,k,n

= (ck,i,1 , . . . , ck,i,M )> P = |J1i | j∈Ji I(Z1,k→j = m), = (di,k,1 , . . . , di,k,N )> P = |J1i | j∈Ji I(Z2,j→k = n).

(15) (16) (17) (18)

The factor graph of MMSB is given as Figure 3.

3.2. Generative model of synthetic accessibility assessment Given the clustering results obtained in the first stage, we introduce a generative model of the synthetic accessibility assessment provided by each evaluator. We assume that the true synthetic accessibility of the i-th compound, ti , is generated from a categorical distribution with parameters π0 : ti |π0 ∼ Categorical(ti |π0 ).

(19)

The prior distribution of π0 is generated from the Dirichlet distribution with hyperparameters a: π0 ∼ Dirichlet(π0 |a). 10

(20)

Figure 4: Generative model of synthetic accessibility assessment. c and d are the results of clustering derived from SBM or MMSB.

We assume that three types of latent parameters are considered in the generative process of synthetic accessibility scores: the ability of each evaluator, 140

the common ability of each evaluator cluster, and the degree of synthetic inhabitation by each substructure. The ability of the k-th evaluator is represented by     (k) (k) (k) (k) (k) (k) (k) (k) α0 = α0,1 , . . . , α0,L and α1 = α1,1 , . . . , α1,L , where α0,l and α1,l represent the probability of the k-th evaluator giving l ∈ {1, . . . , L} as the synthetic accessibility score when the true score of the target compound is 0 and 1, re-

145

spectively. The ability parameters are generated from the Dirichlet distribution with hyperparameters η1 : (k)

∼ Dirichlet(α0 |η1 )

(k)

∼ Dirichlet(α1 |η1 ).

α0

α1

(k)

(21)

(k)

(22)

Evaluators in the same cluster are likely to have similar expertise. Therefore, we introduce the common ability parameters for the evaluators in the same cluster. The parameters for the m-th cluster are denoted by β0,m and β1,m , 150

which represent the probability that the evaluator of the m-th cluster will give l as the synthetic accessibility score. These parameters are generated from the Dirichlet distribution with hyperparameters η2 : β0,m

∼ Dirichlet(β0,m |η2 )

(23)

β1,m

∼ Dirichlet(β1,m |η2 ).

(24)

11

Each compound has its own degree of synthetic inhabitation, and compounds in the same cluster have similar degrees. We represent this type of property of the n-th compound cluster by using γn = (γn,1 , . . . , γn,L , ) , where γn,l is the probability that the compound will be given l as the synthetic accessibility score. We assume γn is generated from the Dirichlet distribution with hyperparameters η3 : γn ∼ Dirichlet(γn |η3 ).

(25)

By using the clustering results in the first stage (i.e., ck and di ) and the latent parameters, the synthetic accessibility score of the i-th compound given by the 155

(k)

k-th evaluator, xi , is generated from the following categorical distribution: (k)

(k)

xi |ti ,αt , Bt , Γ, ck , di ∼ Categorical



(k) (k) α x i ti

> + c> k Bti + di Γ ω

!

(26) ,

where ω is the normalization constant and Bt and Γ are matrices, the rows for which are βt,m and γn , respectively:  βt,1   .. Bt =  .  βt,M





    ,Γ =   

γ1 .. . γN

The factor graph for this model is given as Figure 4.



  . 

(27)

4. Experiments We applied our methods to a real dataset of synthetic accessibility assessment with auxiliary responses in order to show the effectiveness of the proposed 160

method. We compared our method to several aggregation methods that did not employ auxiliary responses. 4.1. Dataset We collected synthetic accessibility scores and auxiliary responses for 59 compounds (I = 59) from nine semi-experts (K = 9). All target compounds 12

165

contain 1,661 atoms in total (J = 1661). The target compounds had a virtual new structure that were generated by random partial substitutions of COX-2 inhibitors. The original compounds are obtained from a public database (Jeffrey et al., 2003). The semi-experts are researchers in molecular design, who are not professional chemists but have knowledge of chemistry.

170

The semi-experts were asked to provide the following responses for each compound: • synthetic accessibility score on a five-point scale, • auxiliary response about obstructive substructures; the evaluators were asked to select atoms that were likely to inhibit the synthesize.

175

To evaluate the prediction performance of our methods quantitatively, we derived the ground truth of synthetic accessibility by asking five professional chemists to evaluate the target compounds on a five-point scale. The true synthetic accessibility of a compound is positive (ti = 1) if all the experts rated the compound as three or larger, and that of a compound is negative (ti = 0) if

180

they rated the compound as three or smaller.1 4.2. Baselines We compared our methods to the following three methods without auxiliary responses. • Majority voting (MV): MV gives a prediction based on an average of

185

five-grade responses. It estimates the synthesizable score of a compound PK (k) 1 as K k=1 xi .

• Raykar: The Raykar model is an extension of the popular Dawid and Skene model (Dawid & Skene, 1979), which represents the ability of workers with sensitivity and specificity (Raykar & Yu, 2011). This model infers 1 There

were cases where the experts did not reach such a consensus. Actually, we asked

the experts to evaluate 95 compounds, and only used 59 compounds that were successfully assigned the true synthetic accessibility for the experiments.

13

190

these two parameters along with the true synthetic accessibility using the EM algorithm. • Only-α: This is a variation of the proposed method but performs the second stage only; it does not use the results of clustering based on the auxiliary responses. Specifically, this method only incorporates α (i.e., the evaluator ability) into the generative model. The observed synthetic (k)

accessibility score xi

is generated as follows:

(k)

(k)

xi |ti , αt

(k)

(k)

∼ Categorical(xi |αti )

Additionally, we compared the proposed method with the following six variations of our method that use auxiliary responses. • SBM-(α + β) and MMSB-(α + β): These methods employ two latent 195

parameters: the ability of each evaluator (α) and the shared ability of evaluator cluster (β). The generative model is given as follows: (k)

(k)

xi |ti , αt ,Bt , ck ∼ (k)

(k)

Categorical(xi |(αti + c> k Bti )/ω)

(28)

• SBM-(α + γ) and MMSB-(α + γ): These methods employ two latent parameters: the ability of each evaluator (α) and the degree of synthetic inhabitation by substructure cluster (γ). The generative model is given 200

as follows: (k)

(k)

xi |ti , αt ,Γ, di ∼ (k)

(k)

Categorical(xi |(αti + d> i Γ)/ω)

(29)

• SBM-(β + γ) and MMSB-(β + γ): These methods employ two latent parameters: the shared ability of evaluator cluster (β) and the degree of synthetic inhabitation by substructure cluster (γ). These methods ignore the latent parameters that each semi-expert possesses independently and 205

instead use two kinds of cluster-based latent parameters in the second

14

Table 1: AUC scores of nine semi-experts.

Semi-expert

AUC

k=1

0.427

k=2

0.555

k=3

0.628

k=4

0.640

k=5

0.667

k=6

0.690

k=7

0.793

k=8

0.878

k=9

0.896

stage. The generative model is given as follows: (k)

xi |ti ,Bt , Γ, ck , di ∼

(30)

(k)

> Categorical(xi |(c> k Bti + di Γ)/ω)

We denote the proposed models by SBM-(α + β + γ) and MMSB-(α + β + γ). 4.3. Experimental setup (k)

(k)

We are given the observed responses Dx = {xi } and Dy = {yj }, and we have the following latent parameters in our proposed models: ΦSBM = {π1 , π2 , Z1,k , Z2,j , θm,n } ΦMMSB = {π1,k , π2,j , Z1,k→j , Z2,j→k , θm,n } (k)

(k)

ΦGEN = {π0 , ti , α0 , α1 , β0,m , β1,m , γn } Using these notations, we describe the joint probability of our methods as p(Dx , Dy , ΦSBM , ΦGEN ) for the proposed model with SBM and p(Dx , Dy , ΦMMSB , ΦGEN ) for the proposed model with MMSB. We split the joint probability into two stages, clustering and generation, and the parameters of each stage are separately inferred through the maximum a posteriori probability estimation. Specifically, we first inferred the parameters by solving the following optimization 15

Table 2: AUC scores of our two methods and nine baselines. The top three are methods without auxiliary responses; the others are methods using auxiliary responses. SBM-(α+β+γ) and MMSB-(α + β + γ) are our proposed methods.

Method

AUC

MV

0.886

Raykar

0.939

Only-α

0.925

SBM-(α + β)

0.924

SBM-(α + γ)

0.944

SBM-(β + γ)

0.887

SBM-(α + β + γ)

0.955

MMSB-(α + β)

0.927

MMSB-(α + γ)

0.947

MMSB-(β + γ)

0.875

MMSB-(α + β + γ)

0.959

problem: Φ∗SBM = arg max ln p(Dy , ΦSBM ). ΦSBM

(31)

We optimized the objective function with the gradient ascent, where we 210

first updated Z1,k , Z2,j while fixing other parameters, and then updated other parameters while fixing Z1,k , Z2,j . We alternately iterated these two steps until convergence. In the second stage, we inferred the parameters by solving the following optimization problem: Φ∗GEN = arg max ln p(Dx , ck , di , ΦGEN ). ΦGEN

(32)

The results of clustering were treated as observed parameters in this second stage. We optimized the objective function with the gradient ascent, where we 215

first updated ti while fixing other parameters, and then updated other parameters while fixing ti . We alternately iterated these two steps until convergence. 16

When we used MMSB as the clustering method, differences in inference were that ΦSBM became ΦMMSB . Finding the best initial values for the inference was difficult because our 220

methods require that many parameters be inferred. Therefore, for each stage, we began the inference from 100 random sets of initial values and chose the best ones that enable the objective function to obtain the highest value. We set the hyperparameters for the Dirichlet distribution, ν1 , ν2 , a, η1 and η2 , to (1, 1, . . . , 1). The hyperparameters for the beta distribution are set to

225

b1 = 1 and b2 = 1, and the number of evaluator clusters and that of substructure clusters are set to M = 4 and N = 4, respectively. 4.4. Results We evaluated the prediction accuracy of synthetic accessibility and investigated the latent parameters estimated by the proposed models. We show the

230

AUC score of each semi-expert in Table 1. 4.4.1. Synthetic accessibility prediction. The AUC scores of synthetic accessibility prediction are presented in Table 2. We observe that the proposed methods using auxiliary responses (SBM-(α+β + γ) and MMSB-(α + β + γ)) outperform the all the baselines without auxiliary

235

responses; we successfully confirm that the auxiliary responses are beneficial for estimating evaluator skills and compound properties, which results in the accurate prediction of synthetic accessibility. MMSB-(α+β+γ) achieves slightly higher AUC values than SBM-(α + β + γ). The detailed clustering assignments of MMSB would contribute to the synthetic accessibility prediction.

240

By comparing the results of the proposed methods and the variations, we find that the degree of synthetic inhibition by substructure clusters (γ) is an important parameter in the models in addition to the evaluator ability (α); there is a 3% loss of AUC values when we remove γ from the models. This indicates that the substructure clustering based on the auxiliary responses is

245

effective for the prediction. Because SBM-(α + β + γ) and MMSB-(α + β + γ)

17

are superior to all the variations, the three latent parameters are useful for the synthetic accessibility prediction. 4.4.2. Clustering. We next review the clustering results of SBM and MMSB. Figure 5 shows the 250

clustering results. The semi-experts are ordered by their AUC values of their synthetic accessibility scores, that is, the ninth semi-experts has the highest AUC. SBM divides semi-experts into two clusters while MMSB separates them into three, which implies MMSB capture more detailed expertises of the semiexperts.

255

SBM assigns almost all the substructures to the first cluster, but some substructures are assigned to the rest of clusters. In contrast, MMSB divides the substructures into four clusters and the distribution of the substructure cluster is varied among evaluator–substructure pairs. These results seem to show that the substructures in the second and third cluster of SBM and the second cluster

260

of MMSB are considered as obstructive and the skilled semi-expert correctly select them as obstructive. This result can be confirmed by referring to Figure 6. The figure shows the distribution of atoms which all five experts selected (60 atoms in total), where they are divided based on clusters derived from semiexpert responses. In figure 6(a), 24 of 60 atoms are assigned to the second and

265

third clusters. Moreover, in figure 6(b), more than half atoms are assigned to the second cluster when the AUC scores of semi-experts are high. Therefore, SBM and MMSB are able to incorporate such characteristics of the semi-experts and substructures into the model. 4.4.3. Latent parameter estimation.

270

We review the estimated latent parameters that are the probability of giving each of the five-grade responses. Figure 7 shows the estimated values of α0 and α1 obtained by SBM-(α + β + γ) and MMSB-(α + β + γ) for each semiexpert. For each parameter, the value of l-th dimension corresponds to the probability of generating the l-th grade. Both of the models successfully capture

18

275

the characteristics of the semi-experts; for example, the ninth (skilled) semiexpert correctly provides lower grades to the negative synthesizable compounds and higher grades to the positive synthesizable compounds. Figure 8 shows the estimated parameters of β0 and β1 obtained by SBM and MMBS for each semi-expert cluster. In figure 8(a), the skilled semi-experts are

280

likely assigned to the second cluster and it is successfully estimated that they are correctly provide lower grades to the negative synthesizable compounds. Similarly, in figure 8(b) it is successfully estimated that relatively skilled semiexperts assigned to the second or fourth cluster provide lower grades to the negative synthesizable compounds; however, the subtle tendency difference be-

285

tween the second and fourth clusters are well estimated. Finally, we investigate the estimated values of the parameter γ shared by substructures assigned to the same clusters. The values are illustrated in Figure 9. Examining the estimated values based on the SBM results, we observed that compounds that contained substructures assigned to the second or third

290

clusters tend to be evaluated as problematic. By contrast, reviewing the estimated values using the MMSB results, we observe that compounds containing substructures assigned to the second cluster are likely to be evaluated as problematic, whereas other substructures have slightly different properties depending on their assigned clusters. Overall, both of our proposed methods capture the

295

ability of semi-experts when either SBM or MMSB is used.

5. Related Work In the field of expert and intelligent systems for tackling difficult and complex problems which require deep expert knowledge, machine learning techniques have been explored to extract latent patterns from target data. More recently, 300

to reduce the computational and data acquisition costs of such approaches, some researchers have focused on the use of collective intelligence, which leave some parts of the complex decision making processes to humans and infer reliable answers by aggregating their responses. Majority voting is one of basic aggre-

19

gation methods, but its aggregated labels can be inconsistent when evaluation 305

tasks are highly complex (List, 2003). Dawid and Skene proposed a statistical approach (Dawid & Skene, 1979), where they assumed a confusion matrix for estimating the relationship between the ability of annotators and true labels. This idea has succeeded in improving the reliability of predictions and has been extended in the following studies (Raykar & Yu, 2011; Kim & Ghahramani,

310

2012); for example, Rayker and Yu have extended the aggregation method to multi-grade responses (Raykar & Yu, 2011). Other approaches have proposed to take the task difficulty into account (Whitehill et al., 2009), which simultaneously predicts true labels, worker ability, and the difficulty of the tasks from observed evaluation responses. Following this line of work, we propose an

315

aggregation model for synthetic accessibility prediction which considers worker ability along with properties of candidate compounds. Another extension of aggregation models has introduced latent groups among annotators and evaluation targets. Such assumption enables aggregation models to exploit the similarity or dissimilarity between items and contributes to

320

improve the prediction accuracy. Some prior work proposed to cluster evaluators into several communities and interpret each community as different levels of expertise or skills (Kajino et al., 2013; Venanzi et al., 2014; Kovashka & Grauman, 2015). On top of that, Lakkaraju et al. considered latent clusters for both evaluators and tasks (Lakkaraju et al., 2015). While these approaches

325

are limited within annotation data for each target item, we extend the work by incorporating additional responses that direct at selections of problematic substructures in each target. The advantage of using auxiliary responses is that they explicitly represent the difference of attention among evaluators. We assume that evaluators who have similar skills tend to select similar substructures,

330

and substructures in similar properties tend to be chosen similarly. Based on this assumption, our method use auxiliary responses to extract the similarity of evaluators and compounds. Computational chemists has also considered the use of human intelligence in synthetic accessibility assessment. Some studies have considered to use ex20

335

pert assessment for chemical compounds, and they found the inconsistency of annotations (Takaoka et al., 2003; Lajiness et al., 2004). Similarly, Kutchukian et al. have investigated the validity of assessment from chemists and showed the difficulty of achieving the consensus (Kutchukian et al., 2015). The most related studies to our paper has been started by Oprea et al. (Oprea et al.,

340

2009), where they have taken advantage of crowdsourcing and collected evaluations from 11 experts. More recently, Baba et al. proposed to collect and aggregate evaluation scores from semi-experts for synthetic accessibility prediction (Baba et al., 2018). Our work extend the approach by collecting auxiliary responses from semi-experts aiming for higher accuracy and finer-grained insight

345

into estimated results.

6. Conclusion In this study, for improving the accuracy of synthetic accessibility assessment conducted by semi-experts, we propose (i) the use of auxiliary responses whereby evaluators select atoms of seemingly obstructive substructures, and (ii) the two350

stage aggregation methods where semi-experts and atoms are divided into clusters respectively; these methods stochastically integrate five-grade assessments while considering auxiliary responses. Our experimental results demonstrated that considering the similarity of semi-experts and substructures in addition to personal skills of each semi-expert could achieve the higher prediction perfor-

355

mance than baseline methods. Although there were no big difference between using SBM and MMSB in our method in terms of the accuracy, we found different implications between them in clustering results. The clusters obtained by using SBM were simpler than MMSB clusters, but MMSB succeeded in capturing the fact that skilled semi-experts can find atoms of problematic substructures

360

based on Figure 6. These findings imply the trade-off between interpretability and granularity of results where SBM is superior to MMSB in generating easy to understand results while MMSB is better to obtain finer-grained insight about evaluator skills and substructure properties. In conclusion, we empirically

21

confirmed that we can obtain insightful information on semi-experts and sub365

structures by applying either SBM or MMSB with high prediction accuracy, but we should address the trade-off in clustering methods for future applications. However, our proposed method still has three limitations. First, our findings from experiments are still limited within our dataset including nine semi-experts and 59 compounds. We created candidate chemical compounds which have new

370

structure and defined their synthetic accessibility by the taking the consensus of five expert chemists, which means we have no way to know the ”actual” synthesizability of targets. This is the general limitation of examination using expert annotations as ground truth labels and needs further discussion. Second, our method requires the large computational cost especially when it uses a

375

lot of candidate compounds. In our data, the number of atoms exceeds one thousand because each compound contains approximately 20 to 30 atoms, and this fact results in the heavy inference cost. Considering future applications which handle larger number of candidates, we need to take not only the efficacy but also the efficiency of the method into account. Third, related to the second

380

limitation, our method performs the inference using the entire response data in a batch. This fact implies the difficulty of scaling up the efficiency of the method for using more large scale datasets. In order to mitigate the limitation of the computational cost, considering efficient inference algorithms such as the online learning will be important.

385

One of our methodological future work is to extend clustering models to the non-parametric approaches. The limitation of SBM and MMSB is that they cannot consider relations between items beyond the fixed number of clusters given at the beginning. The infinite relational model (IRM) (Kemp et al., 2006) can address this limitation, which estimates the number of latent clusters along

390

with cluster assignment of items. Investigating the effect of replacing SBM or MMSB with IRM is the next challenge of our study. Another methodological direction is allowing more flexible evaluation collection from semi-experts. Current proposed method supposes only the case when all participants evaluate all candidate compounds, but this is impracticable if the number of candidates is 22

395

quite large. Therefore, in the future work for practical application, we need to discuss how to deal with missing responses from evaluators to targets in our method. Another direction of future work is to use the proposed method in several domains other than synthetic accessibility assessment. Our method uses judgments

400

to evaluation targets and selections of problematic substructures. Therefore, we can apply our model to tasks with similar problem settings where workers assess targets based on their structural information. For relatively easy tasks that require less professional knowledge to annotators, we can consider collecting responses from ordinary people on crowdsourcing platform. Additionally, col-

405

lecting positive feedback on target structures instead of selecting inappropriate substructures is also a promising approach. For example, in the case of finding high quality items from candidates, it is reasonable to rely on responses that select positive components of evaluation targets. Moreover, it is also possible for some tasks to collect both positive and negative feedback on substructures

410

of targets. Trying such different forms of auxiliary response for various tasks is also our future work. acknowledgements The authors would like to thank Hiroki Kano and Hiroshi Yamashita (Mitsuishi Tanabe Pharma Corporation) for their support with the experiments.

415

References References Airoldi, E., Blei, D., Fienberg, S., & Xing, E. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9 , 1981– 2014.

420

Antonis, S., Angelos, P., & Eleftherios, T. (2003). Can non-expert’ users analyze data? A survey and a methodological approach. Europian Research Studies, 6 , 109–119. 23

Baba, Y., Isomura, T., & Kashima, H. (2018). Wisdom of crowds for synthetic accessibility evaluation. Journal of Molecular Graphics and Modelling, 80 , 425

217–223. Boda, K., Seidel, T., & Gasteiger, J. (2007). Structure and reaction based evaluation of synthetic accessibility. Journal of Computer-Aided Molecular Design, 21 , 311–325. Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer

430

error-rates using the EM algorithm. Applied Statistics, 28 , 20–28. Ertl, P., & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 1 , 1–11. Fukunishi, Y., Kurosawa, T., Mikami, Y., & Nakamura, H. (2014). Prediction of

435

synthetic accessibility based on commercially available compound databases. Journal of Chemical Information and Modeling, 54 , 3259–3267. Gillet, V. J., Myatt, G., Zsoldos, Z., & Johnson, A. P. (1995). SPROUT, HIPPO and CAESA: Tools for de novo structure generation and estimation of synthetic accessibility. Perspectives in Drug Discovery and Design, 3 , 34–

440

50. Jeffrey, J. J., O’Brien, L. A., & Weaver, D. F. (2003). Spline-fitting with a genetic algorithm: A method for developing classification structureactivity relationships. Journal of Chemical Information and Computer Sciences, 43 , 1906–1915.

445

Johnson, A. P., Marshall, C., & Judson, P. N. (1992). Starting material oriented retrosynthetic analysis in the LHASA program. 1. general description. Journal of Chemical Information and Computer Sciences, 32 , 411–417. Kajino, H., Tsuboi, Y., & Kashima, H. (2013). Clustering crowds. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (pp. 1120–

450

1127). 24

Kemp, C., Tenenbaum, J., Griffiths, T., Yamada, T., & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. In Proceedings of the Twenty-First AAAI Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference (pp. 381–388). 455

Kim, H.-C., & Ghahramani, Z. (2012). Bayesian classifier combination. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (pp. 619–627). Kovashka, A., & Grauman, K. (2015). Discovering attribute shades of meaning with the crowd. International Journal of Computer Vision, 114 , 56–73.

460

Kutchukian, P. S., Vasilyeva, N. Y., Xu, J., Lindvall, M. K., Dillon, M. P., Glick, M., & Coley, J. D. (2015). Inside the mind of a medicinal chemist: The role of human bias in compound prioritization during drug discovery. PLoS One, 7 , e48476. Lajiness, M. S., Maggiora, G. M., & Shanmugasundaram, V. (2004). Assess-

465

ment of the consistency of medicinal chemists in reviewing sets of compounds. Journal of medicinal chemistry, 47 , 4891–4896. Lakkaraju, H., Leskovec, J., Kleinberg, J., & Mullainathan, S. (2015).

A

Bayesian framework for modeling human evaluations. In Proceedings of the SIAM International Conference on Data Mining (pp. 181–189). 470

List, C. (2003). The theory of judgment aggregation: An introductory review. Synthese, 187 , 179–207. Nowicki, K., & Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures.

Journal of the American Statistical Association, 96 ,

1077–1087. 475

Oprea, T. I., Bologa, C. G., Boyer, S., Curpan, R. F., Glen, R. C., Hopkins, A. L., Lipinski, C. A., Marshall, G. R., Martin, Y. C., & Ostopovici-Halip, L.; et al. (2009). A crowdsourcing evaluation of the NIH chemical probes. Nature chemical biology, 5 , 441–447. 25

Pf¨ortner, M., & Sitzmann, M. (2007). Computer-assisted synthesis design by 480

WODCA. Journal of Computer-Aided Molecular Design, 21 , 311–325. Podolyan, Y., Walters, M. A., & Karypis, G. (2010). Assessing synthetic accessibility of chemical compounds using machine learning methods. Journal of Chemical Information and Modeling, 50 , 979–991. Raykar, V. C., & Yu, S. (2011). Ranking annotators for crowdsourced labeling

485

tasks. In Advances in Neural Information Processing Systems 24 (pp. 1809– 1817). Smuc, M., Mayr, E., Lammarsch, T., Aigner, W., Miksch, S., & G¨ artner, J. (2009). To score or not to score? Tripling insights for participatory design. IEEE Computer Graphics and Applications, 29 , 19–38.

490

Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast – But is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Takaoka, Y., Endo, Y., Yamanobe, S., Kakinuma, H., Okubo, T., Shimazaki, Y.,

495

Ota, T., Sumiya, S., & Yoshikawa, K. (2003). Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists intuition. Journal of Chemical Information and Computer Sciences, 43 , 1269–1275. Venanzi, M., Guiver, J., Kazai, G., Kohli, P., & Shokouhi, M. (2014).

500

Community-based Bayesian aggregation models for crowdsourcing. In Proceedings of the the 23rd World Wide Web Conference (pp. 155–164). Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., & Movellan, J. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22

505

(pp. 2035–2043).

26

(a) SBM

(b) MMSB

Figure 5: Results of clustering. The left is the clusters of semi-experts, the right two, the clusters of atoms. The elements of the heatmap denotes the number of pairs of a semi-expert k and every atom j (j = 1, . . . , 1661).

27

(a) SBM

(b) MMSB

Figure 6: Distribution of atoms which all five experts selected for each cluster derived based on the semi-experts’ auxiliary responses.

28

(a) SBM

(b) MMSB (k)

Figure 7: Estimated values of {αt }k,t .

(a) SBM

(b) MMSB

Figure 8: Estimated values of {βt,m }t,m .

(a) SBM

(b) MMSB

Figure 9: Estimated values of {γn }n .

29

Declaration of interests ☐ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

CRediT author statement Shun Ito:​Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Writing ​– Original Draft, ​ Writing – Review & Editing, ​ Visualization, ​ Yukino Baba: Conceptualization, Investigation, Resources, Data Curation, ​ Writing – Review & Editing,

Visualization, Supervision, ​ Tetsu Isomura:​Resources, Data Curation, ​ Hisashi Kashima: Conceptualization, ​ Writing – Review & Editing, Supervision, Project Administration