Structural fragmentation in scene graphs

Structural fragmentation in scene graphs

Knowledge-Based Systems 211 (2021) 106504 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

3MB Sizes 1 Downloads 41 Views

Knowledge-Based Systems 211 (2021) 106504

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Structural fragmentation in scene graphs ∗

Varshanth Rao, Peng Dai , Sidharth Singla Noah’s Ark Lab, Huawei Technologies Inc., Markham, ON, Canada

article

info

Article history: Received 20 April 2020 Received in revised form 4 October 2020 Accepted 6 October 2020 Available online 22 October 2020 Keywords: Scene graph Graph coherence Clustering Semantic quality Human study

a b s t r a c t Despite continuous performance improvements, contemporary Scene Graph (SG) systems tend to generate ‘fragmented’ graphs. A central problem is that standard metrics only measure similarity to ground truth graphs at the triplet level and may not fully capture image relevance or semantic correctness. In particular, multiple triplet predictions are usually made for the same ground truth regions, which can be considered as a trivial method to improve the standard evaluation metric, i.e. recall. The central purpose of our work is to reveal the inherent drawback of current SG evaluation methods and the resultant redundancy issue. We investigate different types of graph artifacts in SGs generated by existing models and propose two graph quality metrics to evaluate the level of fragmentation. Detailed analysis is given to show how SG model architectures contributes to graph fragmentation. We study these problems in the context of graph semantic quality assessment. Qualitative assessment via human study is conducted to evaluate the semantic consistency between the proposed metrics and human perception. To further clarify the validity of the new source of error, a simple but effective method which targets graph fragmentation is presented. Systematic experiments are conducted with the standard Visual Genome (VG) dataset and the Visual Relationship Detection (VRD) dataset. Experimental results show that our proposed system significantly improves the scene graph quality in terms of the new metrics as well as the traditional Top-N recall values. © 2020 Elsevier B.V. All rights reserved.

1. Introduction Although deep learning models have made impressive progress in detecting, categorizing and localizing objects in images [1–4], it is not sufficient for many interesting applications depending on semantic scene understanding, e.g., content based image retrieval [5,6], natural language image description and scene understanding [7], image generation [8,9], image understanding [10], 3D scene generation [11] and visual question answering [12]. In particular, a critical task is to move beyond detection, classification and localization of objects to also predicting their relations. Recently, the idea of scene graph representations used in computer graphics has been combined with machine learning techniques to capture semantic relationships in images [13]. The key innovation of scene graphs as applied to image understanding is reasoning about inter-object relationships, represented as object bounding boxes connected by predicates, i.e. triplets, represented

as (subject , predicate, object) [6]. Subsequent research particularly benefited from the availability of rich annotated datasets, e.g., Visual Relation Detection (VRD) [14] and Visual Genome [15]. One of the major issues with present scene graph generation models is prediction redundancy, i.e. multiple predictions for the same object and relationship between (subject , object) pairs. This leads to predicate contention and consequently reduces the quality of scene graphs. Specifically, it has two demerits; (i) the redundant relationships between the given (subject , object) pair occupies space in the top-N list of predictions, hence ‘‘starving’’ other valid relationships from appearing at all (ii) redundancy reduces model credibility by allowing it to make multiple attempts for certain relationships. Fig. 1(a) illustrates this point by showing the predictions of F-Net model [16] for a sample image from Visual Genome [15]. The centers of detected objects are marked as labeled circles. There are large amounts of redundant predictions at helmet and shirt. In summary, our work has four major contributions:

• We reveal a widely existing problem in scene graph alThe code (and data) in this article has been certified as Reproducible by Code Ocean:https://help.codeocean.com/en/articles/1120151-code-ocean-sverification-process-for-computational-reproducibility. More information on the Reproducibility Badge Initiative isavailable at www.elsevier.com/locate/knosys. ∗ Corresponding author. E-mail address: [email protected] (P. Dai). https://doi.org/10.1016/j.knosys.2020.106504 0950-7051/© 2020 Elsevier B.V. All rights reserved.

gorithms, i.e. prediction redundancy, as shown in Fig. 1. Detailed analysis is given in Sections 3.2 and 3.3 to show the origin as well as the consequences of the problem. • Two new evaluation metrics, i.e. edge distinctness score and graph coherence score, are proposed to assess the redundant prediction issue.

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 1. Sample scene graphs with/without clustering.

classification of triplet members [23]. In response, a new metric was proposed to consider both recall and ‘‘singleton’’ [23]. Other work in [23,24] also highlights the inability of the Recall@N metric to accommodate sensible or genuine prediction errors in objects and predicates (e.g., boy vs man, have vs have_a, in vs inside, etc.).

• A simple but effective method is presented to reduce redundant predictions. In Visual Genome dataset, our algorithm achieves an average of about 2.25% relative improvement for PhrDet and 3.10% relative improvement for SGGen. Moreover, we conduct human study to show the subjective quality of the generated graphs. Experimental results show significant improvement in terms of structural cleanness (less redundancy) and subjective similarity to ground truth. • Finally, our ablation study shows that the proposed clustering algorithm is general and can be applied to other SG backbones, e.g. Graph Contrastive Loss (CSL). It can also coordinate with non-maximum suppression (NMS) [17] to further reduce redundancies in SGs.

3. System description 3.1. Problem formulation At a high level, the scene graph generation task is to detect, localize and classify objects and predict their pair-wise relationships to a list of (subject , predicate, object) triplets. Formally, the scene graph generation problem can be formulated as finding the optimal

2. Related work Scene graphs can be represented as a series of triplets, i.e. (subject , predicate, object). Generally speaking, existing scene graph generation methods can be divided into two broad categories, i.e. one step approach and two step approach. Two step approaches decouple triplets into labels (subject/object) and relationships (predicate), which then solve the problem via object detection and predicate classification [14,18,19]. On the other hand, one step approaches generates different components of triplets at the same time [13,20,21]. Both approaches attempt to perform object detection and classification combined with the relationship prediction between the objects. Lu et al. leverages language understanding for predicting objects and their interrelations [14]. Even though most of the relations appear infrequently, their cooccurrence with the objects can be probabilistically determined (e.g., ‘boy’ plays ‘piano’). Therefore, objects and predicates are estimated by separate models and later combined based on word vector embeddings. Dai et al. proposed to filter the unlikely relationships based on statistical dependencies between objects and relations to improve system performance [22]. Liao et al. proposed to incorporate prior knowledge from natural language to overcome the sample scarcity issue for visual relationship detection [18]. Factorizable-Net [16], an improved version of Visual Phrase Guided Convolutional Neural Network (ViP-CNN) [20] and Multi-level Scene Description Network [21], reported significant improvement in prediction speed by reducing the number of processed union boxes through the use of sub-graphs. Top-N Recall, or Recall@N, is widely accepted for scene graph evaluation. It has been incorporated as the standard evaluation metric by most scene graph papers [16,21,23]. Although scene graph generation systems have improved over time, there are some inherent problems caused by the evaluation metric. Yang et al. showed that the existing metric failed to distinguish between single and multiple points of failure in detection and

x∗ = arg maxPr (x|I , BI ) x

that maximizes the following probability function given the image I and box proposals BI [13]: Pr (x|I , BI ) =

∏∏

bbox Pr xcls , xi→j |I , BI i , xi

(

)

(1)

i∈V j̸ =i

where V is a set of proposal region indices. 3.2. Over sampling & redundancy SG generation can be considered as a task where pairs of objects are localized and classified together with the inter-object relationship. Due to the inherent similarity between object detection and scene graph generation, there exists a significant overlap in implementation details. Image centric sampling is commonly used in object detection to improve training efficiency [25]. It is implemented as random sampling of the ROI region to form positive/negative instances. Although the (over-)sampling process contributes to the fast and effective training of deep learning models, it also introduces a serious problem, i.e. redundant prediction for the sample ROI. In object detection, redundancy takes to form of multiple bounding boxes for the same object, which is effectively handled by non-maximum suppression (NMS) or its variants, e.g. soft-NMS [17]. The same problem exists for scene graph. In scene graph, the public databases are extremely unbalanced in terms of predicates [15]. The most common predicates are ‘on’, ‘has’, ‘in’, etc. Therefore, similar methods are utilized to improve model performance [19,26]. For example, Zhang et al. samples 128 positive object pairs with 512 total object pairs, which is much more than the possible object pairs based 2

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 2. Effect of Graph fragmentation.

on ground truth labels [19]. In Graph R-CNN [27], the system samples 128 object pairs from the quadratic number of candidates. Furthermore, the oversampling process also exists in inference phase. The average number of objects per image for Visual Genome is 17.68 (≈18) [15]. Thus, the maximum number of object pairs per image considering 18 objects per image is C218 = 18 × 17/2 = 153, which is much smaller than the sampling target. Therefore, the resultant model tend to overestimate the number of triplets in a given image, i.e. the triplets form apriori probability distribution. At the algorithm level, SG systems inherently utilizes multi-proposal mechanism to generate region of interests, which are converted to triplets at later stage. The proposals can be in the form of object pairs [28] or region proposals [13]. Yu et al. discussed this phenomena and formulates the number of relationship predictions as a hyperparameter [28].

Fig. 3. Limitation of recall.

3.3.1. Top-N Recall The Top-N Recall is used to evaluate how much the predicted graph matches the ground truth. Given the predicted scene graph Gp = (Vp , Ep ) and the ground truth label Gt = (Vt , Et ), where V∗ and E∗ refer to the set of nodes and edges respectively, we match Vp to Vt based on IoU and get G∗p = (V∗p , Ep ). At certain Top-N level, the recall can be calculated as

3.3. Graph deformation & graph quality evaluation Scene graph evaluation is a multistep process, which includes evaluation of object bounding box and predicates. Generally speaking, SG quality is usually evaluated via Recall@N, which measures the correctness at the triplet level. Therefore, the following two cases are considered as the same level of ‘success’: (1) Multiple predicted triplets matches the same ground truth; (2) One predicted triplets matches one ground truth. We propose to evaluate this issue with two new metrics. Redundant predictions (see Fig. 4) can be measured from two directions, i.e. nodes or edges. There are generally three types of node mismatch in SGs, i.e. missing nodes, redundant subject nodes, and redundant object nodes. There are 2 types of predicate mismatches, i.e missing predicate and (label) wrong predicate. Fig. 2(a) represents the effect of node fragmentation. Given the ground truth graph to be G = {A: (e1, e2, e3)} where A is a single node and e1, e2 and e3 are unique edges, we can see that the predicted graph can be (wrongly) viewed and treated as 3 separate sub-graphs, each with a member separate node. Similarly, Fig. 2(b) represents the effect of edge (predicate) redundancy, wherein each predicted sub-graph (wrongly) has a separate member edge corresponding to the same node. Scene graph generation systems at its core are designed to concisely describe an image through the use of an exhaustive list of triplets. Justin et al. showed that with high quality SGs, it is possible to perform image retrieval effectively at scale [6]. However, SGs with redundant predictions or fragmented SGs cannot correctly represent the true information of the reference images. Assume we have a SG with N nodes and M edges, with redundant ˆ edges. predictions, the target SG is converted to Nˆ nodes and M Each node n is repeated rn times, as in the multiple ‘helmet’ objects in Fig. 1. Thus, the total number ∑ of nodes (corresponding to the subject or object) becomes n rn , which is much larger than N. The original single SG can become as many as N × (N − 1) fragmented graph pieces.

SRec =

⏐⏐ ⏐⏐ ⏐⏐Ep ∩ Et ⏐⏐

(2)

||Et ||

where ∥ · ∥ returns the total number of elements in a set, Ep and Et denote the set of edges in the predicted SG and ground truth, respectively, and the edges are represented as (subject , predicate, object). Therefore, recall based on edges inherently reveals node accuracy performance. However, recall cannot effectively reflect redundant predictions. The graphs in Fig. 3 are considered as exactly the same based on recall. It can be seen that with edge/node redundancy, multiple graphs corresponds to the single ground truth graph, which we refer to as graph fragmentation. It has to be noted that redundant predictions are very common in state-of-the-art SG systems. Fig. 1 gives two examples from recent SG algorithms. It can be seen that both algorithms show clear redundant predictions (see areas around ‘bed’ and ‘pillow’). In view of this inherent limitation, more robust metrics are developed that evaluates the scene graph from two different angles, i.e. edge distinctness and graph coherence. 3.3.2. Edge distinctness score Edge Distinctness score measures the level of fragmentation based on edge. The general idea is that if multiple predicted edges matches the same ground truth edge, it should not be treated the same as the case when only one edge matches the ground truth edge. Edge Distinctness score, Sed , is defined as Sed = 3

1

∥Et ∥

∑ et ∈M(Ep ,Et )

1 N(ep , et )

(3)

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 4. Redundant predictions for state-of-the-art SG systems.1 , 2

where ∥Et ∥ is the total number of ground truth edges; M(Ep , Et ) is the set of predicted edges that match the ground truth; N(ep , et ) is the total number of predicted edges that match target ground truth edge et . Following the same notation, Top-N Recall can be formulated as SRec =

1

∥ Et ∥



1

correspondence in the ground truth are used in coherence score calculation. Therefore, the upper bound of Scoh can be calculated based on represented by Gs = (Vs , Es ), where ⏐⏐ the subgraph ⏐⏐ Vs = ⏐⏐Vp ∩ V⏐⏐t ⏐⏐; Es denotes the set of ground truth edges ⏐⏐ determined by ⏐⏐Vp ∩ Vt ⏐⏐. Given predictions, the best coherence score possible is

(4)

ωu =

et ∈M(Ep ,Et )

Dvt = d : d = deg(v, Gp ), ∀v → vt , v ∈ Vp ,

}

∆Scoh =

∑ max(Dv ) t . 2 ||Et ||

2 ||Et ||

||Et ||

(8)

deg(vt , Gs ) − max(Dvt ) 2 ||Et ||

(9)

3.3.4. Example Fig. 5 illustrates the rationale of the proposed metrics. It serves as an abstraction of Fig. 1 for simplicity. Redundant objects, helmet, shirt, correspond to nodes 3 and 5 and the emerging new node, 6 in Figs. 5(d) and 5(e), exemplifies pant_2 of Fig. 1. The ground truth scene graph, Fig. 5(a), consists of 7 nodes and 8 labeled edges. In Fig. 5(b), the predicted graph matches all the edges in the ground truth. Therefore, SRec is 1. The max degrees for node#1 ∼ node#7 are 2, 2, 1, 2, 1, 1, 1, respectively. Thus, Scoh can be calculated by (2 + 2 + 1 + 2 + 1 + 1 + 1)/(2 × 8) = 0.625. Fig. 5(b) (c) (d) give the assumed predictions with redundant objects and corresponding scores. Fig. 5(f) gives an example of edge redundancy (without any effect of node redundancy). Sed can be calculated as 18 (1 + 1 + 1 + 1 + 1 + 1 + 1 + 12 ) = 0.9375. The scores for the other graphs can be calculated in similar manner. It can be seen that the recall does not capture the presence of redundant predictions, with SRec is 1 in Fig. 5(b). Nevertheless, the coherence metric effectively quantifies the redundancy. Fig. 5(c) provides an example of both redundancy and wrong predicate labels; d∗ , e∗ . Considering the worst case, the redundant nodes have only 1 incoming/outgoing edge. Since all of the edges are present, the recall is 0.75. However, the coherence score is 0.5 due to redundancy. Post clustering, we obtain Fig. 5(e) wherein the coherence score improves. Figs. 5(d) and 5(e) present an intriguing benefit of the proposed approach. In Fig. 5(d), redundant but high confidence predictions occupy space in the Top-N list, which excludes Node 6 from the graph. However, clustering enables the new node to be included, which represents pant_2 in Fig. 1.

(5)

(6)

With valid triplet, i.e. valid edge exists, the lower bound of Eq. (6) can be achieved when the graph is fully fragmented,

ωl =

=

where vt is the ground truth node that matches v and ∆Scoh originates from redundant bounding boxes. The coherence score offers a metric to evaluate the level of fragmentation.

vt ∈Vt

⏐⏐ ⏐⏐ ⏐⏐Vp ∩ˆ Vt ⏐⏐

∑ v∈Vp ,v→vt

where → denotes node correspondence; deg(v, Gp ) denotes the degree of a graph node, v , based on Gp . Here, we calculate the degree as the sum of incoming and outgoing edges, e, that satisfy e ∈ Ep ∩ Et . The graph coherence score, Scoh , is defined as Scoh =

2 ||Et ||

⏐⏐ ⏐⏐ ⏐⏐Ep ∩ Et ⏐⏐

With redundancy, the ‘difference’ brought by redundant nodes can be calculated as

3.3.3. Graph coherence score Graph Coherence score is designed to quantify how well the predicted scene graph matches the ground truth as a whole. It measures the level of fragmentation in a predicted graph caused by redundant detections. For each node vt ∈ Vt we find all the corresponding nodes in vp ∈ G∗p (see Fig. 5(d) node #3 for an example) and calculate the degree for vp . Then, we can form a degree set for vt as ∗

deg(vt , Gs )

vt ∈Vp ,vt ∈Vp

It can be seen that redundant prediction on the same edge, Fig. 2(b), will be counted as N(e1,e ) in Eq. (3), while Recall@N will p t count it as 1 in Eq. (4).

{



(7)

ˆ denotes the operation that finds the nodes with corwhere ∩ rectly matched edge. In this scenario, the graph is completely fragmented. Therefore, all the node degrees are 1. Recall, SRec decides the upper bound of coherence score, Scoh . SRec reflects the amount of triplets that matches the ground truth, which decides the candidate nodes for coherence score calculation. Based on Eq. (6), only those predicted nodes that have valid 1 Image generated with the original author’s code at https://github.com/ yikang-li/FactorizableNet. 2 Image taken from the original author’s example at https://github.com/ NVIDIA/ContrastiveLosses4VRD. 4

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 5. Redundant predictions example. ∗ indicates label difference; nodes with dashed edge denotes lower confidence nodes.

3.4. Categorical clustering

Given a new bounding box, bbox, and a set of existing clusters, clse , bbox should be assigned to clst , iff clst satisfies

3.4.1. General description In view of the redundancy issue discussed above, we propose a simple but effective method to reduce the redundancy problem. The general idea is to group the highly overlapping nodes, which inherently solve both problems. For example, in Figs. 5(b) and 5(f), merging redundant Nodes would simultaneously improve both Sed and Scoh . It has to be noted that the following approach is a general approach which is designed to target the redundant prediction issue revealed in Section 3. It can be incorporated to any system that suffers the redundancy problem, such as Zhang et al. [19]. Without losing generality, we choose to develop a investigative system based on Factorizable-Net (FNet) [16].

oc (bbox, clst ) ≥ τiou , oc (bbox, clst ) ≥ oc (bbox, clsi ), ∀clsi ∈ clse

where oc (·, ·) is the IoU based on cluster head, cls_hd. The convergence is decided based on the symmetric bijection feq shown in Eq. (12). Js is the Jaccard similarity. feq (A, B) =

{

True False

∀ai ∈ A, ∃bj ∈ B, s.t .Js (ai , bj ) ≥ δJs other w ise

∑ min(xi , yi ) Js (x, y) = ∑ i i max(xi , yi )

(12)

(13)

To improve convergence, we update the similarity threshold δ with Sigmoid annealing inspired by Cosine annealing in [29],

3.4.2. Clustering To tackle the challenges presented above, we propose a clustering method based on IoU. Given the outputs of the network, i.e. (subject , predicate, object) triplets, and the corresponding bounding boxes, the objective is to find a set of clusters, C1 , C2 , . . . , Ck , that satisfy the following criteria

∀xi , xj ∈ Ck , o(xi , xj ) ≥ τiou

(11)

δJs = δmin + (δmax − δmin ) · Sig(N0 (1 − (2 ·

iiteration max_iter

))),

(14)

where Sig(·) denotes the Sigmoid function and N0 = 6 is the Sigmoid saturation, set based on preliminary experimentation. After getting the cluster heads and cluster members from the above algorithm, the cluster head represents each of the cluster members. If there exists multiple predicates between any 2 cluster heads, at most top K (ordered by the decreasing triplet confidence) predicates are retained while all others are dropped. We strongly believe that the approach to limit the number of predicate predictions for a (subject, object) pair is a necessity,

(10)

where o(·, ·) denotes the IoU of two bounding boxes. Each cluster contains a member list (bounding boxes) and the cluster head (the average of all member bounding boxes, denoted as cls_hd). Algorithm 1 estimates such ideal clusters. 5

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504 Table 1 Duplicate objects & predicates in ground truth (Avg: average; Objs: Objects; Preds: Predicates).

Algorithm 1 Categorical Clustering Input: a set of bounding boxes for certain object category bboxes, maximum iteration imax 1: Initialize cls, clslast 2: for iiteration = 1, 2, · · ·, imax do 3: for bbox in bboxes do 4: get clst that satisfies Eq. (11) 5: if clst exists then 6: add bbox to clst 7: else 8: create new cluster based on bbox 9: end if 10: end for 11: Calculate feq (cls, clslast ) from Eqs. (12)(13)(14). 12: if feq (cls, clslast ) then 13: Early stop 14: else 15: clslast = cls. 16: For each cluster, keep cls_hd and reset member list 17: end if 18: end for 19: return cls

VG-MSDN Train

VG-MSDN Test

VG Orig

Total objects Total predicates Duplicate objects Duplicate predicates

576250 507296 54463 52851

124911 111396 12216 11809

2254205 2315975 133872 519317

Avg objects/Image Avg predicates/Image

12.48 10.99

12.49 11.14

20.86 21.43

Table 2 Synonymous predicate dictionary (Syndict) examples.

and follows analogous to performance measurements of other computer vision tasks such as image classification, wherein the top 1 or 5 accuracy is usually considered. After cross validation experiments, we observe K = 4 yields the best scores for all the considered metrics. 4. Experiments and analysis 4.1. Dataset 4.1.1. Public datasets Visual Genome (VG) [15] is commonly employed as a benchmark dataset to train and evaluate scene graph models. VG consists of 108,077 annotated images. Despite its de facto acceptance as a ground truth, there are a number of problems in the annotations that adversely affect the performance of scene graph generation models. Most notably, the existence of duplicate objects and predicates and incorrect predicates. Cleansed Visual Genome (VG-MSDN) is prepared by [16,20] after removing typographic noise, infrequent object and relationship categories and small objects and regions (less than 16 and 32 pixels on short side of objects and regions respectively). Visual Relationship Detection (VRD) [14] is a small benchmark dataset where most of the existing methods are evaluated [16,20]. It consists of 4000 train images, 1000 test images, 100 object categories and 70 relationship categories.

Predicate (Hyponym)

Synonym or Hypernym

above along at attach_to be_in be_on behind below beside by carry cover hang_from hang_on have have_a in in_a in_front_of inside inside_of lay_on look_at near next_to of of_a on on_a on_front_of on_side_of on_top_of

be_on, near, on, on_a, on_top_of, over by on , be_on on , be_on in , in_a , inside , inside_of on near , by under by, near, next_to , on_side_of beside, near, next_to, on_side_of have , hold, have_a on , over attach_to, hang_on on have_a, with have, with be_in, in_a , inside , inside_of be_in, in, inside , inside_of on_front_of be_in , in , in_a , inside_of be_in , in , in_a , inside on watch by beside, by, near, on_side_of of_a of above , be_on , on_top_of , on_a, on above , be_on , on , on_top_of, over in_front_of beside , by , near , next_to above , on , on_a , be_on , near, over

only one predicate, with the highest confidence. Therefore, the presence of multiple predicates in the ground truth can unduly increase or decrease the recall. We apply following treatments to such triplets.

• If the duplicate predicate labels are the same, we simply remove the redundant label. There are 11809 relations with duplicate predicates in the test set. • If the predicate labels are different, a synonym dictionary (referred to as syndict in the following discussion) is used to resolve the issue. A Hyponym (specific) - Hypernym (general) mapping is constructed for all the predicates. A few examples are shown in Table 2 for illustration. In case the predicates are in the syndict mapping, e.g. (spoon, in, cup), (spoon, inside_of , cup), we keep hyponym and discard hypernym at this stage. At prediction time, the prediction inside_of is considered correct if the ground truth is in.

4.1.2. De-duplicated Cleansed Visual Genome (dcVG)3 Although VG-MSDN improves on the original VG, it still contains large amount of duplicate objects and predicates, as well as incorrect predicates. Table 1 gives a summary of duplicate objects and predicates in existing datasets. To remove object duplicates, we group objects with the same label and IoU ≥ θd into a single object. Heuristically, θ is chosen to be 0.7. The largest object is kept among all objects satisfying this criteria and corresponding relationships are updated. A predicate is considered redundant if there exists more than one relations between the same pair of Subject and Object. For a given region the system outputs

It should be noted that the duplicated predicate merging process significantly changes the distribution of different predicates. For fair comparison, we will present the results with/without synonym dictionary as different experiment conditions.

3 The code and cleansed dataset will be shared. 6

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504 Table 3 Comparison of existing methods on visual phrase detection (PhrDet) and scene graph generation(SGGen). (%). FNet* refers to the results we obtained with the author’s pretrained model. Dataset

VG

VRD

Model

PhrDet

SGGen

Top50

Top100

Top50

Top100

MPS [13] MSDN [21] Graph-RCNN [23] CISC [30] FNet [16] FNet*

15.87 19.95 – – 22.84 22.60

19.45 24.93 – – 28.57 28.36

8.32 10.72 11.1 11.4 13.06 12.85

10.88 14.22 14.0 13.9 16.47 16.30

MSDN FNet FNet*

19.95 26.03 26.57

24.93 30.77 34.03

10.72 18.32 19.35

14.22 21.20 24.53

Table 4 Experimental results on VRD (%, cls denotes clustering; Avg denotes average across Top-N conditions; Rel. Imp. denotes relative improvement). Metric

Model

Top-N

Avg

Rel.

34.03 34.06

24.49 +0.44

– 1.80

19.35 19.77

24.53 24.46

17.73 +0.31

– 1.73

16.07 17.47

17.72 19.32

21.95 23.79

16.27 +1.39

– 8.52

17.34 17.85

19.23 19.63

24.31 24.16

17.63 +0.36

– 2.07

20

30

40

50

100

PhrDet

FNet cls

17.29 17.72

20.74 21.24

23.85 24.62

26.57 27.04

SGGen

FNet cls

12.50 12.79

14.85 15.34

17.44 17.84

coh

FNet cls

11.79 12.62

13.82 15.08

ed

FNet cls

12.46 12.92

14.79 15.39

Imp.

Fig. 6. Gcls vs Gno_cls : Participant perceptions.

selected such that both have the same Top-100 recall but Gcls has higher Graph Coherence and Edge Distinctness Score. Only the objects and relationships which overlap with the ground truth were displayed for both Gno_cls and Gcls . The participants were asked to relatively evaluate Gno_cls and Gcls (randomly ordered and unlabeled) on the basis of structural cleanliness and similarity to Ggt . The scale has Gno_cls and Gcls divided by 7 sections where the midpoint indicates that Gno_cls and Gcls are equally good/bad on the selected criteria. The results of the subjective evaluation are displayed on Fig. 6. From Fig. 6(a), it can be observed that more than half of the votes accumulated towards the perception that Gcls outperforms Gno_cls in terms of structural cleanliness. Further, from Fig. 6(b), nearly half of the total votes counted towards the opinion that Gno_cls and Gcls are equally similar to the ground truth. This aligns with our study setup of selecting only scene graphs with the same Top-100 Recall. In fact, Gcls secured nearly half the votes to be more aligning with Ggt than Gno_cls . We theorize that this is due to the leak of the participant bias towards structurally clean graphs since Ggt is structurally clean in most cases. Hence from the above observations of the user study, we can conclude that given the stationary condition of equal Top-100 Recall for both Gno_cls and Gcls , the participants’ perception of structural cleanliness can be strongly associated with our proposed metrics.

4.2. Evaluation metrics Apart from our proposed metrics, Graph Coherence Score (coh) and Edge Distinctness Score (ed), we adopt 2 universal evaluation tasks for scene graph generation [16,21,27]:

• Scene graph generation (SGGen) or Relation Recall measures the performance for detecting (with an IoU of at least 0.5) and classifying objects and predicting pair-wise relationships. • Phrase Recall (PhrDet) [16] or Visual Phrase Detection is to detect the triplet phrases, i.e. (subject , predicate, object). Visual phrase detection only requires one bounding box for the entire phrase to be matching with the corresponding ground truth union box [16]. We report all our experiments in terms of Recall@N metrics, where N = 20, 30, 40 50, 100. Experimental results are given in Tables 3 to 6. 4.3. Qualitative assessment Fig. 7 compares the scene graph output without clustering to the output with clustering. It can be seen that clustering benefits both the qualitative and quantitative aspect of scene graph evaluation by potentially bubbling up new relations and reducing graph fragmentation. In order to capture the descriptive prowess of our proposed metrics over the Top-N Recall, we conduct the following subjective evaluation study amongst a group of 20 participants (each possessing at least an undergraduate background in computer science). The participants were separately presented with 20 sets of scene graphs. Each set contains the ground truth scene graph Ggt , the scene graph without clustering Gno_cls and the scene graph with clustering Gcls . For a fair comparison, Gno_cls and Gcls are

4.4. Experimental results Comparisons are done in literature against MPS [13], MSDN [21], Graph-RCNN [23], CISC [30] and FNet [16] on both datasets. It has to be noted that all comparison targets are one step approaches, which do not have an explicit object detection step. Further discussion regarding generalization to two step approaches will be given in Section 4.5.3. Based on the results in Table 3, it can be seen that FNet achieves the best results. Therefore, we present detailed comparison with FNet in Tables 4 to 6. 7

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 7. Visualization of cluster results. Table 5 Experimental results on VG-MSDN (%, cls denotes clustering; Syn denotes Syndict; Avg denotes average across Top-N conditions; Rel. Imp. denotes relative improvement). Metric

Model

Top-N

Avg

20

30

40

50

100

Table 6 Experimental results on dcVG (%, cls denotes clustering; Syn denotes Syndict; Avg denotes average across Top-N conditions; Rel. Imp. denotes relative improvement).

Rel.

Metric

Model

Imp.

Top-N

Avg

20

30

40

50

100

Rel. Imp.

PhrDet

FNet cls Syn cls+Syn

15.77 16.31 17.86 18.46

18.77 19.27 21.20 21.77

20.90 21.34 23.60 24.12

22.60 23.02 25.52 26.00

28.36 28.86 31.98 32.52

21.28 +0.48 +2.75 +3.29

– 2.25 12.93 15.48

PhrDet

FNet cls Syn cls+Syn

15.34 15.88 17.39 18.00

18.31 18.80 20.73 21.30

20.40 20.86 23.11 23.63

22.10 22.53 25.03 25.52

27.84 28.35 31.47 32.03

20.80 +0.49 +2.75 +3.30

– 2.33 13.21 15.86

SGGen

FNet cls Syn cls+Syn

8.63 9.01 9.70 10.03

10.48 10.83 11.78 12.09

11.77 12.14 13.24 13.56

12.85 13.24 14.45 14.77

16.30 16.67 18.37 18.68

12.01 +0.37 +1.50 +1.82

– 3.10 12.51 15.16

SGGen

FNet cls Syn cls+Syn

8.26 8.57 9.31 9.66

10.06 10.36 11.34 11.67

11.31 11.62 12.76 13.12

12.39 12.38 13.97 14.32

15.77 16.06 17.82 18.16

11.56 +0.24 +1.48 +1.83

– 2.08 12.82 15.82

coh

FNet cls Syn cls+Syn

8.46 8.99 9.49 9.97

10.48 10.81 11.49 12.00

11.48 12.12 12.89 13.44

12.52 13.21 14.05 14.64

15.76 16.62 17.72 18.45

11.74 +0.61 +1.39 +1.96

– 5.20 11.82 16.70

coh

FNet cls Syn cls+Syn

8.06 8.50 9.07 9.58

9.78 10.28 11.00 11.57

10.97 11.52 12.35 12.99

11.99 12.56 13.97 14.16

15.13 15.84 17.04 17.89

11.19 +0.55 +1.50 +2.05

– 4.95 13.41 18.34

ed

FNet cls Syn cls+Syn

8.49 8.81 9.54 9.87

10.29 10.53 11.53 11.77

11.53 11.78 12.91 13.15

12.58 12.81 14.04 14.27

15.84 15.90 17.47 17.46

11.75 +0.22 +1.35 +1.56

– 1.87 11.51 13.26

ed

FNet cls Syn cls+Syn

8.14 8.43 9.17 9.48

9.89 10.14 11.11 11.38

11.09 11.36 12.46 12.74

12.13 12.39 13.60 13.85

15.34 15.45 16.97 17.02

11.32 +0.24 +1.34 +1.58

– 2.09 11.87 13.92

Table 4 gives the experimental results on VRD. For PhrDet, with the proposed simple clustering method, the performance improved by about 0.44% on average, leading to a relatively improvement of 1.80%. Similarly, for SGGen, system performance improved by about 0.31% on average, which means a relative improvement of 1.73%. When it comes to the proposed graph metrics, much larger gains can be observed. The average improvement for Scoh is about 1.39%, while Sed improves by about 0.36% on average. The relative improvements are 8.52% and 2.07% for Scoh and Sed , respectively. Table 5 gives the results on VG-MSDN. Since we constructed the synonym dictionary for VG, two separate algorithm settings are presented. For PhrDet, clustering improves the result by 0.48% on average, leading to a relative improvement of 2.25%. Clustering together with synonym dictionary achieves an average improvement of 3.29%, meaning a relative improvement of 15.48%. For SGGen, Clustering alone brings an average of about 0.37% throughout all TopN levels, which means a relative improvement of 3.10% over the baseline. With synonym dictionary, our proposed algorithm (cls+syn) achieves an average improvement of 1.82% in SGGen. For our proposed graph scores, similar improvements can be observed. Clustering achieves an average gain of about 0.61% (5.2% relative improvement) and 0.22% (1.87% relative improvement) on coherence score and edge distinctness score respectively. With clustering and synonym dictionary

together, an average improvement of 2.05% (18.34% relative improvement) and 1.56% (13.26% relative improvement) can be observed for Scoh and Sed , respectively. Table 6 gives the results based on our cleaned VG. The proposed algorithm, i.e. clustering with/without synonym dictionary, achieves better performance throughout all conditions. For PhrDet, clustering alone achieves an average improvement of 0.49%, while clustering with synonym dictionary achieves 3.30% improvement on average. The relative improvements are 2.33% and 15.86% over the basedline, respectively. Similarly, for SGGen, clustering improves the system by 0.24% on average. With synonym dictionary added in, we observe gains of 1.83% on average. The relative improvements are 2.08% and 15.82%, respectively. For coherence score, the average improvement is about 0.55% from clustering and 2.05% from clustering with synonym dictionary. The relative improvements are 4.95% and 18.34%, respectively. Similarly, edge distinctness score is improved by about 0.24% from clustering on average. With synonym dictionary added, the improvement is about 1.58%. The relative improvements are 2.09% and 13.92%, respectively. 4.5. Analysis & discussion 4.5.1. Graph quality score While we note that both our clustering algorithm and our Syndict approach yields large gains across all metrics, we would 8

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 8. The impact of clustering threshold. Experiments are conducted using FNet backbone on VRD.

Fig. 9. Graphical Contrastive Losses (GCL) experiments based on VRD dataset.4

overall results. It can be seen that the clustering parameter shows more impact to different scores at higher Top-N level. In Fig. 8(a), as the cluster threshold changes, the improvement in evaluation scores generally follow Top-100 > Top-50 > Top-20 > Top-10. With the same target SG, i.e. ground truth, larger Top-N means the system is given more chances to guess, which on the other hand yields more predicted bounding boxes for the ‘easy’ objects. The resultant SG contains more redundancy. Thus, clustering bring more benefit, since it is intended for reducing redundancy. On the other hand, the proposed metrics are designed to capture redundancy in SGs. It can be clearly seen that there are much larger changes in graph quality scores (Fig. 8(b)) than the classical metrics (Fig. 8(a)). This indicates that the same level of graph redundancy change, i.e. same clustering threshold, leads to larger change in graph quality scores.

like to highlight on some subtleties. As a refresher, we noted in Section 3.3.3, that the upper bound of Scoh was SRec (interchangeably SGGen). By observing the experiments without clustering throughout all datasets, we notice a large gap between Scoh and SRec . Evidently, our clustering algorithm consistently achieves a Scoh which is substantially closer to the SRec than FNet. This high reduction in the score gap implies that by using clustering, we are able to push the structural coherence property of the graph closer to its potential maximum. The advantage of Scoh over SRec becomes lucid by observing the Top-100 SGGen and coh values on Table 4. We note that although FNet has slightly higher SRec , the Scoh lags behind by about 2%, implying heavy node fragmentation from FNet. The fact that the above mentioned node fragmentation would have gone undetected without Scoh further necessitates the inclusion of structural coherence based metrics for SG evaluation. On further inspection, we note here for this case, that the slight dip of ed for clustering is due to the slight reduction of the SRec .

4.5.3. Ablation study The central purpose of the paper is to reveal a common problem in Scene Graph algorithms, i.e. redundant predictions. The clustering method described in Section 3.4 services as a simple but effective approach to tackle the redundancy problem. It is a general method and can be coupled with any SG algorithm. In order to better present the generalizability to two step approaches, we show further experiment results of our proposed method on top of Graphical Contrastive Losses (GCL) [19]. GCL explicitly detects object regions before predicate classification.

4.5.2. Clustering parameters Clustering merges redundant boxes that satisfy certain IoU threshold. Fig. 8 shows how the heuristic parameter affects the 4 Experiment based on the author’s code from https://github.com/NVIDIA/ ContrastiveLosses4VRD/. 9

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

Fig. 10. Detailed experimental results for different parameter settings (optimal parameter settings).

Fig. 11. Detailed experimental results for different parameter settings.

Fig. 10 shows the impact of different parameters around the optimal point. The score changes are very subtle. Thus, to give a better view of the interaction between parameters, we show another interaction plot in Fig. 11. It can be seen that when we limit the impact of NMS (threshold set as 0.9), clustering takes the major responsibility to remove redundancy and reaches optimal at around 0.3 ∼ 0.4. Similarly, if we limit the impact of clustering by setting clustering threshold at 0.9, NMS takes the lead to remove redundancy and reaches the optimal point at around 0.3 ∼ 0.4. Furthermore, NMS has been widely accepted as a standard step to remove redundant object detections. Nevertheless, NMS is not able to completely remove redundancy in SG, which clearly supports our claim regarding the inherent problem about SG algorithms.

Thus, Non-maximum Suppression (NMS) can be directly applied at the detection step, before scene graph generation. NMS is a widely used algorithm in object detection to remove redundant proposals [17]. Therefore, in the following discussion, we compare the results showing the interaction between clustering and NMS. Fig. 9 gives a detailed view of GCL results. It can be clearly seen that even in the presence of NMS, clustering manage to introduce substantial amount of improvement to the system. More improvement can be observed at larger top-N values. Fig. 10 shows how different parameters influence the final results around the optimal settings, i.e. both NMS and clustering thresholds set at 0.4. As clustering threshold increases, system performance scores get better. As for NMS threshold, system performance scores gradually become worse as it increases. It has to be noted that GCL incorporates NMS before SG generation. Therefore, if NMS threshold is smaller than clustering threshold, experiment results show minor changes (both Figs. 10(a) and 10(b) show almost horizontal line range). This is because both NMS and Clustering both utilize IoU to screen candidate regions. Thus, if NMS threshold is smaller than clustering threshold, almost all of the candidate regions are handled by NMS.

5. Conclusions Scene graph systems are designed to succinctly represent images as interactions between objects through relationships in the form of nodes and edges. State-of-the-art SG systems tend to generate redundant predictions, which reduces the semantic 10

V. Rao, P. Dai and S. Singla

Knowledge-Based Systems 211 (2021) 106504

quality of SGs. We show that the widely used recall based evaluation method cannot effectively capture structural fragmentation issues caused by redundant predictions. As a solution, two new metrics, i.e. Graph Coherence Score and Edge Distinctness Score, are proposed to bridge the gap. Furthermore, we proposed an effective clustering methodology to tackle the redundancy problem. Extensive experiments show that our system brings significant gains in terms of both our proposed and traditional evaluation metrics. Lastly, the qualitative assessment through the subjective analysis also reveals that a majority of the participants perceive that our system yields significantly better semantic quality in terms of structural cleanliness while maintaining similarity to the ground truth graph.

[11] A.X. Chang, M. Savva, C.D. Manning, Learning spatial knowledge for text to 3D scene generation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014, pp. 2028–2038. [12] D. Teney, L. Liu, A. van den Hengel, Graph-structured representations for visual question answering, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, 2017, pp. 3233–3241. [13] D. Xu, Y. Zhu, C.B. Choy, L. Fei-Fei, Scene graph generation by iterative message passing, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, 2017, pp. 3097–3106. [14] C. Lu, R. Krishna, M.S. Bernstein, F. Li, Visual relationship detection with language priors, in: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part I, 2016, pp. 852–869. [15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D.A. Shamma, M.S. Bernstein, F. Li, Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis. 123 (1) (2017) 32–73. [16] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, X. Wang, Factorizable net: An efficient subgraph-based framework for scene graph generation, in: 15th European Conference on Computer Vision, ECCV, 2018, pp. 346–363. [17] N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS — Improving object detection with one line of code, in: 2017 IEEE International Conference on Computer Vision, ICCV, 2017, pp. 5562–5570, http://dx.doi.org/10.1109/ ICCV.2017.593. [18] W. Liao, S. Lin, B. Rosenhahn, M.Y. Yang, Natural language guided visual relationship detection, 2017, CoRR abs/1711.06032. [19] J. Zhang, K.J. Shih, A. Elgammal, A. Tao, B. Catanzaro, Graphical contrastive losses for scene graph parsing, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019. [20] Y. Li, W. Ouyang, X. Wang, X. Tang, ViP-CNN: Visual phrase guided convolutional neural network, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, 2017, pp. 7244–7253. [21] Y. Li, W. Ouyang, B. Zhou, K. Wang, X. Wang, Scene graph generation from objects, phrases and region captions, in: IEEE International Conference on Computer Vision, ICCV, 2017, pp. 1270–1279. [22] B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, 2017, pp. 3298–3308. [23] J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, Graph R-CNN for scene graph generation, in: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part I, 2018, pp. 690–706. [24] R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural motifs: Scene graph parsing with global context, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, 2018, pp. 5831–5840. [25] R. Girshick, Fast R-CNN, in: 2015 IEEE International Conference on Computer Vision, ICCV, IEEE, 2015, http://dx.doi.org/10.1109/iccv.2015.169. [26] T. Chen, W. Yu, R. Chen, L. Lin, Knowledge-embedded routing network for scene graph generation, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019. [27] J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, Graph r-cnn for scene graph generation, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 670–685. [28] R. Yu, A. Li, V.I. Morariu, L.S. Davis, Visual relationship detection with internal and external linguistic knowledge distillation, in: IEEE International Conference on Computer Vision, ICCV, IEEE, 2017, http://dx.doi.org/ 10.1109/iccv.2017.121. [29] I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, in: International Conference on Learning Representations (ICLR) 2017 Conference Track, 2017. [30] W. Wang, R. Wang, S. Shan, X. Chen, Exploring context and visual pattern of relationship for scene graph generation, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, 2019, http: //dx.doi.org/10.1109/cvpr.2019.00838.

CRediT authorship contribution statement Varshanth Rao: Conceptualization, Methodology, Software, Writing - original draft, Visualization. Peng Dai: Conceptualization, Methodology, Writing - Review & Editing, Visualization, Supervision. Sidharth Singla: Methodology, Software. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References [1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, SSD: Single shot multibox detector, in: Computer Vision, ECCV 2016, Springer, Amsterdam, The Netherlands, 2016, pp. 21–37. [2] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection., in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, NV, USA, 2016, pp. 779–788. [3] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014, CoRR abs/1409.1556. [4] S. Ren, K. He, R.B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137–1149. [5] Y. Liu, D. Zhang, G. Lu, W. Ma, A survey of content-based image retrieval with high-level semantics, Pattern Recognit. 40 (1) (2007) 262–282. [6] J. Johnson, R. Krishna, M. Stark, L. Li, D.A. Shamma, M.S. Bernstein, F. Li, Image retrieval using scene graphs, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015, pp. 3668–3678. [7] S. Schuster, R. Krishna, A.X. Chang, F. Li, C.D. Manning, Generating semantically precise scene graphs from textual descriptions for improved image retrieval, in: Proceedings of the Fourth Workshop on Vision and Language, VL@EMNLP 2015, Lisbon, Portugal, September 18, 2015, 2015, pp. 70–80. [8] J. Johnson, A. Gupta, L. Fei-Fei, Image generation from scene graphs, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, 2018, pp. 1219–1228. [9] C.L. Zitnick, D. Parikh, L. Vanderwende, Learning the visual interpretation of sentences, in: IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, 2013, pp. 1681–1688. [10] M. Yatskar, L.S. Zettlemoyer, A. Farhadi, Situation recognition: Visual semantic role labeling for image understanding, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, 2016, pp. 5534–5542.

11