Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework

Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework

Accepted Manuscript Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework Rodrigo Rojas-M...

17MB Sizes 0 Downloads 35 Views

Accepted Manuscript

Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework Rodrigo Rojas-Moraleda, Wei Xiong, Niels Halama, Katja Breitkopf-Heinlein, Steven Dooley, Luis Salinas, Dieter W. Heermann, Nektarios A. Valous PII: DOI: Reference:

S1361-8415(17)30036-1 10.1016/j.media.2017.02.009 MEDIMA 1234

To appear in:

Medical Image Analysis

Received date: Revised date: Accepted date:

18 April 2016 20 February 2017 28 February 2017

Please cite this article as: Rodrigo Rojas-Moraleda, Wei Xiong, Niels Halama, Katja Breitkopf-Heinlein, Steven Dooley, Luis Salinas, Dieter W. Heermann, Nektarios A. Valous, Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework, Medical Image Analysis (2017), doi: 10.1016/j.media.2017.02.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • We propose an automatic segmentation algorithm for histological images. • The algorithm uses persistent homology in a computational topology framework. • We evaluated 856 images and compared with automated output validated by experts.

CR IP T

• The approach successfully detected cell nuclei; overall per-object accuracy: 84.6%.

AC

CE

PT

ED

M

AN US

• Fully automated detection provides tangible benefits for clinical decision-making.

1

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Graphical abstract

2

ACCEPTED MANUSCRIPT

Robust detection and segmentation of cell nuclei in biomedical images based on a computational topology framework Rodrigo Rojas-Moraledaa,e , Wei Xiongb , Niels Halamac , Katja Breitkopf-Heinleind , Steven Dooleyd , Luis Salinase , Dieter W. Heermannb , Nektarios A. Valousa,∗ a Applied

Tumor Immunity Clinical Cooperation Unit, National Center for Tumor Diseases, German Cancer Research Center, Heidelberg, Germany Physics and Theoretical Biophysics Group, Institute for Theoretical Physics, Heidelberg University, Heidelberg, Germany c Department of Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany d Department of Medicine II, Faculty of Medicine at Mannheim, Heidelberg University, Mannheim, Germany e Department of Informatics, Technical University Federico Santa Mar´ıa, Valparaiso, Chile

CR IP T

b Statistical

Abstract

PT

ED

M

AN US

The segmentation of cell nuclei is an important step towards the automated analysis of histological images. The presence of a large number of nuclei in whole-slide images necessitates methods that are computationally tractable in addition to being effective. In this work, a method is developed for the robust segmentation of cell nuclei in histological images based on the principles of persistent homology. More specifically, an abstract simplicial homology approach for image segmentation is established. Essentially, the approach deals with the persistence of disconnected sets in the image, thus identifying salient regions that express patterns of persistence. By introducing an image representation based on topological features, the task of segmentation is less dependent on variations of color or texture. This results in a novel approach that generalizes well and provides stable performance. The method conceptualizes regions of interest (cell nuclei) pertinent to their topological features in a successful manner. The time cost of the proposed approach is lower-bounded by an almost linear behavior and upper-bounded by O(n2 ) in a worst-case scenario. Time complexity matches a quasilinear behavior which is O(n1+ε ) for ε < 1. Images acquired from histological sections of liver tissue are used as a case study to demonstrate the effectiveness of the approach. The histological landscape consists of hepatocytes and nonparenchymal cells. The accuracy of the proposed methodology is verified against an automated workflow created by the output of a conventional filter bank (validated by experts) and the supervised training of a random forest classifier. The results are obtained on a per-object basis. The proposed workflow successfully detected both hepatocyte and non-parenchymal cell nuclei with an accuracy of 84.6%, and hepatocyte cell nuclei only with an accuracy of 86.2%. A public histological dataset with supplied ground-truth data is also used for evaluating the performance of the proposed approach (accuracy: 94.5%). Further validations are carried out with a publicly available dataset and ground-truth data from the Gland Segmentation in Colon Histology Images Challenge (GlaS) contest. The proposed method is useful for obtaining unsupervised robust initial segmentations that can be further integrated in image/data processing and management pipelines. The development of a fully automated system supporting a human expert provides tangible benefits in the context of clinical decision-making.

1. Introduction

CE

Keywords: computational topology; persistent homology; cell nuclei; image segmentation.

AC

Topology concerns the study of fundamental and intrinsic properties of spaces which are invariant to continuous deformations, extrinsic evaluations (Weatherson and Marshall, 2014) or changes in magnitude e.g. scale, shape, distance, intensity, etc. Bernhard Riemann discovered the modern concept of topology by the second half of the 19th century. However, the foundations of topology and homology are due to the work of Henry Poincare, Enrico Betti, and Emmy Noether. Topology became an influential field of research by the middle of the 20th century. For most of its history topology has not been applied to the scientific domain due to its abstract and qualitative nature. However, during the last ten years it has emerged as one of the ∗ Corresponding

author (e-mail address: [email protected])

Preprint submitted to Medical Image Analysis

most effective representation models for exploring massive and complex data sets (Gyulassy et al., 2009) gaining all sorts of attention in the fields of computational science and big data analytics. Topological invariants have great potential in the image processing field since they are related to structural properties, resulting in less sensitivity regarding changes in scale and intensity, thus simplifying tasks such as image segmentation. One topological invariant extensively used in image and data analysis is the Betti number. Betti numbers parametrize space connectivity, i.e. the 0-dimensional Betti number indicates the number of connected components if the space is embedded in a Euclidean space, the 1-dimensional Betti number shows the number of independent two dimensional (2D) holes, and the 2dimensional Betti number indicates the number of independent three dimensional (3D) voids. Thus, Betti numbers are valuable March 4, 2017

ACCEPTED MANUSCRIPT

towards the automated analysis of histological images. Image segmentation methods are based on the assessment of discontinuities or similarities in the desired signals (de Oliveira et al., 2013). The automated segmentation of cell nuclei is now a well-studied topic for which a large number of algorithms are available in the literature and newer methods continue to be investigated (Al-Kofahi et al., 2010). A major limitation of traditional segmentation methods such as histogram- or thresholdbased techniques is the under/over-segmentation of cell nuclei (Xu et al., 2014). In order to address this and other issues, researchers proposed a variety of techniques: automated approaches to cell detection (intensity-based, region-based, active contours/level sets, probabilistic and graphical models) are reviewed in detail here (Irshad et al., 2014). For example, thresholding, region growing, and watershed can locate the nuclei region, but problems arise when they try to segment the touching and overlapping nuclei. Dealing with overlapping and clustered nuclei is still a major challenge in the field. In order to tackle the problem, a variety of schemes are employed based on concavity point detection (Kong et al., 2011), distance transform (Wahlby et al., 2004), marker-controlled watershed (Cataldo et al., 2010), adaptive active contour models with shape and curvature information (Ali and Madabhushi, 2012), Gaussian mixture models and expectation maximization (Jung et al., 2010), and graphs (Al-Kofahi et al., 2010). In addition, machine learning techniques like Bayesian (Basavanhally et al., 2010), support vector machines (Veillard et al., 2013), AdaBoost (Vink et al., 2012), and deep learning via multiscale convolutional networks (Song et al., 2015) are implemented as well. Regarding machine learning approaches, the latest developments for the predictive modeling of digital pathology images are discussed here (Madabhushi and Lee, 2016). Practically, existing methods may still have issues related to under/over-segmentation, some limitations in convergence, an increased dependency on large amounts of training data, or require explicit prior knowledge of the image structure. Due to the high variability and complexity of the color/textural content of histological images, reliable cell nuclei segmentation is still a challenging task (Wang et al., 2016). The aim is to develop a method for the robust segmentation of cell nuclei in histological images based on the principles of persistent homology. To our knowledge, this is the first reported development of computational topology for detecting and segmenting cellular structures in biomedical imagery. Images acquired from histological sections of liver tissue are used as a case study to demonstrate the effectiveness of the proposed approach. A public histological dataset (H&E stained tissue) with supplied ground-truth is also used for evaluating the performance of the proposed approach. Further validations are carried out with a publicly available dataset (H&E stained tissue) and ground-truth data from the Gland Segmentation in Colon Histology Images Challenge (GlaS) contest at MICCAI 2015.

AC

CE

PT

ED

M

AN US

CR IP T

for characterizing spaces. Unfortunately, a major drawback of Betti numbers is the high computational cost (Edelsbrunner and Parsa, 2014) A standard approach for obtaining Betti numbers relies on embedding the images in a Euclidean space; pixels become vertices. Two nearby vertices span an edge, three nearby vertices span a triangle comprised of three edges, and four nearby vertices span a tetrahedron formed by four triangles and six edges. These configurations are called geometric simplicial complexes. Many algorithms can compute Betti numbers in this tessellated space (Zomorodian and Carlsson, 2005; Ghrist, 2007). The number of complexes spanned by k neighboring vertices is (2k+1 -1). Thus, a small increment in k will introduce a substantial increase of simplicial complexes that would make unattainable to compute the already computationally expensive Betti numbers. This is particularly true in persistent homology applications. This computational issue is tackled in the proposed approach by dropping the number of vertices drastically; this is achieved by splitting the image into connected components at different scales in a process called image dismantling which is a kind of data coarsening. Betti numbers are computed from the inclusion relations between the connected components. Instead of geometric simplicial complexes, abstract simplicial complexes are spanned which is a combinatorial version based on sets of sets. From this abstract representation, Betti numbers can be efficiently obtained. The reliability of the interconnections between regions is examined by persistent homology. For the segmentation problem presented in the current work, the areas with the most stable relationships are utilized. The recent advent of digital whole slide scanners has allowed for the development of quantitative histomorphometry analysis approaches which can now enable a detailed spatial interrogation of the entire tumor morphologic landscape and its most invasive elements (Madabhushi and Lee, 2016). As a result of immunohistochemical staining procedures, histological images reveal regions with variable properties in texture and color. Evaluating histological tissue sections acquired by whole slide imaging is a laborious task. The presence of a large number of nuclei in whole-slide images necessitates methods that are computationally tractable in addition to being effective. The use of automated algorithms to assist decision-making is becoming increasingly important. Essentially, workflows that implement quantitative computational approaches are especially helpful to clinicians. These systems rely on sophisticated algorithms aiming to perform measurements with reproducibility and to discern subtle details (Sadimin and Foran, 2012). Two problems have to be addressed for routine work: i) artifacts during image acquisition as well as batch effects (Kothari et al., 2013) which may bias the results, and ii) significant variance in the data given by the wide range of shape, size, color, texture, and distribution (heterogeneity) of regions of interest (ROIs) in histological sections. Overall, detection and segmentation of cell nuclei in routinely stained histopathological images pose a difficult computer vision problem due to high variability caused by a number of factors including differences in slide preparation and image acquisition (Irshad et al., 2014). The segmentation of cell nuclei is an important first step

2. Related work There have been attempts to address image segmentation problems using homology. A recent approach uses homology 4

ACCEPTED MANUSCRIPT

to characterize large sections of images according to their Betti numbers in a 2D simplicial embedding (Nakane et al., 2015). To produce Betti numbers in a predictable manner the embedding has to be fixed in advance so the classification task can be completed. In the proposed approach, the embedding is dynamically obtained after a persistent homology calculation, therefore requiring less prior knowledge. In another approach (Vaitkus, 2013), the segmentation is achieved by studying the persistent homology of a functional like watershed or edge detector deployed over the image to span a space of geometric simplicial complexes. Contrastingly, the proposed method computes the persistent homology of an abstract simplicial complex obtained from the image which is less dependent on the metric space. Another method deals mainly with the homology of geometries contained in the image (Saveliev, 2011). However, the proposed approach focuses in the persistent homology of the space of relationships between connected components obtained at different scales. Conceptually, the aim of this work is to access the shape of the space obtained when the image is fragmented at different scales. This notion brings the proposed approach into another class of methods which involves the coarsening of spaces as a fundamental task for studying their homology (Robins, 2000).

AN US

CR IP T

create a feature space that captures well the properties of the cell nuclei; ii) images are deconstructed into connected components at different scales; iii) an inclusion tree is built that stores parenthood information between connected components; iv) by traversing the inclusion tree from the leaves to its root a sequence of abstract simplicial complexes is spanned; v) persistent homology is used to summarize 0-dimensional Betti number changes when step (iv) occurs; vi) statistical inference is done over persistent homology for segmenting the cell nuclei; and vii) binary masks are created. Finally, the segmentation output is visualized.

3. Tissue preparation and image acquisition

Figure 1: Sample images acquired from histological section of liver tissue used in the analysis: a) whole slide image, and b) cropped square section.

PT

ED

M

Liver pieces are fixed in formalin and paraffin blocks are prepared. Four µm sections are stained using the rabbit-antiphospho-Stat3 (Tyr705) (D3A7) (1:50) (Cell Signaling Technology, USA) and the rabbit-HRP conjugated (1:100) (Santa Cruz Biotechnology, Germany) antibodies. Binding of the antibodies is visualized by the addition of the DAB substrate. Sections are counter-stained with hematoxylin. Glass slides are imaged at high resolution in brightfield mode at a 40-fold magnification (0.23 µm/pixel) using the Hamamatsu NanoZoomer 2.0-HT (Hamamatsu Photonics, Japan). Flat field correction is done with an empty blank slide while misalignments from the line scan are corrected with a calibration slide (Lahrmann et al., 2013). The scanner automatically detects the regions that contain the stained tissue and also determines a valid focal plane for scanning (Valous et al., 2013). The time requirements for scanning are around 2-4 min per slide. Whole tissue sections are cropped into square images (6162 pixels) which include an overlap of 20 pixels between adjacent images. The total number of images to be processed are 856. The histological landscape consists of cells with big round nuclei (hepatocytes) and cells with smaller irregularly-shaped nuclei regarded as nonparenchymal cells (hepatic stellate cells, liver sinusoidal endothelial cells, fibroblasts, endothelial cells, Kupffer cells) (Fig. 1).

CE

Figure 2: Conceptual workflow of the proposed approach.

AC

4.2. Feature space The staining in the cell nuclei encodes its information in the saturation of the blue color. Due to this property, a gray level image is obtained using only the saturation channel S ch , which is represented by the distance from the color coordinate to the black-white axis of luminance. The saturation channel is a function defined as S ch (X) : (R(X), G(X), B(X)) −→ R, where X is the spatial domain of the image, and R(X), G(X), B(X) are digital image functions for the channels red, green, and blue respectively (Caselles and Monasse, 2010). This transfers the original color image (X, R(X), G(X), B(X)) into a gray image (X, S ch (X)). The range of S ch lies in the interval [0, 1] ⊂ R.

4. Proposed approach 4.1. Overview

4.3. Image dismantling The basis of the proposed method consists of slicing S ch (X) at a sequence of level lines. This constitutes a form of coarsen-

The proposed method consists of the following steps (Fig. 2): i) a change in the color model is carried out in order to 5

ACCEPTED MANUSCRIPT

ing which is a fundamental strategy to extract topological information from data (Robins, 2000). A strictly ordered sequence of real numbers LF ∈ R of random κ + 1 elements is defined whose values are in the range of S ch : [min S ch , max S ch ]. In the experiments, a constant step value  for a given κ is used, which S ch , hence the sequence LF is the is written as  = max S ch −min κ following:

4.4. Construction of inclusion tree The image dismantling step turns R into a sequence of 2D digital images for each level line. Thus for a given n-th level line ln ∈ LF , the n-th level line digital image Pn is denoted by a set of points: Pn = {(xi , yi ) ∈ X | (xi , yi , ln ) ∈ Uπln , (xi , yi , ln ) , (0, 0, 0)} (4)

LF = {li | l0 = min S ch , li+1 − li = , i = 0, · · · , (κ)}

(1)

πl : R → Z2 × R

CR IP T

For (X, S ch (X)), a 3D surface R is formed which is comprised of a finite set of discrete points (xi , yi , zi ), where i = 1, · · · , n. Here (xi , yi ) ∈ X stands for the image domain, and zi = S ch (xi , yi ) is the saturation function value. A level line is spanned on the surface R for each element of LF . For a given level line value l ∈ LF , a projection function is defined as:

AN US

(2)

such that the image of this projection function is (Fig. 3): ( (xi , yi , l) , if zi ≥ l, Uπl = πl (xi , yi , zi ) = (0, 0, 0) , if zi < l

Pn contains the coordinates of pixels in the projection Uπln such that S ch (xi , yi ) > ln . To construct the inclusion tree, an intrinsic topological property called connected components is required. In a 2D digital image Pn , 8-adjacent pixels (two vertical, two horizontal, and four diagonal) of a point p are defined as its neighborhood, and denoted as N p . Any two points p1 , p2 ∈ Pn are connected if they are in each others neighborhood, i.e. p1 ∈ N p2 and p2 ∈ N p1 . A connected set of points C is a set that cannot be partitioned into two non-empty subsets, meaning that each subset has no points in common with the closure of the other. The maximal connected subsets in Pn are called the connected components CC(Pn ) of Pn , inducing a partition in X. In particular, for any arbitrary connected components cci , cc j ∈ CC(Pn ), their intersection is empty cci ∩cc j = ∅, ∀i , j. Taking T as the collection set of all connected components spanned by all ln ∈ LF , two properties are satisfied: i) X itself is one of the subsets for l0 = 0, and ii) for any two subsets, one either has no elements in common with the other or contains all its elements. Under these conditions, connected components are the nodes of the inclusion tree (Caselles and Monasse, 2010).

(3)

CE

PT

ED

M

A function onto R3 must have a well-defined element for every point (xi , yi , zi ) ∈ R. Therefore, arbitrarily (0, 0, 0) is assigned instead of ∅. For each cropped image, projections are obtained across all values of LF . Table 1 shows the pseudocode for this step. Fig. 4 shows the results obtained on histological images using κ = 100.

AC

Figure 3: Example of Uπl points projected on a plane passing at level l of S ch (x, y). A grayscale image (left) and a 3D Cartesian system representing the image (right) are shown. X is the spatial domain of the image and z = S ch (x, y) represents the function of the saturation channel. The resulting projection lies in Euclidean space. Figure 4: Example of projections obtained for a sample cropped image at different level lines: l0 , l5 , l15 , l30 , l60 , and l80 .

Fig. 5 illustrates the process using κ = 12 on a synthetic image. The ROIs are enhanced by a projection function and are tracked by a foliation tree of level lines using the concept of connected components. Labels are created by the leaf nodes of the foliation tree. The leaf nodes correspond to small disjointed areas of the image. Despite their size, they are local maximals comprising of salient regions which are representative of the

Table 1: Pseudocode for the image dismantling step. procedure UmbraProjected(φ,l) U = BinaryArray(row, col) = false for each row in φ do for each col in φ do 5: if φ(row, col) ≥ l then 6: U (row, col)= true 7: return U 1: 2: 3: 4:

. φ : 2D digital picture function

1

6

ACCEPTED MANUSCRIPT

branch where they belong. In Fig. 5, the leaf nodes are denoted as L = {cc12 , cc23 , cc29 , cc27 , cc22 }. Table 2 shows the pseudocode for this step.

AN US

CR IP T

As an example from the tree structure of Fig. 5, traversing the tree from leaf nodes L = {cc12 , cc23 , cc29 , cc27 , cc22 } to the root, a collection of sets S n is spanned. Traversing reversely from l12 to l0 (in order to record S n ) only leaf node cc29 exists, therefore S 0 = {cc29 }. From l6 , the two nodes cc27 and cc22 are united first so that S 6 = {{cc12 }, {cc23 }, {cc29 }, {cc27 , cc22 }} until l0 where all leaf nodes are united at the root with S 12 = {cc12 , cc23 , cc29 , cc27 , cc22 }. An abstract simplicial complex (ASimComp) is a finite collection of sets A such that the simplex α ∈ A and Betti number β ⊆ α, implying β ∈ A (Edelsbrunner and Harer, 2010). The dimension of a simplex α is dim α = cardinality (α − 1), and the dimension of a complex is the maximum dimension of any of its simplices. The collection of sets S 0 , · · · , S κ can be topologically described as a sequence of abstract simplicial complexes. As an example, the ASimComp S 6 = {cc12 , cc23 , cc29 , {cc27 , cc22 }} has four simplices cc12 , cc23 , cc29 , and {cc27 , cc22 } with dim(cc12 ) = 0 and dim({cc27 , cc22 }) = 1, thus dim(S 6 ) = 1 (Fig. 6). Traversing the structure of the inclusion tree from the leaf nodes to the root, a network is formed between leaf nodes, i.e. configurations of sets and subsets. Each configuration S n describes an abstract simplicial complex from which the homology is used to obtain topological invariant Betti numbers. Henceforth, the initial image space is represented by an ASimComp space that satisfies the three axioms of topology making it a Topological Space (Dugundji, 1966). Table 3 shows the pseudocode for this step.

M

Figure 5: Example of an inclusion tree spanned by a set of projections at different level lines LF = {l0 , l1 , . . . , l12 }. A synthetic grayscale image (left) and the tree spanned by the nesting connected components of projections (right) are shown.

PT

8: 9: 10: 11: 12: 13:

procedure InclusionTree(φ,Hf ) . φ : 2D digital picture function tmp = array(m,n) = 0 InclusionTree = null for each l in Hf do Uπl [φ] = UmbraProjected (φ, l) CC l = lconcomp(Uπl [φ]) for each cl in CC l do if tmp(cl ) 6= 0 then InclusionTree (cl ) = tmp(cl ) else InclusionTree (cl ) = tmp(cl ) tmp(cl ) = cl return InclusionTree

AC

4.5. Filtration

CE

2: 3: 4: 5: 6: 7:

ED

Table 2: Pseudocode for constructing the inclusion tree. 1:

1

The leaf nodes are considered as initial ROIs. Instead of considering the properties of the leaf nodes, a novel method is implemented by studying the connectivity of the nodes. Assuming that the network of ROIs follows a certain shape, a topological invariant is computed through the network in order to get the shape of the data. By considering a leaf node as representative of its branch, a bifurcation becomes the union for two or more leaf nodes. The root of the tree will be the union of all leaf nodes. Traversing through all level lines in LF from the leaf nodes to the root, a filtration of networks is built {S 1 , · · · , S κ } when the connections between the leaf nodes are recoded as a set S n for each ln .

Figure 6: Sets obtained by considering the leaf nodes {cc12 , cc23 , cc29 , cc27 , cc22 } as representatives of their branches (left) and bifurcations (right), i.e. the union of leaf nodes representing branches. This is consistent with the inclusion relations in the tree structure.

4.6. Persistence homology For the current implementation (cell nuclei segmentation), all level lines of the network are used (ASimComp sequence S n ) to track how topological features (homological groups) persist. A detailed introduction into persistent homology is given elsewhere (Edelsbrunner and Harer, 2010). The Betti number 7

ACCEPTED MANUSCRIPT

as a topological invariant (which records the number of components, holes, and voids and their higher-dimensional generalizations) is used to describe the intrinsic features of the image. The change of Betti numbers throughout the previously described process is summarized in the following way:

4.6.1. Algorithmic details for computing persistent homology A previous work (Cohen-Steiner et al., 2006) introduced the now well-established algorithm to compute persistent homology, involving the reduction of an incident matrix of simplices. In the proposed approach, the original persistence algorithm for the Z2 coefficients introduced in a previous work (Edelsbrunner et al., 2002) is used instead. This is intuitive and works efficiently when large number of complexes enter the computations. There are two major steps involved here: ordering the entry of simplices, and relating the cycles (a collection of simplices with empty boundary) with boundaries (the lower dimensional simplices that compound a simplex). The boundaries of a simplex dimension p are obtained by an operator called the boundary map, ∂ p : simplices p → simplices p−1 . Ordering the entry of simplices: simplices are grouped into two categories: positive and negative. A simplex σ is called positive if it creates a cycle when entering into the filtration. A simplex σ is named negative if it destroys a cycle when entering into the filtration (Fig. 7). The category of the simplices (as they appear in the filtration) is recorded in a tabular data structure as follows (Table 4):

• each instance of a component, hole, and void is represented by a bar; • the position and length of the bar represent the lifetime of the corresponding component, hole and void;

CR IP T

• the starting point of the bar corresponds to the value of the level line at which the instance comes into existence, i.e. birth level b; • the ending point of the bar corresponds to the value of the level line at which the instance ceases to exist, i.e. death level d.

AN US

The bars in Fig. 7b show the intervals of the level lines over which certain topological features (components, holes, voids) persist.

1. Simplicial complexes are enumerated according to their order of appearance.

Table 3: Pseudocode for creating the simplicial sets.

10:

M

4: 5: 6: 7: 8: 9:

i=1

11:

2. The row σ is filled with the label of each simplex in the filtration.

procedure CreateSimplicialSets(T) . T : Inclusion tree H = HeightOfTheTree(T) for h ∈ {1 . . .H} do . height from leaves (1) to the root(H) P = getAllNodesAtHeight(h, T) Sh = {∅} for each node S ∈ P do Sh = Sh getNodeSons(node) . Snode is a set endfor endfor H [ S= {Sh } return S

3. The row LF is filled with the level line in which the respective simplex in row σ appears. 4. In row sgn, the category of the simplex is annotated. Positive simplices are denoted as ”+” and negative as ”-”. 5. In row K, the entry instance of the simplex is recorded using the sequence order of step 1.

Intervals without changes are not recorded. However, they are shown as jumps in the LF row. An arbitrary order is given to simplices that enter at the same LF . For example, at l6 ∈ LF it is arbitrarily decided to introduce first cycle e and then bd. In addition, new cycle labeled a is annotated positive, while bd is labeled negative because it removes a 0-dimension simplex (Table 4). Relating cycles with boundaries: to measure the life-time of a non-binding cycle, its homology class is tracked (since it is created by a positive simplex) until its disappearance by a negative simplex. The row K marks the simplex entry. To record disappearance, a new row is added which is denoted as D p−1 . The sub-index (p − 1) is a reference to the homology dimension of the boundary elements:

AC

CE

PT

1

ED

1: 2: 3:

Figure 7: Visualization of homological persistence with positive simplices shown in red and negative simplices shown in blue: a) geometric representation of an abstract simplicial complex (ASimComp) at different level lines, b) barcode visualization of Betti numbers showing birth, persistence, and death, and c) persistence diagram showing the birth b and death level d as points (b, d) in the diagram.

1. One negative chain (an arbitrary collection of simplices of a given dimension) is selected from the table. 2. Its boundary elements are obtained. 3. The boundary element with the largest filtration index is selected (given by K). 8

ACCEPTED MANUSCRIPT

4. For the boundary element obtained at step 3, the value K of the negative chain selected at step 1 is recorded as D p−1 .

of the abstract simplices. The persistence of a point v is expressed by the difference between the two values of birth and death: persistence(v) = j − i. Since deaths occur only after births, all points lie above the diagonal. The points on the diagonal correspond to trivial homology classes that have births and deaths at every level (which provides the notion of noise). A point in the persistence diagram that is close to the diagonal represents a class that has a rapid birth and death. On the other hand, a point that is far from the diagonal persists longer. This is visualized with a color code in which red stands for larger and blue for shorter persistence.

5. The procedure is repeated until all negative chains are processed.

AN US

CR IP T

As an example (Table 5), the negative chain bd that appears at K = 5 is selected. Its boundary elements are b and d. Element b appears at K = 1 and d at K = 3, therefore d is selected as the youngest simplex. Then, the value 5 is recorded in the row D p−1 for simplex d. Thus, this is demonstrating that simplex d is created at K = 3 and removed at K = 5. In Table 4 simplices {cd, cb, ca, ab} are recorded as positive. In Table 5 they are regarded as negative due to the arbitrary order assigned to the simplices entering the filtration; they destroy a connected component (simplices of 0-dimension). However, in the homology of simplices of 1-dimension they create new loops, therefore they are positive simplices. In the current work, only the homology of abstract simplices with 0-dimension is considered. In Tables 4 and 5, the ellipses stand for the remaining simplicial configurations till the construction of a fully connected simplicial complex. Table 4: Example of simplex ordering from Fig. 6. For clarity, vertices have been renamed in the following way: a = cc29 , b = cc27 , c = cc23 , d = cc22 , and e = cc12 . 0 a + 0

2 b + 1

3 c + 2

4 d + 3

6 e + 4

6 bd 5

8 ce 6

9 ed 7

9 cd + 8

9 cb + 9

9 eb 10

9 ea 11

9 ca + 12

9 ab + 14

ED

1

9 da 13

... ... ... ...

M

LF σ sgn K

Table 5: Example of life-time records for the simplices in the 0-dimension homology of Fig. 6. 0 a + 0 ∞

2 b + 1 14

3 c + 2 12

4 d + 3 5

6 e + 4 6

6 bd 5

8 ce 6

9 ed 7

9 cd + 8 ...

9 cb + 9 ...

9 eb 10

9 ea 11

PT

LF σ sgn K Dp−1

9 ca + 12 ...

9 da 13

9 ab + 14 ...

... ... ... ...

1

4.7. Persistence diagram For the purposes of the proposed approach, a graphical representation called persistence diagram is utilized. A persistence diagram is a set of points in the upper half plane {(b, d) ∈ R2 | d ≥ b} along with infinitely many copies of the points on the diagonal: diag = {(x, x) ∈ R2 }. In Fig. 7c, a persistence diagram of connected components is shown. Since the 0-dimensional Betti number β0 provides information on the number of connected components, the focus is on the persistent homology shown by β0 instead of using all dimensions (mainly for stability reasons). For the evolution of Betti number β0 , the instances are exactly the leaf nodes of the tree. An instance ceases to exist when it merges with another instance. For each instance v that is created at level i and ceases to exist at level j of the inclusion tree, a point at (i, j) is drawn in the persistent diagram. Fig. 8 shows that the points in the persistent diagrams are in an one-to-one correspondence with the connected components

AC

CE

Figure 8: Four cropped histological sections (left) and their corresponding persistent diagrams for the 0-dimension homology group (right) are shown. For each image a nontrivial segmented region is shown (middle); this is the segmented region that persists for two or more filtration steps.

4.8. Statistical inference Stained cell nuclei define regions with common properties of texture and color, thus during image dismantling such regions and their constituent parts should appear at several level lines of the inclusion tree (expressing longer persistence). The segmentation performance in Fig. 8 is not satisfactory, but from a topological point of view this segmentation contains all significant regions of the image. The absent part is the definition of a threshold of significance (in the persistent diagram) to differentiate topological significant features (ROIs) from insignificant ones (background). In other words, a confidence interval has to be computed for the topological noise such that the cell nuclei regions are well-defined and segmented. 9

ACCEPTED MANUSCRIPT

The aim of statistical inference consists of finding a confidence interval for the distribution of persistence homology classes with unknown mean and standard deviation. The standard deviation is replaced by the estimated standard deviation s (also known as the standard error). The confidence interval obtained would be exact only when the distribution of persistence (for the homology classes) is normal. Under other population distributions, the interval will be approximately correct by the central limit theorem. Since the standard error is an estimate of the true value of the standard deviation, the distribution of the sample mean follows the t-distribution with mean µ √ and standard deviation s/ n. The t-distribution has n degrees of freedom when the population size is n+1: notation is t(n). As n increases, the t-distribution becomes closer to the normal distribution, since standard error approaches the true standard deviation for larger n. Given a persistence diagram with n observed persistence classes x with an unknown underlying mean and standard deviation, the confidence interval for the persistence mean based on a random sample is computed by:

4.9. Mask segmentation Persistent diagrams provide clues about the kind of structures that are important. However, persistent diagrams do not carry information about the geometry of the ROIs. In order to use topological information for segmenting the cell nuclei, a post-processing operation is implemented to transform the points in the persistent diagram into a properly segmented region. Region growing (Adams and Bischof, 1994) provides a simple way to generate precise segmentations that are initialized by the proposed approach. This initialization is robust against color distortions and image blur.

CR IP T

√ x¯ ± t ∗ s/ n

The value for the confidence interval is obtained using an image of the set. Due to the properties of topological persistence, generalization performance is not affected. Segmented cell nuclei (Fig. 9) have not received any post-processing treatment and are obtained right after computing the persistent diagram using the connected components of the inclusion tree.

5. Computational complexity

AN US

(5)

Computational cost is studied regarding processing time pertinent to the whole workflow till obtaining the persistent diagram where segmentation is performed. For this purpose, a set of 12 groups of images are assessed. Each group contains 40 images ranging from 502 to 20002 pixels (Fig. 10):

ED

M

where t is the upper (1 − δ)/2 critical value for the t-distribution with n − 1 degrees of freedom. Fig. 9 shows segmentations (same images as Fig. 8) obtained when the confidence interval (1 − δ)/2 = 80%.

• P-noise: synthetic images of Perlin noise rescaled and added onto itself to generate fractal noise. • LPS: images acquired from histological tissue sections of liver used as the case study in the current work.

PT

• PNE-1, PNE-2, PNE-3: images acquired from histological tissue sections of pancreatic neuroendocrine neoplasms with faint stain intensity, average stain intensity, and histological artifacts respectively. • IHC2X, IHC2X2, IHC2X3: images acquired from histological tissue sections of breast cancer (HER2, 2+).

CE

• IHC3X, IHC3X2, IHC3X3, IHC3X4: images acquired from histological tissue sections of breast cancer (HER2, 3+).

AC

The three groups of images IHC2X, IHC2X2, and IHC2X3 belong to different histological specimens. Similarly, for the groups IHC3X, IHC3X2, IHC3X3, and IHC3X4. 6. Validations 6.1. Validation with public dataset A public histological dataset (H&E stained tissue) with supplied ground-truth data (Wienert et al., 2012) is used for evaluating the performance of the proposed approach. The data can be found here: http://www.nature.com/articles/ srep00503#supplementary-information. The dataset consists of 36 images (size: 6002 pixels) containing 7931 cells

Figure 9: Using the histological sections of Fig. 8 (left) to obtain segmentations (middle) with an 80% confidence interval for the noise. The bands in the persistent diagrams (right) define regions where any point inside them is labeled as noise.

10

ACCEPTED MANUSCRIPT

and ground-truth data comprising of manually labeled cell nuclei. All images are labeled by three pathologists and only cells for which consensus is achieved are specified in the groundtruth. All other cells are left unlabeled. The dataset comprises of various breast cancer samples representing a broad morphological variety including normal tissue components, bone marrow with normal and pathologically altered cells, normal liver tissue, as well as kidney tissue and intestinal mucosa acquired during routine diagnostics and processed with standard routine laboratory protocols.

2D histogram is a statistical plot of the co-occurrence of two intensity values for the same pixel in two images, i.e. original IM and averagely filtered image I M 0 , and is obtained from:   H(i, j) = W (x, y) | I M(x, y) = i, I M(x, y) = j

(6)

where W are the elements in the set, while (x, y) denote coordinates of pixels in images. The 2D histogram is normalized by:

CR IP T

H(i, j) p(i, j) = PL PL i=1 j=1 H(i, j)

(7)

AN US

The bulk of the 2D histogram includes a valley and peaks corresponding to background and object. Because of homogeneity, pixels interior to objects or background contribute mainly to the near-diagonal elements, while pixels of edges and noise should contribute to the off-diagonal elements. Thus, the threshold (k, l) segments the 2D histogram into 4 different classes: A, B, C, and D. Classes A and B are regarded as object and background, while C and D contain the distributions of edges and noise. Basically, the method aims to find the optimal threshold (k0 , l0 ) that maximizes the between-class variance of A and B:

Figure 10: Images for estimating computational complexity. The figure shows six sample images for each group (size range: 502 to 20002 pixels): (a) PNoise, (b) IHC2X2, (c) IHC3X, (d) IHC2X, (e) PNE-1, (f) IHC3X2, (g) LPS, (h) PNE-3, (i) IHC3X3, (j) IHC3X4, (k) PNE-2, and (l) IHC2X3.

M

6.2. Validation with GlaS challenge contest dataset

σ(k, l)2 = ωA (k, l) ωB (k, l) [µA (k, l) − µB (k, l)]2

This validation aims to illustrate how the proposed method compares with state-of-the-art approaches in biomedical image segmentation. The publicly available dataset and groundtruth data from the Gland Segmentation in Colon Histology Images Challenge (GlaS) contest at MICCAI 2015 is used for the comparisons (Sirinukunwattana et al., 2017). The challenge brought together computer vision and medical image computing researchers to solve the problem of gland segmentation in digitized images of H&E stained tissue slides. Participants developed algorithms which were applied to benign tissue and to colonic carcinomas. A training dataset was provided together with ground-truth annotations by an expert pathologist. Success was measured by how closely the automated segmentation matched the gold standard provided by the pathologist.

ωA (k, l) =

p(i, j)

(9)

i=1 j=1

ωB (k, l) =

ED

k X l X

(8)

L L X X

p(i, j)

(10)

i=k+1 j=l+1

PT

µA (k, l) =

CE

µB (k, l) =

k X l X i jp(i, j) ωA (k, l) i=1 j=1

L X L X i jp(i, j) ω (k, l) i=k+1 j=l+1 B

(11)

(12)

AC

To determine (k0 , l0 ), the method needs to search an L × L space of [(1, 1), (1, 2) . . . (L, L)]. For each pixel intensity value pair (k, l), the zeroth- and first-order cumulative moments (ω and µ) are calculated by accumulating the variance from histogram entry of (1, 1) to current location of (k, l). An improved algorithm is used to accelerate the calculation of ω and µ at (k, l), by using those at locations (k − 1, l − 1), (k − 1, l) and (k, l − 1) (Zheng et al., 2006). Thus, computational time is reduced and a total of O(L2 ) is required for searching the optimal threshold. Background removal is beneficial for the next step which involves stain separation using the widely-used technique of color deconvolution; a methodology that allows to separate image components based on color parameters. Virtually every set of three colors can be separated, even in the case of co-localization of stains (Ruifrok and Johnston, 2001). Images are reconstructed for each stain separately, requiring only knowledge of the RGB

6.3. Validation with case study dataset An automated segmentation pipeline validated by experts is used to measure the performance of the proposed approach. The validation set consists of binary masks obtained by standard methods such as background removal, stain color deconvolution, and post-processing operations. 2D histogram variance thresholding separates regions of interest (tissue) and background pixels. In many cases using thresholds selected from 1D histograms may not be satisfactory. The 1D histogram can be extended into 2D, thus eliminating the effects of noise while selecting thresholds (Wu et al., 1999). A 11

ACCEPTED MANUSCRIPT

AN US

7. Results

the workflow introduced a massive reduction in the amount of simplicial complexes generated by the image, compared with the original method of using pixels as the vertices of simplicial complexes. The use of abstract simplicial complexes introduced a major improvement in computational efficiency in comparison with the nominal cost of computing persistent homology in a metric embedding (e.g. using Rips, Witness, or ˇ Cech complexes) which is upper-bounded by O(n3 ).

CR IP T

color vectors of each stain. Hence, the contribution of each stain is computed and images are produced. These grayscale images are then binarized directly by replacing all pixels with luminance >0.5 with one, and all other pixels with zero. Finally, post-segmentation operations refined image characteristics, eliminated redundancies, and improved object perceptibility (i.e. filtering, morphological operations, and watershed transform) (Valous et al., 2013). Images went through manual quality control to ensure that desired image features are detected in an efficient manner. Further validations are carried out using the software tool ilastik v.1.1.8 (Sommer et al., 2011). Ilastik is a generalpurpose state-of-the-art segmentation tool providing an automated workflow based on the supervised training of a random forest classifier. For comparison purposes, training is performed manually on the same two images used to tune the proposed approach. The software is trained using color, edge, and textural features.

7.1. Computational cost

ED

M

Computational time results are shown in Fig. 11. The results are normalized and fitted to a curve by polynomial (quadratic) regression with a coefficient of determination R2 > 0.9999. In the case of synthetic images (P-noise), it is a linear fit with an R2 > 0.995.

Figure 12: Processing time of the proposed approach as a function of input size. For comparison purposes, the figure includes curves of computational cost for exponential (O(2n )), cubical (O(n3 )), and quadratic (O(n2 )) time algorithms, plus the average performance of the proposed approach.

7.2. Validation with public dataset

AC

CE

PT

The evaluation is performed on a per-object basis for the class of cell nuclei according to the supplied ground-truth data (accuracy: 94.5%). For the background class, the assessment is conducted on a per-pixel basis. Both classes are normalized before computing the confusion matrix and receiver operating characteristic (ROC) curve (Fig. 13). The proposed approach overestimates the number of cell nuclei. False positive nuclei arise in this evaluation because no other criteria than natural clusters (in the persistence diagram) are used. False positive nuclei manifest because of: i) cell structures morphologically similar to real cell nuclei, and ii) cell nuclei with weak staining or inconclusive morphology that remain unlabeled in the ground-truth, due to non-agreement by the pathologists.

Figure 11: Processing time of the proposed approach measured in sec vs. image size measured in millions of pixels.

7.3. Validation with GlaS challenge contest dataset The results obtained with the proposed approach are presented in Table 6. The evaluation of performance is reported according to three criteria: a) accuracy of the detection of individual glands (F1 score), b) volume-based accuracy of the segmentation of individual glands (object-level Dice index), and c) boundary-based similarity between glands and corresponding segmentation (object-level Hausdorff distance). The dataset is

The processing time taken by the proposed method to run as a function of input size is shown in Fig. 12. Evaluation of the results indicate that time complexity matches a quasilinear behavior which is O(n1+ε ) for ε < 1. The time cost of the algorithm is lower-bounded by an almost linear behavior obtained by the synthetic images and upper-bounded by O(n2 ) in a worstcase scenario. The image dismantling step at the beginning of 12

ACCEPTED MANUSCRIPT

separated into training part, test part A, and test part B. Ranking results are available here (Sirinukunwattana et al., 2017).

tive measurements of its geometrical properties (Biasotti et al., 2008). Consequently, the obtained results indicate that abstract simplicial complexes of dimension-1 have potential for such problems. 7.4. Validation with case study dataset

CR IP T

The validation set is provided in two groups: (i) all segmented nuclei, and (ii) positive cell nuclei (hepatocytes). The ROC curves obtained by the proposed approach on a per-object basis are shown below. To obtain a global evaluation, the confusion matrices are normalized and added. The output mask is compared with the validation mask which contains all segmented cell nuclei: hepatocyte and nonparenchymal cells (accuracy: 84.6%) (Fig. 14). In addition, the output mask is compared with the validation mask containing hepatocyte cell nuclei only (accuracy: 86.2%) (Fig. 15).

AN US

Figure 13: Average segmentation performance for cell nuclei using a public dataset with supplied ground-truth data (Wienert et al., 2012). Normalization balances background (0) and nuclei class (1); each class contains approx. 50% of data. Overall accuracy: 94.5%; error rate: 5.5%; specificity: 91.4%; true positive rate for nuclei: 97.5%; false positive rate: 8.6%.

Table 6: Performance of the proposed approach using the dataset of the Gland Segmentation in Colon Histology Images Challenge (GlaS) contest at MICCAI 2015 (Sirinukunwattana et al., 2017). A and B correspond to test dataset part A and part B.

F1 Score A B 0.51 0.33

Object Dice A B 0.53 0.44

Object Hausdorff A B 101.52 237.23

M

Proposed Approach

Figure 14: Average segmentation performance for all cell nuclei (hepatocyte and non-parenchymal cells). Normalization balances background (0) and nuclei class (1); each class contained approx. 50% of the data. Overall accuracy: 84.6%; error rate: 15.4%; specificity: 92.8%; true positive rate for nuclei: 76.7%; false positive rate: 7.2%.

AC

CE

PT

ED

From a topological perspective, 1 segmenting a gland is a task that consists of recognizing 1-dimensional loops in the space of histological tissue structures. Such loops are captured by the 1-dimensional Betti number β1 . However, the method described in this work is limited in harvesting 0-dimensional Betti numbers. To overcome such limitation an indirect procedure is carried out in order to work on 1-dimensional loops. Cell nuclei are segmented using persistent homology. The generated inclusion tree is traversed from the leaves to the root in order to connect neighbor cell nuclei. Such process generates different forms of nuclei agglomerations. The edge of a gland is usually expressed by simplicial complexes densely packed where the complex nodes are cell nuclei. The center of a gland is usually expressed by a region of low nuclei density. Using these properties, it is possible to construct a grayscale image where dense areas are bright zones. Such image is then processed further in order to find those regions of the image that persist: the gland core. At this stage, a morphological step is performed to segment the shape of the gland. This procedure allows to evaluate how much information can be obtained and used from the 0-dimensional Betti number. The effectiveness of the proposed approach compares to the group of techniques that have mid to lower performance characteristics (Sirinukunwattana et al., 2017). This is not unexpected, because the image segmentation problem in this challenge has a strong geometrical component. Such problems lay in the field of differential topology which combines the topological exploration of space with quantita-

Figure 15: Average segmentation performance for hepatocyte cell nuclei. Normalization balances background (0) and nuclei class (1); each class contained approx. 50% of the data. Overall accuracy: 86.2%; error rate: 13.8%; specificity: 92.4%; true positive rate for nuclei: 80.2%; false positive rate: 7.6%.

To perform this comparison, the segmented nuclei are separated into two groups according to their eccentricities. The eccentricity is the ratio of the distance between the foci of the ellipse 13

ACCEPTED MANUSCRIPT

containing a nucleus and its major axis length (value between 0 and 1). Eccentricities equal to 0 indicate a circle, while values equal to 1 indicate a line segment. A positive label is assigned to cell nuclei with an eccentricity lower than equal to 0.9 and a negative label to nuclei with an eccentricity greater than 0.9. Finally, the corresponding ROC curve for all segmented cell nuclei, comparing the proposed approach with the segmentation obtained from the software tool ilastik (accuracy: 75.5%), is shown in Fig. 16. The ROC curve of Fig. 16 highlights the problem of generalization in this machine learning approach. The issue arises because training is based on appearance features and the set used for training does not contain enough variability in the appearance of cell nuclei, in order to segment successfully unseen samples. Increasing the number of samples in the training set and re-training for new samples are conventional strategies to minimize these problems. However, in practice it is impossible to predict the complete pool of possible variations in the appearance of cell nuclei. The proposed approach achieves better separation between classes (nuclei and background) and in consequence better segmentation performance, because it does not depend on cell nuclei appearance features but in a highly abstract representation of the image which is less sensitive to changes in color, texture, and edges.

PT

ED

M

AN US

CR IP T

ical and color/textural content of histological specimens render the development of image analysis algorithms an arduous task; nevertheless automation has merits compared to manual assessments. In this work, an abstract simplicial homology approach for image segmentation is established. By introducing an image representation based on topological features, the segmentation task is less dependent on variations in color or texture. This results in a novel approach that generalizes well and provides stable performance across a large set of histological images. The method conceptualizes regions of interest (cell nuclei) pertinent to their topological features in a successful manner. Results reveal that performance depends on the choice of feature space where cell nuclei are detected during the fragmentation step. This is a feature space where it is possible to generate configurations of abstract simplicial complexes. The reader can quickly observe how higher dimensional Betti numbers are not taken into account (holes and voids). Experimentally, holes and voids arise in an unpredictable fashion during persistent homology computations. Analysis showed that the process of image dismantling introduces holes due to the discrete nature of digital images. An erroneous output at this point would scale to all subsequent steps. The use of a functional over the image to address this issue would result in over-parametrization, jeopardizing the improvements in generalization. Hence, higher dimension topological features are not considered. A new algorithm should be developed to identify these features or further developments should be made in the image dismantling process.

AC

8. Discussion

CE

Figure 16: Average segmentation performance obtained with software tool ilastik for all cell nuclei (hepatocyte and non-parenchymal cells). Ilastik is trained with the same two images used for tuning the proposed approach. Normalization balances background (0) and nuclei class (1); each class contained approx. 50% of the data. Overall accuracy: 75.5%; error rate: 24.5%; specificity: 87.7%; true positive rate for nuclei: 63.3%; false positive rate: 12.3%.

The success of any histological image analysis method depends on how well it conceptualizes different cellular structures (Arevalo et al., 2015). In turn, this depends on how accurate is the representation model for the visual content of the image. There are several approaches for the automatic segmentation of tissue structures; often such algorithms rely on properties measured directly from the image which can easily be affected by noise and the natural variability of the histological landscape. In consequence, segmentation methods applied to such image representations are frequently difficult to generalize beyond particular specimens, resulting in algorithms that require constant retuning in order to perform properly. In essence, the morpholog-

9. Conclusions and future work The detection of cells in histological images is a subject of interest in a wide range of cell-based studies as it is critical for evaluating the existence of disease and its severity (Sirinukunwattana et al., 2015; Valous et al., 2016; Halama et al., 2016). The construction of a fully automated detection and segmentation system supporting a human expert provides tangible benefits in the context of clinical decision-making (George et al., 2013; LaTorre et al., 2013). Automated image analysis tools may not replace but assist clinicians to increase diagnostic precision and inter-observer reliability (Wienert et al., 2012). The proposed approach deals with the persistence of disconnected sets in the image, thus identifying salient regions that express patterns of persistence. Still there is no immediate connection between the definition of topology and the geometry of a cell nuclei. Differential topology may provide a suitable setting for formalizing and solving problems related to shape analysis; this is achieved by combining the topological exploration of space with quantitative measurements of geometrical properties provided by a real function defined on the shape (Biasotti et al., 2008). Hence, differential topology is a clear direction for future work in order to improve the use of persistent homology for the analysis of biomedical images. Conflict of interest statement The authors declare that they have no conflict of interest.

14

ACCEPTED MANUSCRIPT

Acknowledgements

CR IP T

The authors are thankful to Ms. Katarina Abramovic for the technical assistance (immunohistochemistry). Experimental methods are supported by the German Federal Ministry of Education and Research: Virtual Liver Network grants 0315755 and 0315764 (SD). Dr. Luis Salinas acknowledges support from FONDECYT 1100805, Basal Project FB0821 CCTVal, and AnilloProject ACT119. Ms. Wei Xiong acknowledges the funding from the China Scholarship Council (CSC No. 2011604036) and support from the Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences (HGS MathComp).

Halama, N., Zoernig, I., Berthel, A., Kahlert, C., Klupp, F., Suarez-Carmona, M., Suetterlin, T., Brand, K., Krauss, J., Lasitschka, F., Lerchl, T., LucknerMinden, C., Ulrich, A., Koch, M., Weitz, J., Schneider, M., Buechler, M. W., Zitvogel, L., Herrmann, T., Benner, A., Kunz, C., Luecke, S., Springfeld, C., Grabe, N., Falk, C. S., Jaeger, D., 2016. Tumoral immune cell exploitation in colorectal cancer metastases can be targeted effectively by anti-CCR5 therapy in cancer patients. Cancer Cell 29, 587–601. Irshad, H., Veillard, A., Roux, L., Racoceanu, D., 2014. Methods for nuclei detection, segmentation, and classification in digital histopathology: a reviewcurrent status and future potential. IEEE Rev. Biomed. Eng. 7, 97–114. Jung, C., Kim, C., Chae, S. W., Oh, S., 2010. Unsupervised segmentation of overlapped nuclei using bayesian classification. IEEE T. Bio-Med. Eng. 57, 2825–2832. Kong, H., Gurcan, M., Belkacem-Boussaid, K., 2011. Partitioning histopathological images: an integrated framework for supervised color-texture segmentation and cell splitting. IEEE T. Med. Imaging 30, 1661–1677. Kothari, S., Phan, J. H., Stokes, T. H., Wang, M. D., 2013. Pathology imaging informatics for quantitative analysis of whole-slide images. J. Am. Med. Inform. Assn. 20, 1099–1108. Lahrmann, B., Valous, N. A., Eisenmann, U., Wentzensen, N., Grabe, N., 2013. Semantic focusing allows fully automated single-layer slide scanning of cervical cytology slides. PLoS ONE 8, e61441. LaTorre, A., Alonso-Nanclares, L., Muelas, S., Pe˜na, J., DeFelipe, J., 2013. Segmentation of neuronal nuclei based on clump splitting and a two-step binarization of images. Expert Syst. Appl. 40, 6521–6530. Madabhushi, A., Lee, G., 2016. Image analysis and machine learning in digital pathology: challenges and opportunities. Med. Image Anal. 33, 170–175. Nakane, K., Takiyama, A., Mori, S., Matsuura, N., 2015. Homology-based method for detecting regions of interest in colonic digital images. Diagn. Pathol. 10, 36. Robins, V., 2000. Computational Topology at Multiple Resolutions: Foundations and Applications to Fractals and Dynamics. University of Colorado. Dissertation. Ruifrok, A. C., Johnston, D. A., 2001. Quantification of histochemical staining by color deconvolution. Anal. Quant. Cytol. Histol. 23, 291–299. Sadimin, E. T., Foran, D. J., 2012. Pathology imaging informatics for clinical practice and investigative and translational research. N. Am. J. Med. Sci. 5, 103–109. Saveliev, P., 2011. A graph, non-tree representation of the topology of a gray scale image. In: Astola, J. T., Egiazarian, K. O. (Eds.), Image Processing: Algorithms and Systems IX. Vol. 7870. Proc. SPIE. Sirinukunwattana, K., Khan, A. M., Rajpoot, N. M., 2015. Cell words: modelling the visual appearance of cells in histopathology images. Comput. Med. Imag. Grap. 42, 16–24. Sirinukunwattana, K., Pluim, J. P., Chen, H., Qi, X., Heng, P.-A., Guo, Y. B., Wang, L. Y., Matuszewski, B. J., Bruni, E., Sanchez, U., B¨ohm, A., Ronneberger, O., Cheikh, B. B., Racoceanu, D., Kainz, P., Pfeiffer, M., Urschler, M., Snead, D. R., Rajpoot, N. M., 2017. Gland segmentation in colon histology images: the glas challenge contest. Med. Image Anal. 35, 489–502. Sommer, C., Straehle, C., Kothe, U., Hamprecht, F. A., 2011. Ilastik: interactive learning and segmentation toolkit. In: Proceedings of the IEEE International Symposium on Biomedical Imaging: From Nano to Macro. pp. 230–233. Song, Y., Zhang, L., Chen, S., Ni, D., Lei, B., Wang, T., 2015. Accurate segmentation of cervical cytoplasm and nuclei based on multiscale convolutional network and graph partitioning. IEEE T. Bio-Med. Eng. 62, 2421– 2433. Vaitkus, M., 2013. Grayscale and color image segmentation using computational topology. In: Wimmer, M., Hladuvka, J., Ilck, M. (Eds.), 17th Central European Seminar on Computer Graphics. pp. 95–102. Valous, N. A., Lahrmann, B., Halama, N., Bergmann, F., J¨ager, D., Grabe, N., 2016. Spatial intratumoral heterogeneity of proliferation in immunohistochemical images of solid tumors. Med. Phys. 43, 2936–2947. Valous, N. A., Lahrmann, B., Zhou, W., Veltkamp, R., Grabe, N., 2013. Multistage histopathological image segmentation of iba1-stained murine microglias in a focal ischemia model: methodological workflow and expert validation. J. Neurosci. Meth. 213, 250–262. Veillard, A., Kulikova, M. S., Racoceanu, D., 2013. Cell nuclei extraction from breast cancer histopathology images using colour, texture, scale and shape information. Diagn. Pathol. 8 (Suppl 1), S5. Vink, J., Leeuwen, M. V., Deurzen, C. V., Haan, G. D., 2012. Efficient nucleus detector in histopathology images. J. Microsc. 249, 124–135.

References References

AC

CE

PT

ED

M

AN US

Adams, R., Bischof, L., 1994. Seeded region growing. IEEE T. Pattern Anal. 16, 641–647. Al-Kofahi, Y., Lassoued, W., Lee, W., Roysam, B., 2010. Improved automatic detection and segmentation of cell nuclei in histopathology images. IEEE T. Bio-Med. Eng. 57, 841–852. Ali, S., Madabhushi, A., 2012. An integrated region-, boundary-, shape-based active contour for multiple object overlap resolution in histological imagery. IEEE T. Med. Imaging 31, 1448–1460. Arevalo, J., Cruz-Roa, A., Arias, V., Romero, E., Gonz´alez, F. A., 2015. An unsupervised feature learning framework for basal cell carcinoma image analysis. Artif. Intell. Med. 64, 131–145. Basavanhally, A., Ganesan, S., Agner, S., Monaco, J., Feldman, M., Tomaszewski, J., Bhanot, G., Madabhushi, A., 2010. Computerized imagebased detection and grading of lymphocytic infiltration in her2+ breast cancer histopathology. IEEE T. Bio-Med. Eng. 57, 642–653. Biasotti, S., De Floriani, L., Falcidieno, B., Frosini, P., Giorgi, D., Landi, C., Papaleo, L., Spagnuolo, M., 2008. Describing shapes by geometricaltopological properties of real functions. ACM Comput. Surv. 40, 12:1– 12:87. Caselles, V., Monasse, P., 2010. Geometric Description of Images As Topographic Maps. Springer-Verlag, Berlin. Cataldo, S. D., Ficarra, E., Acquaviva, A., Macii, E., 2010. Automated segmentation of tissue images for computerized IHC analysis. Comput. Meth. Prog. Bio. 100, 1–15. Cohen-Steiner, D., Edelsbrunner, H., Morozov, D., 2006. Vines and vineyards by updating persistence in linear time. In: Proceedings of the Twenty-second Annual Symposium on Computational Geometry (SCG ’06). ACM, New York, USA, pp. 119–126. de Oliveira, D. L. L., do Nascimento, M. Z., Neves, L. A., de Godoy, M. F., de Arruda, P. F. F., de Santi Neto, D., 2013. Unsupervised segmentation method for cuboidal cell nuclei in histological prostate images based on minimum cross entropy. Expert Syst. Appl. 40, 7331–7340. Dugundji, J., 1966. Topology (Allyn and Bacon Series in Advanced Mathematics). Allyn and Bacon, Boston. Edelsbrunner, H., Harer, J. L., 2010. Computational Topology: An Introduction. American Mathematical Society, Providence. Edelsbrunner, H., Letscher, D., Zomorodian, A., 2002. Topological persistence and simplification. Discrete & Computational Geometry 28, 511–533. Edelsbrunner, H., Parsa, S., 2014. On the computational complexity of betti numbers: reductions from matrix rank. In: Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 152–160. George, Y. M., Bagoury, B. M., Zayed, H. H., Roushdy, M. I., 2013. Automated cell nuclei segmentation for breast fine needle aspiration cytology. Signal Process. 93, 2804–2816. Ghrist, R., 2007. Barcodes: the persistent topology of data. Bull. Amer. Math. Soc. 45, 61–76. Gyulassy, A., Nonato, L. G., Bremer, P. T., Silva, C., Pascucci, V., 2009. Visualization corner: robust topology-based multiscale analysis of scientific data. Comput. Sci. Eng. 11, 88–95.

15

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Wahlby, C., Sintorn, I.-M., Erlandsson, F., Borgefors, G., Bengtsson, E., 2004. Combining intensity, edge and shape information for 2d and 3d segmentation of cell nuclei in tissue sections. J. Microsc. 215, 67–76. Wang, P., Hu, X., Li, Y., Liu, Q., Zhu, X., 2016. Automatic cell nuclei segmentation and classification of breast cancer histopathology images. Signal Process. 122, 1–13. Weatherson, B., Marshall, D., 2014. Intrinsic vs. extrinsic properties. In: Zalta, E. N. (Ed.), The Stanford Encyclopedia of Philosophy, fall 2014 Edition. Metaphysics Research Lab, Stanford University. Wienert, S., Heim, D., Saeger, K., Stenzinger, A., Beil, M., Hufnagl, P., Dietel, M., Denkert, C., Klauschen, F., 2012. Detection and segmentation of cell nuclei in virtual microscopy images: a minimum-model approach. Sci. Rep. 2. Wu, X.-J., Zhang, Y.-J., Xia, L.-Z., 1999. A fast recurring two-dimensional entropic thresholding algorithm. Pattern Recogn. 32, 2055–2061. Xu, H., Lu, C., Mandal, M., 2014. An efficient technique for nuclei segmentation based on ellipse descriptor analysis and improved seed detection algorithm. IEEE J. Biomed. Health Inform. 18, 1729–1741. Zheng, C., Sun, D.-W., Zheng, L., 2006. Segmentation of beef joint images using histogram thresholding. J. Food Process Eng. 29, 574–591. Zomorodian, A., Carlsson, G., 2005. Computing persistent homology. Discrete Comput. Geom. 33, 249–274.

16