Automatic Segmentation of Chinese Characters as Wire-Frame Models

Automatic Segmentation of Chinese Characters as Wire-Frame Models

Available online at www.sciencedirect.com ScienceDirect �is space is reserved for the �rocedia header, do not use it �is space is reserved for the �...

676KB Sizes 0 Downloads 51 Views

Available online at www.sciencedirect.com

ScienceDirect

�is space is reserved for the �rocedia header, do not use it �is space is reserved for the �rocedia header, do not use it �isProcedia space Computer is reserved for the header, do not use it Science 108C�rocedia (2017) 415–424 �is space is reserved for the �rocedia header, do not use it �is space is reserved for the �rocedia header, do not use it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland

Automatic Segmentation of Chinese Characters as Automatic of Chinese Characters as Automatic Segmentation Segmentation of Chinese Characters as Wire-Frame Models Automatic Segmentation of Chinese Characters Wire-Frame Automatic Segmentation of Models Chinese Characters as as Wire-Frame Models 1 Wire-Frame Models Antoine Bossard 1 Wire-Frame Models Antoine Bossard1 Antoine Bossard

1 University Graduate School of Science, Kanagawa Antoine Bossard 1 University Graduate School of Science, Kanagawa Antoine Bossard 2946 Tsuchiya, Hiratsuka, Kanagawa, Japan 259-1293 Graduate School of Science, Kanagawa University 2946 Tsuchiya, Hiratsuka, Kanagawa, Japan 259-1293 Graduate School of Science, Kanagawa University 2946 Tsuchiya, Hiratsuka, Kanagawa, Japan 259-1293 Graduate School of Science, Kanagawa University 2946 Tsuchiya, Hiratsuka, Kanagawa, Japan 259-1293 2946 Tsuchiya, Hiratsuka, Kanagawa, Japan 259-1293

Abstract Abstract �ere exist thousands of Chinese characters, used across several countries and languages. �eir huge Abstract �ere exist thousands ofprocessing Chinese characters, across several andtopic languages. �eir huge Abstract number induces various difficultiesused by computers. Onecountries challenging is for example the �ere exist thousands ofprocessing Chinese characters, used across several countries andtopic languages. �eir huge Abstract number induces various difficulties by computers. One challenging is for example the �ere exist thousands of Chinese characters, used across several countries and languages. �eir huge automatic font generation for such characters. Also, as these characters are in many cases recursive comnumber induces various processing difficultiesused by computers. Onecountries challenging topic is for example the �ere exist thousands of Chinese characters, across several and languages. �eir huge automatic font generation for such characters. Also, as these characters are in many cases recursive comnumber induces various processing difficultiesis by One challenging topic is for the pounds, pattern (i.e. sub-character) detection ancomputers. insightful topic. In are thisinpaper, aiming atexample addressing automatic font generation for such characters. Also, as these characters many cases recursive comnumber induces various processing difficultiesis by computers. One challenging topic is for the pounds, pattern (i.e. sub-character) detection an insightful topic. In are this paper, aiming atexample addressing automatic font generation for such characters. Also, as these characters in many cases recursive comsuch issues, we describe a segmentation method for Chinese characters, producing wire-frame models, pounds, pattern (i.e. sub-character) detection is an insightful topic. In are this paper, aiming at addressing automatic font generation for such characters. Also, as these characters in many cases recursive comsuch issues, we describe a segmentation method for Chinese characters, producing wire-frame models, pounds, pattern (i.e. sub-character) detection israster an topic. In this paper, aiming atenable addressing thus vector graphics, compared to conventional approaches. While raster output would only such issues, we describe a segmentation method forinsightful Chinese characters, producing wire-frame models, pounds, pattern (i.e. sub-character) detection israster an insightful topic. In this paper, aiming atenable addressing thus vector graphics, compared to conventional approaches. While raster output would only such issues, we describe a segmentation method for Chinese characters, producing wire-frame models, very limited reusing of these wire-frame models, vector output would for instance support the automatic thus vector graphics, compared to conventional raster approaches. While raster output would enable only such issues, we describe a segmentation method for Chinese characters, producing wire-frame models, very limited reusing of these wire-frame models, vector output would for instance support the automatic thus vector graphics, compared to conventional raster approaches. While raster output would enable only generation ofreusing vector of fonts (Adobe Type models, 1, Apple True Type, etc.) for such characters. Our approach very limited these wire-frame vector output would for instance support the automatic thus vector graphics, compared to conventional raster approaches. While raster output would enable only generation ofreusing vector of fonts (Adobe Type 1, Apple True Type, etc.) for such characters. Our approach very limited these wire-frame models, vector output would for instance support the automatic also enables significant performance increase compared to the raster approach. �e proposed method is generation ofreusing vector of fonts (Adobe Type 1, Apple True output Type, etc.) for such characters. Our approach very limited these wire-frame models, vector would for instance support the automatic also enables significant performance increase compared to the raster approach. �e proposed method is generation of vector fonts (Adobe Type 1, Apple True Type, etc.) for such characters. Our approach then experimented with a list of several Chinese characters. Next, the method is empirically evaluated also enables significant performance increase compared to theNext, rasterthe approach. �e proposed method is generation of vector fonts (Adobe Type 1, Apple True Type, etc.) for such characters. Our approach then experimented with a list of several Chinese characters. method is empirically evaluated also enables significant performance increase compared to theNext, rasterthe approach. �e proposed method is and its average timewith complexity is assessed. then experimented a list of several Chinese characters. method is empirically evaluated also enables significant performance increase compared to theNext, rasterthe approach. �e proposed method is and its average timewith complexity is assessed. then experimented a list of several Chinese characters. method is empirically evaluated and its average time complexity is assessed. Keywords: vector, graphics, scalable, script, kanji, glyph, Japanese, Next, font the method is empirically evaluated then experimented with a list of several Chinese characters. © 2017 The Authors. Published by Elsevier B.V. and its average complexity assessed. Keywords: vector,time graphics, scalable,isscript, kanji, glyph, Japanese, font Peer-review undertime responsibility of the scientific committee of the International Conference on Computational Science and its average complexity isscript, assessed. Keywords: vector, graphics, scalable, kanji, glyph, Japanese, font Keywords: vector, graphics, scalable, script, kanji, glyph, Japanese, font Keywords: vector, graphics, scalable, script, kanji, glyph, Japanese, font

1 Introduction 1 1 Introduction Introduction Chinese characters are numerous (thousands) and used in several languages and writing systems such 1 Introduction Chinese characters numerous (thousands) used in several languages andprocessing writing systems such 1 Introduction as in Japanese and are Korean. Because of theirand huge number, various computer difficulties Chinese characters are numerous (thousands) and used in several languages and writing systems such

as in Japanese and are Korean. Because of their huge number, various computer processing difficulties Chinese characters (thousands) and used in several languages andsuch writing systems such arise. For instance, onenumerous challenging topic is the automatic font generation for characters. Also, as in Japanese and are Korean. Because of their huge number, various computer processing difficulties Chinese characters numerous (thousands) and used in several languages and writing systems such arise. For instance, one challenging topic is the automatic font generation for such characters. Also, as in Japanese and Korean. Because of their huge number, various computer processing difficulties interestingly, Chinese characters are often recursive compounds: for instance, the character 榎 hackberry arise. For instance, one challenging topic is the automatic font generation for such characters. Also, as in Japanese and Korean. Because of their huge number, various computer processing difficulties interestingly, Chinese characters are often recursive compounds: for instance, the character 榎 hackberry arise.consists For instance, one challenging topic is the automatic font generation for characters. tree of the character 木 are treeoften horizontally combined with the character 夏 summer. �is Also, way, interestingly, Chinese characters recursive compounds: for instance, the such character 榎 hackberry arise. For instance, one challenging topic is the automatic font generation for such characters. tree consists of the character 木 are tree horizontally combined with character 夏 summer. �is Also, way, interestingly, Chinese characters often compounds: for instance, the[7, character 榎 hackberry several “layers” of sub-characters can berecursive often identified from thethe characters 12]. Automatic and tree consists of the character 木 tree horizontally combined with the character 夏 summer. �is way, interestingly, Chinese characters are often recursive compounds: for instance, the character 榎 hackberry several “layers” of sub-characters can be often identified from the characters [7, 12]. Automatic and tree consists of the character tree horizontally combined the character 夏12]. summer. �is way, efficient character pattern (i.e. 木 sub-character) recognition isfrom anwith important topic with various applications several “layers” of sub-characters can be often identified the characters [7, Automatic and tree consists of the character 木 tree horizontally combined with the character 夏 summer. �is way, efficient character pattern (i.e. sub-character) recognition is an important topic with various applications several “layers” of sub-characters identifiedisfrom the characters [7, 12]. Automatic and (font generation, compression, etc.).can be often efficient character pattern (i.e. sub-character) recognition an important topic with various applications several “layers” sub-characters can be often identifiedisfrom the characters [7, 12]. Automatic and (font compression, etc.). efficient characterof pattern (i.e.for sub-character) recognition an important with various Ageneration, conventional approach two-dimensional image segmentation istopic thinning [10, 11].applications �is image (font generation, compression, etc.). efficient charactercompression, pattern (i.e.for sub-character) recognition an importantistopic with [10, various applications Ageneration, conventional two-dimensional image is segmentation thinning 11].tree, �isanyway image (font etc.). processing techniqueapproach is usually based on either convolution matrix application or decision A conventional approach for two-dimensional image segmentation is thinning [10, 11]. �isanyway image (font generation, compression, etc.). processing technique is usually based on either convolution matrix application or decision tree, A conventional approach forbased two-dimensional image segmentation isrequires thinning [10, through 11].tree, �isanyway image pixel-based methods: this is the raster (bitmap) approach. Itmatrix obviously going each of processing technique is usually on either convolution application or decision A conventional approach forbased two-dimensional image segmentation isrequires thinning [10, through 11].tree, �isanyway image pixel-based methods: this is the (bitmap) approach. Itmatrix obviously each of processing technique usually on either convolution application orgoing decision all the pixels making is the image toraster be segmented. Such a raster approach thus has for disadvantage that pixel-based methods: this is the raster (bitmap) approach. It obviously requires going through each of processing technique is usually based on either convolution matrix application or decision tree, anyway all the pixels making the image to be segmented. Such a raster approach thus has for disadvantage that pixel-based methods: this isonly theto raster (bitmap) approach. It obviously going each of 1) the segmented result isimage valid for the original image, noapproach scalingrequires up orhas down isthrough possible, orthat at all the pixels making the be segmented. Such a raster thus for disadvantage pixel-based this isonly theto raster (bitmap) approach. It obviously going each of 1) result isimage valid for the original noapproach scalingrequires up orhas down isthrough possible, orthat at all the the segmented pixelsmethods: making the be segmented. Suchimage, a raster thus for disadvantage 1) the segmented result is only valid for the original image, no scaling up or down is possible, or at all the pixels making the image to be segmented. Such a raster approach thus has for disadvantage that 1) the segmented result is only valid for the original image, no scaling up or down is possible, or at 1 1) the segmented result is only valid for the original image, no scaling up or down is possible, or at 1 1 1 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1 Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.122

416

Automatic Segmentation of Chinese Characters as Wire-Frame Models Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

significant precision and accuracy loss, and 2) a high time complexity, at least linear in the size of the source image, and 3) induced further complexity for segmented model analysis: raster analysis again required, or intermediate vectorisation. Our objective in this paper is to propose a segmentation method for Chinese characters aiming for instance at the fast automatic generation of a character pattern database for facilitated character decomposition [5, 4, 3]. Segmentation here means that a two-dimensional wire-frame model of a character will be produced by extracting the character “spine”. Also, as other application example, once characters have been segmented, it will provide a base for fully automatic Chinese character font generation, a famously difficult topic [13]. Our strategy is as follows: instead of conventionally segmenting Chinese characters from a bitmap image (raster graphics), we first generate the character list using vector graphics, concretely with the SVG format [14], before going on with segmentation. Compared with those of previous works, this approach has for significant merit that the resulting wire-frame models are also vector graphics, and that the time complexity required to segment a character is significantly reduced compared with that required by the conventional raster graphics approach. Informally, the time complexity is reduced from linear to logarithmic in the size of the character raster image. Furthermore, analysis (e.g. character pattern analysis) of the obtained vector graphics wire-frame model is greatly facilitated compared with that of a raster wire-frame model, and additional processing of the wire-frame model, such as font generation, is significantly better since a raster wire-frame model would induce a bitmap font (PCF, FON, etc.) with its well-known shortcomings compared to an outline font, i.e. a “modern” vector (scalable) font (e.g. Adobe Type 1 [8], Apple TrueType [1], OTF, etc.), or would first require vectorisation of the obtained raster wire-frame to then produce an outline font, but then at a higher total complexity since in two steps: rasterisation, followed by vectorisation. In addition and importantly, our method enables instant mapping of a segmented character to its Unicode (or any other encoding: Shift JIS, EUC, etc.) representation.

2

Preprocessing: character list as vector graphics

�e aim of this preprocessing step is to produce vector graphics from Chinese characters for further processing. �is is achieved as follows (we rely on the Inkscape vector drawing software [2]). 1. Create a new text object and type-in or paste the desired Chinese characters. A common scenario is to first copy an extensive list of Chinese characters to be processed, like the list of the Japanese regular-use characters [9], and then to paste this list inside the Inkscape text object. 2. Select the newly created text object and convert it to a path with the Object to Path command. 3. Save the file (SVG format by default, which is suitable). In the example of the list of the Japanese regular-use characters, a 2MB SVG file is generated. �e generated SVG file is structured as follows: each character corresponds to one SVG path element. All such path elements are grouped under an SVG g element. Each path element (i.e. path for one character) is defined using the d attribute which declares the path nodes and node connection (straight lines and Bézier curves in our case). As example, the path element for the character 圧 is given below. 1



Automatic Segmentation of Chinese Characters as Wire-Frame Models Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424

3

A. Bossard

Character segmentation

Our implementation of the proposed segmentation method has been realised with the Racket functional programming language [6]. First, let us introduce the terminology employed hereinafter. As explained, a character corresponds to one SVG path. An SVG path is made of nodes which are connected each other with lines or curves. Each such path declares one or more sub-paths; we call a sub-path a segment. In other words, a path, and thus a character, is defined as a set of segments; we call such set a segset. Next, it is important to note the following property of a segment: since corresponding to one or more strokes of a character, each stroke having some thickness, each segment is a closed shape (polygon). Each segmentation step described in the Sections 3.1 to 3.5 below will be illustrated with the same character 暖 warm in order to easily follow the segmentation progress.

3.1

Obtaining the character shape

�e first task is to parse the previously generated SVG file to load each node of a path. SVG file loading is conducted by the Racket modules xml and xml/path. Path parsing is done using the lexer implemented by the Racket module parser-tools/lex. We define nine empty lexer tokens corresponding to the SVG path commands m/M moveto, l/L lineto, q/Q curveto, z/Z closepath and the end-of-path signal. In addition, one valued token is used to parse path command parameters (coordinates). Once the path nodes have been obtained, they are connected either with a straight line or with a quadratic Bézier curve. Such node connection is parametrised with a sampling value δ ∈ R>0 defining the number of points to take in order to connect two nodes. �is sampling rate thus directly impacts the smoothness of the obtained character shape. Concretely, the sampling value δ is the maximum distance between any two adjacent points. A sample output of this step is illustrated in Figure 1.

Figure 1: Obtaining the character shape from the SVG data. It is important to note that while an SVG path declares only a few nodes (as few as necessary to reproduce the character shape), for instance 30 nodes in the SVG path example given in Section 2, our model introduces several times more points as specified by δ the sampling rate, for instance 83 in the path example of Section 2 with δ = 5.

3.2

Detecting inner and outer segments

Because some characters, such as our example 暖, include overlapping segments (i.e. segments that include one or several holes), we next have to distinguish between what we call outer segments, that is a segment defining the outside shape of a character stroke (or stroke compound since one segment can correspond to several character strokes as in Figure 1), and inner segments, that is a segment defining 3

417

418

Automatic Segmentation of ChineseAntoine Characters as Wire-Frame Models Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

the inside shape (i.e. the hole shape). �e input of this step is a segset, and the output is a list of outer segments, each outer segment being encoded as a pair (inners . outer) with outer the list of points corresponding to this outer segment shape, and inners the list of inner segments (each being a list of points) for that outer segment. �is output is obtained with the following algorithm: 1. Identify outer and inner segments as follows. For each segment s of the segset, iterate all the other segments of the segset to get the number p of segments including s. A segment s is included inside a segment s′ if and only if all the points of s are included inside the polygon induced by the points of s′ . If p is even, then the segment s is an outer segment, and if p is odd, then the segment s is an inner one. Two lists are returned from this process: outer segments each with its number p, and inner segments. 2. Sort the previously identified outer segments in descending order of their respective numbers p so as to process the “innermost” outer segment first. 3. For each outer segment in this order, collect its inner segments (if any) from the returned inner segment list. Segment inclusion is checked using the winding number method: a point is located inside a polygon if and only if the sum of the signed angles between the point and consecutive polygon strokes is equal to 2π (or −2π). �e angle sign is obtained independently from the angle value by calculating the vector product. A sample output of this step is illustrated in Figure 2a. In addition, the output of this step when applied to the character 回 is given in Figure 2b in order to illustrate several layers of segment inclusion.

(a) (b) Figure 2: (a) Identifying outer (in black) and inner (in red) segments. (b) Identifying outer (in black) and inner (in red) segments: example of several nested segments.

3.3

Calculating normal vectors

Next, aiming at further connecting segment points together (this is the next step, see Section 3.4), we calculate the normal vector at each point of the segset. �e key here is to know whether the points of a segment are sequenced clockwise or not. �is is checked with the following algorithm. 1. For each point pi of the segment (i.e. polygon), consider the triple pi−1 , pi , pi+1 , and calculate the normal vector ni at pi with xni = ypi−1 − ypi+1 and yni = xpi+1 − xpi−1 .

2. Consider the ray starting at pi and of slope yni /xni , i.e. the half-line [pi , ni ). �is ray is going inwards if and only if the polygon points are ordered clockwise. 4



Automatic Segmentation of ChineseAntoine Characters as Wire-Frame Models Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

3. Consider the point p at distance ϵ from pi and located on this ray, with ϵ a very small number. We have p = (xpi + ϵ × xni , ypi + ϵ × yni ).

4. If p is located inside the polygon, then the polygon points are ordered clockwise, and otherwise counter-clockwise. �en, we process all the segments as follows to obtain ingoing normal vectors for outer segment points, and outgoing normal vectors for inner segment ones. An illustration is given in Figure 3.

pi+1

pi

pi−1

Figure 3: Ingoing normal vector at pi (in blue), and translated at pi , obtained from pi−1 and pi+1 . Segment points are sequenced clockwise. For each outer segment s, check if it is clockwise. If not, reverse the order of its points. �en, for each point pi of s, consider the triple pi−1 , pi , pi+1 , and calculate the normal vector ni at pi and translated at pi with xni = xpi + (ypi−1 − ypi+1 ) and yni = ypi + (xpi+1 − xpi−1 ). Finally, the pair (pi , ni ) is returned. Inner segments are processed similarly, with the difference that the normal vector ni at pi and translated at pi has coordinates xni = xpi + (ypi+1 − ypi−1 ) and yni = ypi + (xpi−1 − xpi+1 ). A sample output of this step is illustrated in Figure 4a for outer segments, and in Figure 4b for inner ones. Note that the normal vectors are displayed directly from the returned pairs (pi , ni ).

(a) (b) Figure 4: (a) Ingoing normal vectors for outer segment points (in blue). (b) Outgoing normal vectors for inner segment points (in blue).

3.4

Lacing

Here, the input is a list of point-normal pairs, thus not distinguishing between outer and inner segments henceforth any more, we simply call a segment the merged list of points corresponding to each pair 5

419

420

Automatic Segmentation of Chinese Characters as Wire-Frame Models Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

(inners . outer) as described previously. A list of laces is output, one lace for each segment point, a lace connecting two segment points. Laces are deduced from normal vectors. �e algorithm is as follows.

1. For each point-normal pair (pi , ni ), consider the ray [pi , ni ). Further consider the linear function n(x) corresponding to this ray. 2. Iterate the segment to collect all the pairs (u, v) of adjacent points u and v with u ̸= pi and v ̸= pi . Let f (x) be the linear function induced by the line segment [u, v]. 3. For each such pair (u, v), solve the equation n(x) = f (x) to obtain the intersection point r of the two functions. �en, check if r is included on both the line segment [u, v] and the ray [pi , ni ) to detect crossing of the ray [pi , ni ) with the line segment [u, v]. 4. Iterate all the line segments intersecting with the ray [pi , ni ) so as to find the one closest to pi . In practice, the Euclidean distance between the intersection point r and pi is used. 5. For the closest line segment [u, v] intersecting the ray at pi , lace pi to w with w ∈ {u, v} closest to, and distinct with, pi . Finally, duplicate laces are discarded. Our implementation is using the argmin function to find the closest intersecting line segment, with +∞ being the distance returned for line segments not intersecting the ray. Also, even though the lacing algorithm is identical, it must be noted that the special cases f (x) or n(x) of the form x = c, with c a constant, require special care. Such lacing process is relying on normal vectors as lacing closest non-adjacent nodes would risk lacing two nodes that are separated by a “hole” (i.e. empty space). A sample output of this step is illustrated in Figure 5a.

3.5

Filtering and connecting

3.5.1

Lace filtering

Some laces may connect points that are exceptionally distant; we consider such laces as noise and filter them out. For this, we first compute the average lace length of the current segment by iterating all the segment laces. �en, we calculate the average absolute deviation around the previously obtained average lace length for the current segment. Next, we iterate a third time all the segment laces, discarding each lace whose length difference to the average lace length is greater than the average absolute deviation. �e average absolute deviation is used rather than the standard deviation since allowing to eliminate more noise. Effectively, it is easy to show that the average absolute deviation value is smaller than or equal to the standard deviation value. A sample output of this filtering process is given in Figure 5b. 3.5.2

Merge lace centre points

Next, the centres of retained laces (i.e. laces after lace filtering) are collected. Because several laces may have their respective centre points very close to each other, we merge centre points as per the following algorithm. 1. Initialise the set point P as the set of previously obtained wire points. 2. For each point p ∈ P , partition P into the points that are close to p and those that are not. Two points are said to be close to each other if and only if their Euclidean distance is smaller than the threshold value δ/4. 6



Automatic Segmentation of Chinese Characters as Wire-Frame Models Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

(a) (b) Figure 5: (a) Lacing segment points together, relying on normal vectors. (b) Filtering laces using the average absolute deviation metric. 3. Consider the set P ′ ⊆ P of the points that are close to p, and calculate its average point p′ by the arithmetic mean of each dimension for all points of P ′ . Note that p ∈ P ′ . 4. Store p′ and recursively apply this algorithm with the new set point P \ P ′ . �e algorithm is terminated when the new set point obtained is the empty set. �e points obtained from this algorithm are called the wire points. 3.5.3

Connect wire points

Finally, we connect the wire points obtained in the previous step. �e connection algorithm is detailed in Algorithm 1; our implementation is in functional style, we give the algorithm in imperative style for the sa�e of clarity. �e main idea of this algorithm is for each segment to connect each wire point to its nearest neighbour, and this for each of the following four orderings: wire points ordered successively in ascending (resp. descending) order of their horizontal coordinates – leftmost (resp. rightmost) order –, in ascending (resp. descending) order of their vertical coordinates – topmost (resp. bottommost) order – (on our canvas, the y-axis is oriented top-down). �us, in total, wire point connection is made in four passes. �e merit of this four-pass method is detailed in Table 1: the number of selected edges at each pass is given, showing that each pass has some impact on the final wire-frame model result. Still considering the character 暖, it has six segments (i.e. six (inners . outer) pairs), but in the case δ = 5, two of them (Segments 1 � 3) each has one single wire point retained, thus with no edge selected. �e passes #1 to #4 correspond to leftmost, topmost, rightmost and bottommost orders, respectively. A sample output of this final step is given in Figure 6; on the left, the value of δ is the one used previously for Figures 1 to 5b (precisely, for reference, δ = 5), and on the right the result with a lower value of δ (precisely, δ = 1). As shown in this figure, minor imperfections will be corrected by lowering the value of the sampling value δ, which can be as small as possible (we recall that δ ∈ R>0 ).

4

Experimental results

In this section, additional experimental results are presented. To begin, we have rendered the wire-frame models of the first nine characters of the regular-use Chinese characters list as defined in the Japanese language [9]: 亜, 哀, 挨, 愛, 曖, 悪, 握, 圧 and 扱. �e sampling value δ has been successively set to 3 and 1. �e obtained rendering results are given in Table 2. 7

421

422

Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424 Automatic Segmentation of Chinese Characters as Wire-Frame Models

A. Bossard

Algorithm 1: WIRE(List points) Input: List of wire points for a segment. Output: List of edges (i.e. wire point pairs). Procedure FWD(Point p, List points) mark p as visited; Point q ← nearest point to p in points \ {p}; if edge pq not already selected then select the edge pq; end if q unvisited then FWD(q, points); end end if |points| = 1 then exit;

List lmost ← sort points in ascending order horizontally; List tmost ← sort points in ascending order vertically; List rmost ← reverse lmost; List bmost ← reverse tmost;

foreach P in (lmost, tmost, rmost, bmost) do mark all points of P as unvisited; foreach point p in P do if p unvisited then FWD(p, points); end end

Table 1: Number of selected edges at each pass for the character 暖. Sampling δ=5 δ=1 Pass #1 #2 #3 #4 #1 #2 #3 #4 Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6

23 6 3 32

9 0 0 10

1 0 0 3

4 0 0 0

10 96 11 34 9 164

0 16 0 11 2 34

3 19 1 2 0 23

0 0 0 0 0 1

As explained previously, these results show that rendering fidelity to the original character will be increased as the value of δ decreases. Also, in the case δ = 1 for some character patterns, especially the ones with right angles, one can notice “cross”-like rendering artefacts, where some wires perpendicular to the expected character shape are generated (see for instance characters #5 and #6 of Table 2). �is is due to lace filtering: the metric used for filtering is the average absolute deviation, and as the sampling value decreases, the number of undesired laces connecting opposite sides of a character shape increases, some eventually remaining unfiltered as the deviation value is impacted significantly. As second experiment, we measured the time required to render the wire-frame model of characters depending on the sampling value δ, in other words, for each character the time taken to go through the complete process described above in Section 3 as well as displaying the computed wire-frame model on a canvas inside a GUI window. Precisely, we render and display side-by-side in the canvas the 8



Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424 Automatic Segmentation of Chinese Characters as Wire-Frame Models

A. Bossard

(a) (b) Figure 6: Wire-frame model obtained for the character 暖 with (a) δ = 5 and (b) δ = 1. Table 2: Wire-frame model renderings of the first nine regular-use Chinese characters in Japanese. δ=3 δ=1 δ=3 δ=1 δ=3 δ=1

1

2

3

4

5

6

7

8

9

arbitrarily selected three characters 暖, 曖, and 回. �is processing is conducted directly from within the Racket interpreter DrRacket on a 64-bit Microsoft Windows 10 machine equipped with an Intel Core i7-4510� processor and 8�B RAM. �e measured execution times are detailed in Figure 7, with plotted for reference an approximation of the induced average time complexity. Time (ms) 2000

100 * (δ-5)2 + 200 rendering time

1500

1000

500

0

-1

0

1

2

3

4

5

6

7

δ

Figure 7: Measured required time for wire-frame model rendering of the three sample characters 暖, 曖, and 回 depending on the sampling value δ. We recall that the lower the sampling value δ, the longer the segmentation process. So, from this empirical evaluation, we were able to estimate the average time complexity of the proposed algorithm 9

423

424

Automatic Segmentation of Chinese Characters as Wire-Frame Models Antoine Bossard et al. / Procedia Computer Science 108C (2017) 415–424

A. Bossard

as quadratic in δ the sampling value, i.e. O(δ 2 ).

5

Conclusions

In this paper, we have described an innovative method for fully automatic Chinese character segmentation. �e result being a wire-frame model of a character, it enables fast further character analysis, compared to the conventional raster graphics approach. As experiment, we have segmented the first nine characters of the list of the regular-use Chinese characters as used in Japanese. �en, we have empirically evaluated the average time complexity of the proposed algorithm depending on the sampling value δ; an O(δ 2 ) time complexity was deduced. Regarding future works, it will be meaningful to further work on 1) refining the deviation metric used to more accurately filter laces so as to avoid rendering “cross” artefacts, and 2) improving and simplifying the wire points connection process. �en, a larger experiment will be considered. Next, we aim at generating a pattern database for Chinese characters in order to easily decompose ideograms. Finally, this segmentation method could be further applied to other logographic systems such as Egyptian hieroglyphs for similar purposes.

References [1] Apple Computer, https://developer.apple.com/fonts/TrueType-Reference-Manual/. TrueType™ reference manual, 1991–2014. Last accessed September 2016. [2] Tavmjong Bah. Inkscape: guide to a vector drawing program. Prentice Hall, NJ, USA, 4th edition, 2011. [3] Antoine Bossard. Premises of an algebra of Japanese characters. In Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering, pages 79–87, Yokohama, Japan, July 2015. [4] Antoine Bossard. aIME: a new input method based on Chinese characters algebra. In Studies in Computational Intelligence, volume 656, chapter Computer and Information Science, pages 167–179. Springer, 2016. [5] Antoine Bossard and Keiichi Kaneko. A scientific approach to Chinese characters: rationale, ontology and application. In Proceedings of the 29th International Conference on Computer Applications in Industry and Engineering, pages 111–116, Denver, CO, USA, September 2016. [6] Matthew Flatt. Creating languages in Racket. Communications of the ACM, 55(1):48–56, 2012. [7] Osamu Fujimura and Ryohei Kagaya. Structural patterns of Chinese characters. In Proceedings of the Conference on Computational Linguistics, pages 1–17, 1969. [8] Adobe Systems Incorporated. Adobe Type 1 font format. Addison-Wesley, MA, USA, 1990. [9] Japanese Ministry of Education, �e Agency for Cultural A�airs. Table of the regular-use kanji characters (常用漢字表, in Japanese), 2010. [10] Pieter P. Jonker. Morphological operations on 3D and 4D images: from shape primitive detection to skeletonization. In Lecture Notes in Computer Science, volume 1953, chapter Discrete Geometry for Computer Imagery, pages 371–391. Springer, 2001. [11] Louisa Lam, Seong-Whan Lee, and Ching Y. Suen. �inning methodologies - a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9):869–885, 1992. [12] Richard Sproat. A computational theory of writing systems. Cambridge University Press, Cambridge, United Kingdom, 2000. [13] Tetsurou Tanaka, Yuichiro Ishii, Mikio Takeuchi, and Eiiti Wada. Sharing skeleton data by multiple kanji fonts through programmable rendering (in Japanese). Transactions of the Information Processing Society of Japan, 36(1):177–186, 1995. [14] World Wide Web Consortium, http://www.w3.org/TR/SVG11/. Scalable Vector Graphics (SVG) 1.1 (Second Edition), 2011. Last accessed August 2016.

10