Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo
Methods for automated detection of plagiarism in integrated-circuit layouts Dominik Kasprowicz n, Hilekaan Wada Institute of Microelectronics and Optoelectronics, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, Poland
art ic l e i nf o
a b s t r a c t
Article history: Received 6 December 2013 Received in revised form 7 March 2014 Accepted 7 April 2014
Student projects have always been plagued by plagiarism. Integrated-circuit (IC) design courses are no exception. Since layout is considered the most laborious part of circuit design, it is common for students to reuse their colleagues' work with some minor modifications intended to make the cheating harder to detect. While software detecting plagiarism in text or computer code is commonly used these days, no counterpart exists for IC layouts. This paper proposes several criteria of IC-layout dissimilarity that can be used for computer-aided layout matching. A program based on these criteria is shown to successfully identify similar layouts in a pool of designs. & 2014 Elsevier Ltd. All rights reserved.
Keywords: Plagiarism detection Copyright protection Integrated circuit layout Layout matching
1. Introduction Academic courses in integrated-circuit (IC) design usually include projects where a student is asked to design a simple IC-cell down to the physical layout level. As the circuit must be simple enough to be designed by a relatively unexperienced person, the number of architectures useful for that purpose is quite limited. Therefore, it is natural for some circuits, like the Miller transconductance amplifier or a flip-flop, to be reused year over year or even within a single student group. The design of the physical layout of a given circuit is usually viewed by students as the most challenging – or at least the most laborious – part of the assignment. As a consequence, some students are tempted to reuse their colleagues' work, usually with minor modifications intended to make the cheating harder to detect. While it is relatively easy for an instructor to spot cases of plagiarism within a single student group, reuse of layouts created a couple of years back is likely to go unnoticed. Thus, a computer application pointing the user to “suspiciously similar” items in a layout repository would be a useful tool. Whereas software supporting detection of plagiarism in essays or computer code is already mature and widely used (see e.g. [1] or [2] for an overview of currently used methodologies), no counterpart for circuit layouts is available. To the best of the authors' knowledge, no results of academic studies in this field have been published either.
n
Corresponding author. Tel.: þ 48 222347207. E-mail addresses:
[email protected],
[email protected] (D. Kasprowicz).
This paper summarizes the authors' initial efforts toward automation of detection of IC-layout plagiarism. It is an extended version of the work presented in [3]. Section 2 contains general remarks on the problem of layout comparison. Formal measures of layout dissimilarity are proposed in Section 3. Section 4 contains thorough analysis of the outcome of an automated scan for plagiarism performed on a given pool of layouts. The section starts with a discussion of results obtained for the set of eight layouts presented in Fig. 7 at the end of this paper, which permits easy understanding of the source of strengths and weaknesses of each measure of dissimilarity. Then, results for a much larger set of layouts are presented to draw more general conclusions regarding those measures. The section ends with an analysis of the impact of the number of layouts and their complexity on the runtime of the plagiarism detection procedure. Section 5 summarizes the work. 2. IC layout matching – general observations An important step toward creation of a layout-plagiarism detector is verbalization of what layouts are considered similar. Definitely, similar sizes of corresponding components, e.g. the input differential pair, are neither necessary nor sufficient to decide that one layout is a copy of another. Indeed, component sizes are strongly influenced by the project specifications like power consumption, area or performance. If several assignments share a common architecture, the most obvious way to discourage the students from identically sizing their transistors and passives is imposing different specifications on each design. This policy, however, does not prevent the students from copying the physical
http://dx.doi.org/10.1016/j.mejo.2014.04.023 0026-2692/& 2014 Elsevier Ltd. All rights reserved.
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
2
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
layout of the circuits designed by their colleagues, although some effort must be invested in resizing the components. Thus, other similarity measures must be sought. Two comparison criteria are contemplated: one is the location of transistors, the other is the shapes of the interconnects between those transistors. Of course, similarity of transistor placement and similarity of interconnect routing are still vague concepts that need further formal definitions. Such definitions are proposed in Section 3. Preferably, a tool comparing two layouts should be fairly insensitive to basic layout transformations: scaling, rotation, and reflection. This requirement gains importance if the transistor sizing and the placement of terminals are the designer's choice. In such a situation, resizing an existing layout (perhaps also changing its aspect ratio) along with a rotation or reflection is a simple way of making the layouts dissimilar at first glance. While detecting a rotated and/or reflected copy of a given design is conceptually trivial, detecting similarities between two layouts of a substantially different size or aspect ratio is trickier. Since design rules impose numerous upper and lower limits on the sizes of layout features and distances between them, a resized design is never just a magnified or shrunk copy of the original. Thus, even size normalization is not guaranteed to make the two layouts identical. Another problem stems from the fact that designs sharing the same architecture may differ in the number of transistors, i.e. MOSFET fingers. In fact, such differences may occur even in cases where one layout is an almost exact copy of another. Such a situation takes place in Designs A, B, and C (see Fig. 7), whose top-right portions differ substantially in the number and location of transistors, while the rest remains almost identical. From the algorithmic point of view, if two sets or sequences must be compared elementwise, the difference in their sizes always introduces some degree of ambiguity. Still, software detecting plagiarism must handle such situations. Detecting plagiarism in student assignments requires comparing a large number of relatively simple layouts stored in the teacher's repository. Detecting plagiarism in a pool of L designs requires LðL 1Þ=2 comparisons, which might be costly in terms of computation time. Thus, it might be tempting to split this task into two stages. First, some small set of characteristic properties (a “signature”) would be extracted from each layout. This extraction procedure would only run L times, so even computationally expensive algorithms would be allowed at this stage. The second step would involve pairwise comparisons of those signatures, being relatively small datasets. Even though the overall time complexity of the whole procedure remains OðL2 Þ, such partitioning of the task might bring substantial speedup. The task of detecting IC-layout plagiarism is much different from detecting plagiarism in essays or computer code. The latter problems are inherently one-dimensional since they involve comparing strings of characters. An IC layout, on the other hand, is a two-dimensional (and multilayered) entity. Theoretically, IC-layout comparison could rely on bitmap-analysis algorithms. Such an approach, however, would be extremely a wasteful of computational resources. Image-comparison algorithms usually begin their operation by performing either feature extraction or some dimensionality-reduction procedure, e.g. the PrincipalComponent Analysis. The actual comparison is subsequently performed on such a reduced set of crucial attributes rather than on the full set of pixels. The computer representation of an IC layout already contains the crucial data, i.e. the coordinates of polygon vertices, and can be easily analyzed to extract parameters like polygon centroids, moments of area, or other measures described later in this paper. Thus, transforming such a vector representation into a bitmap would be an unnecessary (and very costly) step back. Nevertheless, some measures known from image
analysis, like moment invariants, can be easily adapted to handle shapes represented as sequences of vertices. Vector representation has an additional advantage of enabling fine-grained comparison (e.g. polygon-by-polygon) rather than comparing bitmaps representing entire circuits. This is important because, as mentioned before, similar placement and routing of corresponding components is usually indicative of plagiarism, irrespective of any difference in the dimensions of those components. If the two layouts were to be “blindly” compared as bitmaps, those size differences would likely blur the existing similarities between those layouts.
3. Layout dissimilarity measures The following definitions and assumptions are used throughout this paper. A layout signature is a sequence of numbers describing such properties of the underlying layout that a large difference (however defined) between two signatures implies strong difference between the underlying layouts. That difference between signatures will be referred to as dissimilarity measure. A layout comparator is a computer application calculating such dissimilarity measures within a given set of layouts. The word transistor denotes a MOSFET channel, defined as the intersection of lithographic masks of polysilicon and the active layer (diffusion). If a MOSFET channel is divided into multiple fingers, each of them is treated in this work as a separate transistor unless noted otherwise. The term transistor coordinates or location applies to the centroid of the transistor's channel. Sections 3.1–3.3 outline proposed dissimilarity measures based on transistor location. 3.1. Transistor distance to the layout center – TDLC This method analyzes the distances d of all the N transistors to the center of the bounding box of all the transistor centers as illustrated in Fig. 1. Thus, the signature is simply a sequence of those distances arranged in non-ascending order fd1 ; d2 ; …; dN g;
d1 Zd2 Z ⋯ ZdN :
ð1Þ
This leads to the following measure of dissimilarity between designs A and B sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M 1 TDLCðA; BÞ ¼ ð2Þ ∑ ðdAm dBm Þ2 M m¼1
Fig. 1. Illustration of the definition of the distances d of transistors from the layout center. The dashed contour outlines the box bounding all the transistor centroids (black crosses). Its center is the reference point.
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
The choice of summation limit M is arbitrary. The maximum possible value of M in a comparison of circuits A and B containing NA and NB transistors, respectively, is M max ¼ minfN A ; NB g:
ð3Þ
Using the M smallest, instead of largest, distances has been shown to lead nowhere, regardless of the choice of the summation limit M. To compensate for size differences between the analyzed layouts, the following normalization of TDLC has been proposed: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ∑M m ¼ 1 ðdAm dBm Þ TDLC norm ðA; BÞ ¼ ; ð4Þ ∑M m ¼ 1 ðdAm þ dBm Þ where M, dAm, and dBm have the same meaning as in (2). 3.2. Transistor-to-transistor distance – TTD This approach is loosely inspired by the fingerprint-matching method presented in [4]. The first step requires calculation of the distances between the centroids of all the transistors in a given design. Thus, for a layout consisting of N transistors, an N N Euclidean-Distance Matrix (EDM) is obtained. Comparing two layouts based on such large datasets would be difficult, especially if the transistor count differed between those designs. Luckily, important features of any matrix are contained in its spectrum, i.e. the set of its eigenvalues. The EDM has a single positive eigenvalue and N 1 negative ones. All the N eigenvalues of an EDM sum up to zero,1 so the positive one can be left out of analyses as redundant. Large/small transistor-to-transistor distances generate negative eigenvalues with large/small absolute values. Thus, EDM eigenvalues close to zero usually correspond to distances between adjacent fingers of a single transistor. Every analog block contains a large number of multifingered transistors, and the finger-tofinger distance is usually kept as small as permitted by the design rules. This is why a large portion of EDM eigenvalues of the analyzed circuits is close to zero and almost identical. Consequently, large (in terms of absolute value) eigenvalues are much more useful in differentiating IC designs. This reasoning is important because if layouts with different transistor counts are to be compared, some eigenvalues must be discarded. The following signature is proposed: fe1 ; e2 ; …; eN 1 g;
je1 jZ je2 j Z⋯ Z jeN 1 j;
ð5Þ
where ek denotes the k-th largest (in terms of the absolute value) eigenvalue of the EDM of a given layout. The dissimilarity of layouts A and B is therefore expressed as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi K 1 ∑ ðe eBk Þ2 : ð6Þ TTDðA; BÞ ¼ K k ¼ 1 Ak Again, the choice of the summation limit K is arbitrary. The maximum reasonable value for K is K max ¼ minfN A ; N B g 1:
ð7Þ
The TTD measure, too, can be normalized to compensate for difference sizes between compared layouts. The normalized version is defined as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑Kk ¼ 1 ðeAk eBk Þ2 TTDnorm ðA; BÞ ¼ : ð8Þ ∑Kk ¼ 1 ðeAk þ eBk Þ The minus sign makes the measure a positive number in spite of the eigenvalues being negative. 1 All the diagonal entries of an EDM are zeros, since the distance of a vertex to itself is zero. This implies that the sum of the eigenvalues of an EDM, by definition equal to the sum of all its diagonal entries, is also zero.
3
3.3. Type-aware transistor-to-transistor distance The measure introduced above does not discern between NMOS and PMOS devices. As a result, an NMOSFET in one design may be erroneously “matched” to a PMOSFET in the other design by incidentally having a similar distance to other transistors. Therefore, intuitively, distinction between device types can only bring benefit. To this end, three separate EDMs are built, containing NMOS– NMOS, PMOS–PMOS, and NMOS–PMOS distances within a design – see Fig. 2 for illustration. If a design contains N transistors, NN of which are n-type and NP are p-type, the sizes of the first two matrices are of course N N N N and N P NP . The NMOS–PMOS distances can be written in the form of an N N matrix with zeros corresponding to distances between devices of the same type. The number of negative eigenvalues of the NMOS–PMOS matrix equals minfNN ; NP g and each negative eigenvalue is accompanied by a positive one with the identical absolute value. If, for example, NN 4N P , then the sequence of eigenvalues written in ascending order begins with NP negative elements and ends with NP positive elements, while the remaining NN N P elements are zeros. Therefore, while using the three matrices for comparing two layouts A and B, it is reasonable to truncate the ordered sequences of their eigenvalues to K NNmax ¼ minfN NA ; N NB g 1; K PPmax ¼ minfN PA ; N PB g 1; K NPmax ¼ minfN NA ; N PA ; NNB ; N PB g:
ð9Þ
The three sequences of eigenvalues, each truncated according to the appropriate limit defined by (9), are merged into one sequence of length K max ¼ K NNmax þK PPmax þ K NPmax :
ð10Þ
That aggregate sequence is subsequently plugged into formula (6) to calculate “type aware” TTD and into formula (8) for “type aware” TTDnorm. One obvious property of all the above measures is that they are rotation invariant. Additionally, normalization reduces their sensitivity to scaling. Another advantage results from the fact that once the signatures are extracted from a layout composed of N transistors, the original layout need not be analyzed again. Those relatively small sets of numbers are all that need to be stored in a database for comparison with old as well as future projects. Comparing a newly submitted layout with L old ones takes at most O(LN) time, which is acceptable.
4. Example results and discussion A layout comparator based on the measures presented in Sections 3.1–3.3 has been written in Python. The program produces a list of design pairs sorted in ascending order of a selected dissimilarity measure. Since such a list becomes lengthy even for a small set of designs, the application also produces a dendrogram (discussed later in this section) to provide concise visual information about groups of layouts that appear similar. This program has been used to detect cases of plagiarism in a pool of 62 student designs, created with the proprietary layout editor UNCLE [5] and stored in the form of CIF files [6]. This section begins with a discussion of results obtained for a much smaller subset containing only eight designs. Their layouts are presented in Fig. 7 at the end of this paper, which enables interpretation of the obtained results. Later on, the quality of proposed dissimilarity measures is evaluated more formally on the full set of 62 layouts. Finally, time complexity is evaluated as a function of the number of layouts and the transistor count per layout.
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
Table 2 “Type aware” TTD (micrometers) for designs in Fig. 7. Bold entries – actually similar layouts. Underlined entries – pairs misclassified as similar.
A B C D E F G
B
C
D
E
F
G
H
6.2
7.8 4.7
38.6 46.0 47.3
4:9 11.1 12.1 32.9
18.2 12.5 13.5 61.0 23.5
69.2 81.8 84.1 5:5 59.5 107.4
31.1 26.7 25.0 76.0 36.6 20.9 133.4
Table 3 “Type aware” TTDnorm 100 for designs in Fig. 7. Bold entries – actually similar layouts. Underlined entries – pairs misclassified as similar. Fig. 2. Illustration of “type aware” transistor-to-transistor distances used to calculate the dissimilarity measure described in Section 3.3. The style of a line (solid, dashed, or dotted) depends on the types of devices connected by the line.
Table 1 TTD (micrometers) for designs in Fig. 7. Bold entries – actually similar layouts. Underlined entries – pairs misclassified as similar.
A B C D E F G
B
C
D
E
F
G
H
6
11 6
34 41 44
4 9 12 31
20 14 12 59 23
57 69 73 4 52 97
40 35 32 79 43 24 130
4.1. Detailed results for a small set of layouts The layouts shown in Fig. 7 represent a Miller transconductance amplifier designed for a 1 μm technology. Clearly, Designs A, B, and C are very similar to one another, while the other layouts look pretty distinct. The values of TTD have been summarized in Table 1. The number of eigenvalues used in (6) was the maximum possible for each pair analyzed, i.e. given by (7). The results correctly suggest the similarity of Designs A, B, and C – the highest TTD value within this set is 11, which is well below the average. Unfortunately, there are three other pairs ranked lower than that value. Namely, the column for Design E indicates its close similarity to Designs A and B, which is obviously false. Likewise, the extremely low value of TTD for pair D–G is erroneous. Even though cell G is twice as large as cell D, its surface area is still several times smaller than the area of other designs, which may explain the apparent similarity of D and G. Making the TTD measure “transistor type aware” brings some benefit, as shown in Table 2 where pair B–E does not appear similar any more. This may be because the “transistor type awareness” makes more evident the large gap between the blocks of NMOS and PMOS devices in Design E. Unfortunately, pairs A–E and D–G still rank relatively high on the similarity list. Further improvement is brought by normalization of the “type aware” TTD, as can be seen in Table 3. Design D appears relatively more distant from Design G, probably due to the large difference in aspect ratios of those cells. This was enough to make the D–G distance much greater than the distances within the {A, B, C} group. Apart from K ¼ K max , values of 10 and 15 have also been used. This led to an overall increase in the TTD values since
A B C D E F G
B
C
D
E
F
G
H
3.0
3.7 2.1
24.4 26.6 27.6
2:6 5.5 6.0 22.9
8.2 5.3 5.6 31.0 10.7
26.5 28.5 29.6 4.5 24.9 32.6
13.6 11.1 10.1 37.3 16.2 8.1 39.3
Table 4 TDLC (micrometers) for designs in Fig. 7. Bold entries – actually similar layouts. Underlined entries – pairs misclassified as similar.
A B C D E F G
B
C
D
E
F
G
H
1.8
2.2 1.1
8.6 10.7 11.1
1:4 2.9 3.3 7.3
4.5 2.9 3.0 13.2 5.3
10.0 12.6 13.9 2:1 7.9 16.3
6.0 4.3 4.2 15.6 7.1 2.4 20.7
small inter-transistor distances were left out of the average in (6). However, the same design pairs appeared at the top of the similarity list. Values of the other measure, TDLC, for M ¼ M max can be found in Table 4. (The choice of M¼10 or 15 does not significantly affect the results.) Basically, TDLC indicates similarity of the same layout pairs as TTD. This is also true of Designs D and G, which – as in the case of TTD – appear similar only because of their exceptionally small size. Normalization used in TDLCnorm, however, compensates for this effect, as indicated by the numbers shown in Table 5. This time Designs D and G appear more different from each other than the designs within the subset {A, B, C}. On the other hand, the value for the F–H pair suggests similarity. This new effect, however unwanted, is easily explicable because those two designs have a very specific transistor layout: they are the only ones comprised of only two distinct rows of devices running almost across the whole width of the layout. This similarity was not apparent in TTD and TDLC because of a remarkable difference in those layouts' aspect ratio. It is only the normalized measure that exposed this similarity. Classification of layout pairs as similar or different is of course arbitrary. Therefore, no threshold can be fixed a priori for the measures introduced above. However, since distances between any two designs are defined, the problem can be viewed as cluster analysis. A visualization tool commonly used in that field is the dendrogram [7]. The proposed layout comparator generates
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
dendrograms based on any of the dissimilarity measures presented in this paper. Fig. 3 presents a dendrogram based on the “type aware” TTDnorm. The vertical position of a horizontal bridge connecting two designs corresponds to the distance between those designs – TTDnorm in this case. Bridges also connect groups of designs. In this case, the distance between two groups, say G1
and G2, is defined as the distance of those individual members – one in G1 and the other in G2 – that are the closest to each other. This corresponds to single-linkage clustering (see e.g. [7]). 1
A B C D E F G
C
D
E
F
G
H
1.3
1.6 0.7
8.4 9.5 9.7
1:1 2.1 2.4 7.7
3.0 1.8 1.8 10.5 3.5
8.1 9.5 10.1 2.8 6.9 11.0
3.8 2.5 2.4 11.4 4.5 1:3 12.5
Time [s]
Table 5 (TDLC norm 100Þ for designs in Fig. 7. Bold entries – actually similar layouts. Underlined entries – pairs misclassified as similar. B
5
0.1
0.01
0.001 10
100
Layout count Fig. 5. Comparator execution time as a function of the number of analyzed layouts. The designs used in this experiment were Miller amplifiers similar to the ones discussed before. A quadratic function is fitted into the points corresponding to signature comparison times while linear functions approximate the time complexity of other operations. Dendrogram generation time cannot be approximated with a polynomial.
100
Time [s]
10 1 0.1 0.01 0.001 0.0001 10
100
1000
Device count per circuit Fig. 6. Comparator execution time as a function of the number of transistors on a layout. The analyzed set consisted of five designs.
1
1
0.8
0.8
0.6
0.6
Sensitivity
Sensitivity
Fig. 3. Dendrogram showing pairwise distances between designs shown in Fig. 7. The distance measure is transistor-type aware TTD (normalized).
0.4
0.4
TDLCnorm
TDLC 0.2
0.2
TTD
TTDnorm type aware TTDnorm
type aware TTD 0
0
0.1
0.2
False Positive Rate
0.3
0
0
0.1
0.2
0.3
False Positive Rate
Fig. 4. ROC curves for various dissimilarity measures obtained for a set of 62 layouts (1891 pairs, 10 of which are actually similar).
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
6
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 7. Miller-amplifier layouts analyzed in this work. Note that layouts D and G are much smaller than others and have been enlarged to better expose their structure.
The dendrogram indicates that the most similar pair is B–C (which is correct), followed by pair A–E (which is wrong). However, the connection between A and E is followed very closely by a
bridge connecting this pair to pair B–C. This is due to the similarity of layouts A–B and C. It is worth noting that other bridges are situated much farther away. Thus, in this particular case only the
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
pairs showing TTDnorm lower than about 0.04 should be subject to closer examination. 4.2. Comparison of dissimilarity measures for a large set of layouts The choice of threshold value separating “suspect” pairs from the rest is critical if a large set of layouts is to be analyzed. Raising the threshold increases sensitivity defined as the ratio between detected cases of plagiarism and all such cases. However, it also increases the false positive rate (FPR), defined as the fraction of actually dissimilar pairs that have been erroneously classified as similar. This means much more work for the person examining the “suspect” layouts. A good guideline for the choice of threshold can be obtained by calculating the appropriate dissimilarity measure (e.g. TTD or TDLC) for each pair of layouts in a dataset with known cases of plagiarism. Sweeping the threshold across all the values of that dissimilarity measure encountered in the dataset will simultaneously change both sensitivity and FPR. Plotting sensitivity versus FPR produces a so-called ROC curve [8]. Such curves help in determining the appropriate threshold by illustrating the tradeoff between sensitivity and the number of cases that need to be analyzed. A good classifier maximizes sensitivity while minimizing FPR. Thus, the corresponding ROC curve should approach the upper-left corner of the plot as closely as possible. This observation enables easy visual comparison of various classifiers. ROC curves have been used to compare all the measures proposed in this work. All those measures have been calculated for a set of 62 designs. This makes 1891 pairs, 10 of which are known to be similar. Fig. 4 presents ROC curves obtained for TDLC, TTD, and “type aware” TTD both before and after normalization. The first observation is that normalization does bring benefit as expected, especially at sensitivity levels below 0.8. In other words, normalization facilitates detection of most cases of plagiarism except the least obvious ones. Another conclusion is that making TTD aware of transistor type helps a lot in those hardest cases while bringing little benefit (or even making detection more difficult) in others. This effect is easy to explain because transistor-type awareness cannot make “obviously” similar designs appear even more similar. Instead, it brings down FPR by eliminating the pairs of those layouts that bear some resemblance to each other but differ in the type of transistors placed in particular locations. Overall, the best classifier is the normalized version of TDLC, but in the “hardest” cases it is surpassed by the “type aware” TTD.
7
comparator is capable of checking several standard cell libraries for instances of plagiarism. Interestingly, the operation of dendrogram generation, though relatively slow, seems to have a complexity Oðlog LÞ. Thus, it will certainly not be the limiting factor for larger datasets. It is also worth noting that most of the time is spent in processing the input CIF file. This problem will be addressed below. As mentioned before, the program is intended to detect plagiarism in large sets of relatively simple designs laid out by hand by a designer. Nevertheless, it might be interesting to see how the execution time is impacted by the size of a single layout measured by the number N of transistors. Fig. 6 shows the execution times for a set of five designs, each of them containing circuits similar to one another in terms of transistor count. For the set of largest circuits, each composed of around 750 transistors, the total processing time is around two minutes. The fastest step is the evaluation of expressions for TTD and TDLC (“signature comparison” in the plot). It takes time proportional to the transistor count, which results directly from the definitions of those measures. The calculation of distances of individual transistors to the layout center (“TDLC calculation” in the plot) is also fast, even though the resulting lists of distances is eventually sorted, which take OðN log NÞ. The generation of the other signature (“TTD calculation” in the plot) is much more time consuming. It begins by determining the distances between all the transistor pairs, which of course has a quadratic complexity. Subsequent calculation of the eigenvalues of the resulting distance matrix has a cost between OðN 2 Þ and OðN 3 Þ. Still, the most time consuming stage is the processing of the input CIF file, including the extraction of NMOS and PMOS channels from the shapes on the basic masks. Since the shape of a transistor channel is a geometrical intersection of Poly and Active masks, extracting all the devices takes time proportional to the product of the number of shapes on those two layers. This suggests a roughly quadratic complexity in the number of transistors. This can probably be improved by storing the Poly and Active shapes in more sophisticated data structures than the lists used in this program. It must also be noted that the comparator has been implemented in Python, which is an interpreted language. The performance is absolutely satisfactory for the largest analyzed set of more than 60 layouts, about 30 transistors each. For larger sets and/or more complex circuits the tool could be rewritten in C.
4.3. Performance evaluation Fig. 5 shows comparator execution times as a function of the number L of designs to be compared. Each layout is a Miller amplifier composed of around 30 transistors, similar to the ones discussed before. Each layout was analyzed to extract all the sequences used in expressions (2), (4), (6) and (8), both “type aware” and without distinction between NMOS and PMOS devices. Those sequences will be referred to as the signatures of a layout. The first observation is that even the largest set of 60 designs does not take more than a second to process. The generation of signatures is a relatively slow process, but it is performed separately for each layout, so it is linear in the number of designs. The fastest-growing term is the comparison of layout signatures, i.e. evaluation of expressions for TTD and TDLC. It is performed pairwise, which of course implies quadratic time complexity. Fortunately, due to the relatively small size of the signatures, time spent on this operation is negligible for sets smaller than a hundred layouts. The extrapolation of this quadratic component (not shown here) proves that one minute would be sufficient to analyze a set of over a thousand layouts. Thus, the proposed layout
5. Conclusions and future work Formal criteria for comparing IC layouts are proposed in this paper. A computer program based on those criteria correctly identified similar layouts within a pool of submitted designs. The performance of this layout comparator is sufficient to analyze a set of 750 simple designs (30 transistors each) in less than two minutes. Therefore, the tool is capable of checking several standard cell libraries for instances of plagiarism. While the high false-positive rate is a problem, high sensitivity is preferred over specificity in this kind of task. This is because the final decision as to whether or not a layout has been copied from another must be based on careful visual inspection anyway. Several new approaches to IC-layout matching are now being tested in order to bring down the false-positive rate. One of them tests for homomorphism of graphs representing corresponding interconnects. Also, the layout comparator will be extended to accept GDSII files.
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i
D. Kasprowicz, H. Wada / Microelectronics Journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
References [1] A.M.E.T. Ali, H.M.D. Abdulla, V. Snasel, Survey of plagiarism detection methods, in: 5th Asia Modelling Symposium (AMS), 24–26 May 2011, pp. 39–42. [2] S.M. Alzahrani, N. Salim, A. Abraham, Understanding plagiarism linguistic patterns, textual features, and detection methods, IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 42 (March (2)) (2012) 133–149. [3] D. Kasprowicz, H. Wada, Computer-aided detection of plagiarism in integratedcircuit layouts, in: Proceedings of the 20th International Conference on Mixed Design of Integrated Circuits and Systems (MIXDES) 2013, June 2013, pp. 213–217.
[4] C. Wang, M.L. Gavrilova, Delaunay triangulation algorithm for fingerprint matching, in: 3rd International Symposium on Voronoi Diagrams in Science and Engineering ISVD '06, July 2006, pp. 208–216. [5] UNCLE project homepage. [Online]. Available: 〈http://www.imio.pw.edu.pl/ wwwvlsi/cad/imiocad/uncle〉. [6] R. Sproull, R. Lyon, The Caltech Intermediate Form for LSI Layout Description, Technical Report TR-2686, California Institute of Technology, 1980. [7] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009. [8] I.H. Witten, E. Frank, Data Mining – Practical Machine Learning Tools and Techniques, 2nd ed., Morgan Kaufmann, San Francisco, 2005.
Please cite this article as: D. Kasprowicz, H. Wada, Methods for automated detection of plagiarism in integrated-circuit layouts, Microelectron. J (2014), http://dx.doi.org/10.1016/j.mejo.2014.04.023i