A two-step string-matching procedure

A two-step string-matching procedure

0031-3203/91 $3.00 + .00 Pattern Recognition, Wol. 24, No. 7, pp. 711-716, 1991 Pergamon Press pie Printed in Great Britain (~ 1991 Pattern Recogn...

428KB Sizes 43 Downloads 53 Views

0031-3203/91 $3.00 + .00

Pattern Recognition, Wol. 24, No. 7, pp. 711-716, 1991

Pergamon Press pie

Printed in Great Britain

(~ 1991 Pattern Recognition Society

A TWO-STEP STRING-MATCHING PROCEDURE* SHUFEN KUOt and GEORGE R. CROSS:~§ "~Department of Computer Science, Washington State University, Pullman, WA 99164-1210, U.S.A.; J; Contel Technology Center, Intelligent SystemsLaboratory, 15000Conference Center Drive, Chantilly, VA 22021-3808, U.S.A.

(Received 7 June 1989; in revised form 19 December 1990; received for publication 3 January 1991) Abstract--String-matching algorithms which will be used as part of a knowledge-based retrieval system for multimedia documents, consisting of the graphical images of corporate trademarks, trademark words or phrases, and representations of companies and products, are discussed. Let LLCS(A, B) denote the length of the longest common subsequence of two strings A and B. We describe an improvement to the Wagner-Fischer Algorithm called WF+ for finding the LLCS which runs in linear space and quadratic time. An approximate algorithm for finding the LLCS, called the sorted string approximate algorithm (SSAA), is then presented, which runs in linear time but works on strings in sorted order only. The SSAA and WF+ are then combined into a two-step process which retrieves strings similar to a given string in near-linear time. Empirical data are presented in support of the estimates of average running time for the algorithms. String-matching algorithms

Information retrieval

1. I N T R O D U C T I O N

We are building a computer system that contains a database of trademarks and which reasons about their similarity in a manner used by consumers and courts of law/1) When completed, this system will allow the user to input a combined text and graphical image and information about the company and product it represents. The system will then retrieve similar trademarks along with an assessment of their similarity and some measure of the possible confusion that can be expected. The system provides a framework for solving interesting problems in the design of image information systems, algorithms for comparison of words using both spelling and sound, image understanding, similarity-based retrieval and the combination of conflicting evidence. This paper concentrates on the similarity of word trademarks and presents some improvements to algorithms for comparing words. 2. S O M E B A C K G R O U N D O N T R A D E M A R K S

Trademarks represent an important corporate asset. In the United States, the Lanham Act protects manufacturers who register and use trademarks with their products. Designing a new trademark is a complex process which involves numerous sociological and legal factors. A trademark can be any word, name, symbol or device, or any combination of them. Some 80% of all trademarks are wordmarks, i.e.

they do not involve graphic images. (2) The number of new wordmarks has given rise to the concerns about legal problems of infringement. To create prospective trademarks which are reasonably different from other existing trademarks and have a good chance of being accepted for registration, we should solve the trademark similarity problem. Furthermore, we should be able to quantify the similarity between trademarks. Since the consequences of infringement are severe, firms have sprung up to create new wordmarks for new products that are interesting and novel and do not represent words in any language. We are trying to study cases in order to develop a theory of infringement, but the legal situation is very confusing and many similar cases have been decided in apparently opposite ways. 0) More research needs to be done in order to understand the comparison methods used in the courts. In this paper, we concentrate on a subset of the problem. Given two wordmarks, just how can we measure their similarity? We settle on substring matching and then describe improvements to existing algorithms and describe a fast way to combine a pair of algorithms for improved performance. If two words share a lengthy common subsequence then the two wordmarks are very similar indeed. Such a technique detects simple rhymes and spelling changes used by unscrupulous infringers. We feel the algorithms presented below may also be useful as tools for the information retrieval community.

* Supported in part by an equipment donation from Tektronix, Inc. § Author to whom correspondence should be addressed.

3. N O T A T I O N

In this section, we define conventions, notation 711

712

SHUFENKUO and GEORGE R. CROSS {Let w 2 [i] d e n o t e t h e i t h l e t t e r of w 2 Let wl[j] denote t h e jth l e t t e r of w 1 Define a (m+l) by (n+l) matrix, M [ 0 : m , 0:nl . L e t M[i, j] d e n o t e the (i, j ) - e l e m e n t of t h e m a t r i x For

j

:=

For For

0 to n do M[0, j] := 0; i := 1 t o m M[i,0] := 0; i := 1 to m For j := 1 begin if

end; return

{initialize

row

0

to

0}

{initialize

column

0

to

M}

do

do to n

0}

do

wl[j] = w 2 [ i l t h e n M[i, j] := 1 + M [ i - l , j-l] else M[i,j] := m a x ( M [ i , j - l ] , M [ i - l , j ] ) (M [m, n] )

Fig. 1. The original Wagner-Fischer Algorithm WF.

Wt

J

W2 |

1

.~ J-1

j

(0,0)

Fig. 2. (m + 1) by (n + 1) matrix: while comparing W1[j] with W2[i], before filling a value into M[i, j], there are only n + 2 cells in the current matrix which we will reference later. They are shaded above.

String similarity

and terminology that will be used throughout this paper.

3.2.

3.1. Strings

Although there are many possible definitions of string similarity, we have settled on a single measure called the likeness measure. Define the likeness measure of string A and string B, LM(A, B), as the following:

We use a vector notation for strings: A = A[1], A[2] . . . . , A[m]. We denote the length of string A by IA I~A[i] is the ith element of A and A [i :j] denotes the substring A[i],A[i + 1] . . . . . A[]]. A string C = C[1], C[2] . . . . , C[p] is a suhsequence of string A = A[1], A[2] . . . . , Aim] if C is formed by deleting m - p (not necessarily adjacent) symbols from A. For example, "cut" is a subsequence of "computer". A string C is a common subsequence of strings A and B if C is a subsequence of A and also a subsequence of B. CS is the abbreviation for common subsequence. A string C is a longest CS of string A and string B if C is a CS of A and B of maximal length. LLCS(A, B) is the abbreviation for the length of the longest common subsequence of the strings A and B. For example, LLCS ("wings," "magics') = 2 with longest common subsequences "is" and "gs".

LLCS(A, B) LM(A, B) = max{iA[, IBI}. Retrieval of strings from a dictionary is based on setting a threshold called "ratio", then retrieving all strings from the dictionary with likeness measure with respect to the source string greater than or equal to ratio. 4. AN IMPROVED WAGNER-FISCHER ALGORITHM

Summaries of related work in string-matching and subsequence size calculations have appeared recently. (4'5) We now detail a space-saving improvement to the Wagner-Fischer Algorithm36) The Wagner-Fischer Algorithm ( W F ) computes the

A two-step string-matching procedure

j-i

713

j

I-I

l

Fig. 3. Matrix after moving ith row one space to the left.

J-I

<1,j-l)

l-It

left

J

(l-l,J-l)

(l-IJ~

(!,J) shoulder head ~¢urrent

~iii

Fig. 4. Matrix after overlapping ith row on (i - 1)th row.

{This algorit hm will return t h e L L C S of A a n d B in q u a d r a t i c time and linear space} L e t n b e t h e l e n g t h of A L e t m b e t h e l e n g t h of B Create a vector V[0:n+l] with n+2 elements initialized to 0 For

i := 1 to m do begin F o r s := 1 to n do begin left := s - I; h e a d := s + I; if

A(s)

=

B[i]

t h e n V[s] := 1 + V[s] e l s e V[s] := m a x ( V [ l e f t ] , V [ h e a d ] ) ; and; {shift each element 1 F o r k := n + l d o w n t o 1 V[k] := V [ k - l ] ;

v[0]

:=

0

end; return (V [n+l ] )

Fig. 5. Algorithm W F + .

space do

to

right}

714

SHUFEN KUO and GEORGER. CROSS Let n be Let m be Create •

the length of A the length of B v e c t o r V[0:n+l] w i t h n+2 elements i n i t i a l i z e d to 0

MaxNum

left For

:=

n+2;

:= 0; i := 1 to begin

m

do

j := 1 to n do begin s := (left+l) rood M a x N u m ; h e a d :-- (s+l) m o d M a x N u m ; if A[j] = B[i] :-- i + v[s] t h e n V[s] := m a x ( V [ l e f t ] , V [ h e a d ] ) ; else V[s] left := s end; left := head; V[left] = 0 end; return (V[s]) For

Fig. 6. Algorithm WF+CQ, which is Algorithm WF+ with a circular queue.

{The

function INTERSECT s o r t e d strings, $1

returns and $2}

the

LLCS

of

the

two

integer function if (NULL Sl) or then else

I N T E R S E C T (Sl, S2) (NULL $2) INTERSECT := 0

if((CAR

Sl)

=

(CAR S2))

{compare the 1st c h a r a c t e r of S 1 to t h e 1st c h a r a c t e r of $2}

then

INTERSECT := 1 + ( I N T E R S E C T (CDR Sl), (CDR S2) ) {if the 1st c h a r a c t e r of $1 is equal to the 1st character of $2, t h e n i n c r e a s e I N T E R S E C T b y 1 and d i s c a r d first c h a r a c t e r of Sl and first c h a r a c t e r of S2} else

if ((CAR Sl) < (CAR S2) ) then I N T E R S E C T := I N T E R S E C T ( (CDR Sl), S2) {if the ist c h a r a c t e r of S I is s m a l l e r t h a n the ist c h a r a c t e r of S2, then d i s c a r d the Ist c h a r a c t e r of Sl } else I N T E R S E C T := I N T E R S E C T (Sl, (CDR S2) ) {discard the s m a l l e r c h a r a c t e r w h i c h is the Ist c h a r a c t e r of S 2}

Fig. 7. Algorithm SSAA, the sorted string approximate algorithm: CAR and CDR have their usual interpretation as in LISP as returning the head and tail of a list, respectively. NULL is a predicate which tests whether a list is empty.

LLCS of two strings using a two-dimensional matrix of dimension (m + 1)(n + 1) where n is the length of the larger string and m is the length of the smaller string. W F is shown in Fig. 1. The following observations led to the improvement of WF, which we call W F + . Our improved algorithm W F + requires only a vector of n + 2 elements. Let wl and w 2 be two strings with [wl[ = . and Iw21 = m . (1) The order we fill values into each cell of the matrix is from top to bottom and left to right. When we compare the jth character of w t to the ith character of w2, three elements in the matrix will be referenced: M ( i - 1 , j - 1), M ( i , j - 1)and M ( i - 1,j).

However, after filling a value into M(i, ]), only n + 2 elements in the current matrix will be referenced from this point on (see Fig. 2: the shaded cells contain values which will be referenced in the current comparison and in future comparisons). (2) Call M(i - 1, j - 1) "shoulder", M(i, j - 1) "left", M(i - 1, j ) "head" and M(i, ]) "current". We shift the ith row of the matrix one cell to the left so that current and shoulder are in the same column (see Fig. 3). (3) After we fill a value into M(i, j ) , shoulder is never referenced. Thus, we can store the current value into the space of shoulder; i.e. we can write M(i, j ) into M(i - 1, ] - 1) for 1 -< j -< n. See Fig. 4

A two-step string-matching procedure Table 1. Running times for two-step algorithm and WF+ at various LM levels LM 1.0 0.9 0.8 0.7 0.6

/41 (0"1) 5.988 5.922 5.980 6.445 9.127

~2 (0"2)

(0.741) (0.727) (0.747) (1.038) (3.603)

113.653 (34.838) 112.323 (33.864) 112.746 (34.564) 113.619 (34.110) 115.520 (35.751)

Table 2. Average size of subsets retrieved from the dictionary with SSAA and WF+ at various LM levels LM

SSAA

WF+

1.0 0.9 0.8 0.7 0.6

1 2 18 169 1111

1 1 3 12 73

and note that shoulder and current are on the same space. Therefore, instead of using an (m + 1) (n + 1) matrix we can use a vector of n + 2 elements. The improved algorithm W F + is shown in Fig. 5. An Algorithm W F + , s represents current and shoulder, and corresponds to both M(i, j ) and M(i - 1, j - 1) in the W F Algorithm; left is referred to as M(i, j - 1) in the W F Algorithm and head is referred to as M(i - 1, j ) in the W F Algorithm. Hirschberg (7) presented a linear space algorithm which is a slight modification of WF. Algorithm W F + with a slight modification to use a circular queue is more efficient than Hirschberg's algorithm, because Algorithm W F eliminates unneeded shifting and uses only half the space of Hirschberg's algorithm. Since shifting each element in the vector V one space to the right for each i is quite expensive, we improve Algorithm W F + by treating the vector V as a circular queue. The resulting algorithm, which we call W F + C Q , is shown in Fig. 6.

B) with value given by LLCS(sort(A), sort(B)), where "sort(. )" is a function which accepts a string and returns the string in alphabetical order. Call the algorithm which does this the sorted string approximate algorithm (SSAA). To make this algorithm efficient, we should sort each string in the dictionary beforehand; i.e. for each string B in dictionary, the string sort(B) is also in the dictionary and is readily available when needed. Figure 7 gives a linear time algorithm for finding the SLLCS(A, B) when A and B are two sorted strings. 5.2. The two-step algorithm Such a sorting results in the retrieval of more trademarks than would be retrieved from an exact algorithm like W F + ; but experiments with actual data show the precision (i.e. the ratio of the number of strings retrieved with a common subsequence of a specified length to the total number of strings retrieved) to be high. Nevertheless, the approximate algorithm SSAA can be modified to be a fast exact algorithm by executing an exact algorithm on the subset retrieved by the approximate algorithm. Suppose that the subset retrieved by the S S A A is called S'. We then use W F + to retrieve a smaller subset S of S'. The set S will be equal to the set of strings that was retrieved from the dictionary using W F + . The following lemma proves the correctness of the method.

Lemma. The set of strings retrieved by an exact string-matching algorithm is a subset of the set of strings retrieved by a sorted string-matching algorithm. Proof. Let A and B be strings over a totallyordered finite alphabet H. To prove the lemma, we need only show SLLCS(A, B) -> LLCS(A, B). Let c be a character and S be a string. Define a function CNT#(c, S) as the number of times that character c appears in string S. From the definition of SLLCS and LLCS we know that:

5. APPROXIMATE STRING MATCHING

In this section, we present a fast method to retrieve similar word trademarks from the trademark dictionary. The main point of this method is to speed up the computation of the LLCS of two strings. For unsorted strings, it takes n 2 time to compute the LLCS of two strings; however, if they are sorted in a collating sequence, there is an algorithm which takes O(n) time to compute the LLCS. We assume both strings have the same length n.

5.1. The sorted string approximate algorithm Define the length of the longest common subsequence of the sorted strings A and B as SLLCS(A,

715

SLLCS(A, B) = ~

m i n ( C N T # ( c , A ) , CNT#(c, B))

{~inl3/

and LLCS(A, B) = ILCS(A, B) I = ~'~ CNT#(c, LCS(A, B)). {cin H}

Since LCS(A, B) is a subsequence of A and a subsequence of B, CNT#(c, LCS(A, B ) ) - CNT#(c, A) and CNT#(c, LCS(A, B))-< CNT#(c, B). So, CNT#(c, LCS(A, B))-< min(CNT#(c, A), CNT #(c, B)); Therefore, SLLCS(A, B) -> LLCS(A, B).

716

SHUFENKUO and GEORGE R. CROSS

5.3. Complexity analysis of the two-step algorithm Although there are two passes required for this algorithm, it is still quite efficient. Let X be the number of words in trademark dictionary and Y be the number of words picked up from the dictionary using sorted retrieval. Since S S A A takes linear time, and W F + takes n 2 time, we have a combined running time of XO(n) + YO(n2). We would expect X to be very large since it includes all existing trademarks. Y, the number of words retrieved, is small. Therefore, XO(n) + YO(n 2) is much less than XO(n2). The average running time per word in the dictionary is

XO(n) + YO(n 2) YO(n 2) X = O(n) + When Y,~ X, the above value is close to O(n). 5.4. Empirical data We have investigated the running time of the twostep procedure. We used 1000 words selected at random from a dictionary derived from a machinereadable form of Webster's 7th International Dictionary containing 48,178 words. We report the time to search this entire dictionary for each of the words in the 1000 and retrieve subsets at various levels of likeness measure. The algorithm was coded in Common LISP on a Tektronix 4444. The times listed in Table I are measured in seconds./~1 is the average time for the two-step algorithm while/t 2 is the average time for W F + alone, ax and a2 are the respective sample standard deviations. The first column is the likeness measure LM. The reason for this improvement is that the SSAA retrieves a relatively small subset of the entire dictionary for examination by the W F + algorithm. The average number of words retrieved at various levels of LM are shown in Table 2. Recall that in automated information retrieval, the precision is the percentage of relevant words in the retrieved subset, while the recall is the percentage of relevant words retrieved from the entire database. (8) The W F + algorithm operates at precision and recall of 100% at a given LM value. The use of the S S A A has a perfect recall of 100% and its precision is high enough so that the slow W F + algorithm does not have a very large set to work with.

We conclude from this that the two-step procedure is significantly faster than the W F method alone and could be used to enhance any string-matching algorithm at the expense of storage. 6. SUMMARY AND CONCLUSIONS We have described the problem of a trademark retrieval system. We have described a space-saving improvement to the Wagner-Fischer Algorithm, known as W F + , and a simple approximate matching algorithm for sorted strings, called SSAA. The combination of these two with a pre-sorted dictionary provides a fast, near-linear algorithm. Further work needs to be done both in assessing the adequacy of this measure of similarity. An important consideration is the sound of the two marks; we are developing algorithms for comparing the sounds of two trademarks using the same string-matching techniques described above. Instead of matching letters, we match phonemes and allow partial matches of similar-sounding phonemes. Moreover, composite marks like " M A G N A V O X " and "MULTIVOX" must be compared in the context of meaning in addition to sound and spelling. (9) REFERENCES

1. G. R. Cross and A. Sirjani, A system that reasons about trademark infringement, Proc. 4th Int. Conf. Law Comput., Session V, Rome (1988). 2. C. J. Werkman, Trademarks. Barnes and Noble, New York (1974). 3. W.-B. Huang and G. R. Cross, Reasoning about trademark infringement cases, Proceeding: DARPA/ISTO Case-Based Reasoning Workshop. Morgan Kaufmann Publishers, Inc., San Mateo (1989). 4. K. Abe and N. Sugita, Distances between strings of symbols, review and remarks, 6th Int. Conf. Pattern Recognition, Munich (1982). 5. A. V. Aho, Algorithms for finding patterns in strings, The Handbook of Theoretical Computer Science. NorthHolland, Amsterdam (1989). 6. R. A. Wagner and M. J. Fischer, The string to string correction problem, J. ACM 21,168-173 (1974). 7. D. S. Hirschberg, A linear space algorithm for computing maximal common subsequences, CACM 18,341343 (1975). 8. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. McGraw Hill, New York (1983). 9. W.-B. Huang, A trademark infringement and case retrieval and reasoning system. Washington State University, Department of Computer Science (1990).

About the Author--GEORGE CROSSwas born in New York City. He received his Ph.D. in Computer

Science in 1980, from Michigan State University. He has held positions at International Business Machines, Louisiana State University and Washington State University. Presently, he is a Principal Scientist in the Intelligent Systems Laboratory of the Contel Technology Center in Chantilly, Virginia. He leads a project investigating applications of knowledge-based diagnostic systems to problems in the telecommunications industry. His research interests also include computer vision, pattern recognition and applications of artificial intelligence. About the Author--SHUFEN Kuo was born in Taipei, Taiwan, Republic of China. She received her M.S.

degree in Computer Science from Washington State University in 1987. She is presently working toward a Ph.D. degree at Washington State University. Her research interests include information retrieval, computer algorithms and pattern recognition.