An optimal code for patient identifiers

Computer Methods and Programs in Biomedicine (2005) 79, 81–88 An optimal code for patient identiﬁers Andreas Faldum∗ , Klaus Pommerening Institut f¨ ...

Download PDF

147KB Sizes 294 Downloads 187 Views

Report

PDF Reader
Full Text

Computer Methods and Programs in Biomedicine (2005) 79, 81–88

An optimal code for patient identiﬁers Andreas Faldum∗ , Klaus Pommerening Institut f¨ ur Medizinische Biometrie, Epidemiologie und Informatik der Johannes-Gutenberg-Universit¨ at, D-55101 Mainz, Germany Received 20 July 2004 ; received in revised form 11 March 2005; accepted 17 March 2005

KEYWORDS

Patient identiﬁers; Error detection; Error correction; MDS Code

Summary How to distinguish 1 billion individuals by an identiﬁer consisting of eight characters, allowing a reasonable amount of error detection or even error correction? Our solution of this problem is an optimal code over a 32-character alphabet that detects up to two errors and corrects one error as well as a transposition of two adjacent characters. The corresponding encoding and error checking algorithms are available for free; they are also embedded as components of the pseudonymisation service that is used in the TMF—the German telematics platform for health research networks. © 2005 Elsevier Ireland Ltd. All rights reserved.

1. Introduction Generating patient (or person) identiﬁers (PIDs) is a basic need in the processing of identity data— not only in medical research networks, but also in health care and in other research or administrative areas. This paper shows how to generate PIDs in such a way that they have optimal properties for error detection and error correction in the sense of mathematical Coding Theory [1,2], adapted to the various requirements and side conditions that are met in a typical research network, and may even be used as pseudonyms. ∗

Corresponding author. E-mail addresses: [email protected] (A. Faldum), [email protected] (K. Pommerening) URL: http://www.staff.uni-mainz.de/pommeren/ (K. Pommerening).

Our PID algorithm is of universal use, however, the special nature of the code limits the applicability to a population of at most 230 individuals.

2. Background 2.1. Pseudonyms In medical research networks, we often encounter the need to pseudonymise patient data to preserve conﬁdentiality [3,4]. Technically speaking, pseudonyms are trapdoor one-way functions [5]; their strength depends on who owns the key to the trapdoor. Pseudonymisation only makes sense, if there is a reliable identiﬁcation procedure before cutting the connection to the identity data; also the quality of the data should be assured before.

0169-2607/$ — see front matter. © 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2005.03.004

82

A. Faldum, K. Pommerening

2.2. Pseudonymisation in the TMF

3. Requirements

The German telematics platform TMF is an organisation that has the goal to build and maintain a telematic infrastructure for medical research networks and is supported by the Federal Ministry of Education and Research (BMBF). The TMF developed a data protection concept [6,7] that has a pseudonymisation service as an essential part, whose exact use and organisational embedding depends on the data protection policy of the individual network.

To be used in a reasonably convenient way a PID should:

2.3. The TMF PID service

(4) not allow any inference of the corresponding identity data nor of time or order of its generation.

One technical component of the pseudonymisation service is a web service that receives identifying data and outputs a PID. The server stores the correspondence between identity data and PID. Depending on the exact use in the research network, the server: • either receives encrypted identity data; then, the generated PID is a pseudonym in a strong sense, at least if the service is offered by a Trusted Third Party; • or receives clear text identity data (of course over an encrypted channel); then, the PID serves only as a ‘‘non-speaking’’ identiﬁer. The second variant is the one preferred in most TMF networks, and combined with a separate pseudonymisation step. Our algorithm can be used with both variants. The PID service itself consists of two algorithms, see Fig. 1: • A matching algorithm. It is described elsewhere [8]; • The PID-generating algorithm. It is the subject of this paper.

Fig. 1 The PID service resides on a web server with an underlying data base; a client calls the service via a web interface.

(1) consist only of letters and numbers, preferably case insensitive, (2) contain one or more extra characters for error checking, (3) be no longer than eight characters altogether. Furthermore, in many application settings, it is desirable that a PID should

The ﬁrst part of the last requirement is fulﬁlled by a simple counter; to fulﬁl the second part the counter is encrypted. This encryption step may seem exaggerated, but it causes practically no extra cost and on the other hand is useful in certain settings. And, it does not affect the encoding procedure given below, only the pre-processing of the data.

4. Ingredients for the construction of PIDs To meet the requirements presented in Section 3, an alphabet consisting of 32 characters seems suitable: the numbers 0, . . . , 9 and the letters A– Z, omitting the letters B, I, O, S that could be mistaken for the numbers 8, 1, 0, 5. Using six of these characters, we can represent 326 = 230 ≈ 109 = 1 (American) billion distinct strings of length 6—enough for virtually any medical research network. This approach leaves room for two additional characters and suggests a code that detects two errors and corrects one error (and will be deﬁned in Section 5), a so-called 32-[8,6,3]-MDS code [1, Chapter 11] (‘‘Maximum Distance Separable’’) over the Galois ﬁeld F32 of 32 elements; these elements are identiﬁed with the 32 characters of our alphabet (we let 0 correspond to the to the bit string 00000, . . . , 9 to 01001, A to 01010, C to 01011, and so on lexicographically, ﬁnally Z to 11111). Here is a valid PID:

The validity will be shown in Appendix A. Fig. 2 shows a global view at the PID generating algorithm.

An optimal code for patient identiﬁers

Fig. 2 The two steps of PID generation. The ﬁrst step is a cryptographic procedure, see Section 4.2, the second step adds two characters, see Section 5.2.

4.1. Mathematical context Our algorithms process bit strings of length 30, also interpreted as six-dimensional vectors over the ﬁeld F32 , and the entire mathematical procedure results in a bit string of length 40 that is ﬁnally translated into an eight-character string as above.

4.2. Cryptographic procedure The speciﬁcation of the cryptographic procedure is of little interest here, because the encryption conceals only the counter that contains very few information. Note that the encryption algorithm cannot be really strong, because it works on strings of 30bit only. This excludes standard algorithms such as AES [9] that would give 128-bit strings. Truncating the output to 30 bits would endanger the uniqueness und would (by the ‘‘birthday paradox’’ [10,5]) with probability 12 result in a ‘‘collision’’ already after approximately 215 ≈ 32000 cases (uniqueness could, however, be restored by a table lookup). We consider the encryption step as part of the pre-processing and refer to the source code [11] for the details; except to say that the key consists of three 30-bit integers k1 , k2 , k3 , that have to be speciﬁc for the individual project or network, should be kept secret, and must not be changed between the generation of the ﬁrst and the last PID. This cryptographic pre-processing makes the (ﬁrst six characters of the) PID pseudorandom and unique. A weakness of the cryptographic procedure is the permanence of the key over the entire life time of the research net. A key change would invalidate all formerly generated PIDs. We compensate this to a small extent by introducing randomness at the cost of diminishing the PID space by a few bits, see Section 4.3.

4.3. Randomisation Our cryptographic algorithm allows a randomisation of its input – which is the counter – controlled by a parameter r ‘‘randomisation width’’. This is the number of (most signiﬁcant) bits of the counter that are replaced by random bits before the counter is

83 encrypted. Our implementation allows r between 0 and 12, where 0 means ‘‘no randomisation at all’’. This leaves k = 30 − r bits for the true input. Therefore, we can generate up to 2k distinct PIDs; for the maximum value r = 12, this number is 218 ≈ 260,000.

5. The code The encoding algorithm works on the vector space 6 over the Galois ﬁeld F . In this chapter, we F32 32 assume basic mathematical knowledge about polynomials and linear algebra over ﬁnite ﬁelds as given in the ﬁrst four chapters of [1].

5.1. Arithmetic of the Galois ﬁeld The polynomial f = T 5 + T 2 + 1 ∈ F2 [T ] in one indeterminate T over the two element ﬁeld F2 is primitive; hence, the residue class ring F2 [T ]/fF2 [T ] is a ﬁeld of 25 = 32 elements, therefore, up to an isomorphism, it is the ﬁeld F32 . If we represent its elements by vectors in F25 – that is, by bit strings of length 5 – then, the addition in F32 is the addition of vectors over F2 , that is the XOR (exclusive or) of bit strings. If t ∈ F32 denotes the residue class of T; then, F32 has the basis: t4 , t3 , t2 , t, 1,

(1)

over F2 , corresponding to the canonical basis: e1 = 10000,

e2 = 01000,

e4 = 00010,

e5 = 00001

e3 = 00100, (2)

of F25 , in this order. The entire multiplication table derives from the single formula: t5 = t2 + 1.

(3)

For x = (x1 , x2 , x3 , x4 , x5 ) = x1 t4 + x2 t3 + x3 t2 + x4 t + x5 in particular, the products with lower powers of t look as follows: t · x = (x2 , x3 , x1 + x4 , x5 , x1 ), t2 · x = (x3 , x1 + x4 , x2 + x5 , x1 , x2 ), t3 · x = (x1 + x4 , x2 + x5 , x1 + x3 , x2 , x3 ), t4 · x = (x2 + x5 , x1 + x3 , x1 + x2 + x4 , x3 , x1 + x4 ). (4) These formulas easily yield highly efﬁcient bit operations for an implementation; for example, we calculate the product t · x as a bitwise XOR of the

84

A. Faldum, K. Pommerening

three quantities: u = x 4 = (0, 0, 0, 0, x1 ), v = u 2 = (0, 0, x1 , 0, 0),

(5)

w = (x 1)&31 = (x2 , x3 , x4 , x5 , 0) (in the notation of the programming language C, where or denote left or right shift, and ‘&’ the Boolean AND). Note that the operation ‘‘&31’’ wipes out all but the last ﬁve bits. More generally, a highly efﬁcient formula for te · x, valid for e = 0, . . . , 3, is given by the bitwise XOR of the three quantities: u = x (5 − e), v = u 2, w = (x e)&31.

(6)

For t4 · x, the bitwise XOR of two further quantities has to be added: u = u 3, v = u 2.

(7)

5.2. Encoding The encoding step consists of the linear map ϕ : 6 −→ F8 given by the matrix F32 32   1 0 0 0 0 0 t t2    0 1 0 0 0 0 t2 t4     0 0 1 0 0 0 t3 t6    (8) G=   0 0 0 1 0 0 t4 t8       0 0 0 0 1 0 t5 t10  0 0 0 0 0 1 t6 t12 6 into transferring z = (z1 , z2 , z3 , z4 , z5 , z6 ) ∈ F32

ϕ(z) = zG = c = (c1 , c2 , c3 , c4 , c5 , c6 , c7 , c8 ).

(9)

Then, ci = zi for i = 1, . . . 6, c7 = tz1 + t2 z2 + t3 z3 + t4 z4 + t5 z5 + t6 z6 , c8 =

t2 z

1

+ t4 z

2

+ t6 z

3

+ t8 z

4

+ t10 z

5

+ t12 z

6.

(10)

5.3. Properties of the code In the language of Coding Theory, the image space 6 ) is a code of length 8 and dimension C := ϕ(F32 6; the elements of C will be denoted as code words. In this section, we show that this code is Maximum Distance Separable (MDS) code that detects up to two errors, and corrects one error. This is the optimum for error correction and detection that can be reached with a redundancy of two characters, see items 1–5 below. Moreover,

this code corrects the transposition of two adjacent characters (a common mistake with manual data entry). This will be proved in Section 5.4. (1) Minimum distance of C: The minimal number of places at which two distinct code words differ is 3. (This number is called minimum distance of C). Proof. Let c, c∗ ∈ C with c = c∗ . Since C is a linear space, c − c∗ ∈ C, and c − c∗ = 0. Therefore, c − c∗ is a linear combination of the rows of G. Each row itself has exactly 3 coordinates diffent from 0. A combination of 2 rows i and j has exactly 2 coordinates unequal 0 within the ﬁrst 6 places. The last 2 coordinates of such a combination cannot vanish simultaneously, because ati + btj = 0 = at2i + bt2j with a, b = 0 implies ti = tj , which is not true for i = j. A combination of 3 or more rows results in 3 or more coordinates unequal to 0 within the ﬁrst 6 places. In any case the number of coordinates different from 0 exceeds 2 and therefore, c and c∗ differ in 3 places at least. (2) Capacity for correction: A distorted code word with one error can be corrected automatically. Proof. This is a direct consequence out of property 1 [2, page 8]. Decoding a distorted code word c to the code word c, which differs in the fewest places from c will lead to the right result if only one error happened. (3) Capacity for detection: A distorted code word with two errors will be detected. Proof. Once again, property 1 guarantees this assertion. Since two distinct code words differ in at least 3 places, two errors cannot change one code word into another. Therefore, two errors can be detected by checking if the questionable word is a valid code word. (4) Quality of the code: With a redundancy of two characters C is an optimal code for error correction and detection. Proof. This assertion results from the Singleton bound [12,2], which restricts the minimum distance d by the redundancy of a code d ≤ n − k + 1, where n denotes the length and k the dimension. Therefore, in our case (n = 8, k = 6), the minimum distance cannot be larger than 3. The same arguments as stated above show that the error correcting and detecting capacity is ﬁxed by the minimum distance and cannot be enlarged beyond 1, respectively, 2.

An optimal code for patient identiﬁers

85

8 . Then c ∈ (5) Parity check matrix of C: Let c ∈ F32 t C if and only if Hc = 0 with H denoting the parity check matrix of C. t t2 t3 t4 t5 t6 1 0 H= (11) t2 t4 t6 t8 t10 t12 0 1

and

ct

the transposed vector of c.

Proof. Since Hct = 0 for all rows c of G, 8 |Hct = 0}. The proof is completed C ≤ {c ∈ F32 8 |Hct = 0} = by the observation that dim{c ∈ F32 6 = dim C.

5.4. Checking for errors Assume a valid code word c = (c1 , c1 , . . . , c8 ) ∈ C is entered, but there could be errors in ‘‘transmission’’ (or in data entry), so a vector c = 8 is received, and the receiver (c1 , c1 , . . . , c8 ) ∈ F32 cannot be sure if this c is the original code word. He then asks, what is the ‘‘most likely’’ value of c? (We would not discuss probabilistic models of errors here, but see Section 5.5.) Decoding algorithm (with correction): To answer this question we form the sums a = c1 t + c2 t2 + · · · + c6 t6 + c7 ;

b = c1 t2 + c2 t4 + · · · + c6 t12 + c8 ;

(12)

these are the appropriate test values, and obviously (a, b)t = Hct . The decoding algorithm yields a valid code word ˆ c in the following seven cases: (1) If a = 0 = b, then, c ∈ C is a valid code word and will not be changed. Thus, ˆ c := c . (2) If a = 0 = b, replace c7 by c7 + a. Thus, ˆ c := (c1 , . . . , c6 , c7 + a, c8 ). c= (3) If a = 0 = b, replace c8 by c8 + b. Thus, ˆ (c1 , . . . , c7 , c8 + b). (4) If a = 0 and b/a = ti with some i = 1, . . . , 6, replace ci by ci + a/ti . Thus, ˆ c = (c1 , . . . , ci + i a/t , . . . , c8 ). ) · t18+i and b/a = t18+i (5) If 0 = a = (ci + ci+1 with some i = 1, . . . , 5, transpose the coordinates at places i and i + 1. Thus, ˆ c= , c , . . . , c ). (c1 , . . . , ci+1 8 i (6) If 0 = a = (c6 + c7 )(t6 + 1) and b/a = t16 , transpose the coordinates at places 6 and 7. Thus, ˆ c = (c1 , . . . , c5 , c7 , c6 , c8 ). transpose the (7) If 0 = a = b = c7 + c8 , coordinates at places 7 and 8. Thus, ˆ c = (c1 , . . . , c6 , c8 , c7 ).

In all other cases, c is not a valid code word but will not be corrected (the error is detected but there is no sensible way of correcting it). The proof that ˆ c ∈ C will be given in Appendix B.1. This is our main technical result: Theorem 1. Using the decoding algorithm stated above one error is corrected properly, two errors are still detected and a transposition of adjacent distinct characters is reversed. The proof is in Appendix B.2.

5.5. Trustworthiness of the code Even an optimal code cannot guarantee that the results, PIDs in our case, are error free, in particular, when we consider manual data entry, where the error rate usually is high. Then, an automatic error correction is impossible or may too often lead to wrong results. Therefore, we have to explore the trustworthiness of the procedure. Moreover, we may ask whether we should prefer pure error detection over automated correction. To this end, we have to determine the probability of an undetected wrong PID or an erroneous correction. If we assume that PIDs and data errors are uniformly distributed, we can calculate these probabilities as functions of the error rate in data entry or transmission. The uniform distribution of PIDs (in the space C of code words) is justiﬁed by the pseudorandom generation method. The distribution of the data errors, however, heavily depends on the data source, e.g. on the manual entry method. To get a rough estimate of the trustworthiness for simplicity, we also assume, a uniform distribution—the probability of changing one character of the PID into another character at a certain position does not depend on the original and the false character nor on the position; and more complicated errors such as omissions or transpositions are not taken into account. For this simpliﬁed model, an approximate solution for the probability of an undetected wrong code word or an incorrect decoding is given in the following theorem, which will be proved in a forthcoming paper [13]. Theorem 2. Let C be a linear code over Fq of length n and minimum distance d ≥ 3. Furthermore, let Ad denote the number of code words, which differ in d coordinates from 0. Assume that the probability for every code word equals 1/|C|. Suppose that C is used for error-correction on a qary symmetric channel with symbol error probabil-

86

A. Faldum, K. Pommerening

ity p. In case we trust a decoded word if and only if the distance between the received and the decoded word does not exceed a chosen integer ≤ d−1 2 , the probability of an incorrect decoding P is characterized by:

d P lim (q − 1)−d . = A (13) d p→0 (p/(1 − p))d− Therefore,

d− d p P ≈ Ad (1 − p)(q − 1)

(14)

if p is small enough. In our situation, C is a linear code over F32 with minimum distance d = 3. Since C is an MDS code, the number of code words with 3 coordinates unequal 0 is A3 = 1736 [1, 11 (§3)]. Now let p be the error probability for a single character. Then, the probability of an undetected wrong PID ( = 0 in Theorem 2) approximates 0.06p3 , and the probability of an erroneous automatic correction ( = 1 in Theorem 2) can be assessed by 5.42p2 , if p is not too large. Therefore, a pure error indication gives more trustworthy results. Automatic correction can only be recommended, if the error rate is low. Because the error probability p for a single character can be estimated empirically by the proportion of detected wrong PIDs q ≈ 8p, the expected rate of erroneous automatic corrections is given by 5.42p2 ≈ 0.085q 2 . This observation suggests an adaptive procedure for the decision whether errors should only be indicated or automatically corrected, if a limit for the rate of erroneous corrections is ﬁxed in advance. What are acceptable limits for the rate of erroneous corrections depends on the application environment. Example 3. Assume an error rate of p = 0.003; then, the automatic correction has an error rate of 5.42p2 = 4.9 × 10−5 that is reasonably low for statistical analyses, but unacceptable, when the PIDs are used in a patient treatment context. The probability of an undetected wrong PID is 0.06p3 = 1.6 × 10−9 , sufﬁciently small for all applications.

5.6. Comparison with double data entry Compared with our method the standard procedure of double data entry has a redundancy of six characters instead of two. Double data entry for PIDs can be regarded as a code of length 12

and minimum distance 2 over F32 . The number of doubly entered correct PIDs differing in 2 places from 0 equals the number of words with length 6 and 1 nonzero coordinate 6 × 31 = 186. Therefore, the probability of an undetected wrong doubly entered PID approximates 0.19p2 according to Theorem 2. This is much higher than the comparable probability of 0.06p3 with our code. Automatic correction doesn’t make sense with double data entry. Consequently, our code gives signiﬁcantly better results even though the number of redundant characters is reduced from 6 to 2. Therefore, our code should be prefered to double data entry.

6. Resume and discussion We propose a method to generate and check patient identiﬁers (PIDs) that only uses simple and efﬁcient binary operations. Our method guarantees fast data processing without additional delays and gives pseudonymous and unique identiﬁers. The PIDs generated by this method have excellent error detecting or correcting properties. In a forthcoming paper, we will prove in general that the MDS code C given here has the approximate best trustworthiness for error detection and correction, i.e. when using an alphabet with 32 characters and a code of length 8 with redundancy 2, the code C results in the lowest probability of an undetected wrong PID or an erroneous correction, if the error probability for a single character is low [13]. The source code (in C) for the PID generating and PID checking procedures is in the public domain and available from the authors’ web page [11]. The entire PID generator (see Section 2) works as a web service or as a line mode program, also in batch mode. It is available from the TMF under negotiable conditions. The web site of the TMF is at www.tmf-ev.de.

Acknowledgements This paper was partially supported by the ‘‘Bundesministerium f¨ ur Bildung und Forschung’’ as a subproject of the ‘‘Kompetenznetz P¨ adiatrische Onkologie und H¨ amatologie’’ as well as the ‘‘Telematikplattform f¨ ur die Medizinischen Forschungsnetze’’ (TMF).

An optimal code for patient identiﬁers

87

Appendix A. Verifying a PID manually The PID given in Section 4 was 6AT93DCP and is represented by the vector c1 = 00110,

c2 = 01010,

c3 = 11001,

c4 = 01001,

c5 = 00011,

c6 = 01100,

c7 = 01011,

c8 = 10110,

(A.1)

according to the translation map in Section 4. For the ﬁrst checksum we calculate, using the formulas in Section 5.1, tc1 = 01100, t4 c4 = 00100,

t2 c2 = 01101, t5 c5 = 01111,

t3 c3 = 10110, t6 c6 = 10111, (A.2)

and the sum of these vectors, 01011, is c7 . For the second checksum we calculate t2 c1 = 11000,

t4 c2 = 10001,

t8 c4 = 01010,

t10 c5 = 10110,

t6 c3 = 00001, t12 c6 = 00010, (A.3)

and the sum of these vectors, 10110, is c8 . Therefore, 6AT93DCP is a valid codeword as stated in Section 5.4.

(6) Hˆ c = Hct + H(0, . . . , 0, c6 + c7 , c6 + c7 , 0)t = (a, b)t + ((c6 + c7 )(t6 + 1), (c6 + c7 )t12 )t = 0 since (t6 + 1)t16 = t12 . (7) Hˆ c = Hct + H(0, . . . , 0, c7 + c8 , c7 + c8 )t = (a, b)t + (c7 + c8 , c7 + c8 )t = 0.

B.2. Properties of the decoding algorithm – Proof of Theorem 1 Let c = (c1 , . . . , c8 ) be the transmitted code word, c = (c1 , . . . , c8 ) the received vector, and ˆ c= (ˆ c1 , . . . , ˆ c8 ) the decoded code word in the cases 1– 7 above. The difference e = (e1 , . . . , e8 ) = c − c is the vector of ‘‘transmission errors’’, and the test values are: (a, b)t = Hct = H(c + e)t = Hct + Het = Het = (e1 t + e2 t2 + · · · + e6 t6 + e7 , e1 t2 + e2 t4 + · · · + e6 t12 + e8 )t .

(B.1)

We distinguish three cases: (1) Assume one single error occurred at place i. Then, e = (0, . . . , ei , . . . , 0) and  i 2i   (ei t , ei t ), if i = 1, . . . , 6, (a, b) = (e7 , 0), (B.2) if i = 7,   (0, e8 ), if i = 8. Applying the decoding algorithm yields

Appendix B. Mathematical proofs B.1. Correctness of the decoding algorithm In Section 5.4, we have to prove that ˆ c ∈ C. According to item 5 in Section 5.3, we have to show Hˆ ct = 0. Notice that in the ﬁeld F32 , we have 2a = 0 and (a + b)2 = a2 + b2 for all elements a, b ∈ F32 . (1) Hct = 0 by deﬁnition of a and b. (2) Hˆ c = Hct + H(0, . . . , 0, a, 0)t = (a, b)t + (a, 0)t = 0. (3) Hˆ c = Hct + H(0, . . . , 0, b)t = (a, b)t + (0, b)t = 0. (4) Hˆ c = Hct + H(0, . . . , a/ti , . . . , 0)t = (a, b)t + (a, ati )t = 0. , c (5) Hˆ c = Hc + H(0, . . . , ci + ci+1 i , . . . , 0)t ) + ci+1 = (a, b)t + ((ci + ci+1 i i+1 2i 2i+2 t (t + t ), (ci + ci+1 )(t + t )) = 0 i i+1 18+i since t + t =t .

 c + (0, . . . , a/ti , . . . 0),      if i = 1, . . . , 6, according to rule 4,     c + (0, . . . , 0, a, 0), ˆ c= (B.3)  if i = 7, according to rule 2,      c + (0, . . . , 0, b),    if i = 8, according to rule 3. In any case, ˆ c = c + e = c − e = c, and the transmission error is properly corrected. (2) Assume two errors occurred. As stated in item 3 of Section 5.3, two distinct code words differ in at least three coordinates and, consequently, two errors cannot change one code word into another. Therefore, Hct = (0, 0)t , and the incorrect word c is detected. (3) Assume two distinct characters at places i and i + 1 have been transposed. Then, e = (0, . . . , ei , ei+1 , . . . , 0) = (0, . . . , ci + ci+1 , ci + ci+1 , . . . , 0),

(B.4)

88

A. Faldum, K. Pommerening and

 ((ci + ci+1 )(ti + ti+1 ), (ci + ci+1 )     2i 2i+2 )),  if i ≤ 5,   (t + t 6 (a, b) = ((c6 + c7 )(t + 1),    12  if i = 6,  (c6 + c7 )t ),   (c7 + c8 , c7 + c8 ), if i = 7. (B.5) (t2i

+ t2i+2 )/(ti

+ ti+1 )

ti

[4]

[5] [6]

+ ti+1

Since = = t18+i for i = 1, . . . , 5, and t12 /(t6 + 1) = t16 , , rules 5, 6 and 7 of and ci + ci+1 = ci + ci+1 the decoding algorithm apply, resulting in ˆ c = c, and the transposition of the adjacent characters is reversed.

References [1] F.J. MacWilliams, N.J.A. Sloane, The Theory of Error-Correcting Codes, North-Holland, Amsterdam, 1977. [2] W. Willems, Codierungstheorie, De Gruyter, Berlin, 1999. [3] K. Pommerening, Pseudonyme-ein Kompromiß zwischen Anonymisierung und Personenbezug, in: H.J. Trampisch, ¨ S. Lange (Eds.), Medizinische Forschung-Arztliches Han-

[7]

[8]

[9] [10] [11] [12] [13]

deln, vol. 40, Jahrestagung der GMDS, MMV Medizin Verlag, M¨ unchen, 1995, pp. 329–333. K. Pommerening, M. Miller, I. Schmidtmann, J. Michaelis, Pseudonyms for cancer registry, Meth. Inf. Med. 35 (1996) 112–121; K. Pommerening, M. Miller, I. Schmidtmann, J. Michaelis, Pseudonyms for cancer registry, Yearbook of Medical Informatics (1997) 338–347. A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography, CRC Press, Boca Raton, 1997. C.-M. Reng, P. Debold, C. Specker, K. Pommerening, Generische L¨ osungen der TMF zum Datenschutz f¨ ur die Forschungsnetze der Medizin, in press. K. Pommerening, C.-M. Reng, Secondary use of the electronic health record via pseudonymisation, in: L. Bos, S. Laxminarayan, A. Marsh (Eds.), Medical Care Compunetics, vol. 1, IOS Press, Amsterdam, 2004, pp. 441– 446. M. Wagner, K. Pommerening, A formal language for the speciﬁcation of matching algorithms as a general framework for pseudonymization, Informatik, Biometrie und Epidemiologie in Medizin und Biologie 34 (2003) 531–533. J. Daemen, V. Rijmen, The Design of Rijndael, AES: The Advanced Encryption Standard, Springer, Berlin, 2002. W. Feller, An Introduction to Probability Theory and its Applications, vol. I, Wiley, NY, 1957. Web page: http://www.staff.uni-mainz.de/pommeren/ PID/. R.C. Singleton, Maximum distance q-nary codes, IEEE Trans. Inf. Theory 10 (1964) 116–118. A. Faldum, Trustworthiness of Error-Correcting Codes, submitted for publication.

An optimal code for patient identifiers

An optimal code for patient identifiers

Recommend Documents