Implementation of a query interface for a generic record server

Implementation of a query interface for a generic record server

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764 journal homepage: www.intl.elsevierhealth.com...

1MB Sizes 0 Downloads 74 Views

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

journal homepage: www.intl.elsevierhealth.com/journals/ijmi

Implementation of a query interface for a generic record server Tony Austin ∗ , Dipak Kalra, Archana Tapuria, Nathan Lea, David Ingram Centre for Health Informatics and Multiprofessional Education (CHIME), UCL, UK

a r t i c l e

i n f o

a b s t r a c t

Article history:

Introduction: This paper presents work to define a representation for clinical research queries

Received 20 September 2007

that can be used for the design of generic interfaces to electronic healthcare record (EHR)

Received in revised form 2 May 2008

systems. Given the increasing prevalence of EHR systems, with the potential to accumulate

Accepted 4 May 2008

life-long health records, opportunities exist to analyse and mine these for new knowledge. This potential is presently limited by many factors, one of which is the challenge of extracting information from them in order to execute a research query.

Keywords:

Method: There is limited pre-existing work on the generic specification of clinical queries.

Computerised patient medical

Sets of example queries were obtained from published studies and clinician reference

records [MeSH ID D016347]

groups. These were re-represented as structured logical expressions, from which a gener-

Query tools

alisable pattern (information model) was inferred. An iterative design and implementation approach was then pursued to refine the model and evaluate it. Results: This paper presents a set of requirements for the generic representation of clinical research queries, and an information model to represent any arbitrary such query. A middleware component was implemented as an interface to an existing system that holds 20,000 anonymised cancer EHRs in order to validate the model. This component was interfaced in turn to a query design and results presentation tool developed by the Open University, to permit end user demonstrations and feedback as part of the evaluation. Conclusion: Although it is difficult to separate cleanly the evaluation of a theoretical model from its implementation, the empirical evaluation of the query-execution interface revealed that clinical queries of the kinds studied could all be represented and executed successfully. However, performance was a problem and this paper outlines some of the challenges faced in building generic components to handle specialised data structures on a large scale. The limitations of this work are also discussed. The work complements many years of European research and standardisation on the interoperable communication of electronic health records, by proposing a way in which one or more EHR systems might be queried in a standardised way. © 2008 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

1.1.

Research objective

The objective of this research has been to develop and evaluate a generic information model that could be implemented ∗

Corresponding author at: 4th Floor, Holborn Union Building, Highgate Hill, London N19 5LW, UK. Tel.: +44 20 7288 3372. E-mail address: [email protected] (T. Austin). 1386-5056/$ – see front matter © 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.ijmedinf.2008.05.003

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

as a query/result interface to a generic and standards-based electronic health record (EHR) system, in order to provide a generalised means of supporting diverse user-driven clinical and bio-science research analyses of EHR data.

1.2.

Background/rationale

Given the increasing adoption of EHR systems internationally, some of which are or will be interconnected via regional or national health care networks, there are unprecedented opportunities to knowledge-mine routinely collected EHR data across significant populations for clinical and bio-science research, public health and epidemiology. This potential will be greater as forthcoming interoperability standards become adopted. Over 15 years of cumulative research has been undertaken to identify the requirements and information architectures needed to support shared longitudinal electronic health records that are clinically and ethically sound [1–6]. This work has so far resulted in two generations of European (CEN) standard [7,8], and a new joint European (CEN) and international (ISO) standard for EHR Communication, parts of which are progressively being published as European Standards during 2007–2008 [9]. These research projects and standards have focussed on supporting the care given to patients by promoting good designs for EHR systems, and standards for the secure communication of part or all of a patient’s EHR between authorised systems. However, these EHR-related standards have centred on the communication of parts of the EHR of an individual subject of care, and little work has been done to date on defining a generic means of querying EHR systems, as distributed repositories, in a consistent way. Without such a means the only methods of querying large populations of EHRs are to centralise the data into a new single data warehouse or to transform each required query into multiple formats as supported by each EHR system. It is recognised that not all EHR systems deployed today would be capable of supporting a standards-conformant interface, but there is increasing pressure for them to do so and for many that do, it would be an attractive option to be able to support population queries by extending the same interface. The research reported in this paper to investigate a generic query interface was conducted as a work plan item within the Clinical E-Science Framework project (CLEF).

1.3.

The CLEF project

The Clinical E-Science Framework (CLEF [10]) project, and its follow-on project CLEF-Services, together span five and a half years of research (2002–2007) funded by the UK Medical Research Council as part of the UK national e-Science programme. CLEF aims to develop rigourous generic methods for integrating and analysing clinical information that is captured as part of routine patient care, and to develop tools to facilitate the analysis of EHR data by the clinical and bio-science research communities. The key goal of CLEF is to provide medically rich information derived from operational health records for the purposes of scientific study and research, in an efficient manner, whilst

755

ensuring patient confidentiality by developing appropriate safeguards and removing or protecting potentially identifiable information. A separate thread of CLEF research is therefore developing a methodology for deriving large numbers of longitudinal pseudonymised health records, within a managed and monitored authentication and authorisation framework that limits access only to legitimate users [11]. Another thread of research, pertinent to this paper, has been the development of a client tool at the Open University to enable end users to compose clinical research queries in a user-friendly natural language format, and also to present the results in a variety of presentation styles as narrative text, tables and charts [12]. In order to have access to a test-bed of empirical data and systems on which to develop the generic query specification, a pseudonymised EHR system was established at University College London (UCL) during 2004–2005. A sample population of electronic health records (around 20,000 deceased-patient records) was extracted, with ethical approval, from the main computer system at the Royal Marsden Hospital (UK) and subjected to a combination of computerised and manual de-identification on site before being sent via a secure communication to UCL. The generic query interface model reported in this paper, was implemented as an interface to this repository and a test-suite of queries were designed and executed on it using the Open University client. The final combination of repository and tools, and others developed within the project, are presently being consolidated into a CLEF Workbench, to be published later.

2.

Method

The research method adopted was to obtain a diverse set of examples of clinical research questions that could be represented as queries executed on a clinical data repository, in order to infer some generalisable patterns. These were reflected in a set of requirements and an interface information model. The interface was implemented as a Java middleware component communicating with an existing EHR server, also written in Java, at University College London. Using a client component developed by the Open University, a suite of test queries were composed and executed in order to validate the implementation, the utility of the design approach and as a visualisation from which to gain end user feedback. A number of queries were examined in more detail, one of which is described in the next section.

2.1.

Investigations

Several complementary approaches were pursued, for which one clinical domain (cancer) was selected as the principal focus in order to make the investigation manageable. Cancer was selected because of particular opportunities for implementation and evaluation using data from the Royal Marsden Hospital (described in Section 1.3). 1. A review of published clinical studies was conducted in order to find papers that investigated the research questions in a formal and rigourous way.

756

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

2. A further literature search was performed (beyond the domain of cancer) for publications of generic query specifications or interfaces or tools (beyond the simple use of Structured Query Language (SQL) or SQL-like constructs). 3. A specific investigation was performed in the area of temporal databases in order to identify any particular approaches unique to this field of informatics, and in particular whether there are any clinical research questions that can only be represented and executed using them. 4. A selected group of academic clinicians were invited to propose example research questions that either had recently been executed in their teams or which they would ideally like to ask but have not been able to. 5. Several commercial tools that support clinical and bioinformatics research, workflow and data mining (knowledge discovery) were reviewed, by attending vendor demonstrations at events and through dedicated invited sessions. However, this investigation was limited to the user view of each system and some details provided by each vendor representative: it was not possible to conduct a detailed examination of the internal architecture of these systems. 6. As this research has followed an iterative design approach, the implementation of early versions were demonstrated at conference exhibitor stands, such as the Research Councils’ All Hands Meeting each September from 2004 to 2006, at which general feedback was provided by the attending audience and specific discussion of requirements took place with key clinical and bio-science researchers. All of the components of this method were pursued in parallel, over a period of 2 years from 2004 to 2006. In general it was found that most published clinical research studies have used dedicated databases to hold patient-related study data, and therefore composed dedicated queries in order to analyse this. However, a useful set of recent clinical research hypotheses was produced from which plausible example clinical research queries could be inferred. There is limited pre-existing work on the generic specification of clinical queries. Temporal databases offer an innovative way of optimising date and time-related data for storage and retrieval but do not in themselves support very different end user requirements (only, perhaps, a better way of meeting them). This paper focuses on defining the clinical expression of queries, that might include temporal criteria and be met by the addition of temporal storage and retrieval facilities to the underlying database (e.g. the expansion of the SQL query language to express temporal queries). The body of temporal query literature is therefore not a relevant input to this query specification (but would be to any actual implementation of it). The most useful resource has therefore proved to be the collection of example queries provided by individuals through discussions and interviews, and to obtain their feedback on early versions of a generic query implementation.

2.2.

Analysis of sample clinical research queries

From published cancer clinical trials and discussions with invited clinical academics a set of sample research questions

were defined. Although all of these queries were examined to identify their general pattern, some were selected for in-depth analysis as they included data items that correspond with the CLEF cancer data repository: these were therefore capable of execution later to generate sample query result sets. The list below shows several examples of clinical research queries, from which it was possible to infer a general pattern. • Find the age and gender of patients who have been diagnosed with Hodgkin’s disease, where the initial diagnosis occurred between the ages 50 and 70 inclusive. • How many patients with primary breast cancer survived 5 years after treatment with methotrexate? • What is the percentage of patients diagnosed with primary breast cancer in the age range 30–70 who were surgically treated and had post-operative (any time after surgery) haematoma/seroma? • What percentage of patients with primary breast cancer who relapsed had the relapse within 5 years of surgery? • How many patients with primary breast cancer below the age of 50 had no adjuvant chemotherapy? • What is the average survival of patients with Chronic Myeloid Leukaemia (CML) and both with and without splenomegaly at diagnosis? • What is the average survival of patients with Acute Myeloid Leukaemia (AML) having a diagnostic White Blood Cell (WBC) count >100,000 and those having a diagnostic WBC count <100,000? • What are the predisposing factors for breast cancer? • What is the likely stage of breast cancer at diagnosis? • What are the adverse effects of treatment with methotrexate? • What is the median age for breast cancer occurrence? To begin deducing an appropriate query framework, these kinds of queries were decomposed into the constraints that applied to each data item. (The first of these, relating to Hodgkin’s disease, is considered as an example later in this paper.) The majority of these queries were found to break down into a consistent form. First the set of patients to be considered were established based on certain combinations of criteria (such as their smoking status or diagnosis), then any patients explicitly excluded were removed from the set, and finally a set of return values was specified for the remaining patients. Sometimes queries were based on absolute criteria and sometimes they were relative to a criterion that had already been specified, for example “5 years after surgery” or “fall in the systolic blood pressure by more than 20 mmHg within 4 h of the administration of drug A”. Consider the following example research request: “Find all women between 60 and 70 years old that are nonsmokers, who were diagnosed with cancer of the breast that was metastatic to the lungs but not to the liver. These should be patients who underwent mastectomy followed by either chemotherapy or radiotherapy. Report the number who were alive at 5 years after surgery.” It is possible to make educated guesses about the precise intended meaning of some of these phrases. For example, “between 60 and 70 years old” almost certainly means “relative

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

757

Fig. 1 – Example clinical query decomposed into individual constraints.

to the date of diagnosis.” Non-smoking translates to “exclude patients recorded as smoking prior to the date of diagnosis.” The examples listed above are clearly not very complex queries. However, it is often the case that more complex queries are actually longer combinations of data items with similar joins between them, and therefore of a similar pattern to those above. It should be recognised that the query specification can only represent constraints on and relationships between information items that occur within the structured EHR. A question such as “What are the predisposing factors for breast cancer?” might be inferred from the answer to such queries, but cannot itself be framed as a query unless a data item already existed in the EHR called (something equivalent to) “predisposing factors”. Fig. 1 shows this query as a decomposition into constraints on individual clinical characteristics (traits, or data items) together with conditional relationships between them. From this it can be seen that the query detail comprises two main portions: • a set of clinical characteristics that help to define a sub-population of interest, quite precisely; some of these characteristics are inclusion criteria and some are exclusion criteria, and some of them are conditionally interrelated; • a set of characteristics whose values are wanted back on this specific population of patients, as the “result” of the query. In SQL terms, these correspond to the SELECT and the WHERE clauses (the FROM clause being the population of EHRs

in the repository to which the query is sent). From this pattern it was possible to formulate an information model that could be used to represent any arbitrary clinical research query of this kind, and to test its use to physically represent many more example queries.

3.

Results

3.1. Requirements for a generic clinical research query representation The first result arising from this research has been a set of requirements for the representation of a clinical research query. These requirements specify the information to be represented in order to communicate a clinical research query to a target clinical data repository. 1. A query will include the specification of inclusion criteria which, taken together, define a target sub-population from whom nominated information is to be returned. 2. The inclusion criteria will each specify a set of constraints on the domain of possible values for a given clinical characteristic (or trait, usually represented as a data item or data item structure within a clinical data repository). 3. Constraints on a clinical characteristic might be a specified value or value range. 4. Constraints on a clinical characteristic might be a specified value that relates to the value of a different

758

5.

6.

7.

8.

9.

10. 11.

12.

13.

14.

15.

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

characteristic (e.g. greater than the value of another data item for the same patient). Constraints on a clinical characteristic might be specified in terms of a given date (e.g. the first occurrence after a specified date). Constraints on a clinical characteristic might be specified in terms of an interval relative to the date of another data item for the same patient (e.g. the administration of a particular drug within 6 weeks following a particular operation). A special case of a relative date interval might be the age (or age range, or date of birth range) of the patient at the time of a given clinical characteristic value (e.g. patients diagnosed with a condition between the ages of 50 and 60). Constraints on a clinical characteristic might be specified in terms of the order of occurrence within the repository (e.g. the most recent known value of a given data item). Constraints on a clinical characteristic might be for any occurrence of a given value (e.g. patients who have ever had a particular drug). Any constraints on a clinical characteristic might be nominated as exclusion or inclusion criteria. Two or more clinical characteristic constraints might be combined as AND (intersection: both or all must be true) or as OR (union: either may be true). Constraints on textual values might be specified as an exact string, a sub-string, a code value or a code value range. Constraints on numeric values might need to present an absolute value, a value range, any upper or lower bound, and to specify if the upper and lower bounds are inclusive within the range (i.e. to distinguish between >10 and ≥10). Constraints on date values might need to present an absolute value, a date range, with any upper or lower bound, and to specify if the upper and lower bounds are inclusive within the range. Constraints on date values might be to varying precision: a full date, a month and year or only a year. Note that requirements to specify time values were not encountered

during this investigation, but such a requirement may also exist. 16. The return values portion of a query will specify the set of data items to be returned to the query issuer for the patients identified through the selection criteria. 17. The constraints that may need to be specified for a given return value are the same as those that may be specified for an inclusion criterion.

4. Information model conforming to the requirements Figs. 2–4 show the core information contained in a request. The CLEF REQUEST class represents the full query submission from a client to a server for processing and contains all of the required data to compute and return a response. Apart from some basic sender and message identifiers, the core content is a list of specific subclasses of the CLIN CHAR class represented at the top of Fig. 3. From Fig. 3 it can be seen that CLIN CHAR, the constraint specification for a single characteristic, can be one of five basic types. 1. Absolute Clinical Characteristic (class ABSOLUTE CLIN CHAR) “Return records satisfying constraint x.” This is the simplest form of specification, in which a value or temporal occurrence constraint may be specified for a given clinical characteristic, and which will initially be considered in isolation from any other clinical characteristic (although its result values may subsequently be combined with those of other characteristics). An Absolute Clinical Characteristic may be an exclusion criterion instead of an inclusion criterion. 2. Age Anchor (class AGE ANCHOR) “Return records satisfying constraint x, where x occurred in age range y to z.” It is rare to be interested in the age of each patient at the time the query is performed (although this can be provided, it is misleading and it is difficult to know what values if any should be returned for deceased patients). It is more com-

Fig. 2 – Information model representing the complete query request.

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

759

Fig. 3 – Information model representing the constraints for a single clinical characteristic.

mon for researchers to specify a clinical event as the basis for the age calculation. The age anchor provides for specifying an additional criterion representing the difference between the date the clinical characteristic was recorded (which can be further specified as the first, last, or any instance), and the date or year of the patient’s birth. 3. Event Anchor (class EVENT ANCHOR) “Return records satisfying constraint x, but providing for a user-defined anchor identifier.” The event anchor provides for the same initial query specification in the age anchor, but does not offer the second age-related constraint. Whereas the age anchor can only be specified once for each separate query, the event anchor can be used for as many criteria as required. “Anchor” queries provide for an additional temporal key that can be referred to by relative clinical characteristic queries, discussed below. They correspond to instructions which in

natural language might be expressed: “I’m interested in patients who had a diagnosis of x, but please also remember when the diagnosis was made for each patient as I’ll also specify something else temporally related to those dates (e.g. limit my search to those who started treatment with drug y within a month of being diagnosed.)” 4. Relative Clinical Characteristic (class RELATIVE CLIN CHAR) “Return records satisfying constraints on this characteristic where the observation was dated a given interval before (or after) the Event Anchor characteristic for the same patient.” Relative clinical characteristic queries search for conditions in which the criteria given are true and the occurrence is within a date window specified by a given interval from an origin point given by the data item corresponding to the nominated Event Anchor. So for example it is possible to establish the date of a particular diagnosis (the Event

Fig. 4 – Information model representing constraints on data values.

760

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

Fig. 5 – Information model representing the values returned in response to the query.

Anchor) and then ask for a relative query to determine that a particular treatment occurred within 5 years of that diagnosis. Other examples would include the onset of a symptom within a certain number of days of radiotherapy administration, or a drop in white cell count within 6 h of the first administration of a particular chemotherapy agent. The specification of a relative clinical characteristic constraint therefore has to identify which anchor event it must be relative to, and include a date interval (and its direction from the Event Anchor). 5. Comparative Clinical Characteristic (class COMPARATIVE CLIN CHAR) “Return records satisfying constraints on this characteristic where the observation is greater than (or less than, etc.) the value of the Event Anchor characteristic for the same patient.” This constraint behaves similarly to the Relative Clinical Characteristic except that the comparison is made on the basis of the values of each, rather than their dates of occurrence. Each of these five kinds of clinical characteristic specification individually identifies a sub-population of patients. These individual patient lists need then to be combined in order to identify the target population of the query. When combining multiple characteristic constraints, the UNION QUERY class contains one or more instances of an absolute or relative or comparative clinical characteristic query (each of which may be marked as either being in the relevant patient set, or being excluded from that set). The absolute and relative clinical characteristic query results are OR-ed together before being returned to the enclosing class. Each of the age anchor, event anchors and union query components are finally AND-ed together (obtaining their intersection—the patients common to all such lists) to form the definitive set of relevant patients for the query. This set is passed to the return values processor to obtain the values requested for the patient set. The return values part of the query defines what information is to be returned, as a result set, about each of these patients. The return values set provides values for the absolute or relative or comparative clinical characteristics specified in the RETURN VALUES class. The same query mechanics are used for both the return values and the union queries, except that in the former case it makes no sense to provide for return values being “excluded”. For absolute clinical characteristics, the actual values are returned. For relative and comparative clinical characteristics a Boolean is returned indicating whether the clinical characteristic was present for the patient given (in effect answering the question, “did the event you specified in this Relative Clinical Characteristic occur within the time period you declared, or not”—for example, returning

whether the patient died within 5 years of diagnosis rather than the fact that he or she has died). It was found during the investigation that the age and gender of the patients in the result set was so often requested that these were included by default within the response information model. The age was deliberately calculated for the date point at which an age was specified in the set of inclusion criteria; if no age criterion was specified then a “rogue” age value of −1 was returned. This policy was deliberately adopted to avoid returning spurious ages for deceased patients, or ages at the day of query execution that bore no relationship to the health episodes being analysed. The result of executing this overall query is a result set represented by Fig. 5. These values are passed back to the caller with spurious patient identifiers, obfuscated so that although the same identifier refers to the same patient, the identifier itself will have been unrecoverably changed from that which was held in the record database.

5.

Evaluation

5.1.

Implementation

An important part of the research has been to validate that the generic query specification can be implemented as an interface and middleware component communicating with a generic EHR system. The CLEF repository implemented at UCL for this purpose was architected on the basis of longstanding research on EHR systems design by the research team [13]. The UCL generic EHR server [14] has been part of a 12-year multidisciplinary journey that has covered both requirements and information modelling, as well as systems development. It has benefited from collaboration with a large number of healthcare organisations, informatics groups, and industrial partners across the United Kingdom, Europe, and beyond, through the research projects mentioned earlier in this paper. The fourth EU Health Telematics framework project, Synapses [15,16] (Project No. HC1046 1996–1998), provided much of the information modelling basis for the record server developed at UCL. The Synapses project developed an architecture for a longitudinal and multi-enterprise EHR, by federating clinical databases and systems scattered across hospital departments, specialised units, primary care and other community settings. The UCL record server is generic at several levels. It has an n-tier architecture, where the database and presentation layers are distinct from the middleware containing the logic of record persistence itself. It is fully architected in JavaTM [17] and can be deployed on any system for which there is a Java Virtual Machine. Within an installation domain, JiniTM [18] can

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

761

be used to connect the application server to the middleware, and the middleware to the database. The EHR databases are abstracted through a Java interface called the “FeederDatabaseSource.” Implementations for this interface have been written for relational, object-oriented, and XML database architectures, and the same methodology can be used via Jini to locate and federate multiple database instances of any of these types [19]. A specific interface is needed as the gateway to the EHR database in order to ensure that all instances committed to it conform to the EHR model, and that appropriate access policies are applied to the data. One outcome of this level of genericity is that it is not possible to rely on the presence of any particular query language (e.g. SQL) provided by the database implementation itself (since these would bypass the validation and access control measures mediated via FeederDatabaseSource). It was in any case considered desirable to utilise an interface derived from the specific data structures used in the UCL EHR server (even though standards based) for the generic query representation; interfaces of this kind will be the most tractable and scalable way to execute a generic query across multiple heterogeneous EHR systems.

5.2.

Archetypes

Each criterion specified in the query includes an identifier for the clinical characteristic to which it is a constraint. As part of this constraint, the value specification can refer to a string, date or numeric value, depending on the type of the values contained by the corresponding data item. The implementation of this interface requires the query-authoring environment (in our implementation, the Open University client) to have prior knowledge of the data items used within the UCL repository and the data value type of each leaf node (known as an Element). The UCL server can represent and store EHR data generated from any clinical domain in a systematic and consistent way, since it incorporates the use of archetypes [20]. An archetype is a formalised representation of the clinical knowledge defining a discrete health record domain concept (such as a blood pressure, a retinal examination, an APGAR score, a prescribed drug, a progress note or a differential diagnostic formulation). Each archetype specifies a blueprint for how one such clinical domain concept is to be represented within a generic EHR, such that all instances are identically structured and can be captured, presented, communicated or analysed consistently. This architecture provided a useful test-bed for this generic interface implementation, since domain-specific aspects of a query (such as the nomination of the clinical characteristics to specify a particular diagnosis, symptom or test result) can be communicated to the EHR server in terms of constraints on identified archetypes. The a priori knowledge to be communicated to the Open University client application was simply the library of archetypes that had been used to represent the cancer EHR within the repository. An example extract of this library, viewed using the UCL editor, is shown in Fig. 6. If a query was to be based on a patient’s ethnic group, for example, then the ID of the corresponding archetype would be passed to the query engine along with a constraint of the appropriate type (in this case, a string constraint). Depending

Fig. 6 – Partial Archetype structure.

on the class in the model in which the identifier was used, this could be used to require patients to be of a certain group or groups, or could be used to exclude this group or groups from analysis.

5.3.

Communications methodology

The Query Interface component runs as a remote service and communicates with the Record Server using Java’s Remote Method Invocation syntax. The part of the communication chain between the Record Server and Query Interface, that might carry patient data with identifiers that could be reconciled with the database itself, has been secured with the Secure Socket Layer communication protocols. Assuming that a CLEF Query Interface has already been discovered and a link established with the RMI/SSL mechanism (taking into account local security policies and so forth), the Hodgkin’s disease example query given earlier in this paper might translate into code as in Fig. 7. Here, “C81.” is the prefix for all types of Hodgkin’s in ICD10. The interface is designed to be relatively uncomplicated for

762

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

Fig. 7 – Use of the Age Anchor query in code.

the caller and allows them to increase the complexity of queries as they choose. To add additional clauses to the query, further request.set*( ) entries would be included.

5.4.

Test results

The Query Interface has been tested as an online query tool against a database of about a gigabyte in size (1,039,116 KB). The database was constructed from XML files exported by the Royal Marsden Hospital. XML is a particularly verbose format, and the sizes of the files giving rise to the final database include: 1. 2. 3. 4. 5. 6.

Casenote letters and narratives, 750 MB Radiology reports, diagnoses, etc., 340 MB Histopathology reports, diagnoses, etc., 40 MB Definitive cancer diagnoses, 23 MB Death certificates, 10.4 MB Registration and demographics data, 5.7 MB

Within the database there are 21,130 individual patient records, and 1,854,575 individual record components (EHR nodes), meaning that on average each patient has just over 87 record components in their record each of which is an average of just over 0.5 KB. There are 983,901 content items in the database, which means that just over half the record components store actual data, the rest containing aggregation items. There are 363,483 casenote letters and narratives (at about 2 KB per item in the original XML), 187,042 radiology reports (at just under 2 KB per item in the original XML) and 15,346 histopathology reports (about 2.5 KB per item originally). In order to establish the level of performance that might be expected of the Query Interface component, two types of test were executed. The first was the relatively simple Hodgkin’s query described earlier. The query was authored using the Open University client, which then passed a Java object containing the query (formulated according to the above model) to the Query Interface component for execution. This test was run with the client on one computer and the database and Query Interface on another—a relatively typical scenario. Both machines were 2004-era G4 processor-based, and on the same sub-network. To account for performance issues at start-up and to allow any JIT optimisations occurring on repeated use to take place, the last 7 iterations from a set of 10 were selected. In the case of the Hodgkin’s example this test yields an average round-trip query response time of 5228 ms, with a first-run time of 6046 ms, and a highest and lowest result of

6840 ms and 4788 ms, respectively. The values were derived by subtracting the millisecond value of the date when the query script begins, from the millisecond value on exit. In order to demonstrate the scalability of the Query Interface, it is possible to present the results of a second query, which was demonstrated at the Engineering and Physical Sciences Research Council All Hands Meeting 2004. This query in narrative form is “Find and return the first histological diagnosis of adenocarcinoma, squamous cell carcinoma or small cell carcinoma, for patients diagnosed with malignant neoplasm of the bronchus or lung between the ages of 20 and 60.” In contrast to the single clause of the first query, this version requires 7 clauses to enact. Following the technique whereby the last 7 tests out of 10 are applied, an average time for this much larger query is 13,440 ms. A further examination of this query was undertaken at the database level, to verify accuracy as opposed to performance. Independently of the query interface specification documented in this paper, the patient record database was examined though a series of individual database retrievals on each facet of the overall query, and the result sets examined. Through SQL joins it was possible to recreate the same final result set as had been generated by the query interface to confirm that an identical list of patients was returned. The addition or removal of individual patients (both those who should be returned and those who should not) always resulted in a corresponding and identical change in the patient list result set by each method of retrieval, thereby confirming that the query interface was retrieving accurately. All of the additional queries listed in Section 2.2 of this paper were subsequently authored, executed and the results verified.

6.

Discussion

If the electronic health record standards described above are widely adopted, in the coming years most EHR systems as well as conventional clinical data sources will have incorporated standard-conformant interfaces by which single patient record extracts can be exported. It is therefore conceivable that a generic query interface could be layered on top of these to also contribute data in response to a distributed query (provided that access permissions were granted for the data to be released). If successful, this could provide a low cost means of analysing large populations of health records across hetero-

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

geneous systems. In other words, this query interface builds on an existing (new) standard and therefore conformance to the standard should enable conformance to this interface with little additional effort. The information model documented in this paper has been designed on the basis of a set of requirements drawn from example queries obtained from particular clinical trials and academic clinical inputs. A more extensive investigation across clinical and bio-informatics is ideally needed to establish if other patterns of query need to be accommodated, and if the model can be extended to meet them. What has been shown by this work is that at least one significant body of queries can be represented in a generic way, and that this model can form the basis whereby an independently developed query authoring tool can obtain results from an EHR-standards-conformant data repository. One of the prerequisites for the success of this approach across distributed systems is that a semantically equivalent data structure can be defined in each connected EHR system, such that a query can be executed consistently across heterogeneous systems. The use of the archetype approach in the work reported here does not in itself guarantee that this will happen. However, as archetypes gain presence internationally, and EHR systems map their internal data schemata to them (in order to share EHR instance data), the capability of these systems to participate in standardised queries will also increase. Another query interface specification has recently been described by Ma et al. [21] in the context of openEHR [22] and both teams continue to share designs as part of an ongoing collaboration. The Query Interface and engine described herein took approximately one person-month to write and test, at about 4000 lines of Java code. Whilst it will provide deep support for the particular cancer-related needs of CLEF, the obvious question to be asked is whether it is sufficiently rich to provide a toolkit for querying in a more generic setting. The lack of dependency on any database architecture or operating system platform means that the toolkit could move freely from one environment to another without hindrance. The use of archetypes ensures that the toolkit could be used to target new domains if desired. However, the implementation and testing has shown that the UCL EHR database is not ideally suited in its present form to temporal queries—queries that are designed to be repeated over different periods of time, but whose cost to perform is expected to decrease after the first execution. These are used for example to establish trends in a patient state. Much has been published on the potential for temporally architected databases to optimise how chronological but interrelated events are stored and queried (for example, a detailed review of various options pioneered within health informatics was published in Combi and Shahar [23]). Such databases are dedicated to particular kinds of data or kinds of anticipated uses, and often require quite specific representations for time points, time intervals and temporal relationships. In contrast, the approach taken in this work has been to place no requirement on the architecture of the underlying EHR data other than that it logically conforms to an EHR standard and utilises or can map to archetypes (which will also form part of the next EHR standard).

763

Summary points What was known before the study • of models for querying in computing systems generally (e.g. SQL, EJB-QL, XML-Query, etc.); • of models for the representation of healthcare records; • of clinically interesting queries in cancer using natural-language. What this study has added to the body of knowledge • of some possible representational models for the cancer domain; • of an architecture for the representation of such queries, supporting a diverse set of possible questions; • of the performance of the query representation against a server known to be compliant to a European standard for record representation.

It is clear from this paper that performance difficulties are encountered in the approach adopted, It is, however, the view of the authors that lessons learned from this evaluation can be used in redesigning a record server to optimise both single patient (hierarchical) extraction and (temporal) research queries across populations. It is likely that in the next generation UCL system a more widespread technology will be used to abstract the database layer so that less developer effort need be spent on maintaining the implementations. An Enterprise JavaBeansTM [24] container for example, offers a more generic querying paradigm based on SQL but which includes querying over entity object relationships. This would remove a mismatch between the functional interface and the expression of the query. The UCL team is presently implementing a successor EHR server that will conform to the new EHR Communications standard (ISO/EN13606, due to be published in Europe during 2007 and internationally in 2008–2009), and utilise such engineering technologies. The logical query interface model is also being improved to reflect additional requirements that are being identified through ongoing collaborations with clinical and bio-scientific research teams. Further implementation and evaluation work will follow.

7.

Conclusion

If EHR system interoperability is now a viable objective through standards-conformant interfaces, a complementary standardised query interface will also be required. This paper complements many years of European research and standardisation on the interoperable communication of electronic health records (EHRs), by proposing a way in which one or more EHR systems might be queried in a standardised way. The authors hope that the work reported here can help highlight the feasibility of and contribute to such a standard.

764

i n t e r n a t i o n a l j o u r n a l o f m e d i c a l i n f o r m a t i c s 7 7 ( 2 0 0 8 ) 754–764

Contribution of the authors Tony Austin wrote the bulk of the draft, and the software components tested herein. Dipak Kalra designed the query model and wrote significant portions of the draft. Archana Tapuria made a significant contribution to the draft (including the referencing). Nathan Lea was responsible for importing the clinical data into the test repository (20,000 patient records), and then executing and evaluating the correctness of all of the clinical research queries reported in this paper. David Ingram was part of the CLEF research project consortium that identified the need for such a solution, and coordinated a set of clinical inputs that informed our design work.

Acknowledgements CLEF is supported in part by grant G0100852 from the UK Medical Research Council under the E-Science Initiative—see http://www.clinical-escience.org for further details. Some parts of the literature research described herein were jointly undertaken by colleagues at the University of Manchester (in particular, Dr. Jeremy Rogers) whose help is gratefully acknowledged.

references

[1] D. Ingram, The good European health record project, in: Laires, C. Laderia (Eds.), Health in the New Communications Age, IOS Press, Amsterdam, 1995, ISBN 9051992246, pp. 66–74. [2] D. Lloyd (ed.), GEHR Architecture. Good European Health Record Final Deliverable. 252 p. June 30, 1995. Available from: http://www.chime.ucl.ac.uk/work-areas/ehrs/GEHR/EUCEN/ del19.pdf (last accessed December 2007). [3] J. Grimson, W. Grimson, D. Berry, G. Stephens, E. Felton, D. Kalra, P. Toussaint, O.W. Weier, A CORBA-based integration of distributed electronic healthcare records using the synapses approach, IEEE Trans. Inf. Technol. Biomed. 2 (September (3)) (1998) 124–138. [4] F.M. Ferrara, P.A. Sottile, W. Grimson, The holistic architectural approach to integrating the healthcare record in the overall information system, Stud. Health Technol. Inform. 68 (1999) 847–852. [5] D. Kalra, D. Ingram, T. Austin, V. Griffith, D. Lloyd, D. Patterson, P. Kirstein, P. Conversin, W. Fritsche, Demonstrating wireless IPv6 access to a Federated Health Record Server, in: M. Bubak, G.D. van Albada, P.M.A. Sloot, J.J. Dongarra (Eds.), Computational Science—ICCS 2004. Proceedings of the 4th International Conference, Lecture Notes in Computer Science, Krakow, Poland, June 6–9, IOS Press, Amsterdam, 2004, pp. 1165–1171, ISSN: 0302-9743 http://eprints.ucl.ac.uk/1588/1/A10.pdf (last accessed April 2008). [6] R. Dixon, P. Grubb, D. Lloyd, D. Kalra, Consolidated List of Requirements. EHCR Support Action HC3001 Deliverable 1.4. The European Commission, Brussels, 2001 59 p. Available from: http://www.chime.ucl.ac.uk/work-areas/ehrs/EHCRSupA/del1-4v1 3.PDF (last accessed December 2007).

[7] P. Hurlen (ed.), Project Team 1-011. ENV 12265: Electronic Healthcare Record Architecture, CEN TC/251, Brussels, 1995. [8] S. Kay, T. Marley (eds.), Project Team 1-026. ENV 13606: EHCR Communications. Part 1. Electronic Healthcare Record Architecture, CEN TC/251, Brussels, 1999. [9] D. Kalra, Electronic health record standards, in: R. Haux, C. Kulikowski (eds.), Yearbook of Medical Informatics. Stuttgart: Schattauer Methods Inf. Med. 45 (2006) (Suppl 1), 136–144. [10] A. Rector, et al., CLEF—Joining up Healthcare with Clinical and Post-Genomic Research, in: Engineering and Physical Sciences Research Council All Hands Meeting, 2003 (poster). [11] D. Kalra, P. Singleton, D. Ingram, J. Milan, J. MacKay, D. Detmer, A. Rector, Security and confidentiality approach for the Clinical E-Science Framework (CLEF), Methods Inf. Med. 44 (2005) 193–197. [12] C. Hallett, D. Scott, R. Power, Composing questions through conceptual authoring, Comput. Linguist. 33 (1) (2007) 105–133. [13] D. Kalra, D. Lloyd, T. Austin, A. O’Connor, D. Patterson, D. Ingram, Information architecture for a federated health record server, in: F. Mennerat (Ed.), Electronic Health Records and Communication for Better Health Care. Proceedings of EuroRec’01, November 2001, Aix-en-Provence, Studies in Health Technology and Informatics, Issue 87, IOS Press, Amsterdam, 2002, pp. 47–71, ISSN: 0926-9630. [14] D. Kalra, T. Austin, A. O’Connor, D. Lloyd, D.L.H. Patterson, D. Ingram, Design and Implementation of a Federated Health Record Server, in: Towards a Electronic Health Record Europe (TEHRE), November 11–14, 2001. Available from: http://eprints.ucl.ac.uk/1579/1/A1.pdf (last accessed April 2008). [15] W. Grimson, P. Toussaint, D. Kalra, P. Hurlen, E. Andersen, S. Spahni, R. O’Moore, J. Grimson, ODP Specification of Synapses. Synapses Federated Healthcare Record Server Final Deliverable, 375 p, March 18, 1999. Available from: https://www.cs.tcd.ie/synapses/public/html/ projectdeliverables.html (last accessed December 2007). [16] D. Kalra, T. Austin, D. Ingram, D.L.H. Patterson, F.M. Ferrara, P.A. Sotille, W. Grimson, D. Solomon, The SynEx project, in: J. Bryant (Ed.), Current Perspectives in Healthcare Computing’ 99. Part 1. BJHC Books, Weybridge, Surrey, 1999, pp. 22–24, ISBN 095354270X. [17] http://www.javasoft.com (last accessed February 2005). [18] http://www.jini.org (last accessed February 2005). [19] T. Austin, The Development and Comparative Evaluation of Middleware and Database Architectures for the Implementation of an Electronic Healthcare Record, PhD Thesis, University College London, CHIME, 2004. [20] T. Beale, Archetypes: Constraint-based Domain Models for Future-proof Information Systems. OOPSLA 2002 Workshop on Behavioural Semantics, 18 p. http://www.deepthought. com.au/it/archetypes/archetypes new.pdf (last accessed February 2005). [21] C. Ma, H. Frankel, T. Beale, S. Heard, in: K. Kuhn, et al. (Eds.), EHR Query Language (EQL)—A Query Language for Archetype-Based Health Records, IOS Press, Medinfo, 2007, pp. 397–401. [22] http://www.openehr.org (last accessed February 2005). [23] C. Combi, Y. Shahar, Temporal reasoning and temporal data maintenance in medicine: issues and challenges, Comput. Biol. Med. 27 (No. 5) (1997) 353–368. [24] http://java.sun.com/products/ejb/ (last accessed February 2005).