“Big data” and “open data”: What kind of access should researchers enjoy?

“Big data” and “open data”: What kind of access should researchers enjoy?

Therapie (2016) 71, 107—114 Available online at ScienceDirect www.sciencedirect.com GIENS WORKSHOPS 2015 /Clinical Pharmacolgy ‘‘Big data’’ and ‘‘...

485KB Sizes 1 Downloads 35 Views

Therapie (2016) 71, 107—114

Available online at

ScienceDirect www.sciencedirect.com

GIENS WORKSHOPS 2015 /Clinical Pharmacolgy

‘‘Big data’’ and ‘‘open data’’: What kind of access should researchers enjoy?夽 Gilles Chatellier a,∗, Vincent Varlet b, Corinne Blachier-Poisson c , the participants of Giens XXXI, Round Table No. 6, Nathalie Beslay d , Jehan-Michel Behier e , David Braunstein f , Mireille Caralp g , Brigitte Congard-Chassol h , Isabelle Diaz i , Laure Fournier a , Anne Josseran h , Philippe Lechat j , Cinira Lefevre k , Franck von Lennep l , Karine Levesque m , Philippe Maugendre n , Guillaume Marchand o , Didier Mennecier p , Nicholas Moore q , Sophie Ravoire r , Christine Riou s a

Hôpital européen Georges-Pompidou, AP—HP, 20, rue Leblanc, 75908 Paris cedex 15, France LeLabEsanté, 75000 Paris, France c AMGEN, 92650 Boulogne-Billancourt, France d Beslay + Avocats, 75008 Paris, France e Takeda France, 92977 Paris-La-Défense, France f Hôpital La Timone, AP—HM, 13385 Marseille, France g Inserm Transfert, 75013 Paris, France h SNITEM, 92400 Courbevoie, France i LEEM/ARIIS, 75858 Paris, France j DRCD, hôpital Saint-Louis, AP—HP, 75475 Paris, France k Bristol-Myers-Squibb, 92500 Rueil-Malmaison, France l DREES, 75350 Paris, France m Abbott Vascular, 94518 Rungis, France n Sanofi France, 94255 Gentilly, France o DMD Santé, 51100 Reims, France p Direction centrale du service de santé des armées, 94160 Saint-Mandé, France b

DOI of original article: http://dx.doi.org/10.1016/j.therap.2016.01.004. Articles, analyses and proposals from the Giens workshops are those of the authors and do not prejudice the proposition of their parent organization. ∗ Corresponding author. E-mail address: [email protected] (G. Chatellier). 夽

http://dx.doi.org/10.1016/j.therap.2016.01.005 0040-5957/© 2016 Société franc ¸aise de pharmacologie et de thérapeutique. Published by Elsevier Masson SAS. All rights reserved.

108

G. Chatellier et al. q

Université de Bordeaux, 33076 Bordeaux, France SR Consulting, 75015 Paris, France s CHU de Rennes, 35033 Rennes, France r

Received 17 December 2015; accepted 21 December 2015 Available online 3 February 2016

KEYWORDS ‘‘Big data’’; ‘‘Open data’’; Anonymization; Connected objects; Data storage

Summary The healthcare sector is currently facing a new paradigm, the explosion of ‘‘big data’’. Coupled with advances in computer technology, the field of ‘‘big data’’ appears promising, allowing us to better understand the natural history of diseases, to follow-up new technologies (devices, drugs) implementation and to participate in precision medicine, etc. Data sources are multiple (medical and administrative data, electronic medical records, data from rapidly developing technologies such as DNA sequencing, connected devices, etc.) and heterogeneous while their use requires complex methods for accurate analysis. Moreover, faced with this new paradigm, we must determine who could (or should) have access to which data, how to combine collective interest and protection of personal data and how to finance in the long-term both operating costs and databases interrogation. This article analyses the opportunities and challenges related to the use of open and/or ‘‘big data’’, from the viewpoint of pharmacologists and representatives of the pharmaceutical and medical device industry. © 2016 Société franc ¸aise de pharmacologie et de thérapeutique. Published by Elsevier Masson SAS. All rights reserved.

Abbreviations ADNI BDW CCAM CDW CépiDC

CNAMTS CNIL CNSA CPRD DNA eCTD EGB EHR4CR EMA GAFAMS HEGP IMI INDS INSERM PMSI

Alzheimer’s disease neuroimaging program biomedical data warehouses medical classification for clinical procedures (classification commune des actes médicaux) clinical data warehouses Epidemiology Centre on the Medical Causes of Death (Centre d’épidémiologie sur les causes médicales de décès) Caisse nationale d’assurance maladie des travailleurs salariés data protection watchdog (Commission nationale informatique et libertés) National Solidarity Found for Autonomy (Caisse nationale de solidarité pour l’autonomie) clinical practice research datalink deoxyribonucleic acid electronic common technical document general samples of beneficiaries Electronic Health Records for Clinical Research European Medicines Agency Google, Apple, Facebook, Amazon, Microsoft and Samsung Georges-Pompidou European Hospital (Hôpital européen Georges-Pompidou) innovative medicines initiative National Institute of Health Data (Institut national des données de santé) Institut national de la santé et de la recherche médicale medicalization of information systems programme (programme de médicalisation des systèmes d’information)

SHRINE Shared Health Research Information Network SNDS national system of health data SNIIRAM French national health insurance information systems (Système national d’information interrégimes de l’assurance maladie)

Introduction It is now possible to analyse several petabytes of data on condition of having access to powerful computers and appropriate algorithms. The real difficulty lies elsewhere. In all areas of activity, the citizens of the world do not fully control nor understand what is being collected, analysed and used, sometimes without their knowledge. Moreover, the global availability and dissemination of such data raise legal issues, with legislation differing from one country to another. Finally, some countries fund the development and operation of large databases, possibly with the support of private contributions, while others do not view such investments as a priority. The healthcare field has also witnessed the explosion of massive data whose computer processing aims to provide new information, which is useful to society and the sick, while guaranteeing citizens — whether current or future patients — and researchers greater confidentiality and improved access to data. . . two contradictory objectives! The aim of this Round Table was to shed light on concepts hidden behind the terms ‘‘big data’’ and ‘‘open data’’ and to underscore certain difficulties or delays affecting our country in this area, against a backdrop of fierce international competition, while also drawing attention to the methodological difficulties to be factored in so as to avoid reaching the kind of inappropriate conclusions

‘‘Big data’’ and ‘‘open data’’: What kind of access should researchers enjoy?

109

which could undermine the development of this new tool for scientific research. The economic value of these data and their use was stressed, as indirectly evidenced by the growing interest in the medical world among hackers, the most glaring example being the recent hacking of Anthem, a major American insurance firm [1].

• cost of production or the advertised rate (possibly variable depending on the applicant) or participation in the base (imaging); • accessibility with or without conditions, particularly financial; • opening up of the format; • reuse with or without a change of purpose.

‘‘Big data’’ and ‘‘open data’’: definitions and scope

Data sharing is not as simple as it might appear (IT questions, legal issues, etc.) but it is particularly being addressed in the field of clinical trials, as evidenced, for example, by the report of the Institute of Medicine of the US National Academies [5].

‘‘Big data’’ ‘‘Big data’’ is often characterized by the ‘‘4Vs’’: volume, velocity, variety and veracity: • volume: society is overwhelmed with growing volumes of data of all types, which are measured in terabytes, or even petabytes; • velocity: in some cases (such as fraud detection), even massive data must be used in real time to (be useful); • variety: ‘‘big data’’ come in the form of structured or unstructured data (text, sensor data, sounds, images, etc.). New knowledge stems from the collective analysis of these data; • veracity: use of data can only be envisaged if researchers have confidence in this information. With these four characteristics, it is understandable that the analytical difficulty will be all the more crucial as the variety of data and the number of sources increase. Therefore, the four parameters described above are often supplemented by complexity. For although variety might be an asset, it is also a source of complexity! For example, in the field of healthcare, in addition to specifically medical data, behaviour and messages by patients on social networks, data from telephone monitoring, data from connected objects, ecological and environmental data, and so on, all contribute to the wealth and complexity of analysis. In the medical field, ‘‘big data’’ are arriving on the scene in massive numbers! In fact, medical and administrative data, data from medical warehouses, which are currently booming (see below), drug prescription data, data from cohorts or clinical trials are all now available. The availability of these data is already a fact in France and in many countries for medical and administrative data. For the rest of the data, there is still a lot of work to be done in many areas, especially in technical fields (databases, analytical methodology), standardization and in terms of exploitation so that they can fulfil their full potential in respect of scientific discovery [2].

‘‘Open data’’ ‘‘Open data’’ represent an international development dating back over 15 years, and concerning both medical and non-medical data. Editorials in leading medical journals have addressed the issue while highlighting the potential benefits of this development [3,4]: • sharing of data produced or held by governments, public establishments, private companies, a community of patients. . . in order to meet a specific need which is in the public interest;

Medical data warehouses: a data source in full development Clinical data warehouses (CDW) or biomedical data warehouses (BDW) help to meet the secondary exploitation needs of patient data and their sharing in the various fields of research. Primarily implemented within university hospital establishments, these warehouses group together the patient data produced during healthcare, in a structured (biology, questionnaire, etc.) or unstructured (text, images, etc.) format. They originate from the information systems of university hospitals (electronic medical records) or individual initiatives, which are often theme based. Their potential applications are exciting: identifying populations for cohort studies, pre-screening and screening for clinical trials, case-control studies, comparing subpopulations, etc. Clinical data are or will be associated with databases or bio-bank warehouses (containing tissues, deoxyribonucleic acid [DNA], serum, etc.). They are undergoing rapid transformation marked by: • integration of ‘‘omics’’ data and data from clinical trials leading to the development of platforms for translational research; • openness to new data sources such as records, data from patient networks; • the establishment of warehouse federations and the development of shared exploitation tools; • the integration of data mining and display tools. Like everything related to the reuse of health data, their exploitation raises questions of a legal, political, technical (interoperability, de-identification [6]) and qualitative nature. A data warehouse cannot be established without specialized tools. Among the operational tools, i2b2 (Informatics for Integrating Biology and the Bedside [7]), an ‘‘open source’’ tool is the most widely deployed internationally. The Bordeaux University Hospitals and the GeorgesPompidou European Hospital (HEGP) in France have chosen this tool for their warehouse, as have more than 50 universities around the world and industrial players such as Johnson & Johnson and GE Healthcare. It is also worth mentioning a French application, ehop, developed by the University of Rennes 1 and the Rennes University Hospital, which is currently deployed in several university hospitals in western France. Warehouse networks are being set up which

110 call on a query platform for standardized queries on the network’s warehouses, such as Shared Health Research Information Network (SHRINE), a widely used tool in the USA, in order to group paediatric warehouses or conduct multicentre observational studies, particularly concerning autism and diabetes [8,9]. In Europe, the innovative medicines initiative (IMI) Electronic Health Records for Clinical Research (EHR4CR) [10] project is part of this approach concerning a network of warehouses serving clinical research. The HEGP and more broadly the Paris public hospital system (AP—HP) and the Rennes University Hospital are taking part in this European project (involving 10 industrial partners and 24 academic partners) to facilitate the identification of hospital sites and patients for industrial clinical research projects via a platform permitting feasibility studies and pre-screening of patients. The grouping of clinical data and omics data within warehouses has led to the development of platforms for translational research and personalized medicine, providing data display and analysis tools. One example is Transmart [11], an ‘‘open source’’ tool based on the i2b2 infrastructure [12]. Although still underdeveloped in France, warehouses have a promising future ahead of them [13]. However, besides the data de-identification/anonymization aspects, the obstacles to the efficient reuse of data from patient records concern questions of interoperability, the quality of information and matching in the absence of a unique identifier. After the clarifications to be provided by the Health Law, warehouses will constitute an additional source of information which could be cross-analysed with French national health insurance information systems (SNIIRAM) data, medical records as well as non-medical, environmental or socio-demographic data.

Legislative aspects in France France has been a pioneer when it comes to protection of the individual with the creation of the data protection watchdog (CNIL) and the founding law in 1978, which inspired the European Directive of 1995. However, in a global landscape undergoing radical transformation, the situation of access to health data is currently marked by a number of weak points, including: • lack of strategic management by the State; • inconsistency and lack of legislation relating to the programme de médicalisation des systèmes d’information (PMSI: the French version of the diagnostic related group [DRG]) and the SNIIRAM; • lack of consistency in governance and access to data subject to Article IX of the Loi informatique et libertés/French Data Protection Act (research data) and to data subject to Article X (medical and administrative data); • obstacles to matching for researchers (decree of the Conseil d’État); • lack of traceability concerning access. The outlook, however, seems reassuring. Article 47 of the Health Law currently being debated by the French National Assembly and Senate lies at the heart of the reform process.

G. Chatellier et al. The proposed legislation creates a ‘‘national system of health data (SNDS)’’, which will gather data from the SNIIRAM, including the medicalization of information systems programme (PMSI), those from the Epidemiology Centre on the medical causes of death (CépiDC), medico-social data from the National Solidarity Fund for Autonomy (CNSA) and a representative sample of reimbursement data for supplementary health insurance. It creates a National Institute of Health data (INDS), the future one-stop access point for data, whose objectives it defines. Personal health data collected for other purposes may be subject to processing for research, study or assessment presenting a public interest, in accordance with the Informatique et Libertés law of 6 January 1978. Such processing must not in any case be aimed at achieving the ‘‘direct or indirect identification of these persons’’, except as provided for in law. The article defines the role of the ‘‘trusted third party’’, a separate agency for SNDS administrators, which will be tasked with retaining the identifiers (names, address, directory registration number or NIR of private individuals) in order to maintain their strict separation from health data. The proposed legislation also addresses other key points: • access regimes: permanent access and access as per the CNIL procedure; • clarification of the governance and rules for access as per the CNIL procedure (expert committee and National Institute of Health Data) including the notion of public interest and prohibited purposes. This part of the text will clarify access to data by persons producing or marketing health products and health supplements; • traceability of access; • reference methodologies. Neither the law nor its implementing decrees have been finalized at the time of writing this article, and the latter are pending publication, but interesting information can be found in the following documents: • Bras report (October 2013) [14]; • report of the open data commission (July 2014) [15]; • the study by the DREES (French department of research, surveys, evaluation and statistics) on the risks of re-identification (July 2015) — contains a summary presentation of article 47 [16]; • the Health Law adopted at its first reading by the National Assembly [17].

‘‘Open data’’: still insufficient availability Non-medical data are increasingly available in open data, such as social, economic, ecological (pollution, temperature) data, all of which can contribute to improving the understanding of diseases and their treatment. The same will be the case for existing medical databases established for other purposes, once the access conditions are taken into account: PMSI, SNIIRAM, etc. It is also important to emphasize the efforts of the French state to make all administrative data accessible through the open platform of French public data [18]. It seems highly likely that hospital data warehouses under construction will soon be available, subject to anonymization of data and the absence of objections from patients.

‘‘Big data’’ and ‘‘open data’’: What kind of access should researchers enjoy? Finally, private data have already demonstrated their value (telephone operators or major web operators such as Google or Apple), but accessibility remains limited except in certain cases. All around the world, public and private initiatives are providing access to sometimes considerable volumes of data. As usual, the US leads the way as shown for example by the list of reusable data proposed by the NIH [19,20]. A debate is under way in world literature on access to clinical trial data [21—23]. The discussion about access to clinical trial data in marketing authorisation (MA) dossiers of the European Medicines Agency (EMA) is a good illustration. Since 2007, all innovative medicines, whether chemical or biological, are subject to a centralized European MA procedure. Such applications for European MAs (or marketing authorization extensions) are filed with the EMA and the national agencies of the 28 member states of the European Union in digital form (DVD) according to the same electronic common technical document (eCTD) presentation standard. All dossiers are loaded on servers and are managed by software, which is common to national agencies as well as to the EMA, the EURS software. With the establishment of this database of MA dossiers and the EURS software, it is possible to gain access to all information contained in these dossiers in electronic format. The file of European MAs submitted in eCTD standardized electronic format since 2007 would appear to cover approximately 500 dossiers, including generic medicinal product dossiers. These dossiers include all data related to the manufacturing process and the pharmaceutical quality of the medicinal product as well as all clinical trial data (whether or not published, with unpublished data in the majority) produced during the different phases of development of a medicinal product. The documents on clinical trials (contained in module 5 of the eCTD dossier) are presented in the form of standardized reports with details of all efficacy and safety data. The efficacy data are presented as text and tables with all analyses in subgroups, which have been conducted on various criteria (primary and secondary) depending on the study populations (population as a whole and various subgroups). This therefore concerns ‘‘aggregated’’ data for each trial. Individual data for each trial are not accessible. Safety data as reported in each trial are shown in detail with the listing of all events and adverse reactions. Access to these clinical data makes it possible to envisage implementing a complementary approach to research in order to extend analysis of the benefit/risk relationship parameters depending on the diseases targeted and populations to be treated and to answer questions either concerning a medicinal product or a class of medicinal products or a field of the disease or therapy. By gaining access to data from several MA dossiers, researchers could therefore perform analysis on small subgroups in each dossier (the elderly, people with renal insufficiency, pregnant women, etc.) or cross-analyse the benefit/risk ratios of products in the same therapeutic class by comparing the data for populations which have the same characteristics with an identical criterion for assessing the efficacy or the risk or conduct indirect comparisons between medicinal products of the same indication, studies of the susceptibility to changing evaluation criteria and the clinically minimally important difference (CMID) and meta-analyses.

111

After several years of discussion with the industry, the EMA introduced the principle of free access to all the data in MA dossiers filed from 1 January 2015 onwards except for formulation and production data. The general public will have access to screens in PDF. Researchers who make an application for a research project while undertaking to refrain from pursuing any commercial objective, may receive all the efficacy and safety data in the form of reports and tabulated files. Given the timeframes for obtaining MAs, effective access can only begin in early 2016 and dossiers filed before 1 January 2015 will not be accessible. This position limits the immediate benefit of such access, to the extent that the main interest concerns the comparison of data between dossiers relating to a same indication. It will therefore be necessary to wait many years before the number of new dossiers is sufficient to allow meaningful comparisons to be made between them! In another field, that of imaging, data sharing seems to be developing well. It is worth citing the examples of the Alzheimer’s disease neuroimaging program (ADNI), the multicentre epidemiological study involving 55 research centres in the USA and Canada and ‘‘The Cancer Imaging Archive’’ (TCIA), a freely accessible archive of specific medical images of cancers with attendant metadata (clinical and genomic). However, in this highly extensive field researchers face many technical problems: • standardization limited by extremely rapid technological developments; • archiving and traceability of data; • need to store raw data in order to reuse them with new image processing techniques; • possibility of recognition of patients despite the deidentification; • providing for the management of incidental findings.

Connected objects and e-health Connected medical objects/devices are developing extremely rapidly and range from consumer objects (pedometer) to sophisticated medical devices (pacemaker). They are beginning to generate a considerable amount of data and, to avoid any confusion, it is imperative to classify connected objects in three broad categories: • connected medical devices coming under telemedicine (defined by Article 78 of the hospital, patients, health and territories [HPST] law of 2009 and its Decree 1229 of October 2010), in other words those which incorporate functions which are not only able to collect data and technical and medical information but also, and above all, to process and transmit them remotely. Two examples of equipment already demonstrate that the objectives may be different even if they are tightly controlled by the same regulations: connected pacemakers which could eventually send an alert requesting assistance, and connected insulin pumps which enable remote monitoring of drug delivery sometimes according to information obtained from other sensors (measurement of blood sugar levels). These are an integral part of the diagnosis and/or medical supervision. A mobile application may well be classified as a medical device due to the manner in which it is used;

112 • ‘‘medicine’’ connectable features are the ‘‘oldest’’ items of equipment which have benefited from data transmission capabilities via a local internet connection or via a mobile application. The quality (via standards such as medical CE marking) makes it possible to give the professional user or patient the guarantee that the data offered by this device correspond to the kind of relevance and reproducibility expected due to the use made of them. The communication protocol can be local with another recording device, or towards servers offered by the manufacturers themselves and they should probably be integrated into the corpus of health data by centralization of information. Examples are connected CE medical scales, connected CE Medical blood pressure monitors, connected glucometers; • connected ‘‘wellness’’ objects which do not meet the requirements of regulatory controls (including evaluation), reliability and which are not used to participate in a health circuit. We might mention activity trackers, non-CE medical connected scales and mobile health applications, apart from those connected to the objects mentioned in the first two categories although these items of equipment can bring real benefits. That is why we must support the manufacturers and developers of these solutions for the benefit of patients, many of whom possess a connected health record hub of the future in the shape of the smartphone. As long as these standards are not achieved, the integration of these data remains debatable. Connected health, with the new trend for connected objects and new uses around the ‘‘quantified self’’, is becoming a strategic sector for leaders of new technologies, such as the Google, Apple, Facebook, Amazon, Microsoft and Samsung (GAFAMS) which are offering their services as ‘‘interfacers’’ for all activity sensors and mobile equipment used for self-monitoring by individuals who are originally in good health but which are increasingly intended for at-risk populations (HealthBook from Apple compiling all information in the user’s smartphone) or those already suffering from chronic diseases (HealthKit from Apple interfaced with some major US hospitals). This technology is tempting after the failure of health record services on line with individual completion such as services like Google Health (2008—2012) and Microsoft Vault because the collection and aggregation of data are automatic. However, two dangers must be highlighted: • the validity of the devices must be tested, and this process is far from universal [24,25]; • moreover, authorities and users should be vigilant as to the terms and conditions of these services, respect for privacy and use of personal data, particularly when they leave the country.

Data processing The effective treatment of large volumes of data has always been a challenge. All aspects of data must be considered: the collection, extraction, integration, quality analysis, interpretation, missing data. . . and appropriate statistical methods selected during processing and analysis [26]. This knowledge of methods still seems insufficient as recently

G. Chatellier et al. emphasized by Sinha et al. [27], resulting in sub-optimal exploitation of large data sets! The statistical approach methodology of this analysis is extremely varied. It includes conventional statistical meta-analysis techniques, allowing interaction studies between subgroups (depending on the characteristics of the product concerned or of the study population) and by using the technique of network meta-analyses for indirect comparisons between medicinal products in the same indication. However, as soon as it concerns the analysis of bases built for other purposes, researchers should be aware of the complexity of electronic health records containing data collected on a routine basis, and thus a procession of problems involving the missing data, classification errors, inaccurate definitions of diseases, etc. If we add the large volume of these data and their heterogeneity (textual data, imagery, etc.), we can see that new skills are needed, especially programmers with expertise in transforming data into a format which is usable by research, such as a relational database structure. An IT infrastructure with secure servers and sufficient storage space will also be required. In terms of data processing, techniques such as data mining and neural networks as well as learning techniques supplement methods for controlling the rate of false positives. In fact, it is important to be wary of the risk of the proliferation of statistical tests rendered powerful by the large numbers of studies conducted using ‘‘big data’’ [28]. A sound methodology is therefore based on definition of specifications to perform relevant extractions and a review of the kinds of studies for which these data can or could be used and those where they pose a risk of bias. Therefore, research in the medical economy seems highly promising [29,30] while comparative clinical research is still in the realm of the controlled trial, because of the indication bias that is found in all routine databases. However, ‘‘big data’’ concern marketed products which have already been subjected to clinical trials; yet, only analysis under actual conditions of use makes it possible to study the performance beyond a theoretical efficacy which is already demonstrated. Methodologies exist to compensate for a certain number of biases and these are rapidly developing, the power related to the volume of data facilitating the development of such methodologies.

Conclusion and recommendations In Europe, the Nordic countries, for example, are scientifically recognized for the quality of databases, but also for the (relative) ease of access to such data. In these countries, the quality of databases for observational research are available for academic researchers, funders and pharmaceutical industries which support transparency and the high methodological standards of research. In the five Nordic countries (Denmark, Finland, Iceland, Norway and Sweden), the large-scale prescription databases covering the entire nation provide an opportunity to analyse medicinal product utilization without introducing selection bias. In the UK, the clinical practice research datalink (CPRD), a primary care database, is an extensive source of health data for research, and includes information on demographics, symptoms,

‘‘Big data’’ and ‘‘open data’’: What kind of access should researchers enjoy? tests, diagnostics, therapies, health behaviour and referral to secondary care. There are connection possibilities with other databases. The CPRD was used to produce more than 1000 studies covering a wide range of health problems. In this competitive European context, France is not lagging behind, either qualitatively or quantitatively, but it should accelerate access to available databases and create a context of continuous improvement when it comes to the quality of these bases. Through its work, the round table 6 was required to make several diagnoses that led it to issue the following recommendations: • the establishment of research projects on large individual databases in France requires the agreement of the CNIL whose fairly unpredictable delays of around 6 to 8 months, on average, are not compatible with project management constraints and not particularly competitive when viewed alongside the timeframes noted in other European countries. This situation is due to a rising level of requests for opinions submitted to the institution without the resources having been adjusted accordingly. On the one hand, the RT requests an increase in human resources allocated to projects in the health sector but also the introduction of benchmark methodologies which would facilitate the work of researchers submitting projects and make the assessment by the CNIL less cumbersome; • the future Institut national des données de santé which should shortly be established by article 47 of the Health Modernisation Law must be at the service of research and public health (like IDS, its predecessor). According to its legal conception, it is important to allocate the resources and time necessary for effective implementation. The round table emphasizes, in particular, that the success of this institute requires the role played by its experts to be highlighted. Furthermore, Article 47 stipulates that only ‘‘public interest’’ studies will be approved by the Institute and will be able to be conducted. The round table recommends that this concept of ‘‘public interest’’ be defined as clearly as possible and proposes that the players involved (institutions, academics, industrialists) jointly develop an operational definition and framework. It might be suggested that this notion of public interest be given concrete form by making the results of such studies available to the public; • medicalization of data (PMSI): the French version of the DRG system: the addition of new variables such as the side of the disease when appropriate and the NYHA classification is essential to analyse the effect of certain medicinal products or medical devices. The group suggests creating a user group (physicians in charge of PMSI management in hospitals, researchers, industrialists) responsible for defining the new needs other than financial ones. Some of these points come under the alteration of diagnosis codes (ICD-10), which seems difficult, or more likely a closer link with data from medical records, for example by interfacing national health data with the medical data warehouses mentioned above; • nomenclatures: an update is needed as quickly as possible. This is possible for the nomenclature of medical devices. The update procedure for CCAM (French medical classification for clinical procedures) interventions should also be accelerated;

113

• conducting research projects on ‘‘big data’’ in the health sector requires multiple skills which must be brought in from medical training spheres but also people trained in scientific or engineering fields. The French higher education system must adapt to this new challenge. To measure the match between the needs and the training of future statisticians, computer scientists, researchers of ‘‘big data’’, the round table recommends a survey of available training and the creation of new curricula. A new profession is emerging, that of ‘‘datascientist’’, in other words, a specialist in the treatment of massive data, which includes computer, statistics and data mining skills; • the database for sickness insurance or SNIIRAM is a perfect example of ‘‘big data’’ which could soon move over to open data (thanks to Article 47 of the Public Health Modernization Act) and is a valuable and unique source for answering many scientific questions. The eventual growth of scientific research on this basis requires a simple data model which is easily comprehensible. The round table recommends a simplification of the structure of the SNIIRAM data model and the provision of documentation to assist researchers along the lines of WIKI-SNIIRAM. The provision of a more powerful version of the general samples of beneficiaries (EGB) or sample of national health system beneficiaries (10% sample instead of 1%), but using the simplified structure and access procedures of the EGB would appear to be an intermediate solution which could meet the majority of needs, with an existing simplified access model which does not call on the CNIL and is easily extensible. This would reserve the full national base for problems which are unable to be resolved in the 1 or 10% bases, relieving both the CNIL and the DEMEX department of the SNIIRAM responsible for extractions: processing or extractions in the complete base require difficult to access computing power, and the complexity of the base (which remains necessary for internal use at the Caisse nationale d’assurance maladie des travailleurs salariés [CNAMTS]) means processing is limited to experts. The actual risk of identification in the full national base will probably result in a cautionary approach by the CNIL faced with full opening up; • matching of bases: simplifying the procedure and shortening implementation timeframes (see Health Law). The use of NIR, after authorization by the CNIL, will be possible for research projects. Matchings might be outsourced to a trusted third party. The decrees implementing the Health Law should specify the practical arrangements; • future research developments which will result in the creation of open health databases should lead us to give consideration to a form of organization and access to bases which is simple and operational over time. It is not possible, for instance, to envisage that a database which is intended to be shared eventually could be the property of its scientific council and nor for its sustainability to be frequently undermined by a lack of resources or will on the part of the players. The round table recommends building on the Institut national de la santé et de la recherche médicale (INSERM)’s contractualisation method which ensures operational access to open data bases in the long-term.

114

G. Chatellier et al.

Disclosure of interest The authors declare that they have no competing interest.

[15]

References

[16]

[1] Le Figaro. Les données de santé attirent les hackers; 2015 http://sante.lefigaro.fr/actualite/2015/02/13/23393-donneessante-attirent-hackers [Accessed December 23rd, 2015]. [2] Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA 2014;311:2479—80. [3] Walport M, Brest P. Sharing research data to improve public health. Lancet 2001;377:537—9. [4] Godlee F. Goodbye PubMed, hello raw data. BMJ 2011;342 http://www.bmj.com/content/342/bmj.d212 [Accessed December 23rd, 2015]. [5] Institute of medecine of the national academies. Sharing clinical tiral data. Maximizing benefits, minimizying risks; 2015 http://iom.nationalacademies.org/∼/media/Files/Report%20 Files/2015/SharingData/CompleteRecommendations.pdf [Accessed December 23rd, 2015 (4 pages)]. [6] Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014;50:151—61. [7] https://www.i2b2.org [Accessed December 23rd, 2015]. [8] Weber GM, Murphy SN, McMurry AJ, Macfadden D, Nigrin DJ, Churchill S, et al. The Shared Health Research Information Network (SHRINE): a prototype federated query tool for clinical data repositories. J Am Med Inform Assoc 2009;16:624—30. [9] McMurry AJ, Murphy SN, MacFadden D, Weber G, Simons WW, Orechia J, et al. SHRINE: enabling nationally scalable multisite disease studies. PLoS One 2013;8(3):e55811. [10] http://www.ehr4cr.eu [Accessed December 23rd, 2015]. [11] http://transmartfoundation.org [Accessed December 23rd, 2015]. [12] Canuel V, Rance B, Avillach P, Degoulet P, Burgun A. Translational research platforms integrating clinical and omics data: a review of publicly available solutions. Brief Bioinform 2015;16(2):280—90. [13] Coorevits P, Sundgren M, Klein GO, Bahr A, Claerhout B, Daniel C, et al. Electronic health records: new opportunities for clinical research. J Intern Med 2013;274(6):547—60. [14] Bras PL. Rapport sur la gouvernance et l’utilisation des données de santé; 2013 http://www.drees.sante.gouv.fr/IMG/pdf/

[17] [18] [19] [20] [21]

[22]

[23] [24]

[25] [26]

[27]

[28]

[29]

[30]

rapport-donnees-de-sante-2013.pdf [Accessed December 23rd, 2015 (128 pages)]. Commission open data en santé; 2014 http://www.drees. sante.gouv.fr/IMG/pdf/rapport final commission open data-2. pdf [Accessed December 23rd, 2015 (63 pages)]. Données de santé : anonymat et risque de ré-identification; 2015 http://www.drees.sante.gouv.fr/IMG/pdf/dss64-2.pdf [Accessed December 23rd, 2015 (103 pages)]. http://www.assemblee-nationale.fr/14/ta/ta0505.asp [Accessed December 23rd, 2015]. https://www.data.gouv.fr/ [Accessed December 23rd, 2015]. https://www.nlm.nih.gov/ [Accessed December 23rd, 2015]. https://www.nlm.nih.gov/hsrinfo/datasites.html [Accessed December 23rd, 2015]. Eichler HG, Petavy F, Pignatti F, Rasi G. Access to patientlevel trial data — a boon to drug developers. N Engl J Med 2013;369:1577—9. Koenig F, Slattery J, Groves T, Lang T, Benjamini Y, Day S, et al. Sharing clinical trial data on patient level: opportunities and challenges. Biom J 2015;57(1):8—26. Bhattacharjee Y. Biomedicine. Pharma firms push for sharing of cancer trial data. Science 2012;338(6103):29. El-Amrawy F, Nounou MI. Are currently available wearable devices for activity tracking and heart rate monitoring accurate, precise, and medically beneficial? Healthc Inform Res 2015;21(4):315—20. Lee J, Finkelstein J. Activity trackers: a critical review. Stud Health Technol Inform 2014;205:558—62. Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nat Rev Genet 2010;11:647—57. Sinha A, Hripcsak G, Markatou M. Large datasets in biomedicine: a discussion of salient analytic issues. J Am Med Inform Assoc 2009;16(6):759—67. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol 2015;13(3):e1002106. Collins B. Big data and health economics: strengths, weaknesses, opportunities and threats. Pharmacoeconomics 2015 [Epub ahead of print]. Wettermark B, Zoëga H, Furu K, Korhonen M, Hallas J, Nørgaard M, et al. The Nordic prescription databases as a resource for pharmacoepidemiological research — a literature review. Pharmacoepidemiol Drug Saf 2013;22(7):691—9.