ARTICLE IN PRESS
Nuclear Instruments and Methods in Physics Research A 559 (2006) 26–30 www.elsevier.com/locate/nima
Towards the operation of INFN Tier-1 for CMS: Lessons learned from CMS Data Challenge (DC04) D. Bonacorsi INFN-CNAF Tier-1 and CMS Bologna, Viale Berti-Pichat 6/2, 40127, Bologna, Italy (on behalf of INFN-CNAF Tier-1 staff and CMS DC04 Task Force) Available online 15 December 2005
Abstract After CMS Data Challenge (DC04)—which was devised to test several key aspects of the CMS computing model—a deeper insight was achieved in most crucial issues for a successful Tier-1 operation with real data within the overall CMS computing infrastructure. In particular, at the Italian Tier-1 centre located at CNAF, several improvements were implemented in one year since DC04, concerning the data management, the data distribution system using the CMS PhEDEx tool, the coexistence of local traditional farm operations and Grid official CMS Monte Carlo production, the development and use of tools to grant efficient data access to distributed users to analyze CMS data via Grid tools, the long-term local archiving and custodial responsibility (e.g. MSS with Castor back-end), the daily CMS operations on Tier-1 resources which are shared among LHC (and not only) experiments, and so on. The outcome of CMS DC04, as well as the CMS use of INFN-CNAF Tier-1 resources, are briefly reviewed and discussed, yielding indications for a roadmap towards the operation of the regional centre when real data from LHC will be available. r 2005 Elsevier B.V. All rights reserved.
1. Introduction
2. Tier-1 resources and services for CMS
The Compact Muon Solenoid (CMS) [1] is one of the four High Energy Physics (HEP) experiments that will examine p–p collisions at the CERN Large Hadron Collider (LHC). The operation of CMS requires infrastructure and manpower resources larger than previous HEP experiments, partly due to the large amount of data generated per year and the complexity of the software used to produce and analyze such data, and partly due to the geographically distributed nature of the CMS collaboration. Planning to meet these requirements has led to the development of a distributed computing model [2] comprising solutions for data creation, distribution, management and analysis. This model is tested by undertaking a series of computing challenges of increasing complexity, each of which tests the maturation of the collaborating groups and computing centres. This paper details experience at one such centre, the INFN Tier-1, during a recent challenge, named Data Challenge 04 (DC04).
The INFN Tier-1 centre, located at CNAF in BolognaItaly, offers computing facilities for the INFN HEP community, and is one of the main nodes of the GARR network. It is a multi-experiment Tier-1, hosting computing resources in use by LHC experiments (ALICE, ATLAS, CMS, LHCb) and other collaborations (AMS, Argo, BaBar, CDF, Magic, Pamela, Virgo, etc). The centre aims to provide a dynamic share of access to resources to all involved experiments. To achieve this, INFN maintains a relevant Grid presence, by participating in LCG [3], EGEE [4] and INFN-GRID [5] projects, supporting R&D activities and developing middleware prototypes and components. Currently, the INFN Tier-1 CPU power on non-legacy farms is about 1300 kSI2k—plus few dozens of servers—all dual processor boxes, with hyper-thread activated, in use by all Tier-1 experiments on a fair-share basis. The usage of the storage at the Tier-1 is driven by requirements of LHC data processing at the centre, i.e. simultaneous access of PBs of data from 1000 nodes at high rate. The centre’s main focus is on robust,
E-mail address:
[email protected]. 0168-9002/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.nima.2005.11.159
ARTICLE IN PRESS D. Bonacorsi / Nuclear Instruments and Methods in Physics Research A 559 (2006) 26–30
load-balanced, redundant solutions to grant proficient and stable data access to distributed users. The overall Tier-1 disk storage capacity is organized in 4 NAS1 systems (about 20 TB) and 2 SAN2 systems (about 240 TB). The storage resources are allocated to experiments according to policies defined at the Tier-1 level, while the storage usage patterns are derived from specific experiments use-cases. Currently CMS uses disk space to host and serve simulated data to the CMS analysis community. An HSM3 software system (Castor [6]) is operational at the centre, currently consisting of a StorageTek L180 library with 100 LTO-1 tapes (100 GB native) and a StorageTek L5500 library equipped with 5500 mixed slots in a hybrid set-up with 6 IBM LTO-2 drives with 1300 LTO-2 tapes (200 GB native) and 2 STK 9940b drives (200 GB native), for a total tape capacity of approximately 400 TB [7]. More tape resources are planned to be installed in Q3-Q4/2005. CMS uses tapes as a reliable, affordable means for storing large volumes of data and for addressing the Tier-1 data custodial responsibility as required in the CMS computing and analysis model [2]. Network services are implemented as a Tier-1 LAN with rack Fast-Ethernet switches with 2xGbps uplinks to core switch (upgrade to rack Gb switches foreseen for Q3-Q4/ 2005) and 1 Gbps Tier-1 link to WAN (upgrade to 10 Gbps foreseen for Q3-Q4/2005), plus 1 additional Gbps link for the LHC experiments participation to imminent LCG service challenge efforts [8]. Recently, the INFN Tier-1 went through a software migration process in several areas of operation. On the farming side, the Tier-1 migrated all boxes to: (i) a new operating system (from RedHat to Scientific Linux SLC 3.0.4); (ii) a new LCG middleware version (LCG 2.4.0 on Computing Elements, Worker Nodes and Storage Elements; also subsequent upgrades up to LCG 2.6.0 on e.g. brokering services); (iii) a new installation/management tool (from LCFGng to Quattor 1.1.0 [9], with LCG integration); (iv) a new scheduler (from Torque/Maui to LSF 6.1 [10], with LCG interfacing). This overall intervention was managed by 2 Tier-1 FTEs in close collaboration with experiments expertise at the site, and took about three months. This process required a deep testing in migration periods—also with experiment’s applications—to verify basic functionality and to grant coherent LCG interfacing. 3. CMS DC04 and the role of INFN Tier-1 During March–April 2004, CMS undertook a large-scale data challenge (DC04) with the aim of validating its computing infrastructure on a sufficient number of Tier-0/ 1/2s. In a preliminary large Pre-Challenge Production (PCP) of simulated events in 2003/2004, Monte Carlo data 1
NAS ¼ ‘‘ Network Attached Storage’’ . SAN ¼ ‘‘ Storage Area Network’’ . 3 HSM ¼ ‘‘ Hierarchical Storage Management’’ : 2
27
needed as input for the DC04 were produced, using both traditional methods and Grid prototypes (CMS/LCG-0, LCG-1, Grid3). DC04 then focused on the reconstruction and analysis on CMS data sustained over two months at 5% of the LHC rate at full luminosity (i.e. 25% of start-up luminosity). This translated to the following challenge targets: (i) sustain a 25 Hz reconstruction rate in the Tier-0 farm; (ii) register data and metadata to a world-readable catalogue; (iii) distribute reconstructed data from Tier-0 to Tier-1/2 sites; (iv) perform data analysis on selected remote sites as reconstructed data arrive; (v) monitor/archive information on resources and processes. CMS DC04 was hence notably not just a ‘‘CPU challenge’’ but an effort aimed to the demonstration of feasibility of a full chain— from data handling at CERN to distributed data analysis at the Tiers acting as nodes in the data distribution topology (see for example [11] and references therein). The PCP effort, started in July 2003 and continued in 2004, coordinated the CMS Regional Centres and satisfied physicist’s requests for more than 70 millions of simulated events—20 millions with Geant4. At several centres, including the INFN Tier-1, the last step of the production cycle (digitization) continued in April 2004 despite the start of DC04, in order to feed the CERN buffers with enough data to potentially match the DC04 25 Hz input rate for more than one month. About 750 K production jobs ran at CMS sites, for 3500 KSI2k/months, storing about 100 TB of simulated data. The Data Challenge demonstrated that the overall chain—comprising reconstruction, data distribution and data analysis—is able to run at 25 Hz. CMS DC04 data and metadata were registered to the EDG Replica Location Service (RLS) [12] at CERN, which provided replica catalogue functionality for all data distribution chains in the challenge layout. Performance issues emerged in DC04 concerning the use of such catalogue: the Local Replica Catalogue (LRC) component of the RLS—used as a global file catalogue—was found capable of providing a physical filename lookup service only when invoked from C þ þ applications, while the Replica Metadata Catalogue (RMC) component of RLS—used as a global metadata catalogue to store the POOL [13] file attributes and to look for logical collection of files—exhibited severe performance and scalability issues in both inserting and querying information. A variety of different data transfer strategies were deployed in DC04. Typically, differences in performances between these strategies are related to the Tier-1 operational choices: LCG-2 Replica Manager at INFN, PIC Tier-1s; Storage Resource Broker (SRB) at GridKa, IN2P3, RAL Tier-1s; native Storage Resource Manager (SRM) with dCache at FNAL Tier-1. In particular, INFN Tier-1 obtained good overall performances by exploiting the LCG-2 software, successfully meeting DC04 requirements with a combination of LCG-2 and custom CMS software. Transfer rates to CNAF reached a sustained 430 MB=s and peaked at 42 MB=s for 5 h during a
ARTICLE IN PRESS 28
D. Bonacorsi / Nuclear Instruments and Methods in Physics Research A 559 (2006) 26–30
large file-size transfer test undertaken at the end of DC04 [11]. Reconstructed data was analyzed as it arrived at Tier-1 and smaller centres. At the INFN Tier-1, LCG components were utilized to provide job submission, monitoring and output gathering services to help automate this onarrival analysis. Real-time analysis at Tier-2s was demonstrated to be possible. About 15 k jobs were submitted and the latency between data availability at the detector facility—i.e. the Tier-0—and the submission of jobs analyzing such data at the Tier-2s was low, at approximately 20 minutes [11]. Operating conditions peculiar to DC04 caused problems, e.g. a low ratio of number of events to distributed files caused an inefficient use of data transfer bandwidth, made the start-up of commands dominate on the commands execution time and addressed some severe scalability issues of MSS system.
4. Tier-1 operational issues from DC04 The INFN Tier-1 was successfully operated by CMS within PCP/DC04 and achieved the challenge primary goals. Possible improvements were identified on many aspects of the challenge, providing a roadmap for CMS post-DC04 activity planning at the Tier-1. Although a data challenge is by definition an experiment-specific effort, its post-mortem may draw more general recommendations. CMS experience during/after DC04 showed that most experiment-specific issues—when working in a Tier-1 with resources shared among experiments—may be more proficiently addressed if perceived as general Tier-1 issues, potentially affecting also other experiments relying on the same hardware resources and on similar software components. This approach stresses a synergy between experiment-specific choices and Tier-wide policies on key operational issues. A constant close interaction among all Tier-1 actors and the role of the experiments expertise at the Tier-1 emerge as crucial points for successful Tier-1 operations. An example is the CMS data transfer activity at the Tier1 using the Castor HSM system for bidirectional disk-tape transfers. The Castor solution was experienced to be quite satisfactory in ‘‘normal’’ conditions: if not stressed, Castor system performances were good and Castor components— like the stager—demonstrated to be acceptably stable. Nevertheless weaknesses were found in ‘‘challenging’’ operating conditions, as in DC04. Performance degradation on tape drives was observed when large amounts of relatively small files were transferred at a high rate to the Tier-1 disk buffer. Tape space was inefficiently used in writing mode—tapes marked read-only or disabled—and slow access to tapes was experienced in reading mode— tape failures and locking during stage-in requests of logical data which required the read-out of many pieces of information randomly positioned on tapes [7,11].
After DC04, the layout and set-up of the overall Castor system at the Tier-1 was closely revised, and experiments using Castor faced the problem from different perspectives. For example, the LHCb experiment achieved acceptable Castor performances by triggering ad hoc policies of preliminary data stage-in to a temporarily-expanded disk buffer just in LHCb data analysis periods. The CMS experiment—on the other hand—redesigned its use of the existing Castor infrastructure by adding a disk-only ‘‘Import Buffer’’ Storage Element layer as a front-end to the data distribution system, with the aim of preventing any Castor migration policy to be automatically triggered and of allowing any desired data merging policy to apply first. This set-up has already been designed and successfully deployed at INFN Tier-1 in the post-challenge CMS activities, and will be tested thoroughly in the immediate future. Concerning the reading operations from tapes, the experiments at the INFN Tier-1 will also profit from the recently acquired STK 9940 b drives, supposed to be more reliable than IBM LTO-2 drives in random data access [7]. Some other basic Tier-1 topics are currently being addressed by joining the Tier-1 staff expertise and the skills developed by experiment personnel at the Tier-1 in running their challenges and in managing daily operations. A few examples are: (i) the need to design and operate experiment-specific test-suites—based on fake production/ analysis jobs—to be run on the Tier-1 shared farm each time there is need to check actual usability of resources, e.g. after a failure recovery, or after a large software migration; (ii) the need to use pools of disk-servers—LCG Storage Elements in a DNS load-balanced set-up to address the need of a scalable data serving to experiments; (iii) the need for redundant monitoring/accounting systems at the Tier1, able to archive information on metrics also defined by experiments, and especially focused on data access patterns; (iv) the need to evaluate any services—such as database services—which may be crucial to experiments by stress-testing them in real operating conditions, thus revealing limitations and possibly suggest optimal service configuration; (v) the need to acquire the ability to create working groups (not necessarily within the Tier-1 personnel only) with a limited time-duration to just focus on e.g. specific hardware testing, or evaluation and comparison of software products and middleware components. The need for a reliable dataset replication tool on top of unreliable data transfer mechanisms strongly emerged in CMS from DC04 experience, and was later addressed by PhEDEx [14] (Physics Experiment Data Export). It is a data distribution management system that allows CMS to manage allocation and transfer of physics datasets on bidirectional routes among Tiers registered in a dynamic transfer topology. The architecture is based on a coherent set of loosely coupled software agents, inter-operating and communicating with a blackboard, the Transfer Management Database (TMDB). So far, PhEDEx has been in production use for over one year for CMS, and today exploits a transfer topology with 6 Tier-1 and 16 Tier-2/3
ARTICLE IN PRESS D. Bonacorsi / Nuclear Instruments and Methods in Physics Research A 559 (2006) 26–30
sites, and handles more than 100 TB (replicated twice on average). The INFN Tier-1 is a PhEDEx node since the beginning, and 1 TB/day of sustained transfer rate from Tier-0 was demonstrated to be feasible. After intense data traffic from Tier-0 to INFN Tier-1 in Q3-Q4/2004 (see Fig. 1), in 2005 the Tier-1 started to drive the PhEDEx deployment in INFN. Today the tool is operational within INFN in the Tier-1, 4 Tier-2 and 1 Tier-3 sites. Recently, the Tier-1 joined an advanced LCG Service Challenge phase [8], and within this effort CMS plans to address all data distribution issues with PhEDEx as a production-quality, reliable and scalable replica management layer. The CMS analysis task is moving fast from the ‘‘controlled’’ and ‘‘fake’’ DC04 scenario towards the ‘‘unpredictable’’ and ‘‘real’’ CMS analysis use-cases. The new scenario defines stringent requirements on data quality check, and requires high expertise on resources management and process handling to efficiently face all issues arising from such data access pattern (see Fig. 2). All CMS data hosted at INFN Tier-1 must pass a full and extensive validation procedure aimed to check the data integrity before exposing the data to any end-user analysis. It basically builds local file catalogues by handling data and metadata and by running standard analysis executables, and publishes data properties and validated information into a local PubDB [15] at CNAF, which—together with RefDB [16]—currently addresses the data discovery task in CMS. Once validated and published, the data are available to be accessed via Grid services by preparing analysis jobs and executing them remotely on the desired data in a distributed environment. In particular, INFN Tier-1 is a full LCG-2 site and in addition it hosts 2 CMS-dedicated Resource Brokers (and related information sources) and offers CMS support to any analysis activity on-going at the site. CRAB [17] is the workload management tool responsible for job preparation, splitting and submission, and was widely tested also at INFN Tier-1. Several systems 8 7 6 5 4
PhEDEx T0 - T1 transfers (TB by week)
2004
2005 RAL T1 PIC T1 INFN T1 IN2P3 T1 FZK T1 FNAL T1
3 2 1 0 42 43 44 45 46 47 48 49 50 51 52 53 01 02 03 04 k ek ek ek ek ek ek ek ek ek ek ek ek ek ek ek e e e e e e e e e e e e e e e e e w w w w w w w w w w w w w w w w Fig. 1. CMS PhEDEx data traffic from Tier-0 to Tier-1 sites in Q4/2004 and part of Q1/2005.
29
5000 4500 4000
waiting jobs
3500 3000 2500 2000
running jobs
1500 1000 500 0 Mon Tue Wed Thu Fri Sat Sun Mon Tue 20/06 21/06 22/06 23/06 24/06 25/06 26/06 27/06 28/06 Fig. 2. CMS grid jobs occupancy on INFN Tier-1 farm in one sample week in June 2005. Blue color (bottom) depicts running jobs, green color (top peaks) depicts waiting jobs.
are in place and are still being improved to perform the CMS job monitoring [18]. At INFN Tier-1, a test-suite based on CRAB jobs for all datasets residing at CNAF is being written and can be scheduled to run on-demand at the site by the CMS personnel to check catalogues integrity and CMS resources accessibility, thus allowing a prompt problem discovery and troubleshooting and hence reducing the risk of high failure rates in massive submissions of real analysis jobs. This model already in use will be taking significant steps towards the system described in the CMS computing design [2]. 5. Conclusions In this paper, we stress that direct participation in operating a centre at full scale is crucial to bringing the centre operations and procedures to LHC scale. Despite being quite a young centre, the INFN-CNAF Tier-1 is ramping-up towards stable production-quality services, also through the lessons learned by its participation to experiment data challenges. The successful involvement of the Tier-1 in recent CMS PCP/DC04 effort derived from the capability to address different responsibilities at the same time, to face well-defined time constraints and to achieve a constant and proficient interaction between the Tier-1 staff and the CMS personnel at the centre. Current wide operational issues include resource management, Grid interfacing of CMS tools and Tier-1 components, deep involvement in several aspects of the overall CMS data distribution system and a full subset of activities specifically devoted to address the needs of the CMS analysis community. Most problems that will be encountered in the immediate future in a Tier-1 ‘‘shared’’ among experiments may be more proficiently addressed if faced not as CMSonly issues but as more general Tier-1 topics, hence
ARTICLE IN PRESS 30
D. Bonacorsi / Nuclear Instruments and Methods in Physics Research A 559 (2006) 26–30
triggering a synergy between all the actors at the Tier-1 level and in CMS. INFN will join next LCG Service Challenge steps, also running CMS applications. Acknowledgements We would like to acknowledge the INFN-CNAF Tier-1 staff for the competent work and support, the CMS DC04 Task Force and all Tier-0/1/2 site managers for their invaluable efforts, the PhEDEx team for the proficient work and the excellent team management, the CERN IT/ DB and IT/FIO divisions for their help and support, and the LCG Deployment team. References [1] CMS experiment, hhttp://cmsdoc.cern.chi [2] CMS Computing Technical Design Report, CERN/LHCC-2005-023, June 2005. [3] LCG project, hhttp://www.cern.ch/lcgi [4] EGEE project, hhttp://public.eu-egee.orgi [5] INFN-GRID project, hhttp://grid.infn.iti
[6] Castor, hhttp://castor.web.cern.ch/castori [7] P. Ricci, et al., Storage resources management and access at Tier-1 CNAF, this conference. [8] I. Bird, et al., Deploying the LHC Computing Grid—The LCG Service Challenges, International Symposium on Emergence of Globally Distributed Data, Sardinia, Italy, June 2005. [9] Quattor, hhttp://www.quattor.org;LCFGngi, hhttp://www.lcfg.orgi [10] LSF, hhttp://www.platform.comi [11] D. Bonacorsi, et al., Role of Tier-0, Tier-1 and Tier-2 Regional Centers during CMS DC04, Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland, 2004. [12] D. Cameron, et al., Replica Management in the European DataGrid Project, J. Grid Comput. 2004, in press. [13] POOL, hhttp://lcgapp.cern.ch/project/persisti [14] T. Barrass, et al., Software agents in data and workflow management, Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland, 2004. [15] PubDB, hhttp://cmsdoc.cern.ch/swdev/viewcvs/viewcvs.cgi/OCTOPUS/PubDB/?cvsroot=OCTOPUSi [16] J. Andreeva, et al., RefDB: The Reference Database for CMS Monte Carlo Production, Computing in High Energy and Nuclear Physics (CHEP), La Jolla, California, 2003. [17] CRAB project, hhttp://cmsdoc.cern.ch/cms/ccs/wm/www/Crabi [18] N. De Filippis, et al., The CMS analysis chain in a distributed environment, this conference.