From commissioning to collisions: preparations and execution of CMS Computing

From commissioning to collisions: preparations and execution of CMS Computing

Nuclear Physics B (Proc. Suppl.) 215 (2011) 79–81 www.elsevier.com/locate/npbps From commissioning to collisions: preparations and execution of CMS C...

394KB Sizes 0 Downloads 40 Views

Nuclear Physics B (Proc. Suppl.) 215 (2011) 79–81 www.elsevier.com/locate/npbps

From commissioning to collisions: preparations and execution of CMS Computing D. Bonacorsia (on behalf of CMS Computing) a

University of Bologna, Department of Physics, Italy

After joining a series of Computing scale tests during last years, steered by the Worldwide LHC Computing Grid, the CMS Computing project is ready to enter into the LHC data taking era. With the real LHC collisions at centre-of-mass energies up to 7 TeV, CMS Computing collected more experience on the Tier-0 tape writing and processing capabilities, on the behaviour of the Tier-1 sites in terms of tape archival operations as well as reprocessing and skimming, on the optimization of the datasets distribution on a full-mesh transfer topology, on the performance of a world-wide Grid-enabled distributed analysis of real LHC data. In this paper, the first Computing lessons learned with LHC collisions data at 7 TeV are presented.

1. Introduction After years of preparation and commissioning activities, the CMS experiment [1] entered in data taking mode. During collision running, CMS performed workflows and activities that were predicted in the Computing Model [2]. The CMS Computing Tiers of the Worldwide LHC Computing Grid (WLCG) [3,4] performed the specified workflows: prompt processing at the Tier-0 (T0); output to the CAF [5] for prompt feedback and alignment/calibration activities; transfer to Tier1 sites (T1) for storage and distribution to Tier-2 sites (T2); prompt skimming and reprocessing at T1 centres; Monte Carlo production and analysis activities at T2 centres. The CMS Computing experience on the various Tier level with LHC proton-proton collisions data is summarized in the following. 2. Computing at Tier-0 The overall load on computing systems and the event complexity was lower than expected in the original planning. As of the end of May 2010, CMS collected roughly 10 nb−1 and the acquisition era for collisions at 7 TeV summed up to 580 M evts collected in RAW format, corresponding to 430 TB of total data volume (including all 0920-5632/$ – see front matter © 2011 Published by Elsevier B.V. doi:10.1016/j.nuclphysbps.2011.03.140

re-reconstruction passes). This slower ramp with respect to the planning (tens of pb−1 in the first 6 months) resulted in no stress on computing resources, and in the predicted activities to be performed more frequently. The T0 operations experience at CERN was very smooth. At the T0, CMS ran the rolling (fully automated) workflows, i.e. the express processing, the prompt reconstruction and the prompt skimming (actually at T1s, but scheduled by the T0 system). The success rate of the different types of jobs running at the T0 (e.g. express, repack, prompt-reco, alca-skims, ...) averaged at >99.9%. The overall CMSSW software stability and computing systems reliability were extremely satisfactory. The processing latency for timecritical applications like the Express stream had already been measured in the 2009 run. As an example, the latency from receiving first streamers of run at the T0 to the first express files on the CAF was measured to be about 25 min (versus a design spec of 1 hour). With the 2010 run, such latencies are still well within the design goals. The CMS CERN Analysis Facility (CAF) also showed a broadly distributed usage (up to about 130 active users) and an equal share among the different groups (AlCa, Commissioning, Physics). The jobs on the CAF have always been able to start almost immediately in low-latency queues.

80

D. Bonacorsi / Nuclear Physics B (Proc. Suppl.) 215 (2011) 79–81

Figure 1. CMS PhEDEx traffic on T1-T2 and T2-T2 routes since March 2010. Daily-averaged rates in MB/s are shown. Each color represents a different destination (T2) site. 3. Computing at Tier-1s The T1 sites readiness and stability is now remarkably high. This is a result of a daily work done for many years, in close contact with the T1 site admins. CMS launched an ad-hoc CMS Site Readiness program [6], that defined the “readiness” based on a set of boolean ‘AND’ of many tests, and constantly monitors it for all CMS computing Tiers while helping site administrators in solving CMS-related issues. A good collaboration among CMS Computing Operations teams and WLCG Tiers was established (also through the Computing shifts): it became even more tight since March 30th. In particular, the readiness of T1 sites was very good during the 2010 collisions data taking period, and the T1s were able to sustain all activities foreseen in the Computing Model. A steady data traffic out of CERN was observed in PhEDEx [7] since the start of data collection. All 7 T1s received custodial data from the T0. During 90 days just before this Conference, CMS moved >0.8 PB of data from T0 to T1 sites. The pattern shows typical bursts corresponding to fills: in interesting days, the dailyaveraged aggregate CERN-outbound rate peaked at >650 MB/s. The transfer quality is constantly monitored, and was stable and good for all desti-

nation T1s throughout all data taking so far. It’s remarkable to note that after one rare and short (few hours) service loss in the transfer chain, the accumulated backlog was quickly digested afterwards, showing how the excess capacity gets efficiently used when needed, and resulted in no harm on operations at all (and fully transparent to the physics analysis teams). The T1 centres also functioned well for the first 7 TeV promptskimming and reprocessing activities, even if not (yet) running in a resource constrained environment. CMS expects to be able to turn over the data passes needed for the Summer conferences (e.g. ICHEP’10 in Paris) in a week or less. 4. Computing at Tier-2s The readiness of T2 sites has plateaued in late 2009 already to more than 40 usable T2 sites. Many structures are visible - related to holiday periods, problematic sites, or simply downtimes - but in general CMS can count on a consistent number of stable T2 centres, mainly for Monte Carlo production and distributed analysis. After the start of 7 TeV collisions data taking, the T1-T2 average data traffic rate has come up: 49 T2s have received some data since early April 2010. During 90 days just before this Conference, >3 PB were moved from T1 to T2 sites. Ad-

D. Bonacorsi / Nuclear Physics B (Proc. Suppl.) 215 (2011) 79–81

ditionally, a significant effort of the CMS Data Transfers teams allowed to deploy the first serious T2-T2 transfers, among many sites and with daily-averaged rate as high as 0.8 GB/s (see Fig. 1). The T2-T2 route is extremely useful to optimize the overall transfer system and minimize the dataset transfer latency to a given destination: for these reasons, it’s being largely (and transparently) used e.g. for physics data replication among T2s supporting the same physics group. CMS has measured that on average the T2s are used 50% for Monte Carlo production and 50% for analysis, as from the model. The baseline for CMS is to perform MC production at T2 centres, but some special high-priority MC requests were started to be produced at T1 sites as well, to flexibly use all available resources, and at the same time to further reduce any delay in delivering production outputs to end users. About 11k jobs slots are available for analysis on the T2 level. In total, >1 PB of centrally-allocated space is used at 50 CMS T2s (with 20% of T2 sites whose storage is still not extensively used). The number of CMS physicists participating to analysis on the distributed infrastructure is increasing: in any given week 47±2 T2 sites ran CMS analysis jobs, submitted by a number of (weekly) users which (during this Conference) goes beyond 450 individuals (more in [8]). The analysis job success rate remains a persistent issue: CMS stabilized on average at around 80% success rate, hence with visible improvement over last year when CMS averaged about 65%. It was determined already that roughly half of errors are related to remote stage-out of produced output analysis files. This is still an area with relatively large margins of possible improvements. 5. Conclusions CMS Computing is working smoothly for LHC physics at 7 TeV. The overall system and the operations team have demonstrated to be able to cope with the load in all sectors. Specific (rare) events which caused some backlogs showed that the system absorbs them and properly react with no impact whatsoever on physics. It must be remarked that the current resource utilization

81

is not (yet) stressing the system. The live time of the accelerator and the event complexity are currently lower than we expect later in the year. The luminosity ramp and the increase in the integrated data volume will keep CMS Computing more and more busy in next months. CMS Computing would like to thank the WLCG coordination for the constant efforts - also through several computing challenges over last few years - and all the CMS contacts and the site admins at the WLCG Tiers for the hard, competent and constant work. REFERENCES 1. CMS Collaboration R. Adolphi et al., The CMS experiment at the CERN LHC, JINST, 0803, S08004 (2008) 2. C.Grandi, D.Stickland, L.Taylor et al., The CMS Computing Model, CERN-LHCC-2004035/G-083 (2004) 3. The Worldwide LHC Computing Grid (WLCG) web portal: http://lcg.web.cern.ch/LCG 4. J. D. Shiers, “The Worldwide LHC Computing Grid (worldwide LCG)”, Computer Physics Communications 177 (2007) 219–223, CERN, Switzerland 5. P. Kreuzer et al., “The CMS CERN Analysis Facility (CAF)”, presented at Computing in High Energy and Nuclear Physics (CHEP’09), Prague, Czech Republic, March 21-27, 2009 - to be published in J. Phys.: Conf. Ser. 6. S. Belforte et al., “The commissioning of CMS Computing Centres in the WLCG Grid”, XIII International Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT’08), Erice, Italy, Nov 3-7, 2008 - Proceedings of Science, PoS (ACAT08) 043 (2008) 7. D. Bonacorsi et al, “PhEDEx high-throughput data transfer management system”, CHEP06, Computing in High Energy and Nuclear Physics, T.I.F.R. Bombay, India, 2006 8. D. Spiga, “Running the CMS Computing for the analysis of the first data”, this Conference