Scientific research applications and usage

Scientific research applications and usage

4 Scientific research applications and usage This chapter focuses on the building of a big data application that is used within CERN, where the complex...

1MB Sizes 0 Downloads 231 Views

4 Scientific research applications and usage This chapter focuses on the building of a big data application that is used within CERN, where the complexity of large data loads and its automated analysis and insights execution is the norm. This will include the discussion on data ingestion, large data sets, streaming data sets, data computations, distributed data processing, replications, stream versus batch analytics, analytic formulas, once versus repetitive executions of algorithms, supervised and unsupervised learning, neural networks execution, and applying visualizations and data storyboarding. The research community benefits of the big data stack and applications, how to avoid risks and pitfalls, when to boil the ocean, and where to avoid traps will all be discussed in the last segment of this chapter. The name CERN is derived from the acronym for the French “Conseil Europe´en pour la Recherche Nucle´aire,” or European Council for Nuclear Research, a provisional body founded in 1952 with the mandate of establishing a world-class fundamental physics research organization in Europe. At that time, pure physics research concentrated on understanding the inside of the atom, hence the word “nuclear.” In 1964, there were two physicists Franc¸ois Englert and Peter Higgs who published individual papers on particle physics and defined that there must exist a particle that is smaller than an atom and even smaller than a proton and electron, which can be called as “God particle”. To illustrate the name they called it a “boson” after Satyendranath Bose another physicist who collaborated with Albert Einstein on several experiments. Some of the pair’s work resulted in the invention of BoseeEinstein statistics, a way to describe the behavior of a class of particles that now shares Bose’s name. Two bosons with identical properties can be in the same place at the same time, but two fermions cannot. This is why photons, which are bosons, can travel together in concentrated laser beams. But electrons, which are fermions, must stay away from each other, which explains why electrons must reside in separate orbits in atoms. The boson discovery when done will open several new ways of understanding the universe which till date has been understood around 4%.

Building Big Data Applications. https://doi.org/10.1016/B978-0-12-815746-6.00004-1 Copyright © 2020 Elsevier Inc. All rights reserved.

85

86 Building Big Data Applications

The Higgs boson is the last undiscovered particle predicted by the Standard Model, a beautiful mathematical framework physicists use to describe the smallest bits of matter and how they interact. Experimental results have time and again validated the model’s other predictions. But finding the Higgs boson would not close the book on particle physics. While the Standard Model accounts for fundamental forces such as electromagnetism and the strong nuclear force, it cannot make sense of gravity, which is disproportionately weak compared to the other forces. One possible explanation is that we experience only a fraction of the force of gravity because most of it acts in hidden extra dimensions. The relentless pursuit of this fundamental particle was propelled further by particle physics. No particle can move with a speed faster than the speed of light in vacuum; however, there is no limit to the energy a particle can attain. In high-energy accelerators, particles normally travel very close to the speed of light. In these conditions, as the energy increases, the increase in speed is minimal. As an example, particles in the LHC move at 0.999,997,828 times the speed of light at injection (energy ¼ 450 GeV) and 0.999999991 times the speed of light at top energy (energy ¼ 7000 GeV). Therefore, particle physicists do not generally think about speed, but rather about a particle’s energy. Energy and mass are two sides of the same coin. Mass can transform into energy and vice versa in accordance with Einstein’s famous equation (E ¼ mc2), and because of this equivalence, mass and energy can be measured with the same unit (by setting c ¼ 1). At the scale of particle physics these are the electronvolt and its multiples. In this pursuit is where CERN became involved. Just because something looks like the Higgs particle does not mean it is the Higgs particle. If physicists do discover a new particle, they will need to measure its numerous properties before they can determine whether it is the Higgs boson described by the Standard Model of particle physics. Theory predicts in great detail how a Standard Model Higgs particle would interact with other particles. Only after carefully measuring and testing these interactions, like a biologist examining the genetic makeup of a new plant species would scientists be certain that they had indeed found the Standard Model Higgs boson. A new particle that did not act as expected would give physicists a whole new set of mysteries to explore. The Standard Model is a collection of theories that embodies all of our current understanding of fundamental particles and forces. According to the theory, which is supported by a great deal of experimental evidence, quarks and leptons are the building blocks of matter, and forces act through carrier particles exchanged between the particles of matter. Forces also differ in their strength.

Chapter 4  Scientific research applications and usage

87

Quarks and Leptons are the building blocks which build up matter, i.e., they are seen as the “elementary particles” In the present standard model, there are six “flavors” of quarks. They can successfully account for all known mesons and baryons (over 200). The most familiar baryons are the proton and neutron, which are each constructed from up and down quarks. Quarks are observed to occur only in combinations of two quarks (mesons) and three quarks (baryons). There was a recent claim of observation of particles with five quarks (pentaquark), but further experimentation has not borne it out. Quark

Symbol

Spin

Charge

Baryon number

S

C

B

T

Mass*

Up Down Charm Strange Top Bottom

U D C S T B

1/2 1/2 1/2 1/2 1/2 1/2

þ2 / 3 1 / 3 þ2 / 3 1 / 3 þ2 / 3 1 / 3

1/3 1/3 1/3 1/3 1/3 1/3

0 0 0 1 0 0

0 0 þ1 0 0 0

0 0 0 0 0 1

0 0 0 0 þ1 0

1.7e3.3 MeV 4.1e5.8 MeV 1270 MeV 101 MeV 172 GeV 4.19 GeV(MS) 4.67 GeV(1S)

88 Building Big Data Applications

We are in the search of these quarks, protons, and electrons, which all have a low decay time from their creation, which is where the accelerator solution comes in.

Accelerators A Large Hadron Collider (LHC) was built at CERN with the goal to smash protons moving at 99.999,999% of the speed of light into each other and so recreate conditions a fraction of a second after the big bang. The LHC experiments try and work out what happened in particle physics experiments. Particle physics is the unbelievable in pursuit of the unimaginable. To pinpoint the smallest fragments of the universe you have to build the biggest machine in the world. To recreate the first millionths of a second of creation you have to focus energy on an awesome scale. Some questions that we hope to answer with LHC experiments are as follows:  Why do we observe matter and almost no antimatter if we believe there is a symmetry between the two in the universe?  What is this “dark matter” that we cannot see that has visible gravitational effects in the cosmos?  Why cannot the Standard Model predict a particle’s mass?  Are quarks and leptons actually fundamental, or made up of even more fundamental particles?  Why are there exactly three generations of quarks and leptons?  How does gravity fit into all of this? To get the LHC experiments a set of outcomes including the discovery of the “God particle”, we established the following frontiers:  The Energy Frontier: using high-energy colliders to discover new particles and directly probe the architecture of the fundamental forces.  The Intensity Frontier: using intense particle beams to uncover properties of neutrinos and observe rare processes that will tell us about new physics beyond the Standard Model.  The Cosmic Frontier: using underground experiments and telescopes, both ground and space based, to reveal the natures of dark matter and dark energy and using high-energy particles from space to probe new phenomena. The Large Hadron Collider (LHC) is located in a circular tunnel 27 km (17 miles) in circumference. The tunnel is buried around 100 m (about the size of a football field) underground. LHC straddles the Swiss and French borders on the outskirts of Geneva. The collider consists of distinct sets of components that cover the 27 km boundary. The goal of the collider is to deliver the smash of the two photons and provide introspections into the underlying matter and how fast it decays. Prior to the LHC we first built a Large

Chapter 4  Scientific research applications and usage

89

ElectronePositron Collider (LEP), one of the largest particle accelerators ever constructed. LEP collided electrons with positrons at energies that reached 209 GeV. It was a circular collider with a circumference of 27 km built in a tunnel roughly 100 m (300 ft) underground and used from 1989 until 2000. Around 2001 it was dismantled to make way for the LHC, which reused the LEP tunnel. The LEP is the most powerful accelerator of leptons ever built. LEP was built with four detectors, each built around the four collision points within underground halls. Each was the size of a small house and was capable of registering the particles by their energy, momentum, and charge, thus allowing physicists to infer the particle reaction that had happened and the elementary particles involved. By performing statistical analysis of this data, knowledge about elementary particle physics is gained. The four detectors of LEP were called Aleph, Delphi, Opal, and L3. They were built differently to allow for complementary experiments. The colliders and detectors created and used were the following:  ALEPH is an acronym for Apparatus for LEP PHysics at CERN. ALEPH is a detector that determined the mass of the W-boson and Z-boson to within one part in a 1000. The number of families of particles with light neutrinos was determined to be 2.982  0.013, which is consistent with the standard model value of 3.  DELPHI Is an acronym for DEtector with Lepton, Photon, and Hadron Identification. Like the other three detectors, it recorded and analyzed the result of the collision between LEP’s colliding particle beams. DELPHI was built in the shape of a cylinder over 10 m in length and diameter, and a weight of 3500 tons. In operation, electrons and positrons from the accelerator went through a pipe going through the center of the cylinder, and collided in the middle of the detector. The collision products then traveled outwards from the pipe and were analyzed by a large number of subdetectors designed to identify the nature and trajectories of the particles produced by the collision.  OPAL is an acronym for Omni-Purpose Apparatus for LEP. The name of the experiment was a play, as some of the founding members of the scientific collaboration which first proposed the design had previously worked on the JADE detector at DESY in Germany. OPAL was designed and built as a general-purpose detector designed to collect a broad range of data. Its data were used to make high precision measurements of the Z-Boson lineshape, perform detailed tests of the Standard Model, and place limits on new physics.  L3 was another LEP experiment. Its enormous octagonal magnet return yoke remained in place in the cavern and became part of the ALICE detector for the LHC. LEP was successful in many experiments and had provided enough directions to what physics directions it could base on the standard model, however we felt the need to create and build a larger collider which will give us beyond the standard model and

90 Building Big Data Applications

potentially even help us move into the discovery of the elusive Higgs boson particle. The challenge is significant in the LEP that it has been built in a tunnel, reusing this tunnel will require us to decommission the LEP and then build the detectors and colliders for the LHC. What the rebuild will provide us is an opportunity to align all technologies in the big data platform layers to collect data that is generated by these mammoth colliders, which will help us rebuild the collision process and visualize the impact repeatedly. The LEP collider was having Oracle databases, C programs and Open Source data models and platforms, Pascal, Fortran, and SAS but had to collect data as needed missing on foundational areas sometimes, and there were issues in the need for time as collisions occurred and people recorded as much as possible. To compute better with collection of large volumes of data and analyze results on demand the data platforms added at LHC included Hadoop, NoSQL, Spark, and Kubernetes. The next segment is to look at the technologies and the data.

Big data platform and application History of LHC at CERN can be summarized in these events: 1982: 1983: 1989: 1994: 1996: 2000: 2002: 2003: 2005: 2008: 2009: 2011:

First studies for the LHC project Z0/W discovered at SPS proton antiproton collider (SppbarS) Start of LEP operation (Z/W boson factory) Approval of the LHC by the CERN Council Final decision to start the LHC construction Last year of LEP operation above 100 GeV LEP equipment removed Start of LHC installation Start of LHC hardware commissioning Start of (short) beam commissioning Repair, recommissioning, and beam commissioning Launch of LHC

As LHC was being designed at CERN, the data calculations were being computed for each device and how much data is needed for use. CERN has approximately 3000 Members and 12,000 users. The Large Hadron Collider and four big Experiments: ATLAS, CMS, LHCb, and ALICE generate data 24  7  365. CERN OpenLab is a publiceprivate partnership, through which CERN collaborates with leading ICT companies and research organizations. The Worldwide LHC Computing Grid (WLCG) is a global collaboration of more than 170 institutions in 42 countries which provide resources to store, distribute, and analyze the multiple PBs of LHC Data. Data at CERN 15 PB per month and >250 PB today at CERN data center.

Chapter 4  Scientific research applications and usage

91

CERN Implementation of LHS

Data Generation at LHC

The picture shown above is the data generation at LHC. This data is generated for each execution and comes in multiple formats including the following: ROOT for Physics, Parquet, JSON, TSV, CSV, Log files, and XML formats. We need to collect the data for analytics executed on HDFS on demand, we also need to execute stored

92 Building Big Data Applications

compute on database platforms and we need to execute streaming analytics in memory as data streams. The challenge here is that we will collect several terabytes of data from source generated files, but need to provide 100 e200 GB new extracts for analytics, while we will still have access to operational data for running analytics and exploration. To process data the new platforms to add included Apache Hadoop, Apache Kafka, Apache Spark, Apache Flume, Apache Impala, Oracle, and NoSQL database. This data processing architecture will be integrated with the existing ecosystem of Oracle databases, SAS, and Analytics systems. The Apache stack selected is shown in the picture below.

Hadoop configuration implemented at CERN includes the following:  Baer Metal Hadoop/YARN Clusters  five Clusters  110 þ nodes  14 þ PBs Storage  20 þ TB Memory  3100 þ Cores  HDDs and SDDs Access to data is provided with Active Directory and native security rules are enclosed for each layer of the access from the Grid to Hadoop. The rules provide encryption, decryption, hierarchies, and granularity of access. The authorization policy is implemented in the rules and the authentication is implemented as Active Directory. The end user analysts and physicists at CERN use Jupyter notebooks with PySpark implementation to work on all the data. The Jupyter notebooks use Impala, Pig, and Python and several innovations have been added by the CERN team to use the Apache stack for their specific requirements. We will discuss these innovations in the next segment. Innovations:

Chapter 4  Scientific research applications and usage

93

XRootD filesystem interface project The CERN team evaluated the Apache stack and identified a few gaps between where they were with current technology and the new stack to be augmented. The gaps were all physics files were written using the ROOT project and this project was developed in cþþ and formats will not be able to load into AVRO or Spark. The CERN team joined hands with DIANA-HEP team to create the XRootD project. The project was designed to load physics files into HDFS and Spark. Details of the project can be found at http://xrootd. org and the GitHub page for the project is at https://github.com/cerndb/hadoop-xrootd.

XRootD Project

XRootD: The XRootD project aims at giving high performance, scalable fault tolerant access to data repositories of different kinds, and the access will be delivered as file based. The project was conceived to be delivered on a scalable architecture, a communication protocol, and a set of plug-ins and tools based on those. The freedom to configure XRootD and to make it scale (for size and performance) allows the deployment of data access clusters of virtually any size, which can include sophisticated features, like authentication/authorization, integrations with other systems, and distributed data distribution. XRootD software framework is a fully generic suite for fast, low latency, and scalable data access, which can serve natively any kind of data, organized as a hierarchical filesystem-like namespace, based on the concept of directory.

Service for web-based analysis (SWAN) CERN has packaged and built a service layer for analysis based on the web browser. This service called SWAN is a combination of the Jupyter notebook, Python, Cþþ, ROOT, Java, Spark, and several other API interfaces. The package is available for download and usage for any consumer who works with CERN. The SWAN service is available at https:// swan.web.cern.ch. There are several other innovations to manage the large files, the streaming analytics, the in-memory analytics, and kerberos security plug-ins.

94 Building Big Data Applications

The resultdHiggs Boson discovery The discovery of the Higgs particle in 2012 is an astonishing triumph of mathematics’ power to reveal the workings of the universe. It is a story that has been recapitulated in physics numerous times and each new example thrills just the same. The possibility of black holes emerged from the mathematical analyses of German physicist Karl Schwarzchild; subsequent observations proved that black holes are real. Big Bang cosmology emerged from the mathematical analyses of Alexander Friedmann and also Georges Lemaıˆtre; subsequent observations proved this insight correct as well. The concept of antimatter first emerged from the mathematical analyses of quantum physicist Paul Dirac; subsequent experiments showed that this idea, too, is right. These examples give a feel for what the great mathematical physicist Eugene Wigner meant when he spoke of the “unreasonable effectiveness of mathematics in describing the physical universe.” The Higgs field emerged from mathematical studies seeking a mechanism to endow particles with mass, and once again the math has come through with flying colors. Nearly a half-century ago, Peter Higgs and a handful of other physicists were trying to understand the origin of a basic physical feature: mass. You can think of mass as an object’s heft or precisely as the resistance it offers to having its motion changed. Accelerate a car to increase its speed, and the resistance you feel reflects its mass. At a microscopic level, the car’s mass comes from its constituent molecules and atoms, which are themselves built from fundamental particles, electrons, and quarks. But where do the masses of these and other fundamental particles come from? When physicists in the 1960s modeled the behavior of these particles using equations rooted in quantum physics, they encountered a puzzle. If they imagined that the particles were all massless, then each term in the equations clicked into a perfectly symmetric pattern, like the tips of a perfect snowflake. And this symmetry was not just mathematically elegant. It explained patterns evident in the experimental data. But here is the puzzle, physicists knew that the particles did have mass, and when they modified the equations to account for this fact, the mathematical harmony was spoiled. The equations became complex and unwieldy and, worse still, inconsistent. What to do? Here is the idea put forward by Higgs. Do not shove the particles’ masses down the throat of the beautiful equations. Instead, keep the equations pristine and symmetric, but consider them operating within a peculiar environment. Imagine that all of space is uniformly filled with an invisible substance, now called the Higgs field that exerts a drag force on particles when they accelerate through it. Push on a fundamental particle in an effort to increase its speed and, according to Higgs, you would feel this drag force as a resistance. Justifiably, you would interpret the resistance as the particle’s mass. For a mental toehold, think of a ping pong ball submerged in water. When you push on the ping pong ball, it will feel much more massive than it does outside of water. Its interaction with the watery environment has the effect of endowing it with mass. The same is the case of explanation with particles submerged in the Higgs field.

Chapter 4  Scientific research applications and usage

95

The physics community had, for the most part, fully bought into the idea that there was a Higgs field permeating space. Mathematical equations can sometimes tell such a convincing tale; they can seemingly radiate reality so strongly, that they become entrenched in the vernacular of working physicists, even before there is data to confirm them. But it is only with data that a link to reality can be forged. How can we test for the Higgs field? This is where the Large Hadron Collider (LHC) comes in. Winding its way hundreds of yards under Geneva, Switzerland, crossing the French border and back again, the LHC is a nearly 17-mile-long circular tunnel that serves as a racetrack for smashing together particles of matter. The LHC is surrounded by about 9000 superconducting magnets, and is home to streaming hordes of protons, cycling around the tunnel in both directions, which the magnets accelerate to just shy of the speed of light. At such speeds, the protons whip around the tunnel about 11,000 times each second and when directed by the magnets, engage in millions of collisions in the blink of an eye. The collisions, in turn, produce fireworks-like sprays of particles, which mammoth detectors capture and record. One of the main motivations for the LHC, which cost on the order of $10 billion and involves thousands of scientists from dozens of countries, was to search for evidence for the Higgs field. The math showed that if the idea is right, if we are really immersed in an ocean of Higgs field, then the violent particle collisions should be able to jiggle the field, much as two colliding submarines would jiggle the water around them. And every so often, the jiggling should be just right to flick off a speck of the field a tiny droplet of the Higgs ocean which would appear as the long-sought Higgs particle. The calculations also showed that the Higgs particle would be unstable, disintegrating into other particles in a minuscule fraction of a second. Within the maelstrom of colliding particles and billowing clouds of particulate debris, scientists armed with powerful computers would search for the Higgs’ fingerprint, a pattern of decay products dictated by the equations. In the early morning hours of July 4, 2012, as the world came to quickly learn, the evidence that the Higgs particle had been detected was strong enough to cross the threshold of discovery. With the Higgs particle now officially found, physicists worldwide broke out into wild applause. Peter Higgs wiped away a tear. The Higgs particle represents a new form of matter, which had been widely anticipated for decades but had never been seen. Early in the 20th century, physicists realized that particles, in addition to their mass and electric charge, have a third defining feature: their spin. But unlike a child’s top, a particle’s spin is an intrinsic feature that does not change; it doesn’t speed up or slow down over time. Electrons and quarks all have the same spin value, while the spin of photons, particles of light is twice that of electrons and quarks. The equations describing the Higgs particle showed that unlike any other fundamental particle species it should have no spin at all. Data from the Large Hadron Collider have now confirmed this.

96 Building Big Data Applications

Establishing the existence of a new form of matter is a rare achievement, but the result has resonance in another field: cosmology, the scientific study of how the entire universe began and developed into the form we now witness. For many years, cosmologists studying the Big Bang theory were stymied. They had pieced together a robust description of how the universe evolved from a split second after the beginning, but they were unable to give any insight into what drove space to start expanding in the first place. What force could have exerted such a powerful outward push? For all its success, the Big Bang theory left out the bang. The LHC’s confirmation that at least one such field actually exists thus puts a generation of cosmological theorizing on a far firmer foundation. Lessons Learned: The significant set of lessons we have learned in discussing the CERN situation and its outcomes with Big Data Analytics implementation and the future goals include the following: Problem Statement: Define the problem clearly, including the symptoms, situations, issues, risks, and anticipated resolutions. The CERN team started this process since the inception of the LEP and throughout the lifecycle of all its associated devices; they also defined the gaps and areas of improvement to be accomplished which were all defined in the LHC process. Define solution: this segment should identify all possible solutions for each area of the problem. The solution segment can consist of multiple tools and heterogenous technology stacks integrated for a definitive, scalable, flexible, and secure outcome. The definition of the solution should include analytics, formulas, data quality, data cleansing, transformation, rules, exceptions, and workarounds. These steps will need to be executed for each area and include all the processes to be defined in clarity. CERN team has implemented this and added governance to ensure that the steps are completed in accordance and no gaps are left unanswered, and if gaps exist there are tags and tasks associated with the tags for potential completion. Step by step execution: is a very essential mechanism to learn how to become successful. If you read the discovery of the Higgs field, the experiment proves that we need to iterate multiple times for every step to analyze the foundational aspects, which will provide us more insights to drill through to greater depths. This step by step process is very much seen to bring success, whether we work on cancer research or in-depth particle physics research the concept to proof perspective demands steps and outcomes at each step, adjustments to be made recorded and the step reprocessed and outcomes recorded. In big data applications the step by step execution is very much possible with the data collected in HDFS at the raw operational level, which can be explored, discovered, experimented and constructed in multiple methods, cycles and analysis of the details performed for the data. All of these are possible within the HDFS layers which provides us the playground to prove the possibilities. The cost models are not necessarily cheap, CERN for example has spent over $1B on infrastructure worldwide over the years, but

Chapter 4  Scientific research applications and usage

97

one Higgs field result is more valuable than the spend, and the recovery is multifold. Here is where the technology has and is enabling the experiments to head towards:     

2010e12: Run 1 ;7 and 8 TeV 2015e1 : Run 2 ; 13 TeV 2021e23: Run 3 (14 TeV) 2024e25: HL-LHC installation 2026e35: HL-LHC operation

The next iteration of this big data application is the introduction of cloud and kubernates platforms where more experiments can be conducted at each component of the LHC. Open source adoption: has been a critical success factor in the CERN application of Big Data platforms and applications. The open source licenses and security have evolved to be used for enterprise applications and can be deployed on a global basis today for which CERN is one proof. This is a critical success factor to remember. Governance: is the other key critical success factor to this entire journey and anybody in the scientific community will agree to this viewpoint. Governance is not about being measured but reporting on the measurements required for ensuring all activities and their tasks have been completed, per the expected outcomes and times completed. This governance enabled CERN to become successful from the first iteration to date and will be used in the future.