- Riccardo Maria Bianchi [*] , Renaud Bruneliere
- (CERN & Freiburg University)
|[*]||Corresponding author. Email: firstname.lastname@example.org|
31 May, 2010
Keywords: data analysis; Python; PyROOT; Large Hadron Collider; LHC; High Energy Physics; HEP; Computer Aided Software Engineering; automation; CASE; GRID.
The Large Hadron Collider (LHC) at CERN in Geneva, Switzerland, has just started its first long run, at an energy range never explored in the human history. The amount of data collected will be huge, of the order of the Petabytes per year. And rare evidences of the existence of the Higgs Boson particle may hide within these data, besides signs of exciting new physics. That is new because it goes beyond the current model of the Particle Physics, the so-called "Standard Model", which describes the fundamental bricks and glue which build our Universe: from far galaxies to the atoms of the substances in the skin cells of our thumb.
To analyze those data in the search for new physics, physicists have to write pieces of software which contain algorithms aimed to filter events and objects, in order to keep only the very rare and interesting ones which can lead to a great discovery. And as we cannot be sure how new physics can display itself, a large number of different data analyses has to be set up to scan all possibilities. But in such big High Energy Physics (HEP) experiments, physicists have also to take care of writing a huge amount of extra code needed to run the analysis over the data; that code has to deal, for example, with setting up the environment, retrieving and accessing data files, storing results, looping on objects inside data files and selecting them, or booking and filling plots and histograms.
While writing a new piece of code to implement an idea for a physics search, physicists working in HEP too often use a cut-and-paste approach: they copy parts of the code from an existing analysis into the new one, before editing the physics-related part. But in this way, after few iterations, physicists end up with a plethora of classes to debug, to maintain and to validate; classes with most of the code in common. This approach, which is still the most used in HEP, is error-prone and not suitable to handle many different analyses, as it's needed when searching for new physics. Moreover starting to write an analysis from scratch in such complex experiments needs a non-trivial knowledge and understanding of the software framework itself, which can make the implementation of new analysis ideas a difficult task for a physicist.
Here a new approach and a new package to ease the writing of HEP analysis code is presented. WatchMan  is an object-oriented framework fully written in Python, which lets the physicists implement their ideas easily, without taking care of the full writing of the code. WatchMan is a software generator which builds complete analysis code from user settings.
This is the main idea beyond WatchMan: from physics analysis ideas scribbled down at the coffee table... to complete analysis code in few easy automated steps! And all that thanks to the power of Python!
The user simply enters analysis ideas, as many as wanted, in a text-like way via a configuration file, the package parses them and it dynamically generates the complete analysis Python code ready to be run on data, both locally or on the GRID network . That generated code combine all analyses specified by the user in a unique code, via a mechanism of object-flagging which was possible to easily set up only thanks to the great flexibility of Python. WatchMan presents modular interfaces in order to create analysis code for different experiments or data formats. Three interfaces are provided with the framework so far: two for the ATLAS experiment running on LHC at CERN, and one for the open-source Delphes data format , used in HEP for experiment-independent studies. More custom interfaces can be easily added by users.
WatchMan presents and implements a new idea in the HEP field, the usage of Computer Aided Software Engineering to build reliable, easy to maintain and easy to validate Python data analysis code, mainly aimed at analyzing new data from the LHC collider. Python is the language used for the whole framework and throughout the development process; it has been the first choice of the developers, after having considered other languages, due to its extreme flexibility, its development speed and its cleanness and readability. Moreover when the usage of C++ code is necessary to properly interface with certain data formats, the Python bindings are automatically built by WatchMan using the tools provided by PyROOT , the python bridge to the ROOT framework .
WatchMan is a new open-source Python project under continuos development, with an already first stable release; it has a small community of active users and it has already been used with success to analyze data for some scientific papers at LHC.
|||"WatchMan -- An highly automated Analysis Code Generator", https://twiki.cern.ch/twiki/bin/view/Main/WatchMan|
|||"PyROOT -- A Python-ROOT Bridge", Wim Lavrijsen (CERN & LBNL) http://root.cern.ch/drupal/content/how-use-use-python-pyroot-interpreter|
|||"ROOT -- An Object-Oriented Data Analysis Framework", http://root.cern.ch/drupal/|
|||"Delphes -- A framework for fast simulation of a generic collider experiment", http://projects.hepforge.org/delphes/|
|||"Worldwide LHC Computing Grid (WLCG)", http://lcg.web.cern.ch/LCG/ and http://public.web.cern.ch/public/en/Spotlight/SpotlightGrid-en.html|