Data Model and Query Semantics

In order for a body of data to remain useful over the long term, the user community must have a common understanding of the organization of the data (the data model) and the form of queries that can be performed on that data model (the query semantics). To this end, a primary objective of this project will be to clearly document the data model and query semantics for the various physics use cases of the LHC and Tevatron data, gain the acceptance of the stakeholders for the model, and verify its ability to represent multiple physical organizations and software technologies. Each of the experiments has addressed this problem in some detail for its own purposes, however it remains an open problem to describe these sources in a common way that can be inteagrated into both local and institutional repositories.

Approaches to Data and Query Modeling

This research area will begin with an understanding of the high level data use and re-use cases generated by Workshop 1 and Workshop 2, and then follow up by surveying the technical data models generated by each of the LHC experiments. These will be combined into a draft common form to be discussed in detail at Workshop 3. The workshop will be used to finalize the data model, and to better understand the long-term use cases, including broad applicability across disciplines. One of the most important principles that has emerged from the body of work on database management is the principle of separation of logical and physical data organization. Simply put, the user’s ability to view and manipulate the data should be completely independent from the physical organization of the data in storage. This frees the storage system to choose the best physical organization for the desired access pattern, and also allows the data to move freely between multiple storage systems that may have completely different implementations.

To verify the data model and query semantics, we will construct prototype software that supports a small number of physical organizations on multiple storage systems, (e.g. local disk and HDFS) and a small number of query technologies (e.g. SQL and SPARQL). If we have defined the data model and implemented it correctly, it should be possible to change both physical layouts and software technologies without affecting the ability of the user to query and obtain data. The selection of appropriate storage technologies will be driven by the experience of Workshop 6.

After demonstrating the working prototype, it will be documented and made available as open source software to facilitate understanding and extension of our work after the project concludes. We expect that the presentation of these results and the software will be a significant portion of one of the final-year workshops.

Technical and Organizational Challenges

To date, the LHC data has not entirely observed the principle of separation of logical and physical data organization, resulting in many organizational and technical challenges. As the recommended storage method has evolved from plain files to object-based repositories like Objectivity to distributed filesystems like HDFS, the user community has paid a significant price to adapt applications and middleware to interact with each new software system. Each instance of changing the physical organization at a site requires major downtime, reconfiguration, and user difficulty before the system is ready again.

The sheer size of the LHC data also has an impact on the technical design - we cannot assume that an implementation that can handle GB can be trivially scaled up to PB of data by simply attaching a larger filesystem. In practice, consumers of the data are interested in subsets of the data, which may be defined by trigger conditions, vertical slices of the detector output, or even random sampling for testing. Because few (if any) sites beyond CERN will have the capacity to store all of the data, different overlapping subsets will be stored at multiple sites. The policy and mechanism for moving data among sites will be highly variable and may depend on research interests, infrastructure investment, and sharing policies. These issues will be guided in detail by the outputs of Workshop 5. Because of this, the data model, the query language, and the prototype implementations must all accommodate the semantics of distributed data. One carefully-crafted query may access only locally-stored data with a reasonably predictable access time. A slight change to the query (or the underlying data distribution) may cause the query to refer to data currently stored at multiple sites, which can make the time to execute the query effectively unbounded in both time and resources. An effective model and implementation must take this physical distribution into account, allowing for the user to express their intentions and the system to respond with its configuration and limitations, including the possible absence of data in the final result.