Sys Admins

Service Documentation

ARC CE

ARC Compute Element (CE) is a Grid front-end on top of a conventional computing resource (e.g. a Linux cluster or a standalone workstation).

Website:
Sys admin guide:
User guides:
Troubleshooting guide:
Support channels:

ARGUS

The Argus Authorization Service renders consistent authorization decisions for distributed services (e.g., user interfaces, portals, computing elements, storage elements). The service is based on the XACML standard, and uses authorization policies to determine if a user is allowed or denied to perform a certain action on a particular service.

Website:
User guides:
Support channels:
Contact:

Batch Systems

The following documentation mainly covers existing batch system middleware integration components and some general links to the most popular batch systems used within WLCG. In particular, CREAM CE integration with Torque, LSF, GE and SLURM is well documented as well as EMI implementation of the TORQUE batch system.

User guides:
List of known issues:
Troubleshooting guide:
Contact:

CREAM CE

The CREAM (Computing Resource Execution And Management) Service is a simple, lightweight service for job management operation at the Computing Element (CE) level.

CREAM accepts job submission requests (which are described with the same JDL language used to describe the jobs submitted to the Workload Management System) and other job management requests (e.g. job cancellation, job monitoring, etc).

CREAM can be used by the Workload Management System (WMS), via the ICE service, or by a generic client, e.g. an end-user willing to directly submit jobs to a CREAM CE. For the latter user case a command line interface (CLI) is available.

CREAM exposes a web service interface.

Website:
User guides:
List of known issues:

CVMFS

CERNVM File System (CVMFS) is a network file system based on HTTP and optimized to deliver experiment software in a fast, scalable, and reliable way. Files and file metadata are aggressively cached and downloaded on demand. Thereby the CernVM-FS decouples the life cycle management of the application software releases from the operating system.

Website:
List of known issues:
Troubleshooting guide:
Support channels:

dCache

dCache is a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.

Sys admin guide:
User guides:
List of known issues:
Troubleshooting guide:
Support channels:

DPM

The Disk Pool Manager (DPM) is a lightweight storage solution for grid sites. It offers a simple way to create a disk-based grid storage element and supports relevant protocols (SRM, gridFTP, RFIO) for file management and access.

It focus on manageability (ease of installation, configuration, low effort of maintenance), while providing all required functionality for a grid storage solution (support for multiple disk server nodes, different space types, multiple file replicas in disk pools).

Website:
User guides:
List of known issues:
Troubleshooting guide:
Support channels:

EOS

EOS is a disk-based service providing a low latency storage infrastructure for physics users. EOS provides a highly-scalable hierarchical namespace implementation. Data access is provided by the XROOT protocol.

The main target area for the service are physics data analysis, which is characterized by many concurrent users, a significant fraction random data access and a large file open rate. 

For user authentication EOS supports Kerberos (for local access) and X.509 certificates for grid access. To ease experiment workflow integration SRM as well as gridftp access are provided. EOS further supports the XROOT third-party copy mechanism from/to other XROOT enabled storage services at CERN.

Website:
List of known issues:
Troubleshooting guide:
Support channels:
Site recipes:

FTS

FTS3 is the service responsible for globally distributing the majority of the LHC data across the WLCG infrastructure. It is a low level data movement service, responsible for reliable bulk transfer of files from one site to another while allowing participating sites to control the network resource usage.

 

Website:
User guides:
List of known issues:
Troubleshooting guide:
Support channels:

GFAL

GFAL (Grid File Access Library ) is a C library providing an abstraction layer of the grid storage system complexity.

Website:
Sys admin guide:
List of known issues:
Troubleshooting guide:
Site recipes:

HTCondor CE

An OSG Compute Element (CE) is the entry point for the OSG to your local resources: a layer of software that you install on a machine that can submit jobs into your local batch system. At the heart of the CE is the job gateway software, which is responsible for handling incoming jobs, authorizing them, and delegating them to your batch system for execution. Historically, the OSG only had one option for a job gateway solution, Globus Toolkit’s GRAM-based gatekeeper, but now offers the HTCondor CE as an alternative.

Today in OSG, most jobs that arrive at a CE (called grid jobs) are not end-user jobs, but rather pilot jobs submitted from factories. Successful pilot jobs create and make available an environment for actual end-user jobs to match and ultimately run within the pilot job container. Eventually pilot jobs remove themselves, typically after a period of inactivity.

HTCondor CE is a special configuration of the HTCondor software designed to be a job gateway solution for the OSG. It is configured to use the JobRouter daemon to delegate jobs by transforming and submitting them to the site’s batch system.

List of known issues:
Troubleshooting guide:
Support channels:
Contact:

StoRM

StoRM (STOrage Resource Manager) is a light, scalable, flexible, high-performance, file system independent, storage manager service (SRM) for generic disk based storage system, compliant with the standard SRM interface version 2.2.

StoRM provides data management capabilities in a Grid environment to share, access and transfer data among heterogeneous and geographically distributed data centres.In particular, StoRM works on each POSIX filesystems (ext3, ext4, xfs, basically on everything than can be mounted on a Linux machine) but it also brings in Grid the advantages of high performance storage systems based on cluster file system (such as GPFS from IBM or Lustre from Sun Microsystems) supporting direct access (native POSIX I/O call) to shared files and directories, as well as other standard Grid access protocols. StoRM is adopted in the context of WLCG computational Grid framework.

Website:
User guides:
List of known issues:
Troubleshooting guide:
Support channels:

UI/WN

The User Interface (UI) is the access point to the Grid Infrastructure. This can be any machine where users have personal account and where their user certificate is installed. From the UI, the user can be authenticated and authorised to use the Grid resources and can access the functionalities offered by the Information, Workload and Data Management Systems.

The Worker Node (WN) is the computing node inside the Grid where the user's jobs are finally executed at a site. On the WN, the necessary middleware components are installed. Additional software components may be necessary according to the requirements of the site supported VOs.

Website:
User guides:
List of known issues:
Troubleshooting guide:
Support channels:
Site recipes:
Contact:

VOMS

The Virtual Organization Membership Service is a Grid attribute authority which serves as central repository for VO user authorization information, providing support for sorting users into group hierarchies and keeping track of their roles and other attributes. These information are used to issue trusted attribute certificates and assertions used in the Grid environment for authorization purposes.

Website:
Sys admin guide:
List of known issues:
Troubleshooting guide:
Support channels:
Contact:

WLCG Information System

The grid information system provides detailed information about grid services which is needed for various different tasks. The grid information system has a hierarchical structure of three levels. The fundamental building block used in this hierarchy is the Berkley Database Information Index (BDII). The resource level or core BDII is usually co-located with the grid service and provides information about that service. Each grid site runs a site level BDII. This aggregates the information from all the resource level BDIIs running at that site. The top level BDII aggregates all the information from all the site level BDIIs and hence contains information about all grid services. There are multiple instances of the top level BDII in order to provide a fault tolerant, load balanced service. The information system clients query a top level BDII to find the information that they require.

Sys admin guide:
List of known issues:

XRootD

The XROOTD project aims at giving high performance, scalable fault tolerant access to data repositories of many kinds. The typical usage is to give access to file-based ones. It is based on a scalable architecture, a communication protocol, and a set of plugins and tools based on those. The freedom to configure it and to make it scale (for size and performance) allows the deployment of data access clusters of virtually any size, which can include sophisticated features, like authentication/authorization, integrations with other systems, WAN data distribution, etc.

XRootD software framework is a fully generic suite for fast, low latency and scalable data access, which can serve natively any kind of data, organized as a hierarchical filesystem-like namespace, based on the concept of directory. As a general rule, particular emphasis has been put in the quality of the core software parts.

Website:
Sys admin guide:
List of known issues:
Troubleshooting guide:
Support channels:

Operations Coordination Meeting

T1 and T2 sites are invited to raise any issue they are concerned about at the monthly Operations Coordination meeting that usually takes place the 1st Thursday of the month from 15h30 to 17h CE(S)T. There is a section on the agenda for this. You can also write to wlcg-ops-coord-chairpeople in advance to make sure a specific slot is scheduled in the agenda.

Middleware

WLCG Middleware Baseline

The WLCG Middleware Baseline lists the minimum recommended versions of middleware services that should be installed by WLCG sites to be part of the production infrastructure. It does not necessarily reflect the latest versions of packages available in the UMD, OSG or EPEL repositories. It contains the latest version fixing significant bugs or introducing important features. Versions newer than those indicated are assumed to be at least as good, unless otherwise indicated. In other words: if you have a version older than the baseline, you should upgrade at least to the baseline. For more details, please check the list of versions in the following link:

Baseline Versions

Middleware Known Issues

A list of middleware known issues is maintained by the WLCG Middleware Officer. The list contains known middleware issues affecting the operations of the WLCG infrastructure. For more details please check the following link:

Known Issues

To report a new known issue, please, contact the WLCG Middleware Officer.

User Support

The WLCG ticketing system of choice is GGUS. Some of the advantages of opening a GGUS ticket instead of sending email include:

  • There is a persistent web link that can be quoted from any other document
  • All Grid supporters have access to the ticket and can comment, re-assign, solve it
  • Ticket updates generate automatic email notifications to anyone mentioned in or subscribed to the ticket.
  • Tickets can be escalated to attract the attention of support units
  • GGUS is interfaced to other ticketing systems used in the WLCG community
  • It is possible to open a ticket on behalf of a 3rd party
  • TEAMers of the same VO co-own a ticket
  • Authorised ALARMers get immediate attention from T0/T1 operators by opening ALARM tickets for incident categories covered by the WLCG Memorandum of Understanding.

GGUS features' reminders, searching and reporting tools, including some specially tailored to sites, release notes and full documentation is available from the GGUS homepage. Users/supporters must be registered with their personal digital certificate to enjoy the full functionality of GGUS.

System administrators and middleware deployers often exchange expert opinions in the LCG-ROLLOUT@JISCMAIL.AC.UK , a moderated mailing list with a nice web interface & archive.

You are here