Sys Admins

Operations Meetings

T1 sites are invited to report about relevant matters at the weekly Operations meetings, which usually take place on Mondays at 15h00 CE(S)T. They are short meetings (max 30min) where experiments, T1s and central services report about ongoing operational matters.

Weekly operations meetings

Operations Coordination Meeting

T1 and T2 sites are invited to raise any issue they are concerned about at the monthly Operations Coordination meeting that usually takes place the 1st Thursday of the month from 15h30 to 17h CE(S)T. There is a section on the agenda for this. You can also write to wlcg-ops-coord-chairpeople in advance to make sure a specific slot is scheduled in the agenda. Short and medium term plans of the experiments affecting sites are reported. TFs and WGs also report about their progress and there may be specific presentations covering a particular topics that need discussion at operations level.

Ops Coordination meetings

Middleware

WLCG Middleware Baseline

The WLCG Middleware Baseline lists the minimum recommended versions of middleware services that should be installed by WLCG sites to be part of the production infrastructure. It does not necessarily reflect the latest versions of packages available in the UMD, OSG or EPEL repositories. It contains the latest version fixing significant bugs or introducing important features. Versions newer than those indicated are assumed to be at least as good, unless otherwise indicated. In other words: if you have a version older than the baseline, you should upgrade at least to the baseline. For more details, please check the list of versions in the following link:

Baseline Versions

That page also contains a list of (major) known issues affecting WLCG middleware.

Service Documentation

ARC CE

ARC Compute Element (CE) is a Grid front-end for a conventional computing resource (e.g. a Linux cluster).

Website

ARC CE Web Site

List of known issues

Bugzilla Bug List for ARC CE

ARC Release Notes

Support channels

GGUS (ARC)

Sys admin guide

ARC CE Sys Admin Guide

User guides

ARC CE User Guide

Existing fora links

ARC Discussion forum

Argus

The Argus Authorization Service renders consistent authorization decisions for distributed services (e.g., portals, computing elements, storage elements). The service is based on the XACML standard and uses authorization policies to determine if a user is allowed or denied to perform a certain action on a particular service.

Website

Argus Documentation

Argus code and binaries

List of known issues

Argus Issues

Support channels

GGUS

Sys admin guide

Argus Sys Admin Guide

Site recipes

NDGF Argus Integration

GridPP Argus Documentation

NIKHEF Argus Global Banning Setup Overview

Batch Systems

The following documentation mainly covers existing batch system middleware integration components and some general links to the most popular batch systems used within WLCG. In particular, ARC and CREAM CE integration with HTCondor, Slurm, Torque, LSF and Grid Engine flavors are well documented.

Website

HTCondor Web Site

Slurm Web Site

TORQUE Web Site

Son of Grid Engine Web Site

Support channels

GGUS (ARC, CREAM-BLAH, HTCondor-CE)

Sys admin guide

CREAM Batch System Integration

ARC CE Batch System Integration

Site recipes

Manchester Multicore Torque Configuration

NIKHEF Multicore and Large Memory Jobs Torque Configuration

CREAM CE

Website

CREAM Web Site

List of known issues

CREAM Known Issues

Support channels

GGUS (CREAM-BLAH)

Sys admin guide

CREAM Sys Admin Guide

User guides

CREAM User Guide

Troubleshooting guide

CREAM Troubleshooting Guide

CVMFS

CERNVM File System (CVMFS) is a network file system based on HTTP and optimized to deliver experiment software in a fast, scalable, and reliable way. Files and file metadata are aggressively cached and downloaded on demand.

Website

CVMFS Web Site

List of known issues

Support channels

Sys admin guide

User guides

CVMFS Configuration Examples

Existing fora links

CVMFS support

Site recipes

CVMFS RAL twiki

OSG Installation procedures

NIKHEF CVMFS Errors/Warnings

dCache

dCache is a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.

Website

Main dCache Web Site

List of known issues

dCache Github Issues

Support channels

GGUS (dCache Support)

Sys admin guide

dCache Sys Admin Guide

User guides

dCache User Guide

Existing fora links

dCache User Forum

Site recipes

GridPP dCache documentation

DPM

The Disk Pool Manager (DPM) is a lightweight storage solution for grid sites.

Website

DPM Twiki

LCG-DM home page

List of known issues

DPM Issues

Support channels

GGUS (DPM Development)

Sys admin guide

DPM admin guide

Existing fora links

DPM Users Forum

Site recipes

GridPP DPM Documentation

EOS

EOS is a disk-based service providing a low latency storage infrastructure for end users. EOS provides a highly-scalable hierarchical namespace implementation. Data access is provided by the XROOT and HTTPS protocols.

Website

EOS Web Site

Support channels

EOS Community

Sys admin guide

EOS Installation

User guides

EOS Client Configuration

FTS

FTS3 is the service responsible for globally distributing the majority of the LHC data across the WLCG infrastructure. It is a low level data movement service, responsible for reliable bulk transfer of files from one site to another while allowing the network resource usage to be controlled per site.

Website

FTS Web Site

List of known issues

Support channels

Sys admin guide

User guides

Existing fora links

FTS Steering Mailing list

GFAL

GFAL (Grid File Access Library ) is a client C library providing an abstraction layer of the grid storage system complexity.

Website

GFAL Web Site

List of known issues

GFAL Release Notes

GFAL Issues

Support channels

GGUS (Data Management Clients Development)

User guides

GFAL User Guide

Troubleshooting guide

GFAL FAQ

HTCondor CE (OSG)

An OSG Compute Element (CE) is the entry point for the OSG to your local resources: a layer of software that you install on a machine that can submit jobs into your local batch system. At the heart of the CE is the job gateway software, which is responsible for handling incoming jobs, authorizing them, and delegating them to your batch system for execution. In the past, the OSG only had one option for a job gateway solution, Globus Toolkit’s GRAM-based gatekeeper, but now offers the HTCondor CE instead.

Today in OSG, most jobs that arrive at a CE (called grid jobs) are not end-user jobs, but rather pilot jobs submitted from factories. Successful pilot jobs create and make available an environment for actual end-user jobs to match and ultimately run within the pilot job container. Eventually pilot jobs remove themselves, typically after a period of inactivity.

HTCondor CE is a special configuration of the HTCondor software designed to be a job gateway solution for the OSG. It is configured to use the JobRouter daemon to delegate jobs by transforming and submitting them to the site’s batch system.

Website

HTCondor CE Overview

List of known issues

HTCondorCE Known Issues

Support channels

OSG Support

Sys admin guide

HTCondor CE Installation and Configuration

User guides

Submitting Jobs to an HTCondor CE

Troubleshooting guide

HTCondor CE Troubleshooting Guide

StoRM

StoRM (Storage Resource Manager) is a light, scalable, flexible, high-performance, file system independent, storage manager service for generic disk based storage systems.

Website

StoRM Web Site

List of known issues

StoRM Issues

Support channels

GGUS (StoRM)

Sys admin guide

StoRM Sys Admin Guide

StoRM Advanced Configuration Examples

User guides

StoRM SRM Client

Troubleshooting guide

StoRM FAQ

StoRM Troubleshooting

Existing fora links

StoRM users

Site recipes

GridPP StoRM Documentation

UI / WN

The User Interface (UI) is the access point to the Grid Infrastructure. This can be any machine where users have personal account and where their user certificate is installed. From the UI, the user can be authenticated and authorised to use the Grid resources and can access the functionalities offered by the Information, Workload and Data Management Systems.

The Worker Node (WN) is the computing node inside the Grid where the user's jobs are finally executed at a site. On the WN, the necessary middleware components are installed. Additional software components may be necessary according to the requirements of the site supported VOs.

Note: these days many VOs use UI and WN middleware provided through CVMFS, served either via their own repositories or via the generic grid.cern.ch repository.

Support channels

GGUS (EMI UI and UI WN Tarball)

Sys admin guide

UMD-4 CentOS 7 UI

UMD-4 CentOS 7 WN

VOMS

The Virtual Organization Membership Service is a Grid attribute authority which serves as central repository for VO user authorization information, providing support for sorting users into group hierarchies and keeping track of their roles and other attributes. Such information is used to issue trusted attribute certificates and assertions used in the Grid environment for authorization purposes.

Website

VOMS Web Site

List of known issues

VOMS Known Issues

Support channels

GGUS (VOMS and VOMS Admin)

Sys admin guide

VOMS Sys Admin Guide

User guides

VOMS Clients Guide

VOMS Admin User Guide

Existing fora links

VOMS news

Site recipes

NDGF VOMS usage notes

WLCG Information System

The grid information system provides detailed information about grid services in the interest of a multitude of grid clients and services. The grid information system has a hierarchical structure of three levels. The fundamental building block used in this hierarchy is the Berkeley Database Information Index (BDII). The resource level BDII is usually co-located with a given grid service and provides information about that service. Each grid site in EGI runs a site level BDII. This aggregates the information from all the resource level BDIIs running at that site. A top level BDII aggregates all the information from all the site level BDIIs and hence contains information about all grid services published by any site. A site may run its own top level BDII or point local clients to some other instance(s) on the grid. The information system clients query a top level BDII to find the information that they require.

Website

Information System Web Site

List of known issues

BDII Known Issues

Support channels

GGUS (Information System Development)

Sys admin guide

BDII Sys Admin Guide

User guides

ginfo, lcg-info and lcg-infosites

Troubleshooting guide

See Troubleshooting section in the Sys Admin Guide

Existing fora links

WLCG Information System general discussions

GLUE WG

Site recipes

EGI guidelines on how to publish Site Information

EGI guidelines on how to publish OS names

EGI guidelines on how to publish System Architecture

XRootD

XRootD software framework is a fully generic suite for fast, low latency and scalable data access, which can serve natively any kind of data, organized as a hierarchical filesystem-like namespace, based on the concept of directory.

List of known issues

XRootD Issues

Support channels

XRootD User Support

Sys admin guide

XRootD Documentation

User guides

Python bindings for XRootD

Site recipes

OSG XRootD redirector installation

ALICE XRootD installation for sys admins

Retired middleware

CREAM CE

Website

CREAM Web Site

List of known issues

CREAM Known Issues

Support channels

GGUS (CREAM-BLAH)

Sys admin guide

CREAM Sys Admin Guide

User guides

CREAM User Guide

Troubleshooting guide

CREAM Troubleshooting Guide

User and site support

The WLCG ticketing system of choice is GGUS. Features include:

There is a persistent web link that can be quoted
All Grid supporters have access to the ticket and can comment, re-assign, solve it
Ticket updates generate automatic email notifications to anyone mentioned in or subscribed to the ticket.
Tickets can be escalated to attract the attention of support units
GGUS is interfaced to other ticketing systems used in the WLCG community
It is possible to open a ticket on behalf of a 3rd party
TEAMers of the same VO co-own a ticket
Authorised ALARMers get immediate attention from T0/T1 operators by opening ALARM tickets for incident categories covered by the WLCG Memorandum of Understanding.

GGUS features' reminders, searching and reporting tools, including some specially tailored to sites, release notes and full documentation is available from the GGUS homepage. Users/supporters must be registered to enjoy the full functionality of GGUS.

System administrators and middleware deployers often exchange expert opinions in the LCG-ROLLOUT@JISCMAIL.AC.UK , a moderated mailing list with a nice web interface & archive.

Operations Meetings

Operations Coordination Meeting

Middleware

WLCG Middleware Baseline

Service Documentation

ARC CE

Argus

Batch Systems

CREAM CE

CVMFS

dCache

DPM

EOS

FTS

GFAL

HTCondor CE (OSG)

StoRM

UI / WN

VOMS

WLCG Information System

XRootD

Retired middleware

CREAM CE

User and site support

Useful links