CNGrid GOS*: China National Grid Software

       

Abstract

CNGrid GOS is a suite of grid software with independent intellectual property rights, which supports the China National Grid running environment. CNGrid GOS is an important achievement of China National Grid software research and development project which is supported by the Hi-Tech Research and Development (863) Program of China. This paper introduces the architecture, functionalities, and innovation of the various parts of CNGrid GOS, including a system software, a CA certificate management system and a testing environment, three business version of sub-systems (high performance computing gateway, data grid, and grid workflow), and a monitoring system.

Keywords

China National Grid (CNGrid), Grid Operating System (GOS), High Performance Computing Gateway (HPCG), Data Grid (CORSAIR), Grid Workflow, Grid Monitoring System (CNGridEye).

 

China National Grid (CNGrid) is a major project supported by the Hi-Tech Research and Development (863) Program of China. CNGrid is the new generation test-bed of information infrastructure aggregating high-performance computing and transaction processing capabilities. Through resource sharing, work in coordination, and service mechanism, CNGrid effectively supports many applications such as scientific research, resource environment, advanced manufacturing, and information services. CNGrid promotes the construction of national information industry and the development of related industries by technological innovations.

China National Grid Software, named CNGrid GOS, is a suite of grid software with independent intellectual property, which is developed by CNGrid software R&D project team. The relationship of CNGrid environment and GOS software is illustrated in Figure 1.

Figure 1. The relationship of CNGrid GOS software and CNGrid environment

 

CNGrid GOS mainly includes a system software, a CA certificate management system and a testing environment, three business version of sub-systems (high performance computing gateway, data grid, and grid workflow), and a monitoring system. This project is undertaken by seven organizations including Institute of Computing Technology of Chinese Academy of Sciences, Jiangnan Institute of Computing Technology, Tsinghua University, National University of Defense Technology, Beihang University, Computer Network Information Center of Chinese Academy of Sciences and Shanghai Supercomputing Center.

1.  CNGrid GOS System Software

CNGrid GOS system software (VegaGOS) provides functionalities including global naming management, VO management, user management, resource management, application runtime management and so on. The VegaGOS has many important innovations in global naming management, distributed resource management, virtual organization (agora), grid process (grip) technology, grid security mechanism, supporting a variety of domain applications, etc.

(1)     Naming. Naming is a decentralized and name-stable global object (Gnode) management system. Naming supports locating objects by the global unique identifier with the feature of low latency and high success ratio; Naming also supports object searching based-on attribute-match with the feature of low latency and high recall ratio. Naming is a fundamental component in VegaGOS to construct the whole system. As a reusable component, Naming forms a global layer of virtual names to solve the problem of non-stable of physical address and tight coupling between applications and resources.

(2)     Resource Management. Resources in VegaGOS are in various forms, and are accessed in different ways. It is really difficult to describe and manage those heterogeneous resources. The introduction of resource controller mechanism (RController) is in order to import and manage various heterogeneous resources in a unified way. RController provides many functions for resources like create, destroy, access control, access, read and write properties, etc.

(3)     VO Management. Virtual organization in VegaGOS, called Agora, supplies distributed resources, users and access control policy management, and has the characteristic of single sign-on and single system image. Agora, as a common trusted third-party super-organization, achieves the unified cross-domain access control mechanisms while keeping autonomy.

(4)     Grid Application Runtime Management. Grid applications need to maintain the identities of users to support access control implementation during runtime. In VegaGOS, Grid Process technology, which is abbreviated to grip, is not only maintains the user identities and other application runtime context, but also manages resources occupied by the application and supports a number of applications collaborations.

Figure 2 shows the runtime architecture of VegaGOS, which illustrates the runtime interaction of the above key innovations.

Figure 2. VegaGOS application runtime management architecture

 

(5)     Application Level Tools. VegaGOS provides a wealth of application level tools in order to support the traditional command-line mode in high-performance computing and to make it have grid characteristics, including Portal/GShell/VegaSSH/GOSClient. Portal provides users with friendly operation interface based on Web, and facilitates users to use VegaGOS. GShell is a grid shell like a GNU bash environment, to support the application running with a grip; VegaSSH supplies single sign-on to any grid node to use the back-end high performance computing resources; GOSClient is a set of client tools including GShell and can be installed independently to use VegaGOS system.

 

Affiliation: Institute of Computing Technology, Chinese Academy of Sciences

Address: No.6 Kexueyuan South Road Zhongguancun, Haidian District Beijing, 100190

Telephone: 010-62600969        Fax: 010-62600900

Contact: Li ZHA                        Copywriter: Xiaoyi LU

2.  CA Certificate Management System and Testing Environment

1.           CA Certificate Management System

The CA of CNGrid provides the digital certificate service for all users, resources and applications in Grid, grants certificates for the testing and the formal environments, and supports certificate reclamation and status query.

The CNGrid CA system consists of Certification Authority and Registration Authority in different levels. The system adopts multi-certificates architecture, in which the top-level Certification Authority signs the certificates of low-level Certification Authority.

The servers in CA system are organized as three layers:the Web server, the Function server and the DB server. All servers run the Linux OS and we take PC with Windows OS as the client management platform.

Figure 3. CNGrid CA software architecture

 

CNGrid CA software architecture is illustrated in Figure 3.

The CA system provides full functionalities on web portal, such as certificate application, generation, distribution, revocation, status query and management.

(1)     Analyzing certificate application information, generating, distributing and reclaiming digital certificate.

(2)     Combination of RA distributed inspection and RS centralized inspection.

(3)     Certificate store management.

(4)     User management, log management, security audit and security management.

(5)     Data backup and recovery.

(6)     System key management.

(7)     Certificates download, status query and validation.

(8)     Standardizing certificates including the extending items of user requirement.

 

2.           Testing Environment

The popular testing notion and modern testing management tools have been used to CNGrid GOS integration and testing environment. By the mean of rigorous software re-engineering procedure, the users will get a full-featured, high-performance and reliable GOS system. In order to achieve these goals, all perspectives of GOS have been completely tested.

(1)     GUI testing ensures that GOS meets the requirements specification, especially the   uniformity, usability and effective of GUI. GOS also should provide the users on-line help and operation tips.

(2)     Functionality testing focuses on ensuring functionality requirements of GOS. Most of the test cases are automated.

(3)     Performance testing determines the response time, the throughout and the number of concurrency of GOS. The performance analysis results will help the system developers to improve the system performance.

(4)     Reliability testing evaluates how long GOS can run properly under a heavy workload (≥90%). It ensures GOS can provide reliable grid services.

(5)     Compatibility testing evaluates GOS compatibility with the OS environments, host environments and the client environments. It ensures GOS runs smoothly in specified environments.

(6)     Usability testing promotes the GOS as an easy-to-use and attractive software production.

Figure 4. CNGrid GOS integration and testing environment

 

Affiliation: Jiangnan Institute of Computing Technology

Address: No. 031 of P.O.Box 33, Wuxi, Jiangsu, 214083

Telephone: 0510-85155200        Fax: 0510-85155197

Contact: Hailiang WEI                Copywriter: Hailiang WEI

3.  High Performance Computing Gateway

High Performance Computing Gateway (HPCG) is a set of system services and application software developed upon VegaGOS to support high performance computing. HPCG has integrated the computing resources and storage resources of more than ten computing centers in the CNGrid. HPCG aims to supply non-professional users with "professional" scientific computing environment. HPCG is composed of many related system services, plus user interfaces including web portal, command line interfaces and APIs. The system services include batch job service, file management service, message service, user-mapping service, and accounting services. Through the different composition of these services, HPCG meets various high performance computing requirements of users. Characteristics of HPCG are as follows.

1.           Full-featured

(1)        Batch Job Service

It enables transparently submitting jobs to multiple high-performance computing centers and provides flexible and efficient mechanism for getting the job status.

(2)        File Management Service

It enables remotely managing files and editing small files online. It can also support reliably synchronous or asynchronous file transfer adapted to the firewall settings.

(3)        Accounting Service

It provides efficient resource usage accounting statistics and supports the global accounting.

2.           Facilitated Integration

(1)        APIs. Based on rich libraries, high-performance computing applications could be easily customized;

(2)           Job Template. Based on template technique of HPCG, importing and sharing high-performance computing software resources can be facilitated by only modifying some XML-based template files;

3.           Friendly User Interface

(1)     It provides both grid portal and grid shell for scientific computing users and resource providers.

HPCG aims to address requirements of grid batch jobs for the enterprise intranet users, and to provide feature-rich, user-friendly, running-stable scientific computing environment. Figure 5 shows HPCG deployment diagram in enterprises and computing centers.

Figure 5. HPCG deployment diagram in enterprises and computing center

 

Affiliation: Institute of Computing Technology, Chinese Academy of Sciences

Address: No.6 Kexueyuan South Road Zhongguancun, Haidian District Beijing, 100190

Telephone: 010-62600966        Fax: 010-62600900

Contact: Boqun CHENG        Copywriter: Boqun CHENG

4.  Data Grid

CORSAIR is a virtual file system manager that solves the stage in, stage out and data sharing problems in Grid. The data access and sharing service are provided to users transparently by CORSAIR. It means that users can use the data resources without needing to know the physical locations and can share resources without complicated configurations. The storage resources and access control are covered by CORSAIR. Its features are listed as follows.

(1)     Local and remote resources are integrated and presented in a unified view.

(2)     Parallel file transfer, transfer resuming and third-party transfer are supported.

(3)     Resource management can be performed in a unified way. (E.g. copy, paste, sharing, etc.)

(4)     Keyword searching service for resources in CORSAIR.

(5)     Web-based community management is supported. (E.g. creation/demission of communities, adding/removing of users, etc.)

CORSAIR provides public recourses for any users, private storage for registered users and community sharing storage for communities. CORSAIR provides convenient management tools, with the help of which users can manage data resources in CORSAIR as simple as local files.

CORSAIR is composed of storage services, mapping services, management portal, and GUI management tool (GUI Man) and command line management tool (CMD Man) for clients. The system deployment is showed in Figure 6.

Figure 6. Deployment of CORSAIR

 

Affiliation: Tsinghua University

Address: Department of Computer Science, Tsinghua University, Beijing, 100084

Telephone: 010-62796341        Fax: 010-62797141

Contact: Yongwei WU                Copywriter: Xiaomeng HUANG

 

Affiliation: National University of Defense Technology

Address: 601 Department of Computer Science, National University of Defense Technology, Changsha, 410073

Telephone: 0731-4573639        Fax: 0731-4556089

Contact: Nong XIAO                Copywriter: Nong XIAO

5.  Grid Workflow

CNGrid GOS workflow provides a suite of service-based and graphic workflow modeling and executing environment. It enables users to orchestrate services from distributed CNGrid nodes in the form of workflow in a visualized development environment and monitor the execution state in a browser. The main features are as follows.

(1)     Powerful Workflow Modeling Capability. With supporting two kinds of workflow language standards: WS-BPEL and XPDL, workflow modeler can describe both automatic scientific computing process and human-activity-involved workflow for scientific and business computing. In the latter situation, people can participate in activities, observe outputs and intervene if necessary.

(2)     Easily Access to Grid Services. With configurable service adapter, workflow modeler and workflow portal can connect to distributed grid nodes and provide a personalized service directory for users to view, assemble or execute.

(3)     Process as a reusable Service. Process deployed in servers can be reused as a service in other process.

(4)     Pluggable, Extendable Workflow Management Console. As a kind of distributed management mechanism based on web plug-ins, the console can provide unified monitoring and management for different workflow engines, with functions ranging from process definition category management to system configuration.

(5)     Extendable Workflow Modeler and Engine. Easy to extend new activities in the workflow model and add relevant interpretation and execution modules as plug-ins in the engine.

Figure 7. CNGrid workflow modeling and executing environment

 

Affiliation: Institute of Computing Technology, Chinese Academy of Sciences

Address: No.6 Kexueyuan South Road Zhongguancun, Haidian District Beijing, 100190

Telephone: 010-62600957        Fax: 010-62600900

Contact: Houfu LI                Copywriter: Houfu LI

 

Affiliation: Beihang University

Address: P.O.Box 7-28, No.37 Xueyuan Road, Haidian District, Beijing, 100191

Telephone: 010-82339679        Fax: 010-82339679

Contact: Chunming HU                Copywriter: Chunming HU

6.  CNGrid Monitoring System

CNGridEye is a system offering resources monitoring and accounting services for China National Grid (CNGrid). CNGridEye collects the status of distributed, heterogeneous and dynamic resources inside CNGrid and uses collected information to support upper-layer processing such as job scheduling, failure detection, etc. CNGridEye offers powerful accounting functions using the accurate records of resources usage information to support the daily operation and QoS enhancement of CNGrid. The architecture of CNGridEye is shown in Figure 8.

Figure 8. CNGridEye architecture

 

CNGridEye has following features.

(1)     Using an integrated architecture to monitor cross-domain and distributed resources.

(2)     Supporting several different info-models to offer complete monitoring metrics from host, cluster, node and grid vision.

(3)     Supporting many different kinds of resources for monitoring such as hardware, network and services/application and different job management systems such as OpenPBS, LSF and OAR.

(4)     Offering powerful failure detection and processing functions.

(5)     Monitoring Grid operation system (GOS) and helping to ensure its stable operation.

(6)     Monitoring network status between CNGrid nodes to find possible bottlenecks or failures.

(7)     Offering powerful user interface and supporting user to customize different kinds of charts.

(8)     Supporting distributed accounting and flexible billing strategy.

 

Affiliation: Beihang University

Address: No.37 Xueyuan Road, Haidian District, Beijing, 100191

Telephone: 010-82315908        Fax: 010-82328077

Contact: Zhongzhi LUAN        Copywriter: Zhongzhi LUAN

The relevant documentation

l             For system deployment:

GOS 3.2 Install Guide

GOS Patch Manual (3.1 to 3.2)

 

l             For system management:

GOS 3.2 GSH User Manual

HPCG 1.0.2 Admin Manual

 

l             For system development:

GOS 3.2 System Software API

GOS 3.2 Programming Tutorial

 

l             CNGrid GOS Propagation Brochure:

CNGrid GOS Propagation Brochure (Chinese)

CNGrid GOS Propagation Brochure (English)

 



* This project is supported by the Hi-Tech Research and Development (863) Program of China (Grant No. 2006AA01A106).