e-science and the grid in practice : opportunities and obstacles
Technological development and corresponding increases in computational power have enabled scientific experimentation on a scale exceeding the resources of individual groups or institutions; collaborative research programs that produce vast result sets have become both increasingly necessary and beneficial – as such results can be exploited in similar or even unrelated research, reducing duplication of effort and expense.
The purpose of e-science correlates with traditional science, modified in light of that particular factor dominating of this age – which is information, or data. It encompasses the hardware and software requirements for long term storage of data, for providing easy, timely and appropriate access to such data, and also the development and deployment of suitable tools for interacting with such sizeable and numerous archives.
In order to benefit from such a modification of scientific methodologies, for example by enabling deduction of new hypotheses from examination of large scale data sets, there remain a number of obstacles to overcome. The future of e-science will consist of and depend on successful negotiation of these.
E-SCIENCE LANDSCAPE
Computation has become a vital tool for the advancement of modern science; there are now common processes executed during many research programs that not long ago seemed impossible. This has led to increasingly large and complex projects in many research fields – from topics as diverse as civil engineering to computer games design – as techniques such as functional modelling and graphical simulations become more widely available and applicable.
In tandem with the increased prevalence of computation comes an increased output of data. Database development has made significant advances in the last 40 years, and modern relational databases are an example of one of the most successful endeavours in computer science, both in terms of theoretical development and subsequent application of such theories [1].
During that period, many different standards have been tried and tested – some were adopted and remain today, whereas others have been retired. This is true of both hardware and software development. In considering currently available computational resources, there are some underlying standardised structures that are so ubiquitous as to be taken for granted, whereas there are yet others still in development. e-science and Grid computing represent some of those areas under development.
Some recent advances have been very obvious and beneficial to all, from research groups through to industry and even the general public. For example, the internet involves sharing some in a standard manner, over networks that are operated under standard contracts and protocols, but the data that can be represented and the applications that operate on those representations is very diverse and becomes more so every day. Also, High Performance Computing (HPC – see EPCC for example [2]) investment has increased both computational capability and capacity, enabling large simulations and rapid processing of vast data inputs.
The internet has its roots in the requirements of researchers to share data between institutions, and the similar concept of sharing computation resulted first in the development of specialised HPC machines. This initial concept grew into the vision of the grid – an accessible supply of computational power, like the electrical supply grid, providing varying computational capacities as required across fluctuating demands.
Large amounts of HPC resource are dedicated to processing calculations for physicists; in fields such as astronomy, large data sets are also commonplace. The tendency for physicists to have to make use of such tools, and to consider such data sets, has made physics an ideal proving ground for the development of e-science and grid concepts. The Large Hadron Collider, for example, is expected to output petabytes of data per year [3], requiring vast amounts of computing power supplied by multiple institutions, and is the instigator for development of an entire grid middleware suite.
Traditions in biological science do not include use of computational techniques to the extent that they are used in physics. Realising the benefits of grid computing to bioinformatics opens new avenues for development of these techniques – and presents new challenges, particularly in the increased requirements for data protection and security.
Some unique and significant achievements have been claimed as paradigms of e-science success, yet it is debatable in some cases whether these truly are e-science achievements, or results of strong HPC availability. Yet there is a clear difference between grid computing involving large compute clusters suitable for high throughput computing as opposed to high performance machines and their capabilities. However both are methods for providing computational power, with clear requirements and intended goals.
Elements of distributed computing networks tend to be managed clusters (such as ECDF [4]), as opposed to the (perhaps more common) concept of the grid as a mechanism for making available the spare processing cycles on otherwise designated machines, such as in campus computing labs – such installations do exist [5], but they are not very common. This likely results from the fact that grid middlewares are not yet highly robust nor easy to administer, which has an impact for future grid development.
Work remains in improving the service provided by, and easing administration of, clusters providing computational capacity; but the concept of the grid has come to incorporate the problems of data storage and access. The problems of computation and data are closely related yet unique, having particular implications for distributed as opposed to high performance computing.
Unfortunately, funding for e-science projects – and for positions to train in e-science – has reduced in the last two years. This may be due to the hope that the newly developed compute clusters, networks and other functionality will be administered by those interested in using such facilities; but this is a misapprehension of the situation : development and uptake of such techniques only serves to prove that there is a need for them. The fact that they remain incomplete and complex to operate highlights a need for specialists to work with them, otherwise the maintenance requirements of such technologies will serve only to restrain otherwise capable experts from devoting time to their particular research fields.
e-science is still very much a developmental entity – it is not yet mature and is as yet unable fully to support the requirements of advancing scientific research. This is most likely due to the shortfall in trained experts willing and capable of providing much needed support. The nature of e-science as a discipline straddling the boundaries between computational and natural science can obfuscate the path to individual development as an eScientist.
To consider where such development and support is most needed, a closer analysis of some current e-science technologies and techniques is required.
E-SCIENCE CHALLENGES
The development of grid middleware such as gLite [6] within physics has resulted in prominence of the command line as a standard interface. This was not an issue for physicists used to such interactions, but represents a barrier to uptake in other domains like biological research. In order to combat this, development of some form of graphical user interface is required.
A more serious concern is that the technology is still, in some cases, unstable and difficult to maintain. Examples of installation procedures requiring many hours and attempts, possibly requiring access to machines at a level that pose security issues, are common. Condor, for example, requires delegating root access to run some key initialisation jobs [7].
Security poses two further issues : like command line interfaces, it puts users off; also, until attacks are more commonplace, it is difficult to discern the success of the deployed solutions. Essentially, good security can only be achieved when it is virtually transparent to the user, as relying on users to implement security is a sure point of failure – and security is only as good as the weakest link [8]. Reassuringly, implementing single sign on with the likes of shibboleth and institutional authentication servers is progressing well.
The basis for security systems is trust; trust must be gained and upheld in order to use a resource, whilst intrusion from untrusted sources must be thwarted. This is complicated in the research domain, where collaborations between groups that may otherwise be opposed are common – for instance two research groups at competing companies may have to implement some level of trust between the groups, whilst maintaining the otherwise hostile nature towards untrusted sources within the parent companies. Hence the concept of the virtual organisation is central to the development of e-science, and plays a key role in the authentication functions previously mentioned.
Defining the boundaries and current state of e-science faces a further complication : the concurrent development of Service Oriented Architectures (SOA), and the increasing availability of access to commercially operated compute clusters seem to exemplify grid computing. There remains, at least for the present, a distinction between them in that although SOA allows execution of jobs on remote systems, and techniques such as xmlrpc [9] can be used to pass variables, messages or triggers between remote programs, these methods do not tend to allow for submission of wholly new, externally created jobs to those systems; this is something that remains unique to grid computing architectures (although it is implemented in a restricted fashion on commercial compute clusters also). In future, however, it may be that such a distinction is not so easy to maintain.
The preceding challenges represent obstacles to the future development of e-science and the grid, and they must be resolved; yet on the whole, there are well-defined (though complex) solutions and strategies in place to tackle them [10]. With continued effort and development, success can be assured; improvements already in process to network infrastructures, security mechanisms, and grid software architectures [11] will continue to advance grid computing towards the computing-capacity-on-tap paradigm – though there is still the matter of the nature of the data itself to consider.
THE DATA CHALLENGE
The key partner to grid computing for the future of e-science is data handling. This should be abundantly clear, given that the aim of e-science is to support research through providing new means for exploiting scientific data; unfortunately the scientific data sets that represent the inputs to computational processes are rarely simple, and are often very large in size; the result data, too, can be of considerable magnitude. In many cases, the storage requirements and related access and transport methods for such large archives will present significantly greater complications than the task of running a computation on the readily available data.
Given the scalability of compute clusters, networks and grid infrastructures, available computing power will continue to increase. But as computing power grows, so too does the ability to generate data – yet the ability to store, access and process that data does not. Therefore, data handling could be the source of major issues for the future of e-science.
Storage
There will very soon be active experiments in physics and biology producing on the order of petabytes of data per year [3]. At that rate, within the first year, these data collections will become the largest ever to have existed in the world. It is essential both that the hardware exists to hold all this data, but also that the archiving requirements are met. The amount of disks required brings serious concerns for reliability, both in terms of machine uptime and in the quality of the data provided – for example if a drive is failing, there is a possibility that a query on the data could result in an incorrect value being supplied. Also, provision must be made for reliable backup – complicated by the fact that, with so many machines involved e.g. on a raid array, it is possible that another failure could occur before recovery from the previous one is complete.
Access
Massive data sets can bring increased complexity for archiving and curation. If a relational database were required, one containing only 10s of petabytes would be the largest ever created, and this is not outwith scope. Add to that the issue of different standards of SQL being in use, along with the fact that different databases, even for similar experiments with comparable outputs, may have very different schemas; and very quickly compatibility will become a monstrous task. In addition, the command line interface is not considered a modern user-friendly approach; thus, access portals must be developed – so methods for rapid development of such portals would be beneficial, e.g. Rapid [12]. Furthermore, as this type of data storage and access increases in popularity in fields where data protection and security are serious factors, ensuring that such requirements are adhered to by any available views onto data is very important – particularly when the internet is such a ubiquitous tool that online access would be a user expectation.
Transport
Once a data collection is in place, and a method for access is available, the effects of transporting the data must be considered. Whilst higher bandwidth networks are well under development, and techniques to utilise extra availability are being produced, it is still easy to concoct a situation where a required data set may be large enough to significantly affect the network, and too large for the new location. Network latency is a prime concern, as delay in receiving data can limit computational speed, presenting a bottleneck. Thus, transport of data must be developed further and closely monitored. Further complications arise when intellectual property rights (IP) and data protection (for example in medical research) are of concern; any data licensing policies or copying limitations must be honoured – which can be confusing when data may exist in multiple locations across varying geographical and legal domains.
Solutions
There are a number of possible solutions to some of the data handling problems. Software tools such as OGSA-DAI [13] have been developed in an attempt to enable transparent compatibility between diverse data structures. Also, standardisation in some areas has improved, with agreements within research fields leading to coherent policies for data curation, for example with CSML [14]. This is a key factor – exploitation of scientific data depends on harmonization of data.
In terms of hardware, distributed file systems such as AFS [15] have been available for some time, and do provide solutions to some problems at a cluster level. Advancements to such concepts by Google [16] and Hadoop [17] represent interesting solutions for the larger scale. Additionally, Storage Resource Brokers (SRB) are available on large machines at various sites – however their scale does not yet reach that which may soon be required, e.g. many petabytes of storage across numerous data farms.
THE FUTURE OF E-SCIENCE
There is no question that e-science will survive – the benefits gained in research to date far outweigh the difficulties faced, and such developments will not be abandoned as long as computational power remains available. It is a question of how well it will be supported as it becomes an essential tool across all fields of scientific research.
Whilst recent developments have been successful in improving accessibility of e-science techniques and offering more to researchers outside of the traditional areas of physics, it is important not to expect too much too soon – the idea that technologies such as the internet are modern and sudden inventions is incorrect. A long process of use in particular research fields, of improvement and refinement, of development of trust and standardisation is required before something that is useful and easy to use emerges.
The development of SOA and commercial cloud computing [18] is a sign for optimism. Developments of opportunities to serve computational power in a high throughput manner, and measuring / billing that service, will result in growth and increasing maturity.
Work remains to be done on security, robustness of grid architectures, easing administration and user interaction, and network support. But it is in data handling that many issues remain – at a technological, theoretical, legal and cultural level [19]. It is here that further development is critical, and where timely advances could lead to significant advantages for those that make them.
The advent of e-science represents a recognition that this cannot continue to be done by scientists – they must focus their talents on their own areas of expertise. In the same way that an excessive push for staff efficiencies can backfire by requiring experts to waste time on tasks they are not best equipped for, the same can happen if too much emphasis is placed on scientific research groups to support their own e-science requirements. Similarly, computer scientists are not necessarily suited to – nor interested in – supporting scientific research, as they have their own research interests to pursue.
e-science represents an increasing middle ground between these two disciplines. In order to commit to developing and taking advantage of the required infrastructure, there must be a way for individuals to progress in the field of e-science. Unfortunately it is not yet clear what that path should be.
The requirements for data handling are, in some sense, an issue for data curation; yet traditional curation or librarianship career paths do not cater to this [19]. Previous dedicated investments in developing e-science expertise may have been reduced too soon – funding councils now offer less e-science-specific funding options [20]. Whilst continued investment in scientific research will filter into development of grid infrastructures, the seat of responsibility for developing the required data handling solutions is less clear cut. Therefore, it is critical that continued investment be made in this area, in order to ensure that the expertise exists to support the future eScientific requirements of all scientific research groups.
REFERENCES
[1] R. Ramakrishnan and J. Gehrke, Database Management Systems, 3rd ed. McGraw-Hill, 2003.
[2] EPCC : www.epcc.ed.ac.uk
[3] The vision for networks, data storage systems and compute capability :
www.nesc.ac.uk/documents/OSI/compute.pdf
[4] ECDF : www.ecdf.ed.ac.uk
[5] Condor harvests unused campus computing power :
www.ualberta.ca/AICT/RESEARCH/Condor/Articles/condor-intro.html
[6] gLite : glite.web.cern.ch/glite
[7] Condor : www.cs.wisc.edu/condor
[8] OMII grid security technology overview :
www.omii.ac.uk/dissemination/SecurityOverview.pdf
[9] xmlrpc : www.xmlrpc.com
[10] Study of User Priorities for e-Infrastructure for e-Research (SUPER) :
www.nesc.ac.uk/technical_papers/UKeS-2007-01.pdf
[11] AAA, Middleware and DRM : www.nesc.ac.uk/documents/OSI/aaa.pdf
[12] Rapid : https://research.nesc.ac.uk/rapid
[13] OGSA-DAI : www.ogsadai.org.uk
[14] CONFIDENTIAL – not yet published.
[15] AFS : www.openafs.org
[16] GFS : labs.google.com/papers/gfs.html
[17] Hadoop : hadoop.apache.org/core
[18] Amazon EC2 : aws.amazon.com/ec2
[19] Report of the working group on search and navigation :
www.nesc.ac.uk/documents/OSI/search.pdf
[20] STFC grants on the web : www.scitech.ac.uk/gow/intro.asp