I am looking for more research students to join the Alchemy Project. This project focuses on all aspects of distributed systems - in particular, I am looking for students who are interested in fault tolerance, task scheduling, and agent-based dynamic distribution. My primary interests are to automate all aspects of distributed systems so that they become more accessible and easier to use by the average programmer.
My work is collaborative with colleagues in Japan, Wales and the United States so students must be prepared to work in a group environment to achieve a common goal. If you are intersted in postgraduate research and distributed systems is the area in which you want to work,, then I suggest that you contact me directly (michael@cs.adelaide.edu.au). A detailed project description, which forms part iof the Alchemy Project, is found below. Research students to work in this area are particularly sought after.
At any given moment the majority of the world's computing resources remain idle. By harnessing this resource pool, an individual computation may be completed in a fraction of the time required to execute on a single machine. This project aims to develop an agent-based framework to make this effective use of resources a reality in such a way that the individual programmer need do minimal work to secure this benefit. The user program should be distributed automatically. Load balancing of machines is necessary to ensure good overall system performance. Robustness is essential to ensure resilience against inevitable machine and network failure. These properties should also be provided automatically to ensure enhanced system performance.
For large, computationally intensive parallel applications
there is much to be gained from globally distributing the application
across a wide variety of machines and obtaining a performance
speedup. The difficulty is that the programmer wishing to execute
such an application may not have the physical resources themselves,
nor may they have the necessary skills to effectively distribute
the application manually. This project aims to explore the "all
care and no responsibility" principle of distribution whereby
the average programmer does not wish to take responsibility for
the physical distribution and coordination of the application
but is, however, concerned with the application throughput and
total execution time. The project aims to provide support to programmers
for the transparent automated distribution of applications across
a wide area network, with delivery of the associated benefits,
without excessive programmer involvement. A separate class of
programmer involved in high performance computing will, however,
wish to retain a degree of control over the distribution of their
application. Another aim of the project is to specialise the support
provided so that additional functionality and control is provided
to experienced programmers working with a computational grid [8].
A computational grid is similar to a power grid in that computers,
typically high performance supercomputers, can join and leave
the grid as necessary to provide computational support to massive
applications. Typically the computers involved are supercomputers
and may be associated with specific large data sets. In these
cases, it is cheaper to move the computation to the data, rather
than move the data to the computation.
This project aims to develop an agent-based [9, 10] framework
to perform the distribution of the user application across the
available resources and provide appropriate infrastructure support
for fault tolerance to the applications programmer. To support
heterogeneity, object mobility and remote data access, appropriate
middleware [12] must be developed. Agents, mobile autonomous computational
units, represent a novel and appropriate mechanism for the development
of such a system. Such distributed software architecture lends
itself well to software projects such as the Virtual Ship Project
at DSTO, Adelaide.
This project is significant in that it differs from the majority of the current research by trying to hide, as much as possible, the underlying distribution mechanisms and associated issues from the programmer. The user identifies the tasks to be distributed and they are transformed into autonomous agents by the system. The intention is not the development of the optimal distribution strategy for each application, but rather an acceptable distribution in terms of total elapsed time. The issues that need to be addressed include partitioning of the application, scheduling of the tasks [2, 13], fault tolerance [1, 3, 14, 19], load balancing [11, 22] and code migration [6, 21], resource location [5], visualisation, middleware [12], communication [4, 18], security, agents [9], and support for heterogeneous environments. These issues interact in complex ways and even a semi-automated system is unlikely to be as effective as a human expert in terms of reaching an optimal solution. However, the vast majority of programmers lack the ability to undertake the distribution themselves, and even fewer to find the optimal solution. Worse still, the optimal distribution is specific to an application and, in fact, may not be realised as resources may be unavailable and network congestion etc may occur. In order to deal with such difficulties, the system proposed must be adaptive and dynamic in nature an ideal application for autonomous software agents.
Current related research focuses on a single issue at a time. Typically, the options explored are not transparent to the user and, in fact, require the user to be an expert in the area in order to achieve a near-optimal result. Examples include Condor [15] and Globus [7]. Neither of these systems operates in a wide area network, and both restrict the user to a specific hardware platform, or operating system. Both systems require the user to make specific calls to library-based methods in order to utilise the software and have their code distributed. This lack of transparency and heterogeneity is common in such systems.
Another common problem is that the middleware used with the
distributed system is fully exposed. Across a wide area network
the two most common communication systems are CORBA [16, 17] and
Java RMI [20]. These examples of middleware require the programmer
to be familiar with their interface and underlying data structures.
The nett result is that the middleware selected has a significant
impact on the nature of the code produced by the programmer. One
of the goals of this research is to hide the middleware as much
as possible from the average programmer who wishes to painlessly
distribute their application. However, serious users wishing to
employ a computational grid will want greater control over the
distribution of the application. A specific application programmer
interface (API) will need to be developed for such a user. This
represents a novel use of software agents within the high performance
computing field.
In addition to the development of the necessary research infrastructure,
specific outcomes of this grant will include a dynamic scheduling
policy, basic transparent support for fault tolerance, integration
of these aspects into a software agent system, and the framework
for the deployment of software agents across a computational grid.
The work undertaken in the project is geographically dispersed. The expertise to undertake the work is also dispersed (Australia: fault tolerance and scheduling techniques; Japan: software agents and mobility; Wales: middleware and high performance computing; United States: high performance computing and applications).
The proposal is a collaborative research effort across four countries. The coordination of the research is therefore critical in order for it to be successful. Travel combined with fax and E-mail, will provide the necessary synergy to ensure that the research software is designed, developed and integrated to meet the project goals.
The steps involved in the project include the transformation of software tasks, identified by the applications programmer, into software agents. Normally agents are provided with an itinerary and visit specified hosts in turn. This behaviour will be modified to grant the agents the ability to determine their own mobility (dynamic scheduling). The issue of visualising the location of the agents is beyond the scope of this small grant. The criteria for agent relocation is determined by need (i.e., need to access a large remote data repository), or load (i.e., the host machine is becoming too heavily utilised and it is best to relocate). The experience with Condor has shown that effective throughput can be achieved without taking over a host machine. This ensures that a user who is prepared to allow their machine to be used by such a computation is not inconvenienced by it running on their machine.
The software agents are able to access the infrastructure and middleware to support heterogeneity, fault tolerance and agent/object location. Each of the three investigators will contribute to this aspect of the project. Since the agent is essentially an intelligent wrapper around the users task, the interaction between the agent and the support infrastructure can be made transparent to the applications programmer. However, in the case of high performance applications for a computational grid, a limited set of methods can be made available through an API to offer specialist users greater control over the distribution and behaviour of the application. Fault tolerance will be delivered via a subsystem that employs both replication and checkpointing. It is adaptive in the sense that it employs heartbeat messages and caters for varying loads on processors and the network before determining that a host has failed. The subsystem will also resolve ambiguities that arise if the agent is restarted on another machine and the host that was presumed to have failed is later found to still be active.
In order to use the distributed resources, a potential user must first register their computer with the agent system. This makes their computational resources available to other users while granting them access to other machines. The intent is that the computational resources will only be made available to other users while the computer is idle. There should be no observable degradation of performance apparent to the user of the computer. A user may deregister their machine at any time and surrender access to the distributed system. Security is clearly an issue and is handled by the existing agent technology. Task migration and failure are also clearly important issues and infrastructure will be built to deal with these problems. Within a high performance computational grid, access to large data sets, such as satellite information, may be necessary. In these cases it is necessary to migrate tasks to a specific machine and failure occurs if the resource is not available. Support for back-up services will also need to be considered.
Registration of computers with the system needs to be decentralised in order to avoid a single point of failure. Software agents will be used as the implementation technique to implement such a distributed database and propagate information regarding the availability of hosts. There may be, in fact, a number of federations operating at any one time.
The system architecture is relatively simple: the user application is transformed into an agent-based equivalent. The agents distribute themselves across the participating computers in the federate. These machines provide the necessary infrastructure support to the agents they host. All machines have access to the registry system in order to join and leave the federate. Applications aimed at a computational grid are provided access to a specialist API for additional functionality.