PhD Students Wanted

I am looking for more research students to join the Alchemy Project. This project focuses on all aspects of distributed systems - in particular, I am looking for students who are interested in fault tolerance, task scheduling, and agent-based dynamic distribution. My primary interests are to automate all aspects of distributed systems so that they become more accessible and easier to use by the average programmer.

My work is collaborative with colleagues in Japan, Wales and the United States so students must be prepared to work in a group environment to achieve a common goal. If you are intersted in postgraduate research and distributed systems is the area in which you want to work,, then I suggest that you contact me directly (michael@cs.adelaide.edu.au). A detailed project description, which forms part iof the Alchemy Project, is found below. Research students to work in this area are particularly sought after.

Project Outline

At any given moment the majority of the world's computing resources remain idle. By harnessing this resource pool, an individual computation may be completed in a fraction of the time required to execute on a single machine. This project aims to develop an agent-based framework to make this effective use of resources a reality in such a way that the individual programmer need do minimal work to secure this benefit. The user program should be distributed automatically. Load balancing of machines is necessary to ensure good overall system performance. Robustness is essential to ensure resilience against inevitable machine and network failure. These properties should also be provided automatically to ensure enhanced system performance.

Project in Detail

Using Agent-Based Systems to Support Automated Global Distribution for the Average Programmer: Effective Use of Existing Resources Without Programmer Intervention

Aims, significance and expected outcomes of the project

For large, computationally intensive parallel applications there is much to be gained from globally distributing the application across a wide variety of machines and obtaining a performance speedup. The difficulty is that the programmer wishing to execute such an application may not have the physical resources themselves, nor may they have the necessary skills to effectively distribute the application manually. This project aims to explore the "all care and no responsibility" principle of distribution whereby the average programmer does not wish to take responsibility for the physical distribution and coordination of the application but is, however, concerned with the application throughput and total execution time. The project aims to provide support to programmers for the transparent automated distribution of applications across a wide area network, with delivery of the associated benefits, without excessive programmer involvement. A separate class of programmer involved in high performance computing will, however, wish to retain a degree of control over the distribution of their application. Another aim of the project is to specialise the support provided so that additional functionality and control is provided to experienced programmers working with a computational grid [8]. A computational grid is similar to a power grid in that computers, typically high performance supercomputers, can join and leave the grid as necessary to provide computational support to massive applications. Typically the computers involved are supercomputers and may be associated with specific large data sets. In these cases, it is cheaper to move the computation to the data, rather than move the data to the computation.
This project aims to develop an agent-based [9, 10] framework to perform the distribution of the user application across the available resources and provide appropriate infrastructure support for fault tolerance to the applications programmer. To support heterogeneity, object mobility and remote data access, appropriate middleware [12] must be developed. Agents, mobile autonomous computational units, represent a novel and appropriate mechanism for the development of such a system. Such distributed software architecture lends itself well to software projects such as the Virtual Ship Project at DSTO, Adelaide.

This project is significant in that it differs from the majority of the current research by trying to hide, as much as possible, the underlying distribution mechanisms and associated issues from the programmer. The user identifies the tasks to be distributed and they are transformed into autonomous agents by the system. The intention is not the development of the optimal distribution strategy for each application, but rather an acceptable distribution in terms of total elapsed time. The issues that need to be addressed include partitioning of the application, scheduling of the tasks [2, 13], fault tolerance [1, 3, 14, 19], load balancing [11, 22] and code migration [6, 21], resource location [5], visualisation, middleware [12], communication [4, 18], security, agents [9], and support for heterogeneous environments. These issues interact in complex ways and even a semi-automated system is unlikely to be as effective as a human expert in terms of reaching an optimal solution. However, the vast majority of programmers lack the ability to undertake the distribution themselves, and even fewer to find the optimal solution. Worse still, the optimal distribution is specific to an application and, in fact, may not be realised as resources may be unavailable and network congestion etc may occur. In order to deal with such difficulties, the system proposed must be adaptive and dynamic in nature ­ an ideal application for autonomous software agents.

Current related research focuses on a single issue at a time. Typically, the options explored are not transparent to the user and, in fact, require the user to be an expert in the area in order to achieve a near-optimal result. Examples include Condor [15] and Globus [7]. Neither of these systems operates in a wide area network, and both restrict the user to a specific hardware platform, or operating system. Both systems require the user to make specific calls to library-based methods in order to utilise the software and have their code distributed. This lack of transparency and heterogeneity is common in such systems.

Another common problem is that the middleware used with the distributed system is fully exposed. Across a wide area network the two most common communication systems are CORBA [16, 17] and Java RMI [20]. These examples of middleware require the programmer to be familiar with their interface and underlying data structures. The nett result is that the middleware selected has a significant impact on the nature of the code produced by the programmer. One of the goals of this research is to hide the middleware as much as possible from the average programmer who wishes to painlessly distribute their application. However, serious users wishing to employ a computational grid will want greater control over the distribution of the application. A specific application programmer interface (API) will need to be developed for such a user. This represents a novel use of software agents within the high performance computing field.
In addition to the development of the necessary research infrastructure, specific outcomes of this grant will include a dynamic scheduling policy, basic transparent support for fault tolerance, integration of these aspects into a software agent system, and the framework for the deployment of software agents across a computational grid.

Research plan, methods and techniques

The work undertaken in the project is geographically dispersed. The expertise to undertake the work is also dispersed (Australia: fault tolerance and scheduling techniques; Japan: software agents and mobility; Wales: middleware and high performance computing; United States: high performance computing and applications).

The proposal is a collaborative research effort across four countries. The coordination of the research is therefore critical in order for it to be successful. Travel combined with fax and E-mail, will provide the necessary synergy to ensure that the research software is designed, developed and integrated to meet the project goals.

The steps involved in the project include the transformation of software tasks, identified by the applications programmer, into software agents. Normally agents are provided with an itinerary and visit specified hosts in turn. This behaviour will be modified to grant the agents the ability to determine their own mobility (dynamic scheduling). The issue of visualising the location of the agents is beyond the scope of this small grant. The criteria for agent relocation is determined by need (i.e., need to access a large remote data repository), or load (i.e., the host machine is becoming too heavily utilised and it is best to relocate). The experience with Condor has shown that effective throughput can be achieved without taking over a host machine. This ensures that a user who is prepared to allow their machine to be used by such a computation is not inconvenienced by it running on their machine.

The software agents are able to access the infrastructure and middleware to support heterogeneity, fault tolerance and agent/object location. Each of the three investigators will contribute to this aspect of the project. Since the agent is essentially an intelligent wrapper around the users task, the interaction between the agent and the support infrastructure can be made transparent to the applications programmer. However, in the case of high performance applications for a computational grid, a limited set of methods can be made available through an API to offer specialist users greater control over the distribution and behaviour of the application. Fault tolerance will be delivered via a subsystem that employs both replication and checkpointing. It is adaptive in the sense that it employs heartbeat messages and caters for varying loads on processors and the network before determining that a host has failed. The subsystem will also resolve ambiguities that arise if the agent is restarted on another machine and the host that was presumed to have failed is later found to still be active.

In order to use the distributed resources, a potential user must first register their computer with the agent system. This makes their computational resources available to other users while granting them access to other machines. The intent is that the computational resources will only be made available to other users while the computer is idle. There should be no observable degradation of performance apparent to the user of the computer. A user may deregister their machine at any time and surrender access to the distributed system. Security is clearly an issue and is handled by the existing agent technology. Task migration and failure are also clearly important issues and infrastructure will be built to deal with these problems. Within a high performance computational grid, access to large data sets, such as satellite information, may be necessary. In these cases it is necessary to migrate tasks to a specific machine and failure occurs if the resource is not available. Support for back-up services will also need to be considered.

Registration of computers with the system needs to be decentralised in order to avoid a single point of failure. Software agents will be used as the implementation technique to implement such a distributed database and propagate information regarding the availability of hosts. There may be, in fact, a number of federations operating at any one time.

The system architecture is relatively simple: the user application is transformed into an agent-based equivalent. The agents distribute themselves across the participating computers in the federate. These machines provide the necessary infrastructure support to the agents they host. All machines have access to the registry system in order to join and leave the federate. Applications aimed at a computational grid are provided access to a specialist API for additional functionality.

References

  1. M.K. Aguilera, W. Chen and S. Toueg, "Failure Detection and Consensus in the Crash-Recovery Model", Distributed Computing, Volume 13, Number 2, 2000, pp. 99­125.
  2. H. Attiya and J. Welch, Distributed Computing. Fundamentals, Simulations and Advanced Topics, McGraw-Hill, 1998.
  3. A. Avizienis, "Towards Systematic Design of Fault-Tolerant Systems", IEEE Computer, April 1997, pp. 51­58.
  4. H. Detmold, M. Hollfelder and M.J. Oudshoorn, "Ambassadors: Structured Object Mobility in Worldwide Distributed Systems", 19th IEEE International Conference on Distributed Computing Systems, Austin, Texas, May 31­June 5, 1999, pp 442­449.
  5. K. Falkner, The Provision of Relocation Transparency Through a Formalised Naming System in a Distributed Mobile Object System, PhD. Thesis, University of Adelaide, May 2000.
  6. N.J.G. Falkner and M.J. Oudshoorn. "Congress: A Dynamic Distributed Task Allocation Environment". Australian Computer Science Communications, Volume 20, Number 1, Springer, February 1998, pp. 475­488.
  7. I. Foster and C. Kesselman, "Globus: A Metacomputing Infrastructure Toolkit", The International Journal of Supercomputer Applications and High Performance Computing, Volume 11, Number 2, 1997, pp. 115-128.
  8. I. Foster and C. Kesselman, The Grid. Blueprint for a New Computing Infrastructure, Morgan Kaufman, 1999.
  9. R. Gray, D. Kotz, G. Cybenko and D. Rus, "Mobile Agents: Motivations and State-of-the-Art Systems", Technical Report TR2000-365, Department of Computer Science, Dartmouth College, April 2000.
  10. I.A. Hamid and D. Ramamonjisoa, "Handling Coordination in Heterogeneous Mobile Agents System", Scuola Superiore G. Reiss Romoli, SSGRR-2000, in L'Aquila, Italy, 31 July ­ 6 August, 2000. CD-ROM, 6 pages.
  11. C. Hau, J. Hau, K.G. Shin and T.K. Tsukada, "Transparent Load Sharing in Distributed Systems: Decentralised Design Alternatives Based on the Condor Package", Symposium on Reliable Distributed Systems, IEEE Computer Society Press, October 1994, pp. 202­211.
  12. K.A. Hawick, H.A. James and J.A. Mathew, "Remote Data Access in Distributed Object-Oriented Middleware". To appear in Journal of Parallel and Distributed Computing Practices, 2000.
  13. L. Huang and M.J. Oudshoorn, "Static Scheduling of Conditional Parallel Tasks", Chinese Journal of Advanced Software Research, Volume 6, Number 2, 1999, pp. 121-129.
  14. L. Lamport, R. Shostak and M. Pease, "The Byzantine Generals Problem", ACM Transactions on Programming Languages and Systems, Volume 4, Number 3, July 1982, pp. 382­401.
  15. M. Litzkow, M. Livny and M.W. Mutka, "Condor ­ a Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, June 1998, pp. 104­111.
  16. Object Management Group, Common Object Request Broker Architecture, Object Management Group Document Number 91.12.1, Revision 1.1, December 1991.
  17. R. Otte, P. Patrick and M. Roy, Understanding CORBA. The Common Object Request Broker Architecture, Prentice-Hall, 1996.
  18. M.J. Oudshoorn and H. Detmold, "Ambassadors: A Communication Structure for Mobile Java Objects", Scuola Superiore G. Reiss Romoli, SSGRR-2000, in L'Aquila, Italy, 31 July ­ 6 August, 2000. CD-ROM, 10 pages.
  19. P. Stelling, I. Foster, C. Kesselman, C. Lee, G. von Laszewski, "A Fault Detection Service for Wide Area Distributed Computations", Proceedings of the 7th IEEE Symposium on High Performance Distributed Computing, 1998, pp. 268­278.
  20. Sun Microsystems Inc. Java Remote Method Invocation Specification, 1997.
  21. Y.S. Wadge, Process Migration in Network Parallel Software Systems, M.A. Thesis, University of Texas at Austin, 1994.
  22. A. Zomaya, Parallel and Distributed Computing Handbook, McGaw-Hill, 1996.