Honours/Masters Projects in 2010
Due to leave requirements I am unable to supervise any projects commencing in semester 1.
Honours/Masters Project: General Purpose Computing on a GPU
Dr Fred Brown
The aim of this project is to investigate how best to use a high performance graphics card
to perform general purpose computing in one of a number of challenging areas.
Background
Modern graphics cards are very high performance parallel computing systems specifically
designed for rendering 3-D computer graphics.
Some of the latest cards have as many as 3,200 processors coupled with up to several gigabytes of extremely fast memory.
However, obtaining access to this computing resource and being able to sensibly employ
it in other application areas is a major challenge.
Fortunately, the major vendors are now attempting to make this computing resource more generally available using compute libraries.
The aim of this project is to take one of the available libraries and investigate how it
can be used to tackle more general purpose problems in particular,
Genetic Algrorithms and String Sorting.
Some of the key issues to be considered include:
- What kind of algorithms are suited to these parallel machines?
- How long does it take to load and unload the graphics card?
- Would it be better to employ multi-core CPUs?
Plan
investigate work on general purpose programming for GPUs
design a range of implementation experiments to evaluate a real system
conduct the experiments with and without the GPU
evaluate the performance tradeoff using a small number of test algorithms
Honours Project: Genome Indexing
Dr Fred Brown
The aim of this project is to investigate the implementation of an index
structure, the suffix tree, that is capable of indexing the entire human genome.
Constructing suffix trees larger than memory is a significant research problem.
No knowledge of genetics or biology is required.
Background
In recent times significant progress has been made in identifying the
DNA sequences that encode the genetic structure of life. One of the most
notable projects is the
Human Genome project
which is attempting to construct the complete sequence for human DNA.
A genome is the entire collection of genetical material in a particular organism. . The structure of genes, that is those sequences of DNA that encode proteins, varies significantly between organisms. In higher life forms such as man, protein coding regions are separated by long stretches of non-coding sequences, so-called "junk DNA". Changes in the DNA sequence of genes, mutations, can give rise to a genetic disease.
A number of inhertied diseases have been found to occur when the number of mutations in a particular gene exceed a certain threshold.
At present molecular analysis of DNA sequences can involve searching
previously constructured DNA sequences looking for particular genes or
gene components. Typically this is performed using linear scanning of
DNA sequences which is not the ideal when searching for dissimilar
sequences that may represent the same gene. The obvious alternative is to
build a data structure such as a suffix tree that supports the searches of
interest. Unfortunately the amount of data involved can be substantial.
In the case of the human genome, the 23 chromosones range in size from
50 million base pairs (Mbp) to 263Mbp. Given a suffix tree has 2 internal
nodes per 3 characters encoded this would result in trees of between 35 and 180
million nodes per chromosome. If an internal node contains an integer and
two 32-bit pointers, 12 bytes ignoring other housekeeping overheads,
this represents up to 2 gigabytes of tree.
However, a 32-bit pointer cannot address this much data
so the nodes may be significantly larger. It is estimated that the most compact
suffix tree representation of the entire human genome, 3Gbp, will require 45GB
of memory with compression.
This project will look at techniques that may be able to efficiently construct
and search these large data structures. Of particular interest is the possibly
of constructing multiple partitions of the trees distributed over a server cluster.
Given the potential size of the index structures they are not likely to fit
into main memory so a major focus needs to be managing access to secondary storage.
Plan
investigate work on index structures used to index genomic data
design a range of implementation experiments in constructing these indexes
conduct the experiments using data from more than one genome
evaluate searching performance using a small number of test examples
|