Honours/Masters Project: General Purpose Computing on a GPU
Dr Fred Brown
The aim of this project is to investigate how best to use a high performance graphics card
to perform general purpose computing in one of a number of challenging areas.
Background
Modern graphics cards are very high performance parallel computing systems specifically
designed for rendering 3-D computer graphics.
Some of the latest cards have as many as 1,600 processors coupled with up to a gigabyte of extremely fast memory.
However, obtaining access to this computing resource and being able to sensibly employ
it in other application areas is a major challenge.
Fortunately, the major vendors are now attempting to make this computing resource more generally available using compute libraries.
The aim of this project is to take one of the available libraries and investigate how it
can be used to tackle more general purpose problems in particular,
Genetic Algrorithms and String Sorting.
Some of the key issues to be considered include:
- What kind of algorithms are suited to these parallel machines?
- How long does it take to load and unload the graphics card?
- Would it be better to employ multi-core CPUs?
Plan
investigate work on general purpose programming for GPUs
design a range of implementation experiments to evaluate a real system
conduct the experiments with and without the GPU
evaluate the performance tradeoff using a small number of test algorithms
Masters Project: Data Structures for Bioinformatics
Dr Fred Brown and Dr Ute Baumann
The aim of this project is to investigate the practical application of a particular data structure, the generalised
suffix tree, to the problem of identifying DNA sequences that control gene expression.
No knowledge of genetics or biology is required.
The identification and interpretation of any biologically significant results is not part of this project.
Background
In recent times significant progress has been made in identifying the DNA sequences that encode the genetic structure of life. One of the most notable projects is the
Human Genome project
which is attempting to construct the complete sequence for human DNA.
A genome is the entire collection of genetical material in a particular organism. The structure of genes, that is those sequences of DNA that encode proteins, varies significantly between organisms. In higher life forms such as man, protein coding regions are separated by long stretches of non-coding sequences, so-called "junk DNA". Changes in the DNA sequence of genes, mutations, can give rise to a genetic disease.
A number of inhertied diseases have been found to occur when the number of mutations in a particular gene exceed a certain threshold.
At present molecular analysis of DNA sequences can involve searching previously constructured DNA sequences looking for
particular genes or gene components. Typically this is performed using linear scanning of DNA sequences which is not the
ideal when searching for short sequence motifs. The obvious alternative is to build a
data structure such as a suffix tree that supports the searches of interest. A particular feature of a suffix tree is that, if it contains multiple sequences, it can effectively search the multiple sequences in parallel with a single traversal of the tree. A suffix tree that encodes more than one sequence is called a generalised suffix tree.
This project will look at constructing a generalised suffix tree that contains all gene promoter regions for a particular organism and attempt to identify particular patterns of interest.
A promoter is the region directly adjacent to the protein-coding region of a gene. The presence of short sequence motifs within the promoter controls where and when a gene is allowed to function in the life-time of an organism.
The data set of interest consists of 30,000 sequences of about 1,000 characters for each promoter. The analysis of sequences is a biology research project in its own right. Simply enumerating the different motifs that may be found would generate over a terabyte of data so a more resource friendly searching and analysis approach must be developed.
Plan
investigate work on index structures and algorithms used to identify gene promotor regions in genomic data
investigate suffix tree implementation approaches
design a range of implementation experiments in constructing generalsed suffix trees
conduct the experiments using the example data set
evaluate the performance of the implementation noting any biologically significant results
Honours Project: Genome Indexing
Dr Fred Brown
The aim of this project is to investigate the implementation of an index
structure, the suffix tree, that is capable of indexing the entire human genome.
Constructing suffix trees larger than memory is a significant research problem.
No knowledge of genetics or biology is required.
Background
In recent times significant progress has been made in identifying the
DNA sequences that encode the genetic structure of life. One of the most
notable projects is the
Human Genome project
which is attempting to construct the complete sequence for human DNA.
A genome is the entire collection of genetical material in a particular organism. . The structure of genes, that is those sequences of DNA that encode proteins, varies significantly between organisms. In higher life forms such as man, protein coding regions are separated by long stretches of non-coding sequences, so-called "junk DNA". Changes in the DNA sequence of genes, mutations, can give rise to a genetic disease.
A number of inhertied diseases have been found to occur when the number of mutations in a particular gene exceed a certain threshold.
At present molecular analysis of DNA sequences can involve searching
previously constructured DNA sequences looking for particular genes or
gene components. Typically this is performed using linear scanning of
DNA sequences which is not the ideal when searching for dissimilar
sequences that may represent the same gene. The obvious alternative is to
build a data structure such as a suffix tree that supports the searches of
interest. Unfortunately the amount of data involved can be substantial.
In the case of the human genome, the 23 chromosones range in size from
50 million base pairs (Mbp) to 263Mbp. Given a suffix tree has 2 internal
nodes per 3 characters encoded this would result in trees of between 35 and 180
million nodes per chromosome. If an internal node contains an integer and
two 32-bit pointers, 12 bytes ignoring other housekeeping overheads,
this represents up to 2 gigabytes of tree.
However, a 32-bit pointer cannot address this much data
so the nodes may be significantly larger. It is estimated that the most compact
suffix tree representation of the entire human genome, 3Gbp, will require 45GB
of memory with compression.
This project will look at techniques that may be able to efficiently construct
and search these large data structures. Of particular interest is the possibly
of constructing multiple partitions of the trees distributed over a server cluster.
Given the potential size of the index structures they are not likely to fit
into main memory so a major focus needs to be managing access to secondary storage.
Plan
investigate work on index structures used to index genomic data
design a range of implementation experiments in constructing these indexes
conduct the experiments using data from more than one genome
evaluate searching performance using a small number of test examples
|