|
NeoBio API | ||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--neobio.alignment.PairwiseAlignmentAlgorithm | +--neobio.alignment.CrochemoreLandauZivUkelson
This abstract class is the superclass of both global and local sequence alignment algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and Michal Ziv-Ukelson (2002).
This implementation derives from the paper of M.Crochemore, G.Landau and M.Ziv-Ukelson, A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices (available here as PDF or Postscript).
It employs Lempel-Ziv compression techniques to speed up the classic dynamic programmin approach to sequence alignment (see NeedlemanWunsch and SmithWaterman classes). It reduces the time and space complexity from O(n2) to O(n2/log n). In fact, the complexity is actually O(h n2/log n), where 0 <= h <= 1 is a real number denoting the entropy of the text (a measure of how compressible it is). This means that, the more compressible a sequence is, the less memory the algorithm will require, and the faster it will run.
The idea behind this improvement is to identify repetitions in the sequences and reuse the computation of their alignments. The first step is, therefore, to parse the sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished by the FactorSequence class (for more information about this factorisation, please refer to the specification of that class) that builds a doubly-linked list of factors. Each factor is an instance of the Factor class (refer to the specification of this class for more information).
Once the sequences have been parsed, the algorithm builds a matrix of blocks, called
block table, that is vaguely similar to the dynamic programming matrix used by both
NeedlemanWunsch
and SmithWaterman
. Each block contains an
instance of an AlignmentBlock (please refer to its specification for more
information on what information is stored) and represents the alignment beteween one
factor of each sequence. This block table is, in fact, a partition of the alignment
graph.
Consider a block B which corresponds to the alignment of factor F1 = xa from sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of S1 with character a, while F2 extends a previous factor of S2 with character b. We can define the input border of B as the set of values at the left and top borders of block B, and the output border as the set of values at the right and bottom borders of B. Moreover, we can define the following prefix blocks of B:
Note that each factor has a pointer to its prefix factor, called ancestor (see the
specification of the Factor
class). This pointer makes it easy to retrieve
any of the three prefix blocks of B in constant time.
Rather than computing each value of the alignment block B, the algorithm will only compute the values on its input and output borders. This is precisely what makes it more efficient.
In this class there is a general specification of how the block table is computed
(see the computeBlockTable
method for details). The actual
method depends on the subclasses. In general, there are two phases:
In fact, for each block, only one column of the DIST matrix needs to be computed,
all other columns are actually retrieved from its prefix blocks. This is precisely what
is accomplished by the assembleDistMatrix
method found in
this class (it is general enough for both global and local alignment versions of the
algorithm.
From the DIST matrix, we obtain the OUT matrix defined as
OUT[i,j] = I[i] + DIST[i,j]
where I is the input border array. This means
that the OUT matrix is the DIST matrix updated by the input border of a block. The
output border is then computed from the OUT matrix by taking the maximum value of each
column. This class also have a general method for assembling the input border (see
assembleInputBorder
The OUT matrix is encoded by the OutMatrix class that takes as both a DIST matrix and an input border array. Note that it does not compute the OUT matrix, it just stores the necessary information to retrieve a value at any position of the matrix.
A naive approach to compute the output border of a block from the OUT matrix of size n x n would take a time proportional to n2. However, it happens that, due to the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which implies that they are also totally monotone. This property allows the computation of the output border of B in linear time with the SMAWK algorithm (see the specification of the Smawk class for more information on SMAWK).
This class contains a general specification that is pertinent to both global and local versions of the algorithm. For more information on each version of, please refer to the appropriate subclass.
A note about the performance of these algorithms. Although theoretical results suggest that these algorithms are faster and consume less memory than the classical methods, in practice it is hard to realise their performance gains.
These algorithms are extremely complex and require the storage of many extra
pointers and other auxiliary data for each block (see the AlignmentBlock
class for more details). Hence, even though the space requirement is
O(n2/log n), which is less than O(n2), in practice, for most of
the cases these algorithms will take more time and memory space than their clasical
counterparts (we have to keep in mind that the Big Oh notation ignores all constants
involved).
Therefore, in order to realise the full power of these algorithms, they have to be used with extremly large and redundant sequences. This will allow a proper reutilisation of the computations and, maybe, provide an improvement in terms of space and run time. For instance, it is easy to devise such a sequence if we use a one-character alphabet because, in this case, a sequence is factorised into a series of factors that are a prefix of the next.
CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
NeedlemanWunsch
,
SmithWaterman
,
FactorSequence
,
AlignmentBlock
,
OutMatrix
,
Smawk
,
computeBlockTable()
,
assembleDistMatrix(neobio.alignment.AlignmentBlock, int, int, int, int)
Field Summary | |
protected AlignmentBlock[][] |
block_table
The block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence. |
protected static byte |
DIAGONAL_DIRECTION
A constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. |
protected static byte |
LEFT_DIRECTION
A constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. |
protected int |
num_cols
Number of columns of the block table. |
protected int |
num_rows
Number of rows of the block table. |
protected OutMatrix |
out_matrix
An instance of the OutMatrix class that encodes the OUT matrix of a
block when supplied with the DIST matrix and the input border array of a block.
|
protected FactorSequence |
seq1
The first factorised sequence being aligned. |
protected FactorSequence |
seq2
The second factorised sequence being aligned. |
protected Smawk |
smawk
An instance of the Smawk class that implements the SMAWK algorithm to
compute the column maxima of a totally monotone matrix. |
protected static byte |
STOP_DIRECTION
A constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop. |
protected static byte |
TOP_DIRECTION
A constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. |
Fields inherited from class neobio.alignment.PairwiseAlignmentAlgorithm |
alignment, APPROXIMATE_MATCH_TAG, GAP_CHARACTER, GAP_TAG, MATCH_TAG, MISMATCH_TAG, score, score_computed, scoring, sequences_loaded, use_match_tag |
Constructor Summary | |
CrochemoreLandauZivUkelson()
|
Method Summary | |
protected int[][] |
assembleDistMatrix(AlignmentBlock block,
int dim,
int r,
int c,
int lc)
Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks. |
protected int[] |
assembleInputBorder(int dim,
int r,
int c,
int lr)
Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks. |
protected abstract PairwiseAlignment |
buildOptimalAlignment()
Retrieves an optimal alignment between the loaded sequences. |
protected void |
computeBlockTable()
Computes the block table. |
protected PairwiseAlignment |
computePairwiseAlignment()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to retrieve an
optimal alignment between the loaded sequences. |
protected int |
computeScore()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to locate the
score of the highest scoring alignment between the two sequences in the block
table. |
protected abstract AlignmentBlock |
createBlock(Factor factor1,
Factor factor2,
int row,
int col)
Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2. |
protected abstract AlignmentBlock |
createFirstColumnBlock(Factor factor1,
Factor factor2,
int row)
Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string. |
protected abstract AlignmentBlock |
createFirstRowBlock(Factor factor1,
Factor factor2,
int col)
Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string. |
protected abstract AlignmentBlock |
createRootBlock(Factor factor1,
Factor factor2)
Computes the root block of the block table. |
protected AlignmentBlock |
getDiagonalPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the diagonal prefix of a block from the block table. |
protected AlignmentBlock |
getLeftPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the left prefix of a block from the block table. |
protected AlignmentBlock |
getTopPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the top prefix of a block from the block table. |
protected void |
loadSequencesInternal(java.io.Reader input1,
java.io.Reader input2)
Loads sequences into FactorSequence instances. |
protected abstract int |
locateScore()
Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed. |
protected void |
traverseBlock(AlignmentBlock block,
int source,
java.lang.StringBuffer gapped_seq1,
java.lang.StringBuffer tag_line,
java.lang.StringBuffer gapped_seq2)
Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border. |
protected void |
unloadSequencesInternal()
Frees pointers to loaded sequences and the the block table so that their data can be garbage collected. |
Methods inherited from class neobio.alignment.PairwiseAlignmentAlgorithm |
getPairwiseAlignment, getScore, loadSequences, max, max, max, scoreDeletion, scoreInsertion, scoreSubstitution, setScoringScheme, unloadSequences, useMatchTag |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected static final byte STOP_DIRECTION
protected static final byte LEFT_DIRECTION
protected static final byte DIAGONAL_DIRECTION
protected static final byte TOP_DIRECTION
protected FactorSequence seq1
protected FactorSequence seq2
protected AlignmentBlock[][] block_table
protected int num_rows
protected int num_cols
protected Smawk smawk
Smawk
class that implements the SMAWK algorithm to
compute the column maxima of a totally monotone matrix. It is used to speed up the
computation of the output border of a block.
protected OutMatrix out_matrix
OutMatrix
class that encodes the OUT matrix of a
block when supplied with the DIST matrix and the input border array of a block.
Note that it does not compute the OUT matrix itselft, it just stores the necessary
information to retrieve a value at any position of the matrix.
This object is then used to compute the output border of a block with the
Smawk
class. Note that the OutMatrix
class implements the
Matrix
interface as required by the Smawk
class.
Matrix
,
Smawk
Constructor Detail |
public CrochemoreLandauZivUkelson()
Method Detail |
protected void loadSequencesInternal(java.io.Reader input1, java.io.Reader input2) throws java.io.IOException, InvalidSequenceException
FactorSequence
instances. In case of any error,
an exception is raised by the constructor of FactorSequence
(please
check the specification of that class for specific requirements).
A FactorSequence
is an LZ78-like factorisation of the sequences
being aligned.
loadSequencesInternal
in class PairwiseAlignmentAlgorithm
input1
- Input for first sequenceinput2
- Input for second sequence
java.io.IOException
- If an I/O error occurs when reading the sequences
InvalidSequenceException
- If the sequences are not validFactorSequence
protected void unloadSequencesInternal()
unloadSequencesInternal
in class PairwiseAlignmentAlgorithm
PairwiseAlignmentAlgorithm.unloadSequences()
protected PairwiseAlignment computePairwiseAlignment() throws IncompatibleScoringSchemeException
computeBlockTable
for details) and requests subclasses to retrieve an
optimal alignment between the loaded sequences. The actual product depends on the
subclasses which can produce a global (see
CrochemoreLandauZivUkelsonGlobalAlignment
) or local alignment (see
CrochemoreLandauZivUkelsonLocalAlignment
).
Subclasses are required to implement the buildOptimalAlignment
abstract method defined by this class according to their own method.
computePairwiseAlignment
in class PairwiseAlignmentAlgorithm
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
computeBlockTable()
,
buildOptimalAlignment()
protected int computeScore() throws IncompatibleScoringSchemeException
computeBlockTable
for details) and requests subclasses to locate the
score of the highest scoring alignment between the two sequences in the block
table. The result depends on the subclasses, and either a global alignment
(see CrochemoreLandauZivUkelsonGlobalAlignment
) or local alignment
score (see CrochemoreLandauZivUkelsonLocalAlignment
) will be produced.
Subclasses are required to implement the locateScore
abstract
method defined by this class according to their own method.
Note that this method calculates the similarity value only (it doesn't trace back into the block table to retrieve the alignment itself).
computeScore
in class PairwiseAlignmentAlgorithm
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
locateScore()
protected void computeBlockTable() throws IncompatibleScoringSchemeException
There are four different cases that defines four abstract methods in this class, which subclasses must implement:
Note that each factor has a serial number which indicates its order in the list of factors of a sequence. This number will match with the row and column index of a block in the block table. For instance, if a block has factors F1 and F2 with serial numbers 12 and 53, respectively, this means that this block is found at row 12, column 53 of the block table.
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.createRootBlock(neobio.alignment.Factor, neobio.alignment.Factor)
,
createFirstColumnBlock(neobio.alignment.Factor, neobio.alignment.Factor, int)
,
createFirstRowBlock(neobio.alignment.Factor, neobio.alignment.Factor, int)
,
createBlock(neobio.alignment.Factor, neobio.alignment.Factor, int, int)
protected int[][] assembleDistMatrix(AlignmentBlock block, int dim, int r, int c, int lc)
block
- the block for which the DIST matrix is neededdim
- the dimension of the DIST matrixr
- the row index of this block in the block tablec
- the column index of this block in the block tablelc
- the number of columns of the alignment block
protected int[] assembleInputBorder(int dim, int r, int c, int lr)
ArrayIndexOutOfBoundsException
.
dim
- dimension of the input borderr
- row index of the block in the block tablec
- column index of the block in the block tablelr
- number of row of the block
protected void traverseBlock(AlignmentBlock block, int source, java.lang.StringBuffer gapped_seq1, java.lang.StringBuffer tag_line, java.lang.StringBuffer gapped_seq2) throws IncompatibleScoringSchemeException
block
- the block to be traversedsource
- the source of the path in the output bordergapped_seq1
- the StringBuffer to where the gapped sequence 1 is written totag_line
- the StringBuffer to where the tag_line is written togapped_seq2
- the StringBuffer to where the gapped sequence 2 is written to
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected AlignmentBlock getLeftPrefix(AlignmentBlock block)
block
- the block
protected AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
block
- the block
protected AlignmentBlock getTopPrefix(AlignmentBlock block)
block
- the block
protected abstract AlignmentBlock createRootBlock(Factor factor1, Factor factor2) throws IncompatibleScoringSchemeException
factor1
- the factor of the first sequence being alignedfactor2
- the factor of the second sequence being aligned
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createFirstRowBlock(Factor factor1, Factor factor2, int col) throws IncompatibleScoringSchemeException
factor1
- the factor of the first sequence being alignedfactor2
- the factor of the second sequence being alignedcol
- the column index of block in the block table
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createFirstColumnBlock(Factor factor1, Factor factor2, int row) throws IncompatibleScoringSchemeException
factor1
- the factor of the first sequence being alignedfactor2
- the factor of the second sequence being alignedrow
- the row index of block in the block table
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createBlock(Factor factor1, Factor factor2, int row, int col) throws IncompatibleScoringSchemeException
factor1
- the factor of the first sequence being alignedfactor2
- the factor of the second sequence being alignedrow
- the row index of block in the block tablecol
- the column index of block in the block table
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected abstract PairwiseAlignment buildOptimalAlignment() throws IncompatibleScoringSchemeException
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.protected abstract int locateScore()
IncompatibleScoringSchemeException
- If the scoring scheme is not compatible
with the loaded sequences.
|
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |