NeoBio API

neobio.alignment
Class FactorSequence

java.lang.Object
  |
  +--neobio.alignment.FactorSequence

public class FactorSequence
extends java.lang.Object

This class builds a list of factors of a character sequence as induced by its Lempel-Ziv (LZ78) factorisation. Each factor is enconded as the longest factor previously seen plus one character.

The input can come from any source, provided it is encapsulated in a proper Reader instance. The stream is expected to be ready (i.e. the next read operation must return the first character of the sequence) and it is not closed when its end is reached, so the client is allowed to reset it and maybe use it for another purpose.

Sequences can contain letters only although lines started with the COMMENT_CHAR character ('>') are regarded as comments and are completely skipped. White spaces (including tabs, line feeds and carriage returns) are also ignored throughout.

This class uses a Trie to keep track of a list of factors. Each node of the trie contains a Factor of the text. As the sequence is read from the input, the trie is traversed as far as possible. When a leaf node is reached (which means that the longest prefix of the input has been found), two tasks are accomplished:

Each factor also receives a serial number according to the order they are found and a pointer to the next factor (in that order) for fast access. This pointer, together with the factor's ancestor pointer forms a doubly-linked list of factors. The original text can then be reconstructed simply by following the linked list and writing out its factors.

As an example, the sequence ACTAAACCGCATTAATAATAAAA is parsed into the following 12 factors:

 0  ( , ) = empty
 1  (0,A) = A
 2  (0,C) = C
 3  (0,T) = T
 4  (1,A) = AA
 5  (1,C) = AC
 6  (2,G) = CG
 7  (2,A) = CA
 8  (3,T) = TT
 9  (4,T) = AAT
 10 (9,A) = AATA
 11 (4,A) = AAA

 serial # (prefix, new char) = factor text
 

This class is used by CrochemoreLandauZivUkelson algorithm to speed up the classic dynamic programming approach to sequence alignment.

Author:
Sergio A. de Carvalho Jr.
See Also:
Factor, Trie, CrochemoreLandauZivUkelson

Field Summary
protected static char COMMENT_CHAR
          The character used to start a comment line in a sequence file.
protected  int num_chars
          The numbers of character represented by this sequence.
protected  int num_factors
          The numbers of factors generated by the LZ78 parsing of the sequence.
protected  Factor root_factor
          A pointer to the root factor, the one that starts the list of factors.
 
Constructor Summary
FactorSequence(java.io.Reader reader)
          Creates a new instance of a FactorSequence, loading the sequence data from the Reader input stream.
 
Method Summary
 Factor getRootFactor()
          Returns the root factor, the one that starts the list of factors.
 int numChars()
          Returns the number of characters of the original sequence.
 int numFactors()
          Returns the number of factors produced by the LZ78 parsing of the text.
 java.lang.String printFactors()
          Returns a string representation of the actual list of factors produced by the LZ78 parsing of the text.
 java.lang.String toString()
          Reconstructs the sequence from the list of factors induced by the LZ78 parsing of the text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

COMMENT_CHAR

protected static final char COMMENT_CHAR
The character used to start a comment line in a sequence file. When this character is found, the rest of the line is ignored.

See Also:
Constant Field Values

root_factor

protected Factor root_factor
A pointer to the root factor, the one that starts the list of factors.


num_chars

protected int num_chars
The numbers of character represented by this sequence.


num_factors

protected int num_factors
The numbers of factors generated by the LZ78 parsing of the sequence.

Constructor Detail

FactorSequence

public FactorSequence(java.io.Reader reader)
               throws java.io.IOException,
                      InvalidSequenceException
Creates a new instance of a FactorSequence, loading the sequence data from the Reader input stream. A doubly-linked list of factors is built according to its LZ78 factorisation.

Parameters:
reader - source of characters for this sequence
Throws:
java.io.IOException - if an I/O exception occurs when reading the input
InvalidSequenceException - if the input does not contain a valid sequence
Method Detail

getRootFactor

public Factor getRootFactor()
Returns the root factor, the one that starts the list of factors.

Returns:
root factor

numFactors

public int numFactors()
Returns the number of factors produced by the LZ78 parsing of the text.

Returns:
number of factors

numChars

public int numChars()
Returns the number of characters of the original sequence.

Returns:
number of characters of the original sequence

toString

public java.lang.String toString()
Reconstructs the sequence from the list of factors induced by the LZ78 parsing of the text.

Overrides:
toString in class java.lang.Object
Returns:
the original sequence

printFactors

public java.lang.String printFactors()
Returns a string representation of the actual list of factors produced by the LZ78 parsing of the text. Each factor is printed out in a separate line, in the order they appear in the text, with its serial number, its ancestor's serial number, its new character, length and a string representation of the factor itself.

Returns:
a string representation of the list of factors

SourceForge.net

http://neobio.sourceforge.net
NeoBio is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with NeoBio; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.