NeoBio API

neobio.alignment
Class Trie

java.lang.Object
  |
  +--neobio.alignment.Trie

public class Trie
extends java.lang.Object

This class implements a trie, or a digital search tree. A trie is a multiway tree (each node can have multiple children) that represents a set of strings.

Each node contains data encapsulated in an object instance. Each edge spells out a character and each path from the root represents a string described by the characters labelling the traversed edges. Moreover, for each string represented, there is a unique path from the root.

The trie of the following example represents the strings 'a', 'd', 'b', 'ac', 'ba', 'be', 'bd', 'bad' and 'bae'.

      [0]
     --+--
    /  |  \
  a/  d|   \b
 [1]  [2]  [4]
  |       --+--
  |      /  |  \
 c|    a/  e|  d\
 [3]  [5]  [6]  [9]
     --+--
    /     \
  d/      e\
 [7]       [8]
 

It is easy to see that strings with common prefixes will branch off from each other at the first distinguishing character. This feature makes the trie a good data structure to identify and represent phrases of a text such as the ones induced by the Lempel-Ziv familiy of compression algorithms. For instance, the LZ78 version parses the text into phrases, where each phrase is the longest matching phrase seen previously plus one character.

In this implementation, each node is actually an instance of this class. To build a trie, one must first create the root using the public constructor:

 Trie root = new Trie (some_object);
 

Here some_object contains any relevant information encapsulated in an object instance. Typically, that's the only moment the public constructor is used. From now on, all new nodes will be added as a new child of one existing node using the add method:

 new_node = any_node.add (some_object, character);
 

Here character is the character that will label the edge from any_node to new_node. Note that this transition must not already exist, otherwise an exception is raised.

To find the longest prefix of a given string, we follow a path from the root down the tree, character by character, with the spellDown method:

 next_node = root;
 while (next_node != null)
 {
     current_node = next_node;
     char c = get next character from somewhere
     next_node = current_node.spellDown (c);
 }
 

spellDown follows the edge out of current_node labelled by the character c and returns the next node. If there is no such a path, it returns null.

To retrieve the information stored at any node, simply use the getData method.

In fact, there are many ways to implement a trie. To avoid wasting space with multiple pointers at each node, this implementation uses an approach with a linked list of siblings. Each node actually contains a pointer to one of its children and a pointer to one of its siblings only. Together with the pointers, each node also stores the character that labels the edge to the pointed node.

 [0]
  |
 a|  d     b
 [1]---[2]---[4]
  |           |
 c|          a|  e     d
 [3]         [5]---[6]---[9]
              |
             d|  e
             [7]---[8]
 

In this way, a trie is similar to a binary tree. Although this implementation is more efficient in terms of space, the search for a label with a given character leaving a node n is no more constant but proportional to the number of children of n. In the previous example, it is necessary to traverse three edges to reach node 9 from node 4 with character d.

This class is used by the FactorSequence to build a linked list of factors of a sequence in a LZ78 fashion, i.e. where each factor is the longest factor previously seen plus one character.

Author:
Sergio A. de Carvalho Jr.
See Also:
FactorSequence

Field Summary
protected  java.lang.Object data
          This node's stored data.
protected  Trie sibling
          A pointer to this node's next sibling.
protected  Trie son
          A pointer to the first of this node's children.
protected  char to_sibling
          The character that labels the edge from this node to the sibling pointer by sibling.
protected  char to_son
          The character that labels the edge from this node to the child node pointer by son.
 
Constructor Summary
Trie(java.lang.Object data)
          Creates a new trie node with the specified data.
 
Method Summary
 Trie add(java.lang.Object data, char c)
          Adds a new child to this node.
protected  Trie addSibling(java.lang.Object data, char c)
          Adds a sibling to this node.
 java.lang.Object getData()
          Returns the data associated with this node.
 Trie spellDown(char c)
          Follows a path from this node to one of its children by spelling the character supplied as an argument.
protected  Trie spellRight(char c)
          Follows a path from this node to one of its sibling by spelling the character supplied as an argument.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

son

protected Trie son
A pointer to the first of this node's children.


to_son

protected char to_son
The character that labels the edge from this node to the child node pointer by son.


sibling

protected Trie sibling
A pointer to this node's next sibling.


to_sibling

protected char to_sibling
The character that labels the edge from this node to the sibling pointer by sibling.


data

protected java.lang.Object data
This node's stored data.

Constructor Detail

Trie

public Trie(java.lang.Object data)
Creates a new trie node with the specified data. This constructor is typically used by the client only once to instantiate the root node. After that, all new nodes are implicitly instantiated by the add method.

Parameters:
data - the data that will be associated with the new node
Method Detail

getData

public java.lang.Object getData()
Returns the data associated with this node.

Returns:
data associated with this node

add

public Trie add(java.lang.Object data,
                char c)
Adds a new child to this node. The new node will be implicitly instantiated with the data argument, and the edge from this node to the new node will be labelled by the character argument. If this node already have an edge labelled with this character, an exception is raised. Otherwise, the new node created and returned.

If this node have no child, a new node is created straight away. Otherwise, the task is assigned to its first child that will add the new node as a sibling.

Parameters:
data - the data that will be associated with the new node
c - the character that will label the edge from this node to the new node
Returns:
the added node
Throws:
java.lang.IllegalStateException - if this node already have an edge labelled by c

addSibling

protected Trie addSibling(java.lang.Object data,
                          char c)
Adds a sibling to this node. The new node will be implicitly instantiated with the data argument, and the edge from this node to the new node will be labelled by the character argument. If this node already have a sibling with this character, an exception is raised. Otherwise, the new node is created and returned.

If this node have no direct sibling, a new node is created straight away. Otherwise, the task is assigned to its next sibling.

Parameters:
data - the data that will be associated with the new node
c - the character that will label the edge from this node to the new node
Returns:
the added node
Throws:
java.lang.IllegalStateException - if this node already have an edge labelled by c

spellDown

public Trie spellDown(char c)
Follows a path from this node to one of its children by spelling the character supplied as an argument. If there is no such a path, null is returned. Otherwise, the reached child node is returned.

If this node's direct child is reached with an edge labelled by the character, it is returned straight away. Otherwise, it is assigned the task of finding another sibling labelled with that character.

Parameters:
c - the character that labels the path to be followed to this node's child
Returns:
the child node reached by traversing the edge labelled by c

spellRight

protected Trie spellRight(char c)
Follows a path from this node to one of its sibling by spelling the character supplied as an argument. If there is no such a path, null is returned. Otherwise, the reached sibling node is returned.

If this node's direct sibling is reached with an edge labelled by the character, it is returned straight away. Otherwise, it is assigned the task of finding another sibling labelled with that character.

Parameters:
c - the character that labels the path to be followed to the sibling
Returns:
the sibling node reached by traversing the edge labelled by c

SourceForge.net

http://neobio.sourceforge.net
NeoBio is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with NeoBio; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.