Description
A signature
is a data pattern of interest in a large data file or set of large data files.
Such signatures that need to be found arise in applications such as DNA sequence
analysis, network intrusion detection, biometrics, large scientific
experiments, and speech recognition and sensor networks. Related to this
is string matching.
More
specifically we envision a problem where long linear data files (i.e. flat files) contain
multiple signatures that are to be found using a multiplicity of processors (parallel processor).
This paper
evaluates the performance of finding signatures in files residing in the nodes
of parallel
processors configured as trees, two dimensional meshes and hyper cubes.
We assume
various combinations of sequential and parallel searching. A unique feature
of this work is that it is assumed that data is pre-loaded onto processors, as
may occur in practice, thus load distribution time need not be accounted for.
Elegant
expressions are found for average signature searching time and speedup, and
graphical results are provided.
Figure: Signature
Introduction
A signature
is a relatively small data pattern of interest embedded in a very large (in
this paper sequential) data file. It is assumed signatures are temporally
distinct and do not overlap each other. That is, there can be multiple
signatures in a file.
Because the
files we study are much longer than the signatures, it is assumed that
signatures have infinitesimally small length. Such signature searching occurs
in network security, signal processing, medicine, image processing, and sensor
technology and many other fields.
Most
previous work on signature searching (known as template matching and string matching) develop
algorithm for the detailed matching process. This paper, like addresses an
upper level view of signature searching involving system performance
evaluation.
However we
now briefly summarize some string matching work. String searching, which is
similar to our concept of signature searching, is a special case of pattern searching.
String searching generically involves finding a pattern of length m in a text
of length n over some alphabet.
The worst
case complexity of exact string matching is O(n) but the proportionality
constant of the linear term can be very different depending on the string
matching algorithm, ranging from m for the naive algorithm to 2 for the
Knuth-Morris-Pratt algorithm.
Tree network
Assume that
in a tree a node is a structure which contains exactly one file. Each node has one, two or
more child nodes, which are below it in the tree space. A node that has
a child is called the child’s parent node.
A node has
at most one parent. Nodes that do not have any children are called leaf nodes. They
are also referred to as terminal nodes.
A node’s
height is the length of the Multi-level tree network longest downward path to a
leaf from that node. The root’s height is the height of the tree.
The depth of
a node is the length of the path to its root. The nodes in the same depth of
the tree are said to be at the same level.
The height
of the multiple-level tree in this paper is H. Throughout this paper
there is a single file, possibly containing signatures, in each node.
In all the
cases for tree networks which we examine the number of children nodes ni per
node is a random variable from 1 to N, possibly different for each
level.
But at the
same level the number of children in each subtree is the same. In this section
if a parent node has a signature(s), its children nodes may also have
signatures.
If the
parent node does not have any signature of interest, its children nodes also
have no signatures and do not need to be searched.
Some dependency between signature
occurrences is thus being assumed. This may model
locality of reference - if a node has a signature(s), there may be related
information (signatures) on its children.
Figure: Tree Network
Mesh Networks
A regular two
dimensional mesh network of processors. It is a commonly used interconnection
network. In this network structure each processor is located in the
corners of four rectangles and has four neighbors.
The central
processor is called the originator; it can communicate information or
transport data to its four neighboring nodes.
As in trees,
the central node is assumed to be layer 0, and its four neighboring nodes in
layer 1, the nodes which are neighbors to the nodes in layer 1 but not in layer
0 are in layer 2, . . ., the nodes which are neighbors to the nodes in layer i
but not in layer i-1 are in layer i+1 in the mesh structure.
A node can
only send messages or data to its neighboring nodes. If the node in the north
of layer 2 wants to send a message to the node in layer 0, first it should send
message to the node in the north top of layer 1, then the message will be
transferred to layer.
Mesh network Assume
that there are N layers and there is at most one signature in every node in a
mesh network. There are 4i nodes in the i th layer. Different
from the tree network, the case that every node can have a signature will be
discussed in section.
That is, if a
node has no signature, its children node scan still have signatures. Assume the
average time to search every node in the mesh network is the same: X,
and again there is at most one signature in one file.
Figure: Mesh Network




No comments:
Post a Comment