Engineering Project Trends : Signature Searching in a Networked Collection of Files

Description

A signature is a data pattern of interest in a large data file or set of large data files. Such signatures that need to be found arise in applications such as DNA sequence analysis, network intrusion detection, biometrics, large scientific experiments, and speech recognition and sensor networks. Related to this is string matching.

More specifically we envision a problem where long linear data files (i.e. flat files) contain multiple signatures that are to be found using a multiplicity of processors (parallel processor).

This paper evaluates the performance of finding signatures in files residing in the nodes of parallel processors configured as trees, two dimensional meshes and hyper cubes.

We assume various combinations of sequential and parallel searching. A unique feature of this work is that it is assumed that data is pre-loaded onto processors, as may occur in practice, thus load distribution time need not be accounted for.

Elegant expressions are found for average signature searching time and speedup, and graphical results are provided.

Figure: Signature

Introduction

A signature is a relatively small data pattern of interest embedded in a very large (in this paper sequential) data file. It is assumed signatures are temporally distinct and do not overlap each other. That is, there can be multiple signatures in a file.

Because the files we study are much longer than the signatures, it is assumed that signatures have infinitesimally small length. Such signature searching occurs in network security, signal processing, medicine, image processing, and sensor technology and many other fields.

Most previous work on signature searching (known as template matching and string matching) develop algorithm for the detailed matching process. This paper, like addresses an upper level view of signature searching involving system performance evaluation.

However we now briefly summarize some string matching work. String searching, which is similar to our concept of signature searching, is a special case of pattern searching. String searching generically involves finding a pattern of length m in a text of length n over some alphabet.

The worst case complexity of exact string matching is O(n) but the proportionality constant of the linear term can be very different depending on the string matching algorithm, ranging from m for the naive algorithm to 2 for the Knuth-Morris-Pratt algorithm.

Tree network

Assume that in a tree a node is a structure which contains exactly one file. Each node has one, two or more child nodes, which are below it in the tree space. A node that has a child is called the child’s parent node.

A node has at most one parent. Nodes that do not have any children are called leaf nodes. They are also referred to as terminal nodes.

A node’s height is the length of the Multi-level tree network longest downward path to a leaf from that node. The root’s height is the height of the tree.

The depth of a node is the length of the path to its root. The nodes in the same depth of the tree are said to be at the same level.

The height of the multiple-level tree in this paper is H. Throughout this paper there is a single file, possibly containing signatures, in each node.

In all the cases for tree networks which we examine the number of children nodes ni per node is a random variable from 1 to N, possibly different for each level.

But at the same level the number of children in each subtree is the same. In this section if a parent node has a signature(s), its children nodes may also have signatures.

If the parent node does not have any signature of interest, its children nodes also have no signatures and do not need to be searched.

Some dependency between signature occurrences is thus being assumed. This may model locality of reference - if a node has a signature(s), there may be related information (signatures) on its children.

Figure: Tree Network

Mesh Networks

A regular two dimensional mesh network of processors. It is a commonly used interconnection network. In this network structure each processor is located in the corners of four rectangles and has four neighbors.

The central processor is called the originator; it can communicate information or transport data to its four neighboring nodes.

As in trees, the central node is assumed to be layer 0, and its four neighboring nodes in layer 1, the nodes which are neighbors to the nodes in layer 1 but not in layer 0 are in layer 2, . . ., the nodes which are neighbors to the nodes in layer i but not in layer i-1 are in layer i+1 in the mesh structure.

A node can only send messages or data to its neighboring nodes. If the node in the north of layer 2 wants to send a message to the node in layer 0, first it should send message to the node in the north top of layer 1, then the message will be transferred to layer.

Mesh network Assume that there are N layers and there is at most one signature in every node in a mesh network. There are 4i nodes in the i th layer. Different from the tree network, the case that every node can have a signature will be discussed in section.

That is, if a node has no signature, its children node scan still have signatures. Assume the average time to search every node in the mesh network is the same: X, and again there is at most one signature in one file.