Mining User-Aware Rare Sequential Topic Patterns in Document Streams
Introduction
Document streams are created and
distributed in various forms on the Internet, such as news streams, emails,
micro-blog articles, chatting messages, research paper archives, web forum
discussions, and so forth. The contents of these documents generally
concentrate on some specific topics, which reflect offline social events and
users’ characteristics in real life.
To mine these pieces of information, many
researches of text mining focused on extracting topics from document
collections and document streams through various probabilistic topic models,
such as classical PLSI, LDA and their extensions.
In order to characterize user behaviors
in published document streams, we study on the correlations among topics
extracted from these documents, especially the sequential relations, and
specify them as Sequential Topic
Patterns (STPs). Each of them records the complete and repeated behavior of
a user when she is publishing a series of documents, and are suitable for
inferring users’ intrinsic characteristics and psychological statuses.
First,
compared to individual topics, STPs capture both combinations and orders of
topics, so can serve well as discriminative units of semantic association among
documents in ambiguous situations.
Second, compared to document-based patterns,
topic-based patterns contain abstract information of document contents and are
thus beneficial in clustering similar documents and finding some regularities
about Internet users.
Third, the probabilistic description of topics helps to
maintain and accumulate the uncertainty degree of individual topics, and can
thereby reach high confidence level in pattern matching for uncertain data. For
a document stream, some STPs may occur frequently and thus reflect common
behaviors of involved users.
Beyond that, there may still exist some other
patterns which are globally rare for the general population, but occur
relatively often for some specific user or some specific group of users. We call
them User-aware Rare STPs (URSTPs). Compared
to frequent ones, discovering them is especially interesting and significant.
Theoretically, it defines a new kind of patterns for rare event mining, which is
able to characterize personalized and abnormal behaviors for special users. Practically,
it can be applied in many real-life scenarios of user behavior analysis.
Fig.
1. Mining User‐Aware
Rare Sequential Topic Patterns in Document Streams
Mining URSTP
In this section, we propose a novel
approach to mining URSTPs in document streams. The main processing framework
for the task is shown in Fig. 2. It consists of three phases. At first, textual
documents are crawled from some micro-blog sites or forums, and constitute a
document stream as the input of our approach. Then, as preprocessing
procedures, the original stream is transformed to a topic level document stream
and then divided into many sessions to identify complete user behaviors.
Finally and most importantly, we discover all the STP candidates in the document
stream for all users, and further pick out significant URSTPs associated to
specific users by user-aware rarity analysis.
Fig.
2. Processing framework of URSTP mining.
No comments:
Post a Comment