Publication Type
Conference Paper
Abstract
Document clustering is a powerful technique that has been widely
used for organizing data into smaller and manageable information kernels.
Several approaches have been proposed suffering however from problems like
synonymy, ambiguity and lack of a descriptive content marking of the
generated clusters. We are proposing the enhancement of standard kmeans
algorithm using the external knowledge from WordNet hypernyms in a twofold
manner: enriching the “bag of words” used prior to the clustering process and
assisting the label generation procedure following it. Our experimentation
revealed a significant improvement over standard kmeans for a corpus of news
articles derived from major news portals. Moreover, the cluster labeling process
generates useful and of high quality cluster tags.
used for organizing data into smaller and manageable information kernels.
Several approaches have been proposed suffering however from problems like
synonymy, ambiguity and lack of a descriptive content marking of the
generated clusters. We are proposing the enhancement of standard kmeans
algorithm using the external knowledge from WordNet hypernyms in a twofold
manner: enriching the “bag of words” used prior to the clustering process and
assisting the label generation procedure following it. Our experimentation
revealed a significant improvement over standard kmeans for a corpus of news
articles derived from major news portals. Moreover, the cluster labeling process
generates useful and of high quality cluster tags.
Publication Links
Year of Publication
2010



