Blog

Python 3.7 And 2.7 Installation Steps On Windows 10

Changing python default versions on fedora 27?

Python websockets installation is failed on fedora linux

Network Basics

Document Clustering

Clustering Techniques

SQL Injection and Prevention Techniques

Google Form Intro and App Scripts

Download in Youtube using Youtube-dl commands

Java Network Launch Protocol File Launch Issue

javaws command is not working

Forcing Website with HTTPS instead of HTTP

Cricket Scores API's

Design Patterns

IIS Installation and Configuration on Windows 10

Unable to start debugging on the web server (IIS Error Code 405)

Google Chrome Installation on Fedora 26, CentOS / RHEL 7.4

Install Fedora 26 On Virtual Machine Using VirtualBox in Windows 10

Using screen command in fedora 25

Java JDK 1.8 Installation Steps On Windows 10

Enable Permanent SSH Access on Linux

C Program Undefined Reference Error

Curl error and couldn't resolve hostname fedora mirrors

Installing Android Studio in Windows 10

Resetting Root password on Fedora 26

Installing visual studio code editor steps for fedora 27 / centos 7 / RHEL 7

Java JNI Error UnsupportedClassVersionError in Windows 10

Install virt-customize in RHEL 8

Updating qcow2 image in RHEL 8

Remove the files from dir in another dir

Install virtualenv in RHEL 7

Install Scapy in RHEL 7

SQL ACID properties

Document Clustering



The automatic discovery of document clusters/groups in a document collection, where the formed clusters have a high degree of association (with regard to a given similarity measure) between members, whereas members from different clusters have a low degree of association.

In other words, the goal of a good document clustering scheme is to minimize intra-cluster distances between documents, while maximizing inter-cluster distances (using an appropriate distance measure between documents). A distance measure (or, dually, similarity measure) thus lies at the heart of document clustering. Several ways for measuring the similarity between two documents exist, some are based on the vector model (e.g. Cosine distance or Euclidean distance) while others are based on the Boolean model (e.g. size of intersection between document term sets). More advanced approaches exist, for instance using Latent Semantic Analysis to transform the vector space into a space of reduced dimensionality.

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies slow to occur.

APPLICATIONS OF DOCUMENT CLUSTERING

Generally, clustering is used in statistics to discover the structure of large “multivariate” data sets. It can often reveal latent relationships hidden in complex data. Within information retrieval, clustering (of documents) has several promising applications, all concerned with improving efficiency and effectiveness of the retrieval process. Some of the more interesting include:

  • Finding Similar Documents to a given document. This feature is often used when the user has spotted one “good” document in a search result and wants more-like-this.
  • Search Result Clustering allowing the user to get a better overview of the documents returned as results in the search, and to navigate towards clusters that are relevant to the user’s information need.
  • Guided/Interactive Search, where clustering is used to help the user drill down and find the desired information step-by-step by gradually refining the search.
  • Organizing Site Content into Categories allowing browsing of the site in a Yahoo-like fashion.
  • Recommender System that, based on the documents the user has already visited, recommends other documents. A typical use of this is in an e-commerce setting, where products that might interest the customer are suggested based on products the user has already examined/bought.
  • Faster/Better Search utilizing the clustering to optimize the search. A user query could for instance be compared to clusters instead of the individual documents, effectively limiting the search space

CHALLENGES IN DOCUMENT CLUSTERING

Although commercial information retrieval systems utilizing clustering exist, document clustering is far from a trivial or solved problem. The clustering process is filled with challenges like:

  • Selecting appropriate features of the documents that should be used for clustering.
  • Selecting an appropriate similarity measure between documents.
  • Selecting an appropriate clustering method utilizing the above similarity measure.
  • Implementing the clustering algorithm in an efficient way that makes it feasible in terms of required memory and CPU resources.
  • Finding ways of assessing the quality of the performed clustering.
  • Finding feasible ways of updating the clustering if new documents are added to the collection.
  • Finding ways for applying the clustering to improve the information retrieval task at hand

 Blog

Python 3.7 And 2.7 Installation Steps On Windows 10

Changing python default versions on fedora 27?

Python websockets installation is failed on fedora linux

Network Basics

Document Clustering

Clustering Techniques

SQL Injection and Prevention Techniques

Google Form Intro and App Scripts

Download in Youtube using Youtube-dl commands

Java Network Launch Protocol File Launch Issue

javaws command is not working

Forcing Website with HTTPS instead of HTTP

Cricket Scores API's

Design Patterns

IIS Installation and Configuration on Windows 10

Unable to start debugging on the web server (IIS Error Code 405)

Google Chrome Installation on Fedora 26, CentOS / RHEL 7.4

Install Fedora 26 On Virtual Machine Using VirtualBox in Windows 10

Using screen command in fedora 25

Java JDK 1.8 Installation Steps On Windows 10

Enable Permanent SSH Access on Linux

C Program Undefined Reference Error

Curl error and couldn't resolve hostname fedora mirrors

Installing Android Studio in Windows 10

Resetting Root password on Fedora 26

Installing visual studio code editor steps for fedora 27 / centos 7 / RHEL 7

Java JNI Error UnsupportedClassVersionError in Windows 10

Install virt-customize in RHEL 8

Updating qcow2 image in RHEL 8

Remove the files from dir in another dir

Install virtualenv in RHEL 7

Install Scapy in RHEL 7

SQL ACID properties

Privacy Policy  |  Copyright@2017 - All Rights Reserved.  |  Contact us   |  Report website issues in Github   |  Facebook page   |  Google+ page

Email Facebook Google LinkedIn Twitter
^