Text Preview

WORLD DATA CLUSTERING

ADEWALE .O . MAKO

DATA MINING

INTRODUCTION:

Data mining is the analysis step of knowledge discovery in databases or a field at the intersection of computer science and statistics. It is also the analysis of large observational datasets to find unsuspected relationships. This definition refers to observational data as opposed to experimental data.

Data mining typically deals with data that has already been collected for some purpose or the other than the data mining analysis. It is often referred to as ‘secondary data analysis. The overall goal of the data mining process is to extract information from a dataset and transform it into an understandable structure for further use.

SCORE FUNCTIONS IN DATA MINING

A score function is a measure of one’s performance while making decisions under uncertainty.

The purpose of a score function in data mining is to rank models as a function of how useful the models are to the data miner. A chosen score function should reflect the overall goals of the data mining task as far as is possible. Different score functions have different properties and are useful in different situations which is why one should avoid using a convenient score function because it will most likely be inappropriate for the task at hand.

CLUSTERING

Clustering is one of the most important unsupervised learning techniques. It deals with finding a structure in a collection of unlabelled data as every other problem of this kind. Clustering is in the eye of the beholder.in the other word there is not accurately correct clustering algorithm.

We can describe clustering as a process of organizing objects into groups that members have some similarity in particular way.in the other word a cluster is therefore a collection of objects that are similarity between them and are dissimilarity to the objects belonging to other cluster.

An advantage of clustering is simplifications, pattern detection, and unsupervised learning process, useful in data concept construction.

Clustering can be used in information retrieval, data mining, marketing, medical diagnostic, web analysis and text mining.

The most important type of clustering is between hierarchical and partition sets of clusters. Partition clustering is a partition of data objects into non-overlapping subsets (clusters) such that every single data object is in exactly one subset. On the other hand Hierarchical cluster is a set of nested clusters prepared as a hierarchical tree. Other distinctions between sets of clusters are: Fuzzy versus no-fuzzy, exclusive versus non-exclusive, heterogeneous versus homogeneous and partial versus complete.

We have many different kind of cluster such as: well-separated clusters, canter-based clusters, contiguous clusters, described by an objective function and property or conceptual. Clustering algorithms are: K-means and its variants, Hierarchical clustering, Density-based clustering. One of the most well-known methods is Lloyd's algorithm. Mostly referred to as “K-means” algorithm. Many other approaches have been done in this field. For instance seed based clustering approach. Clusters are defined as a range of higher density than the reminder of data sets in density-based clustering.

K-means clustering is partitioned clustering approach.in these approaches every single cluster are associated with Centroid (canter point).every point is assign to the cluster with the nearest centroid. Number of cluster (k) has to specify. In this algorithm initial centroids are usually chosen randomly. The centroid commonly is the mean of the points in the cluster. Closeness is calculated by cosine similarity, correlation, Euclidean distance and etc. The most common evaluating K-means Cluster is measure sum of Squared Error (SSE).the error is the distance to the closest cluster for each point. We square these errors and sum them to get SSE

mi is the representative...