K Means Clustering With Decision Tree Computer Science Essay

The K-means bunch informations excavation algorithm is normally used to happen the bunchs due to its simpleness of execution and fast executing. After using the K-means constellating algorithm on a dataset, it is hard for one to construe and to pull out required consequences from these bunchs, until another information excavation algorithm is non used. The Decision tree ( ID3 ) is used for the reading of the bunchs of the K-means algorithm because the ID3 is faster to utilize, easier to bring forth apprehensible regulations and simpler to explicate. In this research paper we integrate the K-means bunch algorithm with the Decision tree ( ID3 ) algorithm into a one algorithm utilizing intelligent agent, called Learning Intelligent Agent ( LIAgent ) . This LIAgent capable of to make the categorization and reading of the given dataset. For the visual image of the bunchs 2D scattered graphs are drawn.

Keywords: Categorization, LIAgent, Interpretation, Visualization

1. Introduction

The information excavation algorithms are applied to detect hidden, new forms and dealingss from the complex datasets. The utilizations of intelligent nomadic agents in the information excavation algorithms further hike their survey. The term intelligent nomadic agent is a combination of two different subjects, the ‘agent ‘ is created from Artificial Intelligence and ‘code mobility ‘ is defined from the distributed systems. An agent is an object which has independent yarn of control and can be initiated. The first measure is the agent low-level formatting. The agent will so get down to run and may halt and get down once more depending upon the environment and the undertakings that it tried to carry through. After the agent finished all the undertakings that are required, it will stop at its complete province. Table 1 elaborates the different provinces of an agent [ 1 ] [ 2 ] [ 3 ] [ 4 ] .

Table 1. States of an agent

Name of Measure

Description

Initialize

Performs erstwhile apparatus activity.

Start

Start its occupation or undertaking.

Stop

Stops its occupations or undertakings after salvaging intermediate consequences.

Complete

Performs completion or expiration activity.

There is nexus between Artificial Intelligence ( AI ) and the Intelligent Agents ( IA ) . The information excavation is known as “ Machine Learning ” in Artificial Intelligence. Machine Learning trades with the development of techniques which allows the computing machine to ‘learn ‘ . It is a method of making computing machine plans by the analysis of the datasets. The agents must be able to larn to make categorization, constellating and anticipation utilizing larning algorithms [ 5 ] [ 6 ] [ 7 ] [ 8 ] .

The balance of this paper is organized as followos: Section 2 reviews the relevant information excavation algoritms, viz. the K-means bunch and the Decision tree ( ID3 ) . Section 3 is about the methodological analysis ; a intercrossed integrating of the information excavation algorithms. In subdivision 4 we discuss the consequences and dicussion. Finally subdivision 5 presents the decision.

2. Overview of Data Mining Algorithms

The K-means bunch informations excavation algorithm is used for the categorization of a dataset by bring forthing the bunchs of that dataset. The K-means bunch algorithm is a sort of ‘unsupervised larning ‘ of machine acquisition. The determination tree ( ID3 ) information excavation algorithm is used to construe these bunchs by bring forthing the determination regulations in if-then-else signifier. The determination tree ( ID3 ) algorithm is a type of ‘supervised larning ‘ of machine acquisition. Both of these algorithms are combined in one algorithm through intelligent agents, called Learning Intelligent Agent ( LIAgent ) . In this subdivision we will discourse both of these algorithms.

2.1. K-means constellating Algorithm

The undermentioned stairss explain the K-means bunch algorithm:

Measure 1: Enter the figure of bunchs and figure of loops, which are the needed and basic inputs of the K-means bunch algorithm.

Measure 2: Calculate the initial centroids by utilizing the Range Method shown in equations 1 and 2.

( 1 )

( 2 )

The initial centroid is C ( curie, cj ) .Where: soap X, soap Y, min X and min Y represent upper limit and minimal values of X and Y attributes severally. ‘k ‘ represents the figure of bunchs and I, J and n vary from 1 to k where K is an whole number. In this manner, we can cipher the initial centroids ; this will be the get downing point of the algorithm. The value ( maxX – coquette ) will supply the scope of ‘X ‘ property, likewise the value ( maxY – minY ) will give the scope of ‘Y ‘ property. The value of ‘n ‘ varies from 1 to ‘k ‘ . The figure of loops should be little otherwise the clip and infinite complexness will be really high and the value of initial centroids will besides go really high and may be out of the scope in the given dataset. This is a major drawback of the K-means bunch algorithm.

Measure 3: Calculate the distance utilizing Euclidean ‘s distance expression in equation 3. On the footing of the distances, generate the divider by delegating each sample to the closest bunch.

Euclidian Distance Formula: ( 3 )

Where vitamin D ( xi, xj ) is the distance between xi and xj. eleven and xj are the properties of a given object, where I and J vary from 1 to N where N is entire figure of properties of a given object. I, J and N are whole numbers.

Measure 4: Compute new bunch centres as centroids of the bunchs, once more compute the distances and bring forth the divider. Repeat this until the bunch ranks stabilizes [ 9 ] [ 10 ] .

The strengths and failings of the K-means bunch algorithm are discussed in table 2.

Table 2. Strengths and Weakness of the K-means bunch Algorithm

Strengths

Failings

Time complexness is O ( nkl ) . Linear clip complexness in the size of the dataset.

It is easy to implement, it has the drawback of depending on the initial Centre provided.

Space complexness is O ( k + N ) .

If a distance step does non be, particularly in multidimensional infinites, foremost specify the distance, which is non ever easy.

It is an order-independent algorithm. It generates same divider of informations irrespective of order of samples.

The Results obtained from this constellating algorithm can be interpreted in different ways.

Not applicable

All constellating techniques do non turn to all the demands adequately and at the same time.

The following are countries but non limited to where the K-means bunch algorithm can be applied:

Selling: Finding groups of clients with similar behaviour given big database of client incorporating their profiles and past records.

Biology: Categorization of workss and animate beings given their characteristics.

Libraries: Book telling.

Insurance: Identifying groups of motor insurance policy holders with a high mean claim cost ; placing frauds.

City-planning: Identifying groups of houses harmonizing to their house type, value and geographically location.

Earthquake surveies: Clustering observed temblor epicentres to place unsafe zones.

World wide web: Document categorization ; constellating web log informations to detect groups of similar entree forms.

Medical Sciences: Categorization of medical specialties ; patient records harmonizing to their doses etc. [ 11 ] [ 12 ] .

2.2. Decision Tree ( ID3 ) Algorithm

The determination tree ( ID3 ) produces the determination regulations as an end product. The determination regulations obtained from ID3 are in the signifier of if-then-else, which can be usage for the determination support systems, categorization and anticipation. The determination regulations are helpful to organize an accurate, balanced image of the hazards and wagess that can ensue from a peculiar pick. The map of the determination tree ( ID3 ) is shown in the figure 1.

Figure 1. The Function of Decision Tree ( ID3 ) algorithm

The bunch is the input informations for the determination tree ( ID3 ) algorithm, which produces the determination regulations for the bunch.

The undermentioned stairss explain the Decision Tree ( ID3 ) algorithm:

Measure 1: Let ‘S ‘ is a preparation set. If all cases in ‘S ‘ are positive, so make ‘YES ‘ node and arrest. If all cases in ‘S ‘ are negative, make a ‘NO ‘ node and arrest. Otherwise select a characteristic ‘F ‘ with values v1, … , vn and make a determination node.

Measure 2: Partition the preparation cases in ‘S ‘ into subsets S1, S2, … , Sn harmonizing to the values of V.

Measure 3: Use the algorithm recursively to each of the sets Si [ 13 ] [ 14 ] .

Table 3 shows the strengths and failings of ID3 algorithm.

Table 3. Strengths and Failings of Decision Tree ( ID3 ) Algorithm

Strengths

Failings

It generates apprehensible regulations.

It is less appropriate for a uninterrupted property.

It performs categorization without necessitating much calculation.

It does non execute better in jobs with many category and little figure of preparation illustrations.

It is suited to manage both uninterrupted and categorical variables.

The growth of a determination tree is expensive in footings of calculation because it sorts each node before happening the best split.

It provides an indicant for anticipation or categorization.

It is suited for a individual field and does non handle good on non-rectangular parts.

3. Methodology

We combine two different informations excavation algorithms viz. the K-means bunch and Decision tree ( ID3 ) into a one algorithm utilizing intelligent agent called Learning Intelligent Agent ( LIAgent ) . The Learning Intelligent Agent ( LIAgent ) is capable of constellating and reading of the given dataset. The bunchs can besides be visualized by utilizing 2D scattered graphs. The architecture of this agent system is shown in figure 2.

Figure 2. The Architecture of LIAgent System

The LIAgent is a combination of two informations excavation algorithms, the one is the K-means bunch algorithm and the 2nd is the Decision tree ( ID3 ) algorithm. The K-means constellating algorithm produces the bunchs of the given dataset which is the categorization of that dataset and the Decision tree ( ID3 ) will bring forth the determination regulations for each bunch which are utile for the reading of these bunchs. The user can entree both the bunchs and the determination regulations from the LIAgent. This LIAgent is used for the categorization and the reading of the given dataset. The bunchs of the LIAgent are further used for visual image utilizing 2D scattered graphs. Decision tree ( ID3 ) is faster to utilize, easier to bring forth apprehensible regulations and simpler to explicate since any determination that is made can be understood by sing way of determination. They besides help to organize an accurate, balanced image of the hazards and wagess that can ensue from a peculiar pick. The determination regulations are obtained in the signifier of if-then-else, which can be used for the determination support systems, categorization and anticipation.

A medical dataset ‘Diabetes ‘ is used in this research paper. This is a dataset/testbed of 790 records. The information of ‘Diabetes ‘ dataset is pre-processed, called the information standardisation. The interval scaled informations is decently cleansed. The properties of the dataset/testbed ‘Diabetes ‘ are:

Number of times pregnant ( NTP ) ( min. age = 21, soap. age = 81 )

Plasma glucose concentration a 2 hours in an unwritten glucose tolerance trial ( PGC )

Diastolic blood force per unit area ( mm Hg ) ( DBP )

Tricepss skin fold thickness ( millimeter ) ( TSFT )

2-Hour serum insulin ( thousand U/ml ) ( 2HSHI )

Body mass index ( weight in kg/ ( height in m ) ^2 ) ( BMI )

Diabetess pedigree map ( DPF )

Age

Class ( whether diabetes is cat 1 or cat 2 ) [ 15 ] .

We create the four perpendicular dividers of the dataset ‘Diabetes ‘ , by choosing the proper figure of properties. This is illustrated in tabular arraies 4 to 7.

Table 4. 1st Vertically divider of Diabetes Dataset

NTP

DPF

Class

4

0.627

-ive

2

0.351

+ive

2

2.288

-ive

Table 5. 2nd Vertically divider of Diabetes Dataset

DBP

Age

Class

72

50

-ive

66

31

+ive

64

33

-ive

Table 6. 3rd Vertically divider of Diabetes Dataset

TSFT

Body mass index

Class

35

33.6

-ive

29

28.1

+ive

0

43.1

-ive

Table 7. 4th Vertically divider of Diabetes Dataset

PGC

2HIS

Class

148

0

-ive

85

94

+ive

185

168

-ive

Each partitioned tabular array is a dataset of 790 records ; merely 3 records are model shown in each tabular array. For the LIAgent, the figure of bunchs ‘k ‘ is 4 and the figure of loops ‘n ‘ in each instance is 50 i.e. value of K =4 and value of n=50. The determination regulations of each bunchs is obtained. For the visual image of the consequences of these bunchs, 2D scattered graphs are besides drawn.

4. Consequences and Discussion

The consequences of the LIAgent are discussed in this subdivision. The LIAgent produces the two end products, viz. , the bunchs and the determination regulations for the given dataset. The entire 16 bunchs are obtained for all four dividers, four bunchs per divider. Not all the bunchs are good for the categorization, merely the needed and utile bunchs are discussed for farther information. The 16 determination regulations are besides generated by LIAgent. We are showing three determination regulations of three different bunchs. The figure of determination regulations varies from bunch to bunch ; it depends upon the figure of records in the bunch.

The Decision Rules of the 4th divider of the dataset ‘Diabetes ‘ :

Rule: 1

if PGC = “ 165 ” so

Class = “ Cat2 ”

else

Rule: 2

if PGC = “ 153 ” so

Class = “ Cat2 ”

else

Rule: 3

if PGC = “ 157 ” so

Class = “ Cat2 ”

else

Rule: 4

if PGC = “ 139 ” so

Class = “ Cat2 ”

else

Rule: 5

if HIS = “ 545 ” so

Class = “ Cat2 ”

else

Rule: 6

if HIS = “ 744 ” so

Class = “ Cat2 ”

else

Class = “ Cat1 ”

Merely six determination regulations are for the 4th divider of the dataset. It is easy for any one to take the determination and construe the consequences of this bunch.

The Decision Rules of the 1st divider of the dataset ‘Diabetes ‘ :

Rule: 1

if DPF = “ 1.32 ” so

Class = “ Cat1 ”

else

Rule: 2

if DPF = “ 2.29 ” so

Class = “ Cat1 ”

else

Rule: 3

if NTP = “ 2 ” so

Class = “ Cat2 ”

else

Rule: 4

if DPF = “ 2.42 ” so

Class = “ Cat1 ”

else

Rule: 5

if DPF = “ 2.14 ” so

Class = “ Cat1 ”

else

Rule: 6

if DPF = “ 1.39 ” so

Class = “ Cat1 ”

else

Rule: 7

if DPF = “ 1.29 ” so

Class = “ Cat1 ”

else

Rule: 8

if DPF = “ 1.26 ” so

Class = “ Cat1 ”

else

Class = “ Cat2 ”

The eight determination regulations are for the 1st divider of the dataset. The reading of the bunch is easy through the determination regulations and it besides helps to take the determination.

The Decision Rules of the 3rd divider of the dataset ‘Diabetes ‘ :

Rule: 1

if BMI = “ 29.9 ” so

Class = “ Cat1 ”

else

Rule: 2

if BMI = “ 32.9 ” so

Class = “ Cat1 ”

else

Rule: 3

if TSFK = “ 23 ” so

Rule: 4

if BMI = “ 25.5 ” so

Class = “ Cat1 ”

else

Rule: 5

if BMI = “ 30.1 ” so

Class = “ Cat1 ”

else

Rule: 6

if BMI = “ 28.4 ” so

Class = “ Cat1 ”

else

Class = “ Cat2 ”

else

Rule: 7

if BMI = “ 22.9 ” so

Class = “ Cat1 ”

else

Rule: 8

if BMI = “ 27.6 ” so

Class = “ Cat1 ”

else

Rule: 9

if BMI = “ 29.7 ” so

Class = “ Cat1 ”

else

Rule: 10

if BMI = “ 27.1 ” so

Class = “ Cat1 ”

else

Rule: 11

if BMI = “ 25.8 ” so

Class = “ Cat1 ”

else

Rule: 12

if BMI = “ 28.9 ” so

Class = “ Cat1 ”

else

Rule: 13

if BMI = “ 23.4 ” so

Class = “ Cat1 ”

else

Rule: 14

if BMI = “ 30.5 ” so

Rule: 15

if TSFK = “ 18 ” so

Class = “ Cat2 ”

else

Class = “ Cat1 ”

else

Rule: 16

if BMI = “ 26.6 ” so

Rule: 17

if TSFK = “ 18 ” so

Class = “ Cat2 ”

else

Class = “ Cat1 ”

else

Rule: 18

if BMI = “ 32 ” so

Rule: 19

if TSFK = “ 15 ” so

Class = “ Cat2 ”

else

Class = “ Cat1 ”

else

Rule: 20

if BMI = “ 31.6 ” so

Class = “ Cat2 ” , “ Cat1 ”

else

Class = “ Cat2 ”

The 20 determination regulations are for the 3rd divider of the dataset. The figure of regulations for this bunch is higher than the other two bunchs discussed.

The visual image is of import tool which provides the better apprehension of the informations and illustrates the relationship among the properties of the informations. For the visual image of the bunchs 2D scattered graphs are drawn for all the bunchs. We are showing the four 2D scattered graphs of four different bunchs of different dividers.

Figure 3. 2D Scattered Graph between ‘NTP ‘ and ‘DPF ‘ properties of ‘Diabetes ‘ dataset

The distance between ‘NTP ‘ and ‘DPF ‘ properties of ‘Diabetes ‘ dataset varies at the beginning of the graph but after some interval the distance becomes changeless.

Figure 4. 2D Scattered Graph between ‘DBP ‘ and ‘AGE ‘ properties of ‘Diabetes ‘ dataset

There is a variable distance between ‘DBP ‘ and ‘AGE ‘ properties of the dataset. It remains variable throughout this graph.

Figure 5. 2D Scattered Graph between ‘TSFT ‘ and ‘BMI ‘ properties of ‘Diabetes ‘ dataset

The graph shows about changeless distance between ‘TSFT ‘ and ‘BMI ‘ properties of the dataset. It remains changeless throughout the graph.

Figure 6. 2D Scattered Graph between ‘PGC ‘ and ‘2HIS ‘ properties of ‘Diabetes ‘ dataset

There is a variable distance between ‘PGC ‘ and ‘2HIS ‘ properties of the dataset. But in the center of this graph there is some changeless distance between these properties. The construction of this graph is similar to the graph of figure 5.

5. Decision

It is non simple for all the users that they can construe and pull out the needed consequences from these bunchs, until some other informations excavation algorithms or other tools are non used. In this research paper we have tried to turn to the issue by incorporating the K-means constellating algorithm with the Decision tree ( ID3 ) algorithm. The pick of the ID3 is due to the determination regulations in the signifier of if-then-else as an end product, which are easy to understand and assist to take the determination. It is a intercrossed combination of ‘supervised and unsupervised machine larning ‘ , utilizing intelligent agent, called a LIAgent. The LIAgent is helpful in the categorization and anticipation of the given dataset. Furthermore, 2D scattered graphs of the bunchs are drawn for the visual image.