K- means clustering analysis case (1)

Previous comments:

Introduction to Cluster: Click here

Hierarchical cluster analysis case (1): World Bank sample data set

Hierarchical Cluster Analysis Case (2): Burning Status of Amazon Rainforest

Hierarchical Cluster Analysis Case (3): Gene Clustering

Food consumption pattern is a research hotspot in the fields of medicine and nutrition. Food consumption is related to the overall health of individuals, the nutritional value of food, the economy of buying food and the consumption environment. This analysis involves the relationship between meat and other foods in 25 European countries. It is interesting to observe the correlation between meat and other foods. These data include: red meat, white meat, eggs, milk, fish, cereals, starchy foods, nuts (including beans and oilseeds), fruits and vegetables.

Preparatory work/about to start work

In order to apply K-means clustering, we use protein consumption data sets of 25 European countries.

Step 1: Collect and describe data.

This task uses a data set named protein, which is stored in a CSV file in a standard format and contains 25 rows of data and 10 variables. Data acquisition path

The numerical variables are as follows:

red meat

plain unseasoned boiled pork

egg

milk

fish

cereal

starch

nut

Fr& vegetables

Non-numeric variables are as follows:

country

Specific implementation steps

Here are the implementation details.

Step 2: Explore the data

Let's explore the data and understand the relationship between variables. Starting from importing a CSV file named Europenaprotein.csv, save the data to the protein data frame:

The head () function returns the start or end of a vector, matrix, table, data frame or function. Pass the protein data frame to the head () function.

The results are as follows:

Step 3: Clustering

Start the cluster on the basis of three clusters. In order to generate a random number of clusters in the initial stage, the set.seed () function is called. The set.seed () function can generate random numbers.

The kmeans () function can perform K-means clustering on the data matrix. Protein data matrix is passed to this function as an object, and it must be a numerical matrix. Centers = 3 indicates the number of initialization cluster centers. Since the number of clusters is specified by a number, nstart = 10 defines the number of randomly selected centers.

The results are as follows:

Next, a cluster allocation list is generated. The order () function returns a sequence to regenerate its first parameter in ascending or descending order. GroupMeat data frame is passed in as a data frame object:

Call the data.frame () function to display the countries and the clusters where these countries are located:

The results are as follows:

Plot () function is a general function for drawing R objects. The parameter type indicates the kind of graphics to be displayed. The xlim parameter means that the parameter should be given a range boundary, not a range. Xlab and ylab provide the titles of the X axis and Y axis, respectively:

The results are as follows:

Step 4: Improve the model.

Next, all 9 protein groups are clustered, and 7 clusters have been created. On the scatter chart, dots of different colors represent countries that eat white meat and red meat. Geographically adjacent countries are often grouped together.

Center = 7 indicates the initial number of cluster centers:

Seven different clusters were formed. All 25 countries are assigned to a certain group.

The results are as follows:

The clustplot () function creates a graph with two variables, from which you can see the visual division of data. All observations are expressed in points by principal components. Draw an ellipse around each cluster. Protein data frames are passed in as objects:

The results are as follows:

Another hierarchical representation is as follows. Agnes () function is used here. By setting diss=FALSE, the dissimilarity matrix is used to calculate the original data. Metric="euclidean "means Euclidean distance is used for calculation:

The results are as follows:

Plot () Draw a picture: Enter to see the next chapter, including two pictures.

The results are as follows:

The cutree () function cuts the tree into several groups, and divides the tree by setting the required number of groups or cutting height:

The results are as follows:

The results are as follows: