Saturday, 7 September 2024

K Means Clustering Algorithm

 

This is useful to know as k-means clustering is a popular clustering algorithm that does a good job of grouping spherical data together into distinct groups. This is very valuable as both an analysis tool when the groupings of rows of data are unclear or as a feature-engineering step for improving supervised learning models. 


Clustering models aim to group data into distinct “clusters” or groups. This can both serve as an interesting view in an analysis, or can serve as a feature in a supervised learning algorithm. 

Consider a social setting where there are groups of people having discussions in different circles around a room. When you first look at the room, you just see a group of people. You could mentally start placing points in the center of each group of people and name that point as a unique identifier. You would then be able to refer to each group by a unique name to describe them. This is essentially what k-means clustering does with data. 




How to determine K i.e number of Clusters

Here the below case has K as 2 and the below diagram shows the Sum of Squared Errors of each point to it's centroid. 






If there are 11 data points and SSE will become Zero when the K values is 11, however our of the below diagram the recommended K value is 4. This technique is call Elbow Technique, where the point at which perpendicular change occurs.




Source: https://www.youtube.com/watch?v=EItlUEPCIzM&list=PLeo1K3hjS3us_ELKYSj_Fth2tIEkdKXvV&index=53

https://www.datacamp.com/tutorial/k-means-clustering-python


No comments:

Post a Comment

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...