# Spatial clustering of wine quality to identify anomalies

## What is DBSCAN?

DBSCAN is abbreviation for**D**ensity-

**B**ased

**S**patial

**C**lustering of

**A**pplication with

**N**oise algorithm. It is a method of clustering by separate high-density points from low-density points. As an outcome, the algorithm finds the noise points (outliers) from a set of data points. It sounds complicated however it is simple and easy to apply.DBSCAN is an example of

**unsupervised learning (**a branch of machine learning and hence a subset of artificial intelligence)and part of the density-based algorithm[DD1] . Before proceeding further, we need to understand what the unsupervised learning method is.

**Unsupervised (machine) learning**algorithms infer patterns from a dataset without reference to known, or labelled, outcomes. The term ‘

**density-based algorithm**’ refers that we are going to arrange data based on how dense the location of data points.What follows is a technical dive into the approach taken.

## How to implement DBSCAN

The two main parameters of DBSCAN algorithm are**ε**(epsilon) and

**minPoints**, which are defined as:

### Epsilon:

**ε = Radius of the neighbourhood region**Two points are considered neighbours if the distance between the two points is below the threshold epsilon. The epsilon is calculated based on the Euclidian distance between points. To understand more explicitly assume the below example, where we have two points X and Y in a 2 two-dimensional axis then we can calculate its distance as,

*By picking larger values of*

*ɛ, more points become density-reachable (fewer outliers found), and by choosing smaller values of*

*ɛ, less points become density-reachable (more outliers found).*

### minPoint:

**minPoint = Minimum number of points that must present within the neighbourhood**We can adjust minPoint based on our convenience, for example if we need at least 10 points to be present in a core point then we will keep it as 10 and so on.Based on ε and minPoint, we get three different outputs which are two clusters and an outlier. The figure below illustrates the scenario in a clear manner.

**Core point**= A data point is said to be a core point if it at least has ‘minPoint’. For example, assume our minPoint is five and if we get a datapoint with three, then we can’t classify it as a core point as it doesn’t satisfy the requirement.

**Border point**= A data point is said to be a border point if it has less than ‘minPoint’ and contains one of the core points. For example, assume our minPoint as five and if any of data point has 3 with one them as core point that is reachable with a distance of ε.

**Noise point**= Noise point can be termed as outliers which is the goal of finding through DBSCAN algorithm. A data point is said to be noise point if its neither a core nor a border point, these can be assumed as an extreme value, unexpected occurrence or different behaviour than a regular event.

### Implementation of DBSCAN in Python

We can implement DBSCAN algorithm in python with sci-kit learning, which is really a simple procedure.**Step 1: Import necessary libraries required**

**Step 2: Read the dataset**

**Step 3: Basic idea of data**

**Step 4: Define the model**

**Step 5: Check the count in each cluster**

**Step 6: Visualize the outliers**

## Advantages of DBSCAN

It can identify outliers from any shape of data existence, where a normal k-means and k-median can identify clusters only when the data resembles a circle.Can identify clusters whatever the shape of data and simple to implement.## Disadvantage of DBSCAN

Sensitive to ε and minPoint as the outliers point changes for every value combination.If our data exist with varying densities, then it would be tricky to identify clusters and noise point.DBSCAN is good at separating high-density clusters from low-density clusters but struggles with similar density.DBSCAN suffers from high dimensionality of data. Hence, we need to do the additional task of feature selection before passing to DBSCAN.## Other potential applications of DBSCAN

Anomaly detection in temperature, sales and X-ray image cells.Clustering of data.Identification of abnormal behaviour in stock market.**References**https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Are you interested in knowing more about it?Let’s talk, we can help you!Contact | Lucid Insights

Check out the Lucid Insights blogThere is a variety of content that may help you to improve your business!