Is Maching Learning Clustering Used With Categorical Data

By Shelvi Garg, Data Scientist

In this web log we will explore and implement:

1-hot Encoding using:
- Python's category_encoding library
- Scikit-learn preprocessing
- Pandas' get_dummies
Binary Encoding
Frequency Encoding
Label Encoding
Ordinal Encoding

What is Categorical Data?

Categorical information is a type of data that is used to group information with like characteristics, while numerical data is a type of data that expresses information in the form of numbers.

Instance of chiselled data: gender

Why do we need encoding?

Most automobile learning algorithms cannot handle chiselled variables unless we convert them to numerical values
Many algorithm's performances even vary based upon how the chiselled variables are encoded

Categorical variables tin be divided into two categories:

Nominal: no particular order
Ordinal: there is some order betwixt values

Nosotros will as well refer to a crook sheet that shows when to use which type of encoding.

Method 1: Using Python's Category Encoder Library

category_encoders is an amazing Python library that provides fifteen dissimilar encoding schemes.

Here is the list of the fifteen types of encoding the library supports:

One-hot Encoding
Characterization Encoding
Ordinal Encoding
Helmert Encoding
Binary Encoding
Frequency Encoding
Mean Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Difference Encoding
Exit 1 Out Encoding
James-Stein Encoding
Yard-calculator Encoding
Thermometer Encoder

Importing libraries:

import pandas as pd import sklearn  pip install category_encoders  import category_encoders as ce

Creating a dataframe:

data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female person', 'Female'],                        'grade' : ['A','B','C','D','A'],                         'metropolis' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] }) information.head()

Paradigm By Author

Implementing i-hot encoding through category_encoder

In this method, each category is mapped to a vector that contains i and 0 denoting the presence or absenteeism of the feature. The number of vectors depends on the number of categories for features.

Create an object of the i-hot encoder:

ce_OHE = ce.OneHotEncoder(cols=['gender','city'])  data1 = ce_OHE.fit_transform(data) data1.head()

Image By Author

Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates ane feature column.

Prototype Ref

ce_be = ce.BinaryEncoder(cols=['course']);  # transform the data data_binary = ce_be.fit_transform(data["form"]); data_binary

Prototype By Writer

Similarly, at that place are some other 14 types of encoding provided by this library.

Method 2: Using Pandas' go dummies

pd.get_dummies(data,columns=["gender","city"])

Image By Writer

Nosotros can assign a prefix if we want to, if we practise not want the encoding to use the default.

pd.get_dummies(data,prefix=["gen","city"],columns=["gender","metropolis"])

Prototype Past Writer

Method three: Using Scikit-learn

Scikit-larn also has 15 dissimilar types of congenital-in encoders, which can exist accessed from sklearn.preprocessing.

Scikit-learn 1-hot Encoding

Allow's start become the list of chiselled variables from our information:

southward = (data.dtypes == 'object') cols = listing(s[due south].index)  from sklearn.preprocessing import OneHotEncoder  ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

Applying on the gender column:

data_gender = pd.DataFrame(ohe.fit_transform(information[["gender"]]))  data_gender

Image By Author

Applying on the city column:

data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))  data_city

Image Past Writer

Applying on the class cavalcade:

data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))  data_class

Image By Author

This is because the class cavalcade has four unique values.

Applying to the list of chiselled variables:

data_cols = pd.DataFrame(ohe.fit_transform(information[cols]))  data_cols

Image Past Author

Here the starting time 2 columns correspond gender, the side by side 4 columns represent class, and the remaining 2 stand for urban center.

Scikit-learn Characterization Encoding

In label encoding, each category is assigned a value from ane through Due north where North is the number of categories for the characteristic. There is no relation or society between these assignments.

from sklearn.preprocessing import LabelEncoder  le = LabelEncoder() Label encoder takes no arguments le_class = le.fit_transform(information[["class"]])

Comparing with ane-hot encoding

Image By Author

Ordinal Encoding

Ordinal encoding's encoded variables retain the ordinal (ordered) nature of the variable. It looks similar to characterization encoding, the only difference being that label coding doesn't consider whether a variable is ordinal or not; it volition then assign a sequence of integers.

Example: Ordinal encoding volition assign values every bit Very Good(1) < Good(2) < Bad(3) < Worse(4)

Get-go, we need to assign the original order of the variable through a lexicon.

temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']} df=pd.DataFrame(temp,columns=["temperature"]) temp_dict = {'very common cold': one,'cold': 2,'warm': 3,'hot': 4,"very hot":5} df

Image By Author

Then we can map each row for the variable every bit per the lexicon.

df["temp_ordinal"] = df.temperature.map(temp_dict) df

Image By Author

Frequency Encoding

The category is assigned as per the frequency of values in its full lot.

data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

Grouping by class cavalcade:

iron = data_freq.groupby("course").size()

Dividing past length:

Mapping and rounding off:

data_freq["data_fe"] = data_freq["class"].map(fe_).round(ii) data_freq

Paradigm By Writer

In this article, we saw 5 types of encoding schemes. Similarly, there are ten other types of encoding which we have non looked at:

Helmert Encoding
Hateful Encoding
Weight of Evidence Encoding
Probability Ratio Encoding
Hashing Encoding
Backward Departure Encoding
Leave One Out Encoding
James-Stein Encoding
Yard-estimator Encoding
Thermometer Encoder

Which Encoding Method is Best?

In that location is no single method that works best for every problem or dataset. I personally think that the get_dummies method has an advantage in its ability to be implemented very easily.

If you want to read about all 15 types of encoding, here is a very good article to refer to.

Here is a cheat sheet on when to use what type of encoding:

Prototype Ref

References:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
https://pypi.org/project/category-encoders/
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Bio: Shelvi Garg is a Data Scientist. Interests and learnings are not limited.

Related:

Animated Bar Chart Races in Python

Source: https://www.kdnuggets.com/2021/05/deal-with-categorical-data-machine-learning.html

Posted by: andersoncritaiment.blogspot.com

Is Maching Learning Clustering Used With Categorical Data

What is Categorical Data?

Method 1: Using Python's Category Encoder Library

Method 2: Using Pandas' go dummies

Method three: Using Scikit-learn

Scikit-learn 1-hot Encoding

Scikit-learn Characterization Encoding

Ordinal Encoding

Frequency Encoding

Which Encoding Method is Best?

0 Response to "Is Maching Learning Clustering Used With Categorical Data"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Is Maching Learning Clustering Used With Categorical Data

What is Categorical Data?

Method 1: Using Python's Category Encoder Library

Method 2: Using Pandas' go dummies

Method three: Using Scikit-learn

Scikit-learn 1-hot Encoding

Scikit-learn Characterization Encoding

Ordinal Encoding

Frequency Encoding

Which Encoding Method is Best?

Related Posts

0 Response to "Is Maching Learning Clustering Used With Categorical Data"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel