banner



Is Maching Learning Clustering Used With Categorical Data

By Shelvi Garg, Data Scientist

Image

In this web log we will explore and implement:

  • 1-hot Encoding using:
    • Python's category_encoding library
    • Scikit-learn preprocessing
    • Pandas' get_dummies
  • Binary Encoding
  • Frequency Encoding
  • Label Encoding
  • Ordinal Encoding

What is Categorical Data?


Categorical information is a type of data that is used to group information with like characteristics, while numerical data is a type of data that expresses information in the form of numbers.

Instance of chiselled data: gender

Why do we need encoding?

  • Most automobile learning algorithms cannot handle chiselled variables unless we convert them to numerical values
  • Many algorithm's performances even vary based upon how the chiselled variables are encoded

Categorical variables tin be divided into two categories:

  • Nominal: no particular order
  • Ordinal: there is some order betwixt values

Nosotros will as well refer to a crook sheet that shows when to use which type of encoding.

Method 1: Using Python's Category Encoder Library


category_encoders is an amazing Python library that provides fifteen dissimilar encoding schemes.

Here is the list of the fifteen types of encoding the library supports:

  • One-hot Encoding
  • Characterization Encoding
  • Ordinal Encoding
  • Helmert Encoding
  • Binary Encoding
  • Frequency Encoding
  • Mean Encoding
  • Weight of Evidence Encoding
  • Probability Ratio Encoding
  • Hashing Encoding
  • Backward Difference Encoding
  • Exit 1 Out Encoding
  • James-Stein Encoding
  • Yard-calculator Encoding
  • Thermometer Encoder

Importing libraries:

import pandas as pd import sklearn  pip install category_encoders  import category_encoders as ce

Creating a dataframe:

data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female person', 'Female'],                        'grade' : ['A','B','C','D','A'],                         'metropolis' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] }) information.head()
Figure
Paradigm By Author

Implementing i-hot encoding through category_encoder

In this method, each category is mapped to a vector that contains i and 0 denoting the presence or absenteeism of the feature. The number of vectors depends on the number of categories for features.

Create an object of the i-hot encoder:

ce_OHE = ce.OneHotEncoder(cols=['gender','city'])  data1 = ce_OHE.fit_transform(data) data1.head()
Figure
Image By Author

Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates ane feature column.

Figure
Prototype Ref

ce_be = ce.BinaryEncoder(cols=['course']);  # transform the data data_binary = ce_be.fit_transform(data["form"]); data_binary
Figure
Prototype By Writer

Similarly, at that place are some other 14 types of encoding provided by this library.

Method 2: Using Pandas' go dummies

pd.get_dummies(data,columns=["gender","city"])
Figure
Image By Writer

Nosotros can assign a prefix if we want to, if we practise not want the encoding to use the default.

pd.get_dummies(data,prefix=["gen","city"],columns=["gender","metropolis"])
Figure
Prototype Past Writer

Method three: Using Scikit-learn


Scikit-larn also has 15 dissimilar types of congenital-in encoders, which can exist accessed from sklearn.preprocessing.

Scikit-learn 1-hot Encoding


Allow's start become the list of chiselled variables from our information:

southward = (data.dtypes == 'object') cols = listing(s[due south].index)  from sklearn.preprocessing import OneHotEncoder  ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

Applying on the gender column:

data_gender = pd.DataFrame(ohe.fit_transform(information[["gender"]]))  data_gender
Figure
Image By Author

Applying on the city column:

data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))  data_city
Figure
Image Past Writer

Applying on the class cavalcade:

data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))  data_class
Figure
Image By Author

This is because the class cavalcade has four unique values.

Applying to the list of chiselled variables:

data_cols = pd.DataFrame(ohe.fit_transform(information[cols]))  data_cols
Figure
Image Past Author

Here the starting time 2 columns correspond gender, the side by side 4 columns represent class, and the remaining 2 stand for urban center.

Scikit-learn Characterization Encoding


In label encoding, each category is assigned a value from ane through Due north where North is the number of categories for the characteristic. There is no relation or society between these assignments.

from sklearn.preprocessing import LabelEncoder  le = LabelEncoder() Label encoder takes no arguments le_class = le.fit_transform(information[["class"]])

Comparing with ane-hot encoding

Figure
Image By Author

Ordinal Encoding


Ordinal encoding's encoded variables retain the ordinal (ordered) nature of the variable. It looks similar to characterization encoding, the only difference being that label coding doesn't consider whether a variable is ordinal or not; it volition then assign a sequence of integers.

Example: Ordinal encoding volition assign values every bit Very Good(1) < Good(2) < Bad(3) < Worse(4)

Get-go, we need to assign the original order of the variable through a lexicon.

temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']} df=pd.DataFrame(temp,columns=["temperature"]) temp_dict = {'very common cold': one,'cold': 2,'warm': 3,'hot': 4,"very hot":5} df
Figure
Image By Author

Then we can map each row for the variable every bit per the lexicon.

df["temp_ordinal"] = df.temperature.map(temp_dict) df
Figure
Image By Author

Frequency Encoding


The category is assigned as per the frequency of values in its full lot.

data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

Grouping by class cavalcade:

iron = data_freq.groupby("course").size()

Dividing past length:

Mapping and rounding off:

data_freq["data_fe"] = data_freq["class"].map(fe_).round(ii) data_freq
Figure
Paradigm By Writer

In this article, we saw 5 types of encoding schemes. Similarly, there are ten other types of encoding which we have non looked at:

  • Helmert Encoding
  • Hateful Encoding
  • Weight of Evidence Encoding
  • Probability Ratio Encoding
  • Hashing Encoding
  • Backward Departure Encoding
  • Leave One Out Encoding
  • James-Stein Encoding
  • Yard-estimator Encoding
  • Thermometer Encoder

Which Encoding Method is Best?


In that location is no single method that works best for every problem or dataset. I personally think that the get_dummies method has an advantage in its ability to be implemented very easily.

If you want to read about all 15 types of encoding, here is a very good article to refer to.

Here is a cheat sheet on when to use what type of encoding:

Figure
Prototype Ref

References:

  1. https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
  2. https://pypi.org/project/category-encoders/
  3. https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html


Bio: Shelvi Garg is a Data Scientist. Interests and learnings are not limited.

Related:

  • Animated Bar Chart Races in Python

Source: https://www.kdnuggets.com/2021/05/deal-with-categorical-data-machine-learning.html

Posted by: andersoncritaiment.blogspot.com

0 Response to "Is Maching Learning Clustering Used With Categorical Data"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel