First Name*

Last Name*

Email ID

Phone*

College - Where did you study?*

One of the IITs

One of the NITs

One of the BITs

One of the IIITs

One of the NIDs

Agnel Charities' FR. C. Rodrigues Institute of Technology, Vashi, Navi Mumbai

Atal Bihari Vajpayee Indian Institute of Information Technology & Management Gwalior (IIIT)

B M S College of Engineering Basavanagudi,Bangalore(BMSCE)

B.R.A.C.T's Vishwakarma Institute of Information Technology, Kondhwa(VIIT)

Bansilal Ramnath Agarawal Charitable Trust's Vishwakarma Institute of Technology, Bibwewadi, Pune (VIT Pune)

Bhartiya Vidya Bhavan's Sardar Patel Institute of Technology , Andheri, Mumbai (SPIT)

Bhilai Institute of Technology, Bhilai House, Durg(BIT)

Bhilai Institute of Technology.

Birla Institute of Technology, Goa

Birla Institute of Technology, Hydrabad

Birla Institute of Technology, Mesra, Ranchi

Birla Institute of Technology, Pilani, Rajasthan

CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY(CBIT)

Coimbatore Institute Of Technology(CIT) (Autonomous)

College of Engineering, Pune (COEP)

CV Raman Global University

Dayananda Sagar College of Engineering Bangalore (DSCE)

Delhi Technological University, DTU Delhi

Desai University, (DDU), Nadiad

Dhirubhai Ambani Institute of Info. & Comm. Tech.,(DA-IICT)

Don Bosco Institute of Technology, Mumbai

Dr. Ambedkar Institute Of Technology Bangalore

Faculty Of Technology & Engineering(MSU), Vadodara

Faculty Of Technology And Engineering(GIA), Dharmsinh

Fr. Conceicao Rodrigues College of Engineering, Bandra,Mumbai

Garv Institute of Management & Technology.

Government College of Engineering, Amravati

Govt Engineering College, Bilaspur.

Govt Engineering College, Raipur.

Govt. Engineering College, Raipur (GEC Raipur)

IIIT Hyderabad

Indian Institute of Art and Design(IIAD), Delhi

Indian Institute of Engineering Science and Technology, Shibpur (IIEST Shibpur)

Indian Institute of Information Technology (IIIT) Pune

Indian Institute of Information Technology (IIIT)Kota, Rajasthan

Indian Institute of Information Technology Surat (IIIT)

Indian Institute of Information Technology(IIIT) Kilohrad, Sonepat, Haryana

Indian Institute of Information Technology(IIIT), Vadodara, Gujrat

Indian Institute of Information Technology, Design & Manufacturing, Kancheepuram (IIIT)

Indian Institute of Technology (BHU) Varanasi

Indian Institute of Technology (ISM) Dhanbad

Indian Institute of Technology Bhilai

Indian Institute of Technology Bhubaneswar

Indian Institute of Technology Bombay

Indian Institute of Technology Delhi

Indian Institute of Technology Dharwad

Indian Institute of Technology Gandhinagar

Indian Institute of Technology Goa

Indian Institute of Technology Guwahati

Indian Institute of Technology Hyderabad

Indian Institute of Technology Indore

Indian Institute of Technology Jammu

Indian Institute of Technology Jodhpur

Indian Institute of Technology Kanpur

Indian Institute of Technology Kharagpur

Indian Institute of Technology Madras

Indian Institute of Technology Mandi

Indian Institute of Technology Palakkad

Indian Institute of Technology Patna

Indian Institute of Technology Roorkee

Indian Institute of Technology Ropar

Indian Institute of Technology Tirupati

Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)

INSTITUTE OF ENGINEERING & TECHNOLOGY,LUCKNOW (0052)(IET Lucknow)

Institute of Engineering and Management, Kolkata

Institute of Engineering and Technology, DAVV, Indore (1996)

Institute Of Technology, Nirma University Of Science & Technology, Ahmedabad

International Institute of Information Technology, Bhubaneswar

International Institute of Information Technology, Naya Raipur

Jabalpur Engineering College, Jabalpur, (JEC) (1947)

Jadavpur Uni

Jadavpur University

JSS Science and Technology University(Formerly SJCE) Mysore

K J Somaiya Institute of Engineering and Information Technology, Sion, Mumbai

K.J.Somaiya College of Engineering, Vidyavihar, Mumbai

Kalinga Institute of Industrial Technology

L.D.College Of Engineering, Ahmedabad (LDCE)

M S Ramaiah Institute of Technology Bangalore (MSRIT)

Madhav Institute of Technology & Science, Gwalior (1957)

MAEER’S MIT, Pune

Maharashtra Academy of Engineering and Educational Research

Maharashtra Institute of Technology (MIT)

Malaviya National Institute of Technology Jaipur

Manipal Institute of Technology (MIT)

Maulana Abul Kalam Azad University of Technology, Kolkata

Maulana Azad National Institute of Tehnology Bhopal

MIT Academy of Engineering,Alandi, Pune

MKSSS's Cummins College of Engineering for Women, Karvenagar,Pune

Motilal Nehru National Institute of Technology Allahabad

National Institute of Design(NID)

National Institute of Technology Calicut

National Institute of Technology Delhi

National Institute of Technology Durgapur

National Institute of Technology Hamirpur

National Institute of Technology Jalandhar

National Institute of Technology Karnataka, Surathkal

National Institute of Technology Patna

National Institute of Technology Raipur

National Institute of Technology, Andhra Pradesh

National Institute of Technology, Jamshedpur

National Institute of Technology, Kurukshreta

National Institute of Technology, Rourkela

National Institute of Technology, Silchar

National Institute of Technology, Tiruchirappalli

National Institute of Technology, Warangal

Netaji Subhas University of Technology, New Delhi (NSUT Delhi)

O U COLLEGE OF ENGG HYDERABAD (UCE)

P E S University (Electronic City Campus) Bangalore(PES)

P E S University (Ring Road Campus) Bangalore(PES)

Pandit Deendayal Petroleum University ,Gandhinagar(PDPU)

Pimpri Chinchwad Education Trust, Pimpri Chinchwad College of Engineering, Pune(PCCOE)

PSG College of Engineering and Technology

Pt. Dwarka Prasad Mishra Indian Institute of Information Technology, Design & Manufacture Jabalpur

Pune Institute of Computer Technology, Dhankavdi, Pune(PICT)

Punjab Engineering College, Chandigarh (PEC)

R. V. College of Engineering Bangalore(RVCE)

Sardar Patel Institute of Technology, Andheri, Mumbai

Sardar Vallabhbhai National Institute of Technology, Surat

School of Engineering and Applied Science, Ahmedabad (SEAS)

Shri G.S. Institute of Technology & Science, Indore (M.P.) (1952)

Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded

Shri Shankaracharya Technical Campus,(Shri Shankaracharya Group of Institutions).

Shri Vile Parle Kelvani Mandal's Dwarkadas J. Sanghvi College of Engineering, Vile Parle,Mumbai (DJSCE)

Silicon Institute of Technology

Sir M.Visveswaraya Institute of Technology Hunasemaranahalli,Bangalore,

SOA ITER, Bhubaneshwar

Sri Jayachamarajendra College of Engineering(Const. of JSS Univ.) Mysore

Sri Sivasubramaniya Nadar College Of Engg (Autonomous) (SSN)

Srishti Institute of Art and Design, Bangaluru

SSN CoE, Kalavakkam

Symbiosis Institute of Design(SID),Pune

The National Institute of Engineering Mysore (NIE)

Thiagarajar College Of Engineering (Autonomous) (TCE)

University Institute of Technology RGPV, Bhopal (1986)

University of Kalyani, Kalyani

University Visveswariah College of Engineering Bangalore (UVCE)

VASAVI COLLEGE OF ENGINEERING (VCE)

Veer Surendra Sai University of Technology

Veermata Jijabai Technological Institute(VJTI), Matunga, Mumbai

Vellore Institute of Technology(VIT Vellore)

Vidyalankar Institute of Technology,Wadala, Mumbai

Vishwakarma Government Engineering College, Chandkheda,Gandhinagar (VGECG)

Visvesvaraya National Institute of Technology, Nagpur

Vivekanand Education Society's Institute of Technology, Chembur, Mumbai

Walchand College of Engineering, Sangli (WCE)

Field of Study (Graduation)*

BTech

BDES/MDES

BCA

BSc

Others

Upload your CV*

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

First Name*

Last Name*

Email ID

Phone*

Message

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

Handling Categorical Features in Machine Learning

March 21, 2017

rukameshkumar

Contributor

March 21, 2017

rukameshkumar

Contributor

Introduction:

Every dataset has two type of variables Continuous(Numerical) and Categorical. Regression based algorithms use continuous and categorical features to build the models. You can’t fit categorical variables into a regression equation in their raw form in most of the ML Libraries. If it is not included in the modeling, then you do not get an accurate model. It’s crucial to learn the methods of dealing with such variables. There are many machine learning libraries that deal with categorical variables in various ways. Approach on how to transform and use those efficiently in model training, varies based on multiple conditions, including the algorithm being used, as well as the relation between the response variable and the categorical variable(s). Here I take the opportunity to demonstrate the various methods prevalent and incorporated in the popular Machine Learning Library in Spark, i.e.Mllib for handling categorical variables.

Challenges with categorical variable:

* A categorical variable has too many levels. It effects performance of model for example for rent prediction a zip code field has numerous levels.

* A categorical variable has levels which rarely occur. Many of these levels have minimal chance of making a real impact on model fit.

* There is one level which always occurs i.e. for most of the observations in data set there is only one level. Variables with such levels fail to make a positive impact on model performance due to very low variation.

* We can’t fit categorical variables into a regression equation in their raw form.

* Most of the algorithms (or ML libraries) produce better result with numerical variable.

Different approaches available In SparkML:

Below mentioned, three methods that are used generally to deal with categorical variable in Mllib Library of Spark.

1. StringIndexer: StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels], ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we cast it to string and index the string values. you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol. For unseen labels set it to “error” or “skip”, for

error option an exception will be thrown. However, if you set skip option it will skip unseen label.

Examples: Assume that we have the following DataFrame with columns id and gender.

id | gender

—-|———-

0 | M

1 | F

2 | F

3 | M

4 | M

5 | M

Gender is a string column with two labels: “M” and “F”. Applying StringIndexer with gender as the input column and genderIndex as the output column, we should get the following:

id | gender | genderIndex

—-|———-|—————

0 | M | 0.0

1 | F | 1.0

2 | F | 1.0

3 | M | 0.0

4 | M | 0.0

5 | M | 0.0

“M” gets index 0 because it is the most frequent, followed by “F” with index 1.

from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame([(0, "M"), (1, "F"), (2, "F"), (3, "M"), (4, "M"), (5, "M")],["id", "gender"])

indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

indexed = indexer.fit(df).transform(df)

indexed.show()

2. One-hot Encoding: One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default because it makes the vector entries sum up to one so an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([(0, "M"), (1, "F"), (2, "F"), (3, "M"), (4, "M"), (5, "M")],["id", "gender"])

stringIndexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

model = stringIndexer.fit(df)

indexed = model.transform(df)

encoder = OneHotEncoder(dropLast=False, inputCol="genderIndex", outputCol="genderVec")

encoded = encoder.transform(indexed)

encoded.select("id", "genderVec").show()

3. VectorIndexer :VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following: Take an input column of type Vector and a parameter maxCategories. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical. Compute 0-based category indices for each categorical feature. Index categorical features and transform original feature values to indices. Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

from pyspark.ml.feature import VectorIndexer

data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)

indexerModel = indexer.fit(data)

# Create new column "indexed" with categorical values transformed to indices

indexedData = indexerModel.transform(data)

indexedData.show()

Example:–

I have worked on an example to show how to apply these concepts. Example is about(prediction of car value based on a variety of characteristics such as mileage, make, model, engine size, interior style, and cruise control).This dataset is available here http://ww2.amstat.org/publications/jse/v16n3/datasets.kuiper.html

Dataset containing following variables:

Categorical : Make ,Model,Trim,Type

Continuous: Price , Mileage,Cylinder,Liter,Doors,Cruise,Sound, Leather

Dataset is divided into (70:30) ratio for training and testing. String Indexer is applied on union of train and test data frame, so you are assured all labels are there.Below are the results based on the approach tried. Implementation of all the above concepts have been compiled and put together in https://github.com/rukamesh/CarPricePrediction.git

Results without Categorical Variable:

RootMean

SquareError

Standard

Deviation

Error%(0-1)

Error%(1-2)

Error%(2-3)

Error%(3-4)

Error%(4-5)

Error%(5-10)

Error%(>10)

3843.47

11.13

7.87

9.44

3.93

10.23

7.87

24.80

35.81

Results With Categorical Variable:

RootMean

SquareError

Standard

Deviation

Error%(0-1)

Error%(1-2)

Error%(2-3)

Error%(3-4)

Error%(4-5)

Error%(5-10)

Error%(>10)

952.06

2.51

20.07

16.14

18.11

16.53

12.20

15.35

1.57

Formulas Used:

%Error = (data$prediction – data$Price)*100/data$Price

RMSE = sqrt((mean(data$prediction – data$Price)^2 ))

Optimizing Data Management in React: A Comprehensive Guide to Implementing RTK Query

Technical

Handling Categorical Features in Machine Learning

Introduction:

Challenges with categorical variable:

Different approaches available In SparkML:

Example:–

Results without Categorical Variable:

Results With Categorical Variable:

Formulas Used:

Optimizing Data Management in React: A Comprehensive Guide to Implementing RTK Query

8 Proven IT Cost Reduction Strategies

Unlocking Interactivity with HTMX: A Deeper Dive into Modern Web Development

How to use Intel SGX to execute code in Trusted Execution Environment

Backend for Frontend (BFF) Authentication in Go-Part 2

Backend For Frontend (BFF) Authentication: What It Is And How To Implement It In Go

Generation Of Code Coverage In Microservices-Based Architecture Testing Using JaCoCo

Packet Capture: A Simple Application To Retrieve Columns From Ethernet Encoded Frame

Webpack Module Federation- A Brief Overview

Subgrid: An In-Depth Look At The Recent CSS Grid Layout

Our Capabilities

Follow Us

Contact Us

Handling Categorical Features in Machine Learning

Introduction:

Challenges with categorical variable:

Different approaches available In SparkML:

Example:–

Results without Categorical Variable:

Results With Categorical Variable:

Formulas Used: