#9 - Separating Numerical, Categorical, and Textual Data
Data separation based on its type
Table of contents
Introduction
For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.
Unfortunately, if in the dataset there are both categorical and textual data, Pandas
will consider both as categorical data. But, with a simple hack, we can differentiate that easily.
If we have categorical data like
['short', 'tall', 'average']
, we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.But, if we have textual data like
[ 'This is the first document.', 'This document is the second document.']
, we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.
With Pandas
, we will see how to solve this separation task programmatically.
Credits of Cover Image - Photo by Will Francis on Unsplash
import
Packages
import numpy as np
import pandas as pd
import random
Scenario 1
For this example, we will create our own (dummy) data set.
real_values = np.random.rand(10, 3)
num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)])
str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)])
# -----
df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3'])
df['col_4'] = num_cate
df['col_5'] = str_cate
The first five rows of df
look like -
col_1 | col_2 | col_3 | col_4 | col_5 |
0.902552 | 0.212317 | 0.208628 | 2 | short |
0.308503 | 0.959903 | 0.631784 | 3 | tall |
0.459789 | 0.199742 | 0.176000 | 2 | tall |
0.760787 | 0.167179 | 0.137287 | 2 | average |
0.199737 | 0.120213 | 0.145068 | 2 | average |
We can check the type of each column by -
df.dtypes
'''
col_1 float64
col_2 float64
col_3 float64
col_4 int32
col_5 object
'''
We can see that col_5
is of type object
which simply represents that it is different from int
and float
. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.
def split_categories_numericals(dframe):
cols = list(dframe.columns)
num_cols = list(dframe._get_numeric_data().columns)
cate_cols = list(set(cols) - set(num_cols))
return cate_cols, num_cols
cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
Scenario 2
The function split_categories_numericals()
works fine for the above data set. Let's add one more column that has textual sentences as values.
sents = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'Hello, how are you doing?',
'Did you go to the party yesterday?'
]
text_data = [random.choice(sents) for i in range(10)]
df['col_6'] = text_data
Now, the first five rows of the df
look like -
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 |
0.902552 | 0.212317 | 0.208628 | 2 | short | This document is the second document. |
0.308503 | 0.959903 | 0.631784 | 3 | tall | Is this the first document? |
0.459789 | 0.199742 | 0.176000 | 2 | tall | Hello, how are you doing? |
0.760787 | 0.167179 | 0.137287 | 2 | average | Did you go to the party yesterday? |
0.199737 | 0.120213 | 0.145068 | 2 | average | Hello, how are you doing? |
If we check the dtypes
of df
, we get -
df.dtypes
'''
col_1 float64
col_2 float64
col_3 float64
col_4 int32
col_5 object
col_6 object
'''
We can see col_5
and col_6
have the same type, i.e., object
. It is because they have string values. But, col_5
- we can get a pie chart, compute the likelihood, etc. Whereas col_6
, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.
Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply split_categories_numericals()
, we get -
cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5', 'col_6']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
It considers both col_5
and col_6
as categorical.
Is there any programming hack to automatically differentiate categorical data and textual data?
The answer is Yes. We can very well achieve that with the following function.
def split_text_categories_numericals(dframe):
cols = list(dframe.columns)
num_cols = list(dframe._get_numeric_data().columns)
cate_cols = list(set(cols) - set(num_cols))
text_cols = []; category_cols = []
for ccol in cate_cols:
each_col_list = dframe[ccol].str.split(' ').to_list()
col_val_len_arr = np.array(list(map(len, each_col_list)))
if np.any(col_val_len_arr > 1):
text_cols.append(ccol)
else:
category_cols.append(ccol)
return num_cols, category_cols, text_cols
num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df)
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
print(cate_cols) # ['col_5']
print(text_cols) # ['col_6']
We have separated the columns with a simple hack now.
Use Cases
If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).
By extracting textual columns, you can easily process it and get the most out of it.
Challenges
The function
split_text_categories_numericals()
only works for the preprocessed and cleaned data.It might not work efficiently if the function is applied directly to the original data.
End