# #9 - Separating Numerical, Categorical, and Textual Data

## Data separation based on its type

Jul 21, 2021·

### Introduction

For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.

Unfortunately, if in the dataset there are both categorical and textual data, `Pandas` will consider both as categorical data. But, with a simple hack, we can differentiate that easily.

• If we have categorical data like `['short', 'tall', 'average']`, we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.

• But, if we have textual data like ```[ 'This is the first document.', 'This document is the second document.']```, we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.

With `Pandas`, we will see how to solve this separation task programmatically.

Credits of Cover Image - Photo by Will Francis on Unsplash

### `import` Packages

``````import numpy as np
import pandas as pd
import random
``````

### Scenario 1

For this example, we will create our own (dummy) data set.

``````real_values = np.random.rand(10, 3)
num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)])
str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)])
# -----
df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3'])
df['col_4'] = num_cate
df['col_5'] = str_cate
``````

The first five rows of `df` look like -

 col_1 col_2 col_3 col_4 col_5 0.902552 0.212317 0.208628 2 short 0.308503 0.959903 0.631784 3 tall 0.459789 0.199742 0.176000 2 tall 0.760787 0.167179 0.137287 2 average 0.199737 0.120213 0.145068 2 average

We can check the type of each column by -

``````df.dtypes

'''
col_1    float64
col_2    float64
col_3    float64
col_4      int32
col_5     object
'''
``````

We can see that `col_5` is of type `object` which simply represents that it is different from `int` and `float`. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.

``````def split_categories_numericals(dframe):
cols = list(dframe.columns)
num_cols = list(dframe._get_numeric_data().columns)
cate_cols = list(set(cols) - set(num_cols))
return cate_cols, num_cols
``````
``````cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
``````

### Scenario 2

The function `split_categories_numericals()` works fine for the above data set. Let's add one more column that has textual sentences as values.

``````sents = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'Hello, how are you doing?',
'Did you go to the party yesterday?'
]
text_data = [random.choice(sents) for i in range(10)]
df['col_6'] = text_data
``````

Now, the first five rows of the `df` look like -

 col_1 col_2 col_3 col_4 col_5 col_6 0.902552 0.212317 0.208628 2 short This document is the second document. 0.308503 0.959903 0.631784 3 tall Is this the first document? 0.459789 0.199742 0.176000 2 tall Hello, how are you doing? 0.760787 0.167179 0.137287 2 average Did you go to the party yesterday? 0.199737 0.120213 0.145068 2 average Hello, how are you doing?

If we check the `dtypes` of `df`, we get -

``````df.dtypes

'''
col_1    float64
col_2    float64
col_3    float64
col_4      int32
col_5     object
col_6     object
'''
``````

We can see `col_5` and `col_6` have the same type, i.e., `object`. It is because they have string values. But, `col_5` - we can get a pie chart, compute the likelihood, etc. Whereas `col_6`, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.

Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply `split_categories_numericals()`, we get -

``````cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5', 'col_6']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
``````

It considers both `col_5` and `col_6` as categorical.

Is there any programming hack to automatically differentiate categorical data and textual data?

The answer is Yes. We can very well achieve that with the following function.

``````def split_text_categories_numericals(dframe):
cols = list(dframe.columns)

num_cols = list(dframe._get_numeric_data().columns)
cate_cols = list(set(cols) - set(num_cols))

text_cols = []; category_cols = []
for ccol in cate_cols:
each_col_list = dframe[ccol].str.split(' ').to_list()
col_val_len_arr = np.array(list(map(len, each_col_list)))

if np.any(col_val_len_arr > 1):
text_cols.append(ccol)
else:
category_cols.append(ccol)

return num_cols, category_cols, text_cols
``````
``````num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df)
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
print(cate_cols) # ['col_5']
print(text_cols) # ['col_6']
``````

We have separated the columns with a simple hack now.

### Use Cases

• If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).

• By extracting textual columns, you can easily process it and get the most out of it.

### Challenges

• The function `split_text_categories_numericals()` only works for the preprocessed and cleaned data.

• It might not work efficiently if the function is applied directly to the original data.

End