#9 - Separating Numerical, Categorical, and Textual Data
Data separation based on its type
For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.
Unfortunately, if in the dataset there are both categorical and textual data,
Pandas will consider both as categorical data. But, with a simple hack, we can differentiate that easily.
If we have categorical data like
['short', 'tall', 'average'], we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.
But, if we have textual data like
[ 'This is the first document.', 'This document is the second document.'], we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.
Pandas, we will see how to solve this separation task programmatically.
import numpy as np import pandas as pd import random
For this example, we will create our own (dummy) data set.
real_values = np.random.rand(10, 3) num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)]) str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)]) # ----- df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3']) df['col_4'] = num_cate df['col_5'] = str_cate
The first five rows of
df look like -
We can check the type of each column by -
df.dtypes ''' col_1 float64 col_2 float64 col_3 float64 col_4 int32 col_5 object '''
We can see that
col_5 is of type
object which simply represents that it is different from
float. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.
def split_categories_numericals(dframe): cols = list(dframe.columns) num_cols = list(dframe._get_numeric_data().columns) cate_cols = list(set(cols) - set(num_cols)) return cate_cols, num_cols
cate_cols, num_cols = split_categories_numericals(dframe=df) print(cate_cols) # ['col_5'] print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
split_categories_numericals() works fine for the above data set. Let's add one more column that has textual sentences as values.
sents = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', 'Hello, how are you doing?', 'Did you go to the party yesterday?' ] text_data = [random.choice(sents) for i in range(10)] df['col_6'] = text_data
Now, the first five rows of the
df look like -
|0.902552||0.212317||0.208628||2||short||This document is the second document.|
|0.308503||0.959903||0.631784||3||tall||Is this the first document?|
|0.459789||0.199742||0.176000||2||tall||Hello, how are you doing?|
|0.760787||0.167179||0.137287||2||average||Did you go to the party yesterday?|
|0.199737||0.120213||0.145068||2||average||Hello, how are you doing?|
If we check the
df, we get -
df.dtypes ''' col_1 float64 col_2 float64 col_3 float64 col_4 int32 col_5 object col_6 object '''
We can see
col_6 have the same type, i.e.,
object. It is because they have string values. But,
col_5 - we can get a pie chart, compute the likelihood, etc. Whereas
col_6, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.
Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply
split_categories_numericals(), we get -
cate_cols, num_cols = split_categories_numericals(dframe=df) print(cate_cols) # ['col_5', 'col_6'] print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
It considers both
col_6 as categorical.
Is there any programming hack to automatically differentiate categorical data and textual data?
The answer is Yes. We can very well achieve that with the following function.
def split_text_categories_numericals(dframe): cols = list(dframe.columns) num_cols = list(dframe._get_numeric_data().columns) cate_cols = list(set(cols) - set(num_cols)) text_cols = ; category_cols =  for ccol in cate_cols: each_col_list = dframe[ccol].str.split(' ').to_list() col_val_len_arr = np.array(list(map(len, each_col_list))) if np.any(col_val_len_arr > 1): text_cols.append(ccol) else: category_cols.append(ccol) return num_cols, category_cols, text_cols
num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df) print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4'] print(cate_cols) # ['col_5'] print(text_cols) # ['col_6']
We have separated the columns with a simple hack now.
If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).
By extracting textual columns, you can easily process it and get the most out of it.
split_text_categories_numericals()only works for the preprocessed and cleaned data.
It might not work efficiently if the function is applied directly to the original data.