Welcome to part 2 of custom document classifier with AWS Comprehend tutorial series. In the previous tutorial we have successfully download the dataset. In this tutorial we are going to prepare the training file to feed into the custom comprehend classifier. Just to take a note that Amazon Comprehend custom classification supports up to 1 million examples containing up to 1000 unique classes.

Let’s get started and write some code in the jupyter notebook that we have created in the previous tutorial. Here, we are going to use two library that is os and pandas.

# importing libraries
import os
import pandas as pd

# defining dataframe
df = pd.DataFrame()
mapping = {}
source_path = "bbc/bbc/

Now, we will loop through the content of the source_path directory and create the mapping of the class/target/label name with the first/top 300 files from an individual class directory.

# looping through bbc/bbc/ directory
for i in sorted(os.listdir(source_path)):
    # checking if it is directory or not
    if os.path.isdir(source_path+i):
        # creating the dictionary with class as key and first 300 files as key
        mapping[i] = sorted(os.listdir(source_path+i))[:300]
# printing in the mappings with keys/values individually
print(mapping.keys())
print(mapping.values())

Moving along with unpacking and looping through the mapping dictionary. Here, we will append the individual labels in the label list and respective data of the file in the data list.

# label or class or target list
label = []
# text file data list
data = []
# unpacking and iterating through dictionary
for i, j in mapping.items():
    # iterating through list of files for each class
    for k in j:
        # appending labels/class/target
        label.append(i)
        # reading the file and appending to data list
        data.append(open(source_path+i+"/"+k, encoding="cp1252").read().replace("\n", " "))

Now, we will create two columns (label & document) in the dataframe. And assign the data in the individual column respectively.

# creating column in dataframe and assigning data
df["label"] = label
df["document"] = data

Further we will shuffle the data. It is not mandatory though. Later, we will save the data as train.csv without index and headers.

# shuffling the data/rows and dropping index
df = df.sample(frac=1).reset_index(drop=True)

# saving it as csv file without index and headers
df.to_csv("train.csv", index=False, header=False)

Here is the full jupyter notebook code
Well, thats it for now. In the next tutorial we are going to train the Comprehend custom document classifier. Below is the video tutorial.

Till that time, keep sharing and stay tuned for more. Follow me on Twitter