Welcome to part 4 of custom document classifier with AWS Comprehend tutorial series. In the previous tutorial we have successfully trained the classifier. In this tutorial we are going to prepare test document for classification using our custom classifier.

Here, we are going to re-use the script that we have written while creating the train document. So, we will go ahead and modify that script. If you remember than we have considered or used first 300 text document from each class. Now, in the test document we will use 10 documents from each class that does not belong to the training document. Let’s get started with the code.

import os
import pandas as pd

# defining test dataframe
df = pd.DataFrame()
# defining test truth dataframe
df_test_truth = pd.DataFrame()
mapping = {}
source_path = "bbc/bbc/"

In the next bit of code, instead of considering first 300 text document we will change it to 300 to 310.

# looping through bbc/bbc/ directory
for i in sorted(os.listdir(source_path)):
    # checking if it is directory or not
    if os.path.isdir(source_path+i):
        # creating the dictionary with class as key and 10 files after 300 files as key
        mapping[i] = sorted(os.listdir(source_path+i))[300:310]
# printing in the mappings with keys/values individually
print(mapping.keys())
print(mapping.values())

The next code block remain same as it was in preparing training document.

# label or class or target list
label = []
# text file data list
data = []
# unpacking and iterating through dictionary
for i, j in mapping.items():
    # iterating through list of files for each class
    for k in j:
        # appending labels/class/target
        label.append(i)
        # reading the file and appending to data list
        data.append(open(source_path+i+"/"+k, encoding="cp1252").read().replace("\n", " "))

Here, we are going to save two document. The one is the test document itself which contain only the text file data and not the labels. And the another is the truth for that test document, which contain the mapping of both text file data and the labels or class. So, we will use the truth test document for the validation purpose.

# creating test dataframe and assigning data without label
df["document"] = data
# creating test truth dataframe and assigning data
df_test_truth["label"] = label
df_test_truth["document"] = data
print("Test data rows : ", df.shape[0])
print("Test truth data rows : ", df_test_truth.shape[0])

Here, we will save both the files as csv.

# saving test data as csv file without index and headers
df.to_csv("test.csv", index=False, header=False)
# saving test truth data as csv file without index and headers
df_test_truth.to_csv("test_truth.csv", index=False, header=False)

Here is the full Jupyter Notebook code
Well, thats it for now. In the next tutorial, we will classify the test document using the custom classifier. Below is the video tutorial


Till that time, keep sharing and stay tuned for more. Follow me on Twitter