Welcome to part 6 of tutorial series on how to custom document classifier with AWS Comprehend tutorial series. In the previous tutorial we have classified the test document or predicted the class/labels for test document. In this tutorial we are going to validate the predicted class or labels by our custom classifier.

Let’s get started. As the first step we will download the results or the predicted class for the test document from the S3 bucket.

comprehend document classifier comprehend

Once the file is downloaded, we will extract the content from the compressed file. And as a result, it will give us predictions.jsonl file. jsonl stand for json lines. Generally, it means that each line in the document is an independent json.

In the next step we will jump to Jupyter notebook, since we are going to write some code. Validation can be done using various ways and this is one of the way I’m using.

Here, we are going to use two libraries which is json and pandas. And as the next step we will read the predictions.jsonl file. f.readlines() will return list of lines in the file.

# reading json line files
with open('predictions.jsonl', 'r') as f:
    f = f.readlines()

We will loop through each line (which is json) and extract the name of the class with the maximum score. Later, we will append all the class to predictedLabels list.

# predicted label list
predictedLabels = []
# looping through json lines
for i in f:
    # casting str json to json
    j = json.loads(i)["Classes"]
    # fetching class with maximum score
    predictedLabels.append(j[0]['Name'])

Now, we will read the test.csv documentand in the new column we will add the predicted class or labels from predictedLabels.

# reading test document
df_test = pd.read_csv("test.csv", header=None)
# assigning header
df_test.columns = ["Document"]
# creating new column and mapping/assigning label
df_test["PredictedLabel"] = predictedLabels

Here, we will read the truth_test.csv document which contains the true mapping of document and it’s correct labels.

# reading test truth file
df_truth = pd.read_csv("test_truth.csv", header=None)
# assigning header
df_truth.columns = ["TruthLabel", "Document"]

Moving along, with merging both the dataframe (i.e. df_test & df_truth) on Document column. Post merging, it will look like this.

                                            Document PredictedLabel  \
0  Taxes must be trusted - Kennedy  Public trust ...       politics   
1  How to make a greener computer  The hi-tech in...           tech   
2  Gamers could drive high-definition  TV, films,...           tech   
3  Carry On star Patsy Rowlands dies  Actress Pat...  entertainment   
4  Libya takes $1bn in unfrozen funds  Libya has ...       business   

      TruthLabel  
0       politics  
1           tech  
2           tech  
3  entertainment  
4       business  

Now, we can easily compare PredictedLabel and TruthLabel to validate the predicted response.

mergeDf[mergeDf.PredictedLabel == mergeDf.TruthLabel].count()

Here is the full Jupyter Notebook code
Well, thats it for now. You can learn more from the mentioned video. And don’t forget to subscribe the channel.


Till that time, keep sharing and stay tuned for more. Follow me on Twitter