Welcome to part 6 of tutorial series on how to custom document classifier with AWS Comprehend tutorial series. In the previous tutorial we have classified the test document or predicted the class/labels for test document. In this tutorial we are going to validate the predicted class or labels by our custom classifier.
Let’s get started. As the first step we will download the results or the predicted class for the test document from the S3 bucket.
Once the file is downloaded, we will extract the content from the compressed file. And as a result, it will give us predictions.jsonl file. jsonl stand for json lines. Generally, it means that each line in the document is an independent json.
In the next step we will jump to Jupyter notebook, since we are going to write some code. Validation can be done using various ways and this is one of the way I’m using.
Here, we are going to use two libraries which is json and pandas. And as the next step we will read the predictions.jsonl file. f.readlines() will return list of lines in the file.
# reading json line files with open('predictions.jsonl', 'r') as f: f = f.readlines()
We will loop through each line (which is json) and extract the name of the class with the maximum score. Later, we will append all the class to predictedLabels list.
# predicted label list predictedLabels =  # looping through json lines for i in f: # casting str json to json j = json.loads(i)["Classes"] # fetching class with maximum score predictedLabels.append(j['Name'])
Now, we will read the test.csv documentand in the new column we will add the predicted class or labels from predictedLabels.
# reading test document df_test = pd.read_csv("test.csv", header=None) # assigning header df_test.columns = ["Document"] # creating new column and mapping/assigning label df_test["PredictedLabel"] = predictedLabels
Here, we will read the truth_test.csv document which contains the true mapping of document and it’s correct labels.
# reading test truth file df_truth = pd.read_csv("test_truth.csv", header=None) # assigning header df_truth.columns = ["TruthLabel", "Document"]
Moving along, with merging both the dataframe (i.e. df_test & df_truth) on Document column. Post merging, it will look like this.
Document PredictedLabel \ 0 Taxes must be trusted - Kennedy Public trust ... politics 1 How to make a greener computer The hi-tech in... tech 2 Gamers could drive high-definition TV, films,... tech 3 Carry On star Patsy Rowlands dies Actress Pat... entertainment 4 Libya takes $1bn in unfrozen funds Libya has ... business TruthLabel 0 politics 1 tech 2 tech 3 entertainment 4 business
Now, we can easily compare PredictedLabel and TruthLabel to validate the predicted response.
mergeDf[mergeDf.PredictedLabel == mergeDf.TruthLabel].count()
Till that time, keep sharing and stay tuned for more. Follow me on Twitter