Welcome to part 1 of Custom document classifier with AWS Comprehend tutorial series. Customized Comprehend allows you to build the NLP based solutions without prior knowledge of Machine Learning. In this tutorial series we will train the Comprehend classifier using out custom dataset, instead of using a pre-defined comprehend capabilities.

In this tutorial we are going to download the dataset, that we will use for the purpose of this tutorial series. We are going to use BBC news dataset which consist of five categories/class/target.

In the video tutorial I have use Jupyter Notebook for the demo. To get started, create the directory and run jupyter notebook command using terminal from the directory that you have created and create the new Jupyter Notebook.

We will download the dataset using wget command. wget is basically the computer program that  is used to download content from web servers. To download the dataset copy and paste the below command and execute the cell.

!wget http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip

After downloading the zip file, we will unzip the file/dataset using notebook.

!unzip bbc-fulltext.zip -d bbc

The above command will unzip the content into bbc directory. After unzipping we can see that there are 5 class (i.e. Politics, Business, Tech, Sport, Entertainment). Now, we have to prepare the training file for training Comprehend’s custom document classifier. And the training file will look something like below.

label1document1
label2document2
label3document3
label2document4

In the next tutorial, we are going to prepare the training file. Below is the video tutorial.

Till that time, keep sharing and stay tuned for more. Follow me on Twitter