In this post we will learn about the best way to create ML ready data sets. It is a non-trivial task to create clean and precise datasets that can be easily ingested by Machine Learning (ML) algorithms for learning. This task is relevant for any field - working with text (Natural Language Processing), or images and video (Computer Vision), or audio (Speech Recognition).
In this post we will take a sample scenario of tagging product descriptions into one of the predefined categories.
One of the first tasks that need to be defined is - what are we looking to capture when we tag our data? These guidelines need to be explicitly written down and to be read by everyone who is tagging that data. This may seem unnecessary and obvious, but it is important to take an hour and record these guidelines. This is especially helpful when you find disagreement between data taggers, and the disagreement needs to be resolved.
Annotation guidelines are created so we can be explicit about the rules for tagging the data.
In our case, we need to define the right way to classify product descriptions into one of the predefined product categories -
- Mobile Here is a short version of our annotation guidelines:
Computers: These should include all laptops, desktops, and related hardware and accessories - such as printers, scanners etc
Mobile: This includes mobile phone, tables and related accesories - such as headsets
Writing annotation guidelines is a good exercise to get everyone on the same page. The next step is to actually tag the data with the correct tag. This task is called “annotating” the data. A simple way of tagging text data is using a spreadsheet and assigning a label to each text field.
It needs to be decided upfront whether a given product can have one label of more than one label. If only one label is assigned to a given product, it will be important to pick the right label. E.g. a tablet could potentially belong to the category “Computers” or “Mobile”, but in our case it is explicitly specified in our annotation guidelines to include tablets with “Mobile” and not “Computers” - this helps us break the tie when there are 2 class contenders for the same item.
Validity and Reliability of human tagged data
We are interested in finding the validity of human tagged data - however as there is no ground truth data to compare it to, there is no way to measure the correctness of the data directly.
An alternative approach to measuring the reliability of data is to have 2 annotators tag the same dataset and measure the agreement between them. This is called measuring the Inter-Annotator Agreement.
In the following subsections we will discuss both cases of measuring goodness of data -
- when ground truth (gold standard) data is provided
- when ground truth data is not provided.
Ground truth data is provided: Measuring Precision and Recall
When the ground truth data is available (usually for objective tasks - where there is an indisputable correct answer), measuring precision and recall are the best ways to compute the quality of the annotated texts.
Recall: Recall measures the quantity of the found annotations. Recall is the ratio of number of correctly found annotations to the number of correct expected annotations.
Precision: Precision measures the quality of the found annotations. Precision is the ration of the number of correct annotations found, to the total number of annotations found.
Ground Truth data is not provided: Measuring Inter-Annotator agreement
It is good practice to have multiple people (at least 2 people) tag the same parts of the dataset - without looking at what the other person has tagged for the same text. This process helps us mathematically compute how similar two annotators are when they tag the same piece of text. This is called measuring the Inter-Annotator agreement.
The following are some agreement reliability measures:
- Cohen’s Kappa (Cohen, 1960)
- Fleiss’s Kappa
- Scott’s π (Scott, 1955)
- Krippendorff’s α (Krippendorff, 1980)
Agreement measurement isn’t trivial, to to compute agreement one often uses metrics like Cohen’s Kappa.
The following are some degrees of agreement for the Cohen’s Kappa score:
- Kappa < 0: No agreement
- Kappa between 0.00 and 0.20: Slight agreement
- Kappa between 0.21 and 0.40: Fair agreement
- Kappa between 0.41 and 0.60: Moderate agreement
- Kappa between 0.61 and 0.80: Substantial agreement
- Kappa between 0.81 and 1.00: Almost perfect agreement.”
It is important to get Moderate or higher agreement before starting to train a Machine Learning model with your data.
Say we have 4 pieces of text, 2 cageories, and 2 taggers (Person A and Person B)
|text||Person A||Person B|
Accoring to the above table, Person A and Person B both tagged
computers one time, and
mobile one time. There were 0 cases where Person A tagged
mobile and Person B tagged
computers. Finally there were 2 cases where Person A and Person both tagged
Using these values the Kappa scores is: 0.500 - which signifies moderate agreement.
Compute your own Kappa score on your data here
Consolidating the final human annotations
Based on the difficulty of the task, the inter-annotator agreement should be at least moderate or higher. In certain cases, it is expected to be almost perfect before one can continue to the next steps.
There are several ways of consolidating human tags into a single set of tags. Here are some approaches:
- Union of all the annotations. This is used if some annotators have forgotten to tag some parts of the text.
- Intersection of all the annotations. This is a stricter approach where we are only taking into consideration the tags that are tagged by both annotators.
- Discussion and updating the annotation guidelines - and redoing the annotations till best accuracy is achieved. This appraoch is most beneficial when there is going to be a lot more data coming into the pipeline and it is important to have a solid understanding of how data needs to be tagged in the future.
A dataset is created when we combine all the annotations to have a single list of annotations - this dataset if called the Gold Standard or Ground Truth.
This is the dataset on which we can start doing some Machine Learning. However, we are not done yet - There are still several things to think about and consider when working with an annotated dataset - such as how balanced it is, what type of baseline accuracy should we expect to get with this dataset, and how to compute whether we are training a good model with this dataset. More on this coming in a separate blog post.