Image classification or Image tagging is the simplest yet powerful capability of computer vision. Just able to tag images has so many uses in e-commerce, stock-photo sites, visual search, organizing photos, etc. Creating a good image classification model depends on the dataset. In this blog, we shall go through the nuances of creating a good image classification dataset and make good recommendations that be followed to create good image classification datasets.
This guide is constructed to also provide directions on:
- Labeling instructions should be given to labelers while assigning an Image Classification job.
- How Data Scientists can construct good [labels + Datasets ] for Image Classification.
Section 1: Labeling Instructions
In this section, you will get insights on what all information should be given to an image annotator when assigning an Image classification task for labeling.
Dataset Name: Weather Classification Dataset
The Weather Classification Dataset comprises varied types of weather conditions across different parts of the world.
Diversity
- Various cities
- Seasons(Summer, Monsoon)
- Day, Night, Evening
- Included Images has
A good number of dynamic objects such as (chair, house, car, truck)
Different scene layout such as (farms, roads, beach, mountain, lake)
Class Definitions
Class Definitions stands for the classes present in the Weather Classification Dataset.
- Cloudy
- Rainy
- Sunny
- Windy
Example Images
Section 2: How Image Classification works and its types
How Image Classification works?
Image Classification is another supervised learning method in Deep Learning which is used to classify the object present in the image.
Image Classification can be of many types depending on the number and relation of the labels:
- Binary Image Classification: In Binary image classification, there are only two categories or classes. For example, the famous hot-dog classifier mentioned in the famous tv series silicon valley is binary image classification.
- Multi-class Image Classification: In multi-class image classification, there are more than two categories that the model needs to learn. One Image would only belong to one category only. Typically, neural networks are trained in such a way that they would have to choose one of the classes for sure. Let me explain with an example. Let’s say, we are training a multi-class image classification model which will learn to identify 4 fruits, green-sugar-apples, watermelons, banana, apples. Now, if you send any image to this network, it would have to make a prediction out of these 4. That’s why most production projects would also have an additional class called background or negative or anything else. The background class would contain a lot of images of random things other than these four fruits. This way the network would learn to predict any non-relevant images as a background.
3. Multi-label Image Classification: In multi-label image classification, one image can belong to multiple categories.
Section 3: How to create good [Labels + Datasets] for Image Classification
Section 3A
6 fundamentals for creating a high-quality label set.
A. Avoid using labels whose meanings overlap with the other classes present in an image classification dataset.
B. You can use external links [such as Wikipedia or any other google site link] to educate the annotators about the classes present in the dataset.
C. Typically explains the classes/labels of a dataset through example images. Here is an example:
Dataset Name: Flower Classification dataset
Classes Included :
-
-
- Sunflower
- Rose
- Lotus
- Lily
-
Class Description:
a. Sunflower: A sunflower is a very tall plant with large yellow flowers. Oil from sunflower seeds is used in cooking and to make margarine. [Source]
b. Rose: A rose is a flower, often with a pleasant smell, which grows on a bush with stems that have sharp points called thorns on them. [Source]
c. Lotus: A lotus or a lotus flower is a type of water lily that grows in Africa and Asia. [Source]
d. Lily: A lily is a plant with large flowers. Lily flowers are often white. [Source]
D. Constantly provide enough example images to labelers If any two classes are very closely related in a dataset, so that they can differentiate between the two classes very easily. Here is an example:
Dataset Name: Construction Vehicle Dataset
Class A: Bulldozer
Class B: Backhoe Loader
E. Avoid creating confusing/subjective labels for a classification dataset.
Good Example | Bad Example |
Eyes open | Male eyes open |
Eyes close | Male eyes close |
Forehead covered | Forehead covered by a hat, hairband, cap |
Forehead Not Covered | Forehead Not covered by a hat, hairband, cap |
F. Purposefully give enough example images to avoid intra-class/label conflict in a dataset
Example 1: For example, when we asked our labeling team to label vehicles, and we knew about the variations in type of tractors, we tried to give them many examples of various type of tractors.
Dataset Name: Vehicle Dataset
Class Name: Tractor
Example 2:
Dataset Name: Flower Classification Dataset
Class Name: Rose
Section 3B
7 Simple rules to keep in mind, whenever we have to formulate the problem as an image classification task.
A. Add images in the dataset whose resolution closely resembles what is required in the production.
For e.g. :
Production requirement
Scenario 1: For low bandwidth CCTV feed, you need to collect and create a low-resolution images dataset
Scenario 2: For a Medical image database, you need to collect and create a high-resolution medical images dataset
Scenario 3: For a database, if blurry images are required, you need to collect and create a database with blurry images.
B. Create a diverse comprehensive dataset for the production Use-cases.
C. Avoid putting images in a dataset that is systematically biased in nature i.e. favoring only images in which the object of interest or class is under better illumination or in the center of the image.
D. Avoid putting images in the dataset in which two classes exist together.
Dataset Name: Person Animal Classifier
Class Name: Person, Animal
In this case, we shouldn’t put images where both Person and animal are present in the same image.
E. Avoid putting images in a dataset that do not contain required classes in it.
Dataset Name: Person Animal Classifier
Class Name: Person, Animal
Please note that some times, your data scientist may ask you to collect images that contain none of the classes to be put in negative/background class. But unless specified, we should avoid these kind of images.
F*. Add images in the dataset according to production requirement
Production Use-case Scenario: The requirement are governed by the production scenarios. For example, if your app needs to detect mannequins along with the persons, it may be okay to put mannequin images with persons. However, if your app wants to neglect mannequins and just wants to identify real persons, then we would ignore mannequin images.
Case 1:
Requirement:
To label even a mannequin as a person in images available in the dataset.
The action that Data-Scientist will take:
Preparation of a dataset with images of a person and mannequin in it.
Instruction to Labeling Team:
Label all images with mannequins present in them as a person in the “Person Animal classification dataset”.
Case 2:
Requirement:
To only label images that have a person in them.
The action that Data-Scientist will take:
Preparation of a dataset consisting of only person images in it.
Instruction to Labeling Team:
Delete or skip images if mannequins or other objects are present in the “Person Animal classification dataset”.
G*. Label objects in an image only if they are Photo Realistic but NOT if they are cartoons, symbols, or paintings.
This requirement could also be client or app-specific.
Label Image Classification Datasets through NeuralMarker
Visit NeuralMarker to explore the training data annotation platform or Request a demo to learn how this platform fuelled with AI can assist you in Pre-labeling Dataset for Image Classification Tasks.