How to Formulate Image classification problem for Data collection and Labeling

Image classification or Image tagging is the simplest yet powerful capability of computer vision. Just able to tag images has so many uses in e-commerce, stock-photo sites, visual search, organizing photos, etc. Creating a good image classification model depends on the dataset. In this blog, we shall go through the nuances of creating a good image classification dataset and make good recommendations that be followed to create good image classification datasets.

This guide is constructed to also provide directions on:

Labeling instructions should be given to labelers while assigning an Image Classification job.
How Data Scientists can construct good [labels + Datasets ] for Image Classification.

Section 1: Labeling Instructions

In this section, you will get insights on what all information should be given to an image annotator when assigning an Image classification task for labeling.

Dataset Name: Weather Classification Dataset

The Weather Classification Dataset comprises varied types of weather conditions across different parts of the world.

Diversity

Various cities
Seasons(Summer, Monsoon)
Day, Night, Evening
Included Images has

A good number of dynamic objects such as (chair, house, car, truck)

Different scene layout such as (farms, roads, beach, mountain, lake)

Class Definitions

Class Definitions stands for the classes present in the Weather Classification Dataset.

Cloudy
Rainy
Sunny
Windy

Example Images

Section 2: How Image Classification works and its types

How Image Classification works?

Image Classification is another supervised learning method in Deep Learning which is used to classify the object present in the image.

Image Classification can be of many types depending on the number and relation of the labels:

Binary Image Classification: In Binary image classification, there are only two categories or classes. For example, the famous hot-dog classifier mentioned in the famous tv series silicon valley is binary image classification.
Multi-class Image Classification: In multi-class image classification, there are more than two categories that the model needs to learn. One Image would only belong to one category only. Typically, neural networks are trained in such a way that they would have to choose one of the classes for sure. Let me explain with an example. Let’s say, we are training a multi-class image classification model which will learn to identify 4 fruits, green-sugar-apples, watermelons, banana, apples. Now, if you send any image to this network, it would have to make a prediction out of these 4. That’s why most production projects would also have an additional class called background or negative or anything else. The background class would contain a lot of images of random things other than these four fruits. This way the network would learn to predict any non-relevant images as a background.

3. Multi-label Image Classification: In multi-label image classification, one image can belong to multiple categories.

Section 3: How to create good [Labels + Datasets] for Image Classification

Section 3A

6 fundamentals for creating a high-quality label set.

A. Avoid using labels whose meanings overlap with the other classes present in an image classification dataset.

B. You can use external links [such as Wikipedia or any other google site link] to educate the annotators about the classes present in the dataset.

C. Typically explains the classes/labels of a dataset through example images. Here is an example:

Dataset Name: Flower Classification dataset

Classes Included :

- - Sunflower
  - Rose
  - Lotus
  - Lily

Class Description:

a. Sunflower: A sunflower is a very tall plant with large yellow flowers. Oil from sunflower seeds is used in cooking and to make margarine. [Source]

b. Rose: A rose is a flower, often with a pleasant smell, which grows on a bush with stems that have sharp points called thorns on them. [Source]

c. Lotus: A lotus or a lotus flower is a type of water lily that grows in Africa and Asia. [Source]

d. Lily: A lily is a plant with large flowers. Lily flowers are often white. [Source]

D. Constantly provide enough example images to labelers If any two classes are very closely related in a dataset, so that they can differentiate between the two classes very easily. Here is an example:

Dataset Name: Construction Vehicle Dataset

Class A: Bulldozer

Class B: Backhoe Loader

E. Avoid creating confusing/subjective labels for a classification dataset.

Good Example	Bad Example
Eyes open	Male eyes open
Eyes close	Male eyes close
Forehead covered	Forehead covered by a hat, hairband, cap
Forehead Not Covered	Forehead Not covered by a hat, hairband, cap

F. Purposefully give enough example images to avoid intra-class/label conflict in a dataset

Example 1: For example, when we asked our labeling team to label vehicles, and we knew about the variations in type of tractors, we tried to give them many examples of various type of tractors.

Dataset Name: Vehicle Dataset

Class Name: Tractor

Example 2:

Dataset Name: Flower Classification Dataset

Class Name: Rose

Section 3B

7 Simple rules to keep in mind, whenever we have to formulate the problem as an image classification task.

A. Add images in the dataset whose resolution closely resembles what is required in the production.

For e.g. :

Production requirement

Scenario 1: For low bandwidth CCTV feed, you need to collect and create a low-resolution images dataset

Scenario 2: For a Medical image database, you need to collect and create a high-resolution medical images dataset

Scenario 3: For a database, if blurry images are required, you need to collect and create a database with blurry images.

B. Create a diverse comprehensive dataset for the production Use-cases.

C. Avoid putting images in a dataset that is systematically biased in nature i.e. favoring only images in which the object of interest or class is under better illumination or in the center of the image.

D. Avoid putting images in the dataset in which two classes exist together.

Dataset Name: Person Animal Classifier

Class Name: Person, Animal

In this case, we shouldn’t put images where both Person and animal are present in the same image.

E. Avoid putting images in a dataset that do not contain required classes in it.

Dataset Name: Person Animal Classifier

Class Name: Person, Animal

Please note that some times, your data scientist may ask you to collect images that contain none of the classes to be put in negative/background class. But unless specified, we should avoid these kind of images.

F*. Add images in the dataset according to production requirement

Production Use-case Scenario: The requirement are governed by the production scenarios. For example, if your app needs to detect mannequins along with the persons, it may be okay to put mannequin images with persons. However, if your app wants to neglect mannequins and just wants to identify real persons, then we would ignore mannequin images.

Case 1:

Requirement:

To label even a mannequin as a person in images available in the dataset.

The action that Data-Scientist will take:

Preparation of a dataset with images of a person and mannequin in it.

Instruction to Labeling Team:

Label all images with mannequins present in them as a person in the “Person Animal classification dataset”.

Case 2:

Requirement:

To only label images that have a person in them.

The action that Data-Scientist will take:

Preparation of a dataset consisting of only person images in it.