If an artificial intelligence model was an engine, then its training data would be its fuel. And like an engine of an automobile, the quality of its functionality is largely dependent on the quality of what is fed into it. To paraphrase a well-known computer science phrase, if you put garbage in, you get garbage out.The key to ensuring that your models are providing accurate outputs lies in the training data, and data annotation is vital in providing the structure and context to datasets that allow AI models to learn effectively. Without well-structured context, those datasets are just noise.But annotating huge datasets, whether images or videos, can be a never-ending (and tedious) task, and the size of these tasks are set to grow significantly as AI models demand larger datasets. Thankfully, there are a range of tools designed to make image and video data annotation an efficient process. In this article, we will examine what is image annotation, who is using it, how to use it, and how CVAT might be a good choice for your organization’s computer vision or machine learning projects.What is Image Annotation?Image annotation (or image data labeling) is the process of adding labels and tags to image datasets for training computer vision models. Doing so provides context for the machine learning model to understand, and make predictions.With image data annotation software, the annotations generally come in the form of a shape such as a bounding box, polygon, or segmentation mask, along with a textual tag, or label. The geometric shape helps to visually and spatially define the object of interest in the image, while the textual tags help the AI model to identify and classify the object(s) in the image. Example of an image with bounding box annotations.Because in computer vision, identification and classification are key processes that help machines interpret and understand visual data, and image annotation is required for achieving this goal.Image annotation can be a huge undertaking in terms of the amount of time and resources needed. With datasets ranging in size from a few thousand images, to several million images, it’s important to determine the best strategy for both acquiring datasets, and for annotating them. Such strategies can involve usage of public versus proprietary datasets, and the choice to use in-house annotation versus professional annotation services. We will examine these strategies in more depth later.Where and What Image Annotation Is Used For?Autonomous vehicles, medical imaging, facial recognition, and satellite imagery, all use computer vision and artificial intelligence to perform tasks such as detection, classification and segmentation. All of these industrial applications make use of labeled datasets, and image annotation plays a significant role in the transformation from a dataset (for example, a collection of drone imagery), to a labelled dataset, that provides context for the AI model to use.Data annotation is mostly used for training machine learning models. And by training, we mean that we are teaching a computer vision model to identify objects in various kind of images. This is analogous to how a child learns by pointing at things, and calling them out by name. In short, image annotation is providing ground truth labels for the computer vision model. Image annotation is also used for supervised learning, in which the model learns from annotated examples in the form of input-output pairs. In this case, the annotated data assists the model in understanding how an object should be identified.Finally, image annotation can also be used with performance evaluation to assess the accuracy of a model’s learning. The accuracy can be tested by comparing the model’s output to the annotated data (ground truth).As mentioned in the introduction, quality training data for these systems is largely dependent on quality image annotation. A proper image annotation process can help to improve model accuracy/data consistency, reduce time, costs, and biases, and lead to generally more efficient model training.How Are Companies Doing Image Annotation?Any company or organization looking to develop their own computer vision AI models will require high-quality image datasets. These datasets tend to be quite huge, so consideration must be made as to how the datasets are obtained in the first place.This begs the question, should an entity opt for proprietary or open datasets? Let’s examine both cases in more detail.Creating Your Own Proprietary Image DatasetsThe benefits of creating a proprietary dataset include having complete control over the subject matter in the images, as well as the quality of the image. Many open datasets lack the selectivity of a particular subject matter, meaning that they might not be well suited for specialized models.Imagine that you wanted to train an AI to detect a particular type of defect on a manufacturing process developed in-house. By definition of the process’ proprietary nature, it would be close to impossible to find an open dataset with the images needed for training. The datasets, in this case, would necessarily need to be produced in house.A Google Street view car, capturing real world images as it drives around. Source: Wikimedia CommonsThe downside of creating such a dataset, is that the task requires vast resources in terms of image data collection (taking photos), data cleaning and preprocessing, annotation and labelling, quality control, storage and management. It is a time-consuming, and costly exercise.Some examples of companies making use of their own proprietary datasets include Tesla and Google. Tesla collects footage from their own vehicle sensors and makes use of this image data to train its self-driving AI feature (also known as FSD). Similarly, Google uses images gleaned from their own image assets for training Google Lens and Street View AI.Using Open DatasetsOne alternative (often used by smaller companies) is to use open image datasets, which reduce costs and speed up development of models. Many such datasets tend to be created by universities and government institutes, and are often freely available for research use and non-commercial use. Some are available for commercial use, but they are dependent on license conditions, and may require some fee to be paid.The downside of using open datasets is that they may lack specialization for specific tasks, as per the manufacturing example in the section on proprietary datasets.Panoptic segmentation datasets from COCO. Source: https://cocodataset.org.Think of proprietary and open datasets as bespoke tailored clothing versus clothing bought “ready to wear” from a shopping mall. With the tailored garment, you get to choose the material, the fit, the exact color, and any other features that you desire, but the customization comes at a premium price. With ready to wear clothing, you are restricted to whatever is on the shelf in terms of size, style and color, but you save a lot of money compared to the bespoke option. The table below shows several open datasets that can all be imported into CVAT image annotation tool for data labeling.{{image-annotation-table-1="/blog-banners"}}To summarize, your organization’s choice of proprietary or open datasets will depend largely on the resources available and the level of specialization needed for your training data.Proprietary datasets allow a high level of specialization, at the cost of time and financial resources, while open datasets allow faster development at a more cost-effective price point, but may take a penalty when it comes to specialization.Some companies might benefit from a hybrid approach, where open data is used for initial training, and then switch to real data to fine tune their own models.Each method has its own merits and trade-offs, and should be well considered before embarking on the task of developing an AI model.What Are Different Types of Image Annotation Tasks?So, you’ve decided exactly where your dataset is coming from, and are ready to begin the process of adding context to the data for training your model. This is where the image annotation phase (and image annotation software like CVAT) comes into play.At its core, the process of image annotation involves highlighting the item of interest in an image or video data, and adding context via text-based notes to the item in question. The type of annotation would depend on the intended use of the data.For example, if you wanted to train a model to recognize the presence of a cat in an image (image classification), you would upload image data consisting of cats in various scenes. You could then instruct your in-house or third-party image annotation services team to sort through the images, and add a text description indicating if a cat is present in the image, or not.More advanced tasks (such as “detection”) would require a bounding box to be drawn around the cat in the image, with various other descriptions (such as color or breed) added as tags.CVAT provides a variety of different image annotation tools for all of these tasks. Such tools include cuboid annotation (for objects with depth or volume), attribute annotation, tag annotation, and a plethora of different shape annotation tools for 2D objects.Image Annotation TasksThe image annotation process requires applying various labels to images (or video) in order to add structure and context to datasets.Image annotation tasks can generally be divided into three different categories, which are classification, detection, and segmentation. CVAT has a number of drawing tools which are aimed at each category. Before we delve into the drawing tools, let’s take a look at the three categories in more detail. Image ClassificationImage classification is the most basic of image annotation categories. It involves applying a label (or labels) to a singular image and simply helps the AI model to identify if such an object is present. With the image classification method, the object location is not specified, only its presence in the image. The label will then aid the computer vision model in identifying similar objects throughout the whole dataset. As an example, your team might be training an AI to recognize images of cats. With the image classification method, each image could be labelled as “cat” (if present) or “no cat” (if not present). Additional tags could be added to classify each image by breed, or by color. But the classification model would not be able to identify where exactly the cat is located within the image. With image classification, it is not necessary to use the shape drawing notation tools, as the labels/tags are applied to the entire image.To indicate where the object is located in the image, you need to use a detection model.Object DetectionDetection expands upon classification by adding a localization element. Detection not only identifies the presence of an object in an image, but adds spatial information, which helps to identify the object’s location in the image.Such tasks require the use of drawing tools (such as a rectangle/bounding box, polygon, or ellipse) to be added during notation to highlight where the object of interest is positioned. These drawing tools help the AI model understand both the object’s presence and its position in the image.Additionally, if there are multiple objects in the image, the detection model can specify how many there are, and more advanced models can even assign a confidence score, which indicates the likelihood that the identification is correct. Finally, more advanced models can also detect interactions and relationships between multiple models.So going back to our cat example, a detection model could identify that there are two cats, it could classify them according to breed (with a confidence score), and then infer one cat’s position in relation to another cat.Image segmentationSegmentation annotation is the most advanced of the three categories, and divides an image into discrete areas, providing pixel-level accuracy. There are four main subcategories of segmentation in computer vision, which are Semantic, Instance and Panoptic Segmentation types.Semantic SegmentationWith semantic segmentation, each pixel of the object of interest is assigned a class label. Whereas the detection model will use a bounding box to assign a general class and location within the image, defining the object at a pixel level allows the model to detect the shape with more precision.For example, an image might have a cat drinking out of a bowl while another cat sleeps nearby. During the annotation process, the annotator could use a brush tool to paint the pixels of both of the cats, or use a polygon tool. All the pixels within the masks would be classed as “cat”. Similarly, the bowl could also be annotated, with all the bowl pixels labeled as “bowl”.With semantic segmentation, the model does not distinguish between multiple objects of the same class. Both cats, despite being in their own discrete regions, would simply be classified as “cat”. To distinguish multiple instances of the same class, you need to use instance segmentation. Instance SegmentationAn instance segmentation model also uses masks to assign pixel-level classification to objects, but unlike semantic segmentation, it can identify different instances of the same class. For example, it could distinguish between two cats in an image, each with a different label. During the annotation process, the annotator would create a mask around each cat, showing the exact shape and boundaries of each cat. Unlike detection, which only provides a bounding box, instance segmentation gives a detailed pixel-by-pixel representation of each cat. Panoptic SegmentationThe final category is panoptic segmentation, which combines the benefits of both instance segmentation and semantic segmentation to create a more complete understanding of an image. In this approach, annotators categorize both background elements (such as a wall, or carpet) as well as countable objects such as people, cars, or cats.In this case, if there was an image of three cats lounging on a patterned rug, using the panoptic segmentation method, we would treat the rug as a single background element, applying one uniform label to it. Each cat would be identified individually, with separate segmentation masks, distinguishing them even if they are curled up together or partially overlapping (occluded). This method gives AI a more complete understanding of a scene, allowing it to recognize both the setting and the objects within it.Annotation Category SummaryBefore we take a more in-depth look at the various drawing tools, let’s just summarize the annotation categories in terms of their function, along with some non-feline related applications.{{image-annotation-table-2="/blog-banners"}}Types of Image Annotation TechniquesAn effective image annotation software should be capable of annotating objects that are both static (shape mode) and objects in motion, across multiple frames (frame mode). And it should be able to use any kind of common image file, such as JPEG, PNG, BMP, GIF, PPM and TIFF.To make shape annotation tasks a cinch, CVAT allows users to annotate with rectangles, polygons, polylines, ellipses, cuboids, skeletons, and with a brush tool.While the various shape annotation tools can be used interchangeably in many situations, each tool works optimally for specific types of task.RectanglesAnnotating with rectangles is one of the easiest methods of image annotation. Also known as a “bounding box”, this shape is best suited for the detection of uncomplicated objects such as doors on a building, street furniture, packing boxes, animals, and faces. They can even be used for notation of people both static and in motion. This is particularly useful for surveillance or tracking projects, although if pose estimation is required, more detailed annotations such as skeleton or polygons could be a better option.When multiple objects obstruct each other, the "Occluded" label can be used.Overall, notation with rectangles is an easy and computationally efficient method well-suited for quick object detection of a broad range of subjects. If you want a quick way to identify the general presence and location of the object, then notation with rectangles is a great place to start.Image annotation with rectangles is incredibly straightforward. In CVAT, simply select the rectangle icon in the controls sidebar, choose a label, and select a Drawing Method (2-point or 4-point). Click Shape (or Frame, if annotating video) to enter drawing mode.2-Point Method: Click two opposite corners (top-left and bottom-right) to create a rectangle.4-Point Method: Click the top, bottom, left, and right-most points of the object for a more precise fit. The rectangle is completed automatically after the fourth click.Users can adjust the boundaries of the resulting rectangle using the mouse, and rotate it to best fit the object of interest. Polygons Offering a higher level of precision than rectangles, annotating with polygons is better suited for objects with irregular shapes requiring a more accurate boundary delineation.Drawing a polygon allows a much higher level of detail, as it can closely follow the curves and shape of an object, making it well suited for tasks that require pixel-level analysis. Polygons can also be used for creating masks for semantic segmentation, instance segmentation and panoptic segmentation.Polygon annotation can be used for the detection of objects such as geographical features on satellite images, tumors in medical imagery, types of plants in plant identification, and pretty much anything where an object’s shape is too complex to be captured by a rectangle.If a rectangular notation is best suited for broad object detection tasks, polygons are more optimally used for tasks such as image localization, segmentation, or detailed recognition. To put it another way, while a rectangle annotation is fine for detecting faces, polygons are better for detecting facial features such as mouths, eyes, and noses.Like the rectangle annotation function, drawing polygons is uncomplicated. To draw a polygon in CVAT, locate the polygon option in the controls sidebar, and choose a label. Click Shape to enter drawing mode, then the polygon can be drawn with either of two methods. With the first method, the user can simply use the mouse to draw dots around the outline of the object. With the second method, the user can hold the shift key down, and trace the object with the mouse as a continuous contour. Dots will appear around the object automatically. You can see an example of this in the graphic below.Manual drawing of a polygon annotationOnce the polygon is completed (with either method), the user can adjust the polygon by clicking on the dots, and dragging them until they are happy with the result.EllipsesAnnotating with ellipses is the method most useful for the detection of round objects, either elliptical, circular or spherical. If you want to quickly annotate objects such as wheels, various fruits, or even the eyes on a face, then ellipses are the perfect shape for the task.Applications where you might wish to use elliptical annotations include cell detection in medical imaging, pupils in eye tracking, astronomical objects, circular craters in geospatial mapping, or egg monitoring in a hatchery.In CVAT, ellipses are created in much the same way as rectangles. Simply specify two opposite points, and the ellipse will be inscribed in an imaginary rectangle. And like the rectangular notations, ellipses can also be rotated about a point. You can see how easy it is to annotate with ellipses in the video above.PolylinesThe previously mentioned notation types have focused on objects with enclosed regions. Polylines also allow for the notation of elongated, thin enclosed shapes, but also permit the notation of non-enclosed linear, continuous objects.To that end, polyline notation is the most optimal choice for objects with long boundaries and contours that do not need to be fully enclosed, such as railways lines or roads.It is also extremely handy when it comes to tasks requiring path-based analysis, such as object tracking, and for connecting key points in pose estimation tasks. Specific examples of applications using polylines include footpaths and rail lines in aerial mapping, general linear infrastructure inspection, text lines and paragraphs in OCR, animal and human skeletons in pose estimation, and moving objects in video sequences.A polyline on a continuous road markingTo sum it up, polylines are at their most useful when the goal is to track, detect, or measure linear features.Drawing polylines in CVAT is similar to drawing a polygon. Simply select the polyline tool from the control panel, select the shape (or track), and set the number of points needed for the polyline. The drawing will complete when the specified number of points has been reached. Also, like the polygon tool, there are two ways in which a polyline can be drawn - it can be drawn with dots, or it can be traced along manually by holding down the shift key.Brush ToolThe brush tool is a free-form tool that allows the manual painting of objects, and the creation of masks. Masking is particularly useful for annotating singular objects that may appear split in two, such as a vehicle with a human standing in front of it. You can see an example of this in the graphic below.Example of image annotation with a brush toolThe brush tool in CVAT features various modes such as brush shape selection, erase pixels, and polygon-to-mask. Polygon-to-mask mode enables quick conversion of polygon selections into masks. Annotations can be saved and modified via the Brush Tool menu, enhancing efficiency in detailed image segmentation tasks.Annotating with the brush tool is ideal for applications that require a high level of precision, such as medical imaging, object detection, or autonomous driving.SkeletonsAnnotating with skeletons is the best option when dealing with tasks requiring the analysis of complex and consistent structures, such as human figures. It’s also a little more involved than the other annotation processes we have looked at in this article, which is why we have saved it until the end!Example of a skeleton annotationA Skeleton consists of multiple points (also referred to as elements), which may be connected by edges. Each point functions as an individual object, with its own unique attributes and properties such as color, occlusion, and visibility.Skeleton annotations can be used for both static and moving images, although they are used in different ways for each type. When using skeleton notation with static images, they are best used when analyzing a single pose, whereas in video, they can be used for more dynamic applications (such as tracking movement over time).Other specific applications of skeleton-based annotations include gait analysis, workplace ergonomics assessments, gesture recognition for sign language, crime scene analysis, and avatar posture recognition in AR/VR environments.Out of the various other methods we have looked at for notating static images in this article, notating with skeletons is generally the most complex. However, the whole process of annotating with skeletons is made much more user-friendly with CVAT.If you wish to annotate with skeletons with CVAT, then the process is summarized as follows:There are two main methods of annotating with skeletons. The first is to do it manually, and the second is to load a skeleton from a model. The Skeleton Configurator allows users to add, move, and connect points, upload/download skeletons in .SVG format, and configure labels, attributes, and colors for each point.To use the Skeleton Configurator, set up a Skeleton task in the configurator, and click Setup Skeleton to enable manual creation. To create the skeleton, simply add points and edges in the drawing area, configure the required attributes, upload the files, and submit the task. AI-Assisted Image Data AnnotationAs seen in the previous section, annotating with various shapes is a straightforward experience. But these tasks can be made easier still, thanks to various automation features.AI-assisted image annotation makes use of pre-trained ML models for the detection, classification, and segmentation of objects within image datasets. CVAT can use pre-installed models, and can also integrate with Hugging Face and Roboflow for cloud-hosted instances. For organizations using a self-hosted setup, custom models can be used with Nuclio. AI models in CVAT, such as YOLOv3, YOLOv7, RetinaNet, and OpenVINO-based models, provide accurate object detection, facial recognition, and text detection. CVAT’s automated shape annotations and labeling features can significantly accelerate the complex image annotation process, potentially improving speed by up to 10 times. These features leverage various machine learning algorithms for tasks like object detection, semantic segmentation, and tracking. Automatic Labeling using pre-trained deep learning models (e.g., OpenVINO, TensorFlow, PyTorch).Semi-Automatic Annotations (e.g., interactive segmentation).Automatic Mask Generation: AI models can generate segmentation masks for complex objects.Smart Polygon Tool: Automatically refines polygon shapes around detected objects.Pre-Trained Object Detectors: Detects and labels objects using AI models like YOLO, Mask R-CNN, or Faster R-CNN.We will do a deep dive into the automation side of image annotation in another post - we just thought we would draw your attention to its existence, just in case you wanted to know how AI itself can be used to make the model training process even more efficient.Easy Annotation & Labeling of Images with CVAT Annotation SoftwareAs you have seen in this article, there are numerous techniques in image annotation specific for a range of different computer vision projects and use cases. The good news is that CVAT offers all the aforementioned tools in a handy and easily accessible solution.So whether your team is training a computer vision model, engaging in supervised learning, or conducting a performance evaluation, then the CVAT platform can help with all of your data annotation needs.CVAT takes away the headaches of creating annotated datasets with its innovative and user-friendly approach to annotation and task allocation. With its image annotation tool , your organization can upload datasets of visual assets, break the sets down into smaller chunks, and distribute them to team members anywhere on Earth. Once the team members receive their tasks, they are able to use the intuitive image annotation engine to quickly add context to both image and video datasets.CVAT also integrates seamlessly with HUMAN Protocol’s innovative task distribution and compensation system, creating a seamless, efficient workflow for crowdsourcing annotations. And, if you don't have enough resources to do annotation in-house, CVAT's professional annotation services team is available to provide high-quality, expertly labeled datasets, ensuring your machine learning models receive the precise training data they need.So, to summarize - CVAT's image annotation platform can be used for any visual object, whether it's flat or three-dimensional, static or dynamic. And the drawing tools are fundamentally the same for whichever scenario. Naturally, there are more advanced features for power users, and if you would like to know more about those, you can learn more at this link.And if you haven’t yet got to grips with the basics of image annotation and would like to get started with the features in this article, you can try out the free SaaS version of CVAT right here. For those wanting to try the on-premise community version, you can find that over on Github.
.png)
.png)
Annotation 101
February 20, 2025
Introduction to Image Annotation for Computer Vision and AI Model Training