Dataset Format

In order to successfully train a model, please follow the instructions to prepare your dataset accordingly.

Object detection

The TLT-Trainer expects data in KITTI file format for training. Using the KITTI format requires data to be organized in this structure:

.
|--dataset root
|-- images
|-- 000000.jpg
|-- 000001.jpg
.
.
|-- xxxxxx.jpg
|-- annotations
|-- 000000.txt
|-- 000001.txt
.
.
|-- xxxxxx.txt

Here's a description of the structure:

  • The images directory contains the images to train on.

  • The annotations directory contains the labels to the corresponding images.

The images and labels have the same file id's before the extension. The image to label correspondence is maintained using this file name.

Label files

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements

Parameter name

Description

Type

Range

Example

1

Class names

The class to which the object belongs.

String

N/A

person, car, road_sign

1

Truncation

How much of the object has left image boundaries.

Float

0.0, 0.1

0.0

1

Occlusion

Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].

Integer

[0,3]

2

1

Alpha

Observation Angle of object

Float

[-pi, pi]

0.146

4

Bounding box coordinates: [xmin, ymin, xmax, ymax]

Location of the object in the image

Float(0 based index)

[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]

100 120 180 160

3

3-D dimension

Height, width, length of the object (in meters)

Float

N/A

1.65, 1.67, 3.64

3

Location

3-D object location x, y, z in camera coordinates (in meters)

Float

N/A

-0.65,1.71, 46.7

1

Rotation_y

Rotation ry around the Y-axis in camera coordinates

Float

[-pi, pi]

-1.59

The sum of the total number of elements per object is 15. Here is a sample text file

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with the parameters mentioned above. Currently, for detection, the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipeline supports training only for the class and bbox coordinates. The remaining fields may be set to 0.

Here is a sample file for a custom annotated dataset

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

In the next section, you will find instructions about building and running the container.