Mitigating Occlusions in Visual Perception Using Single-View 3D Tracking in NVIDIA DeepStream

When it comes to perception for Intelligent Video Analytics (IVA) applications such as traffic monitoring, warehouse safety, and retail shopper analytics, one of the biggest challenges is occlusions. People may move behind structural obstacles, retail shoppers may not be fully visible due to shelving units, and cars may be hidden behind large trucks, for example.

This post explains how the new Single-View 3D Tracking feature in NVIDIA DeepStream SDK can help mitigate occlusions in visual perception that are often encountered in real-life IVA deployments.

Perspectives and projections in visual perception

In our physical world, the motion of some objects observed through a camera lens may look erratic. This is due to the camera’s 2D representation of the 3D world.

An example of this is the retrograde motion of planets like Mercury and Mars, which baffled ancient Greek astronomers. They couldn’t explain why the planets sometimes appeared to move backward (Figure 1).

The apparent retrogradation perceived is due to the trajectories of stars and planets through the night sky. These are projections of the orbital motions in the 3D space of the universe onto the 2D canvas of the night sky. Had the ancient astronomers known the pattern of motion in 3D space, they could have predicted the appearance of these planets in the 2D night sky.

Images tracking the retrograde motion of Mars shown in the night sky in 2014 and in 2016. — *Figure 1. The retrograde motion of Mars in the night sky in 2014 (left) and 2016 (right). Credit: NASA*

Traffic monitoring cameras provide a similar example. These cameras are often mounted to monitor a large area in which the vehicles’ motion dynamics at near-field and far-field may be drastically different.

In Video 1, vehicles in the distance appear small and slow-moving. As the vehicles get closer to the camera and make turns, abrupt changes in object motion can be observed. These changes make it difficult to find common patterns in the 2D camera view, and therefore difficult to predict where the vehicle may move in the future.

Video 1. Near-field vehicles appears to move quickly while far-field vehicles appear to move more slowly

Object tracking is essentially a continuous estimation of the physical states of objects while identifying their unique identities over time. This process typically involves modeling the object motion dynamics and making predictions to suppress inherent noises in measurements (detections). Given the examples provided, it’s clear that performing the state estimation and prediction of the objects directly in the native 3D space would yield better results than performing it in the projected 2D camera image plane. This is because the objects exist in 3D space.

Single-View 3D Tracking with NVIDIA DeepStream

NVIDIA DeepStream SDK is a complete streaming analytics toolkit based on GStreamer for AI-based multisensor processing, video, audio, and image understanding. The recent DeepStream 6.4 release introduced a new feature called Single-View 3D Tracking (SV3DT), which enables the estimation of object states in the 3D physical world within the single-camera view.

This process involves converting the observed measurement on the 2D camera image plane into the 3D world coordinate system using a 3×4 projection matrix, or camera matrix, for each camera. An object’s location in the 3D world ground plane is represented as the center of the bottom of the object. So a pedestrian is modeled as a cylinder (with height and radius) standing on the world ground plane with the center of the base of the cylindrical model as the foot location of the pedestrian (Figure 2).

An image depicting three people in an outdoor scene with overlays of cylindrical human models fitted to each person. — *Figure 2. The bottom center of each cylindrical model represents the location of each pedestrian on the 3D-world ground plane (marked with a green dot)*

Using a 3×4 projection matrix and a cylindrical human model, the location of a 3D human model on the 3D world ground plane for a detected object is estimated so that the projected 3D human model on the 2D camera image plane is best matched to the bounding box of the detected object.

For example, in Figure 3 (left), the gray bounding boxes indicate the objects detected by an object detector using a model such as NVIDIA TAO PeopleNet. The purple and yellow cylinders indicate the corresponding 3D human models projected from the estimated location on the 3D world ground plane to the 2D camera image plane. The green dots at the bottom of the projected 3D human models indicate the estimated foot locations. These match well with the actual foot locations, although the camera view has perspectives and rotations.

Two side-by-side images. The image on the left shows two people standing in a queue for checkout. The cylindrical human models are shown as an overlay. The image on the right shows a person standing in a narrow aisle while only the head and shoulders are visible. The cylindrical human model is still shown for the full-body with the foot location estimated as a green dot. — *Figure 3. SV3DT helps track the accurate foot location of retail shoppers, even with occlusions*

An important advantage of the newly introduced DeepStream SV3DT feature is that the 2D and 3D foot locations of the objects can be found accurately even if there are significant partial occlusions. This is one of the most challenging problems in real-world IVA applications. For more details, see our previous post, State-of-the-Art Real-time Multi-Object Trackers with NVIDIA DeepStream SDK 6.2.

For example, Figure 3 (right) shows a person shopping in a narrow aisle with only a small part of the upper body visible to the camera. This results in a smaller object bounding box, capturing only the head and shoulder areas. For this sort of scenario, it’s extremely challenging to localize the person on a global store map because estimating the foot location is a nontrivial task, to say the least.

Using the bottom center of the bounding box as the proxy of the object location would introduce a large degree of error in trajectory estimation. This is true even if camera calibration information is used to translate the 2D points into 3D points, especially when the camera perspective and rotation is large.

The SV3DT algorithm in the multi-object tracker module in DeepStream SDK addresses this issue by leveraging the 3D human modeling information under the assumption that the cameras are mounted above the head. This is typically the case in most large camera network systems deployed in smart spaces. With this assumption, it’s possible to use the head as an anchor point when estimating the corresponding 3D human model location. Figure 3 shows that the SV3DT algorithm can successfully find the matching 3D human model location, even when the person is severely occluded.

Video 2 shows people being tracked in a convenience store. Note that the 3×4 projection matrix used does not account for lens distortions, although the particular camera has some degree of lens distortions as you can see that the horizontal lines are a bit curvy and not straight. This causes more inaccuracies in 3D human model location estimation, especially when the people are located at the edges of the video frames.

Nonetheless, the 2D and 3D foot locations (indicated by green dots) of the people in the convenience store are accurately and robustly tracked. This enhances the accuracy of additional analytics like queue length monitoring and occupancy maps, to name a few.

Video 2. People in the queue are tracked with their foot locations (marked with green dots), despite partial and full occlusions

Figure 4 shows how the foot locations of each pedestrian are being tracked robustly in a synthetic dataset, even when the majority of the lower body is occluded by large objects like the shelves.

An animated gif of the interior of a retail store showing human characters being tracked with cylindrical human models as overlays. — *Figure 4. SV3DT pedestrian location tracking with severe particle occlusions using a synthetic dataset*

We believe that addressing partial occlusion problems would open many possibilities in real-world applications. SV3DT is released in Alpha mode due to its limited object type support (standing people only). Other cases, such as people sitting and lying down, or additional object types may be supported in future releases. You can try it for your specific use cases and provide feedback in the DeepStream Forum. Further improvements are planned for future releases.

DeepStream SV3DT use case

One sample DeepStream SV3DT use case demonstrates how to enable single-view 3D tracking on the retailer store video featured in this post, and save 3D metadata from the pipeline. Users can visualize the convex hull and foot locations from the data as shown in Figure 4 and Video 2. The README also describes how to run this algorithm on customized videos. Visit NVIDIA-AI-IOT/deepstream_reference_apps for more details and refer to the DeepStream documentation.

Summary

Single-View 3D Tracking in NVIDIA DeepStream SDK helps to mitigate partial occlusion issues in real-life IVA applications and deployments. The feature was first introduced in the 6.4 release and enhanced in 7.0. Specifically, SV3DT enables estimation of the foot location despite partial occlusions and more robust and accurate object tracking, which would subsequently result in accurate localization in the 3D ground plane. Businesses that rely on or leverage geospatial analytics are expected to benefit from this technology the most.

To get started, check out the latest DeepStream SDK release and try it in your challenging environment.