The OmniDrone project researched unmanned areal vehicles and the important challenges to overcome regarding the drone’s reliability and use. It therefore focussed on the increase of a drone’s intelligence and awareness of its environment, using 720-degree stereo omnidirectional camera systems. The main technological objectives of OmniDrone were therefore to enable real-time, on board, 720 degree cognitive vision and to exploit it towards ultra-reliable operation and pilot-assist, giving drones increased reliability and ease-of-use within security and surveillance, professional video broadcast, inspection and traffic monitoring.
Within this project, it was my task to enable fast and reliable object detection on-board the drone, using an embedded Jetson TX2 platform. Firstly, I investigated the potential gains of fusing RGB and depth data in order to increase the detection accuracy. I developed a model capable of running on both inputs in parallel, eventually fusing both streams at an parametrizable point in the network. Extensive experimentation then demonstrated that fusing both streams towards the end of the network yields the best results, effectively surpassing both RGB- and depth-only models.
Afterwards, I researched how to best combine a variety of speed optimisation techniques for convolutional neural networks, in order to achieve real-time detection on the device. The outcome of this research was a model that ran 15 times faster than the original, meanwhile even increasing the accuracy on an industrial dataset.
References
Improving Real-Time Pedestrian Detectors with RGB+Depth Fusion
In this paper we investigate the benefit of using depth information on top of normal RGB for camera-based pedestrian detection. Indeed, depth sensing is easily acquired using depth cameras such as a Kinect or stereo setups. We investigate the best way to perform this sensor fusion with a special focus on lightweight single-pass CNN architectures, enabling real-time processing on limited hardware. We implement different network architectures, each fusing depth at different layers of our network. Our experiments show that midway fusion performs the best, outperforming a regular RGB detector substantially in accuracy. Moreover, we prove that our fusion network is better at detecting individuals in a crowd, by demonstrating that it has both a better localization of pedestrians and is better at handling occluded persons. The resulting network is computationally efficient and achieves real-time performance on both desktop and embedded GPUs.
Exploring RGB+Depth Fusion for Real-Time Object Detection
In this paper, we investigate whether fusing depth information on top of normal RGB data for camera-based object detection can help to increase the performance of current state-of-the-art single-shot detection networks. Indeed, depth sensing is easily acquired using depth cameras such as a Kinect or stereo setups. We investigate the optimal manner to perform this sensor fusion with a special focus on lightweight single-pass convolutional neural network (CNN) architectures, enabling real-time processing on limited hardware. For this, we implement a network architecture allowing us to parameterize at which network layer both information sources are fused together. We performed exhaustive experiments to determine the optimal fusion point in the network, from which we can conclude that fusing towards the mid to late layers provides the best results. Our best fusion models significantly outperform the baseline RGB network in both accuracy and localization of the detections.
Investigating the Potential of Network Optimization for a Constrained Object Detection Problem
Object detection models are usually trained and evaluated on highly complicated, challenging academic datasets, which results in deep networks requiring lots of computations. However, a lot of operational use-cases consist of more constrained situations: they have a limited number of classes to be detected, less intra-class variance, less lighting and background variance, constrained or even fixed camera viewpoints, etc. In these cases, we hypothesize that smaller networks could be used without deteriorating the accuracy. However, there are multiple reasons why this does not happen in practice. Firstly, overparameterized networks tend to learn better, and secondly, transfer learning is usually used to reduce the necessary amount of training data. In this paper, we investigate how much we can reduce the computational complexity of a standard object detection network in such constrained object detection problems. As a case study, we focus on a well-known single-shot object detector, YoloV2, and combine three different techniques to reduce the computational complexity of the model without reducing its accuracy on our target dataset. To investigate the influence of the problem complexity, we compare two datasets: a prototypical academic (Pascal VOC) and a real-life operational (LWIR person detection) dataset. The three optimization steps we exploited are: swapping all the convolutions for depth-wise separable convolutions, perform pruning and use weight quantization. The results of our case study indeed substantiate our hypothesis that the more constrained a problem is, the more the network can be optimized. On the constrained operational dataset, combining these optimization techniques allowed us to reduce the computational complexity with a factor of 349, as compared to only a factor 9.8 on the academic dataset. When running a benchmark on an Nvidia Jetson AGX Xavier, our fastest model runs more than 15 times faster than the original YoloV2 model, whilst increasing the accuracy by 5% Average Precision (AP).
Real-Time Embedded Computer Vision on UAVs: UAVision2020 Workshop Summary
In this paper we present an overview of the contributed work presented at the UAVision2020 (International workshop on Computer Vision for UAVs) ECCV workshop. Note that during ECCV2020 this workshop was merged with the VisDrone2020 workshop. This paper only summarizes the results of the regular paper track and the ERTI challenge. The workshop focused on real-time image processing on-board of Unmanned Aerial Vehicles (UAVs). For such applications the computational complexity of state-of-the-art computer vision algorithms often conflicts with the need for real-time operation and the extreme resource limitations of the hardware. Apart from a summary of the accepted workshop papers and an overview of the challenge, this work also aims to identify common challenges and concerns which were addressed by multiple authors during the workshop, and their proposed solutions.