One of the key problems of autonomous vehicles is how to represent the world around us in a digestible way. There are many modalities to choose from, each of which gives a different context for a situation. Therefore it is important to choose the modalities which give the most important information gain. We explore the impact of adding and removing modalities with respect to early multimodal fusion paradigms in the context of conditional imitation learning. We test the impact of four modalities, RGB, LiDAR, optical flow, and velocity. Our model consists of two parts, a feature encoder, and a autoregressive waypoint predictor. There are two encoder architectures used in our experiments, the first is simply a pre-trained EfficientNet while the second is an EfficientNet that feeds into a transformer at the last block. We find that optical flow improves the model’s performance although it becomes very unstable while training due to harsh augmentations of RGB images. Our conclusion is that optical flow provides key representation for end-to-end multimodal conditional imitation learning models; however, perturbations of RGB images drastically decrease model performance, effectively only adding noise to the model.
You can find our paper here, and our video presentation here.