Work
Here are some works of mine 📚
Publications
Visual Localization via Few-Shot Scene Region Classification (3DV 2022)
Siyan Dong*, Shuzhe Wang*, Yixin Zhuang, Juho Kannala, Marc Pollefeys, Baoquan Chen
Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the mapping from image pixels to scene coordinates with neural networks to build 2D-3D correspondences for camera pose optimization. However, such memorization requires training by amounts of posed images in each scene, which is heavy and inefficient. On the contrary, few-shot images are usually sufficient to cover the main regions of a scene for a human operator to perform visual localization. In this paper, we propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images. Our insight is leveraging a) pre-learned feature extractor, b) scene region classifier, and c) meta-learning strategy to accelerate training while mitigating overfitting. We evaluate our method on both indoor and outdoor benchmarks. The experiments validate the effectiveness of our method in the few-shot setting, and the training time is significantly reduced to only a few minutes.
Digging Into Self-Supervised Learning of Feature Descriptors (3DV 2021)
Iaroslav Melekhov, Zakaria Laskar, Xiaotian Li, Shuzhe Wang*, Juho Kannala
Fully-supervised CNN-based approaches for learning local image descriptors have shown remarkable results in a wide range of geometric tasks. However, most of them require per-pixel ground-truth keypoint correspondence data which is difficult to acquire at scale. To address this challenge, recent weakly- and self-supervised methods can learn feature descriptors from relative camera poses or using only synthetic rigid transformations such as homographies. In this work, we focus on understanding the limitations of existing self-supervised approaches and propose a set of improvements that combined lead to powerful feature descriptors. We show that increasing the search space from in-pair to in-batch for hard negative mining brings consistent improvement. To enhance the discriminativeness of feature descriptors, we propose a coarse-to-fine method for mining local hard negatives from a wider search space by using global visual image descriptors. We demonstrate that a combination of synthetic homography transformation, color augmentation, and photorealistic image stylization produces useful representations that are viewpoint and illumination invariant. The feature descriptors learned by the proposed approach perform competitively and surpass their fully- and weakly-supervised counterparts on various geometric benchmarks such as image-based localization, sparse feature matching, and image retrieval.
Continual Learning for Image-Based Camera Localization (ICCV 2021)
Shuzhe Wang* , Zakaria Laskar*, Iaroslav Melekhov, Xiaotian Li, Juho Kannala
For several emerging technologies such as augmented reality, autonomous driving and robotics, visual localization is a critical component. Directly regressing camera pose/3D scene coordinates from the input image using deep neural networks has shown great potential. However, such methods assume a stationary data distribution with all scenes simultaneously available during training. In this paper, we approach the problem of visual localization in a continual learning setup -- whereby the model is trained on scenes in an incremental manner. Our results show that similar to the classification domain, non-stationary data induces catastrophic forgetting in deep networks for visual localization. To address this issue, a strong baseline based on storing and replaying images from a fixed buffer is proposed. Furthermore, we propose a new sampling method based on coverage score (Buff-CS) that adapts the existing sampling strategies in the buffering process to the problem of visual localization. Results demonstrate consistent improvements over standard buffering methods on two challenging datasets -- 7Scenes, 12Scenes, and also 19Scenes by combining the former scenes.
Hierarchical Scene Coordinate Classification and Regression for Visual Localization (CVPR2020)
Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala
Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The network consists of a series of output layers, each of them conditioned on the previous ones. The final output layer predicts the 3D coordinates and the others produce progressively finer discrete location labels. The proposed method outperforms the baseline regression-only network and allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image RGB localization performance on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and three combined scenes. Moreover, for large-scale outdoor localization on the Aachen Day-Night dataset, we present a hybrid approach which outperforms existing scene coordinate regression methods, and reduces significantly the performance gap w.r.t. explicit feature matching methods.
Thesis
Visual-Inertial Odometry Aided Temporal Camera Relocalization (Master's thesis)
Shuzhe Wang , Supervisor: Prof. Juho Kannala; Advisor: Li, Xiaotian
The goal of Temporal Camera Relocalization is to efficiently and effectively estimate the 6-DoF camera posew.r.tworld coordinate system. It is one of the fundamental problems in Augmented Reality and Autonomous Driving. However, most of the current approaches focus on one-shot image localization with an emphasis on a single RGB image for camera pose estimation, and the accuracy of TCR methods falls behind the SoTA one-shot methods even taking the time dependency into account.
This thesis proposes a novel Temporal Camera Relocalization pipeline, which consists of three parts: global keyframe localization, local odometry, and fusion algorithms. The global localization has a hierarchical structure and can output image poses with high accuracy, the local tracking is provided by the latest visual-inertial odometry platform. Two fusion algorithms, global constraints and particle filter based method, are proposed in this thesis to utilize both global and local information for temporal camera relocalization. Experimental results show that both methods have promising performances with a mean error of less than0.48m/0.68â—¦in a space of30202m3. The global constraints method achieves the best result with a mean errorof0.22m/0.2â—¦, the particle filter based method is robust to global pose estimation and has the ability to maintain the performance when the accuracy of global localization is significantly dropped.
Projects
Image-based large-scale indoor visual localization on mobile devices
The project creates a localization demo for large-scale indoor environment . I contributed to the building of database and SFM model, evaluating the SoAT approaches such as Hierarchical Localization, HSC-Net, DASC++, establishing a client-server connection.
[Video]
Ship Thruster Interface
The project is to provide an intelligent maneuvering system for tugboats, enabling multiple vessels to coordinate their movements efficiently while assisting larger ships when entering and leaving the harbor. I mainly participated in the design of the control node and the communication between the ROS and the physical layer.