As automatic speech recognition evolves, the deployment of voice user interface has boomingly expanded. Especially since the COVID-19 pandemic, VUI has gained more attention in online communication owing to its non-contact property. However, VUI struggles to be applied in public scenes due to the degradation of received audio signals caused by various ambient noises. In this paper, we propose Wavoice, the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is to model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Additionally, we elaborate on two novel modules for multi-modal fusion embedded into the neural network, leading to accurate speech recognition. Extensive experiments prove the effectiveness of Wavoice under adverse conditions, that is, the character recognition error rate below 1% in a range of 7 meters. In terms of robustness and accuracy, Wavoice considerably outperforms existing audio-only speech recognition methods with lower character error rate and word error rate.