Multimodal speech recognition method and system, and computer-readable storage medium