Cameras that are part of multisensory systems with an integrated computer are rapidly becoming prevalent, largely because of smartphones. Computer vision thus starts evolving into cross-modal sensing, where vision and other sensors cooperate. In nature, this exists in humans and animals, where visual events are often accompanied with sounds. We pursue the idea of using visual input to assist denoising of another modality. Audio-only denoising is very difficult under various complex circumstances and cross-modal association can help. A clear video can direct the audio estimator, which we demonstrate using an example-based approach.