# VisualVoice
**Repository Path**: facebookresearch/VisualVoice
## Basic Information
- **Project Name**: VisualVoice
- **Description**: Audio-Visual Speech Separation with Cross-Modal Consistency
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-07-30
- **Last Updated**: 2024-10-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
This repository contains the code for [VisualVoice](https://arxiv.org/pdf/2101.03149.pdf). [[Project Page]](http://vision.cs.utexas.edu/projects/VisualVoice/)
[VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency](https://arxiv.org/pdf/2101.03149.pdf)
[Ruohan Gao](https://www.cs.utexas.edu/~rhgao/)1,2 and [Kristen Grauman](http://www.cs.utexas.edu/~grauman/)1,2
1UT Austin, 2Facebook AI Research
In CVPR, 2021
If you find our data or project useful in your research, please cite:
@inproceedings{gao2021VisualVoice,
title = {VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency},
author = {Gao, Ruohan and Grauman, Kristen},
booktitle = {CVPR},
year = {2021}
}
### Demo with the pre-trained models
1. Download the pre-trained models:
```
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/facial_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/lipreading_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/unet_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/vocal_best.pth
```
2. Preprocess the demo video using the following commands that convert the video to 25f/s, resample the audio to 16kHz, and track the speakers with a simple implementation based on a face detector. Using other advanced face tracker of your choice can lead to better separation results.
```
ffmpeg -i ./test_videos/interview.mp4 -filter:v fps=fps=25 ./test_videos/interview25fps.mp4
mv ./test_videos/interview25fps.mp4 ./test_videos/interview.mp4
python ./utils/detectFaces.py --video_input_path ./test_videos/interview.mp4 --output_path ./test_videos/interview/ --number_of_speakers 2 --scalar_face_detection 1.5 --detect_every_N_frame 8
ffmpeg -i ./test_videos/interview.mp4 -vn -ar 16000 -ac 1 -ab 192k -f wav ./test_videos/interview/interview.wav
python ./utils/crop_mouth_from_video.py --video-direc ./test_videos/interview/faces/ --landmark-direc ./test_videos/interview/landmark/ --save-direc ./test_videos/interview/mouthroi/ --convert-gray --filename-path ./test_videos/interview/filename_input/interview.csv
./
```
3. Use the downloaded pre-trained models to test on the demo video.
```
python testRealVideo.py \
--mouthroi_root ./test_videos/interview/mouthroi/ \
--facetrack_root ./test_videos/interview/faces/ \
--audio_path ./test_videos/interview/interview.wav \
--weights_lipreadingnet pretrained_models/lipreading_best.pth \
--weights_facial pretrained_models/facial_best.pth \
--weights_unet pretrained_models/unet_best.pth \
--weights_vocal pretrained_models/vocal_best.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--unet_output_nc 2 \
--normalization \
--visual_feature_type both \
--identity_feature_dim 128 \
--audioVisual_feature_dim 1152 \
--visual_pool maxpool \
--audio_pool maxpool \
--compression_type none \
--reliable_face \
--audio_normalization \
--desired_rms 0.7 \
--number_of_speakers 2 \
--mask_clip_threshold 5 \
--hop_length 2.55 \
--lipreading_extract_feature \
--number_of_identity_frames 1 \
--output_dir_root ./test_videos/interview/
```
### Dataset preparation for VoxCeleb2
1. Download the [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) dataset. The pre-processed mouth ROIs can be downloaded as follows:
```
# mounth ROIs for VoxCeleb2 (train: 1T; val: 20G; seen_heard_test: 88G; unseen_unheard_test: 20G)
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_train.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_val.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_seen_heard_test.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_unseen_unheard_test.tar.gz
# Directory structure of the dataset:
# ├── VoxCeleb2
# │ └── [mp4] (contain the face tracks in .mp4)
# │ └── [train]
# │ └── [val]
# │ └── [seen_heard_test]
# │ └── [unseen_unheard_test]
# │ └── [audio] (contain the audio files in .wav)
# │ └── [train]
# │ └── [val]
# │ └── [seen_heard_test]
# │ └── [unseen_unheard_test]
# │ └── [mouth_roi] (contain the mouth ROIs in .h5)
# │ └── [train]
# │ └── [val]
# │ └── [seen_heard_test]
# │ └── [unseen_unheard_test]
```
2. Download the hdf5 files that contain the data paths, and then modify the hdf5 file accordingly by changing the paths to have the correct root prefix of your own.
```
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/train.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/val.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/seen_heard_test.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/unseen_unheard_test.h5
```
### Training and Testing
(The code has been tested under the following system environment: Ubuntu 18.04.3 LTS, CUDA 10.0, Python 3.7.3, PyTorch 1.3.0, torchvision 0.4.1, face-alignment 1.2.0, librosa 0.7.0, av 8.0.3)
1. Download the pre-trained cross-modal matching models as initialization:
```
wget http://dl.fbaipublicfiles.com/VisualVoice/cross-modal-pretraining/facial.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/cross-modal-pretraining/vocal.pth
```
2. Use the following command to train the VisualVoice speech separation model:
```
python train.py \
--name exp \
--gpu_ids 0,1,2,3,4,5,6,7 \
--batchSize 128 \
--nThreads 32 \
--display_freq 10 \
--save_latest_freq 500 \
--niter 1 \
--validation_on True \
--validation_freq 200 \
--validation_batches 30 \
--num_batch 50000 \
--lr_steps 30000 40000 \
--coseparation_loss_weight 0.01 \
--mixandseparate_loss_weight 1 \
--crossmodal_loss_weight 0.01 \
--lr_lipreading 0.0001 \
--lr_facial_attributes 0.00001 \
--lr_unet 0.0001 \
--lr_vocal_attributes 0.00001 \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--margin 0.5 \
--weighted_loss \
--visual_pool maxpool \
--audio_pool maxpool \
--optimizer adam \
--normalization \
--tensorboard True \
--mask_loss_type L2 \
--visual_feature_type both \
--unet_input_nc 2 \
--unet_output_nc 2 \
--compression_type none \
--mask_clip_threshold 5 \
--audioVisual_feature_dim 1152 \
--identity_feature_dim 128 \
--audio_normalization \
--lipreading_extract_feature \
--weights_facial ./pretrained_models/cross-modal-pretraining/facial.pth \
--weights_vocal ./pretrained_models/cross-modal-pretraining/vocal.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--data_path hdf5/VoxCeleb2/ \
|& tee logs.txt
```
3. Use the following command to test on a synthetic mixture:
```
python test.py \
--audio1_path /YOUR_DATASET_PATH/VoxCeleb2/audio/seen_heard_test/id06688/akPwstwDxjE/00023.wav \
--audio2_path /YOUR_DATASET_PATH/VoxCeleb2/audio/seen_heard_test/id08606/0o-ZBLLLjXE/00002.wav \
--mouthroi1_path /YOUR_DATASET_PATH/VoxCeleb2/mouth_roi/seen_heard_test/id06688/akPwstwDxjE/00023.h5 \
--mouthroi2_path /YOUR_DATASET_PATH/VoxCeleb2/mouth_roi/seen_heard_test/id08606/0o-ZBLLLjXE/00002.h5 \
--video1_path /YOUR_DATASET_PATH/VoxCeleb2/mp4/seen_heard_test/id06688/akPwstwDxjE/00023.mp4 \
--video2_path /YOUR_DATASET_PATH/VoxCeleb2/mp4/seen_heard_test/id08606/0o-ZBLLLjXE/00002.mp4 \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--weights_lipreadingnet pretrained_models/lipreading_best.pth \
--weights_facial pretrained_models/facial_best.pth \
--weights_unet pretrained_models/unet_best.pth \
--weights_vocal pretrained_models/vocal_best.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--unet_output_nc 2 \
--normalization \
--mask_to_use pred \
--visual_feature_type both \
--identity_feature_dim 128 \
--audioVisual_feature_dim 1152 \
--visual_pool maxpool \
--audio_pool maxpool \
--compression_type none \
--mask_clip_threshold 5 \
--hop_length 2.55 \
--audio_normalization \
--lipreading_extract_feature \
--number_of_identity_frames 1 \
--output_dir_root test
```
4. Evaluate the separation performance.
```
python evaluateSeparation.py --results_dir test/id06688_akPwstwDxjE_00023VSid08606_0o-ZBLLLjXE_00002
```
### Model Variants
1. Audio-Visual speech separation model tailored to 2 speakers (with context): see subdirectory av-separation-with-context/.
2. Audio-Visual speech enhancement code: see subdirectory av-enhancement.
### Acknowlegements
Some of the code is borrowed or adapted from [Co-Separation](https://github.com/rhgao/co-separation). The code for the lip analysis network is adapted from [Lipreading using Temporal Convolutional Networks](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks).
### Licence
The majority of VisualVoice is licensed under CC-BY-NC, however portions of the project are available under separate license terms: license information for Lipreading using Temporal Convolutional Networks is available at https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/master/LICENSE.