# UnsupervisedDecomposition

**Repository Path**: facebookresearch/UnsupervisedDecomposition

## Basic Information

- **Project Name**: UnsupervisedDecomposition
- **Description**: PyTorch original implementation of "Unsupervised Question Decomposition for Question Answering"
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: dependabot/pip/pytorch-transformers/docs/babel-2.9.1
- **Homepage**: https://arxiv.org/abs/2002.09758
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-07-24
- **Last Updated**: 2024-10-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# UnsupervisedDecomposition

PyTorch original implementation of "[Unsupervised Question Decomposition for Question Answering](https://arxiv.org/abs/2002.09758)" (EMNLP 2020).

TL;DR: We decompose hard (multi-hop) questions into several, easier (single-hop) questions using unsupervised learning. Our decompositions improve multi-hop QA on [HotpotQA](https://arxiv.org/pdf/1809.09600.pdf) without requiring extra supervision to decompose questions.  

## Overview
`XLM` contains the code to train (Unsupervised) Seq2Seq models, based on the code from [XLM](https://github.com/facebookresearch/XLM).
We made the following changes/additions:
- Unsupervised stopping criterion
- Tensorboard logging
- Data preprocessing scripts
- Minor bug fixes from original XLM code
- When initializing a smaller Seq2Seq model with XLM_en pretrained weights, automatically initialize the encoder with the first XLM_en layer weights and the decoder with the remaining layer weights. 

`pytorch-transformers` contains the code to train question answering models (single-hop and multi-hop), based on the code from [transformers](https://github.com/huggingface/transformers). We made the following additions:
- Scripts/notebooks to preprocess data
- Additions to evaluation to handle/evaluate on HotpotQA (i.e., extend single-paragraph SQuAD implementation to multi-paragraph setting)

*10/2020: Update! added additional data and resources*:
* Simple and multihop mined questions
* Multihop QA model checkpoints
* MLM pretraining data
* Unsupervised MT training data

## Installation

Create an anaconda3 environment (we used anaconda3 version 5.0.1):
```bash
conda create -y -n UnsupervisedDecomposition python=3.7
conda activate UnsupervisedDecomposition
# Install PyTorch 1.0. We used CUDA 10.0 (with NCCL/2.4.7-1) (see https://pytorch.org/ to install with other CUDA versions):
conda install -y pytorch=1.0 torchvision cudatoolkit=10.0 -c pytorch
conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA 10.0
pip install -r requirements.txt
python -m spacy download en_core_web_lg  # Download Spacy model for NER
```

If your hardware supports half-precision (fp16), you can install NVIDIA [apex](https://github.com/NVIDIA/apex) to speed up QA model training.
Also, set the `MAIN_DIR` variable to point to the main directory for this repo, e.g.:
```bash
export MAIN_DIR=/path/to/UnsupervisedDecomposition
```

## Downloading and Preprocessing Data
Run `download_data.sh` once, to download/prepare the necessary files for decomposition and question answering training, e.g.:
```bash
bash download_data.sh --main_dir $MAIN_DIR
```

See below to train a decomposition model, or skip to "QA Model Training" to train a question answering model given our trained decomposition model (`XLM/dumped/umt.dev1.pseudo_decomp.replace_entity_by_type/20639223/best-valid_mlm_ppl.pth`).
You can view our generations from the model in the downloaded files `XLM/dumped/umt.dev1.pseudo_decomp.replace_entity_by_type/20639223/hyp.st=0.0.bs=5.lp=1.0.es=False.seed=0.mh-sh.{train|valid}.pred.bleu.sh.txt`. 

## Unsupervised Decomposition Training
Create pseudo-decomposition training data using FastText embeddings and entity replacement using `create_pseudo_decompositions.sh `, e.g.:
```bash
bash create_pseudo_decompositions.sh --main_dir $MAIN_DIR
```

Then, train an Unsupervised Seq2Seq model as follows (initializing from our pre-trained MLM model):
```bash
# Set the following parameters based on your hardware
export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=1  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)
BS=32  # Make batch size smaller if GPU goes out-of-memory. Effective batch size is BS*NPROC_PER_NODE*N_NODES

# Select an MLM initialization checkpoint (for now, let's load the MLM we already pre-trained)
MLM_INIT=dumped/mlm.dev1.pseudo_decomp_random.mined/best-valid_mlm_ppl.pth 

# Train USeq2Seq model
export NGPU=$NPROC_PER_NODE
if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
NUM_TRAIN=`wc -l < data/umt/$DATA_FOLDER/processed/train.mh`
python $DIST_OPTS train.py --exp_name umt.$DATA_FOLDER --data_path data/umt/$DATA_FOLDER/processed --dump_path ./dumped/ --reload_model "$MLM_INIT,$MLM_INIT" --encoder_only false --emb_dim 2048 --n_layers 6 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --ae_steps 'mh,sh' --bt_steps 'mh-sh-mh,sh-mh-sh' --stopping_criterion 'valid_mh-sh-mh_mt_effective_goods_back_bleu,2' --validation_metrics 'valid_mh-sh-mh_mt_effective_goods_back_bleu' --eval_bleu true --epoch_size $((4*NUM_TRAIN/(NPROC_PER_NODE*N_NODES))) --lambda_ae '0:1,100000:0.1,300000:0' --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.00003' --tokens_per_batch 1024 --batch_size $BS --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --max_len 128 --bptt 128 --save_periodic 0 --split_data true --validation_weight 0.5
```

New: to train UMT with the same data we used, download our splits [here](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/umt_training_data.tar.gz)

### Seq2Seq Decomposition Training (Optional)
Alternatively, you can train a standard Seq2Seq model as follows:
```bash
export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=1  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)
BS=128  # Make batch size smaller if GPU goes out-of-memory. Effective batch size is BS*NPROC_PER_NODE*N_NODES

MLM_INIT=dumped/mlm.dev1.pseudo_decomp_random.mined/best-valid_mlm_ppl.pth
export NGPU=$NPROC_PER_NODE
if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
DATA_FOLDER=dev1.pseudo_decomp.replace_entity_by_type
DATA_PATH=data/umt/$DATA_FOLDER/processed
NUM_TRAIN=`wc -l < $DATA_PATH/train.mh`
mkdir -p $OUTPUT_DIR
python $DIST_OPTS train.py --exp_name mt.$DATA_FOLDER --data_path $DATA_PATH --dump_path ./dumped/ --reload_model "$MLM_INIT,$MLM_INIT" --encoder_only false --emb_dim 2048 --n_layers 6 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --mt_steps 'mh-sh,sh-mh' --stopping_criterion 'valid_mh-sh_mt_bleu,2' --validation_metrics 'valid_mh-sh_mt_bleu' --eval_bleu true --epoch_size $((2*NUM_TRAIN/(NPROC_PER_NODE*N_NODES))) --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --tokens_per_batch 1024 --batch_size $BS --max_len 128 --bptt 128 --split_data true
```

You can also use the trained Seq2Seq model checkpoint as the pre-trained initialization (`MLM_INIT`) for USeq2Seq training, as our Curriculum Seq2Seq approach does (see Appendix).

### MLM Pre-training (Optional)

New: Download MLM pretraining data [here](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/mlm_pretraining_data.tar.gz)

To pre-train your own MLM initialization (used as `MLM_INIT`), use the below commands:
```bash
# Set the following parameters based on your hardware
export NPROC_PER_NODE=8  # Use 1 for single-GPU training
export N_NODES=8  # Use >1 for multi-node training (where each node has NPROC_PER_NODE GPUs)

# Copy XLM's English pre-trained MLM weights, which we use to initialize our MLM training
wget https://dl.fbaipublicfiles.com/XLM/mlm_en_2048.pth
mv mlm_en_2048.pth dumped/xlm_en/

# MLM pre-training (on same data as above)
export NGPU=$NPROC_PER_NODE
if [[ $NPROC_PER_NODE -gt 1 ]]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NPROC_PER_NODE"; else DIST_OPTS=""; fi
EPOCH_SIZE=$((2*NUM_TRAIN))
BS=24
EFFECTIVE_BS=$((BS*NPROC_PER_NODE*N_NODES))
NUM_TRAIN=`wc -l < data/umt/$DATA_FOLDER/processed/train.mh`
# For fp16: Add "--fp16 true --amp 1" below
python $DIST_OPTS train.py --exp_name mlm.$DATA_FOLDER --data_path data/umt/$DATA_FOLDER/processed --dump_path ./dumped/ --reload_model 'dumped/xlm_en/mlm_en_2048.pth' --emb_dim 2048 --n_layers 12 --n_heads 16 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --use_lang_emb true --lgs 'mh-sh' --clm_steps '' --mlm_steps 'mh,sh' --stopping_criterion '_valid_mlm_ppl,0' --validation_metrics '_valid_mlm_ppl' --epoch_size $EPOCH_SIZE --optimizer "adam_inverse_sqrt,lr=0.00003,beta1=0.9,beta2=0.98,weight_decay=0,warmup_updates=$((EPOCH_SIZE/EFFECTIVE_BS))" --batch_size $BS --max_len 128 --bptt 128 --accumulate_gradients 1 --word_pred 0.15 --sample_alpha 0
```

## QA Model Training
With a trained decomposition model, we can generate decompositions for multi-hop questions (train and valid sets), and train a question answering model to use the decompositions (below we use our pre-trained decomposition model which you downloaded):
```bash
# Generate decompositions
ST=0.0
LP=1.0
BEAM=5
SEED=0
# Point to model directory (change the final directory number/string/id below to match the directory string from the previous Unsupervised Seq2Seq training command)
MODEL_DIR=dumped/umt.dev1.pseudo_decomp.replace_entity_by_type/20639223
MODEL_NO="$(echo $MODEL_DIR | rev | cut -d/ -f1 | rev)"
for SPLIT in valid train; do
    # Note: Decrease batch size below if GPU goes out of memory
    cat data/umt/all/processed/$SPLIT.mh | python translate.py --exp_name translate --src_lang mh --tgt_lang sh --model_path $MODEL_DIR/best-valid_mh-sh-mh_mt_effective_goods_back_bleu.pth --output_path $MODEL_DIR/$SPLIT.pred.bleu.sh --batch_size 48 --beam_size $BEAM --length_penalty $LP --sample_temperature $ST
done

# Convert Sub-Qs to SQUAD format
cd $MAIN_DIR/pytorch-transformers
for SPLIT in valid train; do
    python umt_gen_subqs_to_squad_format.py --model_dir $MODEL_DIR --data_folder all --sample_temperature $ST --beam $BEAM --length_penalty $LP --seed $SEED --split $SPLIT --new_data_format
done

# Answer sub-Qs
DATA_FOLDER=data/hotpot.umt.all.model=$MODEL_NO.st=$ST.beam=$BEAM.lp=$LP.seed=$SEED
for SPLIT in "dev" "train"; do
for NUM_PARAGRAPHS in 1 3; do
    # For fp16: Add "--fp16 --fp16_opt_level O2" below
    python examples/run_squad.py --model_type roberta --model_name_or_path roberta-large --train_file $DATA_FOLDER/train.json --predict_file $DATA_FOLDER/$SPLIT.json --do_eval --do_lower_case --version_2_with_negative --output_dir checkpoint/roberta_large.hotpot_easy_and_squad.num_paragraphs=$NUM_PARAGRAPHS --per_gpu_train_batch_size 64 --per_gpu_eval_batch_size 32 --learning_rate 1.5e-5 --max_query_length 234 --max_seq_length 512 --doc_stride 50 --num_shards 1 --seed 0 --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.01 --warmup_proportion 0.06 --num_train_epochs 2 --write_dir $DATA_FOLDER/roberta_predict.np=$NUM_PARAGRAPHS --no_answer_file
done
done

# Ensemble sub-answer predictions
for SPLIT in "dev" "train"; do
    python ensemble_answers_by_confidence_script.py --seeds_list 1 3 --no_answer_file --split $SPLIT --preds_file1 data/hotpot.umt.all.model=$MODEL_NO.st=$ST.beam=$BEAM.lp=$LP.seed=$SEED/roberta_predict.np={}/nbest_predictions_$SPLIT.json
done

# Add sub-questions and sub-answers to QA input
FLAGS="--atype sentence-1-center --subq_model roberta-large-np=1-3 --use_q --use_suba --use_subq"
python add_umt_subqs_subas_to_q_squad_format_new.py --subqs_dir data/hotpot.umt.all.model=$MODEL_NO.st=$ST.beam=$BEAM.lp=$LP.seed=$SEED --splits train dev --num_shards 1 --model_dir $MODEL_DIR --sample_temperature $ST --beam $BEAM --length_penalty $LP --seed $SEED --subsample_data --use_easy --use_squad $FLAGS

# Train QA model
export NGPU=8  # Set based on number of available GPUs
if [ $NGPU -gt 1 ]; then DIST_OPTS="-m torch.distributed.launch --nproc_per_node=$NGPU"; else DIST_OPTS=""; fi
if [ $NGPU -gt 1 ]; then EVAL_OPTS="--do_eval"; else EVAL_OPTS=""; fi
export MASTER_PORT=$(shuf -i 12001-19999 -n 1)
FLAGS_STRING="${FLAGS// --/.}"
FLAGS_STRING="${FLAGS_STRING//--/.}"
FLAGS_STRING="${FLAGS_STRING// /=}"
TN=hotpot.umt.all.model=$MODEL_NO.st=$ST.beam=$BEAM.lp=$LP.seed=$SEED$FLAGS_STRING.suba1=0.suba2=0-squad.medium_hard_frac=1.0
RANDOM_SEED=0
OUTPUT_DIR="checkpoint/tn=$TN/rs=$RANDOM_SEED"
# For fp16: Add "--fp16 --fp16_opt_level O2" below
python $DIST_OPTS examples/run_squad.py --model_type roberta --model_name_or_path roberta-large --train_file data/$TN/train.json --predict_file data/$TN/dev.json --do_train $EVAL_OPTS --do_lower_case --version_2_with_negative --output_dir $OUTPUT_DIR --per_gpu_train_batch_size $((64/NGPU)) --per_gpu_eval_batch_size 32 --learning_rate 1.5e-5 --master_port $MASTER_PORT --max_query_length 234 --max_seq_length 512 --doc_stride 50 --num_shards 1 --seed $RANDOM_SEED --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.01 --warmup_proportion 0.06 --num_train_epochs 2 --overwrite_output_dir
```

New: our trained multihop model checkpoints are available here: 
* [Model Seed 1](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/multihop_qa_model_0.tar.gz)
* [Model Seed 2](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/multihop_qa_model_1.tar.gz)
* [Model Seed 3](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/multihop_qa_model_2.tar.gz)
* [Model Seed 4](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/multihop_qa_model_3.tar.gz) 
* [Model Seed 5](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/multihop_qa_model_4.tar.gz)  

## Creating Alternate Pseudo-Decompositions
We can also create pseudo-decompositions using other embedding methods aside from FastText, as described in the Appendix.
To do so, use the functions in `pytorch-transformers/pseudoalignment/pseudo_decomp_{paired_random|fasttext|tfidf|bert|variable}.py`, e.g., by running:
```angular2html
python pseudoalignment/pseudo_decomp_fasttext.py \
    --split train    # decompose the hotpotQA training question
    --min_q_len 4    # minimum length of short questions (tokens)
    --max_q_len 20   # maximum length of short questions (tokens)
    --beam_size 100  # subset of short questions to search exhaustively over for each complex question
    --data_folder data/umt/decomposition_name  # path to dump the results to
```

The different pseudo-decomposition methods are:
* `pseudo_decomp_fasttext.py` - decompose using bag of fasttext vectors
* `pseudo_decomp_random.py` - randomly pair short questions (for ablations/comparisons)
* `pseudo_decomp_tfidf.py` - decompose using bag of tfidf vectors
* `pseudo_decomp_variable.py` - decompose using bag of facttext vectors, but using a variable number of subquestions (see Appendix)
* `pseudo_decomp_bert.py` -  decompose using bert embeddings (requires generating the bert embeddings first with `embed_questions_with_bert.py`)
* `pseudo_decomp_bert_nsp.py` -  decompose using bert NSP embeddings (not in the paper) (requires generating the bert embeddings first with `embed_questions_with_bert.py`)

## Variable Number of Sub-Questions
To train a decomposition model to generate a variable number of sub-questions, you'll need to make the following changes:
- Train on variable-length pseudo-decompositions, created using `python pseudoalignment/pseudo_decomp_fasttext.py` (see above).
- Use a version of the unsupervised stopping criterion which only counts bad decompositions as those with `N<2` sub-questions (as opposed to `N!=2` sub-questions). Simply add the flag `--one_to_variable` when training (Unsupervised) Seq2Seq models with `XLM/train.py`.
- Have the single-hop QA model answer an arbitrary number of sub-questions, instead of a maximum of 2 sub-questions. Simply add `--one_to_variable` to the `FLAGS` variable used in the "QA Model Training" section earlier.

## Data mined from Common Crawl

Data mined from common crawl using our fasttext classifiers can be found [here](https://dl.fbaipublicfiles.com/UnsupervisedDecomposition/data/mined_questions.tar.gz)


## Citation
```
@inproceedings{perez2020unsupervised,
    title={Unsupervised Question Decomposition for Question Answering},
    author={Ethan Perez and Patrick Lewis and Wen-tau Yih and Kyunghyun Cho and Douwe Kiela},
    year={2020},
    booktitle={EMNLP},
    url={https://arxiv.org/abs/2002.09758}
}
```

## License

See the [LICENSE](LICENSE) file for more details.