home

Controllable Text Generation

Style transfer is the vague task of taking a sentence written in style A and somehow making it written in style B while preserving semantics. There are many ways to approach this, with and without supervision. In a project advised by Fatemehsadat Mireshghallah, Greg Durrett, and Taylor Berg-Kirkpatrick, I framed this task in terms of Markov chains with the end goal of composable, controllable, and intuitive style transfer, expanding an idea previously explored by the lab (1).

Methodology

Sentence edits are viewed as Markov chains that start from a given input sentence. At every step of the Metropolis-Hastings (MH) scheme, we propose a new sequence to accept or reject based on its fitness, as judged by the energy function. After taking a given number of steps (edits), we can treat the resulting sentence as a sample from the distribution implicitly created by the energy-based model. Algorithmically, we say that the probability of an edit \(\bar{X}\) of a sentence \(X\) being accepted is \(p(\bar{X}; X) = \min (1, \frac{e^{E_m(\bar{X})} p_m(X \vert X')}{e^{E_m(X)} p_m(\bar{X} \vert X')})\), where \(E_m(x)\) is our energy function and \(p_m(x \vert x')\) is the probability of a sequence \(x\) given a masked version of it \(x'\). Intuitively, the probability of acceptance is higher if the proposed sentence \(\bar{X}\) has high energy and is rare.

Importantly, the exact method of proposing a edit has been left vague. A previous paper proposed Mix and Match (M&M) which follows a similar scheme: they mask out a single token at a time and use BERT to guess the masked token (1). However, without nontrivial modification of the underlying process, this makes their methodology fixed-length: the length of the input (in tokens) is the length of the output (in tokens). It also suffers issue from the tokenization process itself: if just a single token of a multi-token word is masked, the entropy is often very low, making many edits useless! Instead of using BERT and masking a single token at a time, use a text-infill T5 model and mask out random spans of words, not tokens (with length sampled from a poisson distribution of \(\lambda = 2\)). This immidiately allows for variable-length generation after accounting for the more complex calculation of \(p_m(\bar{X} \vert X')\) and \(p_m(X' \vert \bar{X})\).

Finally, we just need to specify \(E_m(X)\), the energy model used to judge sentences before/after editing. M&M (for formality transfer) used a RoBERTa style classifier, hamming distance, and a BERT BERTScore as the energy model. We choose to use a RoBERTa style classifier, DeBERTa BERTScore, and GPT2 sentence probability as our energy model. We choose to rescale the BERTScore to make it more interpretable and useful in an ensemble setting. To rescale, we took the SPR training dataset and computed the mean BERTScore between randomly sampled sentences. We then say our “normalized” BERTScore is \(\frac{BS - Base}{1 - Base}\), where BS is our unnormalized BERTScore. We set the CLF \(\alpha \gets 10\), the BS \(\beta \gets 120\), and the GPT2 \(\iota \gets 5\); these parameters were roughly hand-tuned to make all 3 components of the energy function roughly equal in magnitude on average.

Results

We used the Shakespeare author imitation dataset which contains 37k training sentences of Shakespearian plays and their modern english translations (2). On a subset of size 250, we get the following results for Modern English -> Shakespearian and Shakespearian -> Modern English respectively.

Model J(BS, CLF, FL) CLF BERTScore FL
M&M+ 0.3041 92.5 0.4486 71.5
STRAP 0.1812 67.5 0.3423 83.5
VAE 0.1141 83.5 0.2997 51.5
UNMT 0.095 81.5 0.2556 53.5
Model J(BS, CLF, FL) CLF BERTScore FL
M&M+ 0.3427 86.0 0.4151 87.0
STRAP 0.2719 46.0 0.5592 89.0
VAE 0.1515 58.5 0.3126 54.5
UNMT 0.1376 53.5 0.2787 58.0

J(BS, CLF, FL) refers to the aggregation metric presented by STRAP (3). It is defined as \(J(BS, CLF, FL) = \Sigma_{x \in \mathbb{D}} \frac{CLF(x) \cdot FL(X) \cdot BS(x)}{\vert D \vert}\); this amounts to the average BERTScore amongst fluent and correctly styled outputs. CLF is the transfer accuracy into the target domain as judged by a finetuned RoBERTa-base. BERTScore represents BERTScore between the edited sentence and the target. FL represents the RoBERTa-large predicted binary fluency of the edited sentence (4). More information about the compared-against models are available in (3), (5), and (6) respectively.

References

(1): https://arxiv.org/abs/2203.13299

(2): https://aclanthology.org/C12-1177/

(3): https://arxiv.org/pdf/2010.05700.pdf

(4): https://huggingface.co/cointegrated/roberta-large-cola-krishna2020

(5): https://arxiv.org/abs/2002.03912

(6): https://arxiv.org/abs/1710.11041