Analysing paraphrasing from a neural model perspective

4 min readJun 4, 2021

Like machine translation and abstractive summarization, paraphrase generation is a task that has two ends, from one sequence to the other.

That being said, it is natural to think that we could just fine tune the state-of-the-art neural models on machine translation and abstractive summarization onto paraphrase generation and get this task done easily. Or rather, it is not even a task, but a by-product from machine translation since if we consider translation from English to Czech, and back translation from Czech to English, the generated English sentence is likely to be different from the original one. This is how the classic dataset PARANMT is formed.

However, as you may already start to feel weird, paraphrase is somewhat different from machine translation and abstractive summarization. The key difference is that the two ends of the task do not exist in two separate domains. Instead, the two ends are in the same domain.

Being in the same domain isn’t something horrible, but it does bring some inconvenience. Although operations or function mappings often can happen in the same domain, this might not hold in less rule-based functions, e.g., a neural model.

It is easy to understand that a +2 (plus two) operation for integers will not change domain. For example, 27 + 2 = 29, where both 27 and 29 are integers. Similarly, f(x) = -x is also an function that will map an interger to an integer, or a real number to a real number. We may find some properties of these operations, apart from that they are both rule-based. One property is that these functions have a clear direction to which they want to push their inputs. +2 constantly push its input to the positive side on the axis, while f(x) = -x keeps throwing its inputs to the other side of the axis, unless zero is input.

This clear properties could be seen in many neural models as well. In style transfer, an image could be translated to another style. In image super resolution, an image with lower quality (noisy or blur) could be translated into a clearer image. For natural languages, English could be translated to Czech, long articles could be rewritten in summaries. They, too, seem to have a clear mapping.

However, these neural models are not the same as those rule-based functions above, in terms of their convergence. For example, 10,00,000 + 2 = 10,00,002, while 10,00,002 + 2 = 10,00,004. We can always run this function and get some bigger number. On the contrary, practically, we currently cannot super resolve an image infinitely by neural models (2021). This is because neural models are trained on finite data, and resolution of images are also finite. Also, we cannot imagine translate English into Czech, and into Czech, and … again into Czech. The machine translation model would not give you any surprise rather than just a Czech-like sentence, even if you input Czech. This is a little like g(x) = 0.1 * x, where it converges at 0, though for this g(x) you still can get equally changed every time. In such cases when you use super reolution model for thousands of times, you don’t expect a equal change eventually, simply because the function is not rule based.

Certainly, you will not blame your machine translation model for not changing your Czech input, because you are inputting Czech. But, you may blame your paraphrase generation model for not giving you some ‘semantically identital’ sentences.

Maybe you shouldn’t blame your paraphrase model, as it is cursed. Paraphrase generation is essentially not a function. Its input can expecting multiple equally plausible outputs.

Suppose that you know for sure that A1, A2, and A3 are paraphrase to each other. what are you expecting from your model when inputting A2? Are you expecting A1 or A3? One may argue that it is just a problem of randomness. However, if all A1, A2, and A3 are English, which one we may got eventually while translating their Czech counterpart is a matter of randomness, since the input ‘travel’ a long distance from Czech to English. On the contrary, the relationship among A1, A2, and A3 is just like a perturbation. What matters is not mapping scale rather than mapping direction. Or rather, it’s like ‘try change this vector a little, but don’t let it be 0.77 unit away from the original vector’. It is exactly perturbation or noising.

Changing something to be the same as itself might probably be one of the hardest tasks. You sort of know the definition of being ‘the same’, but you don’t know exactly. You let English travel long to Czech and back to English, wishing that there could be a small change happening.

In short, paraphrase generation is not a task suitable to be tackled directly in a sequence-to-sequence manner. Rather it is like adversarial example generation, where you endeavor to full your downstream tasks, while varying your syntext and keeping your grammar sanity.

Currently, I haven’t found a feasible way, but before long there would be people get that done.