nlpaug.augmenter.word.back_translation

Augmenter that apply operation (word level) to textual input based on back translation.

class nlpaug.augmenter.word.back_translation.BackTranslationAug(from_model_name='facebook/wmt19-en-de', to_model_name='facebook/wmt19-de-en', name='BackTranslationAug', device='cpu', batch_size=32, max_length=300, force_reload=False, verbose=0)[source]

Bases: nlpaug.augmenter.word.word_augmenter.WordAugmenter

Augmenter that leverage two translation models for augmentation. For example, the source is English. This augmenter translate source to German and translating it back to English. For detail, you may visit https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28

Parameters:
  • from_model_name (str) – Any model from https://huggingface.co/models?filter=translation&search=Helsinki-NLP. As long as from_model_name is pair with to_model_name. For example, from_model_name is English to Japanese, then to_model_name should be Japanese to English.
  • to_model_name (str) – Any model from https://huggingface.co/models?filter=translation&search=Helsinki-NLP.
  • device (str) – Default value is CPU. If value is CPU, it uses CPU for processing. If value is CUDA, it uses GPU for processing. Possible values include ‘cuda’ and ‘cpu’. (May able to use other options)
  • force_reload (bool) – Force reload the contextual word embeddings model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
  • batch_size (int) – Batch size.
  • max_length (int) – The max length of output text.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.word as naw
>>> aug = naw.BackTranslationAug()
augment(data, n=1, num_thread=1)
Parameters:
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
Returns:

Augmented data

>>> augmented_data = aug.augment(data)