nlpaug.augmenter.word.back_translation

Augmenter that apply operation (word level) to textual input based on back translation.

class nlpaug.augmenter.word.back_translation.BackTranslationAug(from_model_name='transformer.wmt19.en-de', to_model_name='transformer.wmt19.de-en', from_model_checkpt='model1.pt', to_model_checkpt='model1.pt', tokenizer='moses', bpe='fastbpe', is_load_from_github=True, name='BackTranslationAug', device='cpu', force_reload=False, verbose=0)[source]

Bases: nlpaug.augmenter.word.word_augmenter.WordAugmenter

Augmenter that leverage two translation models for augmentation. For example, the source is English. This augmenter translate source to German and translating it back to English. For detail, you may visit https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28

Parameters:
  • from_model_name (str) – Language of your text. Veriried ‘transformer.wmt18.en-de’, ‘transformer.wmt19.en-de’, ‘transformer.wmt19.de-en’, ‘transformer.wmt19.en-ru’ and ‘transformer.wmt19.ru-en’
  • to_model_name (str) – Language for translation. Veriried ‘transformer.wmt18.en-de’, transformer.wmt19.en-de’, ‘transformer.wmt19.de-en’, ‘transformer.wmt19.en-ru’ and ‘transformer.wmt19.ru-en’
  • tokenizer (str) – Default value is ‘moses’
  • bpe (str) – Default value is ‘fastbpe’
  • device (str) – Default value is CPU. If value is CPU, it uses CPU for processing. If value is CUDA, it uses GPU for processing. Possible values include ‘cuda’ and ‘cpu’. (May able to use other options)
  • is_load_from_github (bool) – Default is True. If True, transaltion models will be loaded from fairseq’s github. Otherwise, providing model directory for both from_model_name and to_model_name parameters.
  • force_reload (bool) – Force reload the contextual word embeddings model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.word as naw
>>> aug = naw.BackTranslationAug()
augment(data, n=1, num_thread=1)
Parameters:
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy)
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
Returns:

Augmented data

>>> augmented_data = aug.augment(data)