nlpaug.augmenter.sentence.lambada

Augmenter that apply operation (sentence level) to textual input based on abstractive summarization.

class nlpaug.augmenter.sentence.lambada.LambadaAug(model_dir, threshold=None, min_length=100, max_length=300, batch_size=16, temperature=1.0, top_k=50, top_p=0.9, repetition_penalty=1.0, name='Lambada_Aug', device='cpu', force_reload=False, verbose=0)[source]

Bases: nlpaug.augmenter.sentence.sentence_augmenter.SentenceAugmenter

Augmenter that leverage contextual word embeddings to find top n similar word for augmentation.

Parameters:
  • model_dir (str) – Directory of model. It is generated from train_lambada.sh under scritps folders.n
  • threshold (float) – The threshold of classification probabilty for accpeting generated text. Return all result if threshold is None.
  • batch_size (int) – Batch size.
  • min_length (int) – The min length of output text.
  • max_length (int) – The max length of output text.
  • temperature (float) – The value used to module the next token probabilities.
  • top_k (int) – The number of highest probability vocabulary tokens to keep for top-k-filtering.
  • top_p (float) – If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

:param float repetition_penalty : The parameter for repetition penalty. 1.0 means no penalty. :param str device: Default value is CPU. If value is CPU, it uses CPU for processing. If value is CUDA, it uses GPU

for processing. Possible values include ‘cuda’ and ‘cpu’.
Parameters:
  • force_reload (bool) – Force reload the contextual word embeddings model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.sentence as nas
>>> aug = nas.LambadaAug()
augment(data, n=1, num_thread=1)
Parameters:
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
Returns:

Augmented data

>>> augmented_data = aug.augment(data)