nlpaug.augmenter.sentence.lambada

Augmenter that apply operation (sentence level) to textual input based on abstractive summarization.

class nlpaug.augmenter.sentence.lambada.LambadaAug(model_dir, threshold=None, min_length=100, max_length=300, batch_size=16, temperature=1.0, top_k=50, top_p=0.9, repetition_penalty=1.0, name='Lambada_Aug', device='cpu', force_reload=False, verbose=0)[source]

Bases: nlpaug.augmenter.sentence.sentence_augmenter.SentenceAugmenter

Augmenter that leverage contextual word embeddings to find top n similar word for augmentation.

Parameters:
  • model_dir (str) – Directory of model. It is generated from train_lambada.sh under scritps folders.n
  • threshold (float) – The threshold of classification probabilty for accpeting generated text. Return all result if threshold is None.
  • batch_size (int) – Batch size.
  • min_length (int) – The min length of output text.
  • max_length (int) – The max length of output text.
  • temperature (float) – The value used to module the next token probabilities.
  • top_k (int) – The number of highest probability vocabulary tokens to keep for top-k-filtering.
  • top_p (float) – If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

:param float repetition_penalty : The parameter for repetition penalty. 1.0 means no penalty. :param str device: Default value is CPU. If value is CPU, it uses CPU for processing. If value is CUDA, it uses GPU

for processing. Possible values include ‘cuda’ and ‘cpu’.
Parameters:
  • force_reload (bool) – Force reload the contextual word embeddings model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.sentence as nas
>>> aug = nas.LambadaAug()
augment(data, n=1, num_thread=1)
Parameters:
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy)
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
Returns:

Augmented data

>>> augmented_data = aug.augment(data)