Augmenter that apply operation (word level) to textual input based on contextual word embeddings.

class nlpaug.augmenter.word.context_word_embs.ContextualWordEmbsAug(model_path='bert-base-uncased', model_type='', action='substitute', top_k=100, name='ContextualWordEmbs_Aug', aug_min=1, aug_max=10, aug_p=0.3, stopwords=None, batch_size=32, device='cpu', force_reload=False, stopwords_regex=None, verbose=0, silence=True, use_custom_api=True)[source]

Bases: nlpaug.augmenter.word.word_augmenter.WordAugmenter

Augmenter that leverage contextual word embeddings to find top n similar word for augmentation.

  • model_path (str) – Model name or model path. It used transformers to load the model. Tested ‘bert-base-uncased’, ‘bert-base-cased’, ‘distilbert-base-uncased’, ‘roberta-base’, ‘distilroberta-base’, ‘facebook/bart-base’, ‘squeezebert/squeezebert-uncased’.
  • model_type (str) – Type of model. For BERT model, use ‘bert’. For RoBERTa/LongFormer model, use ‘roberta’. For BART model, use ‘bart’. If no value is provided, will determine from model name.
  • action (str) – Either ‘insert or ‘substitute’. If value is ‘insert’, a new word will be injected to random position according to contextual word embeddings calculation. If value is ‘substitute’, word will be replaced according to contextual embeddings calculation
  • top_k (int) – Controlling lucky draw pool. Top k score token will be used for augmentation. Larger k, more token can be used. Default value is 100. If value is None which means using all possible tokens.
  • aug_p (float) – Percentage of word will be augmented.
  • aug_min (int) – Minimum number of word will be augmented.
  • aug_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_p. Otherwise, using aug_max.
  • stopwords (list) – List of words which will be skipped from augment operation. Do NOT include the UNKNOWN word. UNKNOWN word of BERT is [UNK]. UNKNOWN word of RoBERTa and BART is <unk>.
  • stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
  • device (str) – Default value is CPU. If value is CPU, it uses CPU for processing. If value is CUDA, it uses GPU for processing. Possible values include ‘cuda’ and ‘cpu’. (May able to use other options)
  • batch_size (int) – Batch size.
  • force_reload (bool) – Force reload the contextual word embeddings model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
  • silence (bool) – Default is True. transformers library will print out warning message when leveraing pre-trained model. Set True to disable the expected warning message.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.word as naw
>>> aug = naw.ContextualWordEmbsAug()
augment(data, n=1, num_thread=1)
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1

Augmented data

>>> augmented_data = aug.augment(data)
device = None

TODO: Reserve 2 spaces (e.g. [CLS], [SEP]) is not enough as it hit CUDA error in batch processing mode. Therefore, forcing to reserve 5 times of reserved spaces (i.e. 5)