Augmenter that apply operation to textual input based on word embeddings.

class nlpaug.augmenter.word.word_embs.WordEmbsAug(model_type, model_path='.', model=None, action='substitute', name='WordEmbs_Aug', aug_min=1, aug_max=10, aug_p=0.3, top_k=100, n_gram_separator='_', stopwords=None, tokenizer=None, reverse_tokenizer=None, force_reload=False, stopwords_regex=None, verbose=0, skip_check=False)[source]

Bases: nlpaug.augmenter.word.word_augmenter.WordAugmenter

Augmenter that leverage word embeddings to find top n similar word for augmentation.

  • model_type (str) – Model type of word embeddings. Expected values include ‘word2vec’, ‘glove’ and ‘fasttext’.
  • model_path (str) – Downloaded model directory. Either model_path or model is must be provided
  • model (obj) – Pre-loaded model (e.g. model class is nlpaug.model.word_embs.nmw.Word2vec(), nlpaug.model.word_embs.nmw.Glove() or nlpaug.model.word_embs.nmw.Fasttext())
  • action (str) – Either ‘insert or ‘substitute’. If value is ‘insert’, a new word will be injected to random position according to word embeddings calculation. If value is ‘substitute’, word will be replaced according to word embeddings calculation
  • top_k (int) – Controlling lucky draw pool. Top k score token will be used for augmentation. Larger k, more token can be used. Default value is 100. If value is None which means using all possible tokens. This attribute will be ignored when using “insert” action.
  • aug_p (float) – Percentage of word will be augmented.
  • aug_min (int) – Minimum number of word will be augmented.
  • aug_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_p. Otherwise, using aug_max.
  • stopwords (list) – List of words which will be skipped from augment operation.
  • stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
  • tokenizer (func) – Customize tokenization process
  • reverse_tokenizer (func) – Customize reverse of tokenization process
  • force_reload (bool) – If True, model will be loaded every time while it takes longer time for initialization.
  • skip_check (bool) – Default is False. If True, no validation for size of vocabulary embedding.
  • name (str) – Name of this augmenter
>>> import nlpaug.augmenter.word as naw
>>> aug = naw.WordEmbsAug(model_type='word2vec', model_path='.')
augment(data, n=1, num_thread=1)
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1

Augmented data

>>> augmented_data = aug.augment(data)