nlpaug.augmenter.word.synonym¶

Augmenter that apply semantic meaning based to textual input.

class nlpaug.augmenter.word.synonym.SynonymAug(aug_src='wordnet', model_path=None, name='Synonym_Aug', aug_min=1, aug_max=10, aug_p=0.3, lang='eng', stopwords=None, tokenizer=None, reverse_tokenizer=None, stopwords_regex=None, force_reload=False, verbose=0)[source]¶

Bases: nlpaug.augmenter.word.word_augmenter.WordAugmenter

Augmenter that leverage semantic meaning to substitute word.

Parameters:

aug_src (str) – Support ‘wordnet’ and ‘ppdb’ .
model_path (str) – Path of dictionary. Mandatory field if using PPDB as data source
lang (str) – Language of your text. Default value is ‘eng’. For wordnet, you can choose lang from this list http://compling.hss.ntu.edu.sg/omw/. For ppdb, you simply download corresponding langauge pack from http://paraphrase.org/#/download.
aug_p (float) – Percentage of word will be augmented.
aug_min (int) – Minimum number of word will be augmented.
aug_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_p. Otherwise, using aug_max.
stopwords (list) – List of words which will be skipped from augment operation.
stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
tokenizer (func) – Customize tokenization process
reverse_tokenizer (func) – Customize reverse of tokenization process
force_reload (bool) – Force reload model to memory when initialize the class. Default value is False and suggesting to keep it as False if performance is the consideration.
name (str) – Name of this augmenter

>>> import nlpaug.augmenter.word as naw
>>> aug = naw.SynonymAug()

augment(data, n=1, num_thread=1)¶

Parameters:

data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1

Returns:

Augmented data

>>> augmented_data = aug.augment(data)