nlpaug.augmenter.char.random

Augmenter that apply random character error to textual input.

class nlpaug.augmenter.char.random.RandomCharAug(action='substitute', name='RandomChar_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, aug_word_min=1, aug_word_max=10, include_upper_case=True, include_lower_case=True, include_numeric=True, min_char=4, swap_mode='adjacent', spec_char='!@#$%^&*()_+', stopwords=None, tokenizer=None, reverse_tokenizer=None, verbose=0, stopwords_regex=None, candidiates=None)[source]

Bases: nlpaug.augmenter.char.char_augmenter.CharAugmenter

Augmenter that generate character error by random values. For example, people may type i as o incorrectly.

Parameters:
  • action (str) – Possible values are ‘insert’, ‘substitute’, ‘swap’ and ‘delete’. If value is ‘insert’, a new character will be injected to randomly. If value is ‘substitute’, a random character will be replaced original character randomly. If value is ‘swap’, adjacent characters within sample word will be swapped randomly. If value is ‘delete’, character will be removed randomly.
  • aug_char_p (float) – Percentage of character (per token) will be augmented.
  • aug_char_min (int) – Minimum number of character will be augmented.
  • aug_char_max (int) – Maximum number of character will be augmented. If None is passed, number of augmentation is calculated via aup_char_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aup_char_p. Otherwise, using aug_max.
  • aug_word_p (float) – Percentage of word will be augmented.
  • aug_word_min (int) – Minimum number of word will be augmented.
  • aug_word_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_word_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_word_p. Otherwise, using aug_max.
  • include_upper_case (bool) – If True, upper case character may be included in augmented data. If `candidiates’ value is provided, this param will be ignored.
  • include_lower_case (bool) – If True, lower case character may be included in augmented data. If `candidiates’ value is provided, this param will be ignored.
  • include_numeric (bool) – If True, numeric character may be included in augmented data. If `candidiates’ value is provided, this param will be ignored.
  • min_char (int) – If word less than this value, do not draw word for augmentation
  • swap_mode – When action is ‘swap’, you may pass ‘adjacent’, ‘middle’ or ‘random’. ‘adjacent’ means swap action only consider adjacent character (within same word). ‘middle’ means swap action consider adjacent character but not the first and last character of word. ‘random’ means swap action will be executed without constraint.
  • spec_char (str) – Special character may be included in augmented data. If `candidiates’ value is provided, this param will be ignored.
  • stopwords (list) – List of words which will be skipped from augment operation.
  • stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
  • tokenizer (func) – Customize tokenization process
  • reverse_tokenizer (func) – Customize reverse of tokenization process
  • candidiates (List) – List of string for augmentation. E.g. [‘AAA’, ‘11’, ‘===’]. If values is provided, include_upper_case, include_lower_case, include_numeric and spec_char will be ignored.
  • name (str) – Name of this augmenter.
>>> import nlpaug.augmenter.char as nac
>>> aug = nac.RandomCharAug()
augment(data, n=1, num_thread=1)
Parameters:
  • data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy)
  • n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
  • num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
Returns:

Augmented data

>>> augmented_data = aug.augment(data)