Augmenter that apply typo error simulation to textual input.
KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0, stopwords_regex=None, model_path=None, min_char=4)¶
Augmenter that simulate typo error by random values. For example, people may type i as o incorrectly. One keyboard distance is leveraged to replace character by possible keyboard error.
- aug_char_p (float) – Percentage of character (per token) will be augmented.
- aug_char_min (int) – Minimum number of character will be augmented.
- aug_char_max (int) – Maximum number of character will be augmented. If None is passed, number of augmentation is calculated via aup_char_p. If calculated result from aug_char_p is smaller than aug_char_max, will use calculated result from aup_char_p. Otherwise, using aug_max.
- aug_word_p (float) – Percentage of word will be augmented.
- aug_word_min (int) – Minimum number of word will be augmented.
- aug_word_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_word_p. If calculated result from aug_word_p is smaller than aug_word_max, will use calculated result from aug_word_p. Otherwise, using aug_max.
- stopwords (list) – List of words which will be skipped from augment operation.
- stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
- tokenizer (func) – Customize tokenization process
- reverse_tokenizer (func) – Customize reverse of tokenization process
- include_special_char (bool) – Include special character
- include_upper_case (bool) – If True, upper case character may be included in augmented data.
- include_numeric (bool) – If True, numeric character may be included in augmented data.
- min_char (int) – If word less than this value, do not draw word for augmentation
- model_path (str) – Loading customize model from file system
- lang (str) – Indicate built-in language model. Default value is ‘en’. Possible values are ‘en’, ‘th’ (Thai), ‘tr’(Turkish), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘it’(Italian), ‘nl’(Dutch), ‘pl’(Polish), ‘uk’(Ukrainian), ‘he’(Hebrew). If custom model is used (passing model_path), this value will be ignored.
- name (str) – Name of this augmenter
>>> import nlpaug.augmenter.char as nac >>> aug = nac.KeyboardAug()
augment(data, n=1, num_thread=1)¶
- data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support string or list of string.
- n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
- num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
>>> augmented_data = aug.augment(data)