Augmenter that apply typo error simulation to textual input.
KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0, stopwords_regex=None, model_path=None, min_char=4)¶
Augmenter that simulate typo error by random values. For example, people may type i as o incorrectly. One keyboard distance is leveraged to replace character by possible keyboard error.
- aug_char_p (float) – Percentage of character (per token) will be augmented.
- aug_char_min (int) – Minimum number of character will be augmented.
- aug_char_max (int) – Maximum number of character will be augmented. If None is passed, number of augmentation is calculated via aup_char_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aup_char_p. Otherwise, using aug_max.
- aug_word_p (float) – Percentage of word will be augmented.
- aug_word_min (int) – Minimum number of word will be augmented.
- aug_word_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_word_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_word_p. Otherwise, using aug_max.
- stopwords (list) – List of words which will be skipped from augment operation.
- stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
- tokenizer (func) – Customize tokenization process
- reverse_tokenizer (func) – Customize reverse of tokenization process
- include_special_char (bool) – Include special character
- include_upper_case (bool) – If True, upper case character may be included in augmented data.
- include_numeric (bool) – If True, numeric character may be included in augmented data.
- min_char (int) – If word less than this value, do not draw word for augmentation
- model_path (str) – Loading customize model from file system
- lang (str) – Indicate built-in language model. Default value is ‘en’. Possible values are ‘en’ and ‘th’. If custom model is used (passing model_path), this value will be ignored.
- name (str) – Name of this augmenter
>>> import nlpaug.augmenter.char as nac >>> aug = nac.KeyboardAug()
augment(data, n=1, num_thread=1)¶
- data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy)
- n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
- num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
>>> augmented_data = aug.augment(data)