Augmenter that apply operation to textual input based on word embeddings.
WordEmbsAug(model_type, model_path='.', model=None, action='substitute', name='WordEmbs_Aug', aug_min=1, aug_max=10, aug_p=0.3, top_k=100, n_gram_separator='_', stopwords=None, tokenizer=None, reverse_tokenizer=None, force_reload=False, stopwords_regex=None, verbose=0)¶
Augmenter that leverage word embeddings to find top n similar word for augmentation.
- model_type (str) – Model type of word embeddings. Expected values include ‘word2vec’, ‘glove’ and ‘fasttext’.
- model_path (str) – Downloaded model directory. Either model_path or model is must be provided
- model (obj) – Pre-loaded model
- action (str) – Either ‘insert or ‘substitute’. If value is ‘insert’, a new word will be injected to random position according to word embeddings calculation. If value is ‘substitute’, word will be replaced according to word embeddings calculation
- top_k (int) – Controlling lucky draw pool. Top k score token will be used for augmentation. Larger k, more token can be used. Default value is 100. If value is None which means using all possible tokens.
- aug_p (float) – Percentage of word will be augmented.
- aug_min (int) – Minimum number of word will be augmented.
- aug_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aug_p. Otherwise, using aug_max.
- stopwords (list) – List of words which will be skipped from augment operation.
- stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
- tokenizer (func) – Customize tokenization process
- reverse_tokenizer (func) – Customize reverse of tokenization process
- force_reload (bool) – If True, model will be loaded every time while it takes longer time for initialization.
- name (str) – Name of this augmenter
>>> import nlpaug.augmenter.word as naw >>> aug = naw.WordEmbsAug(model_type='word2vec', model_path='.')
augment(data, n=1, num_thread=1)¶
- data (object/list) – Data for augmentation. It can be list of data (e.g. list of string or numpy) or single element (e.g. string or numpy)
- n (int) – Default is 1. Number of unique augmented output. Will be force to 1 if input is list of data
- num_thread (int) – Number of thread for data augmentation. Use this option when you are using CPU and n is larger than 1
>>> augmented_data = aug.augment(data)