{"id":146,"date":"2023-03-17T11:50:31","date_gmt":"2023-03-17T11:50:31","guid":{"rendered":"https:\/\/smartsource.com.sg\/blog\/?p=146"},"modified":"2023-03-17T11:50:31","modified_gmt":"2023-03-17T11:50:31","slug":"data-augmentation-techniques-suitable-for-text-datasets","status":"publish","type":"post","link":"https:\/\/smartsource.com.sg\/blog\/index.php\/2023\/03\/17\/data-augmentation-techniques-suitable-for-text-datasets\/","title":{"rendered":"Data Augmentation techniques suitable for text datasets"},"content":{"rendered":"\n<p>There are several data augmentation techniques suitable for text data that you can try besides back-translation. Here are some examples:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Synonym Replacement: Replace words in the text with their synonyms. You can use the <code>WordNet<\/code> library in NLTK to find synonyms.<\/li>\n\n\n\n<li>Random Insertion: Insert random words into the text at random positions.<\/li>\n\n\n\n<li>Random Deletion: Delete random words from the text.<\/li>\n\n\n\n<li>Random Swap: Swap pairs of adjacent words at random positions in the text.<\/li>\n<\/ol>\n\n\n\n<p>Here&#8217;s an example code that implements these data augmentation techniques:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\r\nimport random\r\nfrom nltk.corpus import wordnet\r\nimport nltk\r\nnltk.download('wordnet')\r\n\r\n# Load the CSV file into a Pandas DataFrame\r\ndf = pd.read_csv('sample.csv')\r\n\r\n# Select the column containing the text data\r\ntext_column = 'text'\r\n\r\n# Create a list of the text data in the selected column\r\ntext_data = df&#91;text_column].tolist()\r\n\r\n# Define data augmentation functions\r\ndef synonym_replacement(text, n=1):\r\n    words = text.split()\r\n    new_words = words.copy()\r\n    random_word_list = list(set(&#91;word for word in words if word not in stop_words]))\r\n    random.shuffle(random_word_list)\r\n    num_replaced = 0\r\n    for random_word in random_word_list:\r\n        synonyms = get_synonyms(random_word)\r\n        if len(synonyms) >= 1:\r\n            synonym = random.choice(synonyms)\r\n            new_words = &#91;synonym if word == random_word else word for word in new_words]\r\n            num_replaced += 1\r\n        if num_replaced >= n:\r\n            break\r\n    new_text = ' '.join(new_words)\r\n    return new_text\r\n\r\ndef random_insertion(text, n=1):\r\n    words = text.split()\r\n    new_words = words.copy()\r\n    for _ in range(n):\r\n        random_word = random.choice(words)\r\n        new_words.insert(random.randint(0, len(new_words)-1), random_word)\r\n    new_text = ' '.join(new_words)\r\n    return new_text\r\n\r\ndef random_deletion(text, p=0.2):\r\n    words = text.split()\r\n    new_words = &#91;]\r\n    for word in words:\r\n        r = random.uniform(0, 1)\r\n        if r > p:\r\n            new_words.append(word)\r\n    new_text = ' '.join(new_words)\r\n    return new_text\r\n\r\ndef random_swap(text, n=1):\r\n    words = text.split()\r\n    new_words = words.copy()\r\n    for _ in range(n):\r\n        idx1, idx2 = random.sample(range(len(words)), 2)\r\n        new_words&#91;idx1], new_words&#91;idx2] = new_words&#91;idx2], new_words&#91;idx1]\r\n    new_text = ' '.join(new_words)\r\n    return new_text\r\n\r\ndef get_synonyms(word):\r\n    synonyms = &#91;]\r\n    for syn in wordnet.synsets(word):\r\n        for lemma in syn.lemmas():\r\n            synonyms.append(lemma.name())\r\n    return set(synonyms)\r\n\r\n# Apply data augmentation to the text data\r\nnew_text_data = &#91;]\r\nfor text in text_data:\r\n    new_text = synonym_replacement(text) # or random_insertion(text), or random_deletion(text), or random_swap(text)\r\n    new_text_data.append(new_text)\r\n\r\n# Create a new DataFrame with the generated data\r\nnew_df = pd.DataFrame()\r\nfor column in df.columns:\r\n    new_df&#91;column] = df&#91;column]\r\nnew_df&#91;'new_text'] = new_text_data\r\n\r\n# Write the new DataFrame to a CSV file\r\nnew_df.to_csv('new_sample.csv', index=False)\r\n\r\nprint('New CSV file with generated data saved successfully.')\r\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>There are several data augmentation techniques suitable for text data that you can try besides back-translation. Here are some examples: Here&#8217;s an example code that implements these data augmentation techniques:<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19],"tags":[72,110,64,114,112,113,111],"class_list":["post-146","post","type-post","status-publish","format-standard","hentry","category-tutorials","tag-data-augmentation","tag-nltk","tag-python","tag-random-addition","tag-random-deletion","tag-random-swap","tag-synonym-replacement"],"_links":{"self":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=146"}],"version-history":[{"count":1,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/146\/revisions"}],"predecessor-version":[{"id":147,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/posts\/146\/revisions\/147"}],"wp:attachment":[{"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smartsource.com.sg\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}