Data Augmentation using Back Translation

If you have a csv file with data in few columns and few rows and you want to generate a bigger dataset using some data augmentation technique, Back Translation is a good fit for this scenario. Here is the code to select all columns from your sample csv and create a new CSV file with the generated data.

import pandas as pd
import random

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('sample.csv')

# Select a random sample of 10 rows from the DataFrame
sample_df = df.sample(n=10)

# Generate new data
new_text_data = []
for text in sample_df['text']:
    # Apply your data augmentation technique here
    # For example, you can use back-translation
    new_text = back_translation(text)
    new_text_data.append(new_text)

# Create a new DataFrame with the generated data
new_df = pd.DataFrame()
for column in sample_df.columns:
    new_df[column] = sample_df[column]
new_df['new_text'] = new_text_data

# Write the new DataFrame to a CSV file
new_df.to_csv('new_sample.csv', index=False)

print('New CSV file with generated data saved successfully.')

In this code, we first load the CSV file into a Pandas DataFrame and select a random sample of 10 rows using the sample() method. Then, we generate new data using your data augmentation technique of choice. In this example, I’ve used the back_translation() function.

Next, we create a new DataFrame with the original columns from the sample DataFrame and a new column containing the generated text data. We then write the new DataFrame to a CSV file using the to_csv() method, and specify index=False to exclude the index column from the CSV file.

Finally, we print a message to confirm that the new CSV file has been saved successfully. You’ll need to replace 'new_sample.csv' with the desired name for your new CSV file, and modify the data augmentation technique as needed.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.