Synthetic Data Generation Study

Dr. K Sree Kumar,  Sarthak Sethi,  Aaishik Dutta

Dr. K Sree Kumar, Sarthak Sethi, Aaishik Dutta

Abstract

Synthetic data is an analytically generated data that has the potential to act as a proxy to the real data. It tends to imitate data regarding parameters set by user, avoiding methodical measurements, hence making it anonymous. There are many advantages of synthetic data. It helps businesses of every size to work on models powered by deep data sets. This democratizes machine learning. This process is very effective in cost reduction and increases efficiency. We can create the data on the basis of demands rather than waiting on occurence of the event that makes the data set. One important advantage of synthetic data is that it can be used to generate all possible cases of a situation so that the machine learns all dimensions of the problem. We can also avoid privacy issues when using synthetic data. Synthetic data does face one major disadvantage. The reaction of the generator model to new data. When new data is added to the original set, current techniques tend to overlook the new behaviour shown by the dataset. This is due to the fact that the model doesn’t involve the new addition to the set. A workaround to this issue is to generate a new model with the dataset combining the original dataset and the new additions. But this is time consuming. This paper investigates the mechanics of different techniques of synthetic data generation and proposes an efficient method to address the issue of data generation due to dataset update.