Synthetic Data: Fake It Until You Make It or Faking it is Making It

Jan 05, 2026

There’s a fundamental tension in modern analytics: the data you need to train custom models is often the exact data you can’t expose. Client financials. Transaction patterns. Behavioral signals across your portfolio. The gold data that would let you build proprietary ML models capturing your domain expertise is locked behind NDAs and privacy regulations. But you don’t need the real data to train on. You need data that behaves like the real data.

The Synthetic Data Vault (SDV) is an open-source Python library that uses machine learning to learn statistical patterns from real data and generate synthetic datasets that preserve those patterns. Developed at MIT’s Data to AI Lab starting in 2016 and open-sourced in 2018, SDV offers GaussianCopula for modeling variable relationships and CTGAN, a generative adversarial network that handles categorical variables and complex distributions.

The synthetic output maintains correlations, edge cases, and mathematical properties of your source data with zero actual records. This means you can train classification, regression, and anomaly-detection models, as well as any supervised learning system, on synthetic representations of sensitive data. Built-in evaluation tools let you validate statistical fidelity before using the output for model training.

For teams sitting on years of proprietary financial data, this unlocks real capability: train models that encode your hard-won pattern recognition without ever exposing client information. Share training datasets with ML engineers who lack production access. Build and iterate on predictive systems using statistically faithful proxies. Your data becomes a renewable asset for model development, not a liability to protect.

Todd’s Substack

Discussion about this post

Ready for more?