Spaces:

mostlyai
/

synthetic-sdk-demo

Sleeping

App Files Files Community

synthetic-sdk-demo / README.md

ZennyKenny

Update README.md

5426d51 verified about 1 month ago

preview code

raw

history blame

3.92 kB

	---
	title: Synthetic Sdk Demo
	emoji: 🚀
	colorFrom: green
	colorTo: green
	sdk: docker
	pinned: false
	license: apache-2.0
	short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
	---

	# Synthetic Data SDK by MOSTLY AI Demo

	[Documentation](https://mostly-ai.github.io/mostlyai/) \| [Technical White Paper](https://arxiv.org/abs/2508.00718) \| [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) \| [Free Cloud Service](https://app.mostly.ai/)

	The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.

	- LOCAL mode trains and generates synthetic data locally on your own compute resources.
	- CLIENT mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
	- Generators, that were trained locally, can be easily imported to a platform for further sharing.

	## Overview

	The SDK allows you to programmatically create, browse and manage 3 key resources:

	1. Generators - Train a synthetic data generator on your existing tabular or language data assets
	2. Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
	3. Connectors - Connect to any data source within your organization, for reading and writing data

	\| Intent \| Primitive \| API Reference \|
	\|-----------------------------------------------\|-----------------------------------\|---------------------------------------------------------------------------------------------------------------\|
	\| Train a Generator on tabular or language data \| `g = mostly.train(config)` \| [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) \|
	\| Generate any number of synthetic data records \| `sd = mostly.generate(g, config)` \| [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) \|
	\| Live probe the generator on demand \| `df = mostly.probe(g, config)` \| [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) \|
	\| Connect to any data source within your org \| `c = mostly.connect(config)` \| [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) \|

	https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f

	## Key Features

	- Broad Data Support
	- Mixed-type data (categorical, numerical, geospatial, text, etc.)
	- Single-table, multi-table, and time-series
	- Multiple Model Types
	- State-of-the-art performance via TabularARGN
	- Fine-tune Hugging Face hosted language models
	- Efficient LSTM for text synthesis from scratch
	- Advanced Training Options
	- GPU/CPU support
	- Differential Privacy
	- Progress Monitoring
	- Automated Quality Assurance
	- Quality metrics for fidelity and privacy
	- In-depth HTML reports for visual analysis
	- Flexible Sampling
	- Up-sample to any data volumes
	- Conditional simulations based on any columns
	- Re-balance underrepresented segments
	- Context-aware data imputation
	- Statistical fairness controls
	- Rule-adherence via temperature
	- Seamless Integration
	- Connect to external data sources (DBs, cloud storages)
	- Fully permissive open-source license

	## Citation

	Please consider citing our project if you find it useful:

	```bibtex
	@misc{mostlyai,
	title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
	author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
	year={2025},
	eprint={2508.00718},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2508.00718},
	}
	```