Spaces:
Sleeping
Sleeping
| title: Synthetic Sdk Demo | |
| emoji: π | |
| colorFrom: green | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI | |
| # Synthetic Data SDK by MOSTLY AI Demo | |
| [Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/) | |
| The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**. | |
| - **LOCAL** mode trains and generates synthetic data locally on your own compute resources. | |
| - **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there. | |
| - Generators, that were trained locally, can be easily imported to a platform for further sharing. | |
| ## Overview | |
| The SDK allows you to programmatically create, browse and manage 3 key resources: | |
| 1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets | |
| 2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs | |
| 3. **Connectors** - Connect to any data source within your organization, for reading and writing data | |
| | Intent | Primitive | API Reference | | |
| |-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------| | |
| | Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) | | |
| | Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) | | |
| | Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) | | |
| | Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) | | |
| https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f | |
| ## Key Features | |
| - **Broad Data Support** | |
| - Mixed-type data (categorical, numerical, geospatial, text, etc.) | |
| - Single-table, multi-table, and time-series | |
| - **Multiple Model Types** | |
| - State-of-the-art performance via TabularARGN | |
| - Fine-tune Hugging Face hosted language models | |
| - Efficient LSTM for text synthesis from scratch | |
| - **Advanced Training Options** | |
| - GPU/CPU support | |
| - Differential Privacy | |
| - Progress Monitoring | |
| - **Automated Quality Assurance** | |
| - Quality metrics for fidelity and privacy | |
| - In-depth HTML reports for visual analysis | |
| - **Flexible Sampling** | |
| - Up-sample to any data volumes | |
| - Conditional simulations based on any columns | |
| - Re-balance underrepresented segments | |
| - Context-aware data imputation | |
| - Statistical fairness controls | |
| - Rule-adherence via temperature | |
| - **Seamless Integration** | |
| - Connect to external data sources (DBs, cloud storages) | |
| - Fully permissive open-source license | |
| ## Citation | |
| Please consider citing our project if you find it useful: | |
| ```bibtex | |
| @misc{mostlyai, | |
| title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK}, | |
| author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko}, | |
| year={2025}, | |
| eprint={2508.00718}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.LG}, | |
| url={https://arxiv.org/abs/2508.00718}, | |
| } | |
| ``` | |