Spaces:

mostlyai
/

synthetic-sdk-demo

Running

App Files Files Community

ZennyKenny commited on 7 days ago

Commit

5426d51

verified ·

1 Parent(s): 27d7a4f

Update README.md

Browse files

Files changed (1) hide show

README.md +69 -0

README.md CHANGED Viewed

@@ -9,3 +9,72 @@ license: apache-2.0
 short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
 ---

 short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
 ---
+# Synthetic Data SDK by MOSTLY AI Demo
+[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)
+The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.
+- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
+- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
+- Generators, that were trained locally, can be easily imported to a platform for further sharing.
+## Overview
+The SDK allows you to programmatically create, browse and manage 3 key resources:
+1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
+2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
+3. **Connectors** - Connect to any data source within your organization, for reading and writing data
+| Intent                                        | Primitive                         | API Reference                                                                                                 |
+|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
+| Train a Generator on tabular or language data | `g = mostly.train(config)`        | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train)       |
+| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
+| Live probe the generator on demand            | `df = mostly.probe(g, config)`    | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe)       |
+| Connect to any data source within your org    | `c = mostly.connect(config)`      | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect)   |
+https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f
+## Key Features
+- **Broad Data Support**
+  - Mixed-type data (categorical, numerical, geospatial, text, etc.)
+  - Single-table, multi-table, and time-series
+- **Multiple Model Types**
+  - State-of-the-art performance via TabularARGN
+  - Fine-tune Hugging Face hosted language models
+  - Efficient LSTM for text synthesis from scratch
+- **Advanced Training Options**
+  - GPU/CPU support
+  - Differential Privacy
+  - Progress Monitoring
+- **Automated Quality Assurance**
+  - Quality metrics for fidelity and privacy
+  - In-depth HTML reports for visual analysis
+- **Flexible Sampling**
+  - Up-sample to any data volumes
+  - Conditional simulations based on any columns
+  - Re-balance underrepresented segments
+  - Context-aware data imputation
+  - Statistical fairness controls
+  - Rule-adherence via temperature
+- **Seamless Integration**
+  - Connect to external data sources (DBs, cloud storages)
+  - Fully permissive open-source license
+## Citation
+Please consider citing our project if you find it useful:
+```bibtex
+@misc{mostlyai,
+      title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
+      author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
+      year={2025},
+      eprint={2508.00718},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2508.00718},
+}
+```