Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -9,3 +9,72 @@ license: apache-2.0
|
|
9 |
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
|
10 |
---
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
|
10 |
---
|
11 |
|
12 |
+
# Synthetic Data SDK by MOSTLY AI Demo
|
13 |
+
|
14 |
+
[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)
|
15 |
+
|
16 |
+
The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.
|
17 |
+
|
18 |
+
- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
|
19 |
+
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
|
20 |
+
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
|
21 |
+
|
22 |
+
## Overview
|
23 |
+
|
24 |
+
The SDK allows you to programmatically create, browse and manage 3 key resources:
|
25 |
+
|
26 |
+
1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
|
27 |
+
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
|
28 |
+
3. **Connectors** - Connect to any data source within your organization, for reading and writing data
|
29 |
+
|
30 |
+
| Intent | Primitive | API Reference |
|
31 |
+
|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
|
32 |
+
| Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) |
|
33 |
+
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
|
34 |
+
| Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) |
|
35 |
+
| Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) |
|
36 |
+
|
37 |
+
https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f
|
38 |
+
|
39 |
+
## Key Features
|
40 |
+
|
41 |
+
- **Broad Data Support**
|
42 |
+
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
|
43 |
+
- Single-table, multi-table, and time-series
|
44 |
+
- **Multiple Model Types**
|
45 |
+
- State-of-the-art performance via TabularARGN
|
46 |
+
- Fine-tune Hugging Face hosted language models
|
47 |
+
- Efficient LSTM for text synthesis from scratch
|
48 |
+
- **Advanced Training Options**
|
49 |
+
- GPU/CPU support
|
50 |
+
- Differential Privacy
|
51 |
+
- Progress Monitoring
|
52 |
+
- **Automated Quality Assurance**
|
53 |
+
- Quality metrics for fidelity and privacy
|
54 |
+
- In-depth HTML reports for visual analysis
|
55 |
+
- **Flexible Sampling**
|
56 |
+
- Up-sample to any data volumes
|
57 |
+
- Conditional simulations based on any columns
|
58 |
+
- Re-balance underrepresented segments
|
59 |
+
- Context-aware data imputation
|
60 |
+
- Statistical fairness controls
|
61 |
+
- Rule-adherence via temperature
|
62 |
+
- **Seamless Integration**
|
63 |
+
- Connect to external data sources (DBs, cloud storages)
|
64 |
+
- Fully permissive open-source license
|
65 |
+
|
66 |
+
## Citation
|
67 |
+
|
68 |
+
Please consider citing our project if you find it useful:
|
69 |
+
|
70 |
+
```bibtex
|
71 |
+
@misc{mostlyai,
|
72 |
+
title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
|
73 |
+
author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
|
74 |
+
year={2025},
|
75 |
+
eprint={2508.00718},
|
76 |
+
archivePrefix={arXiv},
|
77 |
+
primaryClass={cs.LG},
|
78 |
+
url={https://arxiv.org/abs/2508.00718},
|
79 |
+
}
|
80 |
+
```
|