ZennyKenny commited on
Commit
5426d51
·
verified ·
1 Parent(s): 27d7a4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md CHANGED
@@ -9,3 +9,72 @@ license: apache-2.0
9
  short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
10
  ---
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  short_description: The Synthetic Data SDK is a Python toolkit from MOSTLY AI
10
  ---
11
 
12
+ # Synthetic Data SDK by MOSTLY AI Demo
13
+
14
+ [Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)
15
+
16
+ The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.
17
+
18
+ - **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
19
+ - **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
20
+ - Generators, that were trained locally, can be easily imported to a platform for further sharing.
21
+
22
+ ## Overview
23
+
24
+ The SDK allows you to programmatically create, browse and manage 3 key resources:
25
+
26
+ 1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
27
+ 2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
28
+ 3. **Connectors** - Connect to any data source within your organization, for reading and writing data
29
+
30
+ | Intent | Primitive | API Reference |
31
+ |-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
32
+ | Train a Generator on tabular or language data | `g = mostly.train(config)` | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train) |
33
+ | Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
34
+ | Live probe the generator on demand | `df = mostly.probe(g, config)` | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe) |
35
+ | Connect to any data source within your org | `c = mostly.connect(config)` | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect) |
36
+
37
+ https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f
38
+
39
+ ## Key Features
40
+
41
+ - **Broad Data Support**
42
+ - Mixed-type data (categorical, numerical, geospatial, text, etc.)
43
+ - Single-table, multi-table, and time-series
44
+ - **Multiple Model Types**
45
+ - State-of-the-art performance via TabularARGN
46
+ - Fine-tune Hugging Face hosted language models
47
+ - Efficient LSTM for text synthesis from scratch
48
+ - **Advanced Training Options**
49
+ - GPU/CPU support
50
+ - Differential Privacy
51
+ - Progress Monitoring
52
+ - **Automated Quality Assurance**
53
+ - Quality metrics for fidelity and privacy
54
+ - In-depth HTML reports for visual analysis
55
+ - **Flexible Sampling**
56
+ - Up-sample to any data volumes
57
+ - Conditional simulations based on any columns
58
+ - Re-balance underrepresented segments
59
+ - Context-aware data imputation
60
+ - Statistical fairness controls
61
+ - Rule-adherence via temperature
62
+ - **Seamless Integration**
63
+ - Connect to external data sources (DBs, cloud storages)
64
+ - Fully permissive open-source license
65
+
66
+ ## Citation
67
+
68
+ Please consider citing our project if you find it useful:
69
+
70
+ ```bibtex
71
+ @misc{mostlyai,
72
+ title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
73
+ author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
74
+ year={2025},
75
+ eprint={2508.00718},
76
+ archivePrefix={arXiv},
77
+ primaryClass={cs.LG},
78
+ url={https://arxiv.org/abs/2508.00718},
79
+ }
80
+ ```