| [InCoder](https://huggingface.co/facebook/incoder-6B) was trained on **216 GB** of data, after preprocessing, from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code. | |
| The Github data used the following filtering: | |
| - Average line length < 100 tokens | |
| - Maximum line length < 3000 MB | |
| - Alphanumeric characters fraction > 0.4 | |
| - Remove auto-generated files (keyword search) | |
| The second component of the data consists of questions, answers, and comments from StackOverflow, it includes: | |
| - all questions that have at least one answer | |
| - up to ten answers with a non-negative score (sorted by score) per question | |
| - up to five comments per question/answer | |
| Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf). |