|
--- |
|
datasets: |
|
- bigcode/pii-annotated-toloka-donwsample-emails |
|
- bigcode/pseudo-labeled-python-data-pii-detection-filtered |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
language: |
|
- code |
|
extra_gated_prompt: >- |
|
## Terms of Use for the model |
|
|
|
|
|
This is an NER model trained to detect Personal Identifiable Information (PII) |
|
in code datasets. We ask that you read and agree to the following Terms of Use |
|
before using the model: |
|
|
|
1. You agree that you will not use the model for any purpose other than PII |
|
detection for the purpose of removing PII from datasets. |
|
|
|
2. You agree that you will not share the model or any modified versions for |
|
whatever purpose. |
|
|
|
3. Unless required by applicable law or agreed to in writing, the model is |
|
provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, |
|
either express or implied, including, without limitation, any warranties or |
|
conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A |
|
PARTICULAR PURPOSE. You are solely responsible for determining the |
|
appropriateness of using the model, and assume any risks associated with your |
|
exercise of permissions under these Terms of Use. |
|
|
|
4. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, |
|
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR |
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL OR THE USE OR |
|
OTHER DEALINGS IN THE MODEL. |
|
extra_gated_fields: |
|
Email: text |
|
I have read the License and agree with its terms: checkbox |
|
--- |
|
|
|
# StarPII |
|
|
|
## Model description |
|
|
|
This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned [bigcode-encoder](https://huggingface.co/bigcode/bigcode-encoder) |
|
on a PII dataset we annotated, available with gated access at [bigcode-pii-dataset](https://huggingface.co/datasets/bigcode/pii-annotated-toloka-donwsample-emails) (see [bigcode-pii-dataset-training](https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training) for the exact data splits). |
|
We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames. |
|
|
|
|
|
## Dataset |
|
|
|
### Fine-tuning on the annotated dataset |
|
The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88 |
|
programming languages from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. |
|
|
|
### Initial training on a pseudo-labelled dataset |
|
To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset. |
|
The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data. |
|
|
|
Specifically, we annotated 18,000 files available at [bigcode-pii-ppseudo-labeled](https://huggingface.co/datasets/bigcode/pseudo-labeled-python-data-pii-detection-filtered) |
|
using an ensemble of two encoder models [Deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and [stanford-deidentifier-base](StanfordAIMI/stanford-deidentifier-base) |
|
which were fine-tuned on an intern previously labeled PII [dataset](https://huggingface.co/datasets/bigcode/pii-for-code) for code with 400 files from this [work](https://arxiv.org/abs/2301.03988). |
|
To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score. |
|
After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like `key`, `auth` and `pwd` in the surrounding context. |
|
Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories, |
|
as demonstrated in the table in the following section. |
|
|
|
|
|
### Performance |
|
|
|
This model is respresented in the last row (NER + pseudo labels ) |
|
- Emails, IP addresses and Keys |
|
|
|
| Method | Email address | | | IP address | | | Key | | | |
|
| ------------------ | -------------- | ---- | ---- | ---------- | ---- | ---- | ----- | ---- | ---- | |
|
| | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | |
|
| Regex | 96.2% | 97.47% | 96.83% | 71.29% | 87.71% | 78.65% | 3.62% | 49.15% | 6.74% | |
|
| NER | 94.01% | 98.10% | 96.01% | 88.95% | *94.43%* | 91.61% | 60.37% | 53.38% | 56.66% | |
|
| + pseudo labels | **97.73%** | **98.94%** | **98.15%** | **90.10%** | 93.86% | **91.94%** | **62.38%** | **80.81%** | **70.41%** | |
|
|
|
- Names, Usernames and Passwords |
|
|
|
| Method | Name | | | Username | | | Password | | | |
|
| ------------------ | -------- | ---- | ---- | -------- | ---- | ---- | -------- | ---- | ---- | |
|
| | Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | |
|
| NER | 83.66% | 95.52% | 89.19% | 48.93% | *75.55%* | 59.39% | 59.16% | *96.62%* | 73.39%| |
|
| + pseudo labels | **86.45%** | **97.38%** | **91.59%** | **52.20%** | 74.81% | **61.49%** | **70.94%** | 95.96% | **81.57%** | |
|
|
|
We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. |
|
For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub): |
|
|
|
- Ignore secrets with less than 4 characters. |
|
- Detect full names only. |
|
- Ignore detected keys with less than 9 characters or that are not gibberish using a [gibberish-detector](https://github.com/domanchi/gibberish-detector). |
|
- Ignore IP addresses that aren't valid or are private (non-internet facing) using the `ipaddress` python package. We also ignore IP addresses from popular DNS servers. |
|
We use the same list as in this [paper](https://huggingface.co/bigcode/santacoder). |
|
|
|
# Considerations for Using the Model |
|
|
|
While using this model, please be aware that there may be potential risks associated with its application. |
|
There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. |
|
Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases. |
|
Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, |
|
our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII. |