starpii / README.md

Corrections to Performance number for RegEx to reflect data from StarCoder paper

9604d9d verified 10 months ago

6.83 kB

	---
	datasets:
	- bigcode/pii-annotated-toloka-donwsample-emails
	- bigcode/pseudo-labeled-python-data-pii-detection-filtered
	metrics:
	- f1
	pipeline_tag: token-classification
	language:
	- code
	extra_gated_prompt: >-
	## Terms of Use for the model


	This is an NER model trained to detect Personal Identifiable Information (PII)
	in code datasets. We ask that you read and agree to the following Terms of Use
	before using the model:

	1. You agree that you will not use the model for any purpose other than PII
	detection for the purpose of removing PII from datasets.

	2. You agree that you will not share the model or any modified versions for
	whatever purpose.

	3. Unless required by applicable law or agreed to in writing, the model is
	provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
	either express or implied, including, without limitation, any warranties or
	conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
	PARTICULAR PURPOSE. You are solely responsible for determining the
	appropriateness of using the model, and assume any risks associated with your
	exercise of permissions under these Terms of Use.

	4. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
	DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
	OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL OR THE USE OR
	OTHER DEALINGS IN THE MODEL.
	extra_gated_fields:
	Email: text
	I have read the License and agree with its terms: checkbox
	---

	# StarPII

	## Model description

	This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned [bigcode-encoder](https://huggingface.co/bigcode/bigcode-encoder)
	on a PII dataset we annotated, available with gated access at [bigcode-pii-dataset](https://huggingface.co/datasets/bigcode/pii-annotated-toloka-donwsample-emails) (see [bigcode-pii-dataset-training](https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-training) for the exact data splits).
	We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames.


	## Dataset

	### Fine-tuning on the annotated dataset
	The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88
	programming languages from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset.

	### Initial training on a pseudo-labelled dataset
	To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset.
	The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data.

	Specifically, we annotated 18,000 files available at [bigcode-pii-ppseudo-labeled](https://huggingface.co/datasets/bigcode/pseudo-labeled-python-data-pii-detection-filtered)
	using an ensemble of two encoder models [Deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and [stanford-deidentifier-base](StanfordAIMI/stanford-deidentifier-base)
	which were fine-tuned on an intern previously labeled PII [dataset](https://huggingface.co/datasets/bigcode/pii-for-code) for code with 400 files from this [work](https://arxiv.org/abs/2301.03988).
	To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score.
	After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like `key`, `auth` and `pwd` in the surrounding context.
	Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories,
	as demonstrated in the table in the following section.


	### Performance

	This model is respresented in the last row (NER + pseudo labels )
	- Emails, IP addresses and Keys

	\| Method \| Email address \| \| \| IP address \| \| \| Key \| \| \|
	\| ------------------ \| -------------- \| ---- \| ---- \| ---------- \| ---- \| ---- \| ----- \| ---- \| ---- \|
	\| \| Prec. \| Recall \| F1 \| Prec. \| Recall \| F1 \| Prec. \| Recall \| F1 \|
	\| Regex \| 96.2% \| 97.47% \| 96.83% \| 71.29% \| 87.71% \| 78.65% \| 3.62% \| 49.15% \| 6.74% \|
	\| NER \| 94.01% \| 98.10% \| 96.01% \| 88.95% \| 94.43% \| 91.61% \| 60.37% \| 53.38% \| 56.66% \|
	\| + pseudo labels \| 97.73% \| 98.94% \| 98.15% \| 90.10% \| 93.86% \| 91.94% \| 62.38% \| 80.81% \| 70.41% \|

	- Names, Usernames and Passwords

	\| Method \| Name \| \| \| Username \| \| \| Password \| \| \|
	\| ------------------ \| -------- \| ---- \| ---- \| -------- \| ---- \| ---- \| -------- \| ---- \| ---- \|
	\| \| Prec. \| Recall \| F1 \| Prec. \| Recall \| F1 \| Prec. \| Recall \| F1 \|
	\| NER \| 83.66% \| 95.52% \| 89.19% \| 48.93% \| 75.55% \| 59.39% \| 59.16% \| 96.62% \| 73.39%\|
	\| + pseudo labels \| 86.45% \| 97.38% \| 91.59% \| 52.20% \| 74.81% \| 61.49% \| 70.94% \| 95.96% \| 81.57% \|

	We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives.
	For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub):

	- Ignore secrets with less than 4 characters.
	- Detect full names only.
	- Ignore detected keys with less than 9 characters or that are not gibberish using a [gibberish-detector](https://github.com/domanchi/gibberish-detector).
	- Ignore IP addresses that aren't valid or are private (non-internet facing) using the `ipaddress` python package. We also ignore IP addresses from popular DNS servers.
	We use the same list as in this [paper](https://huggingface.co/bigcode/santacoder).

	# Considerations for Using the Model

	While using this model, please be aware that there may be potential risks associated with its application.
	There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data.
	Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases.
	Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible,
	our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII.