--- pipeline_tag: sentence-similarity language: - en tags: - linktransformer - sentence-transformers - sentence-similarity - tabular-classification --- # gbpatentdata/lt-patent-inventor-linking This is a [LinkTransformer](https://linktransformer.github.io/) model. At its core this model this is a sentence transformer model [sentence-transformers](https://www.SBERT.net) model - it just wraps around the class. This model has been fine-tuned on the model: `sentence-transformers/all-mpnet-base-v2`. It is pretrained for the language: `en`. ## Usage (Sentence-Transformers) To use this model using sentence-transformers: ```python from sentence_transformers import SentenceTransformer # load model = SentenceTransformer("matthewleechen/lt-patent-inventor-linking") ``` ## Usage (LinkTransformer) To use this model for clustering with [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) installed: ```python import linktransformer as lt import pandas as pd df_lm_matched = lt.cluster_rows(df, # df should be a dataset of unique patent-inventors model='matthewleechen/lt-patent-inventor-linking', on=['name', 'occupation', 'year', 'address', 'firm', 'patent_title'], # cluster on these variables cluster_type='SLINK', # use SLINK algorithm cluster_params={ # default params 'threshold': 0.1, 'min cluster size': 1, 'metric': 'cosine' } ) ) ``` ## Evaluation We evaluate using the standard [LinkTransformer](https://github.com/dell-research-harvard/linktransformer) information retrieval metrics. Our test set evaluations are available [here](https://huggingface.co/gbpatentdata/lt-patent-inventor-linking/blob/main/Information-Retrieval_evaluation_test_results.csv). ## Training The model was trained with the parameters: **DataLoader**: `torch.utils.data.dataloader.DataLoader` of length 31 with parameters: ``` {'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} ``` **Loss**: `linktransformer.modified_sbert.losses.SupConLoss_wandb` Parameters of the fit()-Method: ``` { "epochs": 100, "evaluation_steps": 16, "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator", "max_grad_norm": 1, "optimizer_class": "", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", "steps_per_epoch": null, "warmup_steps": 3100, "weight_decay": 0.01 } ``` ``` LinkTransformer( (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False}) (2): Normalize() ) ``` ## Citation If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows: ``` @article{bct2025, title = {300 Years of British Patents}, author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero}, journal = {arXiv preprint arXiv:2401.12345}, year = {2025}, url = {https://arxiv.org/abs/2401.12345} } ``` Please also cite the original LinkTransformer authors: ``` @misc{arora2023linktransformer, title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models}, author={Abhishek Arora and Melissa Dell}, year={2023}, eprint={2309.00789}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```