|  | --- | 
					
						
						|  | tags: | 
					
						
						|  | - software engineering | 
					
						
						|  | - ner | 
					
						
						|  | - named-entity recognition | 
					
						
						|  | - token-classification | 
					
						
						|  | widget: | 
					
						
						|  | - text: >- | 
					
						
						|  | In the field of computer graphics, a graphics processing unit (GPU) utilizes algorithms such as ray tracing, a rendering technique, to create realistic lighting effects in applications like Adobe Acrobat and Microsoft Excel. | 
					
						
						|  | example_title: example 1 | 
					
						
						|  | - text: >- | 
					
						
						|  | By utilizing the TensorFlow and FastAPI libraries with Python, we are optimizing neural network training on devices like the Samsung Gear S2 and Intel T5300 processor. | 
					
						
						|  | example_title: example 2 | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | datasets: | 
					
						
						|  | - wikiser | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | --- | 
					
						
						|  | # Software Entity Recognition with Noise-robust Learning | 
					
						
						|  |  | 
					
						
						|  | We train a BERT model for the task software entity recognition (SER). | 
					
						
						|  | The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. | 
					
						
						|  | The model uses _self-regularization_ during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others. | 
					
						
						|  |  | 
					
						
						|  | The model recognizes 12 fine-grained named entities: `Algorithm`, `Application`, `Architecture`, `Data_Structure`, `Device`, `Error_Name`, `General_Concept`, `Language`, | 
					
						
						|  | `Library`, `License`, `Operating_System`, and `Protocol`. | 
					
						
						|  |  | 
					
						
						|  | | Type             | Examples                                              | | 
					
						
						|  | |------------------|-------------------------------------------------------| | 
					
						
						|  | | Algorithm        | Auction algorithm, Collaborative filtering            | | 
					
						
						|  | | Application      | Adobe Acrobat, Microsoft Excel                       | | 
					
						
						|  | | Architecture     | Graphics processing unit, Wishbone                   | | 
					
						
						|  | | Data_Structure   | Array, Hash table, mXOR linked list                  | | 
					
						
						|  | | Device           | Samsung Gear S2, iPad, Intel T5300                    | | 
					
						
						|  | | Error Name       | Buffer overflow, Memory leak                         | | 
					
						
						|  | | General_Concept  | Memory management, Nouvelle AI                       | | 
					
						
						|  | | Language         | C++, Java, Python, Rust                               | | 
					
						
						|  | | Library          | Beautiful Soup, FastAPI                               | | 
					
						
						|  | | License          | Cryptix General License, MIT License                  | | 
					
						
						|  | | Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS                   | | 
					
						
						|  | | Protocol         | TLS, FTPS, HTTP 404                                   | | 
					
						
						|  |  | 
					
						
						|  | ## Model details | 
					
						
						|  |  | 
					
						
						|  | Paper: https://arxiv.org/abs/2308.10564 | 
					
						
						|  |  | 
					
						
						|  | Code: https://github.com/taidnguyen/software_entity_recognition | 
					
						
						|  |  | 
					
						
						|  | Finetuned from model: `bert-large-cased` | 
					
						
						|  |  | 
					
						
						|  | Checkpoint for base version: https://huggingface.co/taidng/wikiser-bert-base | 
					
						
						|  |  | 
					
						
						|  | ## How to use | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import AutoTokenizer, AutoModelForTokenClassification | 
					
						
						|  |  | 
					
						
						|  | tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-large") | 
					
						
						|  | model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-large") | 
					
						
						|  |  | 
					
						
						|  | nlp = pipeline("ner", model=model, tokenizer=tokenizer) | 
					
						
						|  | example = "Windows XP was originally bundled with Internet Explorer 6." | 
					
						
						|  |  | 
					
						
						|  | ner_results = nlp(example) | 
					
						
						|  | print(ner_results) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Citation | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @inproceedings{nguyen2023software, | 
					
						
						|  | title={Software Entity Recognition with Noise-Robust Learning}, | 
					
						
						|  | author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi}, | 
					
						
						|  | booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)}, | 
					
						
						|  | year={2023}, | 
					
						
						|  | organization={IEEE/ACM} | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  |