|
--- |
|
license: mit |
|
tags: |
|
- code-understanding |
|
- unixcoder |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
# RepoSim4Py |
|
|
|
An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories. |
|
|
|
## Model Details |
|
|
|
**RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories. |
|
For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository. |
|
By taking the mean of these embeddings, a repository-level mean embedding is generated. |
|
These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison. |
|
|
|
### Model Description |
|
|
|
The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset. |
|
|
|
- **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65) |
|
- **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py) |
|
- **Model type:** **code understanding** |
|
- **Language(s):** **Python** |
|
- **License:** **MIT** |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
|
- **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf) |
|
|
|
## Uses |
|
|
|
Below is an example of how to use the RepoSim4Py pipeline to easily generate embeddings for GitHub Python repositories. |
|
|
|
First, initialise the pipeline: |
|
```python |
|
from transformers import pipeline |
|
|
|
model = pipeline(model="Henry65/RepoSim4Py", trust_remote_code=True) |
|
``` |
|
Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries: |
|
```python |
|
repo_infos = model("lazyhope/python-hello-world") |
|
print(repo_infos) |
|
``` |
|
Output (Long numpy array outputs are omitted): |
|
```python |
|
[{'name': 'lazyhope/python-hello-world', |
|
'topics': [], |
|
'license': 'MIT', |
|
'stars': 0, |
|
'code_embeddings': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), |
|
'mean_code_embedding': array([[-2.07551336e+00, 2.81387949e+00, 2.35216689e+00, ...]], dtype=float32), |
|
'doc_embeddings': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), |
|
'mean_doc_embedding': array([[-2.37494540e+00, 5.40957630e-01, 2.29580235e+00, ...]], dtype=float32), |
|
'requirement_embeddings': array([[0., 0., 0., ...]], dtype=float32), |
|
'mean_requirement_embedding': array([[0., 0., 0., ...]], dtype=float32), |
|
'readme_embeddings': array([[-2.1671042 , 2.8404987 , 1.4761417 , ...]], dtype=float32), |
|
'mean_readme_embedding': array([[-1.91171765e+00, 1.65386486e+00, 9.49612021e-01, ...]], dtype=float32), |
|
'mean_repo_embedding': array([[-2.0755134, 2.8138795, 2.352167 , ...]], dtype=float32), |
|
'code_embeddings_shape': (1, 768) |
|
'mean_code_embedding_shape': (1, 768) |
|
'doc_embeddings_shape': (1, 768) |
|
'mean_doc_embedding_shape': (1, 768) |
|
'requirement_embeddings_shape': (1, 768) |
|
'mean_requirement_embedding_shape': (1, 768) |
|
'readme_embeddings_shape': (3, 768) |
|
'mean_readme_embedding_shape': (1, 768) |
|
'mean_repo_embedding_shape': (1, 3072) |
|
}] |
|
``` |
|
More specific information please refer to [Example.py](https://github.com/RepoMining/RepoSim4Py/blob/main/Script/Example.py). Note that "github_token" is unnecessary. |
|
|
|
## Training Details |
|
|
|
Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task. |
|
|
|
## Evaluation |
|
|
|
We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 400 Python repositories categorized in different topics, in order to label similar repositories. |
|
The evaluation metrics and results can be found in the RepoSim4Py repository, under the [Embedding](https://github.com/RepoMining/RepoSim4Py/tree/main/Embedding) folder. |
|
|
|
## Acknowledgements |
|
Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline. |
|
- **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
|
- **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) |
|
- **awesome-python** (https://github.com/vinta/awesome-python) |
|
|
|
## Authors |
|
- **Honglin Zhang** (https://github.com/liaomu0926) |
|
- **Rosa Filgueira** (https://www.rosafilgueira.com) |