README.md · Shuu12121/CodeModernBERT-Owl-2.0-Pre at main

CodeModernBERT-Owl-2.0-Pre / README.md

Shuu12121

Update README.md

7b39dc3 verified 6 months ago

preview code

raw

history blame contribute delete

4.6 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: fill-mask
	tags:
	- code
	- python
	- java
	- javascript
	- php
	- typescript
	- rust
	- ruby
	- go
	- embedding
	- modernbert
	datasets:
	- Shuu12121/php-codesearch-tree-sitter-filtered-v2
	- Shuu12121/ruby-codesearch-tree-sitter-filtered-v2
	- Shuu12121/rust-codesearch-tree-sitter-filtered-v2
	- Shuu12121/go-codesearch-tree-sitter-filtered-v2
	- Shuu12121/javascript-codesearch-tree-sitter-filtered-v2
	- Shuu12121/java-codesearch-tree-sitter-filtered-v2
	- Shuu12121/typescript-codesearch-tree-sitter-filtered-v2
	- Shuu12121/python-codesearch-tree-sitter-filtered-v2
	---


	# 🦉 Shuu12121/CodeModernBERT-Owl-2.0-Pre

	`CodeModernBERT-Owl-2.0-Pre` は、マルチリンガルなコード理解・検索に対応した CodeModernBERT-Owl 系列の最新事前学習モデルです。

	本モデルは、CodeBERT（Feng et al., 2020）で使用されたバイモーダル学習データの約4倍に相当する、全て独自収集・構築した高品質なコーパスのみに基づいて事前学習を行っています。
	前バージョン（`CodeModernBERT-Owl-1.0`）と比較しても、約2倍のデータ量で学習されており、よりリッチな構文・意味情報を学習しています。

	今回新たに、これまで対応していた 7言語（Python, Java, JavaScript, PHP, Ruby, Go, Rust）に加えて、TypeScript を新たにコーパスに加え、より幅広いコード言語に対応しました。

	また、最大2048トークンまでの長文コードを学習データとして使用しており、推論時には最大8192トークンまでの入力を処理可能です（Position Embeddingは拡張済み）。

	さらに、以下のような独自の前処理・フィルタリング処理を組み合わせることで、ノイズを除去し、学習の効率と精度を最大化しています：

	* `Tree-sitter` に基づく構文解析による関数・docstringの厳密な抽出
	* 英語以外のdocstringや、意味のない定型文コメントの除去
	* APIキーやシークレット情報の検出・自動マスキング
	* ライセンス情報を含む関数の除外
	* 関数・docstringペアの重複除去（データリーク対策）

	---

	## 基本情報

	* 対応言語: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
	* 学習時の最大トークン長: 2048
	* 推論時の最大トークン長: 8192（拡張済み）
	* トークナイザ: 独自に学習したBPEベース
	* モデルサイズ: 約150Mパラメータ（ModernBERTベース）

	主な用途例:

	* 関数レベルのコード検索（自然言語→コード）
	* コード補完、要約、分類、コードクローン検出などの下流タスク
	* Retrieval-Augmented Generation（RAG）のためのコード検索基盤

	---

	## English　ver

	`CodeModernBERT-Owl-2.0-Pre` is the latest pretrained model in the CodeModernBERT-Owl series for multilingual code understanding and retrieval.

	This model was trained entirely on a custom-built high-quality corpus, approximately 4 times larger than the bimodal dataset used in CodeBERT (Feng et al., 2020).
	Compared to the previous version (`CodeModernBERT-Owl-1.0`), it has been trained on twice the amount of data, capturing more structural and semantic patterns.

	I also newly added TypeScript to the previously supported 7 languages (Python, Java, JavaScript, PHP, Ruby, Go, Rust), further broadening the model’s applicability.

	The model was trained on inputs up to 2048 tokens, and supports inference up to 8192 tokens thanks to extended positional embeddings.

	A set of custom preprocessing and filtering steps was applied to ensure data quality and training stability:

	* Precise function and docstring extraction via `Tree-sitter`-based parsing
	* Removal of non-English or templated comments
	* Automatic masking of API keys and secrets
	* Exclusion of license-related content
	* Deduplication of code/docstring pairs to prevent data leakage

	---

	* Supported Languages: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
	* Max Training Sequence Length: 2048 tokens
	* Max Inference Sequence Length: 8192 tokens (positionally extended)
	* Tokenizer: Custom-trained BPE
	* Model Size: \~150M parameters (ModernBERT backbone)

	Primary Use Cases:

	* Function-level code search (natural language → code)
	* Tasks such as code summarization, completion, classification, and clone detection
	* High-quality retrieval for RAG (Retrieval-Augmented Generation) systems

	---