--- license: mit pipeline_tag: text-generation tags: - biology - genomics - long-context library_name: transformers --- # GENERator-eukaryote-3b-base model ## Abouts In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms. For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://huggingface.co/GenerTeam). ## How to use ### Simple example1: generation ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model. tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") config = model.config max_length = config.max_position_embeddings # Define input sequences. sequences = [ "ATGAGGTGGCAAGAAATGGGCTAC", "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" ] # Process the sequences sequences = [tokenizer.bos_token + sequence for sequence in sequences] # Tokenize the sequences tokenizer.padding_side = "left" inputs = tokenizer( sequences, add_special_tokens=False, return_tensors="pt", padding=True, truncation=True, max_length=max_length ) # Generate the sequences with torch.inference_mode(): outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) # Decode the generated sequences decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) # Print the decoded sequences print(decoded_sequences) # It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') # The input sequences are too short to provide sufficient context. ``` ### Simple example2: embedding ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model. tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base") config = model.config max_length = config.max_position_embeddings # Define input sequences. sequences = [ "ATGAGGTGGCAAGAAATGGGCTAC", "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" ] # Tokenize the sequences with add_special_tokens=True to automatically add special tokens, # such as the BOS EOS token, at the appropriate positions. tokenizer.padding_side = "right" inputs = tokenizer( sequences, add_special_tokens=True, return_tensors="pt", padding=True, truncation=True, max_length=max_length ) # Perform a forward pass through the model to obtain the outputs, including hidden states. with torch.inference_mode(): outputs = model(**inputs, output_hidden_states=True) # Retrieve the hidden states from the last layer. hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size) # Use the attention_mask to determine the index of the last token in each sequence. # Since add_special_tokens=True is used, the last token is typically the EOS token. attention_mask = inputs["attention_mask"] last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence # Extract the embedding corresponding to the EOS token for each sequence. seq_embeddings = [] for i, token_index in enumerate(last_token_indices): # Fetch the embedding for the last token (EOS token). seq_embedding = hidden_states[i, token_index, :] seq_embeddings.append(seq_embedding) # Stack the embeddings into a tensor with shape (batch_size, hidden_size) seq_embeddings = torch.stack(seq_embeddings) print("Sequence Embeddings:", seq_embeddings) ``` ## Citation ``` @misc{wu2025generator, title={GENERator: A Long-Context Generative Genomic Foundation Model}, author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, year={2025}, eprint={2502.07272}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07272}, } ```