--- library_name: keras-hub --- ### Model Overview # Model Summary Mistral is a set of large language models published by the Mistral AI team. The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. Both pre-trained and instruction tuned models are available with 7 billion activated parameters. Weights are released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE) . Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE). ## Links * [Mixtral Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/mixtral-quickstart-notebook) * [Mixtral API Documentation](https://keras.io/keras_hub/api/models/mixtral/) * [Mixtral Model Card](https://mistral.ai/news/mixtral-of-experts) * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) ## Installation Keras and KerasHub can be installed with: ``` pip install -U -q keras-hub pip install -U -q keras ``` Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. ## Presets The following model checkpoints are provided by the Keras team. Full code examples for each are available below. | Preset name | Parameters | Description | |---------------------------------------|------------|--------------------------------------------------------------------------------------------------------------| | mixtral_8_7b_en | 7B | 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. | | mixtral_8_instruct_7b_en | 7B | Instruction fine-tuned 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. | ## Example Usage ```Python import keras import keras_hub import numpy as np # Basic text generation mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_7b_en") mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500) # Generate with batched prompts mixtral_lm.generate([ "[INST] What is Keras? [/INST]", "[INST] Give me your best brownie recipe. [/INST]" ], max_length=500) # Using different sampling strategies mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_7b_en") # Greedy sampling mixtral_lm.compile(sampler="greedy") mixtral_lm.generate("I want to say", max_length=30) # Beam search mixtral_lm.compile( sampler=keras_hub.samplers.BeamSampler( num_beams=2, top_k_experts=2, # MoE-specific: number of experts to use per token ) ) mixtral_lm.generate("I want to say", max_length=30) # Generate without preprocessing prompt = { "token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2), "padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2), } mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "mixtral_8_7b_en", preprocessor=None, dtype="bfloat16" ) mixtral_lm.generate( prompt, num_experts=8, # Total number of experts per layer top_k_experts=2, # Number of experts to use per token router_aux_loss_coef=0.02 # Router auxiliary loss coefficient ) # Training on a single batch features = ["The quick brown fox jumped.", "I forgot my homework."] mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "mixtral_8_7b_en", dtype="bfloat16" ) mixtral_lm.fit( x=features, batch_size=2, router_aux_loss_coef=0.02 # MoE-specific: router training loss ) # Training without preprocessing x = { "token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2), "padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2), } y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2) sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2) mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "mixtral_8_7b_en", preprocessor=None, dtype="bfloat16" ) mixtral_lm.fit( x=x, y=y, sample_weight=sw, batch_size=2, router_aux_loss_coef=0.02 ) ``` ## Example Usage with Hugging Face URI ```Python import keras import keras_hub import numpy as np # Basic text generation mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_7b_en") mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500) # Generate with batched prompts mixtral_lm.generate([ "[INST] What is Keras? [/INST]", "[INST] Give me your best brownie recipe. [/INST]" ], max_length=500) # Using different sampling strategies mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_7b_en") # Greedy sampling mixtral_lm.compile(sampler="greedy") mixtral_lm.generate("I want to say", max_length=30) # Beam search mixtral_lm.compile( sampler=keras_hub.samplers.BeamSampler( num_beams=2, top_k_experts=2, # MoE-specific: number of experts to use per token ) ) mixtral_lm.generate("I want to say", max_length=30) # Generate without preprocessing prompt = { "token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2), "padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2), } mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "hf://keras/mixtral_8_7b_en", preprocessor=None, dtype="bfloat16" ) mixtral_lm.generate( prompt, num_experts=8, # Total number of experts per layer top_k_experts=2, # Number of experts to use per token router_aux_loss_coef=0.02 # Router auxiliary loss coefficient ) # Training on a single batch features = ["The quick brown fox jumped.", "I forgot my homework."] mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "hf://keras/mixtral_8_7b_en", dtype="bfloat16" ) mixtral_lm.fit( x=features, batch_size=2, router_aux_loss_coef=0.02 # MoE-specific: router training loss ) # Training without preprocessing x = { "token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2), "padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2), } y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2) sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2) mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( "hf://keras/mixtral_8_7b_en", preprocessor=None, dtype="bfloat16" ) mixtral_lm.fit( x=x, y=y, sample_weight=sw, batch_size=2, router_aux_loss_coef=0.02 ) ```