amazon
/

MegaBeam-Mistral-7B-300k

@@ -52,7 +52,7 @@ The 12 evaluation tasks are summarized below (as per [InfiniteBench]((https://gi
 | Retrieve.KV          | Synthetic     | 500        | 89.9k            | 22.7              | Finding the corresponding value from a dictionary and a key.                                |
-## How to Serve MegaBeam-Mistral-7B-300k on vLLM ##
 On an AWS `g5.48xlarge` instance, upgrade vLLM to the latest version as per [documentation on vLLM](https://vllm.readthedocs.io/en/latest/).
 ### Start the server
@@ -93,54 +93,73 @@ print("Chat completion results:")
 print(chat_completion)
 ```
-### Deploy the Model as A SageMaker Endpoint ###
-To deploy MegaBeam-Mistral-7B-300k on a SageMaker endpoint, please follow the example code as below.
-```shell
-#Requires: [sagemaker](https://pypi.org/project/sagemaker/) 2.192.1 or later.
-pip install -U sagemaker
-```
 ```python
 import sagemaker
-from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
-import time
 sagemaker_session = sagemaker.Session()
 region = sagemaker_session.boto_region_name
 role = sagemaker.get_execution_role()
-image_uri = get_huggingface_llm_image_uri(
-  backend="huggingface", # or lmi
-  region=region,
 )
-model_name = "MegaBeam-Mistral-7B-300k-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
-hub = {
-    'HF_MODEL_ID':'amazon/MegaBeam-Mistral-7B-300k',
-    'HF_TASK':'text-generation',
-    'SM_NUM_GPUS':'8',
-    "MAX_INPUT_LENGTH": '288416',
-    "MAX_TOTAL_TOKENS": '288800',
-    "MAX_BATCH_PREFILL_TOKENS": '288800',
-    "MAX_BATCH_TOTAL_TOKENS":  '288800',
-}
-model = HuggingFaceModel(
-    name=model_name,
-    env=hub,
-    role=role,
-    image_uri=image_uri
 )
-predictor = model.deploy(
-  initial_instance_count=1,
-  instance_type="ml.g5.48xlarge",
-  endpoint_name=model_name,
 )
 ```
 ## Limitations ##
 Before using the MegaBeam-Mistral-7B-300k model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.

 | Retrieve.KV          | Synthetic     | 500        | 89.9k            | 22.7              | Finding the corresponding value from a dictionary and a key.                                |
+## Serve MegaBeam-Mistral-7B-300k on EC2 instances ##
 On an AWS `g5.48xlarge` instance, upgrade vLLM to the latest version as per [documentation on vLLM](https://vllm.readthedocs.io/en/latest/).
 ### Start the server
 print(chat_completion)
 ```
+### Deploy the model on a SageMaker Endpoint ###
+To deploy MegaBeam-Mistral-7B-300k on a SageMaker endpoint, please follow this [SageMaker DJL deployment guide](https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_mistral_7b.html).
+Run the following Python statements in a SageMaker notebook (with each block running in a separate cell)
 ```python
 import sagemaker
+from sagemaker import Model, image_uris, serializers, deserializers
 sagemaker_session = sagemaker.Session()
 region = sagemaker_session.boto_region_name
 role = sagemaker.get_execution_role()
+# run the following statement in a notebook cell
+%%writefile serving.properties
+engine=Python
+option.model_id=amazon/MegaBeam-Mistral-7B-300k
+option.dtype=bf16
+option.task=text-generation
+option.rolling_batch=vllm
+option.tensor_parallel_degree=8
+option.device_map=auto
+# run the following statement in a notebook cell
+%%sh
+mkdir mymodel
+mv serving.properties mymodel/
+tar czvf mymodel.tar.gz mymodel/
+rm -rf mymodel
+image_uri = image_uris.retrieve(
+        framework="djl-deepspeed",
+        region=region,
+        version="0.27.0"
 )
+s3_code_prefix = "megaBeam-mistral-7b-300k/code"
+bucket = sagemaker_session.default_bucket()  # bucket to house artifacts
+code_artifact = sagemaker_session.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
+print(f"S3 Code or Model tar ball uploaded to --- &gt; {code_artifact}")
+model = Model(image_uri=image_uri, model_data=code_artifact, role=role)
+instance_type = "ml.g5.48xlarge"
+endpoint_name = sagemaker.utils.name_from_base("megaBeam-mistral-7b-300k")
+model.deploy(initial_instance_count=1,
+             instance_type=instance_type,
+             endpoint_name=endpoint_name
+            )
+# our requests and responses will be in json format so we specify the serializer and the deserializer
+predictor = sagemaker.Predictor(
+    endpoint_name=endpoint_name,
+    sagemaker_session=sagemaker_session,
+    serializer=serializers.JSONSerializer(),
 )
+# test the endpoint
+input_str = """<s>[INST] What is your favourite condiment? [/INST]
+Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
+[INST] Do you have mayonnaise recipes? [/INST]"""
+predictor.predict(
+    {"inputs": input_str, "parameters": {"max_new_tokens": 75}}
 )
 ```
 ## Limitations ##
 Before using the MegaBeam-Mistral-7B-300k model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.