Spaces:

spark-nlp
/

MarianMT

Sleeping

File size: 7,172 Bytes

8d64fe0

import streamlit as st
import pandas as pd

# Custom CSS for Styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        h2 {

            color: #4A90E2;

            font-size: 28px;

            margin-top: 30px;

        }

        h3 {

            color: #4A90E2;

            font-size: 22px;

            margin-top: 20px;

        }

        h4 {

            color: #4A90E2;

            font-size: 18px;

            margin-top: 15px;

        }

    </style>

""", unsafe_allow_html=True)

# Main Title
st.markdown('<div class="main-title">Multilingual Text Translation with Spark NLP and MarianMT</div>', unsafe_allow_html=True)

# Overview Section
st.markdown("""

<div class="section">

    <p>With the ever-growing need to bridge language barriers in today's globalized world, multilingual text translation has become more important than ever. The MarianMT model, a fast and efficient neural machine translation framework, is built on the Transformer architecture and supports over 1,000 translation directions. This guide will demonstrate how to utilize MarianMT within Spark NLP to perform high-quality translations across multiple languages.</p>

</div>

""", unsafe_allow_html=True)

# Introduction to MarianMT and Spark NLP
st.markdown('<div class="sub-title">What is MarianMT?</div>', unsafe_allow_html=True)

# What is MarianMT?
st.markdown("""

<div class="section">

    <p>MarianMT is a neural machine translation framework developed by the Marian project, primarily backed by Microsoft Translator. It is a highly efficient tool, capable of translating text between numerous languages with remarkable speed and accuracy. MarianMT is implemented in C++ and is used in various industrial and research applications.</p>

</div>

""", unsafe_allow_html=True)

# Pipeline and Results
st.markdown('<div class="sub-title">Pipeline and Results</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>In this section, we will build a Spark NLP pipeline that uses the MarianMT model to translate English text into Chinese. We'll demonstrate the translation process from data preparation to the final output.</p>

</div>

""", unsafe_allow_html=True)

# Step 1: Creating the Data
st.markdown("""

<div class="section">

    <h4>Step 1: Creating the Data</h4>

    <p>We'll begin by creating a Spark DataFrame containing the English text that we want to translate into Chinese.</p>

""", unsafe_allow_html=True)

st.code("""

data = [["Hello, how are you?"]]

df = spark.createDataFrame(data).toDF("text")

""", language="python")

# Step 2: Assembling the Pipeline
st.markdown("""

<div class="section">

    <h4>Step 2: Assembling the Pipeline</h4>

    <p>We will now set up a Spark NLP pipeline that includes a document assembler, a sentence detector, and the MarianMT model for translation.</p>

""", unsafe_allow_html=True)

st.code("""

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline



document_assembler = DocumentAssembler()\\

    .setInputCol("text")\\

    .setOutputCol("document")



sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\\

    .setInputCols(["document"])\\

    .setOutputCol("sentences")



marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\\

    .setInputCols(["sentences"])\\

    .setOutputCol("translation")



pipeline = Pipeline(stages=[document_assembler, sentence_detector, marian])

model = pipeline.fit(df)

result = model.transform(df)

""", language="python")

# Step 3: Viewing the Results
st.markdown("""

<div class="section">

    <h4>Step 3: Viewing the Results</h4>

    <p>After processing the text, we can view the translations generated by the MarianMT model:</p>

""", unsafe_allow_html=True)

st.code("""

result.select("translation.result").show(truncate=False)

""", language="python")

st.text("""

+--------------+

|result        |

+--------------+

|[你好,你好吗?] |

+--------------+

""")

# Model Information and Use Cases
st.markdown("""

<div class="section">

    <h4>Model Information and Use Cases</h4>

    <p>The MarianMT model is highly versatile, supporting numerous translation directions. Here’s a brief overview of its characteristics:</p>

    <ul>

        <li><b>Model Name:</b> opus_mt_en_zh</li>

        <li><b>Input Language:</b> English (en)</li>

        <li><b>Output Language:</b> Chinese (zh)</li>

        <li><b>Best for:</b> General text translation from English to Chinese.</li>

        <li><b>Compatibility:</b> Spark NLP 2.7.0+</li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Conclusion
st.markdown("""

<div class="section">

    <h4>Conclusion</h4>

    <p>By integrating MarianMT with Spark NLP, you can easily perform high-quality translations across various languages, leveraging the power of distributed computing. The example provided here demonstrates how to translate English text to Chinese using the <code>opus_mt_en_zh</code> model. Whether you’re working with small-scale text or massive datasets, this approach offers scalability and flexibility.</p>

</div>

""", unsafe_allow_html=True)

# References
st.markdown("""

<div class="section">

    <h4>References</h4>

    <ul>

        <li>Model Documentation: <a class="link" href="https://sparknlp.org/models" target="_blank">Spark NLP Models</a></li>

        <li>MarianMT Information: <a class="link" href="https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models" target="_blank">OPUS-MT Models</a></li>

        <li>John Snow Labs: <a class="link" href="https://nlp.johnsnowlabs.com/" target="_blank">Spark NLP Documentation</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Community & Support
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Tutorials and articles</li>

    </ul>

</div>

""", unsafe_allow_html=True)