Spaces:
Sleeping
Sleeping
File size: 7,172 Bytes
8d64fe0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
import streamlit as st
import pandas as pd
# Custom CSS for Styling
st.markdown("""
<style>
.main-title {
font-size: 36px;
color: #4A90E2;
font-weight: bold;
text-align: center;
}
.sub-title {
font-size: 24px;
color: #4A90E2;
margin-top: 20px;
}
.section {
background-color: #f9f9f9;
padding: 15px;
border-radius: 10px;
margin-top: 20px;
}
.section p, .section ul {
color: #666666;
}
.link {
color: #4A90E2;
text-decoration: none;
}
h2 {
color: #4A90E2;
font-size: 28px;
margin-top: 30px;
}
h3 {
color: #4A90E2;
font-size: 22px;
margin-top: 20px;
}
h4 {
color: #4A90E2;
font-size: 18px;
margin-top: 15px;
}
</style>
""", unsafe_allow_html=True)
# Main Title
st.markdown('<div class="main-title">Multilingual Text Translation with Spark NLP and MarianMT</div>', unsafe_allow_html=True)
# Overview Section
st.markdown("""
<div class="section">
<p>With the ever-growing need to bridge language barriers in today's globalized world, multilingual text translation has become more important than ever. The MarianMT model, a fast and efficient neural machine translation framework, is built on the Transformer architecture and supports over 1,000 translation directions. This guide will demonstrate how to utilize MarianMT within Spark NLP to perform high-quality translations across multiple languages.</p>
</div>
""", unsafe_allow_html=True)
# Introduction to MarianMT and Spark NLP
st.markdown('<div class="sub-title">What is MarianMT?</div>', unsafe_allow_html=True)
# What is MarianMT?
st.markdown("""
<div class="section">
<p>MarianMT is a neural machine translation framework developed by the Marian project, primarily backed by Microsoft Translator. It is a highly efficient tool, capable of translating text between numerous languages with remarkable speed and accuracy. MarianMT is implemented in C++ and is used in various industrial and research applications.</p>
</div>
""", unsafe_allow_html=True)
# Pipeline and Results
st.markdown('<div class="sub-title">Pipeline and Results</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>In this section, we will build a Spark NLP pipeline that uses the MarianMT model to translate English text into Chinese. We'll demonstrate the translation process from data preparation to the final output.</p>
</div>
""", unsafe_allow_html=True)
# Step 1: Creating the Data
st.markdown("""
<div class="section">
<h4>Step 1: Creating the Data</h4>
<p>We'll begin by creating a Spark DataFrame containing the English text that we want to translate into Chinese.</p>
""", unsafe_allow_html=True)
st.code("""
data = [["Hello, how are you?"]]
df = spark.createDataFrame(data).toDF("text")
""", language="python")
# Step 2: Assembling the Pipeline
st.markdown("""
<div class="section">
<h4>Step 2: Assembling the Pipeline</h4>
<p>We will now set up a Spark NLP pipeline that includes a document assembler, a sentence detector, and the MarianMT model for translation.</p>
""", unsafe_allow_html=True)
st.code("""
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document_assembler = DocumentAssembler()\\
.setInputCol("text")\\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\\
.setInputCols(["document"])\\
.setOutputCol("sentences")
marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\\
.setInputCols(["sentences"])\\
.setOutputCol("translation")
pipeline = Pipeline(stages=[document_assembler, sentence_detector, marian])
model = pipeline.fit(df)
result = model.transform(df)
""", language="python")
# Step 3: Viewing the Results
st.markdown("""
<div class="section">
<h4>Step 3: Viewing the Results</h4>
<p>After processing the text, we can view the translations generated by the MarianMT model:</p>
""", unsafe_allow_html=True)
st.code("""
result.select("translation.result").show(truncate=False)
""", language="python")
st.text("""
+--------------+
|result |
+--------------+
|[你好,你好吗?] |
+--------------+
""")
# Model Information and Use Cases
st.markdown("""
<div class="section">
<h4>Model Information and Use Cases</h4>
<p>The MarianMT model is highly versatile, supporting numerous translation directions. Here’s a brief overview of its characteristics:</p>
<ul>
<li><b>Model Name:</b> opus_mt_en_zh</li>
<li><b>Input Language:</b> English (en)</li>
<li><b>Output Language:</b> Chinese (zh)</li>
<li><b>Best for:</b> General text translation from English to Chinese.</li>
<li><b>Compatibility:</b> Spark NLP 2.7.0+</li>
</ul>
</div>
""", unsafe_allow_html=True)
# Conclusion
st.markdown("""
<div class="section">
<h4>Conclusion</h4>
<p>By integrating MarianMT with Spark NLP, you can easily perform high-quality translations across various languages, leveraging the power of distributed computing. The example provided here demonstrates how to translate English text to Chinese using the <code>opus_mt_en_zh</code> model. Whether you’re working with small-scale text or massive datasets, this approach offers scalability and flexibility.</p>
</div>
""", unsafe_allow_html=True)
# References
st.markdown("""
<div class="section">
<h4>References</h4>
<ul>
<li>Model Documentation: <a class="link" href="https://sparknlp.org/models" target="_blank">Spark NLP Models</a></li>
<li>MarianMT Information: <a class="link" href="https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models" target="_blank">OPUS-MT Models</a></li>
<li>John Snow Labs: <a class="link" href="https://nlp.johnsnowlabs.com/" target="_blank">Spark NLP Documentation</a></li>
</ul>
</div>
""", unsafe_allow_html=True)
# Community & Support
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Tutorials and articles</li>
</ul>
</div>
""", unsafe_allow_html=True)
|