Spaces:
Running
Running
import streamlit as st | |
import nltk | |
from nltk.tokenize import word_tokenize, sent_tokenize | |
from nltk.corpus import stopwords | |
from nltk.stem import PorterStemmer, WordNetLemmatizer | |
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer | |
# Download NLTK data | |
nltk.download("punkt") | |
nltk.download("stopwords") | |
nltk.download("wordnet") | |
# Streamlit app configuration | |
st.set_page_config(page_title="NLP Basics for Beginners", page_icon="π€", layout="wide") | |
st.title("π€ NLP Basics for Beginners") | |
st.markdown( | |
""" | |
Welcome to the **NLP Basics App**! | |
Here, you'll learn about the foundational concepts of **Natural Language Processing (NLP)** through interactive examples. | |
Let's explore: | |
- **What is NLP?** Its applications and use cases. | |
- **Text Representation Basics**: Tokens, sentences, words, stopwords, lemmatization, stemming. | |
- **Vectorization Techniques**: Bag of Words (BoW) and TF-IDF. | |
""" | |
) | |
# Divider | |
st.markdown("---") | |
# Sidebar Navigation | |
st.sidebar.title("Navigation") | |
sections = ["Introduction to NLP", "Tokenization", "Stopwords", "Lemmatization & Stemming", "Bag of Words (BoW)", "TF-IDF"] | |
selected_section = st.sidebar.radio("Choose a section", sections) | |
# Input Text Box | |
st.sidebar.write("### Enter Text to Analyze:") | |
text_input = st.sidebar.text_area("Input your text here:", height=150, placeholder="Type or paste some text here...") | |
if not text_input.strip(): | |
st.sidebar.warning("Please enter some text to explore NLP concepts.") | |
# Section 1: Introduction to NLP | |
if selected_section == "Introduction to NLP": | |
st.header("π‘ What is NLP?") | |
st.write( | |
""" | |
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) focused on the interaction between computers and human language. | |
It enables machines to understand, interpret, and generate human language. | |
### **Applications of NLP**: | |
- **Chatbots**: AI-powered conversational agents (e.g., Siri, Alexa). | |
- **Text Summarization**: Extracting important information from lengthy documents. | |
- **Machine Translation**: Translating text between languages (e.g., Google Translate). | |
- **Sentiment Analysis**: Understanding opinions in social media or reviews (positive/negative/neutral). | |
""" | |
) | |
st.image("https://miro.medium.com/max/1400/1*H0qcbsUCWkE7O__q2XkKYA.png", caption="Applications of NLP", use_column_width=True) | |
# Section 2: Tokenization | |
if selected_section == "Tokenization": | |
st.header("π€ Tokenization") | |
st.write( | |
""" | |
**Tokenization** is the process of breaking down text into smaller units, like sentences or words. | |
It is a critical first step in many NLP tasks. | |
### Types of Tokenization: | |
1. **Sentence Tokenization**: Splitting text into sentences. | |
2. **Word Tokenization**: Splitting text into individual words (tokens). | |
**Example Input**: "I love NLP. It's amazing!" | |
**Sentence Tokens**: ["I love NLP.", "It's amazing!"] | |
**Word Tokens**: ["I", "love", "NLP", ".", "It", "'s", "amazing", "!"] | |
""" | |
) | |
if text_input.strip(): | |
st.subheader("Try Tokenization on Your Input Text") | |
st.write("**Sentence Tokenization**:") | |
sentences = sent_tokenize(text_input) | |
st.write(sentences) | |
st.write("**Word Tokenization**:") | |
words = word_tokenize(text_input) | |
st.write(words) | |
# Section 3: Stopwords | |
if selected_section == "Stopwords": | |
st.header("π Stopwords") | |
st.write( | |
""" | |
**Stopwords** are common words (e.g., "and", "is", "the") that add little meaning to text and are often removed in NLP tasks. | |
Removing stopwords helps focus on the essential words in a text. | |
For example: | |
**Input**: "This is an example of stopwords removal." | |
**Output**: ["example", "stopwords", "removal"] | |
""" | |
) | |
if text_input.strip(): | |
st.subheader("Remove Stopwords from Your Input Text") | |
stop_words = set(stopwords.words("english")) | |
words = word_tokenize(text_input) | |
filtered_words = [word for word in words if word.lower() not in stop_words] | |
st.write("**Original Words**:", words) | |
st.write("**Words after Stopwords Removal**:", filtered_words) | |
# Section 4: Lemmatization & Stemming | |
if selected_section == "Lemmatization & Stemming": | |
st.header("π± Lemmatization and Stemming") | |
st.write( | |
""" | |
### **Stemming**: | |
Reduces words to their root form by chopping off prefixes/suffixes. | |
**Example**: "running" β "run", "studies" β "studi" | |
### **Lemmatization**: | |
Returns the base (dictionary) form of a word using context. | |
**Example**: "running" β "run", "better" β "good" | |
""" | |
) | |
if text_input.strip(): | |
st.subheader("Apply Stemming and Lemmatization") | |
words = word_tokenize(text_input) | |
ps = PorterStemmer() | |
stemmed_words = [ps.stem(word) for word in words] | |
st.write("**Stemmed Words**:", stemmed_words) | |
lemmatizer = WordNetLemmatizer() | |
lemmatized_words = [lemmatizer.lemmatize(word) for word in words] | |
st.write("**Lemmatized Words**:", lemmatized_words) | |
# Section 5: Bag of Words (BoW) | |
if selected_section == "Bag of Words (BoW)": | |
st.header("π¦ Bag of Words (BoW)") | |
st.write( | |
""" | |
**Bag of Words (BoW)** is a text representation technique that converts text into a vector of word frequencies. | |
It ignores word order but considers the occurrence of words. | |
### Example: | |
**Input Texts**: | |
1. "I love NLP." | |
2. "NLP is amazing!" | |
**BoW Matrix**: | |
| | I | love | NLP | is | amazing | | |
|------|---|------|-----|----|---------| | |
| Text1| 1 | 1 | 1 | 0 | 0 | | |
| Text2| 0 | 0 | 1 | 1 | 1 | | |
""" | |
) | |
if text_input.strip(): | |
st.subheader("Generate BoW for Your Input Text") | |
vectorizer = CountVectorizer() | |
X = vectorizer.fit_transform([text_input]) | |
st.write("**BoW Matrix**:") | |
st.write(X.toarray()) | |
st.write("**Feature Names (Words):**") | |
st.write(vectorizer.get_feature_names_out()) | |
# Section 6: TF-IDF | |
if selected_section == "TF-IDF": | |
st.header("π TF-IDF (Term Frequency-Inverse Document Frequency)") | |
st.write( | |
""" | |
**TF-IDF** is a statistical measure that evaluates how important a word is to a document in a collection of documents. | |
It balances the frequency of a word with its rarity across documents. | |
### Formula: | |
- **Term Frequency (TF)**: How often a word appears in a document. | |
- **Inverse Document Frequency (IDF)**: Log of the total documents divided by the number of documents containing the word. | |
**Example**: | |
- "NLP is amazing." | |
- "I love NLP." | |
TF-IDF assigns higher weights to rare but significant words. | |
""" | |
) | |
if text_input.strip(): | |
st.subheader("Generate TF-IDF for Your Input Text") | |
tfidf_vectorizer = TfidfVectorizer() | |
tfidf_matrix = tfidf_vectorizer.fit_transform([text_input]) | |
st.write("**TF-IDF Matrix**:") | |
st.write(tfidf_matrix.toarray()) | |
st.write("**Feature Names (Words):**") | |
st.write(tfidf_vectorizer.get_feature_names_out()) | |
# Footer | |
st.markdown("---") | |
st.markdown( | |
""" | |
<center> | |
<p style='font-size:14px;'>Β© 2024 NLP Basics App. All Rights Reserved.</p> | |
</center> | |
""", | |
unsafe_allow_html=True, | |
) | |