File size: 4,503 Bytes
f153827
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import streamlit as st
import numpy as np
from nltk import pos_tag, word_tokenize
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
import random

# Page title
st.title('Statistical NLP')

# Language Models: n-grams and smoothing techniques
st.header('1. Language Models')

st.subheader('Definition:')
st.write("""
Language models are statistical models that assign probabilities to sequences of words. 
They help us predict the next word in a sentence or assess the likelihood of a sentence.

- **n-grams**: A language model based on n-grams uses sequences of 'n' consecutive words to predict the next word.
- **Smoothing Techniques**: Smoothing techniques like Kneser-Ney are used to handle cases where a particular n-gram has not been seen in the training data, thus preventing zero probabilities.
""")

# Interactive example for n-grams
st.subheader('n-gram Model Example:')
ngram_input = st.text_area("Enter a sentence to see n-grams", "I love programming in Python")

n = st.slider("Choose n for n-grams:", 1, 4, 2)

if st.button('Generate n-grams'):
    tokens = word_tokenize(ngram_input.lower())
    ngrams = [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    st.write(f"{n}-grams in the input sentence: {ngrams}")

# Perplexity
st.header('2. Perplexity')

st.subheader('Definition:')
st.write("""
Perplexity is a measurement of how well a probability model predicts a sample. 
It’s often used to evaluate language models. A lower perplexity means the model is better at predicting the next word.
It’s calculated as the inverse probability of the test set normalized by the number of words.
""")

# Interactive example for calculating Perplexity
st.subheader('Perplexity Calculation Example:')
sample_sentence = st.text_area("Enter sentence to calculate perplexity", "This is an example sentence")

if st.button('Calculate Perplexity'):
    # Simple approximation of perplexity for demonstration
    word_count = len(sample_sentence.split())
    word_freq = Counter(sample_sentence.split())
    vocab_size = len(word_freq)
    perplexity = np.exp(sum(-np.log(word_freq[word] / word_count) for word in sample_sentence.split()) / word_count)
    st.write(f"Perplexity of the sentence: {perplexity:.2f}")

# Hidden Markov Models (HMM): Part-of-Speech (POS) tagging
st.header('3. Hidden Markov Models (HMM)')

st.subheader('Definition:')
st.write("""
Hidden Markov Models (HMMs) are statistical models that represent sequences of observations with underlying hidden states. 
In NLP, they are commonly used for tasks like Part-of-Speech (POS) tagging, where each word is assigned a grammatical category (e.g., noun, verb).
""")

# Interactive POS Tagging example using HMM
st.subheader('POS Tagging with HMM:')
sentence_to_tag = st.text_area("Enter sentence for POS tagging", "I love programming in Python")

if st.button('Tag POS'):
    tokens = word_tokenize(sentence_to_tag)
    tagged = pos_tag(tokens)
    st.write(f"POS Tagging result: {tagged}")

# Conditional Random Fields (CRF): Sequence labeling tasks like NER
st.header('4. Conditional Random Fields (CRF)')

st.subheader('Definition:')
st.write("""
Conditional Random Fields (CRF) are used for sequence labeling tasks, where each element in the sequence is assigned a label. 
They are particularly useful for tasks like Named Entity Recognition (NER), where we want to identify entities such as people, organizations, or locations in text.
""")

# Sample data for Named Entity Recognition (NER) using CRF
st.subheader('NER Example using CRF:')
sample_sentence = st.text_area("Enter sentence for NER (Named Entity Recognition)", "Barack Obama was born in Hawaii.")

# Sample NER prediction using a simple CRF model (a toy example)
ner_examples = [
    (["Barack", "Obama", "was", "born", "in", "Hawaii"], ["B-PER", "I-PER", "O", "O", "O", "B-LOC"]),
    (["Apple", "is", "based", "in", "California"], ["B-ORG", "O", "O", "O", "B-LOC"])
]

# Training a simple CRF model with toy data
X_train = [x[0] for x in ner_examples]
y_train = [x[1] for x in ner_examples]

# Initialize CRF
crf = CRF(algorithm='lbfgs')
crf.fit(X_train, y_train)

# Simple prediction for demonstration
if st.button('Perform NER'):
    tokens = word_tokenize(sample_sentence)
    predicted_tags = crf.predict([tokens])[0]
    entities = [(tokens[i], predicted_tags[i]) for i in range(len(tokens))]
    st.write(f"NER result: {entities}")