EdysorEdutech commited on
Commit
36a50be
·
verified ·
1 Parent(s): 479bdbb

Create app.py

Browse files
Files changed (1) hide show
  1. app.py +1994 -0
app.py ADDED
@@ -0,0 +1,1994 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ from transformers import T5ForConditionalGeneration, T5Tokenizer, AutoTokenizer, AutoModelForSeq2SeqLM
4
+ from bs4 import BeautifulSoup, NavigableString, Tag
5
+ import re
6
+ import time
7
+ import random
8
+ import nltk
9
+ from nltk.tokenize import sent_tokenize
10
+
11
+ # Download required NLTK data
12
+ try:
13
+ nltk.download('punkt', quiet=True)
14
+ except:
15
+ pass
16
+
17
+ # Try to import spaCy but make it optional
18
+ try:
19
+ import spacy
20
+ SPACY_AVAILABLE = True
21
+ except:
22
+ print("spaCy not available, using NLTK for sentence processing")
23
+ SPACY_AVAILABLE = False
24
+
25
+ class HumanLikeVariations:
26
+ """Add human-like variations and intentional imperfections"""
27
+
28
+ def __init__(self):
29
+ # Common human writing patterns - EXPANDED for Originality AI
30
+ self.casual_transitions = [
31
+ "So, ", "Well, ", "Now, ", "Actually, ", "Basically, ",
32
+ "You know, ", "I mean, ", "Thing is, ", "Honestly, ",
33
+ "Look, ", "Listen, ", "See, ", "Okay, ", "Right, ",
34
+ "Anyway, ", "Besides, ", "Plus, ", "Also, ", "Oh, ",
35
+ "Hey, ", "Alright, ", "Sure, ", "Fine, ", "Obviously, ",
36
+ "Clearly, ", "Seriously, ", "Literally, ", "Frankly, ",
37
+ "To be honest, ", "Truth is, ", "In fact, ", "Believe it or not, ",
38
+ "Here's the thing, ", "Let me tell you, ", "Get this, ",
39
+ "Funny thing is, ", "Interestingly, ", "Surprisingly, ",
40
+ "Let's be real here, ", "Can we talk about ", "Quick question: ",
41
+ "Real talk: ", "Hot take: ", "Unpopular opinion: ", "Fun fact: ",
42
+ "Pro tip: ", "Side note: ", "Random thought: ", "Food for thought: ",
43
+ "Just saying, ", "Not gonna lie, ", "For what it's worth, ",
44
+ "If you ask me, ", "Between you and me, ", "Here's my take: ",
45
+ "Let's face it, ", "No kidding, ", "Seriously though, ",
46
+ "But wait, ", "Hold on, ", "Check this out: ", "Guess what? "
47
+ ]
48
+
49
+ self.filler_phrases = [
50
+ "kind of", "sort of", "pretty much", "basically", "actually",
51
+ "really", "just", "quite", "rather", "fairly", "totally",
52
+ "definitely", "probably", "maybe", "perhaps", "somehow",
53
+ "somewhat", "literally", "seriously", "honestly", "frankly",
54
+ "simply", "merely", "purely", "truly", "genuinely",
55
+ "absolutely", "completely", "entirely", "utterly", "practically",
56
+ "virtually", "essentially", "fundamentally", "generally", "typically",
57
+ "usually", "normally", "often", "sometimes", "occasionally",
58
+ "apparently", "evidently", "obviously", "clearly", "seemingly",
59
+ "arguably", "potentially", "possibly", "likely", "unlikely",
60
+ "more or less", "give or take", "so to speak", "if you will",
61
+ "per se", "as such", "in a way", "to some extent", "to a degree",
62
+ "I kid you not", "no joke", "for real", "not gonna lie",
63
+ "I'm telling you", "trust me", "believe me", "I swear",
64
+ "hands down", "without a doubt", "100%", "straight up",
65
+ "I think", "I feel like", "I guess", "I suppose", "seems like",
66
+ "appears to be", "might be", "could be", "tends to", "tends to be",
67
+ "in my experience", "from what I've seen", "as far as I know",
68
+ "to the best of my knowledge", "if I'm not mistaken", "correct me if I'm wrong",
69
+ "you know what", "here's the deal", "bottom line", "at any rate",
70
+ "all in all", "when you think about it", "come to think of it",
71
+ "now that I think about it", "if we're being honest", "to be fair"
72
+ ]
73
+
74
+ self.human_connectors = [
75
+ ", which means", ", so", ", because", ", since", ", although",
76
+ ". That's why", ". This means", ". So basically,", ". The thing is,",
77
+ ", and honestly", ", but here's the thing", ", though", ", however",
78
+ ". Plus,", ". Also,", ". Besides,", ". Moreover,", ". Furthermore,",
79
+ ", which is why", ", and that's because", ", given that", ", considering",
80
+ ". In other words,", ". Put simply,", ". To clarify,", ". That said,",
81
+ ", you see", ", you know", ", right?", ", okay?", ", yeah?",
82
+ ". Here's why:", ". Let me explain:", ". Think about it:",
83
+ ", if you ask me", ", in my opinion", ", from my perspective",
84
+ ". On the flip side,", ". On the other hand,", ". Conversely,",
85
+ ", not to mention", ", let alone", ", much less", ", aside from",
86
+ ". What's more,", ". Even better,", ". Even worse,", ". The catch is,",
87
+ ", believe it or not", ", surprisingly enough", ", interestingly enough",
88
+ ". Long story short,", ". Bottom line is,", ". Point being,",
89
+ ", as you might expect", ", as it turns out", ", as luck would have it",
90
+ ". And get this:", ". But wait, there's more:", ". Here's the kicker:",
91
+ ", and here's why", ", and here's the thing", ", but here's what happened",
92
+ ". Spoiler alert:", ". Plot twist:", ". Reality check:",
93
+ ", at the end of the day", ", when all is said and done", ", all things considered",
94
+ ". Make no mistake,", ". Don't get me wrong,", ". Let's not forget,",
95
+ ", between you and me", ", off the record", ", just between us",
96
+ ". And honestly?", ". But seriously,", ". And you know what?",
97
+ ", which brings me to", ". This reminds me of", ", speaking of which",
98
+ ". Funny enough,", ". Weird thing is,", ". Strange but true:",
99
+ ", and I mean", ". I'm not kidding when I say", ", and trust me on this"
100
+ ]
101
+
102
+ # NEW: Common human typos and variations
103
+ self.common_typos = {
104
+ "the": ["teh", "th", "hte"],
105
+ "and": ["adn", "nad", "an"],
106
+ "that": ["taht", "htat", "tha"],
107
+ "with": ["wiht", "wtih", "iwth"],
108
+ "have": ["ahve", "hvae", "hav"],
109
+ "from": ["form", "fro", "frmo"],
110
+ "they": ["tehy", "thye", "htey"],
111
+ "which": ["whihc", "wich", "whcih"],
112
+ "their": ["thier", "theri", "tehir"],
113
+ "would": ["woudl", "wuold", "woul"],
114
+ "there": ["tehre", "theer", "ther"],
115
+ "could": ["coudl", "cuold", "coud"],
116
+ "people": ["poeple", "peopel", "pepole"],
117
+ "through": ["thorugh", "throught", "trhough"],
118
+ "because": ["becuase", "becasue", "beacuse"],
119
+ "before": ["beofre", "befroe", "befor"],
120
+ "different": ["differnt", "differnet", "diferent"],
121
+ "between": ["bewteen", "betwen", "betewen"],
122
+ "important": ["improtant", "importnat", "importan"],
123
+ "information": ["infromation", "informaiton", "informaton"]
124
+ }
125
+
126
+ # NEW: Human-like sentence starters for variety
127
+ self.varied_starters = [
128
+ "When it comes to", "As for", "Regarding", "In terms of",
129
+ "With respect to", "Concerning", "Speaking of", "About",
130
+ "If we look at", "Looking at", "Considering", "Given",
131
+ "Taking into account", "Bear in mind that", "Keep in mind",
132
+ "It's worth noting that", "It should be noted that",
133
+ "One thing to consider is", "An important point is",
134
+ "What's interesting is", "What stands out is",
135
+ "The key here is", "The main thing is", "The point is",
136
+ "Here's what matters:", "Here's the deal:", "Here's something:",
137
+ "Let's not forget", "We should remember", "Don't forget",
138
+ "Think about it this way:", "Look at it like this:",
139
+ "Consider this:", "Picture this:", "Imagine this:",
140
+ "You might wonder", "You might ask", "You may think",
141
+ "Some people say", "Many believe", "It's often said",
142
+ "Research shows", "Studies indicate", "Evidence suggests",
143
+ "Experience tells us", "History shows", "Time has shown"
144
+ ]
145
+
146
+ def add_human_touch(self, text):
147
+ """Add subtle human-like imperfections - MORE AGGRESSIVE"""
148
+ sentences = text.split('. ')
149
+ modified_sentences = []
150
+
151
+ for i, sent in enumerate(sentences):
152
+ if not sent.strip():
153
+ continue
154
+
155
+ # Occasionally start with casual transition (25% chance - increased)
156
+ if i > 0 and random.random() < 0.25 and len(sent.split()) > 5:
157
+ transition = random.choice(self.casual_transitions)
158
+ sent = transition + sent[0].lower() + sent[1:] if len(sent) > 1 else sent
159
+
160
+ # Add filler words occasionally (20% chance - increased)
161
+ if random.random() < 0.2 and len(sent.split()) > 8:
162
+ words = sent.split()
163
+ # Add multiple fillers sometimes
164
+ num_fillers = random.randint(1, 2)
165
+ for _ in range(num_fillers):
166
+ if len(words) > 4:
167
+ insert_pos = random.randint(2, len(words)-2)
168
+ filler = random.choice(self.filler_phrases)
169
+ words.insert(insert_pos, filler)
170
+ sent = ' '.join(words)
171
+
172
+ # Add varied sentence starters (15% chance)
173
+ if i > 0 and random.random() < 0.15 and len(sent.split()) > 10:
174
+ starter = random.choice(self.varied_starters)
175
+ sent = starter + " " + sent[0].lower() + sent[1:] if len(sent) > 1 else sent
176
+
177
+ # Occasionally use contractions (35% chance - increased)
178
+ if random.random() < 0.35:
179
+ sent = self.apply_contractions(sent)
180
+
181
+ # Add occasional comma splices (10% chance) - common human error
182
+ if random.random() < 0.1 and ',' in sent and len(sent.split()) > 10:
183
+ # Replace a period with comma sometimes
184
+ parts = sent.split(', ')
185
+ if len(parts) > 2:
186
+ join_idx = random.randint(1, len(parts)-1)
187
+ parts[join_idx-1] = parts[join_idx-1] + ','
188
+ sent = ' '.join(parts)
189
+
190
+ # NEW: Add parenthetical thoughts (8% chance)
191
+ if random.random() < 0.08 and len(sent.split()) > 15:
192
+ parentheticals = [
193
+ "(and that's saying something)",
194
+ "(which is pretty interesting)",
195
+ "(trust me on this one)",
196
+ "(I've seen this firsthand)",
197
+ "(no joke)",
198
+ "(seriously)",
199
+ "(and for good reason)",
200
+ "(believe it or not)",
201
+ "(surprisingly enough)",
202
+ "(which makes sense)",
203
+ "(go figure)",
204
+ "(who knew?)",
205
+ "(makes you think)",
206
+ "(worth considering)"
207
+ ]
208
+ words = sent.split()
209
+ insert_pos = random.randint(len(words)//3, 2*len(words)//3)
210
+ parenthetical = random.choice(parentheticals)
211
+ words.insert(insert_pos, parenthetical)
212
+ sent = ' '.join(words)
213
+
214
+ # NEW: Occasionally add rhetorical questions (5% chance)
215
+ if random.random() < 0.05 and i < len(sentences) - 1:
216
+ rhetorical_questions = [
217
+ "Makes sense, right?",
218
+ "Pretty cool, huh?",
219
+ "Interesting, isn't it?",
220
+ "Who would've thought?",
221
+ "Sound familiar?",
222
+ "See what I mean?",
223
+ "Get the picture?",
224
+ "Following along?",
225
+ "Crazy, right?",
226
+ "Wild, isn't it?"
227
+ ]
228
+ sent = sent + " " + random.choice(rhetorical_questions)
229
+
230
+ modified_sentences.append(sent)
231
+
232
+ return '. '.join(modified_sentences)
233
+
234
+ def apply_contractions(self, text):
235
+ """Apply common contractions - EXPANDED"""
236
+ contractions = {
237
+ "it is": "it's", "that is": "that's", "there is": "there's",
238
+ "he is": "he's", "she is": "she's", "what is": "what's",
239
+ "where is": "where's", "who is": "who's", "how is": "how's",
240
+ "cannot": "can't", "will not": "won't", "do not": "don't",
241
+ "does not": "doesn't", "did not": "didn't", "could not": "couldn't",
242
+ "should not": "shouldn't", "would not": "wouldn't", "is not": "isn't",
243
+ "are not": "aren't", "was not": "wasn't", "were not": "weren't",
244
+ "have not": "haven't", "has not": "hasn't", "had not": "hadn't",
245
+ "I am": "I'm", "you are": "you're", "we are": "we're",
246
+ "they are": "they're", "I have": "I've", "you have": "you've",
247
+ "we have": "we've", "they have": "they've", "I will": "I'll",
248
+ "you will": "you'll", "he will": "he'll", "she will": "she'll",
249
+ "we will": "we'll", "they will": "they'll", "I would": "I'd",
250
+ "you would": "you'd", "he would": "he'd", "she would": "she'd",
251
+ "we would": "we'd", "they would": "they'd", "could have": "could've",
252
+ "should have": "should've", "would have": "would've", "might have": "might've",
253
+ "must have": "must've", "there has": "there's", "here is": "here's",
254
+ "let us": "let's", "that will": "that'll", "who will": "who'll"
255
+ }
256
+
257
+ for full, contr in contractions.items():
258
+ if random.random() < 0.8: # 80% chance to apply each contraction
259
+ text = re.sub(r'\b' + full + r'\b', contr, text, flags=re.IGNORECASE)
260
+
261
+ return text
262
+
263
+ def add_minor_errors(self, text):
264
+ """Add very minor, human-like errors - MORE REALISTIC"""
265
+ # Occasionally miss Oxford comma (15% chance)
266
+ if random.random() < 0.15:
267
+ text = re.sub(r'(\w+), (\w+), and', r'\1, \2 and', text)
268
+
269
+ # Sometimes use 'which' instead of 'that' (8% chance)
270
+ if random.random() < 0.08:
271
+ text = text.replace(' that ', ' which ', 1)
272
+
273
+ # NEW: Add very occasional typos (3% chance per sentence)
274
+ sentences = text.split('. ')
275
+ for i, sent in enumerate(sentences):
276
+ if random.random() < 0.03 and len(sent.split()) > 10:
277
+ words = sent.split()
278
+ # Pick a random word to potentially typo
279
+ word_idx = random.randint(0, len(words)-1)
280
+ word = words[word_idx].lower()
281
+
282
+ # Only typo common words
283
+ if word in self.common_typos and random.random() < 0.5:
284
+ typo = random.choice(self.common_typos[word])
285
+ # Preserve original capitalization
286
+ if words[word_idx][0].isupper():
287
+ typo = typo[0].upper() + typo[1:]
288
+ words[word_idx] = typo
289
+ sentences[i] = ' '.join(words)
290
+
291
+ text = '. '.join(sentences)
292
+
293
+ # NEW: Occasionally double a word (2% chance)
294
+ if random.random() < 0.02:
295
+ words = text.split()
296
+ if len(words) > 20:
297
+ # Pick a small common word to double
298
+ small_words = ['the', 'a', 'an', 'is', 'was', 'are', 'were', 'to', 'of', 'in', 'on']
299
+ for idx, word in enumerate(words):
300
+ if word.lower() in small_words and random.random() < 0.1:
301
+ words[idx] = word + ' ' + word
302
+ break
303
+ text = ' '.join(words)
304
+
305
+ # NEW: Mix up common homophones occasionally (3% chance)
306
+ if random.random() < 0.03:
307
+ homophones = [
308
+ ('their', 'there'), ('your', 'you\'re'), ('its', 'it\'s'),
309
+ ('then', 'than'), ('to', 'too'), ('effect', 'affect')
310
+ ]
311
+ for pair in homophones:
312
+ if pair[0] in text and random.random() < 0.3:
313
+ text = text.replace(pair[0], pair[1], 1)
314
+ break
315
+
316
+ return text
317
+
318
+ def add_originality_specific_patterns(self, text):
319
+ """Add patterns that Originality AI associates with human writing"""
320
+ # 1. Add personal touches and opinions
321
+ if random.random() < 0.1:
322
+ personal_phrases = [
323
+ "In my view, ", "From my perspective, ", "I believe ",
324
+ "It seems to me that ", "I've found that ", "In my experience, ",
325
+ "I tend to think ", "My take is that ", "I'd argue that ",
326
+ "Personally, I think ", "If you ask me, ", "The way I see it, "
327
+ ]
328
+ sentences = text.split('. ')
329
+ if len(sentences) > 3:
330
+ idx = random.randint(1, len(sentences)-2)
331
+ sentences[idx] = random.choice(personal_phrases) + sentences[idx][0].lower() + sentences[idx][1:]
332
+ text = '. '.join(sentences)
333
+
334
+ # 2. Add conversational asides
335
+ if random.random() < 0.08:
336
+ asides = [
337
+ " - and this is important - ",
338
+ " - bear with me here - ",
339
+ " - stay with me - ",
340
+ " - and I mean this - ",
341
+ " - no exaggeration - ",
342
+ " - true story - ",
343
+ " - I'm serious - ",
344
+ " - think about it - ",
345
+ " - and here's why - "
346
+ ]
347
+ words = text.split()
348
+ if len(words) > 20:
349
+ pos = random.randint(10, len(words)-10)
350
+ words.insert(pos, random.choice(asides))
351
+ text = ' '.join(words)
352
+
353
+ # 3. Add emphatic repetition (human pattern)
354
+ if random.random() < 0.05:
355
+ emphatic_words = ['very', 'really', 'truly', 'absolutely', 'totally']
356
+ sentences = text.split('. ')
357
+ if sentences:
358
+ sent_idx = random.randint(0, len(sentences)-1)
359
+ words = sentences[sent_idx].split()
360
+ if len(words) > 5:
361
+ # Find an adjective or adverb to emphasize
362
+ for i, word in enumerate(words):
363
+ if i > 0 and i < len(words)-1:
364
+ # Add emphasis
365
+ if random.random() < 0.3:
366
+ emphasis = random.choice(emphatic_words)
367
+ words.insert(i, emphasis)
368
+ # Sometimes repeat for extra emphasis
369
+ if random.random() < 0.3:
370
+ words.insert(i, emphasis + ',')
371
+ break
372
+ sentences[sent_idx] = ' '.join(words)
373
+ text = '. '.join(sentences)
374
+
375
+ return text
376
+
377
+ class SelectiveGrammarFixer:
378
+ """Minimal grammar fixes to maintain human-like quality while fixing critical errors"""
379
+
380
+ def __init__(self):
381
+ self.nlp = None
382
+ self.human_variations = HumanLikeVariations()
383
+
384
+ def fix_incomplete_sentences_only(self, text):
385
+ """Fix only incomplete sentences without over-correcting"""
386
+ if not text:
387
+ return text
388
+
389
+ sentences = text.split('. ')
390
+ fixed_sentences = []
391
+
392
+ for i, sent in enumerate(sentences):
393
+ sent = sent.strip()
394
+ if not sent:
395
+ continue
396
+
397
+ # Only fix if sentence is incomplete
398
+ if sent and sent[-1] not in '.!?':
399
+ # Check if it's the last sentence
400
+ if i == len(sentences) - 1:
401
+ # Add period if it's clearly a statement
402
+ if not sent.endswith(':') and not sent.endswith(','):
403
+ sent += '.'
404
+ else:
405
+ # Middle sentences should have periods
406
+ sent += '.'
407
+
408
+ # Fix cut-off words (very short last word without punctuation)
409
+ words = sent.split()
410
+ if len(words) > 3:
411
+ last_word = words[-1].rstrip('.!?')
412
+ if len(last_word) <= 2 and last_word.isalpha():
413
+ # Check if it has vowels (real word vs cut-off)
414
+ if not any(c in 'aeiouAEIOU' for c in last_word):
415
+ # Likely a cut-off word, remove it
416
+ words = words[:-1]
417
+ sent = ' '.join(words)
418
+ if sent and sent[-1] not in '.!?':
419
+ sent += '.'
420
+
421
+ # Ensure first letter capitalization ONLY after sentence endings
422
+ if i > 0 and sent and sent[0].islower():
423
+ # Check if previous sentence ended with punctuation
424
+ if fixed_sentences and fixed_sentences[-1].rstrip().endswith(('.', '!', '?')):
425
+ sent = sent[0].upper() + sent[1:]
426
+ elif i == 0 and sent and sent[0].islower():
427
+ # First sentence should be capitalized
428
+ sent = sent[0].upper() + sent[1:]
429
+
430
+ fixed_sentences.append(sent)
431
+
432
+ result = ' '.join(fixed_sentences)
433
+
434
+ # Add human-like variations
435
+ result = self.human_variations.add_human_touch(result)
436
+ result = self.human_variations.add_minor_errors(result)
437
+ result = self.human_variations.add_originality_specific_patterns(result)
438
+
439
+ return result
440
+
441
+ def fix_basic_punctuation_errors(self, text):
442
+ """Fix only the most egregious punctuation errors"""
443
+ if not text:
444
+ return text
445
+
446
+ # Fix double spaces (human-like error)
447
+ text = re.sub(r'\s{2,}', ' ', text)
448
+
449
+ # Fix space before punctuation (common error)
450
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text)
451
+
452
+ # Fix missing space after punctuation (human-like)
453
+ text = re.sub(r'([.,!?])([A-Z])', r'\1 \2', text)
454
+
455
+ # Fix accidental double punctuation
456
+ text = re.sub(r'([.!?])\1+', r'\1', text)
457
+
458
+ # Fix "i" capitalization (common human error to fix)
459
+ text = re.sub(r'\bi\b', 'I', text)
460
+
461
+ return text
462
+
463
+ def preserve_natural_variations(self, text):
464
+ """Keep some natural human-like variations"""
465
+ # Don't fix everything - leave some variety
466
+ # Only fix if really broken
467
+ if text.count('.') == 0 and len(text.split()) > 20:
468
+ # Long text with no periods - needs fixing
469
+ words = text.split()
470
+ # Add periods every 15-25 words naturally (more variation)
471
+ new_text = []
472
+ for i, word in enumerate(words):
473
+ new_text.append(word)
474
+ if i > 0 and i % random.randint(12, 25) == 0:
475
+ if word[-1] not in '.!?,;:':
476
+ new_text[-1] = word + '.'
477
+ # Capitalize next word if it's not an acronym
478
+ if i + 1 < len(words) and words[i + 1][0].islower():
479
+ # Check if it's not likely an acronym
480
+ if not words[i + 1].isupper():
481
+ words[i + 1] = words[i + 1][0].upper() + words[i + 1][1:]
482
+ text = ' '.join(new_text)
483
+
484
+ return text
485
+
486
+ def smart_fix(self, text):
487
+ """Apply minimal fixes to maintain human-like quality"""
488
+ # Apply fixes in order of importance
489
+ text = self.fix_basic_punctuation_errors(text)
490
+ text = self.fix_incomplete_sentences_only(text)
491
+ text = self.preserve_natural_variations(text)
492
+
493
+ return text
494
+
495
+ class EnhancedDipperHumanizer:
496
+ def __init__(self):
497
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
498
+ print(f"Using device: {self.device}")
499
+
500
+ # Clear GPU cache
501
+ if torch.cuda.is_available():
502
+ torch.cuda.empty_cache()
503
+
504
+ # Initialize grammar fixer
505
+ self.grammar_fixer = SelectiveGrammarFixer()
506
+
507
+ # Try to load spaCy if available
508
+ self.nlp = None
509
+ self.use_spacy = False
510
+ if SPACY_AVAILABLE:
511
+ try:
512
+ self.nlp = spacy.load("en_core_web_sm")
513
+ self.use_spacy = True
514
+ print("spaCy loaded successfully")
515
+ except:
516
+ print("spaCy model not found, using NLTK for sentence splitting")
517
+
518
+ try:
519
+ # Load Dipper paraphraser WITHOUT 8-bit quantization for better performance
520
+ print("Loading Dipper paraphraser model...")
521
+ self.tokenizer = T5Tokenizer.from_pretrained('google/t5-v1_1-xxl')
522
+ self.model = T5ForConditionalGeneration.from_pretrained(
523
+ "kalpeshk2011/dipper-paraphraser-xxl",
524
+ device_map="auto", # This will distribute across 4xL40S automatically
525
+ torch_dtype=torch.float16,
526
+ low_cpu_mem_usage=True
527
+ )
528
+ print("Dipper model loaded successfully!")
529
+ self.is_dipper = True
530
+
531
+ except Exception as e:
532
+ print(f"Error loading Dipper model: {str(e)}")
533
+ print("Falling back to Flan-T5-XL...")
534
+ self.is_dipper = False
535
+
536
+ # Fallback to Flan-T5-XL
537
+ try:
538
+ self.model = T5ForConditionalGeneration.from_pretrained(
539
+ "google/flan-t5-xl",
540
+ torch_dtype=torch.float16,
541
+ low_cpu_mem_usage=True,
542
+ device_map="auto"
543
+ )
544
+ self.tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
545
+ print("Loaded Flan-T5-XL as fallback")
546
+ except:
547
+ raise Exception("Could not load any model. Please check your system resources.")
548
+
549
+ # Load BART as secondary model
550
+ try:
551
+ print("Loading BART model for additional variation...")
552
+ self.bart_model = AutoModelForSeq2SeqLM.from_pretrained(
553
+ "eugenesiow/bart-paraphrase",
554
+ torch_dtype=torch.float16,
555
+ device_map="auto" # Distribute across GPUs
556
+ )
557
+ self.bart_tokenizer = AutoTokenizer.from_pretrained("eugenesiow/bart-paraphrase")
558
+ self.use_bart = True
559
+ print("BART model loaded successfully")
560
+ except:
561
+ print("BART model not available")
562
+ self.use_bart = False
563
+
564
+ def preserve_keywords(self, text, keywords):
565
+ """Mark keywords to preserve them during paraphrasing"""
566
+ if not keywords:
567
+ return text, {}
568
+
569
+ # Create a mapping of placeholders to keywords
570
+ keyword_map = {}
571
+ modified_text = text
572
+
573
+ # Sort keywords by length (longest first) to avoid partial replacements
574
+ sorted_keywords = sorted(keywords, key=len, reverse=True)
575
+
576
+ for i, keyword in enumerate(sorted_keywords):
577
+ # Use unique markers that won't be confused
578
+ placeholder = f"__KW{i:03d}__" # e.g., __KW001__
579
+
580
+ # Find all occurrences of the keyword (case-insensitive)
581
+ pattern = r'\b' + re.escape(keyword) + r'\b'
582
+ matches = list(re.finditer(pattern, modified_text, flags=re.IGNORECASE))
583
+
584
+ if matches:
585
+ # Replace all occurrences with the placeholder
586
+ for match in reversed(matches): # Reverse to maintain positions
587
+ original_keyword = match.group(0)
588
+ start, end = match.span()
589
+ modified_text = modified_text[:start] + placeholder + modified_text[end:]
590
+ # Store the original case version
591
+ keyword_map[placeholder] = original_keyword
592
+
593
+ return modified_text, keyword_map
594
+
595
+ def restore_keywords_robust(self, text, keyword_map):
596
+ """Restore keywords with more flexible pattern matching"""
597
+ if not keyword_map:
598
+ return text
599
+
600
+ restored_text = text
601
+
602
+ # Debug: print what we're working with
603
+ print(f"Restoring keywords in text: {restored_text[:100]}...")
604
+ print(f"Keyword map: {keyword_map}")
605
+
606
+ # First pass: Direct placeholder replacement
607
+ for placeholder, keyword in keyword_map.items():
608
+ if placeholder in restored_text:
609
+ print(f"Found exact placeholder {placeholder}, replacing with {keyword}")
610
+ restored_text = restored_text.replace(placeholder, keyword)
611
+
612
+ # Second pass: Handle any mangled placeholders
613
+ # The model might alter placeholders in various ways
614
+ for placeholder, keyword in keyword_map.items():
615
+ # Extract the number from placeholder
616
+ match = re.search(r'__KW(\d+)__', placeholder)
617
+ if match:
618
+ num = match.group(1)
619
+
620
+ # Various patterns the model might create
621
+ patterns = [
622
+ f'__KW{num}__',
623
+ f'__ KW{num}__',
624
+ f'__KW {num}__',
625
+ f'__ KW {num} __',
626
+ f'_KW{num}_',
627
+ f'_kw{num}_', # lowercase with single underscore
628
+ f'KW{num}',
629
+ f'KW {num}',
630
+ f'__kw{num}__', # lowercase variant
631
+ f'__Kw{num}__', # mixed case
632
+ f'__ kw{num}__',
633
+ f'__KW{num}_', # missing underscore
634
+ f'_KW{num}__', # missing underscore
635
+ f'kw{num}', # just lowercase
636
+ f'___', # Sometimes model reduces to just underscores
637
+ f'____', # Various underscore patterns
638
+ f'_____',
639
+ f'__ __',
640
+ f'___ ___',
641
+ ]
642
+
643
+ for pattern in patterns:
644
+ if pattern in restored_text:
645
+ print(f"Found pattern '{pattern}', replacing with {keyword}")
646
+ restored_text = restored_text.replace(pattern, keyword)
647
+
648
+ # Third pass: Use regex to catch any remaining variations
649
+ # This catches cases where the model might have added characters
650
+ for placeholder, keyword in keyword_map.items():
651
+ match = re.search(r'__KW(\d+)__', placeholder)
652
+ if match:
653
+ num = match.group(1)
654
+ # Regex to match various mangled versions including single underscore
655
+ regex_patterns = [
656
+ rf'_+\s*[Kk][Ww]\s*{num}\s*_*', # Any underscores, case insensitive
657
+ rf'[Kk][Ww]\s*{num}(?!\d)', # KW followed by the number
658
+ rf'__?\s*[Kk][Ww]\s*{num}\s*__?', # Optional underscores
659
+ rf'_[Kk][Ww]{num}_', # Single underscore version
660
+ rf'_+\s*{num}\s*_*', # Just the number with underscores
661
+ rf'__+', # Multiple underscores (fallback)
662
+ ]
663
+
664
+ for pattern in regex_patterns:
665
+ matches = list(re.finditer(pattern, restored_text, flags=re.IGNORECASE))
666
+ if matches:
667
+ print(f"Found regex pattern '{pattern}' {len(matches)} times")
668
+ # Replace from end to beginning to maintain positions
669
+ for match in reversed(matches):
670
+ restored_text = restored_text[:match.start()] + keyword + restored_text[match.end():]
671
+
672
+ # Fourth pass: Look for common patterns where model mangles placeholders
673
+ # Sometimes the model turns __KW002__ into things like "___ University" or "___ College__"
674
+ underscore_patterns = [
675
+ (r'___+\s*[Uu]niversity', keyword + ' University') if 'universit' in keyword.lower() else None,
676
+ (r'___+\s*[Cc]ollege__?', keyword + ' College') if 'college' in keyword.lower() else None,
677
+ (r'___+\s*[Ss]chool', keyword + ' School') if 'school' in keyword.lower() else None,
678
+ (r'___+', keyword), # Generic underscore replacement
679
+ ]
680
+
681
+ for pattern_tuple in underscore_patterns:
682
+ if pattern_tuple:
683
+ pattern, replacement = pattern_tuple
684
+ if re.search(pattern, restored_text):
685
+ print(f"Found underscore pattern '{pattern}', replacing with {replacement}")
686
+ restored_text = re.sub(pattern, replacement, restored_text)
687
+
688
+ # Final safety check: Look for any remaining placeholder-like patterns
689
+ remaining_underscores = re.findall(r'_{2,}', restored_text)
690
+ if remaining_underscores:
691
+ print(f"Warning: Found remaining underscore patterns: {remaining_underscores}")
692
+ # If we still have multiple underscores and we have keywords, do a simple replacement
693
+ # This is aggressive but necessary when model completely mangles placeholders
694
+ if '___' in restored_text and keyword_map:
695
+ # Replace the first occurrence of multiple underscores with each keyword
696
+ for placeholder, keyword in keyword_map.items():
697
+ if '___' in restored_text:
698
+ restored_text = restored_text.replace('___', keyword, 1)
699
+
700
+ # Log final result
701
+ print(f"Final restored text: {restored_text[:100]}...")
702
+
703
+ return restored_text
704
+
705
+ def should_skip_element(self, element, text):
706
+ """Determine if an element should be skipped from paraphrasing"""
707
+ if not text or len(text.strip()) < 3:
708
+ return True
709
+
710
+ # Skip JavaScript code inside script tags
711
+ parent = element.parent
712
+ if parent and parent.name in ['script', 'style', 'noscript']:
713
+ return True
714
+
715
+ # Skip headings (h1-h6)
716
+ if parent and parent.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'title']:
717
+ return True
718
+
719
+ # Skip content inside <strong> and <b> tags
720
+ if parent and parent.name in ['strong', 'b']:
721
+ return True
722
+
723
+ # Skip table content
724
+ if parent and (parent.name in ['td', 'th'] or any(p.name == 'table' for p in parent.parents)):
725
+ return True
726
+
727
+ # Special handling for content inside tables
728
+ # Skip if it's inside strong/b/h1-h6 tags AND also inside a table
729
+ if parent:
730
+ # Check if we're inside a table
731
+ is_in_table = any(p.name == 'table' for p in parent.parents)
732
+ if is_in_table:
733
+ # If we're in a table, skip any text that's inside formatting tags
734
+ if parent.name in ['strong', 'b', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'em', 'i']:
735
+ return True
736
+ # Also check if parent's parent is a formatting tag
737
+ if parent.parent and parent.parent.name in ['strong', 'b', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
738
+ return True
739
+
740
+ # Skip table of contents
741
+ if parent:
742
+ parent_text = str(parent).lower()
743
+ if any(toc in parent_text for toc in ['table of contents', 'toc-', 'contents']):
744
+ return True
745
+
746
+ # Skip CTAs and buttons
747
+ if parent and parent.name in ['button', 'a']:
748
+ return True
749
+
750
+ # Skip if parent has onclick or other event handlers
751
+ if parent and parent.attrs:
752
+ event_handlers = ['onclick', 'onchange', 'onsubmit', 'onload', 'onmouseover', 'onmouseout']
753
+ if any(handler in parent.attrs for handler in event_handlers):
754
+ return True
755
+
756
+ # Special check for testimonial cards - check up to 3 levels of ancestors
757
+ if parent:
758
+ ancestors_to_check = []
759
+ current = parent
760
+ for _ in range(3): # Check up to 3 levels up
761
+ if current:
762
+ ancestors_to_check.append(current)
763
+ current = current.parent
764
+
765
+ # Check if any ancestor has testimonial-card class
766
+ for ancestor in ancestors_to_check:
767
+ if ancestor and ancestor.get('class'):
768
+ classes = ancestor.get('class', [])
769
+ if isinstance(classes, list):
770
+ if any('testimonial-card' in str(cls) for cls in classes):
771
+ return True
772
+ elif isinstance(classes, str) and 'testimonial-card' in classes:
773
+ return True
774
+
775
+ # Skip if IMMEDIATE parent or element itself has skip-worthy classes/IDs
776
+ skip_indicators = [
777
+ 'cta-', 'button', 'btn', 'heading', 'title', 'caption',
778
+ 'toc-', 'contents', 'quiz', 'tip', 'note', 'alert',
779
+ 'warning', 'info', 'success', 'error', 'code', 'pre',
780
+ 'stats-grid', 'testimonial-card', 'highlight-box',
781
+ 'cta-box', 'quiz-container', 'news-box', 'contact-form',
782
+ 'faq-question', 'sidebar', 'widget', 'banner', 'news-section',
783
+ 'author-intro', 'testimonial', 'review', 'feedback',
784
+ 'floating-', 'stat-', 'progress-', 'option', 'results',
785
+ 'question-container', 'quiz-', 'faq-',
786
+ 'comparision-tables', 'process-flowcharts', 'infographics', 'cost-breakdown'
787
+ ]
788
+
789
+ # Check only immediate parent and grandparent (not all ancestors)
790
+ elements_to_check = [parent]
791
+ if parent and parent.parent:
792
+ elements_to_check.append(parent.parent)
793
+
794
+ for elem in elements_to_check:
795
+ if not elem:
796
+ continue
797
+
798
+ # Check element's class
799
+ elem_class = elem.get('class', [])
800
+ if isinstance(elem_class, list):
801
+ class_str = ' '.join(str(cls).lower() for cls in elem_class)
802
+ if any(indicator in class_str for indicator in skip_indicators):
803
+ return True
804
+
805
+ # Check element's ID
806
+ elem_id = elem.get('id', '')
807
+ if any(indicator in str(elem_id).lower() for indicator in skip_indicators):
808
+ return True
809
+
810
+ # Skip short phrases that might be UI elements
811
+ word_count = len(text.split())
812
+ if word_count <= 5:
813
+ ui_patterns = [
814
+ 'click', 'download', 'learn more', 'read more', 'sign up',
815
+ 'get started', 'try now', 'buy now', 'next', 'previous',
816
+ 'back', 'continue', 'submit', 'cancel', 'get now', 'book your',
817
+ 'check out:', 'see also:', 'related:', 'question', 'of'
818
+ ]
819
+ if any(pattern in text.lower() for pattern in ui_patterns):
820
+ return True
821
+
822
+ # Skip very short content in styled containers
823
+ if parent and parent.name in ['div', 'section', 'aside', 'blockquote']:
824
+ style = parent.get('style', '')
825
+ if 'border' in style or 'background' in style:
826
+ if word_count <= 20:
827
+ # But don't skip if it's inside a paragraph
828
+ if not any(p.name == 'p' for p in parent.parents):
829
+ return True
830
+
831
+ return False
832
+
833
+ def is_likely_acronym_or_proper_noun(self, word):
834
+ """Check if a word is likely an acronym or part of a proper noun"""
835
+ # Common acronyms and abbreviations
836
+ acronyms = {'MBA', 'CEO', 'USA', 'UK', 'GMAT', 'GRE', 'SAT', 'ACT', 'PhD', 'MD', 'IT', 'AI', 'ML'}
837
+
838
+ # Check if it's in our acronym list
839
+ if word.upper() in acronyms:
840
+ return True
841
+
842
+ # Check if it's all caps (likely acronym)
843
+ if word.isupper() and len(word) > 1:
844
+ return True
845
+
846
+ # Check if it follows patterns like "Edition", "Focus", etc. that often come after proper nouns
847
+ proper_noun_continuations = {
848
+ 'Edition', 'Version', 'Series', 'Focus', 'System', 'Method', 'School',
849
+ 'University', 'College', 'Institute', 'Academy', 'Center', 'Centre'
850
+ }
851
+
852
+ if word in proper_noun_continuations:
853
+ return True
854
+
855
+ return False
856
+
857
+ def clean_model_output_enhanced(self, text):
858
+ """Enhanced cleaning that preserves more natural structure"""
859
+ if not text:
860
+ return ""
861
+
862
+ # Store original for fallback
863
+ original = text
864
+
865
+ # Remove ONLY clear model artifacts
866
+ text = re.sub(r'^lexical\s*=\s*\d+\s*,\s*order\s*=\s*\d+\s*', '', text, flags=re.IGNORECASE)
867
+ text = re.sub(r'<sent>\s*', '', text, flags=re.IGNORECASE)
868
+ text = re.sub(r'\s*</sent>', '', text, flags=re.IGNORECASE)
869
+
870
+ # Only remove clear prefixes
871
+ if text.lower().startswith('paraphrase:'):
872
+ text = text[11:].strip()
873
+ elif text.lower().startswith('rewrite:'):
874
+ text = text[8:].strip()
875
+
876
+ # Remove leading non-letter characters carefully
877
+ # IMPORTANT: Preserve keyword placeholders
878
+ if not re.match(r'^__KW\d+__', text):
879
+ # Only remove if it doesn't start with a placeholder
880
+ text = re.sub(r'^[^a-zA-Z_]+', '', text)
881
+
882
+ # If we accidentally removed too much, use original
883
+ if len(text) < len(original) * 0.5:
884
+ text = original
885
+
886
+ return text.strip()
887
+
888
+ def paraphrase_with_dipper(self, text, lex_diversity=60, order_diversity=20, keywords=None):
889
+ """Paraphrase text using Dipper model with sentence-level processing"""
890
+ if not text or len(text.strip()) < 3:
891
+ return text
892
+
893
+ # Preserve keywords
894
+ text_with_placeholders, keyword_map = self.preserve_keywords(text, keywords)
895
+
896
+ # Add debug logging
897
+ if keyword_map:
898
+ print(f"Debug: Created keyword map: {keyword_map}")
899
+ print(f"Debug: Text with placeholders: {text_with_placeholders[:100]}...")
900
+
901
+ # Split into sentences for better control
902
+ sentences = self.split_into_sentences_advanced(text_with_placeholders)
903
+ paraphrased_sentences = []
904
+
905
+ for sentence in sentences:
906
+ if len(sentence.strip()) < 3:
907
+ paraphrased_sentences.append(sentence)
908
+ continue
909
+
910
+ try:
911
+ # Adjust diversity based on presence of keywords
912
+ has_keywords = any(placeholder in sentence for placeholder in keyword_map.keys())
913
+ if has_keywords:
914
+ # Use MODERATE diversity when keywords are present to avoid mangling
915
+ lex_diversity = 40 # Reduced from 70
916
+ order_diversity = 10 # Reduced from 20
917
+ elif len(sentence.split()) < 10:
918
+ lex_diversity = 70 # Reduced from 80
919
+ order_diversity = 25 # Reduced from 30
920
+ else:
921
+ lex_diversity = 85 # Slightly reduced from 90
922
+ order_diversity = 35 # Slightly reduced from 40
923
+
924
+ lex_code = int(100 - lex_diversity)
925
+ order_code = int(100 - order_diversity)
926
+
927
+ # Format input for Dipper
928
+ if self.is_dipper:
929
+ input_text = f"lexical = {lex_code}, order = {order_code} <sent> {sentence} </sent>"
930
+ else:
931
+ input_text = f"paraphrase: {sentence}"
932
+
933
+ # Tokenize
934
+ inputs = self.tokenizer(
935
+ input_text,
936
+ return_tensors="pt",
937
+ max_length=512,
938
+ truncation=True,
939
+ padding=True
940
+ )
941
+
942
+ # Move to device
943
+ if hasattr(self.model, 'device_map') and self.model.device_map:
944
+ device = next(iter(self.model.device_map.values()))
945
+ inputs = {k: v.to(device) for k, v in inputs.items()}
946
+ else:
947
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
948
+
949
+ # Generate with appropriate variation based on keywords
950
+ original_length = len(sentence.split())
951
+ max_new_length = int(original_length * 1.3) # Reduced from 1.4
952
+
953
+ # Adjust temperature based on keywords
954
+ temp = 0.9 if has_keywords else 1.1 # Lower temp for keywords
955
+ top_p_val = 0.95 if has_keywords else 0.9
956
+
957
+ with torch.no_grad():
958
+ outputs = self.model.generate(
959
+ **inputs,
960
+ max_length=max_new_length + 20,
961
+ min_length=max(5, int(original_length * 0.7)),
962
+ do_sample=True,
963
+ top_p=top_p_val,
964
+ temperature=temp,
965
+ no_repeat_ngram_size=3,
966
+ num_beams=3 if has_keywords else 2, # More beams for stability with keywords
967
+ early_stopping=True
968
+ )
969
+
970
+ # Decode
971
+ paraphrased = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
972
+
973
+ # Clean model artifacts
974
+ paraphrased = self.clean_model_output_enhanced(paraphrased)
975
+
976
+ # Fix incomplete sentences
977
+ paraphrased = self.fix_incomplete_sentence_smart(paraphrased, sentence)
978
+
979
+ # Ensure reasonable length
980
+ if len(paraphrased.split()) > max_new_length:
981
+ paraphrased = ' '.join(paraphrased.split()[:max_new_length])
982
+
983
+ paraphrased_sentences.append(paraphrased)
984
+
985
+ except Exception as e:
986
+ print(f"Error paraphrasing sentence: {str(e)}")
987
+ paraphrased_sentences.append(sentence)
988
+
989
+ # Join sentences back
990
+ result = ' '.join(paraphrased_sentences)
991
+
992
+ # Debug before restoration
993
+ if keyword_map:
994
+ print(f"Debug: Result before restoration: {result[:100]}...")
995
+ print(f"Debug: Checking for placeholders...")
996
+ for placeholder in keyword_map.keys():
997
+ if placeholder in result:
998
+ print(f"Debug: Found placeholder {placeholder} in result")
999
+ else:
1000
+ # Check for mangled versions
1001
+ if '___' in result:
1002
+ print(f"Debug: Found underscores ___ instead of {placeholder}")
1003
+
1004
+ # Restore keywords AFTER joining all sentences
1005
+ result = self.restore_keywords_robust(result, keyword_map)
1006
+
1007
+ # Debug after restoration
1008
+ if keyword_map:
1009
+ print(f"Debug: Result after restoration: {result[:100]}...")
1010
+
1011
+ # Apply minimal grammar fixes with human variations
1012
+ result = self.grammar_fixer.smart_fix(result)
1013
+
1014
+ return result
1015
+
1016
+ def fix_incomplete_sentence_smart(self, generated, original):
1017
+ """Smarter sentence completion that maintains natural flow"""
1018
+ if not generated or not generated.strip():
1019
+ return original
1020
+
1021
+ generated = generated.strip()
1022
+
1023
+ # Check if the sentence seems complete semantically
1024
+ words = generated.split()
1025
+ if len(words) >= 3:
1026
+ # Check if last word is a good ending word
1027
+ last_word = words[-1].lower().rstrip('.,!?;:')
1028
+
1029
+ # Common ending words that might not need punctuation fix
1030
+ ending_words = {
1031
+ 'too', 'also', 'well', 'though', 'however',
1032
+ 'furthermore', 'moreover', 'indeed', 'anyway',
1033
+ 'regardless', 'nonetheless', 'therefore', 'thus'
1034
+ }
1035
+
1036
+ # If it ends with a good word, just add appropriate punctuation
1037
+ if last_word in ending_words:
1038
+ if generated[-1] not in '.!?':
1039
+ generated += '.'
1040
+ return generated
1041
+
1042
+ # Check for cut-off patterns
1043
+ if len(words) > 0:
1044
+ last_word = words[-1]
1045
+
1046
+ # Remove if it's clearly cut off (1-2 chars, no vowels)
1047
+ # But don't remove valid short words like "is", "of", "to", etc.
1048
+ short_valid_words = {'is', 'of', 'to', 'in', 'on', 'at', 'by', 'or', 'if', 'so', 'up', 'no', 'we', 'he', 'me', 'be', 'do', 'go'}
1049
+ if (len(last_word) <= 2 and
1050
+ last_word.lower() not in short_valid_words and
1051
+ not any(c in 'aeiouAEIOU' for c in last_word)):
1052
+ words = words[:-1]
1053
+ generated = ' '.join(words)
1054
+
1055
+ # Add ending punctuation based on context
1056
+ if generated and generated[-1] not in '.!?:,;':
1057
+ # Check original ending
1058
+ orig_stripped = original.strip()
1059
+ if orig_stripped.endswith('?'):
1060
+ # Check if generated seems like a question
1061
+ question_words = ['what', 'why', 'how', 'when', 'where', 'who', 'which', 'is', 'are', 'do', 'does', 'can', 'could', 'would', 'should']
1062
+ first_word = generated.split()[0].lower() if generated.split() else ''
1063
+ if first_word in question_words:
1064
+ generated += '?'
1065
+ else:
1066
+ generated += '.'
1067
+ elif orig_stripped.endswith('!'):
1068
+ # Check if generated seems exclamatory
1069
+ exclaim_words = ['amazing', 'incredible', 'fantastic', 'terrible', 'awful', 'wonderful', 'excellent']
1070
+ if any(word in generated.lower() for word in exclaim_words):
1071
+ generated += '!'
1072
+ else:
1073
+ generated += '.'
1074
+ elif orig_stripped.endswith(':'):
1075
+ generated += ':'
1076
+ else:
1077
+ generated += '.'
1078
+
1079
+ # Ensure first letter is capitalized ONLY if it's sentence start
1080
+ # Don't capitalize words like "iPhone" or "eBay" or placeholders
1081
+ if generated and generated[0].islower() and not self.is_likely_acronym_or_proper_noun(generated.split()[0]) and not generated.startswith('__KW'):
1082
+ generated = generated[0].upper() + generated[1:]
1083
+
1084
+ return generated
1085
+
1086
+ def split_into_sentences_advanced(self, text):
1087
+ """Advanced sentence splitting using spaCy or NLTK"""
1088
+ if self.use_spacy and self.nlp:
1089
+ doc = self.nlp(text)
1090
+ sentences = [sent.text.strip() for sent in doc.sents]
1091
+ else:
1092
+ # Fallback to NLTK
1093
+ try:
1094
+ sentences = sent_tokenize(text)
1095
+ except:
1096
+ # Final fallback to regex
1097
+ sentences = re.split(r'(?<=[.!?])\s+', text)
1098
+
1099
+ # Clean up sentences
1100
+ return [s for s in sentences if s and len(s.strip()) > 0]
1101
+
1102
+ def paraphrase_with_bart(self, text, keywords=None):
1103
+ """Additional paraphrasing with BART for more variation"""
1104
+ if not self.use_bart or not text or len(text.strip()) < 3:
1105
+ return text
1106
+
1107
+ try:
1108
+ # Preserve keywords
1109
+ text_with_placeholders, keyword_map = self.preserve_keywords(text, keywords)
1110
+
1111
+ # Process in smaller chunks for BART
1112
+ sentences = self.split_into_sentences_advanced(text_with_placeholders)
1113
+ paraphrased_sentences = []
1114
+
1115
+ for sentence in sentences:
1116
+ if len(sentence.split()) < 5:
1117
+ paraphrased_sentences.append(sentence)
1118
+ continue
1119
+
1120
+ inputs = self.bart_tokenizer(
1121
+ sentence,
1122
+ return_tensors='pt',
1123
+ max_length=128,
1124
+ truncation=True
1125
+ )
1126
+
1127
+ # Move to appropriate device
1128
+ if hasattr(self.bart_model, 'device_map') and self.bart_model.device_map:
1129
+ device = next(iter(self.bart_model.device_map.values()))
1130
+ inputs = {k: v.to(device) for k, v in inputs.items()}
1131
+ else:
1132
+ inputs = {k: v.to(self.device) for k, v in inputs.items()}
1133
+
1134
+ original_length = len(sentence.split())
1135
+
1136
+ with torch.no_grad():
1137
+ outputs = self.bart_model.generate(
1138
+ **inputs,
1139
+ max_length=int(original_length * 1.4) + 10,
1140
+ min_length=max(5, int(original_length * 0.6)),
1141
+ num_beams=2,
1142
+ temperature=1.1, # Higher temperature
1143
+ do_sample=True,
1144
+ top_p=0.9,
1145
+ early_stopping=True
1146
+ )
1147
+
1148
+ paraphrased = self.bart_tokenizer.decode(outputs[0], skip_special_tokens=True)
1149
+
1150
+ # Fix incomplete sentences
1151
+ paraphrased = self.fix_incomplete_sentence_smart(paraphrased, sentence)
1152
+
1153
+ paraphrased_sentences.append(paraphrased)
1154
+
1155
+ result = ' '.join(paraphrased_sentences)
1156
+
1157
+ # Restore keywords AFTER joining all sentences
1158
+ result = self.restore_keywords_robust(result, keyword_map)
1159
+
1160
+ # Apply minimal grammar fixes
1161
+ result = self.grammar_fixer.smart_fix(result)
1162
+
1163
+ return result
1164
+
1165
+ except Exception as e:
1166
+ print(f"Error in BART paraphrasing: {str(e)}")
1167
+ return text
1168
+
1169
+ def apply_sentence_variation(self, text):
1170
+ """Apply natural sentence structure variations - MORE AGGRESSIVE"""
1171
+ sentences = self.split_into_sentences_advanced(text)
1172
+ varied_sentences = []
1173
+
1174
+ for i, sentence in enumerate(sentences):
1175
+ # Skip empty sentences
1176
+ if not sentence.strip():
1177
+ continue
1178
+
1179
+ # MORE aggressive variations
1180
+ # Combine short sentences more often (50% chance)
1181
+ if (i < len(sentences) - 1 and
1182
+ len(sentence.split()) < 15 and
1183
+ len(sentences[i+1].split()) < 15 and
1184
+ random.random() < 0.5):
1185
+
1186
+ connectors = [', and', ', but', '; however,', '. Also,', '. Plus,', ', so', ', which means',
1187
+ ' - and', ' - but', '; meanwhile,', '. That said,', ', yet', ' - though']
1188
+ connector = random.choice(connectors)
1189
+
1190
+ # Handle the next sentence properly
1191
+ next_sent = sentences[i+1].strip()
1192
+ if next_sent:
1193
+ combined = f"{sentence.rstrip('.')}{connector} {next_sent[0].lower()}{next_sent[1:]}"
1194
+ varied_sentences.append(combined)
1195
+ sentences[i+1] = "" # Mark as processed
1196
+
1197
+ elif sentence: # Only process non-empty sentences
1198
+ # Split very long sentences more aggressively
1199
+ if len(sentence.split()) > 18 and ',' in sentence:
1200
+ parts = sentence.split(', ', 1)
1201
+ if len(parts) == 2 and len(parts[1].split()) > 6:
1202
+ # 70% chance to split
1203
+ if random.random() < 0.7:
1204
+ varied_sentences.append(parts[0] + '.')
1205
+ # Ensure second part starts with capital
1206
+ if parts[1]:
1207
+ varied_sentences.append(parts[1][0].upper() + parts[1][1:])
1208
+ else:
1209
+ varied_sentences.append(sentence)
1210
+ else:
1211
+ varied_sentences.append(sentence)
1212
+ else:
1213
+ # Add natural variations more often (35% chance)
1214
+ if i > 0 and random.random() < 0.35:
1215
+ # Sometimes add a transition
1216
+ transitions = ['Furthermore, ', 'Additionally, ', 'Moreover, ', 'Also, ',
1217
+ 'Besides, ', 'What\'s more, ', 'In addition, ', 'Not only that, ',
1218
+ 'To add to that, ', 'On top of that, ', 'Beyond that, ']
1219
+ transition = random.choice(transitions)
1220
+ if sentence[0].isupper():
1221
+ sentence = transition + sentence[0].lower() + sentence[1:]
1222
+
1223
+ # Add mid-sentence interruptions (10% chance)
1224
+ if random.random() < 0.1 and len(sentence.split()) > 12:
1225
+ interruptions = [
1226
+ " - and this is crucial - ",
1227
+ " - believe me - ",
1228
+ " - no kidding - ",
1229
+ " (and yes, I mean it) ",
1230
+ " - stay with me here - ",
1231
+ " - and I'm not exaggerating - "
1232
+ ]
1233
+ words = sentence.split()
1234
+ pos = random.randint(len(words)//3, 2*len(words)//3)
1235
+ words.insert(pos, random.choice(interruptions))
1236
+ sentence = ' '.join(words)
1237
+
1238
+ varied_sentences.append(sentence)
1239
+
1240
+ # Post-process for additional human patterns
1241
+ result = ' '.join([s for s in varied_sentences if s])
1242
+
1243
+ # Add occasional fragments for human touch (5% chance)
1244
+ if random.random() < 0.05:
1245
+ fragments = [
1246
+ "Crazy, I know.",
1247
+ "Wild stuff.",
1248
+ "Makes you think.",
1249
+ "Pretty interesting.",
1250
+ "Go figure.",
1251
+ "Who knew?",
1252
+ "There you have it.",
1253
+ "Food for thought.",
1254
+ "Just saying.",
1255
+ "Worth considering."
1256
+ ]
1257
+ sentences = result.split('. ')
1258
+ if len(sentences) > 3:
1259
+ insert_pos = random.randint(1, len(sentences)-1)
1260
+ sentences.insert(insert_pos, random.choice(fragments))
1261
+ result = '. '.join(sentences)
1262
+
1263
+ return result
1264
+
1265
+ def fix_punctuation(self, text):
1266
+ """Comprehensive punctuation and formatting fixes"""
1267
+ if not text:
1268
+ return ""
1269
+
1270
+ # First, clean any remaining model artifacts
1271
+ text = self.clean_model_output_enhanced(text)
1272
+
1273
+ # Fix weird symbols and characters using safe replacements
1274
+ text = text.replace('<>', '') # Remove empty angle brackets
1275
+
1276
+ # Normalize quotes - use replace instead of regex for problematic characters
1277
+ text = text.replace('«', '"').replace('»', '"')
1278
+ text = text.replace('„', '"').replace('"', '"').replace('"', '"')
1279
+ text = text.replace(''', "'").replace(''', "'")
1280
+ text = text.replace('–', '-').replace('—', '-')
1281
+
1282
+ # Fix colon issues
1283
+ text = re.sub(r'\.:', ':', text) # Remove period before colon
1284
+ text = re.sub(r':\s*\.', ':', text) # Remove period after colon
1285
+
1286
+ # Fix basic spacing
1287
+ text = re.sub(r'\s+', ' ', text) # Multiple spaces to single
1288
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text) # Remove space before punctuation
1289
+ text = re.sub(r'([.,!?;:])\s*([.,!?;:])', r'\1', text) # Remove double punctuation
1290
+ text = re.sub(r'([.!?])\s*\1+', r'\1', text) # Remove repeated punctuation
1291
+
1292
+ # Fix colons
1293
+ text = re.sub(r':\s*([.,!?])', ':', text) # Remove punctuation after colon
1294
+ text = re.sub(r'([.,!?])\s*:', ':', text) # Remove punctuation before colon
1295
+ text = re.sub(r':+', ':', text) # Multiple colons to one
1296
+
1297
+ # Fix quotes and parentheses
1298
+ text = re.sub(r'"\s*([^"]*?)\s*"', r'"\1"', text)
1299
+ text = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", text)
1300
+ text = re.sub(r'\(\s*([^)]*?)\s*\)', r'(\1)', text)
1301
+
1302
+ # Fix sentence capitalization more carefully
1303
+ # Split on ACTUAL sentence endings only
1304
+ sentences = re.split(r'(?<=[.!?])\s+', text)
1305
+ fixed_sentences = []
1306
+
1307
+ for i, sentence in enumerate(sentences):
1308
+ if not sentence:
1309
+ continue
1310
+
1311
+ # Only capitalize the first letter if it's actually lowercase
1312
+ # and not part of a special case (like iPhone, eBay, etc.)
1313
+ words = sentence.split()
1314
+ if words:
1315
+ first_word = words[0]
1316
+ # Check if it's not an acronym or proper noun that should stay lowercase
1317
+ if (first_word[0].islower() and
1318
+ not self.is_likely_acronym_or_proper_noun(first_word) and
1319
+ not first_word.startswith('__KW') and
1320
+ not first_word.startswith('_kw')):
1321
+ # Only capitalize if it's a regular word
1322
+ sentence = first_word[0].upper() + first_word[1:] + ' ' + ' '.join(words[1:])
1323
+
1324
+ fixed_sentences.append(sentence)
1325
+
1326
+ text = ' '.join(fixed_sentences)
1327
+
1328
+ # Fix common issues
1329
+ text = re.sub(r'\bi\b', 'I', text) # Capitalize 'I'
1330
+ text = re.sub(r'\.{2,}', '.', text) # Multiple periods to one
1331
+ text = re.sub(r',{2,}', ',', text) # Multiple commas to one
1332
+ text = re.sub(r'\s*,\s*,\s*', ', ', text) # Double commas with spaces
1333
+
1334
+ # Remove weird artifacts
1335
+ text = re.sub(r'\b(CHAPTER\s+[IVX]+|SECTION\s+\d+)\b[^\w]*', '', text, flags=re.IGNORECASE)
1336
+
1337
+ # Fix abbreviations
1338
+ text = re.sub(r'\betc\s*\.\s*\.', 'etc.', text)
1339
+ text = re.sub(r'\be\.g\s*\.\s*[,\s]', 'e.g., ', text)
1340
+ text = re.sub(r'\bi\.e\s*\.\s*[,\s]', 'i.e., ', text)
1341
+
1342
+ # Fix numbers with periods (like "1. " at start of lists)
1343
+ text = re.sub(r'(\d+)\.\s+', r'\1. ', text)
1344
+
1345
+ # Fix bold/strong tags punctuation
1346
+ text = self.fix_bold_punctuation(text)
1347
+
1348
+ # Clean up any remaining issues
1349
+ text = re.sub(r'\s+([.,!?;:])', r'\1', text) # Final space cleanup
1350
+ text = re.sub(r'([.,!?;:])\s{2,}', r'\1 ', text) # Fix multiple spaces after punctuation
1351
+
1352
+ # Ensure ending punctuation
1353
+ text = text.strip()
1354
+ if text and text[-1] not in '.!?':
1355
+ # Don't add period if it ends with colon (likely a list header)
1356
+ if not text.endswith(':'):
1357
+ text += '.'
1358
+
1359
+ return text
1360
+
1361
+ def fix_bold_punctuation(self, text):
1362
+ """Fix punctuation issues around bold/strong tags"""
1363
+ # Check if this is likely a list item with colon pattern
1364
+ def is_list_item_with_colon(text):
1365
+ # Pattern: starts with or contains <strong>Text:</strong> or <b>Text:</b>
1366
+ list_pattern = r'^\s*(?:[-•*▪▫◦‣⁃]\s*)?<(?:strong|b)>[^<]+:</(?:strong|b)>'
1367
+ return bool(re.search(list_pattern, text))
1368
+
1369
+ # If it's a list item with colon, preserve the format
1370
+ if is_list_item_with_colon(text):
1371
+ # Just clean up spacing but preserve the colon inside bold
1372
+ text = re.sub(r'<(strong|b)>\s*([^:]+)\s*:\s*</\1>', r'<\1>\2:</\1>', text)
1373
+ return text
1374
+
1375
+ # Pattern to find bold/strong content
1376
+ bold_pattern = r'<(strong|b)>(.*?)</\1>'
1377
+
1378
+ def fix_bold_match(match):
1379
+ tag = match.group(1)
1380
+ content = match.group(2).strip()
1381
+
1382
+ if not content:
1383
+ return f'<{tag}></{tag}>'
1384
+
1385
+ # Check if this is a list header (contains colon at the end)
1386
+ if content.endswith(':'):
1387
+ # Preserve list headers with colons
1388
+ return f'<{tag}>{content}</{tag}>'
1389
+
1390
+ # Remove any periods at the start or end of bold content
1391
+ content = content.strip('.')
1392
+
1393
+ # Check if this bold text is at the start of a sentence
1394
+ # (preceded by nothing, or by '. ', '! ', '? ')
1395
+ start_pos = match.start()
1396
+ is_sentence_start = (start_pos == 0 or
1397
+ (start_pos > 2 and text[start_pos-2:start_pos] in ['. ', '! ', '? ', '\n\n']))
1398
+
1399
+ # Capitalize first letter if it's at sentence start
1400
+ if is_sentence_start and content and content[0].isalpha():
1401
+ content = content[0].upper() + content[1:]
1402
+
1403
+ return f'<{tag}>{content}</{tag}>'
1404
+
1405
+ # Fix bold/strong tags
1406
+ text = re.sub(bold_pattern, fix_bold_match, text)
1407
+
1408
+ # Fix spacing around bold/strong tags (but not for list items)
1409
+ if not is_list_item_with_colon(text):
1410
+ text = re.sub(r'\.\s*<(strong|b)>', r'. <\1>', text) # Period before bold
1411
+ text = re.sub(r'</(strong|b)>\s*\.', r'</\1>.', text) # Period after bold
1412
+ text = re.sub(r'([.!?])\s*<(strong|b)>', r'\1 <\2>', text) # Space after sentence end
1413
+ text = re.sub(r'</(strong|b)>\s+([a-z])', lambda m: f'</{m.group(1)}> {m.group(2)}', text) # Keep lowercase after bold if mid-sentence
1414
+
1415
+ # Remove duplicate periods around bold tags
1416
+ text = re.sub(r'\.\s*</(strong|b)>\s*\.', r'</\1>.', text)
1417
+ text = re.sub(r'\.\s*<(strong|b)>\s*\.', r'. <\1>', text)
1418
+
1419
+ # Fix cases where bold content ends a sentence
1420
+ # If bold is followed by a new sentence (capital letter), add period
1421
+ text = re.sub(r'</(strong|b)>\s+([A-Z])', r'</\1>. \2', text)
1422
+
1423
+ # Don't remove these for list items
1424
+ if not is_list_item_with_colon(text):
1425
+ text = re.sub(r'<(strong|b)>\s*:\s*</\1>', ':', text) # Remove empty bold colons
1426
+ text = re.sub(r'<(strong|b)>\s*\.\s*</\1>', '.', text) # Remove empty bold periods
1427
+
1428
+ return text
1429
+
1430
+ def extract_text_from_html(self, html_content):
1431
+ """Extract text elements from HTML with skip logic"""
1432
+ soup = BeautifulSoup(html_content, 'html.parser')
1433
+ text_elements = []
1434
+
1435
+ # Get all text nodes using string instead of text (fixing deprecation)
1436
+ for element in soup.find_all(string=True):
1437
+ # Skip script, style, and noscript content completely
1438
+ if element.parent.name in ['script', 'style', 'noscript']:
1439
+ continue
1440
+
1441
+ text = element.strip()
1442
+ if text and not self.should_skip_element(element, text):
1443
+ text_elements.append({
1444
+ 'text': text,
1445
+ 'element': element
1446
+ })
1447
+
1448
+ return soup, text_elements
1449
+
1450
+ def validate_and_fix_html(self, html_text):
1451
+ """Fix common HTML syntax errors after processing"""
1452
+
1453
+ # Fix DOCTYPE
1454
+ html_text = re.sub(r'<!\s*DOCTYPE', '<!DOCTYPE', html_text, flags=re.IGNORECASE)
1455
+
1456
+ # Fix spacing issues
1457
+ html_text = re.sub(r'>\s+<', '><', html_text) # Remove extra spaces between tags
1458
+ html_text = re.sub(r'\s+>', '>', html_text) # Remove spaces before closing >
1459
+ html_text = re.sub(r'<\s+', '<', html_text) # Remove spaces after opening <
1460
+
1461
+ # Fix common word errors that might occur during processing
1462
+ html_text = html_text.replace('down loaded', 'downloaded')
1463
+ html_text = html_text.replace('But your document', 'Your document')
1464
+
1465
+ return html_text
1466
+
1467
+ def wrap_keywords_in_paragraphs(self, soup, keywords):
1468
+ """Wrap keywords with <strong> tags inside <p> tags only"""
1469
+ if not keywords:
1470
+ return
1471
+
1472
+ # Find all paragraph tags
1473
+ for p_tag in soup.find_all('p'):
1474
+ # Skip paragraphs that are inside special elements
1475
+ # Check if paragraph is inside any of these elements
1476
+ skip_parents = ['div.author-intro', 'div.cta-box', 'div.testimonial-card',
1477
+ 'div.news-box', 'button', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
1478
+ 'div.quiz-container', 'div.question-container', 'div.results']
1479
+
1480
+ # Check if this paragraph should be skipped
1481
+ should_skip = False
1482
+ for parent in p_tag.parents:
1483
+ # Check by class
1484
+ if parent.name == 'div' and parent.get('class'):
1485
+ classes = parent.get('class', [])
1486
+ if isinstance(classes, list):
1487
+ class_str = ' '.join(str(cls) for cls in classes)
1488
+ else:
1489
+ class_str = str(classes)
1490
+
1491
+ if any(skip_class in class_str for skip_class in
1492
+ ['author-intro', 'cta-box', 'testimonial-card', 'news-box',
1493
+ 'quiz-container', 'question-container', 'results', 'stats-grid',
1494
+ 'toc-', 'comparison-tables']):
1495
+ should_skip = True
1496
+ break
1497
+
1498
+ # Check by tag name
1499
+ if parent.name in ['button', 'a', 'blockquote', 'details', 'summary']:
1500
+ should_skip = True
1501
+ break
1502
+
1503
+ if should_skip:
1504
+ continue
1505
+
1506
+ # Additional check: Skip if paragraph has specific classes
1507
+ p_classes = p_tag.get('class', [])
1508
+ if isinstance(p_classes, list):
1509
+ p_class_str = ' '.join(str(cls) for cls in p_classes)
1510
+ else:
1511
+ p_class_str = str(p_classes)
1512
+
1513
+ if any(skip_class in p_class_str for skip_class in ['testimonial-card', 'quiz-', 'stat-']):
1514
+ continue
1515
+
1516
+ # Process only if this is a regular content paragraph
1517
+ # Get all text nodes in this paragraph
1518
+ for text_node in p_tag.find_all(string=True):
1519
+ # Skip if already inside a strong or b tag
1520
+ if text_node.parent.name in ['strong', 'b', 'em', 'i', 'span', 'a']:
1521
+ continue
1522
+
1523
+ # Skip if the text node's immediate parent isn't the p tag
1524
+ # (to avoid nested elements)
1525
+ if text_node.parent != p_tag:
1526
+ continue
1527
+
1528
+ original_text = str(text_node)
1529
+
1530
+ # Skip very short text nodes
1531
+ if len(original_text.strip()) < 20:
1532
+ continue
1533
+
1534
+ modified_text = original_text
1535
+
1536
+ # Check each keyword
1537
+ for keyword in keywords:
1538
+ # Use word boundaries for accurate matching
1539
+ pattern = r'\b' + re.escape(keyword) + r'\b'
1540
+
1541
+ # Find all matches (case-insensitive)
1542
+ matches = list(re.finditer(pattern, modified_text, flags=re.IGNORECASE))
1543
+
1544
+ # Replace from end to beginning to maintain positions
1545
+ for match in reversed(matches):
1546
+ start, end = match.span()
1547
+ matched_text = match.group(0)
1548
+ # Wrap with strong tag
1549
+ modified_text = (modified_text[:start] +
1550
+ f'<strong>{matched_text}</strong>' +
1551
+ modified_text[end:])
1552
+
1553
+ # If text was modified, replace the text node
1554
+ if modified_text != original_text:
1555
+ # Parse the modified text to create new nodes
1556
+ new_soup = BeautifulSoup(modified_text, 'html.parser')
1557
+ # Replace the text node with the new nodes
1558
+ for new_node in reversed(new_soup.contents):
1559
+ text_node.insert_after(new_node)
1560
+ text_node.extract()
1561
+
1562
+ def add_natural_flow_variations(self, text):
1563
+ """Add more natural flow and rhythm variations for Originality AI"""
1564
+ sentences = self.split_into_sentences_advanced(text)
1565
+ enhanced_sentences = []
1566
+
1567
+ for i, sentence in enumerate(sentences):
1568
+ if not sentence.strip():
1569
+ continue
1570
+
1571
+ # Add stream-of-consciousness elements (10% chance)
1572
+ if random.random() < 0.1 and len(sentence.split()) > 10:
1573
+ stream_elements = [
1574
+ " - wait, let me back up - ",
1575
+ " - actually, scratch that - ",
1576
+ " - or maybe I should say - ",
1577
+ " - hmm, how do I put this - ",
1578
+ " - okay, here's the thing - ",
1579
+ " - you know what I mean? - "
1580
+ ]
1581
+ words = sentence.split()
1582
+ pos = random.randint(len(words)//4, 3*len(words)//4)
1583
+ words.insert(pos, random.choice(stream_elements))
1584
+ sentence = ' '.join(words)
1585
+
1586
+ # Add human-like self-corrections (5% chance)
1587
+ if random.random() < 0.05:
1588
+ corrections = [
1589
+ " - or rather, ",
1590
+ " - well, actually, ",
1591
+ " - I mean, ",
1592
+ " - or should I say, ",
1593
+ " - correction: "
1594
+ ]
1595
+ words = sentence.split()
1596
+ if len(words) > 8:
1597
+ pos = random.randint(len(words)//2, len(words)-3)
1598
+ correction = random.choice(corrections)
1599
+ # Repeat a concept with variation
1600
+ repeated_word_idx = random.randint(max(0, pos-5), pos-1)
1601
+ if repeated_word_idx < len(words):
1602
+ words.insert(pos, correction)
1603
+ sentence = ' '.join(words)
1604
+
1605
+ # Add thinking-out-loud patterns (8% chance)
1606
+ if random.random() < 0.08 and i > 0:
1607
+ thinking_patterns = [
1608
+ "Come to think of it, ",
1609
+ "Actually, you know what? ",
1610
+ "Wait, here's a thought: ",
1611
+ "Oh, and another thing - ",
1612
+ "Speaking of which, ",
1613
+ "This reminds me, ",
1614
+ "Now that I mention it, ",
1615
+ "Funny you should ask, because "
1616
+ ]
1617
+ pattern = random.choice(thinking_patterns)
1618
+ sentence = pattern + sentence[0].lower() + sentence[1:] if len(sentence) > 1 else sentence
1619
+
1620
+ enhanced_sentences.append(sentence)
1621
+
1622
+ return ' '.join(enhanced_sentences)
1623
+
1624
+ def process_html(self, html_content, primary_keywords="", secondary_keywords="", progress_callback=None):
1625
+ """Main processing function with progress callback"""
1626
+ if not html_content.strip():
1627
+ return "Please provide HTML content."
1628
+
1629
+ # Store all script and style content to preserve it
1630
+ script_placeholder = "###SCRIPT_PLACEHOLDER_{}###"
1631
+ style_placeholder = "###STYLE_PLACEHOLDER_{}###"
1632
+ preserved_scripts = []
1633
+ preserved_styles = []
1634
+
1635
+ # Temporarily replace script and style tags with placeholders
1636
+ soup_temp = BeautifulSoup(html_content, 'html.parser')
1637
+
1638
+ # Preserve all script tags
1639
+ for idx, script in enumerate(soup_temp.find_all('script')):
1640
+ placeholder = script_placeholder.format(idx)
1641
+ preserved_scripts.append(str(script))
1642
+ script.replace_with(placeholder)
1643
+
1644
+ # Preserve all style tags
1645
+ for idx, style in enumerate(soup_temp.find_all('style')):
1646
+ placeholder = style_placeholder.format(idx)
1647
+ preserved_styles.append(str(style))
1648
+ style.replace_with(placeholder)
1649
+
1650
+ # Get the modified HTML
1651
+ html_content = str(soup_temp)
1652
+
1653
+ # Combine keywords and clean them
1654
+ all_keywords = []
1655
+ if primary_keywords:
1656
+ # Clean and validate each keyword
1657
+ for k in primary_keywords.split(','):
1658
+ cleaned = k.strip()
1659
+ if cleaned and len(cleaned) > 1: # Skip empty or single-char keywords
1660
+ all_keywords.append(cleaned)
1661
+ if secondary_keywords:
1662
+ for k in secondary_keywords.split(','):
1663
+ cleaned = k.strip()
1664
+ if cleaned and len(cleaned) > 1:
1665
+ all_keywords.append(cleaned)
1666
+
1667
+ # Remove duplicates while preserving order
1668
+ seen = set()
1669
+ unique_keywords = []
1670
+ for k in all_keywords:
1671
+ if k.lower() not in seen:
1672
+ seen.add(k.lower())
1673
+ unique_keywords.append(k)
1674
+ all_keywords = unique_keywords
1675
+
1676
+ try:
1677
+ # Extract text elements
1678
+ soup, text_elements = self.extract_text_from_html(html_content)
1679
+
1680
+ total_elements = len(text_elements)
1681
+ print(f"Found {total_elements} text elements to process (after filtering)")
1682
+ if all_keywords:
1683
+ print(f"Preserving keywords: {all_keywords}")
1684
+
1685
+ # Process each text element
1686
+ processed_count = 0
1687
+
1688
+ for i, element_info in enumerate(text_elements):
1689
+ original_text = element_info['text']
1690
+
1691
+ # Skip placeholders
1692
+ if "###SCRIPT_PLACEHOLDER_" in original_text or "###STYLE_PLACEHOLDER_" in original_text:
1693
+ continue
1694
+
1695
+ # Skip very short texts
1696
+ if len(original_text.split()) < 3:
1697
+ continue
1698
+
1699
+ # Debug: Check if keywords are in this text
1700
+ text_has_keywords = any(keyword.lower() in original_text.lower() for keyword in all_keywords)
1701
+ if text_has_keywords:
1702
+ print(f"Debug: Processing text with keywords: {original_text[:50]}...")
1703
+
1704
+ # First pass with Dipper (with adjusted diversity)
1705
+ paraphrased_text = self.paraphrase_with_dipper(
1706
+ original_text,
1707
+ keywords=all_keywords
1708
+ )
1709
+
1710
+ # Verify no placeholders remain
1711
+ if '__KW' in paraphrased_text or '___' in paraphrased_text:
1712
+ print(f"Warning: Placeholder or underscores found in paraphrased text: {paraphrased_text[:100]}...")
1713
+ # Try to restore again with the enhanced function
1714
+ temp_map = {}
1715
+ for j, keyword in enumerate(all_keywords):
1716
+ temp_map[f'__KW{j:03d}__'] = keyword
1717
+ paraphrased_text = self.restore_keywords_robust(paraphrased_text, temp_map)
1718
+
1719
+ # Second pass with BART for longer texts (increased probability)
1720
+ if self.use_bart and len(paraphrased_text.split()) > 8:
1721
+ # 50% chance to use BART for more variation (reduced from 60%)
1722
+ if random.random() < 0.5:
1723
+ paraphrased_text = self.paraphrase_with_bart(
1724
+ paraphrased_text,
1725
+ keywords=all_keywords
1726
+ )
1727
+
1728
+ # Apply sentence variation
1729
+ paraphrased_text = self.apply_sentence_variation(paraphrased_text)
1730
+
1731
+ # Add natural flow variations
1732
+ paraphrased_text = self.add_natural_flow_variations(paraphrased_text)
1733
+
1734
+ # Fix punctuation and formatting
1735
+ paraphrased_text = self.fix_punctuation(paraphrased_text)
1736
+
1737
+ # Final check for any remaining placeholders or underscores
1738
+ if '___' in paraphrased_text or '__KW' in paraphrased_text:
1739
+ print(f"Error: Unresolved placeholders in final text")
1740
+ # Use original text if we can't resolve placeholders
1741
+ paraphrased_text = original_text
1742
+
1743
+ # Final quality check
1744
+ if paraphrased_text and len(paraphrased_text.split()) >= 3:
1745
+ element_info['element'].replace_with(NavigableString(paraphrased_text))
1746
+ processed_count += 1
1747
+
1748
+ # Progress update
1749
+ if progress_callback:
1750
+ progress_callback(i + 1, total_elements)
1751
+
1752
+ if i % 10 == 0 or i == total_elements - 1:
1753
+ progress = (i + 1) / total_elements * 100
1754
+ print(f"Progress: {progress:.1f}%")
1755
+
1756
+ # Wrap keywords with <strong> tags in paragraphs
1757
+ self.wrap_keywords_in_paragraphs(soup, all_keywords)
1758
+
1759
+ # Post-process the entire HTML to fix bold/strong formatting
1760
+ result = str(soup)
1761
+ result = self.post_process_html(result)
1762
+
1763
+ # Final safety check for any remaining placeholders or underscores
1764
+ if '__KW' in result or re.search(r'_{3,}', result):
1765
+ print("Warning: Found placeholders or multiple underscores in final HTML output")
1766
+ # Attempt to clean them with keywords
1767
+ for i, keyword in enumerate(all_keywords):
1768
+ result = result.replace(f'__KW{i:03d}__', keyword)
1769
+ result = re.sub(r'_{3,}', keyword, result, count=1)
1770
+
1771
+ # Restore all script tags
1772
+ for idx, script_content in enumerate(preserved_scripts):
1773
+ placeholder = script_placeholder.format(idx)
1774
+ result = result.replace(placeholder, script_content)
1775
+
1776
+ # Restore all style tags
1777
+ for idx, style_content in enumerate(preserved_styles):
1778
+ placeholder = style_placeholder.format(idx)
1779
+ result = result.replace(placeholder, style_content)
1780
+
1781
+ # Validate and fix HTML syntax
1782
+ result = self.validate_and_fix_html(result)
1783
+
1784
+ # Count skipped elements properly
1785
+ all_text_elements = soup.find_all(string=True)
1786
+ skipped = len([e for e in all_text_elements if e.strip() and e.parent.name not in ['script', 'style', 'noscript']]) - total_elements
1787
+
1788
+ print(f"Successfully processed {processed_count} text elements")
1789
+ print(f"Skipped {skipped} elements (headings, CTAs, tables, testimonials, strong/bold tags, etc.)")
1790
+ print(f"Preserved {len(preserved_scripts)} script tags and {len(preserved_styles)} style tags")
1791
+
1792
+ return result
1793
+
1794
+ except Exception as e:
1795
+ import traceback
1796
+ error_msg = f"Error processing HTML: {str(e)}\n{traceback.format_exc()}"
1797
+ print(error_msg)
1798
+ # Return original HTML with error message prepended as HTML comment
1799
+ return f"<!-- {error_msg} -->\n{html_content}"
1800
+
1801
+ def post_process_html(self, html_text):
1802
+ """Post-process the entire HTML to fix formatting issues"""
1803
+ # Fix empty angle brackets that might appear
1804
+ html_text = re.sub(r'<>\s*([^<>]+?)\s*(?=\.|\s|<)', r'\1', html_text) # Remove <> around text
1805
+ html_text = re.sub(r'<>', '', html_text) # Remove any remaining empty <>
1806
+
1807
+ # Fix double angle brackets around bold tags
1808
+ html_text = re.sub(r'<<b>>', '<b>', html_text)
1809
+ html_text = re.sub(r'<</b>>', '</b>', html_text)
1810
+ html_text = re.sub(r'<<strong>>', '<strong>', html_text)
1811
+ html_text = re.sub(r'<</strong>>', '</strong>', html_text)
1812
+
1813
+ # Fix periods around bold/strong tags
1814
+ html_text = re.sub(r'\.\s*<(b|strong)>', '. <\1>', html_text) # Period before bold
1815
+ html_text = re.sub(r'</(b|strong)>\s*\.', '</\1>.', html_text) # Period after bold
1816
+ html_text = re.sub(r'\.<<(b|strong)>>', '. <\1>', html_text) # Fix double bracket cases
1817
+ html_text = re.sub(r'</(b|strong)>>\.', '</\1>.', html_text)
1818
+
1819
+ # Fix periods after colons
1820
+ html_text = re.sub(r':\s*\.', ':', html_text)
1821
+ html_text = re.sub(r'\.:', ':', html_text)
1822
+
1823
+ # Check if a line is a list item
1824
+ def process_line(line):
1825
+ # Check if this line contains a list pattern with bold
1826
+ list_pattern = r'(?:^|\s)(?:[-•*▪▫◦‣⁃]\s*)?<(?:strong|b)>[^<]+:</(?:strong|b)>'
1827
+ if re.search(list_pattern, line):
1828
+ # This is a list item, preserve the colon format
1829
+ return line
1830
+
1831
+ # Not a list item, apply regular fixes
1832
+ # Remove periods immediately inside bold tags
1833
+ line = re.sub(r'<(strong|b)>\s*\.\s*([^<]+)\s*\.\s*</\1>', r'<\1>\2</\1>', line)
1834
+
1835
+ # Fix sentence endings with bold
1836
+ line = re.sub(r'</(strong|b)>\s*([.!?])', r'</\1>\2', line)
1837
+
1838
+ return line
1839
+
1840
+ # Process line by line to preserve list formatting
1841
+ lines = html_text.split('\n')
1842
+ processed_lines = [process_line(line) for line in lines]
1843
+ html_text = '\n'.join(processed_lines)
1844
+
1845
+ # Fix sentence starts with bold
1846
+ def fix_bold_sentence_start(match):
1847
+ pre_context = match.group(1)
1848
+ tag = match.group(2)
1849
+ content = match.group(3)
1850
+
1851
+ # Skip if this is part of a list item with colon
1852
+ full_match = match.group(0)
1853
+ if ':' in full_match and '</' + tag + '>' in full_match:
1854
+ return full_match
1855
+
1856
+ # Check if this should start with capital
1857
+ if pre_context == '' or pre_context.endswith(('.', '!', '?', '>')):
1858
+ if content and content[0].islower():
1859
+ content = content[0].upper() + content[1:]
1860
+
1861
+ return f'{pre_context}<{tag}>{content}'
1862
+
1863
+ # Look for bold/strong tags and check their context
1864
+ html_text = re.sub(r'(^|.*?)(<(?:strong|b)>)([a-zA-Z])', fix_bold_sentence_start, html_text)
1865
+
1866
+ # Clean up spacing around bold tags (but preserve list formatting)
1867
+ # Split into segments to handle list items separately
1868
+ segments = re.split(r'(<(?:strong|b)>[^<]*:</(?:strong|b)>)', html_text)
1869
+ cleaned_segments = []
1870
+
1871
+ for i, segment in enumerate(segments):
1872
+ if i % 2 == 1: # This is a list item pattern
1873
+ cleaned_segments.append(segment)
1874
+ else:
1875
+ # Apply spacing fixes to non-list segments
1876
+ segment = re.sub(r'\s+<(strong|b)>', r' <\1>', segment)
1877
+ segment = re.sub(r'</(strong|b)>\s+', r'</\1> ', segment)
1878
+ # Fix punctuation issues
1879
+ segment = re.sub(r'([.,!?;:])\s*([.,!?;:])', r'\1', segment)
1880
+ # Fix periods inside/around bold
1881
+ segment = re.sub(r'\.<(strong|b)>\.', '. <\1>', segment)
1882
+ segment = re.sub(r'\.</(strong|b)>\.', '</\1>.', segment)
1883
+ cleaned_segments.append(segment)
1884
+
1885
+ html_text = ''.join(cleaned_segments)
1886
+
1887
+ # Final cleanup
1888
+ html_text = re.sub(r'\.{2,}', '.', html_text) # Multiple periods
1889
+ html_text = re.sub(r',{2,}', ',', html_text) # Multiple commas
1890
+ html_text = re.sub(r':{2,}', ':', html_text) # Multiple colons
1891
+ html_text = re.sub(r'\s+([.,!?;:])', r'\1', html_text) # Space before punctuation
1892
+
1893
+ # Fix empty bold tags (but not those with just colons)
1894
+ html_text = re.sub(r'<(strong|b)>\s*</\1>', '', html_text)
1895
+
1896
+ # Fix specific patterns in lists/stats
1897
+ # Pattern like "5,000+" should not have period after
1898
+ html_text = re.sub(r'(\d+[,\d]*\+?)\s*\.\s*\n', r'\1\n', html_text)
1899
+
1900
+ # Clean up any remaining double brackets
1901
+ html_text = re.sub(r'<<', '<', html_text)
1902
+ html_text = re.sub(r'>>', '>', html_text)
1903
+
1904
+ # Apply final minimal grammar fixes
1905
+ html_text = self.grammar_fixer.smart_fix(html_text)
1906
+
1907
+ return html_text
1908
+
1909
+ # Initialize the humanizer
1910
+ humanizer = EnhancedDipperHumanizer()
1911
+
1912
+ def humanize_html(html_input, primary_keywords="", secondary_keywords="", progress=gr.Progress()):
1913
+ """Gradio interface function with progress updates"""
1914
+ if not html_input:
1915
+ return "Please provide HTML content to humanize."
1916
+
1917
+ progress(0, desc="Starting processing...")
1918
+ start_time = time.time()
1919
+
1920
+ # Create a wrapper to update progress
1921
+ def progress_callback(current, total):
1922
+ if total > 0:
1923
+ progress(current / total, desc=f"Processing: {current}/{total} elements")
1924
+
1925
+ # Pass progress callback to process_html
1926
+ result = humanizer.process_html(
1927
+ html_input,
1928
+ primary_keywords,
1929
+ secondary_keywords,
1930
+ progress_callback=progress_callback
1931
+ )
1932
+
1933
+ processing_time = time.time() - start_time
1934
+ print(f"Processing completed in {processing_time:.2f} seconds")
1935
+ progress(1.0, desc="Complete!")
1936
+
1937
+ return result
1938
+
1939
+ # Create Gradio interface with queue
1940
+ iface = gr.Interface(
1941
+ fn=humanize_html,
1942
+ inputs=[
1943
+ gr.Textbox(
1944
+ lines=10,
1945
+ placeholder="Paste your HTML content here...",
1946
+ label="HTML Input"
1947
+ ),
1948
+ gr.Textbox(
1949
+ placeholder="Enter primary keywords separated by commas (e.g., GMAT Focus Edition, MBA, Data Insights)",
1950
+ label="Primary Keywords (preserved exactly)"
1951
+ ),
1952
+ gr.Textbox(
1953
+ placeholder="Enter secondary keywords separated by commas (e.g., test preparation, business school)",
1954
+ label="Secondary Keywords (preserved exactly)"
1955
+ )
1956
+ ],
1957
+ outputs=gr.Textbox(
1958
+ lines=10,
1959
+ label="Humanized HTML Output"
1960
+ ),
1961
+ title="Enhanced Dipper AI Humanizer - Optimized for Originality AI",
1962
+ description="""
1963
+ Ultra-aggressive humanizer optimized to achieve 100% human scores on both Undetectable AI and Originality AI.
1964
+
1965
+ Key Features:
1966
+ - Maximum diversity settings (90% lexical, 40% order) for natural variation
1967
+ - Enhanced human patterns: personal opinions, self-corrections, thinking-out-loud
1968
+ - Natural typos, contractions, and conversational flow
1969
+ - Stream-of-consciousness elements and rhetorical questions
1970
+ - Originality AI-specific optimizations: varied sentence starters, emphatic repetitions
1971
+ - Fixed placeholder system that preserves keywords
1972
+ - Keywords inside <p> tags are automatically wrapped with <strong> tags
1973
+ - Skips content in <strong>, <b>, and heading tags (including inside tables)
1974
+ - Designed to pass the strictest AI detection systems
1975
+
1976
+ The tool creates genuinely human-like writing patterns that fool even the most sophisticated detectors!
1977
+
1978
+ ⚠️ Note: Processing may take 5-10 minutes for large HTML documents.
1979
+ """,
1980
+ examples=[
1981
+ ["""<article>
1982
+ <h1>The Benefits of Regular Exercise</h1>
1983
+ <div class="author-intro">By John Doe, Fitness Expert | 10 years experience</div>
1984
+ <p>Regular exercise is essential for maintaining good health. It helps improve cardiovascular fitness, strengthens muscles, and enhances mental well-being. Studies have shown that people who exercise regularly have lower risks of chronic diseases.</p>
1985
+ <p>Additionally, exercise can boost mood and energy levels. It releases endorphins, which are natural mood elevators. Even moderate activities like walking can make a significant difference in overall health.</p>
1986
+ </article>""", "cardiovascular fitness, mental well-being, chronic diseases", "exercise, health, endorphins"]
1987
+ ],
1988
+ theme="default"
1989
+ )
1990
+
1991
+ if __name__ == "__main__":
1992
+ # Enable queue for better handling of long-running processes
1993
+ iface.queue(max_size=10)
1994
+ iface.launch(share=True)