dwb2023 commited on
Commit
e50ea95
·
verified ·
1 Parent(s): 893496b

Delete graphrag_readme.md

Browse files
Files changed (1) hide show
  1. graphrag_readme.md +0 -351
graphrag_readme.md DELETED
@@ -1,351 +0,0 @@
1
- # GraphRAG README
2
-
3
- ## Some fundamental concepts
4
-
5
- ### Data Ingestion
6
-
7
- NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.
8
-
9
- ```mermaid
10
- graph TD
11
- %% Database shapes with consistent styling
12
- SDS[(Structured<br/>Data Sources)]
13
- UDS[(Unstructured<br/>Data Sources)]
14
- LG[(lexical graph)]
15
- SG[(semantic graph)]
16
- VD[(vector database)]
17
-
18
- %% Flow from structured data
19
- SDS -->|PII features| ER[entity resolution]
20
- SDS -->|data records| SG
21
- SG -->|PII updates| ER
22
- ER -->|semantic overlay| SG
23
-
24
- %% Schema and ontology
25
- ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
26
- ONT --> SG
27
-
28
- %% Flow from unstructured data
29
- UDS --> K[text chunking<br/>function]
30
- K --> NLP[NLP parse]
31
- K --> EM[embedding model]
32
- NLP --> E[NER, RE]
33
- E --> LG
34
- LG --> EL[entity linking]
35
- EL <--> SG
36
-
37
- %% Vector elements connections
38
- EM --> VD
39
- VD -.->|capture source chunk<br/>WITHIN references| SG
40
-
41
- %% Thesaurus connection
42
- ER -.->T[thesaurus]
43
- T --> EL
44
-
45
- %% Styling classes
46
- classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
47
- classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
48
- classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
49
- classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
50
- classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
51
- classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
52
-
53
- %% Apply styles by layer/type
54
- class SDS,UDS dataSource;
55
- class SG,VD storage;
56
- class EM embedding;
57
- class LG lexical;
58
- class SG semantic;
59
- class ONT,T reference;
60
- ```
61
-
62
- ### Augment LLM Inference
63
-
64
- ```mermaid
65
- graph LR
66
- %% Define database and special shapes
67
- P[prompt]
68
- SG[(semantic graph)]
69
- VD[(vector database)]
70
- LLM[LLM]
71
- Z[response]
72
-
73
- %% Main flow paths
74
- P --> Q[generated query]
75
- P --> EM[embedding model]
76
-
77
- %% Upper path through graph elements
78
- Q --> SG
79
- SG --> W[semantic<br/>random walk]
80
- T[thesaurus] --> W
81
- W --> GA[graph analytics]
82
-
83
- %% Lower path through vector elements
84
- EM --> SS[vector<br/>similarity search]
85
- SS --> VD
86
-
87
- %% Node embeddings and chunk references
88
- SG -.-|chunk references| VD
89
- SS -->|node embeddings| SG
90
-
91
- %% Final convergence
92
- GA --> RI[ranked index]
93
- VD --> RI
94
- RI --> LLM
95
- LLM --> Z
96
-
97
- %% Styling classes
98
- classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
99
- classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
100
- classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
101
- classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
102
- classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
103
- classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
104
-
105
- %% Apply styles by layer/type
106
- class SDS,UDS dataSource;
107
- class SG,VD storage;
108
- class EM embedding;
109
- class LG lexical;
110
- class SG semantic;
111
- class ONT,T reference;
112
- ```
113
-
114
- ## Sequence Diagram - covering the current `strwythura` (structure) repo
115
-
116
- - the diagram below is largely based on the `demo.py` functions
117
- - I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
118
- - [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
119
- - I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
120
- - Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
121
- - this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
122
- - For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
123
-
124
-
125
- ```mermaid
126
- sequenceDiagram
127
- participant Main as Main Script
128
- participant ConstructKG as construct_kg Flow
129
- participant InitNLP as init_nlp Task
130
- participant ScrapeHTML as scrape_html Task
131
- participant MakeChunk as make_chunk Task
132
- participant ParseText as parse_text Task
133
- participant MakeEntity as make_entity Task
134
- participant ExtractEntity as extract_entity Task
135
- participant ExtractRelations as extract_relations Task
136
- participant ConnectEntities as connect_entities Task
137
- participant RunTextRank as run_textrank Task
138
- participant AbstractOverlay as abstract_overlay Task
139
- participant GenPyvis as gen_pyvis Task
140
-
141
- Main->>ConstructKG: Start construct_kg flow
142
- ConstructKG->>InitNLP: Initialize NLP pipeline
143
- InitNLP-->>ConstructKG: Return NLP object
144
-
145
- loop For each URL in url_list
146
- ConstructKG->>ScrapeHTML: Scrape HTML content
147
- ScrapeHTML->>MakeChunk: Create text chunks
148
- MakeChunk-->>ScrapeHTML: Return chunk list
149
- ScrapeHTML-->>ConstructKG: Return chunk list
150
-
151
- loop For each chunk in chunk_list
152
- ConstructKG->>ParseText: Parse text and build lex_graph
153
- ParseText->>MakeEntity: Create entities from spans
154
- MakeEntity-->>ParseText: Return entity
155
- ParseText->>ExtractEntity: Extract and add entities to lex_graph
156
- ExtractEntity-->>ParseText: Entity added to graph
157
- ParseText->>ExtractRelations: Extract relations between entities
158
- ExtractRelations-->>ParseText: Relations added to graph
159
- ParseText->>ConnectEntities: Connect co-occurring entities
160
- ConnectEntities-->>ParseText: Connections added to graph
161
- ParseText-->>ConstructKG: Return parsed doc
162
- end
163
-
164
- ConstructKG->>RunTextRank: Run TextRank on lex_graph
165
- RunTextRank-->>ConstructKG: Return ranked entities
166
- ConstructKG->>AbstractOverlay: Overlay semantic graph
167
- AbstractOverlay-->>ConstructKG: Overlay completed
168
- end
169
-
170
- ConstructKG->>GenPyvis: Generate Pyvis visualization
171
- GenPyvis-->>ConstructKG: Visualization saved
172
- ConstructKG-->>Main: Flow completed
173
- ```
174
-
175
- ## Run the code
176
-
177
- 1. setup local Python environment and install Python dependencies
178
-
179
- - I used Python 3.11, but 3.10 should work as well
180
-
181
- ```bash
182
- pip install -r requirements.txt
183
- ```
184
-
185
- 2. Start the local Prefect server
186
-
187
- - follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`
188
-
189
- ```python
190
- prefect server start
191
- ```
192
-
193
- 3. run the `graphrag_demo.py` script
194
-
195
- ```python
196
- python graphrag_demo.py
197
- ```
198
-
199
- ## Appendix: Code Overview and Purpose
200
-
201
- - The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
202
- - It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.
203
-
204
- ---
205
-
206
- ### **Key Components and Flow**
207
-
208
- #### **1. Model and Parameter Settings**
209
- - **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
210
- - **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
211
- - **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
212
- - **Scraping Parameters**: Sets user-agent headers for web requests.
213
-
214
- #### **2. Data Validation**
215
- - **Classes**:
216
- - `TextChunk`: Represents segmented text chunks with their embeddings.
217
- - `Entity`: Tracks extracted entities, their attributes, and relationships.
218
- - **Purpose**: Ensures data is clean and well-structured for downstream processing.
219
-
220
- #### **3. Data Collection**
221
- - **Functions**:
222
- - `scrape_html`: Fetches and parses webpage content.
223
- - `uni_scrubber`: Cleans Unicode and formatting issues.
224
- - `make_chunk`: Segments long text into manageable chunks for embedding.
225
- - **Role**: Prepares raw, unstructured data for structured analysis.
226
-
227
- #### **4. Lexical Graph Construction**
228
- - **Initialization**:
229
- - `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
230
- - **Graph Parsing**:
231
- - `parse_text`: Creates lexical graphs using TextRank algorithms.
232
- - `make_entity`: Extracts and integrates entities into the graph.
233
- - `connect_entities`: Links entities co-occurring in the same context.
234
- - **Purpose**: Converts text into a structured, connected graph of entities and relationships.
235
-
236
- #### **5. Numerical Processing**
237
- - **Functions**:
238
- - `calc_quantile_bins`: Creates quantile bins for numerical data.
239
- - `root_mean_square`: Computes RMS for normalization.
240
- - `stripe_column`: Applies quantile binning to data columns.
241
- - **Role**: Provides statistical operations to refine and rank graph components.
242
-
243
- #### **6. TextRank Implementation**
244
- - **Functions**:
245
- - `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
246
- - **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.
247
-
248
- #### **7. Semantic Overlay**
249
- - **Functions**:
250
- - `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
251
- - Connects entities to their originating text chunks for context preservation.
252
- - **Role**: Enhances the graph with higher-order relationships and semantic depth.
253
-
254
- #### **8. Visualization**
255
- - **Tool**: `pyvis`
256
- - **Functions**:
257
- - `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
258
- - **Features**:
259
- - Node sizing reflects entity importance.
260
- - Physics-based layout supports intuitive exploration.
261
-
262
- #### **9. Orchestration**
263
- - **Function**:
264
- - `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
265
- - **Purpose**: Ensures the seamless integration of all layers and components.
266
-
267
- ---
268
-
269
- ### **Notable Implementation Details**
270
-
271
- - **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
272
- - **Vector Embedding Integration**: Enhances entity representation with embeddings.
273
- - **Error Handling and Debugging**: Includes robust logging and debugging features.
274
- - **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.
275
-
276
- ---
277
-
278
- ## Appendix: Architectural Workflow
279
-
280
- ### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**
281
-
282
- #### **1.1 Workflow Layers**
283
-
284
- **Data Ingestion:**
285
- - Role: Extract raw data from structured and unstructured sources for downstream processing.
286
- - Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
287
- - Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.
288
-
289
- **Lexical Graph Construction:**
290
- - Role: Build a foundational graph by integrating tokenized data and semantic relationships.
291
- - Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
292
- - Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.
293
-
294
- **Entity and Relation Extraction:**
295
- - Role: Identify and label entities, along with their relationships, to enrich the graph structure.
296
- - Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
297
- - Requirements: Domain-tuned models and algorithms for accurate extraction.
298
-
299
- **Graph Construction and Visualization:**
300
- - Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
301
- - Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
302
- - Requirements: Scalable graph-building frameworks and intuitive visualization tools.
303
-
304
- **Semantic Overlay:**
305
- - Role: Enhance the graph with additional context and reasoning capabilities.
306
- - Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
307
- - Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.
308
-
309
-
310
- ### **2. Visualized Workflow**
311
-
312
- #### **2.1 Logical Data Flow**
313
-
314
- ```mermaid
315
- graph TD
316
- A[Raw Data] -->|Scrape| B[Chunks]
317
- B -->|Lexical Parsing| C[Lexical Graph]
318
- C -->|NER + RE| D[Entities and Relations]
319
- D -->|Construct KG| E[Knowledge Graph]
320
- E -->|Overlay Ontologies| F[Enriched Graph]
321
- F -->|Visualize| G[Interactive View]
322
- ```
323
-
324
- ---
325
-
326
- ### **3. Glossary**
327
-
328
- | **Participant** | **Description** | **Workflow Layer** |
329
- |--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
330
- | **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion |
331
- | **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion |
332
- | **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction |
333
- | **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
334
- | **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction |
335
- | **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction |
336
- | **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion |
337
- | **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction |
338
- | **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization |
339
- | **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization |
340
-
341
- ## Citations: giving credit where credit is due...
342
-
343
- Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.
344
-
345
- - Paco Nathan https://senzing.com/consult-entity-resolution-paco/
346
- - Clair Sullivan https://clairsullivan.com/
347
- - Louis Guitton https://guitton.co/
348
- - Jeff Butcher https://github.com/jbutcher21
349
- - Michael Dockter https://github.com/docktermj
350
-
351
- The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.