Spaces:
Running
on
Zero
Running
on
Zero
Delete graphrag_readme.md
Browse files- graphrag_readme.md +0 -351
graphrag_readme.md
DELETED
@@ -1,351 +0,0 @@
|
|
1 |
-
# GraphRAG README
|
2 |
-
|
3 |
-
## Some fundamental concepts
|
4 |
-
|
5 |
-
### Data Ingestion
|
6 |
-
|
7 |
-
NOTE: mermaid.js diagrams below are based on some inspiring content from the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/DerwenAI/cdl2024_masterclass/blob/main/README.md) masterclass.
|
8 |
-
|
9 |
-
```mermaid
|
10 |
-
graph TD
|
11 |
-
%% Database shapes with consistent styling
|
12 |
-
SDS[(Structured<br/>Data Sources)]
|
13 |
-
UDS[(Unstructured<br/>Data Sources)]
|
14 |
-
LG[(lexical graph)]
|
15 |
-
SG[(semantic graph)]
|
16 |
-
VD[(vector database)]
|
17 |
-
|
18 |
-
%% Flow from structured data
|
19 |
-
SDS -->|PII features| ER[entity resolution]
|
20 |
-
SDS -->|data records| SG
|
21 |
-
SG -->|PII updates| ER
|
22 |
-
ER -->|semantic overlay| SG
|
23 |
-
|
24 |
-
%% Schema and ontology
|
25 |
-
ONT[schema, ontology, taxonomy,<br/>controlled vocabularies, etc.]
|
26 |
-
ONT --> SG
|
27 |
-
|
28 |
-
%% Flow from unstructured data
|
29 |
-
UDS --> K[text chunking<br/>function]
|
30 |
-
K --> NLP[NLP parse]
|
31 |
-
K --> EM[embedding model]
|
32 |
-
NLP --> E[NER, RE]
|
33 |
-
E --> LG
|
34 |
-
LG --> EL[entity linking]
|
35 |
-
EL <--> SG
|
36 |
-
|
37 |
-
%% Vector elements connections
|
38 |
-
EM --> VD
|
39 |
-
VD -.->|capture source chunk<br/>WITHIN references| SG
|
40 |
-
|
41 |
-
%% Thesaurus connection
|
42 |
-
ER -.->T[thesaurus]
|
43 |
-
T --> EL
|
44 |
-
|
45 |
-
%% Styling classes
|
46 |
-
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
|
47 |
-
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
|
48 |
-
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
|
49 |
-
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
|
50 |
-
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
|
51 |
-
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
|
52 |
-
|
53 |
-
%% Apply styles by layer/type
|
54 |
-
class SDS,UDS dataSource;
|
55 |
-
class SG,VD storage;
|
56 |
-
class EM embedding;
|
57 |
-
class LG lexical;
|
58 |
-
class SG semantic;
|
59 |
-
class ONT,T reference;
|
60 |
-
```
|
61 |
-
|
62 |
-
### Augment LLM Inference
|
63 |
-
|
64 |
-
```mermaid
|
65 |
-
graph LR
|
66 |
-
%% Define database and special shapes
|
67 |
-
P[prompt]
|
68 |
-
SG[(semantic graph)]
|
69 |
-
VD[(vector database)]
|
70 |
-
LLM[LLM]
|
71 |
-
Z[response]
|
72 |
-
|
73 |
-
%% Main flow paths
|
74 |
-
P --> Q[generated query]
|
75 |
-
P --> EM[embedding model]
|
76 |
-
|
77 |
-
%% Upper path through graph elements
|
78 |
-
Q --> SG
|
79 |
-
SG --> W[semantic<br/>random walk]
|
80 |
-
T[thesaurus] --> W
|
81 |
-
W --> GA[graph analytics]
|
82 |
-
|
83 |
-
%% Lower path through vector elements
|
84 |
-
EM --> SS[vector<br/>similarity search]
|
85 |
-
SS --> VD
|
86 |
-
|
87 |
-
%% Node embeddings and chunk references
|
88 |
-
SG -.-|chunk references| VD
|
89 |
-
SS -->|node embeddings| SG
|
90 |
-
|
91 |
-
%% Final convergence
|
92 |
-
GA --> RI[ranked index]
|
93 |
-
VD --> RI
|
94 |
-
RI --> LLM
|
95 |
-
LLM --> Z
|
96 |
-
|
97 |
-
%% Styling classes
|
98 |
-
classDef dataSource fill:#f4f4f4,stroke:#666,stroke-width:2px;
|
99 |
-
classDef storage fill:#e6f3ff,stroke:#4a90e2,stroke-width:2px;
|
100 |
-
classDef embedding fill:#fff3e6,stroke:#f5a623,stroke-width:2px;
|
101 |
-
classDef lexical fill:#f0e6ff,stroke:#4a90e2,stroke-width:2px;
|
102 |
-
classDef semantic fill:#f0e6ff,stroke:#9013fe,stroke-width:2px;
|
103 |
-
classDef reference fill:#e6ffe6,stroke:#417505,stroke-width:2px;
|
104 |
-
|
105 |
-
%% Apply styles by layer/type
|
106 |
-
class SDS,UDS dataSource;
|
107 |
-
class SG,VD storage;
|
108 |
-
class EM embedding;
|
109 |
-
class LG lexical;
|
110 |
-
class SG semantic;
|
111 |
-
class ONT,T reference;
|
112 |
-
```
|
113 |
-
|
114 |
-
## Sequence Diagram - covering the current `strwythura` (structure) repo
|
115 |
-
|
116 |
-
- the diagram below is largely based on the `demo.py` functions
|
117 |
-
- I used [Prefect](https://www.prefect.io/) to `dig in` and reverse architect the flow...
|
118 |
-
- [graphrag_demo.py](./graphrag_demo.py) is my simple update to [Paco's original python code](./demo.py)
|
119 |
-
- I stuck to using Prefect function decorators based on the existing structure, but I'm looking forward to abstracting some of the concepts out further and thinking agentically.
|
120 |
-
- Telemetry and instrumentation can often demystify complex processes, without the headaches of wading through long print statements. Some great insight often occurs when you can see how individual functions / components are interacting.
|
121 |
-
- this repo features a large and distinguished cast of open source models (GLiNER, GLiREL), open source embeddings (BGE, Word2Vec) and a vector store (LanceDB) for improved entity recognition and relationship extraction.
|
122 |
-
- For a deeper dive, [Paco's YouTube video and associated diagrams](https://senzing.com/gph-graph-rag-llm-knowledge-graphs/) help highlight real-world use cases where effective Knowledge Graph construction can provide deeper meaning and insight.
|
123 |
-
|
124 |
-
|
125 |
-
```mermaid
|
126 |
-
sequenceDiagram
|
127 |
-
participant Main as Main Script
|
128 |
-
participant ConstructKG as construct_kg Flow
|
129 |
-
participant InitNLP as init_nlp Task
|
130 |
-
participant ScrapeHTML as scrape_html Task
|
131 |
-
participant MakeChunk as make_chunk Task
|
132 |
-
participant ParseText as parse_text Task
|
133 |
-
participant MakeEntity as make_entity Task
|
134 |
-
participant ExtractEntity as extract_entity Task
|
135 |
-
participant ExtractRelations as extract_relations Task
|
136 |
-
participant ConnectEntities as connect_entities Task
|
137 |
-
participant RunTextRank as run_textrank Task
|
138 |
-
participant AbstractOverlay as abstract_overlay Task
|
139 |
-
participant GenPyvis as gen_pyvis Task
|
140 |
-
|
141 |
-
Main->>ConstructKG: Start construct_kg flow
|
142 |
-
ConstructKG->>InitNLP: Initialize NLP pipeline
|
143 |
-
InitNLP-->>ConstructKG: Return NLP object
|
144 |
-
|
145 |
-
loop For each URL in url_list
|
146 |
-
ConstructKG->>ScrapeHTML: Scrape HTML content
|
147 |
-
ScrapeHTML->>MakeChunk: Create text chunks
|
148 |
-
MakeChunk-->>ScrapeHTML: Return chunk list
|
149 |
-
ScrapeHTML-->>ConstructKG: Return chunk list
|
150 |
-
|
151 |
-
loop For each chunk in chunk_list
|
152 |
-
ConstructKG->>ParseText: Parse text and build lex_graph
|
153 |
-
ParseText->>MakeEntity: Create entities from spans
|
154 |
-
MakeEntity-->>ParseText: Return entity
|
155 |
-
ParseText->>ExtractEntity: Extract and add entities to lex_graph
|
156 |
-
ExtractEntity-->>ParseText: Entity added to graph
|
157 |
-
ParseText->>ExtractRelations: Extract relations between entities
|
158 |
-
ExtractRelations-->>ParseText: Relations added to graph
|
159 |
-
ParseText->>ConnectEntities: Connect co-occurring entities
|
160 |
-
ConnectEntities-->>ParseText: Connections added to graph
|
161 |
-
ParseText-->>ConstructKG: Return parsed doc
|
162 |
-
end
|
163 |
-
|
164 |
-
ConstructKG->>RunTextRank: Run TextRank on lex_graph
|
165 |
-
RunTextRank-->>ConstructKG: Return ranked entities
|
166 |
-
ConstructKG->>AbstractOverlay: Overlay semantic graph
|
167 |
-
AbstractOverlay-->>ConstructKG: Overlay completed
|
168 |
-
end
|
169 |
-
|
170 |
-
ConstructKG->>GenPyvis: Generate Pyvis visualization
|
171 |
-
GenPyvis-->>ConstructKG: Visualization saved
|
172 |
-
ConstructKG-->>Main: Flow completed
|
173 |
-
```
|
174 |
-
|
175 |
-
## Run the code
|
176 |
-
|
177 |
-
1. setup local Python environment and install Python dependencies
|
178 |
-
|
179 |
-
- I used Python 3.11, but 3.10 should work as well
|
180 |
-
|
181 |
-
```bash
|
182 |
-
pip install -r requirements.txt
|
183 |
-
```
|
184 |
-
|
185 |
-
2. Start the local Prefect server
|
186 |
-
|
187 |
-
- follow the [self-hosted instructions](https://docs.prefect.io/v3/get-started/quickstart#connect-to-a-prefect-api) to launch the `Prefect UI`
|
188 |
-
|
189 |
-
```python
|
190 |
-
prefect server start
|
191 |
-
```
|
192 |
-
|
193 |
-
3. run the `graphrag_demo.py` script
|
194 |
-
|
195 |
-
```python
|
196 |
-
python graphrag_demo.py
|
197 |
-
```
|
198 |
-
|
199 |
-
## Appendix: Code Overview and Purpose
|
200 |
-
|
201 |
-
- The code forms part of a talk for **GraphGeeks.org** about constructing **knowledge graphs** from **unstructured data sources**, such as web content.
|
202 |
-
- It integrates web scraping, natural language processing (NLP), graph construction, and interactive visualization.
|
203 |
-
|
204 |
-
---
|
205 |
-
|
206 |
-
### **Key Components and Flow**
|
207 |
-
|
208 |
-
#### **1. Model and Parameter Settings**
|
209 |
-
- **Core Configuration**: Establishes the foundational settings like chunk size, embedding models (`BAAI/bge-small-en-v1.5`), and database URIs.
|
210 |
-
- **NER Labels**: Defines entity categories such as `Person`, `Organization`, `Publication`, and `Technology`.
|
211 |
-
- **Relation Types**: Configures relationships like `works_at`, `developed_by`, and `authored_by` for connecting entities.
|
212 |
-
- **Scraping Parameters**: Sets user-agent headers for web requests.
|
213 |
-
|
214 |
-
#### **2. Data Validation**
|
215 |
-
- **Classes**:
|
216 |
-
- `TextChunk`: Represents segmented text chunks with their embeddings.
|
217 |
-
- `Entity`: Tracks extracted entities, their attributes, and relationships.
|
218 |
-
- **Purpose**: Ensures data is clean and well-structured for downstream processing.
|
219 |
-
|
220 |
-
#### **3. Data Collection**
|
221 |
-
- **Functions**:
|
222 |
-
- `scrape_html`: Fetches and parses webpage content.
|
223 |
-
- `uni_scrubber`: Cleans Unicode and formatting issues.
|
224 |
-
- `make_chunk`: Segments long text into manageable chunks for embedding.
|
225 |
-
- **Role**: Prepares raw, unstructured data for structured analysis.
|
226 |
-
|
227 |
-
#### **4. Lexical Graph Construction**
|
228 |
-
- **Initialization**:
|
229 |
-
- `init_nlp`: Sets up NLP pipelines with spaCy, GLiNER (NER), and GLiREL (RE).
|
230 |
-
- **Graph Parsing**:
|
231 |
-
- `parse_text`: Creates lexical graphs using TextRank algorithms.
|
232 |
-
- `make_entity`: Extracts and integrates entities into the graph.
|
233 |
-
- `connect_entities`: Links entities co-occurring in the same context.
|
234 |
-
- **Purpose**: Converts text into a structured, connected graph of entities and relationships.
|
235 |
-
|
236 |
-
#### **5. Numerical Processing**
|
237 |
-
- **Functions**:
|
238 |
-
- `calc_quantile_bins`: Creates quantile bins for numerical data.
|
239 |
-
- `root_mean_square`: Computes RMS for normalization.
|
240 |
-
- `stripe_column`: Applies quantile binning to data columns.
|
241 |
-
- **Role**: Provides statistical operations to refine and rank graph components.
|
242 |
-
|
243 |
-
#### **6. TextRank Implementation**
|
244 |
-
- **Functions**:
|
245 |
-
- `run_textrank`: Ranks entities in the graph based on a PageRank-inspired algorithm.
|
246 |
-
- **Purpose**: Identifies and prioritizes key entities for knowledge graph construction.
|
247 |
-
|
248 |
-
#### **7. Semantic Overlay**
|
249 |
-
- **Functions**:
|
250 |
-
- `abstract_overlay`: Abstracts a semantic layer from the lexical graph.
|
251 |
-
- Connects entities to their originating text chunks for context preservation.
|
252 |
-
- **Role**: Enhances the graph with higher-order relationships and semantic depth.
|
253 |
-
|
254 |
-
#### **8. Visualization**
|
255 |
-
- **Tool**: `pyvis`
|
256 |
-
- **Functions**:
|
257 |
-
- `gen_pyvis`: Creates an interactive visualization of the knowledge graph.
|
258 |
-
- **Features**:
|
259 |
-
- Node sizing reflects entity importance.
|
260 |
-
- Physics-based layout supports intuitive exploration.
|
261 |
-
|
262 |
-
#### **9. Orchestration**
|
263 |
-
- **Function**:
|
264 |
-
- `construct_kg`: Orchestrates the full pipeline from data collection to visualization.
|
265 |
-
- **Purpose**: Ensures the seamless integration of all layers and components.
|
266 |
-
|
267 |
-
---
|
268 |
-
|
269 |
-
### **Notable Implementation Details**
|
270 |
-
|
271 |
-
- **Multi-Layer Graph Representation**: Combines lexical and semantic graphs for layered analysis.
|
272 |
-
- **Vector Embedding Integration**: Enhances entity representation with embeddings.
|
273 |
-
- **Error Handling and Debugging**: Includes robust logging and debugging features.
|
274 |
-
- **Scalability**: Designed for handling diverse and large datasets with dynamic relationships.
|
275 |
-
|
276 |
-
---
|
277 |
-
|
278 |
-
## Appendix: Architectural Workflow
|
279 |
-
|
280 |
-
### **1. Architectural Workflow: A Layered Approach to Knowledge Graph Construction**
|
281 |
-
|
282 |
-
#### **1.1 Workflow Layers**
|
283 |
-
|
284 |
-
**Data Ingestion:**
|
285 |
-
- Role: Extract raw data from structured and unstructured sources for downstream processing.
|
286 |
-
- Responsibilities: Handle diverse data formats, ensure quality, and standardize for analysis.
|
287 |
-
- Requirements: Reliable scraping, parsing, and chunking mechanisms to prepare data for embedding and analysis.
|
288 |
-
|
289 |
-
**Lexical Graph Construction:**
|
290 |
-
- Role: Build a foundational graph by integrating tokenized data and semantic relationships.
|
291 |
-
- Responsibilities: Identify key entities through tokenization and ranking (e.g., TextRank).
|
292 |
-
- Requirements: Efficient methods for integrating named entities and relationships into a coherent graph structure.
|
293 |
-
|
294 |
-
**Entity and Relation Extraction:**
|
295 |
-
- Role: Identify and label entities, along with their relationships, to enrich the graph structure.
|
296 |
-
- Responsibilities: Extract domain-specific entities (NER) and relationships (RE) to add connectivity.
|
297 |
-
- Requirements: Domain-tuned models and algorithms for accurate extraction.
|
298 |
-
|
299 |
-
**Graph Construction and Visualization:**
|
300 |
-
- Role: Develop and display the knowledge graph to facilitate analysis and decision-making.
|
301 |
-
- Responsibilities: Create a graph structure using tools like NetworkX and enable exploration with interactive visualizations (e.g., PyVis).
|
302 |
-
- Requirements: Scalable graph-building frameworks and intuitive visualization tools.
|
303 |
-
|
304 |
-
**Semantic Overlay:**
|
305 |
-
- Role: Enhance the graph with additional context and reasoning capabilities.
|
306 |
-
- Responsibilities: Integrate ontologies, taxonomies, and domain-specific knowledge to provide depth and precision.
|
307 |
-
- Requirements: Mechanisms to map structured data into graph elements and ensure consistency with existing knowledge bases.
|
308 |
-
|
309 |
-
|
310 |
-
### **2. Visualized Workflow**
|
311 |
-
|
312 |
-
#### **2.1 Logical Data Flow**
|
313 |
-
|
314 |
-
```mermaid
|
315 |
-
graph TD
|
316 |
-
A[Raw Data] -->|Scrape| B[Chunks]
|
317 |
-
B -->|Lexical Parsing| C[Lexical Graph]
|
318 |
-
C -->|NER + RE| D[Entities and Relations]
|
319 |
-
D -->|Construct KG| E[Knowledge Graph]
|
320 |
-
E -->|Overlay Ontologies| F[Enriched Graph]
|
321 |
-
F -->|Visualize| G[Interactive View]
|
322 |
-
```
|
323 |
-
|
324 |
-
---
|
325 |
-
|
326 |
-
### **3. Glossary**
|
327 |
-
|
328 |
-
| **Participant** | **Description** | **Workflow Layer** |
|
329 |
-
|--------------------------------|---------------------------------------------------------------------------------------------------|-------------------------------------|
|
330 |
-
| **HTML Scraper (BeautifulSoup)** | Fetches unstructured text data from web sources. | Data Ingestion |
|
331 |
-
| **Text Chunker** | Breaks raw text into manageable chunks (e.g., 1024 tokens) and prepares them for embedding. | Data Ingestion |
|
332 |
-
| **SpaCy Pipeline** | Processes chunks and integrates GLiNER and GLiREL for entity and relation extraction. | Entity and Relation Extraction |
|
333 |
-
| **Embedding Model (bge-small-en-v1.5)** | Captures lower-level lexical meanings of text and translates them into machine-readable vector representations. | Data Ingestion |
|
334 |
-
| **GLiNER** | Identifies domain-specific entities and returns labeled outputs. | Entity and Relation Extraction |
|
335 |
-
| **GLiREL** | Extracts relationships between identified entities, adding connectivity to the graph. | Entity and Relation Extraction |
|
336 |
-
| **Vector Database (LanceDB)** | Stores chunk embeddings for efficient querying in downstream tasks. | Data Ingestion |
|
337 |
-
| **Word2Vec (Gensim)** | Generates entity embeddings based on graph co-occurrence for additional analysis. | Semantic Graph Construction |
|
338 |
-
| **Graph Constructor (NetworkX)** | Builds and analyzes the knowledge graph, ranking entities using TextRank. | Graph Construction and Visualization |
|
339 |
-
| **Graph Visualizer (PyVis)** | Provides an interactive visualization of the knowledge graph for interpretability. | Graph Construction and Visualization |
|
340 |
-
|
341 |
-
## Citations: giving credit where credit is due...
|
342 |
-
|
343 |
-
Inspired by the great work done by multiple individuals who created the [Connected Data London 2024: Entity Resolved Knowledge Graphs](https://github.com/donbr/cdl2024_masterclass/blob/main/README.md) masterclass I created this document to highlight areas that rang true.
|
344 |
-
|
345 |
-
- Paco Nathan https://senzing.com/consult-entity-resolution-paco/
|
346 |
-
- Clair Sullivan https://clairsullivan.com/
|
347 |
-
- Louis Guitton https://guitton.co/
|
348 |
-
- Jeff Butcher https://github.com/jbutcher21
|
349 |
-
- Michael Dockter https://github.com/docktermj
|
350 |
-
|
351 |
-
The code to use GLiNER and GLiREL started as a fork of one of four repos that make up the masterclass.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|