Spaces:
Paused
Paused
sequenceDiagram | |
participant API as API (FastAPI) | |
participant DI as Data Ingestion Service | |
participant AM as ArXiv Metadata Fetcher | |
participant PL as PDF Loader (PyMuPDF) | |
participant TS as Text Splitter | |
participant EM as Embedding Model (OpenAI) | |
participant VDB as Vector Database (Qdrant) | |
participant HF as Hugging Face Dataset | |
API->>DI: POST /ingest (query, max_results) | |
DI->>AM: fetch_arxiv_metadata(query, max_results) | |
AM-->>DI: Return metadata list | |
alt Successful metadata fetch | |
loop For each metadata item | |
DI->>PL: process_pdf(pdf_url) | |
alt Successful PDF processing | |
PL-->>DI: Return PDF text | |
DI->>TS: split_text(pdf_text) | |
TS-->>DI: Return text chunks | |
loop For each chunk | |
DI->>EM: embed_query(chunk) | |
EM-->>DI: Return embedding | |
DI->>VDB: add_texts(chunk, embedding) | |
DI->>HF: Add chunk and metadata | |
end | |
else PDF processing error | |
PL-->>DI: Raise exception | |
DI->>DI: Log error and continue | |
end | |
end | |
DI-->>API: Return ingestion result | |
else Metadata fetch error | |
AM-->>DI: Raise exception | |
DI-->>API: Return error message | |
end | |
Note over API,HF: Logging at each step |