--- title: README emoji: 💻 colorFrom: indigo colorTo: indigo sdk: static pinned: false --- # Hi there 👋 StabRise - Document Processing Solutions # Our projects ## PDF DataSource for the Apache Spark

--- **Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf) **Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/) **Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb) --- The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. ## Key features: - Read PDF documents to the Spark DataFrame - Support read PDF files lazy per page - Support big files, up to 10k pages - Support scanned PDF files (call OCR) - No need to install Tesseract OCR, it's included in the package ## ScaleDP

--- **Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp) **Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/) **Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb) --- ScaleDP is an Open-Source Library for processing documents using Apache Spark. ### Key features: - Load PDF documents/Images - Extract text from PDF documents/Images - Extract images from PDF documents - OCR Images/PDF documents - Run NER on text extracted from PDF documents/Images - Visualize NER results ## De-Identify

De-Identify is tool for de-identification/anonymization data ### Supported formats - text - images - pdf documents - DICOM files