metadata
title: README
emoji: π»
colorFrom: indigo
colorTo: indigo
sdk: static
pinned: false
Hi there π
StabRise - Document Processing Solutions
Our projects
PDF DataSource for the Apache Spark
Source Code: https://github.com/StabRise/spark-pdf
Home page: https://stabrise.com/spark-pdf/
Quick Start Jupyter Notebook: https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
Key features:
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
ScaleDP
Source Code: https://github.com/StabRise/scaledp
Home page: https://stabrise.com/scaledp/
Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
Key features:
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results
De-Identify
De-Identify is tool for de-identification/anonymization data
Supported formats
- text
- images
- pdf documents
- DICOM files