MykolaMelnyk commited on
Commit
55da6dd
·
verified ·
1 Parent(s): 0726016

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -7,4 +7,71 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # Hi there 👋
11
+
12
+ StabRise - Document Processing Solutions
13
+
14
+ # Our projects
15
+
16
+ ## PDF DataSource for the Apache Spark
17
+
18
+ <a href="https://stabrise.com/spark-pdf/"><img alt="Spark Pdf" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/16/d6/16d6a0d6-f162-42ad-a5a3-7dc20361ad24/sparkpdf.png__1000x300_subsampling-2.webp" height="120"></a>
19
+
20
+ ---
21
+
22
+ **Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
23
+
24
+ **Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)
25
+
26
+ **Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
27
+
28
+ ---
29
+
30
+ The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
31
+
32
+ ## Key features:
33
+
34
+ - Read PDF documents to the Spark DataFrame
35
+ - Support read PDF files lazy per page
36
+ - Support big files, up to 10k pages
37
+ - Support scanned PDF files (call OCR)
38
+ - No need to install Tesseract OCR, it's included in the package
39
+
40
+ ## ScaleDP
41
+
42
+ <a href="https://stabrise.com/scaledp/"><img alt="ScaleDP" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/4a/7d/4a7d97c2-50d7-4b7a-9902-af2df9b574da/scaledplogo.png__1000x300_subsampling-2.webp" height="120" /></a>
43
+
44
+ ---
45
+
46
+ **Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)
47
+
48
+ **Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)
49
+
50
+ **Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)
51
+
52
+ ---
53
+
54
+ ScaleDP is an Open-Source Library for processing documents using Apache Spark.
55
+
56
+ ### Key features:
57
+
58
+ - Load PDF documents/Images
59
+ - Extract text from PDF documents/Images
60
+ - Extract images from PDF documents
61
+ - OCR Images/PDF documents
62
+ - Run NER on text extracted from PDF documents/Images
63
+ - Visualize NER results
64
+
65
+
66
+ ## De-Identify
67
+
68
+ <a href="https://deidentify.online"><img alt="De-Identify" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/fb/fe/fbfe4b0c-dadb-4878-88ad-1c0ece0dc053/deidentifylogo.png__1000x300_subsampling-2.webp" height="120" /></a>
69
+
70
+ De-Identify is tool for de-identification/anonymization data
71
+
72
+ ### Supported formats
73
+ - text
74
+ - images
75
+ - pdf documents
76
+ - DICOM files
77
+