Clelia (Astra) Bertelli's picture

Clelia (Astra) Bertelli

as-cle-bert

AI & ML interests

Biology + Artificial Intelligence = โค๏ธ | AI for sustainable development, sustainable development for AI | Researching on Machine Learning Enhancement | I love automation for everyday things | Blogger | Open Source

Recent Activity

Articles

Organizations

Social Post Explorers's profile picture Hugging Face Discord Community's profile picture GreenFit AI's profile picture

as-cle-bert's activity

replied to their post 10 days ago
view reply

Hi!

I generally use LangChain + PyPDF, I leave here a code snippet:

from langchain_community.document_loaders import PyPDFLoader

def preprocess(pdf: str) -> list:
    """
    Uses LangChain's PyPDFLoader to extract text.
    """
    loader = PyPDFLoader(pdf)
    documents = loader.load()
    for doc in documents:
        print(doc.page_content)    

This should give a more solid result :)

PS: Langchain is distributed under an MIT license, see their GitHub (https://github.com/langchain-ai/langchain)

posted an update 11 days ago
view post
Post
1544
๐Ÿš€๐๐ž๐ฐ ๐๐ž๐ฆ๐จ ๐š๐ฅ๐ž๐ซ๐ญ๐Ÿš€

Convert (almost) everything to PDF with ๐๐๐Ÿ๐ˆ๐ญ๐ƒ๐จ๐ฐ๐ง, now on Spaces! ๐Ÿ‘‰ as-cle-bert/pdfitdown

You can also install it locally:

python3 -m pip install pdfitdown


Don't forget to star it on GitHub, if you find it useful! ๐Ÿ‘‰ https://www.github.com/AstraBert/PdfItDown

  • 3 replies
ยท