BitcoinPaper / scraper /README.md
cgrumbach's picture
Upload 15 files
6adb5cf verified

bitcointalk_crawler


DataFrame Columns Description

1. start_edit

  • Description: This column represents the date when the post or content was initially created.
  • Type: Date (format: YYYY-MM-DD)
  • Example: 2013-11-02

2. last_edit

  • Description: This column represents the last date when the post or content was edited.
  • Type: Date (format: YYYY-MM-DD)
  • Example: 2013-11-02

3. author

  • Description: The user who created the post.
  • Type: String
  • Example: guyver

4. post

  • Description: The actual content or message of the post.
  • Type: String
  • Example: before we all get excited about the second batch...

5. topic

  • Description: The topic or title of the thread in which the post was made.
  • Type: String
  • Example: [EU/UK GROUP BUY] Blue Fury USB miner 2.2 ...

6. attachment

  • Description: Indicates whether the post has an attachment or not. A value of 1 means there's an attachment(image or video), and 0 means there isn't. In the website, it using img tag to show the emoji but seems not to be an attachment, such that it also ignring the emojis.
  • Type: Integer (0 or 1)
  • Example: 0
  • Note: The script 'attachment_fix.py' is run subsequent to the crawling process, as the initial values populated in this column post-crawling are not accurate.

7. link

  • Description: Indicates whether the post contains a link or not. A value of 1 means there's a link, and 0 means there isn't.
  • Type: Integer (0 or 1)
  • Example: 0

8. original_info

  • Description: This column contains raw HTML or metadata related to the post. It may contain styling and layout information.
  • Type: String (HTML format)
  • Example: <td class="td_headerandpost" height="100%" sty...

9. preprocessed_post

  • Description: Preprocessed of post column that for analysis or other tasks.
  • Type: String
  • Example: get excited second batch.let us wait first bat...

Usage

1. main.py and auto_crawl.sh

  • Description: The main.py script is the full script that is used to crawl the Bitcointalk forum with given the first board page. The auto_crawl.sh script is used to automate the process of running the main.py script.
  • example:
python main.py 
https://bitcointalk.org/index.php?board=40.0 # board url
--board mining_support # board name
-pages 183 # number of pages in the board

2. topic_craawling.py and auto_crawl_topic.sh

  • Description: The topic_crawling.py script is used to crawl exact topic from Bitcointalk forum with given the first page url of the topic. The auto_crawl_topic.sh script is used to automate the process of running the topic_craawling.py script.

  • example:

python topic_crawling.py 
https://bitcointalk.org/index.php?topic=168174.0 # topic url
--board miners # board name that topic belongs to
--num_of_pages 165 # total pages of this topic