File size: 3,035 Bytes
6adb5cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# bitcointalk_crawler

---

## DataFrame Columns Description

### 1. `start_edit`
- **Description**: This column represents the date when the post or content was initially created.
- **Type**: Date (format: YYYY-MM-DD)
- **Example**: `2013-11-02`

### 2. `last_edit`
- **Description**: This column represents the last date when the post or content was edited.
- **Type**: Date (format: YYYY-MM-DD)
- **Example**: `2013-11-02`

### 3. `author`
- **Description**: The user who created the post.
- **Type**: String
- **Example**: `guyver`

### 4. `post`
- **Description**: The actual content or message of the post.
- **Type**: String
- **Example**: `before we all get excited about the second batch...`

### 5. `topic`
- **Description**: The topic or title of the thread in which the post was made.
- **Type**: String
- **Example**: `[EU/UK GROUP BUY] Blue Fury USB miner 2.2 ...`

### 6. `attachment`
- **Description**:  Indicates whether the post has an attachment or not. A value of `1` means there's an attachment(image or video), and `0` means there isn't. In the website, it using img tag to show the emoji but seems not to be an attachment, such that it also ignring the emojis.
- **Type**: Integer (0 or 1)
- **Example**: `0`
- **Note**: The script 'attachment_fix.py' is run subsequent to the crawling process, as the initial values populated in this column post-crawling are not accurate.

### 7. `link`
- **Description**: Indicates whether the post contains a link or not. A value of `1` means there's a link, and `0` means there isn't.
- **Type**: Integer (0 or 1)
- **Example**: `0`

### 8. `original_info`
- **Description**: This column contains raw HTML or metadata related to the post. It may contain styling and layout information.
- **Type**: String (HTML format)
- **Example**: `<td class="td_headerandpost" height="100%" sty...`

### 9. `preprocessed_post`
- **Description**: Preprocessed of `post` column that for analysis or other tasks.
- **Type**: String
- **Example**: `get excited second batch.let us wait first bat...`

---

## Usage

### 1. `main.py` and `auto_crawl.sh`
- **Description**: The `main.py` script is the full script that is used to crawl the Bitcointalk forum with given the first board page. The `auto_crawl.sh` script is used to automate the process of running the `main.py` script.
- **example**: 
```python 
python main.py 
https://bitcointalk.org/index.php?board=40.0 # board url
--board mining_support # board name
-pages 183 # number of pages in the board
```

### 2. `topic_craawling.py` and `auto_crawl_topic.sh`

- **Description**: The `topic_crawling.py` script is used to crawl exact topic from  Bitcointalk forum with given the first  page url of the topic. The `auto_crawl_topic.sh` script is used to automate the process of running the `topic_craawling.py` script.

- **example**: 
```python
python topic_crawling.py 
https://bitcointalk.org/index.php?topic=168174.0 # topic url
--board miners # board name that topic belongs to
--num_of_pages 165 # total pages of this topic
```