Spaces:
Running
Running
| title: Sponsorblock ML | |
| emoji: 🤖 | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: streamlit | |
| app_file: app.py | |
| pinned: true | |
| # SponsorBlock-ML | |
| Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). | |
| Check out the online demo application at [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/), or follow the instructions below to run it locally. | |
| --- | |
| ## Installation | |
| 1. Download the repository: | |
| ```bash | |
| git clone https://github.com/xenova/sponsorblock-ml.git | |
| cd sponsorblock-ml | |
| ``` | |
| 2. Install the necessary dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Run the application: | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| ## Predicting | |
| - Predict for a single video using the `--video_id` argument. For example: | |
| ```bash | |
| python src/predict.py --video_id zo_uoFI1WXM | |
| ``` | |
| - Predict for multiple videos using the `--video_ids` argument. For example: | |
| ```bash | |
| python src/predict.py --video_ids IgF3OX8nT0w ao2Jfm35XeE | |
| ``` | |
| - Predict for a whole channel using the `--channel_id` argument. For example: | |
| ```bash | |
| python src/predict.py --channel_id UCHnyfMqiRRG1u-2MsSQLbXA | |
| ``` | |
| Note that on the first run, the program will download the necessary models (which may take some time). | |
| --- | |
| ## Evaluating | |
| ### Measuring Accuracy | |
| This is primarly used to measure the accuracy (and other metrics) of the model (defaults to [Xenova/sponsorblock-small](https://huggingface.co/Xenova/sponsorblock-small)). | |
| ```bash | |
| python src/evaluate.py | |
| ``` | |
| In addition to the calculated metrics, missing and incorrect segments are output, allowing for improvements to be made to the database: | |
| - Missing segments: Segments which were predicted by the model, but are not in the database. | |
| - Incorrect segments: Segments which are in the database, but the model did not predict (meaning that the model thinks those segments are incorrect). | |
| ### Moderation | |
| This can also be used to moderate parts of the database. To moderate the whole database, first run: | |
| ```bash | |
| python src/preprocess.py --do_process_database --processed_database whole_database.json --min_votes -1 --min_views 0 --min_date 01/01/2000 --max_date 01/01/9999 --keep_duplicate_segments | |
| ``` | |
| followed by | |
| ```bash | |
| python src/evaluate.py --processed_file data/whole_database.json | |
| ``` | |
| The `--video_ids` and `--channel_id` arguments can also be used here. Remember to keep your database and processed database file up-to-date before running evaluations. | |
| --- | |
| ## Training | |
| ### Preprocessing | |
| 1. Download the SponsorBlock database | |
| ```bash | |
| python src/preprocess.py --update_database | |
| ``` | |
| 2. Preprocess the database and generate training, testing and validation data | |
| ```bash | |
| python src/preprocess.py --do_transcribe --do_create --do_generate --do_split --model_name_or_path Xenova/sponsorblock-small | |
| ``` | |
| 1. `--do_transcribe` - Downloads and parses the transcripts from YouTube. | |
| 2. `--do_create` - Process the database (removing unwanted and duplicate segments) and create the labelled dataset. | |
| 3. `--do_generate` - Using the downloaded transcripts and labelled segment data, extract positive (sponsors, unpaid/self-promos and interaction reminders) and negative (normal video content) text segments and create large lists of input and target texts. | |
| 4. `--do_split` - Using the generated positive and negative segments, split them into training, validation and testing sets (according to the specified ratios). | |
| Each of the above steps can be run independently (as separate commands, e.g. `python src/preprocess.py --do_transcribe`), but should be performed in order. | |
| For more advanced preprocessing options, run `python src/preprocess.py --help` | |
| ### Transformer | |
| The transformer is used to extract relevent segments from the transcript and apply a preliminary classification to the extracted text. To start finetuning from the current checkpoint, run: | |
| ```bash | |
| python src/train.py --model_name_or_path Xenova/sponsorblock-small | |
| ``` | |
| If you wish to finetune an original transformer model, use one of the supported models (*t5-small*, *t5-base*, *t5-large*, *t5-3b*, *t5-11b*, *google/t5-v1_1-small*, *google/t5-v1_1-base*, *google/t5-v1_1-large*, *google/t5-v1_1-xl*, *google/t5-v1_1-xxl*) as the `--model_name_or_path`. For more information, check out the relevant documentation ([t5](https://huggingface.co/docs/transformers/model_doc/t5) or [t5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)). | |
| ### Classifier | |
| The classifier is used to add probabilities to the category predictions. Train the classifier using: | |
| ```bash | |
| python src/train.py --train_classifier --skip_train_transformer | |
| ``` | |