Christopher Schrรถder
AI & ML interests
Recent Activity
Organizations
cschroeder's activity
Short summary: We need your support for a web survey in which we investigate how recent advancements in natural language processing, particularly LLMs, have influenced the need for labeled data in supervised machine learning โ with a focus on, but not limited to, active learning. See the original post for details.
โก๏ธ Extended Deadline: January 26th, 2025.
Please consider participating or sharing our survey! (If you have any experience with supervised learning in natural language processing, you are eligible to participate in our survey.)
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Just a quick note: I will not again enter any ideological debates here.
First off, I think this is a non-issue regardless of which license we use. This is first and foremost a scientific study, and the dataset weโre producing is more of a byproductโits main purpose is to help other researchers verify our findings. It seems like there might be some misconceptions about this dataset: Think of it as a table of answer codes. It is not a text dataset and therefore not interesting or useful for LLM training (or similar).
Second, we made this decision because the survey doesnโt have any funding and relies on people generously sharing their opinions (without compensation). Given the growing skepticism around data collection, we wanted to be especially careful not to discourage users from participating. Our primary goal is to conduct a study with a population as diverse as possible, and we did not want to lose potential participants who might be less inclined to give away their data without compensation.
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
โ
โค๏ธ Weโre seeking responses from across the globe! If you know 1โ3 people who might qualify for this surveyโparticularly those in different regionsโplease share it with them. Weโd really appreciate it!
#NLProc #ActiveLearning #ML
Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? ๐ช๐ฒ ๐ฎ๐ฟ๐ฒ ๐ฐ๐๐ฟ๐ฟ๐ฒ๐ป๐๐น๐ ๐ฐ๐ผ๐ป๐ฑ๐๐ฐ๐๐ถ๐ป๐ด ๐ฎ ๐๐๐ฟ๐๐ฒ๐ to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models.
The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community.
๐ With only 5โ15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data.
โค๏ธHow you can help even more: If you know others working on supervised learning and NLP, please share this survey with themโweโd really appreciate it!
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
#NLP #ML
With small language models on the rise, the new version of small-text has been long overdue! Despite the generative AI hype, many real-world tasks still rely on supervised learningโwhich is reliant on labeled data.
Highlights:
- Four new query strategies: Try even more combinations than before.
- Vector indices integration: HNSW and KNN indices are now available via a unified interface and can easily be used within your code.
- Simplified installation: We dropped the torchtext dependency and cleaned up a lot of interfaces.
Github: https://github.com/webis-de/small-text
๐ Try it out for yourself! We are eager to hear your feedback.
๐ง Share your small-text applications and experiments in the newly added showcase section.
๐ Support the project by leaving a star on the repo!
#activelearning #nlproc #machinelearning
Paper (at HF): https://huggingface.co/papers/2406.09206
Paper (in the ACL Anthology): https://aclanthology.org/2024.emnlp-main.669/
Code: https://github.com/chschroeder/self-training-for-sample-efficient-active-learning
In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
โ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives
docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives๐ Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.
Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1
I'm looking forward to releasing v3.2, I have some exciting things planned ๐
Did not know text-splitter yet, thanks!
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)
I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/
Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/
Happy to hear your thoughts!