Generated with sparks and insights from 3 sources
Introduction
-
Instruction Pre-Training: A framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models.
-
General Models: Pre-trained from scratch on 100B tokens, with models like InstructLM-500M and InstructLM-1.3B.
-
Domain-Specific Models: Pre-trained from Llama3-8B, including Finance-Llama3-8B and Biomedicine-Llama3-8B.
-
Datasets: Includes General Instruction-Augmented Corpora and Domain-Specific Instruction-Augmented Corpora.
-
Code and Resources: Available on platforms like Hugging Face and GitHub, with detailed instructions for setup and evaluation.
Overview [1]
-
Definition: Instruction Pre-Training is a framework that augments raw corpora with instruction-response pairs to pre-train language models.
-
Supervised Multitask Learning: The approach uses supervised multitask learning to improve model performance.
-
Efficiency: Instruction-response pairs are generated by an efficient Instruction Synthesizer built on open-source models.
-
Performance: Outperforms vanilla pre-training in both general pre-training from scratch and domain-adaptive continual pre-training.
-
Scalability: Scalable to large datasets, with experiments synthesizing 200M instruction-response pairs covering 40+ task categories.
General Models [1]
-
InstructLM-500M: A general model pre-trained from scratch on 100B tokens.
-
InstructLM-1.3B: Another general model pre-trained from scratch on 100B tokens.
-
Performance: These models benefit more from further instruction tuning.
-
Evaluation: General base models are evaluated using the lm-evaluation-harness framework.
-
Setup: Detailed setup instructions are provided for evaluating these models.
Domain-Specific Models [1]
-
Finance-Llama3-8B: A domain-specific model pre-trained from Llama3-8B for financial tasks.
-
Biomedicine-Llama3-8B: A domain-specific model pre-trained from Llama3-8B for biomedical tasks.
-
Performance: These models are comparable to or even outperform larger models like Llama3-70B.
-
Ethical Considerations: No finance data is included in the domain-specific instruction-augmented corpora to avoid ethical issues.
-
Applications: These models are tailored for specific domains, enhancing their performance in those areas.
Datasets [1]
-
General Instruction-Augmented Corpora: Used for pre-training general language models.
-
Domain-Specific Instruction-Augmented Corpora: Includes datasets for specific domains like medicine.
-
Exclusion: Finance data is excluded from domain-specific corpora to avoid ethical issues.
-
Sources: Augmented from datasets like RefinedWeb corpora.
-
Usage: These datasets are used to generate instruction-response pairs for pre-training.
Code and Resources [1]
-
GitHub Repository: Code available at https://github.com/microsoft/LMOps.
-
Instruction Synthesizer: Context-based instruction synthesizer available at Hugging Face.
-
Fine-Tuning Data: Available for the instruction synthesizer.
-
Evaluation Framework: lm-evaluation-harness framework used for model evaluation.
-
Community: Discussions and example usages are encouraged on the Hugging Face page.
Related Videos
<br><br>
<div class="-md-ext-youtube-widget"> { "title": "How to Train Your Own Large Language Models", "link": "https://www.youtube.com/watch?v=5qlLJrv_q-Q", "channel": { "name": ""}, "published_date": "Jul 26, 2023", "length": "" }</div>
<div class="-md-ext-youtube-widget"> { "title": "Fine-tuning Large Language Models (LLMs) | w/ Example Code", "link": "https://www.youtube.com/watch?v=eC6Hd1hFvos", "channel": { "name": ""}, "published_date": "Oct 1, 2023", "length": "" }</div>
<div class="-md-ext-youtube-widget"> { "title": "Training and deploying open-source large language models", "link": "https://www.youtube.com/watch?v=Ma4clS-IdhA", "channel": { "name": ""}, "published_date": "Jan 8, 2024", "length": "" }</div>