Generated with sparks and insights from 3 sources

img6

img7

img8

img9

img10

img11

Introduction

Overview [1]

  • Definition: Instruction Pre-Training is a framework that augments raw corpora with instruction-response pairs to pre-train language models.

  • Supervised Multitask Learning: The approach uses supervised multitask learning to improve model performance.

  • Efficiency: Instruction-response pairs are generated by an efficient Instruction Synthesizer built on open-source models.

  • Performance: Outperforms vanilla pre-training in both general pre-training from scratch and domain-adaptive continual pre-training.

  • Scalability: Scalable to large datasets, with experiments synthesizing 200M instruction-response pairs covering 40+ task categories.

img6

General Models [1]

  • InstructLM-500M: A general model pre-trained from scratch on 100B tokens.

  • InstructLM-1.3B: Another general model pre-trained from scratch on 100B tokens.

  • Performance: These models benefit more from further instruction tuning.

  • Evaluation: General base models are evaluated using the lm-evaluation-harness framework.

  • Setup: Detailed setup instructions are provided for evaluating these models.

Domain-Specific Models [1]

  • Finance-Llama3-8B: A domain-specific model pre-trained from Llama3-8B for financial tasks.

  • Biomedicine-Llama3-8B: A domain-specific model pre-trained from Llama3-8B for biomedical tasks.

  • Performance: These models are comparable to or even outperform larger models like Llama3-70B.

  • Ethical Considerations: No finance data is included in the domain-specific instruction-augmented corpora to avoid ethical issues.

  • Applications: These models are tailored for specific domains, enhancing their performance in those areas.

img6

Datasets [1]

  • General Instruction-Augmented Corpora: Used for pre-training general language models.

  • Domain-Specific Instruction-Augmented Corpora: Includes datasets for specific domains like medicine.

  • Exclusion: Finance data is excluded from domain-specific corpora to avoid ethical issues.

  • Sources: Augmented from datasets like RefinedWeb corpora.

  • Usage: These datasets are used to generate instruction-response pairs for pre-training.

Code and Resources [1]

  • GitHub Repository: Code available at https://github.com/microsoft/LMOps.

  • Instruction Synthesizer: Context-based instruction synthesizer available at Hugging Face.

  • Fine-Tuning Data: Available for the instruction synthesizer.

  • Evaluation Framework: lm-evaluation-harness framework used for model evaluation.

  • Community: Discussions and example usages are encouraged on the Hugging Face page.

Related Videos

<br><br>

<div class="-md-ext-youtube-widget"> { "title": "How to Train Your Own Large Language Models", "link": "https://www.youtube.com/watch?v=5qlLJrv_q-Q", "channel": { "name": ""}, "published_date": "Jul 26, 2023", "length": "" }</div>

<div class="-md-ext-youtube-widget"> { "title": "Fine-tuning Large Language Models (LLMs) | w/ Example Code", "link": "https://www.youtube.com/watch?v=eC6Hd1hFvos", "channel": { "name": ""}, "published_date": "Oct 1, 2023", "length": "" }</div>

<div class="-md-ext-youtube-widget"> { "title": "Training and deploying open-source large language models", "link": "https://www.youtube.com/watch?v=Ma4clS-IdhA", "channel": { "name": ""}, "published_date": "Jan 8, 2024", "length": "" }</div>