Generated with sparks and insights from 48 sources

img6

img7

img8

img9

img10

img11

Introduction

  • The project utilizes several medical-related datasets available on Hugging Face.

  • These datasets cover a range of medical topics, including question-answer pairs, medical dialogues, and COVID-19 research.

  • Key datasets include MEDIQA, Medical Dialog, and CORD-19, among others.

MEDIQA [1]

  • Description: MEDIQA is a dataset of manually generated, question-driven summaries of multi and single document answers to consumer health questions.

  • Availability: The dataset can be accessed here or via Hugging Face here.

  • Citation: Savery, Max et al. (2020). 'Question-driven summarization of answers to consumer health questions'. Scientific Data, 7(1), 322.

  • Usage: Useful for training models on summarizing medical information in response to health-related questions.

img6

img7

img8

img9

img10

img11

Medical Dialog [2]

  • Description: The MedDialog dataset contains conversations between doctors and patients, with 0.26 million dialogues in English.

  • Sources: The raw dialogues are from healthcaremagic.com and icliniq.com.

  • Availability: The dataset can be downloaded from Google Drive and used via Hugging Face.

  • Citation: Chen, Shu et al. (2020). 'MedDialog: a large-scale medical dialogue dataset'. arXiv preprint arXiv:2004.03329.

  • Usage: Ideal for training models on medical dialogue systems and conversational AI in healthcare.

img6

img7

img8

img9

img10

img11

CORD-19 [1]

  • Description: The COVID-19 Open Research Dataset (CORD-19) is a resource of over 1,000,000 scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses.

  • Sources: The dataset is provided by a coalition of leading research groups in response to the COVID-19 pandemic.

  • Availability: The dataset can be accessed on Kaggle or via Hugging Face here.

  • Citation: Wang, Lucy Lu et al. (2020). 'CORD-19: The COVID-19 Open Research Dataset'. Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020.

  • Usage: Useful for research and development in natural language processing and AI techniques to generate insights on COVID-19.

img6

img7

img8

img9

img10

img11

MedQA [1]

  • Description: MedQA is a large-scale open domain question answering dataset from medical exams.

  • Sources: The dataset includes QAs from the US, Mainland China, and Taiwan, and is available in English and simplified Chinese.

  • Availability: The dataset can be downloaded from Google Drive and used via Hugging Face here.

  • Citation: Jin, Di et al. (2020). 'What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams'. arXiv preprint arXiv:2009.13081.

  • Usage: Ideal for training models on medical question answering and diagnostic systems.

img6

img7

img8

img9

img10

img11

PubMed Health Advice [1]

  • Description: This dataset is used in the paper 'Detecting Causal Language Use in Science Findings'.

  • Sources: The dataset includes health advice extracted from PubMed articles.

  • Availability: The prepared dataset is available via Hugging Face here.

  • Citation: Yu, Bei et al. (2019). 'Detecting Causal Language Use in Science Findings'. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

  • Usage: Useful for research on causal language detection in scientific literature.

img6

img7

img8

img9

PubMed Causal [1]

  • Description: This dataset is used in the paper 'Detecting Causal Language Use in Science Findings'.

  • Sources: The dataset includes causal language extracted from PubMed articles.

  • Availability: The prepared dataset is available via Hugging Face here.

  • Citation: Yu, Bei et al. (2019). 'Detecting Causal Language Use in Science Findings'. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

  • Usage: Useful for research on causal language detection in scientific literature.

img6

img7

img8

img9

img10

ChatDoctor [1]

  • Description: ChatDoctor is a medical chat model fine-tuned on the LLaMA model using medical domain knowledge.

  • Sources: The dataset includes medical dialogues and is used in the paper 'ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge'.

  • Availability: The dataset is available on GitHub.

  • Citation: Li, Yunxiang et al. (2023). 'ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge'. arXiv preprint arXiv:2303.14070.

  • Usage: Ideal for training models on medical chat systems and conversational AI in healthcare.

img6

img7

img8

img9

img10

img11

Related Videos

<br><br>

<div class="-md-ext-youtube-widget"> { "title": "How to Create Custom Datasets To Train Llama-2", "link": "https://www.youtube.com/watch?v=z2QE12p3kMM", "channel": { "name": ""}, "published_date": "Aug 9, 2023", "length": "" }</div>