分享赚钱赏收藏举报版权申诉 / 28

立即下载 VIP下载

当前位置：首页 > 行业资料 > 其他 > 谷歌医疗大模型.pdf

谷歌医疗大模型.pdf

上传人：Stan****Shan

文档编号：1223656

上传时间：2024-04-18

格式：PDF

页数：28

大小：5.09MB

《谷歌医疗大模型.pdf》由会员分享，可在线阅读，更多相关《谷歌医疗大模型.pdf（28页珍藏版）》请在咨信网上搜索。

1、Nature|1ArticleLarge language models encode clinical knowledgeKaran Singhal1,4,Shekoofeh Azizi1,4,Tao Tu1,4,S.Sara Mahdavi1,Jason Wei1,Hyung Won Chung1,Nathan Scales1,Ajay Tanwani1,Heather Cole-Lewis1,Stephen Pfohl1,Perry Payne1,Martin Seneviratne1,Paul Gamble1,Chris Kelly1,Abubakr Babiker1,Nathanae

2、l Schrli1,Aakanksha Chowdhery1,Philip Mansfield1,Dina Demner-Fushman2,Blaise Agera y Arcas1,Dale Webster1,Greg S.Corrado1,Yossi Matias1,Katherine Chou1,Juraj Gottweis1,Nenad Tomasev3,Yun Liu1,Alvin Rajkomar1,Joelle Barral1,Christopher Semturs1,Alan Karthikesalingam1,5&Vivek Natarajan1,5Large languag

3、e models(LLMs)have demonstrated impressive capabilities,but the bar for clinical applications is high.Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks.Here,to address these limitations,we present MultiMedQA,a benchmark combining

4、six existing medical question answering datasets spanning professional medicine,research and consumer queries anda new dataset of medical questions searched online,HealthSearchQA.We propose a human evaluation framework for model answers along multiple axes including factuality,comprehension,reasonin

5、g,possible harm and bias.In addition,we evaluate Pathways Language Model1(PaLM,a 540-billion parameter LLM)and its instruction-tuned variant,Flan-PaLM2 on MultiMedQA.Using a combination of prompting strategies,Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset(M

6、edQA3,MedMCQA4,PubMedQA5 and Measuring Massive Multitask Language Understanding(MMLU)clinical topics6),including 67.6%accuracy on MedQA(US Medical Licensing Exam-style questions),surpassing the prior state of the art by more than 17%.However,human evaluation reveals key gaps.To resolve this,we intro

7、duce instruction prompt tuning,a parameter-efficient approach for aligning LLMs to new domains using a few exemplars.The resulting model,Med-PaLM,performs encouragingly,but remains inferior to clinicians.We show that comprehension,knowledge recall and reasoning improve with model scale and instructi

8、on prompt tuning,suggesting the potential utility of LLMs in medicine.Our human evaluations reveal limitations of todays models,reinforcing the importance of both evaluation frameworks and method development in creating safe,helpful LLMs for clinical applications.Medicine is a humane endeavour in wh

9、ich language enables key interac-tions for and between clinicians,researchers and patients.Yet,todays artificial intelligence(AI)models for applications in medicine and healthcare have largely failed to fully utilize language.These models,although useful,are predominantly single-task systems(for exa

10、mple,for classification,regression or segmentation)lacking expressivity and interactive capabilities13.As a result,there is a discordance between what todays models can do and what may be expected of them in real-world clinical workflows4.Recent advances in LLMs offer an opportunity to rethink AI sy

11、s-tems,with language as a tool for mediating humanAI interaction.LLMs are foundation models5,large pre-trained AI systems that can be repurposed with minimal effort across numerous domains and diverse tasks.These expressive and interactive models offer great promise in their ability to learn general

12、ly useful representations from the knowledge encoded in medical corpora,at scale.There are several exciting potential applications of such models in medicine,includ-ing knowledge retrieval,clinical decision support,summarization of key findings,triaging patients,addressing primary care concerns and

13、more.However,the safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks,enabling research-ers to meaningfully measure progress and capture and mitigate poten-tial harms.This is especially important for LLMs,since these models may produce text generations(he

14、reafter referred to as generations)that are misaligned with clinical and societal values.They may,for instance,hallucinate convincing medical misinformation or incorpo-rate biases that could exacerbate health disparities.https:/doi.org/10.1038/s41586-023-06291-2Received:25 January 2023Accepted:5 Jun

15、e 2023Published online:xx xx xxxxOpen access Check for updates1Google Research,Mountain View,CA,USA.2National Library of Medicine,Bethesda,MD,USA.3DeepMind,London,UK.4These authors contributed equally:Karan Singhal,Shekoofeh Azizi,Tao Tu.5These authors jointly supervised this work:Alan Karthikesalin

16、gam,Vivek Natarajan.e-mail:;2|Nature|ArticleTo evaluate how well LLMs encode clinical knowledge and assess their potential in medicine,we consider the answering of medical questions.This task is challenging:providing high-quality answers to medical questions requires comprehension of medical context

17、,recall of appropriate medical knowledge,and reasoning with expert information.Existing medical question-answering benchmarks6 are often limited to assessing classification accuracy or automated natural language generation metrics(for example,BLEU7)and do not enable the detailed analysis required fo

18、r real-world clinical applications.This creates an unmet need for a broad medical question-answering benchmark to assess LLMs for their response factuality,use of expert knowledge in reasoning,helpfulness,precision,health equity and potential harm.To address this,we curate MultiMedQA,a benchmark com

19、prising seven medical question-answering datasets,including six existing datasets:MedQA6,MedMCQA8,PubMedQA9,LiveQA10,MedicationQA11 and MMLU clinical topics12.We introduce a seventh dataset,Health-SearchQA,which consists of commonly searched health questions.To assess LLMs using MultiMedQA,we build

20、on PaLM,a 540-billion parameter(540B)LLM13,and its instruction-tuned variant Flan-PaLM14.Using a combination of few-shot15,chain-of-thought16(COT)and self-consistency17 prompting strategies,Flan-PaLM achieves state-of-the-art performance on MedQA,MedMCQA,PubMedQA and MMLU clinical topics,often outpe

21、rforming several strong LLM baselines by a substantial margin.On the MedQA dataset comprising USMLE-style questions,FLAN-PaLM exceeds the previous state of the art by more than 17%.Despite the strong performance of Flan-PaLM on multiple-choice questions,its answers to consumer medical questions reve

22、al key gaps.To resolve this,we propose instruction prompt tuning,a data-and parameter-efficient alignment technique,to further adapt Flan-PaLM to the medical domain.The resulting model,Med-PaLM,performs encour-agingly on the axes of our pilot human evaluation framework.For exam-ple,a panel of clinic

23、ians judged only 61.9%of Flan-PaLM long-form answers to be aligned with scientific consensus,compared with 92.6%for Med-PaLM answers,on par with clinician-generated answers(92.9%).Similarly,29.7%of Flan-PaLM answers were rated as potentially leading to harmful outcomes,in contrast to 5.9%for Med-PaL

24、M,which was similar to the result for clinician-generated answers(5.7%).Although these results are promising,the medical domain is com-plex.Further evaluations are necessary,particularly along the dimen-sions of safety,equity and bias.Our work demonstrates that many limitations must be overcome befo

25、re these models become viable for use in clinical applications.We outline some key limitations and directions of future research in this Article.Key contributionsOur first key contribution is an approach for evaluation of LLMs in the context of medical question answering.We introduce HealthSearchQA,

26、a dataset of 3,173 commonly searched consumer medical questions.We present this dataset alongside six existing open datasets for answer-ing medical questions spanning medical exam,medical research and consumer medical questions,as a diverse benchmark to assess the clinical knowledge and question-ans

27、wering capabilities of LLMs(seeMethods,Datasets).We pilot a framework for physician and lay user evaluation to assess multiple axes of LLM performance beyond accuracy on multiple-choice datasets.Our evaluation assesses answers for agreement with the scien-tific and clinical consensus,the likelihood

28、and possible extent of harm,reading comprehension,recall of relevant clinical knowledge,manipu-lation of knowledge via valid reasoning,completeness of responses,potential for bias,relevance and helpfulness(seeMethods,Framework for human evaluation).The second key contribution is demonstratingstate-o

29、f-the-art per-formance on the MedQA,MedMCQA,PubMedQA and MMLU clinical topics datasets using Flan-PaLM anda combination of prompting strat-egies,surpassing several strong LLM baselines.Specifically,we reach 67.6%accuracy on MedQA(more than 17%above the previous state of the art),57.6%on MedMCQA and

30、79.0%on PubMedQA.The next contribution is the introduction of instruction prompt tun-ing,a simple,data-and parameter-efficient technique for aligning LLMs to the safety-critical medical domain(seeMethods,Modelling).We lev-erage this technique to build Med-PaLM,an instruction prompt-tuned Dec 2020Jul

31、 2020Mar 2022Oct 2022Dec 2022 Dec 202235404550656055MedQA(USMLE)accuracy(%)GPT-Neo(2.7B)GPT-Neo(2.7B)33.3PubMedBERT(100M)PubMedBERT(100M)38.1BioLinkBERT(340M)BioLinkBERT(340M)45.1DRAGON(360M)DRAGON(360M)47.5PubMed GPT(2.7B)PubMed GPT(2.7B)50.3Flan-PaLM(540B)67.6PubMedQAMultiMedQAMedQA(USMLE)MedMCQAM

32、MLULiveQATREC 2017 MedicationQAHealthSearchQAMed-PaLM:Newborn jaundice is when a newborn babys skin and eyes look yellow.It is caused by a buildup of a substance called bilirubin in the babys blood.Bilirubin is a byproduct of the normal process of red blood cells breaking down.Newborn jaundice is co

33、mmon and typically harmless,but it can be a sign of a more serious condition.The jaundice typically goes away on its own within a few weeks.In some cases,it may last longer or require treatment.The time it takes for the jaundice to go away can vary depending on the cause and the severity of the jaun

34、dice.If the jaundice is severe or lasts longer than a few weeks,the doctor may recommend testing or treatment to determine the cause and prevent complications.Q:How long does it take for newborn jaundice to go away?PaLM(540B)Medical question answering benchmarkAutomated and human evaluationPrompting

35、Med-PaLM performs encouraginglyon consumer medical question answering Instructiontuning Instructionprompt tuning+17%Fig.1|Overview of our contributions.We curate MultiMedQA,a benchmark for answering medical questions spanning medical exam,medical research and consumer medical questions.We evaluate P

36、aLM and its instructed-tuned variant,Flan-PaLM,on MultiMedQA.Using a combination of prompting strategies,Flan-PaLM exceeds state-of-the-art performance on MedQA(US Medical Licensing Examination(USMLE),MedMCQA,PubMedQA and MMLU clinical topics.In particular,it improves over the previous state of the

37、art on MedQA(USMLE)by over 17%.We next propose instruction prompt tuning to further align Flan-PaLM to the medical domain,producing Med-PaLM.Med-PaLMs answers to consumer medical questions compare favourably with answers given by clinicians under our human evaluation framework,demonstrating the effe

38、ctiveness of instruction prompt tuning.Nature|3version of Flan-PaLM specialized for the medical domain(Fig.1).Our human evaluation framework reveals limitations of Flan-PaLM in scien-tific grounding,harm and bias.Nevertheless,Med-PaLM substantially reduces the gap(or even compares favourably)to clin

39、icians on several of these axes,according to both clinicians and lay users(see Human evaluation results).Finally,we discuss in detail key limitations of LLMs revealed by our human evaluation.Although our results demonstrate the potential of LLMs in medicine,they also suggest that several critical im

40、provements are necessary in order to make these models viable for real-world clini-cal applications(see Limitations).Model development and evaluation of performanceWe first provide an overview of our key results with Flan-PaLM on multiple-choice tasks as summarized in Fig.2 and Extended Data Fig.2.T

41、hen,we present several ablation studies to help contextualize and interpret the results.State of the art on MedQAOn the MedQA dataset consisting of USMLE-style questions with 4 options,our Flan-PaLM 540B model achieved a multiple-choice ques-tion accuracy of 67.6%,surpassing the DRAGON model18 by 20

42、.1%.Concurrent with our study,PubMedGPT,a 2.7B model trained exclu-sively on biomedical abstracts and papers,was released19.PubMedGPT achieved a performance of 50.3%on MedQA questions with 4 options.To the best of our knowledge,this is the state-of-the-art on MedQA,and Flan-PaLM 540B exceeded this b

43、y 17.3%.Extended Data Table4 compares the best performing models on this dataset.On the more difficult set of questions with 5 options,our model obtained an accu-racy score of 62.0%.Performance on MedMCQA and PubMedQAOn the MedMCQA dataset,consisting of medical entrance exam ques-tions from India,Fl

44、an-PaLM 540B reached a performance of 57.6%on the development-test set.This exceeds the previous state-of-the-art result of 52.9%by the Galactica model20.Similarly,on the PubMedQA dataset,our model achieved an accuracy of 79.0%,outperforming the previous state-of-the-art BioGPT model21 by 0.8%(Fig.2

45、).Although this improvement may seem small com-pared to those for the MedQA and MedMCQA datasets,the single-rater human performance on PubMedQA6 is 78.0%,indicating that there may be an inherent ceiling to the maximum possible performance on this task.Performance on MMLU clinical topicsThe MMLU data

46、set contains multiple-choice questions from several clinical knowledge,medicine and biology-related topics.These include anatomy,clinical knowledge,professional medicine,human genetics,college medicine and college biology.Flan-PaLM 540B achieved state-of-the-art performance on all these subsets,outp

47、erforming strong LLMs such as PaLM,Gopher,Chinchilla,BLOOM,OPT and Galac-tica.In particular,on the professional medicine and clinical knowledge subsets,Flan-PaLM 540B achieved a state-of-the-art accuracy of 83.8%and 80.4%,respectively.Extended Data Fig.2 summarizes the results,providing comparisons

48、with other LLMs where available20.AblationsWe performed several ablations on three of the multiple-choice datasetsMedQA,MedMCQA,and PubMedQAto better under-stand our results and identify the key components contributing to Flan-PaLMs performance.Instruction tuning improves performanceAcross all model

49、 sizes,we observed that the instruction-tuned Flan-PaLM model outperformed the baseline PaLM model on MedQA,MedM-CQA and PubMedQA datasets.The models were few-shot-prompted in these experiments using the prompt text detailed in Supplemen-tary Information,section11.The detailed results are summarized in Supplementary Table6.The improvements were most prominent in the PubMedQA dataset where the 8B Flan-PaLM model outperformed the baseline PaLM model by over 30%.Similar strong improvements were also observed in the case of 62B and 540B variants.These results dem-onstrate the strong benefits o

下载提示：咨信网仅提供存储空间/不修改/不编辑

【自信AI创作助手】【自信AI导航】
1、请仔细预览页面，基本判断完整性，对于直接下载带来的问题请及时与客服沟通；下载的文档，不会出现我们的网址水印。
2、该文档所得收入（下载+内容+预览）归上传者、原创作者；如果您是本文档原作者，请点此认领！既往收益都归您。

同意并开始全文预览

举报此文档有问题？有机会获“体验VIP”奖励！

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

20 金币 0人已下载

版权申诉本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请申请举报、认领或删除 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 医疗模型

咨信网温馨提示：
1、咨信平台为文档C2C交易模式，即用户上传的文档直接被用户下载，收益归上传人（含作者）所有；本站仅是提供信息存储空间和展示预览，仅对用户上传内容的表现方式做保护处理，对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿，我们不确定上传用户享有完全著作权，根据《信息网络传播权保护条例》，如果侵犯了您的版权、权益或隐私，请联系我们，核实后会尽快下架及时删除，并可随时和客服了解处理情况，尊重保护知识产权我们共同努力。
2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确)，网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据，平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺，下载前须认真查看，确认无误后再购买，务必慎重购买；若有违法违纪将进行移交司法处理，若涉侵权平台将进行基本处罚并下架。
3、本站所有内容均由用户上传，付费前请自行鉴别，如您付费，意味着您已接受本站规则且自行承担风险，本站不进行额外附加服务，虚拟产品一经售出概不退款（未进行购买下载可退充值款），文档一经付费（服务费）、不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
4、如你看到网页展示的文档有www.zixin.com.cn水印，是因预览和防盗链等技术需要对页面进行转换压缩成图而已，我们并不对上传的文档进行任何编辑或修改，文档下载后都不会有水印标识（原文档上传前个别存留的除外），下载后原文更清晰；试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓；PPT和DOC文档可被视为“模板”，允许上传人保留章节、目录结构的情况下删减部份的内容；PDF文档不管是原文档转换或图片扫描而得，本站不作要求视为允许，下载前自行私信或留言给上传者【Stan****Shan】。
5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用；网站提供的党政主题相关内容(国旗、国徽、党徽－－等)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
6、文档遇到问题，请及时私信或留言给本站上传会员【Stan****Shan】，需本站解决可联系【微信客服】、【 QQ客服】，若有其他问题请点击或扫码反馈【服务填表】；文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“【版权申诉】”（推荐），意见反馈和侵权处理邮箱：1219186828@qq.com；也可以拔打客服电话：4008-655-100；投诉/维权电话：4009-655-100。