谷歌医疗大模型.pdf
《谷歌医疗大模型.pdf》由会员分享,可在线阅读,更多相关《谷歌医疗大模型.pdf(28页珍藏版)》请在咨信网上搜索。
1、Nature|1ArticleLarge language models encode clinical knowledgeKaran Singhal1,4,Shekoofeh Azizi1,4,Tao Tu1,4,S.Sara Mahdavi1,Jason Wei1,Hyung Won Chung1,Nathan Scales1,Ajay Tanwani1,Heather Cole-Lewis1,Stephen Pfohl1,Perry Payne1,Martin Seneviratne1,Paul Gamble1,Chris Kelly1,Abubakr Babiker1,Nathanae
2、l Schrli1,Aakanksha Chowdhery1,Philip Mansfield1,Dina Demner-Fushman2,Blaise Agera y Arcas1,Dale Webster1,Greg S.Corrado1,Yossi Matias1,Katherine Chou1,Juraj Gottweis1,Nenad Tomasev3,Yun Liu1,Alvin Rajkomar1,Joelle Barral1,Christopher Semturs1,Alan Karthikesalingam1,5&Vivek Natarajan1,5Large languag
3、e models(LLMs)have demonstrated impressive capabilities,but the bar for clinical applications is high.Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks.Here,to address these limitations,we present MultiMedQA,a benchmark combining
4、six existing medical question answering datasets spanning professional medicine,research and consumer queries anda new dataset of medical questions searched online,HealthSearchQA.We propose a human evaluation framework for model answers along multiple axes including factuality,comprehension,reasonin
5、g,possible harm and bias.In addition,we evaluate Pathways Language Model1(PaLM,a 540-billion parameter LLM)and its instruction-tuned variant,Flan-PaLM2 on MultiMedQA.Using a combination of prompting strategies,Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset(M
6、edQA3,MedMCQA4,PubMedQA5 and Measuring Massive Multitask Language Understanding(MMLU)clinical topics6),including 67.6%accuracy on MedQA(US Medical Licensing Exam-style questions),surpassing the prior state of the art by more than 17%.However,human evaluation reveals key gaps.To resolve this,we intro
7、duce instruction prompt tuning,a parameter-efficient approach for aligning LLMs to new domains using a few exemplars.The resulting model,Med-PaLM,performs encouragingly,but remains inferior to clinicians.We show that comprehension,knowledge recall and reasoning improve with model scale and instructi
8、on prompt tuning,suggesting the potential utility of LLMs in medicine.Our human evaluations reveal limitations of todays models,reinforcing the importance of both evaluation frameworks and method development in creating safe,helpful LLMs for clinical applications.Medicine is a humane endeavour in wh
9、ich language enables key interac-tions for and between clinicians,researchers and patients.Yet,todays artificial intelligence(AI)models for applications in medicine and healthcare have largely failed to fully utilize language.These models,although useful,are predominantly single-task systems(for exa
10、mple,for classification,regression or segmentation)lacking expressivity and interactive capabilities13.As a result,there is a discordance between what todays models can do and what may be expected of them in real-world clinical workflows4.Recent advances in LLMs offer an opportunity to rethink AI sy
11、s-tems,with language as a tool for mediating humanAI interaction.LLMs are foundation models5,large pre-trained AI systems that can be repurposed with minimal effort across numerous domains and diverse tasks.These expressive and interactive models offer great promise in their ability to learn general
12、ly useful representations from the knowledge encoded in medical corpora,at scale.There are several exciting potential applications of such models in medicine,includ-ing knowledge retrieval,clinical decision support,summarization of key findings,triaging patients,addressing primary care concerns and
13、more.However,the safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks,enabling research-ers to meaningfully measure progress and capture and mitigate poten-tial harms.This is especially important for LLMs,since these models may produce text generations(he
14、reafter referred to as generations)that are misaligned with clinical and societal values.They may,for instance,hallucinate convincing medical misinformation or incorpo-rate biases that could exacerbate health disparities.https:/doi.org/10.1038/s41586-023-06291-2Received:25 January 2023Accepted:5 Jun
15、e 2023Published online:xx xx xxxxOpen access Check for updates1Google Research,Mountain View,CA,USA.2National Library of Medicine,Bethesda,MD,USA.3DeepMind,London,UK.4These authors contributed equally:Karan Singhal,Shekoofeh Azizi,Tao Tu.5These authors jointly supervised this work:Alan Karthikesalin
16、gam,Vivek Natarajan.e-mail:;2|Nature|ArticleTo evaluate how well LLMs encode clinical knowledge and assess their potential in medicine,we consider the answering of medical questions.This task is challenging:providing high-quality answers to medical questions requires comprehension of medical context
17、,recall of appropriate medical knowledge,and reasoning with expert information.Existing medical question-answering benchmarks6 are often limited to assessing classification accuracy or automated natural language generation metrics(for example,BLEU7)and do not enable the detailed analysis required fo
18、r real-world clinical applications.This creates an unmet need for a broad medical question-answering benchmark to assess LLMs for their response factuality,use of expert knowledge in reasoning,helpfulness,precision,health equity and potential harm.To address this,we curate MultiMedQA,a benchmark com
19、prising seven medical question-answering datasets,including six existing datasets:MedQA6,MedMCQA8,PubMedQA9,LiveQA10,MedicationQA11 and MMLU clinical topics12.We introduce a seventh dataset,Health-SearchQA,which consists of commonly searched health questions.To assess LLMs using MultiMedQA,we build
20、on PaLM,a 540-billion parameter(540B)LLM13,and its instruction-tuned variant Flan-PaLM14.Using a combination of few-shot15,chain-of-thought16(COT)and self-consistency17 prompting strategies,Flan-PaLM achieves state-of-the-art performance on MedQA,MedMCQA,PubMedQA and MMLU clinical topics,often outpe
21、rforming several strong LLM baselines by a substantial margin.On the MedQA dataset comprising USMLE-style questions,FLAN-PaLM exceeds the previous state of the art by more than 17%.Despite the strong performance of Flan-PaLM on multiple-choice questions,its answers to consumer medical questions reve
22、al key gaps.To resolve this,we propose instruction prompt tuning,a data-and parameter-efficient alignment technique,to further adapt Flan-PaLM to the medical domain.The resulting model,Med-PaLM,performs encour-agingly on the axes of our pilot human evaluation framework.For exam-ple,a panel of clinic
23、ians judged only 61.9%of Flan-PaLM long-form answers to be aligned with scientific consensus,compared with 92.6%for Med-PaLM answers,on par with clinician-generated answers(92.9%).Similarly,29.7%of Flan-PaLM answers were rated as potentially leading to harmful outcomes,in contrast to 5.9%for Med-PaL
24、M,which was similar to the result for clinician-generated answers(5.7%).Although these results are promising,the medical domain is com-plex.Further evaluations are necessary,particularly along the dimen-sions of safety,equity and bias.Our work demonstrates that many limitations must be overcome befo
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 医疗 模型
1、咨信平台为文档C2C交易模式,即用户上传的文档直接被用户下载,收益归上传人(含作者)所有;本站仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿,我们不确定上传用户享有完全著作权,根据《信息网络传播权保护条例》,如果侵犯了您的版权、权益或隐私,请联系我们,核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确),网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据,平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺,下载前须认真查看,确认无误后再购买,务必慎重购买;若有违法违纪将进行移交司法处理,若涉侵权平台将进行基本处罚并下架。
3、本站所有内容均由用户上传,付费前请自行鉴别,如您付费,意味着您已接受本站规则且自行承担风险,本站不进行额外附加服务,虚拟产品一经售出概不退款(未进行购买下载可退充值款),文档一经付费(服务费)、不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
4、如你看到网页展示的文档有www.zixin.com.cn水印,是因预览和防盗链等技术需要对页面进行转换压缩成图而已,我们并不对上传的文档进行任何编辑或修改,文档下载后都不会有水印标识(原文档上传前个别存留的除外),下载后原文更清晰;试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓;PPT和DOC文档可被视为“模板”,允许上传人保留章节、目录结构的情况下删减部份的内容;PDF文档不管是原文档转换或图片扫描而得,本站不作要求视为允许,下载前自行私信或留言给上传者【Stan****Shan】。
5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用;网站提供的党政主题相关内容(国旗、国徽、党徽--等)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
6、文档遇到问题,请及时私信或留言给本站上传会员【Stan****Shan】,需本站解决可联系【 微信客服】、【 QQ客服】,若有其他问题请点击或扫码反馈【 服务填表】;文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“【 版权申诉】”(推荐),意见反馈和侵权处理邮箱:1219186828@qq.com;也可以拔打客服电话:4008-655-100;投诉/维权电话:4009-655-100。