TxAgent: An AI Agent for Therapeutics

For live access to TxAgent, you can:

  • Access TxAgent if you have an account.
  • Request access if you do not have an account yet.

We look forward to your feedback!


When you join Evaluate TxAgent, you will:

  • See model responses to diverse prompts.
  • Provide instant thumbs-up or thumbs-down ratings.
  • Influence the roadmap for future releases.

Thank you for helping improve TxAgent!

Contact

For questions or suggestions, email Shanghua Gao and Marinka Zitnik.

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
TxAgent model

We gratefully acknowledge the support of NIH R01-HD108794, NSF CAREER 2339524, US DoD FA8702-15-D-0001, Harvard Data Science Initiative, Amazon Faculty Research, Google Research Scholar Program, AstraZeneca Research, Roche Alliance with Distinguished Scientists, Sanofi iDEA-iTECH, Pfizer Research, Gates Foundation (INV-079038), Chan Zuckerberg Initiative, John and Virginia Kaneb Fellowship at Harvard Medical School, Biswas Computational Biology Initiative in partnership with the Milken Institute, Harvard Medical School Dean's Innovation Fund for the Use of Artificial Intelligence, and Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders. We thank Owen Queen and Thomas Hartvigsen for their valuable discussions on this project and NVIDIA AI for providing access to DeepSeek R1 models.

Sign Up

Years of experience in clinical and/or research activities related to your biomedical expertise (required).

Click Next to start the study. Your progress will be saved after you submit each question. For questions or concerns, contact us directly. Thank you for participating!

Instructions:

Please review these instructions and enter your information to begin:

  • Each session requires at least 5-10 minutes per question.
  • You can evaluate multiple questions; you will not repeat evaluations.
  • For each question, compare responses from two models and rate them (scale: 1-5).
  • If a question is unclear or irrelevant to biomedicine, click the RED BUTTON at the top of the comparison page.
  • Use the Back and Next buttons to edit responses before submission.
  • Use the Home Page button to return to the homepage; progress will save but not submit.
  • Submit answers to the current question before moving to the next.
  • You can pause between questions and return later; ensure current answers are submitted to save them.

Model A Response:

Model B Response:

Task success

Which response more fully and correctly accomplishes the therapeutic task—providing the intended recommendation accurately and without substantive errors or omissions?
Model A Response - Did the model successfully complete the therapeutic task it was given?
Model B Response - Did the model successfully complete the therapeutic task it was given?

Helpfulness of rationale

Which response offers a clearer, more detailed rationale that genuinely aids you in judging whether the answer is correct?
Model A Response - Is the model’s rationale helpful in determining whether the answer is correct?
Model B Response - Is the model’s rationale helpful in determining whether the answer is correct?

Cognitive traceability

In which response are the intermediate reasoning steps and decision factors laid out more transparently and logically, making it easy to follow how the final recommendation was reached?
Model A Response - Are the intermediate reasoning steps and decision factors interpretable and traceable?
Model B Response - Are the intermediate reasoning steps and decision factors interpretable and traceable?

Possibility of harm

Which response presents a lower likelihood of causing clinical harm, based on the safety and soundness of its recommendations and rationale?
Model A Response - Based on the model’s output and rationale, is there a risk that the recommendation could cause clinical harm?
Model B Response - Based on the model’s output and rationale, is there a risk that the recommendation could cause clinical harm?

Alignment with clinical consensus

Which response aligns better with clinical guidelines and practice standards?
Model A Response - Does the answer reflect established clinical practices and guidelines?
Model B Response - Does the answer reflect established clinical practices and guidelines?

Accuracy of content

Which response is more factually accurate and relevant, containing fewer (or no) errors or extraneous details?
Model A Response - Are there any factual inaccuracies or irrelevant information in the response?
Model B Response - Are there any factual inaccuracies or irrelevant information in the response?

Completeness

Which response is more comprehensive, covering all necessary therapeutic considerations without significant omissions?
Model A Response - Does the model provide a complete response covering all necessary elements?
Model B Response - Does the model provide a complete response covering all necessary elements?

Clinical relevance

Which response stays focused on clinically meaningful issues—such as appropriate drug choices, pertinent patient subgroups, and key outcomes—while minimizing tangential or less useful content?
Model A Response - Does the model focus on clinically meaningful aspects of the case (e.g., appropriate drug choices, patient subgroups, relevant outcomes)?
Model B Response - Does the model focus on clinically meaningful aspects of the case (e.g., appropriate drug choices, patient subgroups, relevant outcomes)?

You have no questions left to evaluate. Thank you for your participation!