Senior Project Advisor
Large Language Models, LLM, LLMs, Economics, ML Evaluation, ML, AI
This paper describes a novel dataset, EconQA, constructed to assess the performance of large language models within multiple choice economics questions. I present results from 10 experiments, varying prompts and model choices. Results challenge previous findings that prompt choice makes a large impact on quality of response. Using the GPT 3.5 Turbo model, observed performance levels ranged from 70-77% for all prompt choices, with the no prompt baseline scoring 73%. When prompted to use Chain-of-Thought reasoning with examples, performance was highest at 76%. Contrary to previous research, performance on mathematical questions when prompted with Chain-of-Thought was high. This paper closes with an analysis of the types of questions the models performed best on and common errors.
Van Patten, Tate, "Evaluating Domain Specific LLM Performance Within Economics Using the Novel EconQA Dataset" (2023). WWU Honors College Senior Projects. 657.
Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.