Fine-Tuning Llama-2 vs GPT-4. Are 7B params enough to challenge OpenAI?

Anel Music
2 min readSep 3, 2023

--

GPT-4 is proficient in a variety of tasks, yet it doesn’t necessarily excel in every specific domain. Can fine-tuning a smaller model like LLama-7b outperform GPT-4 and if it does, to what extent?

A study conducted by Kourosh Hakhamaneshi and Rehaan Ahmad (see: https://shorturl.at/hPQT7) provides a thorough analysis on fine-tuning LLama 7b and 13b models on several nieche use-cases:

𝐔𝐬𝐞-𝐂𝐚𝐬𝐞 1: Transforming sets of Attribute-Values into coherent text (Functional Representation of Unstructured Text)
𝐃𝐚𝐭𝐚𝐬𝐞𝐭: ViGGO
𝐄𝐱𝐞𝐦𝐩𝐥𝐚𝐫𝐲 𝐈𝐧𝐩𝐮𝐭:
“Dirt: Showdown from 2012 is a sport racing game for the Playstation”
𝐈𝐥𝐥𝐮𝐬𝐭𝐫𝐚𝐭𝐢𝐯𝐞 𝐎𝐮𝐭𝐩𝐮𝐭:
inform(name[Dirt: Showdown], release_year[2012], genres[driving/racing,sport], platforms[Playstation])

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
As shown in the figure below, both the 7b and 13b models demonstrate significant improvement in accuracy through fine-tuning, notably surpassing the baseline performance of GPT-4.

𝐔𝐬𝐞-𝐂𝐚𝐬𝐞 2: SQL Query Generation from Natural Language
𝐃𝐚𝐭𝐚𝐬𝐞𝐭: Hugging face b-mc2/sql-create-context
𝐄𝐱𝐞𝐦𝐩𝐥𝐚𝐫𝐲 𝐈𝐧𝐩𝐮𝐭:
“Name the result for week less than 7 and game sites of los angeles memorial coloseum from table table_name_25”
𝐈𝐥𝐥𝐮𝐬𝐭𝐫𝐚𝐭𝐢𝐯𝐞 𝐎𝐮𝐭𝐩𝐮𝐭:
SELECT result FROM table_name_25 WHERE week < 7 AND game_site = “los angeles memorial coloseum”

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
Once again, both the Llama-7b and 13b model outperform GPT-4. However, the discrepancy between GPT-4 and the fine-tuned models is not as significant as seen in the previous use-case.

𝐔𝐬𝐞-𝐂𝐚𝐬𝐞 3: Elementary School Mathematical Reasoning
𝐃𝐚𝐭𝐚𝐬𝐞𝐭: GSM8k
𝐄𝐱𝐞𝐦𝐩𝐥𝐚𝐫𝐲 𝐈𝐧𝐩𝐮𝐭:
“Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?”
𝐈𝐥𝐥𝐮𝐬𝐭𝐫𝐚𝐭𝐢𝐯𝐞 𝐎𝐮𝐭𝐩𝐮𝐭:
“Natalia sold 48/2 = 24 clips in May. \n Natalia sold 48+24 = 72 clips altogether in April and May. \n#### 72”

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
Post fine-tuning, the 7b and 13b models exhibit a 10% increase in accuracy, relative to their non-fine-tuned counterparts. However, the magnitude of improvement is relatively narrower compared to the chat-tuned baselines, likely due to the incorporation of mathematical examples during the chat-tuning process. In this particular use-case, GPT-4’s is superior even after fine-tuning.

Results

𝐂𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧:
Strategically, fine-tuning LLMs for specific tasks presents a promising way to extract value within a business context. This is driven not just by privacy concerns, but also by factors like latency, cost efficiency and quality improvements. Generalized proprietary models like GPT-4 or even Claude-2 may remain useful for use-case validation and prototyping,however, their feasibility for sustained highperformance applications in real-world production environments is constrained.

--

--