Gorilla: Enhancing LLMs with Massive API Integration

Published on March 15, 2024 | AI Tools

Ever asked a language model to book a flight or analyze an image, only to get a vague or incorrect response? Large Language Models (LLMs) excel at dialogue and reasoning, but using external tools via API calls often trips them up. The 2023 paper by Patil et al. introduces Gorilla, a finetuned LLaMA-7B model that outperforms GPT-4 in generating accurate API calls. Paired with a document retriever, Gorilla adapts to changing API documentation and reduces hallucination errors. Let's unpack how Gorilla works, its technical foundation, and why it's a game-changer for tool-augmented LLMs.

The API Challenge

LLMs like GPT-4 can chat fluently or solve math problems, but they struggle with API calls due to their static knowledge and limited context. APIs, especially in machine learning (ML), are vast, overlapping, and constantly updated, making it hard to select the right one or use it correctly. For example, asking for an image classification model might yield a nonexistent API or wrong library. Gorilla addresses this by finetuning an LLM to handle millions of APIs, using a new benchmark, APIBench, to test its prowess.

Gorilla's Approach: Finetuning for APIs

Gorilla transforms API usage into a sequence-to-sequence task, finetuning LLaMA-7B to map natural language prompts to correct API calls. It integrates a retriever to fetch up-to-date documentation, enabling adaptability. Here's how it operates, illustrated with a prompt like "detect animals in an image."

APIBench Dataset

Gorilla relies on APIBench, a comprehensive dataset of 1,645 ML APIs from HuggingFace (925), TorchHub (94), and TensorHub (626). These cover domains like image classification, text generation, and object detection. Each API includes a JSON with fields like functionality, arguments, and example code. Using self-instruct, 10 synthetic user prompts per API were generated, creating instruction-API pairs. For example, a prompt like "classify pedestrians in an image" maps to torch.hub.load('datvuthanh/hybridnets', 'hybridnets', pretrained=True).

Model Architecture and Training

Gorilla starts with LLaMA-7B, finetuned on APIBench for 5 epochs with a 2e-5 learning rate, cosine decay, and batch size of 64 on 8xA100 GPUs. The dataset is split (90% training for HuggingFace, 80% for others). Each data point is formatted as a user-agent chat, with the user prompt (e.g., "detect animals") and agent response (API call). For retriever-aware training, prompts include "Use this API documentation: ", teaching Gorilla to parse documentation. This reduces hallucination and adapts to documentation changes.

Inference Modes

Gorilla operates in two modes:

Zero-shot: The user prompt is fed directly to Gorilla, which outputs an API call (e.g., hub.load('https://tfhub.dev/google/openimages_v4/ssd_mobilenet_v2/1') for animal detection).

Retrieval: A retriever (BM25 or GPT-Index) fetches relevant API documentation, appended to the prompt. Gorilla then generates the call, leveraging the latest information.

Evaluation with AST Matching

To verify API calls, Gorilla uses Abstract Syntax Tree (AST) sub-tree matching. It parses the generated code into an AST, checking if the API call (e.g., torch.hub.load) and key arguments (e.g., repo_or_dir, model) match a reference in APIBench. Hallucination is defined as an API call not in the dataset. For HuggingFace, domain accuracy is also checked due to its diversity.

Why Gorilla Stands Out

Gorilla shines in APIBench evaluations, achieving 59.13% accuracy on TorchHub, 71.68% on HuggingFace, and 83.79% on TensorHub in zero-shot settings, surpassing GPT-4 by 20.43% on TorchHub and 10.75% over ChatGPT. With an oracle retriever, accuracy climbs to 67.20%, 91.26%, and 94.16%, respectively. Hallucination errors drop significantly (e.g., 6.98% on TorchHub vs. 36.55% for GPT-4). Key strengths include:

Constraint Handling: Gorilla excels at respecting constraints, like selecting an image classifier with > 80% ImageNet accuracy, matching GPT-3.5 with retrievers and outperforming in zero-shot (47.88% accuracy).

Adaptability: Retriever-aware training lets Gorilla handle documentation changes, like upgrading a model's backbone (e.g., ResNet-50 to ResNet-101).

Efficiency: Finetuning focuses on API calls, not general coding, making it lightweight and practical.

Limitations include reliance on retriever quality (BM25 can degrade accuracy by 52.27% vs. oracle) and focus on single API calls, not complex programs. Synthetic training data may also miss real-world nuances.

Try It Yourself

Gorilla's code, model, and dataset are at https://gorilla.cs.berkeley.edu. Test it with prompts like "generate a video from text" or explore its retriever integration.

The Future of Tool-Augmented LLMs

Gorilla redefines how LLMs interact with tools, turning them into interfaces for vast API ecosystems. By finetuning for API calls and integrating retrieval, it tackles hallucination and adaptability challenges. Future work could expand to RESTful APIs or multi-call tasks, but Gorilla already sets a high standard. Next time you need an API call, imagine Gorilla delivering the perfect solution, tailored to the latest documentation.

Pratham Grover