Section 07

Limitations: What LLaMA Cannot Do

LLaMA: Open and Efficient Foundation Language Models 2023

LLaMA was a breakthrough in open-source models, but it has real constraints:


1. Limited Context Length

The constraint: LLaMA-1 was trained on a 2048-token context (sequence length of 2048).

The problem:

  • Many documents, books, and conversations are longer than 2048 tokens
  • 2048 tokens ≈ 1500 words
  • A single research paper is often 6,000-10,000 tokens

Implication: If you want LLaMA to understand a long document, you must either:

  • Split it into chunks and process separately (loses context between chunks)
  • Fine-tune the model on longer sequences (expensive)
  • Use external retrieval (RAG — Retrieval Augmented Generation)

Mitigation: LLaMA-2 extended context to 4096 tokens. Later models (LLaMA-3, Mistral) pushed to 8K or 32K contexts. But the original LLaMA was limited.


2. English-Centric Training Data

The data: LLaMA was trained on publicly available data, which is heavily English-skewed:

  • CommonCrawl: mostly English pages
  • GitHub: primarily English code and comments
  • Wikipedia: English Wikipedia is larger than other language editions
  • ArXiv: mostly English papers

The result: LLaMA is strongest in English, weaker in other languages.

Benchmark evidence:

  • English (MMLU): 63.9% (LLaMA-13B)
  • Other languages: typically 20-40% lower performance

Who this excludes: Researchers and developers working in Hindi, Mandarin, Spanish, Arabic, etc. have weaker models.

Attempts to fix: Projects like Llama-2-Multilingual and subsequent fine-tunes on multilingual data. But base LLaMA is not multilingual.


3. No Instruction Fine-Tuning or RLHF in Base Model

The base model: LLaMA-1 is a pure language model (next-token prediction), not an instruction-following model.

What this means:

  • The base model is not trained with RLHF (Paper 15)
  • No instruction fine-tuning (like InstructGPT)
  • The model has never been explicitly rewarded for being helpful, harmless, honest

Example:

Prompt: "What is the capital of France?"

LLaMA-1 (base): "What is the capital of France? Paris is the capital. 
In terms of population, it is the most populous"
[Model completes text, not necessarily answering well]

InstructGPT (RLHF-trained): "The capital of France is Paris."
[Model is trained to be concise and answer directly]

Implication: Users need to fine-tune LLaMA themselves or use instruction-tuned derivatives (Alpaca, Vicuña, Guanaco, etc.).

Fixed in LLaMA-2: Meta released an instruction-tuned variant (LLaMA-2-Chat), but the base LLaMA-1 lacked this.


4. Misuse and Safety Risks

The risk: By releasing weights publicly without heavy safety fine-tuning, LLaMA enabled:

  • Fine-tuning for harmful tasks (generating malware, phishing, hate speech)
  • Minimal guardrails compared to proprietary models
  • No built-in safety mechanisms to refuse harmful requests

Example misuse cases:

  • Fine-tuning LLaMA to generate realistic misinformation
  • Training jailbroken versions that ignore safety guidelines
  • Using LLaMA to automate cyber attacks

OpenAI’s approach (proprietary models): Fine-tune with RLHF to refuse harmful requests, add content filters, monitor API usage.

Meta’s approach (LLaMA): Release weights, trust the community to use responsibly.

The trade-off: Open weights enable research but require community responsibility. LLaMA’s release led to widespread responsible use (Alpaca, fine-tuning for education), but also enabled irresponsible use.

Mitigation: LLaMA-2 came with improved safety training and responsible use guidelines, but the issue remains for base models.


5. Limited to Autoregressive Generation

The constraint: LLaMA can only generate text left-to-right (one token at a time, based on previous tokens).

What this prevents:

  • Non-autoregressive generation (generate multiple tokens in parallel)
  • Bidirectional understanding (like BERT, which reads both left and right context)
  • Tasks requiring simultaneous reasoning over multiple parts

Example:

  • Task: Fill in the blank “The capital of France is ___.”
  • LLaMA: Generates left-to-right. Must “think” about what word comes next.
  • BERT: Reads the entire sentence, understands context bidirectionally, predicts the blank directly.

Implication: For tasks that need bidirectional reasoning, other architectures may be better. LLaMA excels at generation but not at fine-grained understanding of full texts.


6. No Structured Output or Tool Use (Base Model)

The limitation: LLaMA-1 cannot reliably:

  • Output structured data (JSON, XML)
  • Call external tools (search, calculators, databases)
  • Follow complex instructions with structured outputs

Example:

Prompt: "Find me flights from Delhi to Mumbai on March 15, 2024. Return as JSON."

LLaMA: Might generate plausible-looking but fake flight information.
Has no way to actually query a flight database.

Why it matters: Modern applications need models to:

  • Call APIs (e.g., search Google for current information)
  • Output structured data for downstream processing
  • Use calculators for math (instead of doing arithmetic in the weights)

Fixed in later versions: LLaMA-2 and subsequent models improved structured output via fine-tuning. But base LLaMA lacks this.


7. The “Hallucination” Problem

The issue: LLaMA can confidently generate plausible-sounding but false information.

Example:

Prompt: "Tell me about Dr. John Smith's research on AI ethics."

LLaMA might generate:
"Dr. John Smith published 'Ethical Frameworks for AI' in 2021, 
establishing key principles for responsible AI development..."

Reality: Dr. John Smith may not exist, or may not have published this.

Why this matters: Users trust the fluent, confident-sounding output and believe false information.

Root cause: LLaMA is trained to predict the next token, not to verify facts. It learns patterns from training data but cannot distinguish between true facts and plausible fiction.

Mitigation: Retrieval-augmented generation (RAG), fact-checking, external verification. But the base model has no mechanism to avoid hallucinations.


8. Compute Requirements for Inference

LLaMA-65B inference:

  • Full precision (FP32): ~260 GB memory (not feasible on consumer hardware)
  • Half precision (FP16): ~130 GB memory (requires 4x GPUs)
  • Quantized (INT8): ~65 GB memory (high-end GPU or multi-GPU setup)
  • Quantized (INT4): ~16-20 GB memory (single high-end GPU)

Comparison:

  • GPT-3.5 (OpenAI API): Pay per token, no hardware needed
  • LLaMA-65B: Must own or rent GPUs (expensive for inference at scale)

Implication: While smaller LLaMA models (7B, 13B) run on laptops, the 65B model requires serious hardware.


9. Training Data Cutoff

The constraint: LLaMA was trained on data available up to early 2023.

The problem:

  • No knowledge of events after 2023
  • Can answer questions about 2023 and earlier with moderate accuracy
  • Cannot discuss recent developments, discoveries, or events

Example: Asking LLaMA about GPT-4’s release (March 2023) — it knows about it. Asking about events in 2024 — no knowledge.

This affects: Anyone needing current information, recent news, latest research. LLaMA must be fine-tuned or augmented with retrieval to stay current.


Summary: When to Use LLaMA vs. Alternatives

TaskLLaMA Best ForBetter Alternatives
Research/Experimentation✓ Open weights, reproducible
Long documents (>4K tokens)✗ Limited contextClaude, GPT-4
Instruction-following✗ No base RLHFLLaMA-2-Chat, ChatGPT
Non-English✗ English-centricMultilingual fine-tunes
Current events✗ 2023 cutoffGPT-4, Claude (with web access)
Safety-critical✗ No safety fine-tuningGPT-4, Claude
Education/Open Source✓ Available, reproducible
Cost-sensitive inference✓ (with quantization)