2.4 Technical Principles of Large Language Models (LLMs)

Demystifying the Engine of Language Intelligence: Technical Principles of Large Language Models (LLMs)

Large Language Models (LLMs) are undoubtedly the shining stars at the forefront of the current artificial intelligence wave, permeating every corner of the legal industry with unprecedented depth and breadth. Whether assisting lawyers with rapid legal research, drafting initial versions of various legal documents, providing preliminary legal consultation services, or intelligently reviewing complex contracts, LLM-driven tools represented by models like ChatGPT, Claude, Llama, DeepSeek, and others are profoundly changing the traditional paradigms of legal work.

However, to truly harness these powerful tools effectively and prudently evaluate their potential risks and limitations, legal professionals should not merely be content with knowing what they can do, but also need a fundamental understanding of the core technical principles behind how they work. Understanding their internal mechanisms helps us use them more wisely, identify the reliability of their output, and anticipate potential pitfalls.

This section will focus on the technical heart of LLMs—the Transformer architecture—and delve into key training strategies like Pre-training and Fine-tuning, unveiling the mystery behind this engine of language intelligence.

1. The Transformer Architecture: A Revolutionary Cornerstone for LLM Success

Before the influential paper “Attention Is All You Need” was published by a Google research team in 2017, introducing the Transformer architecture, the mainstream deep learning models for processing sequential data (especially natural language) were Recurrent Neural Networks (RNNs) and their variants (like LSTM, GRU). RNNs process sequences using recurrent structures, attempting to pass information from previous steps forward. However, RNN models faced two major bottlenecks that limited their performance on complex, long texts:

Difficulty capturing long-range dependencies: Information in RNNs is passed sequentially, like a game of telephone. Information tends to get lost or distorted during long-distance transmission (technically known as the vanishing/exploding gradient problem). This made it hard for RNNs to effectively link complex semantic or grammatical relationships between words far apart in sentences or paragraphs, which is crucial for understanding legal texts with their common long sentences, complex clauses, and cross-references.
Inherent obstacle to parallel computation: The recurrent nature of RNNs dictates that they must process sequence elements one by one (e.g., word by word), preventing large-scale parallel computation. This significantly limited model training speed, making it difficult to efficiently leverage the powerful parallel processing capabilities of modern Graphics Processing Units (GPUs) to learn from massive text datasets.

The advent of the Transformer architecture revolutionarily solved these two major problems in an elegant and efficient manner. Its core innovation lies in completely abandoning recurrent structures and relying entirely on a novel design called the “Self-Attention Mechanism” to directly capture dependencies between arbitrary positions within a sequence, while inherently supporting highly parallelized computation.

1.1 Self-Attention Mechanism: The Key to Understanding Context

The core idea behind the self-attention mechanism is that the precise meaning of a word often depends on its specific context. For example, consider the word “bank”. In “He sat on the river bank,” it refers to the edge of a river. In “He went to the bank to withdraw money,” it refers to a financial institution.

The self-attention mechanism endows the model with the ability, when processing each word (or a smaller unit called a Token) in a sequence, to simultaneously ‘attend to’ all other words in the sequence (including itself). Based on the ‘relevance’ or ‘importance’ (calculated as an Attention Score) of these other words to the current word, it dynamically generates a Contextualized Representation for the current word, rich with context.

Consider the sentence: “The animal didn’t cross the street because it was too tired.”

When the self-attention mechanism processes the word “it”:

The model calculates attention scores between “it” and every other word in the sentence:
- “The” - low attention score
- “animal” - very high attention score (as “it” likely refers to “animal”)
- “didn’t” - low score
- “cross” - low score
- “the” - low score
- “street” - low score
- “because” - low score
- “was” - low score
- “too” - low score
- “tired” - moderate attention score (semantically related to “it” in this context)
Based on these attention scores, the model generates a contextual representation for “it” that strongly incorporates information from “animal,” enabling the model to correctly understand that “it” refers to “animal” and not other words in the sentence.

Overview of How It Works (Query, Key, Value): To achieve this “attention,” the self-attention mechanism learns to generate three distinct vectors for each input word vector (Word Embedding):
- Query vector (Q): Represents the current word being processed. It “queries” other words in the sequence to determine their relevance.
- Key vector (K): Represents every word in the sequence (including itself). It acts like a “label” to be matched by the Query. The similarity between Q and K determines relevance.
- Value vector (V): Also represents every word in the sequence (including itself). It contains the actual information embedded in that word. Once Q determines the attention (weight) it should pay to a word’s K, it extracts the corresponding information from that word’s V.
Brief Calculation Flow:
1. Calculate Relevance Scores (Score): For the current word’s Q vector, compute its similarity (typically using dot product) with the K vectors of all words in the sequence (including itself). Higher scores indicate stronger relevance.
2. Scale Scores: To maintain training stability, divide the calculated scores by a scaling factor (usually the square root of the dimension of the K vectors).
3. Normalize Weights: Use the Softmax function to convert the scaled scores into a set of Attention Weights. These weights are non-negative and sum to 1. They represent how much “attention” (i.e., weight) the current word should allocate to each other word’s Value vector. Words with higher relevance scores receive larger attention weights.
4. Weighted Sum of Information: Multiply each word’s V vector by its corresponding attention weight, then sum up all the weighted V vectors. The resulting vector is the new representation of the current word, incorporating context from the entire sequence.

1.2 Multi-Head Attention: Examining from Multiple Perspectives

To enable the model to learn relationships between words in the input sequence from different representation subspaces, or from different perspectives, the Transformer cleverly employs a Multi-Head Attention mechanism instead of just a single set of Q, K, V.

The approach is: The original Q, K, V vectors are each linearly projected (like applying different color filters) into multiple lower-dimensional “heads.” The self-attention process described above is then performed independently within each “head.” Finally, the resulting contextual representation vectors from all “heads” are concatenated and passed through another linear transformation for fusion, yielding the final multi-head attention output.

This design allows the model to simultaneously attend to different types of relational patterns in the input sequence. For instance, one “head” might focus more on syntactic structural relationships (like subject-verb-object), another on semantic associations (like synonyms or antonyms), and yet another on specific referential links. The multi-head mechanism greatly enhances the model’s capacity to capture complex information.

1.3 Positional Encoding: Enabling the Model to Understand Word Order

The self-attention mechanism itself, when calculating relevance, does not inherently consider the position or order of words in the sequence (it treats all positions equally). However, word order is crucial in natural language (“The lawyer informed the client” means something different from “The client informed the lawyer”).

To address this, before the initial word embedding vector (representing the word’s semantics) is fed into the main model, the Transformer adds a special Positional Encoding vector to it. This positional encoding vector is typically pre-calculated using fixed mathematical functions (like sine and cosine functions) based on the word’s absolute or relative position in the sequence. It provides a unique “timestamp” or “positional signal” for each position, enabling the model to distinguish words at different positions and understand their sequence and relative distances.

1.4 Overall Transformer Architecture: The Classic Encoder-Decoder Combination

Originally, the Transformer model was designed for Machine Translation tasks, and its classic architecture includes two main components: the Encoder and the Decoder.

Encoder: Its role is to receive and “understand” the input sequence (e.g., the source language sentence). It typically consists of N (e.g., 6 or 12) identical encoder layers stacked on top of each other. Each encoder layer contains two core sub-layers:
1. A Multi-Head Self-Attention Layer: Allows each word in the input sequence to attend to all other words in the sequence, obtaining rich contextual representations.
2. A Position-wise Feed-Forward Network (FFN): A simple fully connected neural network that applies further non-linear transformations to the output of the self-attention layer, enhancing the model’s representational power. After each sub-layer (self-attention and FFN), a Residual Connection and Layer Normalization are applied. These techniques are crucial for training very deep networks, helping to mitigate the vanishing gradient problem and stabilize/accelerate the training process.
Decoder: Its role is to generate the target output sequence (e.g., the translated sentence in the target language) based on the encoder’s understanding of the input sequence. The decoder also typically consists of N identical decoder layers stacked together. Each decoder layer is slightly more complex than an encoder layer, containing three core sub-layers:
1. A Masked Multi-Head Self-Attention Layer: Performs self-attention on the portion of the output sequence generated so far. The key here is “Masked,” which ensures that when predicting the word at the current position, the model can only attend to words generated before this position and cannot “cheat” by looking at future words (which don’t exist yet during actual generation). This maintains the auto-regressive property of generation.
2. A Multi-Head Encoder-Decoder Attention Layer: This acts as the bridge connecting the encoder and decoder. It allows each position in the decoder to attend to all positions in the encoder’s output (i.e., the final representation of the input sequence). This enables the decoder to leverage relevant information from the input sequence when generating each word.
3. A Position-wise Feed-Forward Network (FFN). Similarly, residual connections and layer normalization follow each sub-layer in the decoder.

LLM Architecture Evolution: The Rise of Decoder-Only Models

While the classic Encoder-Decoder architecture is highly successful in sequence-to-sequence tasks like machine translation, many of the most famous modern LLMs (such as OpenAI’s GPT series) actually employ a Decoder-only architecture.

Decoder-only Architecture (e.g., GPT): These models essentially use only the decoder part of the Transformer (often removing the Encoder-Decoder Attention layer, keeping only Masked Self-Attention and FFN). They excel at Text Generation tasks. By pre-training on massive text data using the “next token prediction” objective (see below), they learn the internal statistical patterns, grammatical structures, and vast world knowledge embedded in language. When given an input prompt, they can auto-regressively predict the most likely next word (or token) based on the prompt and the text generated so far, thus “continuing” the text to produce coherent and relevant long-form content.
Other Architectures: There are also important LLMs (like Google’s BERT and its variants) that primarily use an Encoder-only architecture. These are particularly adept at Natural Language Understanding (NLU) tasks requiring deep contextual understanding, such as text classification, sentiment analysis, named entity recognition, and extractive question answering. Models like T5, on the other hand, stick to the full Encoder-Decoder architecture.

2. Forging LLMs: The Two-Step Strategy of Pre-training and Fine-tuning

Training a powerful LLM typically involves two main stages: Pre-training and Fine-tuning.

2.1 Pre-training: Laying the Foundation for General Language Understanding

Goal: This is the first, most critical, and most resource-intensive stage of training. The objective is to enable the model to learn broad knowledge about language itself (grammar, semantics, pragmatics), vast factual knowledge about the world, and even rudimentary reasoning and pattern recognition abilities from extremely large amounts of, typically unlabeled, text data. The aim is to build a Foundation Model with strong general capabilities.
Data: The scale of data used for pre-training is staggering, often reaching trillions of words (Tokens). This data typically comes from publicly available text sources on the internet, such as Wikipedia, massive web crawls (like Common Crawl), digitized books, news articles, professional forums, code repositories (like GitHub), and more. The vast majority of this data is unlabeled.
Core Task: Self-supervised Learning: Since most data is unlabeled, how does the model learn? The answer is self-supervised learning. The model learns by completing tasks for which “labels” can be automatically generated from the data itself. For LLMs, the core self-supervised pre-training task is Language Modeling:
- Masked Language Modeling (MLM): This is a common pre-training task for Encoder models like BERT. The process involves randomly replacing a small fraction (e.g., 15%) of the words in the input text with a special [MASK] token. The model is then trained to predict the original masked word based on the surrounding context. This forces the model to learn bidirectional dependencies between words.
- Next Token Prediction / Causal Language Modeling (CLM): This is the primary pre-training task for Decoder models (especially generative ones) like GPT. The process involves giving the model a prefix of text (e.g., “Artificial intelligence is changing”) and training it to predict the most likely next word (e.g., “law,” “the world,” “healthcare”). The model learns to predict subsequent content based on preceding text. During actual text generation, the model repeatedly performs this next token prediction.
Outcome: After large-scale pre-training, the model’s billions or even trillions of parameters (weights and biases in the neural network) act like a highly compressed knowledge base, containing the rich language patterns and world knowledge learned from the massive text corpus. This trained model is the Foundation Model, possessing broad general language understanding and generation capabilities, ready to serve as a base for various downstream tasks.

Extensive research (often referred to as “Scaling Laws”) has shown that the performance of LLMs largely depends on the scale of three key factors: Model Size (number of parameters), Dataset Size, and Compute (amount of computation invested). Generally, scaling up these three factors concurrently leads to significant improvements in model capabilities (though not infinitely). This is why modern LLMs often have parameter counts in the hundreds of billions or even trillions. However, conducting such large-scale pre-training requires enormous computational resources (typically thousands of high-end GPUs running in parallel for weeks or months), costing millions or even tens of millions of dollars, making it prohibitive for most organizations.

2.2 Fine-tuning: Tailoring the Model for Specific Needs

While the foundation model obtained from pre-training is powerful and general, it may not be perfectly suited for all specific downstream tasks or specialized domains (like law or medicine). It might lack deep domain-specific knowledge, its output style might not meet requirements, it might struggle to follow complex instructions, or it might even generate harmful or inaccurate information (due to the complexity of pre-training data sources). Therefore, a second stage of Fine-tuning is usually necessary.

Goal: The goal of fine-tuning is to “further develop” or “adapt” the general foundation model to better meet specific application needs. This could involve enhancing its accuracy and terminology usage in a particular professional field (like law), improving its ability to follow user instructions, or making its output safer, more reliable, and aligned with human values (known as Alignment).
Data: Unlike pre-training which uses massive unlabeled data, fine-tuning typically uses a much smaller scale (ranging from hundreds to hundreds of thousands of examples) of high-quality, labeled data highly relevant to the target task or domain. Data quality is often more crucial than quantity here. For example:
- For a legal question-answering task, one might use a set of “legal question - standard answer” pairs curated by legal experts.
- For a contract risk review task, one could use contracts where experienced lawyers have annotated risk clauses and types.
- To improve instruction following, a diverse set of “instruction - desired output” examples is used.
Common Fine-tuning Techniques:
- Supervised Fine-tuning (SFT): This is the most common method. Labeled “input-output” pairs specific to the task (e.g., <instruction, response> pairs, <article, summary> pairs) are used to continue training the model, teaching it directly how to produce the desired output for a given input.
- Domain-adaptive Fine-tuning: If the goal is to adapt the model to the language style and knowledge of a specific domain (like law) but large amounts of labeled data are unavailable, this method can be used. It involves using a large corpus of unlabeled text from the target domain (e.g., vast numbers of legal judgments, statutes, journal articles) to continue training with tasks similar to pre-training (like language modeling). This helps the model “soak” in the domain language, becoming familiar with professional jargon and expression styles.
- Reinforcement Learning from Human Feedback (RLHF): This has been a key technique in recent years for improving the Alignment of large language models (i.e., making their behavior more consistent with human expectations – more helpful, honest, and harmless). It’s considered a major reason for the impressive performance of models like ChatGPT. The process generally involves three steps:
  1. Collect Human Preference Data: First, have the LLM generate multiple different responses to a set of input prompts. Then, invite human evaluators (Annotators) to compare and rank these responses, indicating which one is better, more helpful, safer, etc.
  2. Train a Reward Model (RM): Use the collected human preference ranking data to train a separate “Reward Model.” This RM learns the human preference criteria and can assign a score to any response generated by the LLM (predicting how highly a human would rate it).
  3. Optimize the LLM using Reinforcement Learning: Use the trained Reward Model as a feedback signal in an environment. Apply reinforcement learning algorithms (like Proximal Policy Optimization, PPO) to fine-tune the original LLM. The objective is to train the LLM to generate responses that receive higher scores from the Reward Model, thereby aligning its behavior more closely with human preferences.
- Instruction Fine-tuning: This is a specific type of SFT aimed at enabling the model to understand and follow various forms of natural language instructions. The training data consists of numerous “instruction - desired output” example pairs covering a wide range of potential tasks (e.g., Q&A, translation, summarization, writing, code generation). Models fine-tuned with instructions (like InstructGPT) typically exhibit stronger general-purpose task handling and better “obedience.”

3. Model Scale: The Significance and Trade-offs of Parameter Count

When discussing LLMs, the term “parameter count” (Number of Parameters) is frequently mentioned, e.g., a model having “billions of parameters” or “hundreds of billions of parameters.”

What are Parameters: Simply put, parameters are the numerical values (primarily the weights and biases of the neural network connections) that the model learns during the training process. These parameters collectively determine the model’s behavior and capabilities. The parameter count is a key indicator of an LLM’s scale.
Scale Effect (Revisited): As mentioned earlier, generally, increasing the number of parameters allows a model to learn more complex data patterns, acquire broader knowledge, and exhibit stronger performance across various tasks. Current model scales in the industry can be roughly categorized as:
- Small Models: Hundreds of millions to a few billion parameters.
- Medium Models: Tens of billions of parameters.
- Large Models (LLMs): Hundreds of billions of parameters (e.g., GPT-3 has 175 billion parameters, Google’s PaLM has 540 billion parameters).
- Extra-Large Models: Trillions of parameters or more.
Performance vs. Cost Trade-offs: However, bigger isn’t always better. Larger models typically entail:
- Higher Training Costs: Requiring more computational resources and time.
- Higher Inference Costs: Deploying and running the model for prediction or text generation demands more powerful hardware (like GPUs), consumes more energy, and costs more.
- Longer Response Times: Larger models may take longer to process input and generate output.
- Deployment Challenges: Deploying extremely large models on local devices or in resource-constrained environments is very difficult.

Therefore, in practical applications, a trade-off must be made between model performance and economic cost/operational efficiency. In recent years, both academia and industry have been actively exploring ways to enhance the performance of models with relatively smaller parameter counts through techniques like improved model architectures (e.g., Mixture-of-Experts, MoE), optimized training methods, and model compression (e.g., Quantization, Distillation). The goal is to achieve performance close to or even matching that of very large models while maintaining higher efficiency, which is vital for the widespread adoption of LLM technology.

Large Language Models (LLMs) are complex artificial intelligence systems built upon the revolutionary Transformer architecture. They establish a foundation of general intelligence through massive data pre-training and are then carefully fine-tuned to adapt to specific application needs. A basic understanding of their core self-attention mechanism (enabling context comprehension), pre-training (imparting broad knowledge and potential biases), fine-tuning (especially RLHF, shaping behavior towards human expectations), and model scale (influencing capability versus cost) will significantly help legal professionals to:

Gain a deeper understanding of the capability boundaries of LLM-driven tools: Comprehend why they can understand complex legal jargon and long sentences, why they can generate seemingly professional text, while also recognizing that their knowledge might be outdated and they can “hallucinate” (confidently state falsehoods).
Recognize their potential risks more clearly: Understand the biases potentially inherited from pre-training data and the risks of unprofessional or unsafe outputs resulting from inadequate fine-tuning.
Employ Prompt Engineering more effectively: Knowing that models rely on context and excel at following patterns enables the design of more precise prompts that better guide the model toward desired outcomes.
Evaluate and select AI legal tools more prudently: Be able to ask vendors informed questions about the base model, training data sources, whether high-quality fine-tuning was performed specifically for the legal domain, and what safety alignment measures have been implemented.

Grasping these fundamental principles is a crucial step for legal professionals in this AI-impacted era to maintain professional judgment, effectively leverage technological benefits, and simultaneously mitigate potential risks. In the following chapters, we will further explore specific examples of mainstream LLM models, their characteristic differences, methods for evaluating model performance, and how to better “steer” these powerful language intelligence engines through more advanced prompt engineering techniques.