Developer's Corner

You want to understand how this package works or modify the code you are running? Here the necessary tools are provided and explained.

Tokenizer

Llama2.Tokenizer — Type

Tokenizer

Construct a tokenizer storing vocabulary entries, scores, and byte-piece mappings.

Constructors

Tokenizer(vocab, vocab_scores, sorted_vocab, vocab_size, max_token_length, byte_pieces) Construct a tokenizer directly from the provided fields. Validate that max_token_length > 0 and that byte_pieces has length 256.
Tokenizer(path::String, vocab_size::Integer) Load a tokenizer from a binary file.

Fields

vocab: Token string sequences.
vocab_scores: Scores for each token.
sorted_vocab: Sorted token indices.
vocab_size: Number of vocabulary entries.
max_token_length: Maximum token length in bytes.
byte_pieces: Byte mapping (length 256).

Tokenizer(path::String, vocab_size::Integer) -> Tokenizer

Load a tokenizer from a binary file.

Read vocabulary, token scores, and metadata from the binary file at path. The file format expects:

Int32: maximum token length
For each of vocab_size tokens:
- Float32: token score
- Int32: string length (n)
- n bytes: token string (UTF-8)

Arguments

path::String: Path to the tokenizer binary file.
vocab_size::Integer: Expected number of vocabulary entries.

Returns

Tokenizer: A tokenizer ready for encoding/decoding text.

source

Llama2.TokenIndex — Type

TokenIndex(str::String, id::Integer)

A vocabulary token with its string representation and numeric identifier.

Stores a token string and its associated ID for efficient sorting and lookup in the tokenizer vocabulary.

Throw a DomainError if id ≤ 0.

Arguments

str::String: Token string (e.g., "Julia").
id::Integer: Token identifier (converted to Int16, must be ≥ 0).

Examples

julia> Llama2.TokenIndex("Julia", 1)
Llama2.TokenIndex("Julia", 1)

julia> Llama2.TokenIndex("Julia", -1)
ERROR: DomainError with Token index must be > 0.
[...]

Developer Notes

This is an internal struct.

source

Base.isless — Method

Base.isless(first_token::TokenIndex, second_token::TokenIndex) -> Bool

Compare two tokens lexicographically by their string values.

Return true if first_token.str < second_token.str in lexicographic order. Intended for use as the lt argument to sorting functions.

Examples

julia> isless(Llama2.TokenIndex("A", 1), Llama2.TokenIndex("B", 2))
true

julia> isless(Llama2.TokenIndex("B", 1), Llama2.TokenIndex("A", 2))
false

source

Llama2.str_lookup — Function

str_lookup(str::String, sorted_vocab::Vector{TokenIndex}) -> Int16

Search for str within a sorted vocabulary sorted_vocab. If a match is found, it returns the corresponding token ID; otherwise, it returns -1. It uses a binary search for efficient lookup.

Arguments

str::String: Token string to search for.
sorted_vocab::Vector{TokenIndex}: Vocabulary sorted lexicographically by string.

Returns

Int16: Token ID if found, -1 otherwise.

Examples

julia> Llama2.str_lookup("aa", [Llama2.TokenIndex("aa", 1), Llama2.TokenIndex("bb", 2)])
1

julia> Llama2.str_lookup("ba", [Llama2.TokenIndex("aa", 1), Llama2.TokenIndex("bb", 2)])
-1

source

Llama2.encode — Function

encode(tokenizer::Tokenizer, text::String) -> Vector{Integer}

Converts text into a sequence of token IDs using tokenizer. First ensure the tokenizer's vocabulary is sorted, then encode each character into its corresponding ID. After that, iteratively merge token pairs with the highest scores to form longer tokens until no more merges are possible. Return the final token ID sequence.

Arguments

tokenizer::Tokenizer: The tokenizer with vocabulary and merge scores.
text::String: Input text to encode.

Returns

Vector{Integer}: Sequence of token IDs representing the encoded text.

source

Transformer

Llama2.Transformer — Type

Transformer(config::Config, weights::TransformerWeights)

Create a Transformer with data from config and weights. The RunState containers are initialized empty and just are assigned the corresponding dimensions.

Developer Notes

This is an internal struct.

source

Llama2.Config — Type

Config

Create a Config containing 7 Int32. These describe meta-data to read values from an input file.

Developer Notes

This is an internal struct.

source

Llama2.TransformerWeights — Type

TransformerWeights

Create a TransformerWeights containing several Float32 containers. These describe actual weight data that is loaded from an input file.

Developer Notes

This is an internal struct.

source

Llama2.RunState — Type

RunState

Create a RunState containing several Float32 containers. These reflect the state of the Transformer at run-time.

Developer Notes

This is an internal struct.

source

forward!

Llama2.forward! — Function

forward!(transformer::Transformer, token::Int32, pos::Int32)

Perform a single forward pass through the transformer.

Compute logits for the next token prediction given the current token at position pos in the sequence. Updates the internal KV-cache in transformer.state with keys and values from this forward pass.

Arguments

transformer::Transformer: The model (modified in-place via KV-cache updates).
token::Int32: Current input token index (must be in range 1:vocab_size).
pos::Int32: Position in the sequence (1-indexed, must be ≤ seq_len).

Returns

Vector{Float32}: Logits over the vocabulary for next token prediction (length = vocab_size).

Examples

model = Transformer("model.bin")
token = Int32(1)  # BOS token
pos = Int32(1)

logits = forward!(model, token, pos)
next_token = argmax(logits)

source

Llama2.rmsnorm — Function

rmsnorm(x, w) -> Vector{Float32}

Calculate the rmsnorm of x and w, the scaled product 'λw * x'.

Arguments

x::AbstractVector{Float32}: Input vector to normalize.
w::AbstractVector{Float32}: Scale weights (must have same length as x).

Returns

Vector{Float32}: Normalized and scaled output.

Examples

julia>  x = [1.0f0,2,3];

julia>  w = [1.0f0,1,1];

julia> o = Llama2.rmsnorm(x, w) 
3-element Vector{Float32}:
 0.46290955
 0.9258191
 1.3887286

source

Llama2.softmax! — Function

softmax!(x) -> Vector{Float32}

Updates the output of an layer 'x' with the softmax! of the input.

Transform logits into a probability distribution by exponentiating and normalizing. Uses the numerically stable formulation: x[i] = exp(x[i] - max(x)) / sum(exp(x .- max(x))).

The input vector is modified in-place and also returned.

Arguments

x::AbstractVector{Float32}: Logits to transform (modified in-place).

Returns

Vector{Float32}: The same vector x, now containing probabilities that sum to 1.

Examples

julia> x = [-1.0f0,0,1];

julia> Llama2.softmax!(x);

julia> x
3-element Vector{Float32}:
 0.09003057
 0.24472848
 0.66524094

source

Sampler

Llama2.Sampler — Type

Sampler

Stateful sampler for converting model logits into token indices.

Encapsulate sampling configuration (temperature, nucleus sampling) and internal buffers needed for efficient token sampling.

Constructors

Sampler(vocab_size::Int32, temperature::Float32, topp::Float32, rng_seed::Int128) Construct a sampler with automatically allocated internal buffers for nucleus (top-p) sampling.
Sampler(vocab_size::Int32, probindex::Vector{ProbIndex}, temperature::Float32, topp::Float32, rng_state::Int128) Construct a sampler using a caller-provided probindex workspace buffer. The buffer must have length at least vocab_size.

Fields

vocab_size::Int32: Vocabulary size
temperature::Float32: Sampling temperature (0 = greedy)
topp::Float32: Nucleus sampling threshold
rng_state::Int128: Random number generator state

Sampler is callable and can be applied to a vector of logits to obtain the next token index.

source

Llama2.ProbIndex — Type

ProbIndex

Create a ProbIndex from an AbstractFloat and an Integer.

prob contains a proberbility and is stored as an Int32 and index is stored as a Float32. Throw a DomainError if index < 0.

Examples

julia> Llama2.ProbIndex(1.0, 2)
ProbIndex(1.0f0, 2)

julia> ProbIndex(1.0, -1)
ERROR: DomainError with Prob index must be > 0.:
[...]

source

Base.isless — Method

isless(a::ProbIndex, b::ProbIndex) -> Bool

Comparison function for ordering ProbIndex values by probability.

Return true if a.prob < b.prob. Intended for use as the lt argument to sorting routines.

source

Llama2.sample_mult — Function

sample_mult(probabilities, coin) -> Int

Sample an index from a multinomial distribution.

Given a vector of normalized probabilities and a uniform random number coin ∈ [0, 1), returns the first index whose cumulative probability exceeds coin.

Arguments

probabilities::Vector{Float32}: Probability mass function (must sum to 1).
coin::Float32: Uniform random number in [0, 1).

Returns

Index of the sampled element.

If numerical roundoff prevents an early return, the last index is returned.

source

Llama2.sample_topp — Function

sample_topp(probabilities, topp, probindex, coin) -> Int

Sample an index using nucleus (top-p) sampling.

Selects the smallest set of tokens whose cumulative probability mass exceeds topp, then samples from this restricted distribution using the provided random number.

Arguments

probabilities::Vector{Float32}: Normalized probability distribution.
topp::Float32: Cumulative probability threshold (0 < topp < 1).
probindex::Vector{ProbIndex}: Preallocated workspace for sorting and indexing candidate tokens.
coin::Float32: Uniform random number in [0, 1).

Returns

Index of the sampled token.

The probindex buffer is mutated and reused to avoid allocations.

source