TokenIterators

TokenIterators.jl provides easy syntax for writing lexers/tokenizers, with a few built-ins.
It's super fast and easy to use.

Important

This package is not designed for validating syntax, but splitting text into tokens using simple rules.

Usage

using TokenIterators

t = JSONTokens(b"""{ "key": "value", "key2": -1e-7}""")

collect(t)
# 13-element Vector{Token{Base.CodeUnits{UInt8, String}, Symbol}}:
#  1:1 (1 byte) curly_open {
#  2:2 (1 byte) whitespace
#  3:7 (5 bytes) string \"key\"
#  8:8 (1 byte) colon :
#  9:9 (1 byte) whitespace
#  10:16 (7 bytes) string \"value\"
#  17:17 (1 byte) comma ,
#  18:18 (1 byte) whitespace
#  19:24 (6 bytes) string \"key2\"
#  25:25 (1 byte) colon :
#  26:26 (1 byte) whitespace
#  27:31 (5 bytes) number -1e-7
#  32:32 (1 byte) curly_close }

`TokenIterator` and `Token`

A TokenIterator (abstract type) iterates over Tokens (smallest meaningful unit of text/data) from any input T::AbstractVector{UInt8}.

Both TokenIterator{T,K} and Token{T,K} are parameterized by:

The input data type T <: AbstractVector{UInt8}
The type used to label the kind of token K.

A Token acts like a view(data, i:j). It's defined as:

struct Token{T <: AbstractVector{UInt8}, K, S} <: AbstractVector{UInt8}
    data::T
    kind::K
    i::Int  # starting index
    j::Int  # closing index
    state::S
end

Tip

StringViews.jl (used heavily in this package) can provide lightweight AbstractString views of the token via StringView(t).

Defining Iterators with Rules

The iteration interface is based on finding the next Token based on the current one in the following steps:

Join all the candidate bytes of the next token (everything after the current token).
Identify the kind of the next token via a starting pattern.
Determine the last index of the next token via findnext on an ending pattern.

An Example: JSONTokens

See src/TokenIterators.jl for more TokenIterator implementations.

struct JSONTokens{T} <: TokenIterator{T, Symbol, Nothing}
    data::T
end

function next(o::JSONTokens, t::Token)
    '{' .. t && return t(:curly_open)
    '}' .. t && return t(:curly_close)
    '[' .. t && return t(:square_open)
    ']' .. t && return t(:square_close)
    ':' .. t && return t(:colon)
    ',' .. t && return t(:comma)
    't' .. t && return t(:True, 4)
    'f' .. t && return t(:False, 5)
    'n' .. t && return t(:null, 4)
    ∈(b"\t\n\r ") .. t && return t(:whitespace, Before(!∈(b"\t\n\r ")))
    '"' .. t && return t(:string, u('"'))
    ∈(b"-0123456789") .. t && return t(:number, Before(!∈(b"-+eE.012345678")))
    return t(:unknown)
end

Note

The x .. tok syntax is shorthand for startswith(tok, x)
t(kind, end_pattern, start_idx=2) returns another Token with the given kind and ending position defined via findnext(end_pattern, tok, start_idx)

Mini-DSL

Starting Patterns

Type	Example	Match?
`UInt8`	`0x20` (space)	`x == token[1]`
`Char`	`' '`	`x == StringView(token)[1])`
`Function`	`∈(b" \t\r\n")`	`f(token[1])`
`UseStringView`	`s(isspace)`	`f(StringView(token)[1])`
`AbstractString`	`"abc"`	`startswith(StringView(token), x)`
`AbstractVector{UInt8}`	`b"<a"`	`x == token[1:length(x)]`

Ending Patterns

Type	Example	Find Last Index
`UInt8`	`0x20` (space)	`findnext(==(x), token, 2)`
`Char`	`' '`	`findnext(==(x), StringView(token), 2)`
`Before`	`Before("<")`	`findnext(==('<'), token, 2) - 1`
`Last`	`Last("-->")`	`last(findnext("-->", StringView(token), 2))`

Tokenizer State

Any type can be used to store state for a TokenIterator. Changing the state is done via the operator:

token | function_of_state

We provide several types for common state functions:

# Example: Adding two states to a Set{Symbol} or Vector{Symbol}
token | Push!(:state_to_add) | Push!(:another_state_to_add)

# Example: Removing a state
token | Pop!()

# Example: deleting a state from a Set{Symbol}
token | Delete!(:state_to_remove)

Performance

TokenIterators is very fast with minimal allocations:

using TokenIterators, BenchmarkTools

versioninfo()
# Julia Version 1.11.6
# Commit 9615af0f269 (2025-07-09 12:58 UTC)
# Build Info:
#   Official https://julialang.org/ release
# Platform Info:
#   OS: macOS (arm64-apple-darwin24.0.0)
#   CPU: 10 × Apple M1 Pro
#   WORD_SIZE: 64
#   LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
# Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
# Environment:
#   JULIA_NUM_THREADS = auto

data = read(download("https://github.com/plotly/plotly.js/raw/v3.0.1/dist/plot-schema.json"));

t = JSONTokens(data)
# JSONTokens (3.648 MiB)

@benchmark sum(t.kind == :string for t in $t)
# BenchmarkTools.Trial: 301 samples with 1 evaluation per sample.
#  Range (min … max):  16.554 ms … 16.931 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     16.638 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   16.657 ms ± 72.185 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#       ▁▄  ▂▅▃▇▂ ▄▄█ ▅
#   ▃▄▃▆██▄▇██████████████▇▅▃▃▁▃▁▄▄▁▃▄▆▃█▃▁▃▄▃▅▃▁▄▄▄▃▆▃▁▃▃▁▁▃▁▃ ▄
#   16.6 ms         Histogram: frequency by time        16.9 ms <

#  Memory estimate: 0 bytes, allocs estimate: 0.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github		.github
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenIterators

Usage

`TokenIterator` and `Token`

Defining Iterators with Rules

An Example: JSONTokens

Mini-DSL

Starting Patterns

Ending Patterns

Tokenizer State

Performance

About

Uh oh!

Releases 4

Uh oh!

Contributors 2

Uh oh!

Languages

License

joshday/TokenIterators.jl

Folders and files

Latest commit

History

Repository files navigation

TokenIterators

Usage

TokenIterator and Token

Defining Iterators with Rules

An Example: JSONTokens

Mini-DSL

Starting Patterns

Ending Patterns

Tokenizer State

Performance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors 2

Uh oh!

Languages

`TokenIterator` and `Token`