Introduction
awk is a powerful tool specialized in text processing, but one area where beginners often struggle is how to use the match function and regular expressions. In particular, when you need to go beyond simple searching and work with information such as "where did it match?" and "how long was the match?", a vague understanding will limit your ability to apply it in practice.
This article covers the mechanics of the built-in variables RSTART and RLENGTH, the key differences from the ~ operator, practical applications in loop processing, Japanese text handling, and performance improvements — all organized in a way that is useful in real-world work. By reading through it, the goal is for you to move beyond simple string searching and into practical, advanced text processing.
Reference: GNU awk
The Role and Mechanics of Built-in Variables RSTART and RLENGTH
Creating the File
cat << 'EOF' > input.txt
Hello 123 World
abc456def
no numbers here
EOF
Command
awk '{ if (match($0, /[0-9]+/)) print $0, "-> RSTART=" RSTART ", RLENGTH=" RLENGTH }' input.txt
Output
Hello 123 World -> RSTART=7, RLENGTH=3
abc456def -> RSTART=4, RLENGTH=3
How It Works
| Item | Description |
|---|---|
| match function | Searches a string for the position matching a regular expression |
| RSTART | The start position of the match (1-based) |
| RLENGTH | The length of the matched string |
| No match | RSTART=0, RLENGTH=-1 |
| Update timing | Automatically updated when match is executed |
Explanation
The key point is that the match function stores the match position and length in internal variables, which can be reused in subsequent processing.
This allows substring extraction and position-based processing to be written concisely.
The Decisive Difference Between match and the ~ Operator (Regex Match) and How to Use Each
Creating the File
cat << 'EOF' > input.txt
apple 123
banana abc
cherry 456def
date 789
EOF
Command
awk '{ if (match($0, /[0-9]+/)) print $0, "-> match at", RSTART, "length", RLENGTH }' input.txt
Output
apple 123 -> match at 7 length 3
cherry 456def -> match at 8 length 3
date 789 -> match at 6 length 3
Command
awk '{ if ($0 ~ /[0-9]+/) print $0 }' input.txt
Output
apple 123
cherry 456def
date 789
How It Works
| Item | match function | ~ operator |
|---|---|---|
| Role | Retrieves the position and length of a regex match | Determines whether a regex matches |
| Return value | Match position (0 if no match) | True (1) / False (0) |
| Additional info | RSTART and RLENGTH are available | None |
| Use case | Position and substring analysis | Simple conditional branching |
| Flexibility | High | Simple |
Explanation
match is suited for analysis where you need to know "where" the match occurred, while ~ is suited for conditional branching where you only need to know "whether" something matched.
The basic rule is to use each based on your intended purpose.
Advanced Pattern Matching Using Regex Metacharacters
Creating the File
cat << 'EOF' > input.txt
apple 123
banana 456
cherry abc
date 789xyz
EOF
Command
awk 'match($0, /[a-z]+ [0-9]+/) { print "MATCH:", $0 }' input.txt
Output
MATCH: apple 123
MATCH: banana 456
MATCH: date 789xyz
Command
awk 'match($0, /[0-9]+$/) { print "END NUMBER:", $0 }' input.txt
Output
END NUMBER: apple 123
END NUMBER: banana 456
Command
awk 'match($0, /^[a-z]+ [0-9]+$/) { print "STRICT MATCH:", $0 }' input.txt
Output
STRICT MATCH: apple 123
STRICT MATCH: banana 456
How It Works
| Element | Description |
|---|---|
| match() | A function that determines whether a string matches a regular expression |
| [a-z]+ | One or more consecutive lowercase alphabetic characters |
| [0-9]+ | One or more consecutive digits |
| ^ | Indicates the start of a line |
| $ | Indicates the end of a line |
| $0 | An awk variable representing the entire line |
| /pattern/ | Regular expression literal |
Explanation
Using match() enables advanced regex matching on a per-line basis.
Combining anchors (^, $) and quantifiers (+) allows you to specify precise conditions.
How to Use the match Function in Loop Processing (while)
Creating the File
cat << 'EOF' > input.txt
abc123def456ghi789
EOF
Command
awk '{
while (match($0, /[0-9]+/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' input.txt
Output
123
456
789
How It Works
| Element | Description |
|---|---|
| match function | Returns the position where the regex matched |
| RSTART | Match start position |
| RLENGTH | Length of the match |
| substr | Extracts the matched portion |
| while loop | Repeats until no more matches are found |
| Updating $0 | Trims the string to target everything after the previous match |
Explanation
match detects the numeric portion, and all occurrences are extracted by repeating with while.
The key point is that $0 is rewritten so that the search targets everything after the previous match.
Extracting Only Specific Timestamps or IDs from Log Files
Creating the File
cat << 'EOF' > input.txt
2026-04-29 10:15:23 ID:1001 INFO Login success
2026-04-29 10:16:10 ID:1002 ERROR Failed attempt
2026-04-29 10:17:45 ID:1003 INFO Logout
2026-04-29 10:18:00 ID:1002 INFO Login success
EOF
Command
awk 'match($0, /ID:1002/)' input.txt
Output
2026-04-29 10:16:10 ID:1002 ERROR Failed attempt
2026-04-29 10:18:00 ID:1002 INFO Login success
Command
awk 'match($0, /^2026-04-29 10:17/)' input.txt
Output
2026-04-29 10:17:45 ID:1003 INFO Logout
How It Works
| Element | Description |
|---|---|
| awk | A command that processes text line by line |
| match() | Determines whether a regex matches |
| $0 | Represents the entire line |
| /ID:1002/ | Matches a specific ID |
| /^2026-04-29 10:17/ | Matches lines starting with the specified time |
Explanation
Using match() allows you to flexibly extract only lines matching a regular expression.
It can handle complex conditions such as IDs and timestamps.
Notes and Settings for Using the match Function with Multibyte Characters (Japanese)
Creating the File
cat << 'EOF' > input.txt
こんにちは123
abc日本語456
EOF
Command
awk '{ if (match($0, /[0-9]+/)) print substr($0, RSTART, RLENGTH) }' input.txt
Output
123
456
Command
export LC_ALL=ja_JP.UTF-8
awk '{ if (match($0, /日本語/)) print substr($0, RSTART, RLENGTH) }' input.txt
Output
日本語
How It Works
| Item | Description |
|---|---|
| match function | Stores the position of the regex match in RSTART and its length in RLENGTH |
| RSTART | Match start position (1-based) |
| RLENGTH | Number of characters matched |
| substr | Retrieves the matched portion using RSTART and RLENGTH |
| Issue | In a default environment, multibyte characters may not be treated as a single character |
| Solution | Explicitly set the locale to UTF-8, such as LC_ALL=UTF-8 |
Explanation
Since awk's match is locale-dependent, it is necessary to explicitly specify a UTF-8 environment when handling Japanese.
Without this setting, character positions and lengths may not be retrieved correctly.
Strategies for Improving match Processing Speed When Handling Large Volumes of Data
Creating the File
seq -f "%.0f" 5000000 | sed 's/^/apple_/' > input.txt
Command
time awk '{ if (match($0, /apple_[0-9]+/)) print }' input.txt
Output
real 0m17.488s
user 0m8.886s
sys 0m3.553s
Command
time awk 'BEGIN { pattern = "apple_[0-9]+" } { if ($0 ~ pattern) print }' input.txt
Output
real 0m16.747s
user 0m8.320s
sys 0m3.529s
Command
time awk '{ if (index($0, "apple_")) print }' input.txt
Output
real 0m16.200s
user 0m7.591s
sys 0m3.467s
How It Works
| Method | Description | Speed | Characteristics |
|---|---|---|---|
| match function | Evaluates the regex on every execution | Slow | Flexible but high cost |
| Variable-based regex | Pre-defines the pattern for reuse | Medium | Reduces the number of compilations |
| index function | String search only | Fast | Optimal for simple matches |
Explanation
With large volumes of data, the cost of re-evaluating the regex becomes dominant, so converting to a variable or replacing with a simpler function is effective.
In particular, if the condition is simple, index is the fastest option.
Handling Return Values When There Is No Match
Creating the File
cat << 'EOF' > input.txt
apple 100
banana 200
cherry 300
EOF
Command
awk '{ result = match($0, /orange/); print result }' input.txt
Output
0
0
0
Command
awk '{ result = match($0, /apple/); print result }' input.txt
Output
1
0
0
How It Works
| Condition | Return value of match | Meaning |
|---|---|---|
| Match found | A number of 1 or greater | The start position of the match |
| No match | 0 | No match was found |
| Stored in result | Numeric value | Can be used for conditional branching |
Explanation
Since match returns 0 when there is no match, you can handle it safely by writing branching logic based on that value.
Summary: Mastering awk and match
Understanding match in awk is not just about knowing regular expressions — what matters is grasping the "flow of processing."
Understanding the fine details of the specification dramatically expands the range of what you can apply it to.
By keeping the points introduced in this article in mind, you can go from being a beginner to approaching practical, production-level text processing.

