copied to clipboard!
文字列のawk

Mastering awk’s match Function: A Practical Approach

updated: 2026/05/05 created: 2026/04/30

Introduction

awk is a powerful tool specialized in text processing, but one area where beginners often struggle is how to use the match function and regular expressions. In particular, when you need to go beyond simple searching and work with information such as "where did it match?" and "how long was the match?", a vague understanding will limit your ability to apply it in practice.

This article covers the mechanics of the built-in variables RSTART and RLENGTH, the key differences from the ~ operator, practical applications in loop processing, Japanese text handling, and performance improvements — all organized in a way that is useful in real-world work. By reading through it, the goal is for you to move beyond simple string searching and into practical, advanced text processing.

Reference: GNU awk

The Role and Mechanics of Built-in Variables RSTART and RLENGTH

Creating the File

cat << 'EOF' > input.txt Hello 123 World abc456def no numbers here EOF

Command

awk '{ if (match($0, /[0-9]+/)) print $0, "-> RSTART=" RSTART ", RLENGTH=" RLENGTH }' input.txt

Output

Hello 123 World -> RSTART=7, RLENGTH=3
abc456def -> RSTART=4, RLENGTH=3

How It Works

ItemDescription
match functionSearches a string for the position matching a regular expression
RSTARTThe start position of the match (1-based)
RLENGTHThe length of the matched string
No matchRSTART=0, RLENGTH=-1
Update timingAutomatically updated when match is executed

Explanation

The key point is that the match function stores the match position and length in internal variables, which can be reused in subsequent processing.
This allows substring extraction and position-based processing to be written concisely.

The Decisive Difference Between match and the ~ Operator (Regex Match) and How to Use Each

Creating the File

cat << 'EOF' > input.txt apple 123 banana abc cherry 456def date 789 EOF

Command

awk '{ if (match($0, /[0-9]+/)) print $0, "-> match at", RSTART, "length", RLENGTH }' input.txt

Output

apple 123 -> match at 7 length 3
cherry 456def -> match at 8 length 3
date 789 -> match at 6 length 3

Command

awk '{ if ($0 ~ /[0-9]+/) print $0 }' input.txt

Output

apple 123
cherry 456def
date 789

How It Works

Itemmatch function~ operator
RoleRetrieves the position and length of a regex matchDetermines whether a regex matches
Return valueMatch position (0 if no match)True (1) / False (0)
Additional infoRSTART and RLENGTH are availableNone
Use casePosition and substring analysisSimple conditional branching
FlexibilityHighSimple

Explanation

match is suited for analysis where you need to know "where" the match occurred, while ~ is suited for conditional branching where you only need to know "whether" something matched.
The basic rule is to use each based on your intended purpose.

Advanced Pattern Matching Using Regex Metacharacters

Creating the File

cat << 'EOF' > input.txt apple 123 banana 456 cherry abc date 789xyz EOF

Command

awk 'match($0, /[a-z]+ [0-9]+/) { print "MATCH:", $0 }' input.txt

Output

MATCH: apple 123
MATCH: banana 456
MATCH: date 789xyz

Command

awk 'match($0, /[0-9]+$/) { print "END NUMBER:", $0 }' input.txt

Output

END NUMBER: apple 123
END NUMBER: banana 456

Command

awk 'match($0, /^[a-z]+ [0-9]+$/) { print "STRICT MATCH:", $0 }' input.txt

Output

STRICT MATCH: apple 123
STRICT MATCH: banana 456

How It Works

ElementDescription
match()A function that determines whether a string matches a regular expression
[a-z]+One or more consecutive lowercase alphabetic characters
[0-9]+One or more consecutive digits
^Indicates the start of a line
$Indicates the end of a line
$0An awk variable representing the entire line
/pattern/Regular expression literal

Explanation

Using match() enables advanced regex matching on a per-line basis.
Combining anchors (^, $) and quantifiers (+) allows you to specify precise conditions.

How to Use the match Function in Loop Processing (while)

Creating the File

cat << 'EOF' > input.txt abc123def456ghi789 EOF

Command

awk '{ while (match($0, /[0-9]+/)) { print substr($0, RSTART, RLENGTH) $0 = substr($0, RSTART + RLENGTH) } }' input.txt

Output

123
456
789

How It Works

ElementDescription
match functionReturns the position where the regex matched
RSTARTMatch start position
RLENGTHLength of the match
substrExtracts the matched portion
while loopRepeats until no more matches are found
Updating $0Trims the string to target everything after the previous match

Explanation

match detects the numeric portion, and all occurrences are extracted by repeating with while.
The key point is that $0 is rewritten so that the search targets everything after the previous match.

Extracting Only Specific Timestamps or IDs from Log Files

Creating the File

cat << 'EOF' > input.txt 2026-04-29 10:15:23 ID:1001 INFO Login success 2026-04-29 10:16:10 ID:1002 ERROR Failed attempt 2026-04-29 10:17:45 ID:1003 INFO Logout 2026-04-29 10:18:00 ID:1002 INFO Login success EOF

Command

awk 'match($0, /ID:1002/)' input.txt

Output

2026-04-29 10:16:10 ID:1002 ERROR Failed attempt
2026-04-29 10:18:00 ID:1002 INFO Login success

Command

awk 'match($0, /^2026-04-29 10:17/)' input.txt

Output

2026-04-29 10:17:45 ID:1003 INFO Logout

How It Works

ElementDescription
awkA command that processes text line by line
match()Determines whether a regex matches
$0Represents the entire line
/ID:1002/Matches a specific ID
/^2026-04-29 10:17/Matches lines starting with the specified time

Explanation

Using match() allows you to flexibly extract only lines matching a regular expression.
It can handle complex conditions such as IDs and timestamps.

Notes and Settings for Using the match Function with Multibyte Characters (Japanese)

Creating the File

cat << 'EOF' > input.txt こんにちは123 abc日本語456 EOF

Command

awk '{ if (match($0, /[0-9]+/)) print substr($0, RSTART, RLENGTH) }' input.txt

Output

123
456

Command

export LC_ALL=ja_JP.UTF-8 awk '{ if (match($0, /日本語/)) print substr($0, RSTART, RLENGTH) }' input.txt

Output

日本語

How It Works

ItemDescription
match functionStores the position of the regex match in RSTART and its length in RLENGTH
RSTARTMatch start position (1-based)
RLENGTHNumber of characters matched
substrRetrieves the matched portion using RSTART and RLENGTH
IssueIn a default environment, multibyte characters may not be treated as a single character
SolutionExplicitly set the locale to UTF-8, such as LC_ALL=UTF-8

Explanation

Since awk's match is locale-dependent, it is necessary to explicitly specify a UTF-8 environment when handling Japanese.
Without this setting, character positions and lengths may not be retrieved correctly.

Strategies for Improving match Processing Speed When Handling Large Volumes of Data

Creating the File

seq -f "%.0f" 5000000 | sed 's/^/apple_/' > input.txt

Command

time awk '{ if (match($0, /apple_[0-9]+/)) print }' input.txt

Output

real	0m17.488s
user	0m8.886s
sys	0m3.553s

Command

time awk 'BEGIN { pattern = "apple_[0-9]+" } { if ($0 ~ pattern) print }' input.txt

Output

real	0m16.747s
user	0m8.320s
sys	0m3.529s

Command

time awk '{ if (index($0, "apple_")) print }' input.txt

Output

real	0m16.200s
user	0m7.591s
sys	0m3.467s

How It Works

MethodDescriptionSpeedCharacteristics
match functionEvaluates the regex on every executionSlowFlexible but high cost
Variable-based regexPre-defines the pattern for reuseMediumReduces the number of compilations
index functionString search onlyFastOptimal for simple matches

Explanation

With large volumes of data, the cost of re-evaluating the regex becomes dominant, so converting to a variable or replacing with a simpler function is effective.
In particular, if the condition is simple, index is the fastest option.

Handling Return Values When There Is No Match

Creating the File

cat << 'EOF' > input.txt apple 100 banana 200 cherry 300 EOF

Command

awk '{ result = match($0, /orange/); print result }' input.txt

Output

0
0
0

Command

awk '{ result = match($0, /apple/); print result }' input.txt

Output

1
0
0

How It Works

ConditionReturn value of matchMeaning
Match foundA number of 1 or greaterThe start position of the match
No match0No match was found
Stored in resultNumeric valueCan be used for conditional branching

Explanation

Since match returns 0 when there is no match, you can handle it safely by writing branching logic based on that value.

Summary: Mastering awk and match

Understanding match in awk is not just about knowing regular expressions — what matters is grasping the "flow of processing."

Understanding the fine details of the specification dramatically expands the range of what you can apply it to.

By keeping the points introduced in this article in mind, you can go from being a beginner to approaching practical, production-level text processing.

Leave a Reply

Your email address will not be published. Required fields are marked *

©︎ 2025-2026 running terminal commands