Mastering the awk substr Function: A Hands-on Guide

Introduction

awk is a powerful tool specialized in text processing, widely used in log analysis and data formatting.

Among its features, substr is a fundamental and important function for extracting parts of strings.

However, for beginners, it can be a tricky point when it comes to understanding the meaning of arguments and how to combine it with other functions.

This article explains awk's substr from the basics to advanced usage, from a perspective useful in real-world practice.

Reference: GNU awk

Basic Syntax and Argument Definitions of awk’s substr Function

Creating the File

cat << 'EOF' > input.txt
Hello,awk,substr,function
EOF

Command

awk '{ print substr($0,1,5) }' input.txt

Output

Hello

Command

awk '{ print substr($0,7,3) }' input.txt

Output

awk

Command

awk '{ print substr($0,11) }' input.txt

Output

substr,function

How It Works

Element	Description
substr(string, start, length)	Basic syntax
string	The target string (e.g., $0 represents the entire line)
start	Index starting from 1
length	Optional (if omitted, extracts to the end)
Return value	The substring within the specified range

Explanation

substr is a function that extracts a string starting from a specified position.
The start position is 1-based, and if the length is omitted, the string is extracted to the end.

Behavior When the Third Argument (Length) Is Omitted and Use Cases

Creating the File

cat << 'EOF' > input.txt
HelloWorld
AWKsubstrExample
EOF

Command

awk '{print substr($0,6)}' input.txt

Output

World
bstrExample

Command

awk '{print substr($0,1)}' input.txt

Output

HelloWorld
AWKsubstrExample

How It Works

Item	Description
Function	substr(string, start, length)
When third argument is omitted	Retrieves everything from the start position to the end
Start position basis	Starts from 1 (not 0)
Return value	Substring from the specified position onward
Use cases	Log analysis and string processing where you want to retrieve everything from a midpoint to the end

Explanation

Omitting the third argument allows you to retrieve everything from the start position to the end in one go, which is convenient for extracting the latter part of variable-length data.
It is especially useful for writing concise code when processing logs or strings with ambiguous delimiters.

Dynamically Identifying the Extraction Start Position by Combining with the index Function

Creating the File

cat << 'EOF' > input.txt
apple:100
banana:200
cherry:300
EOF

Command

awk -F: '{ pos = index($0, ":"); print substr($0, pos+1) }' input.txt

Output

100
200
300

Command

awk -F: '{ pos = index($0, ":"); print substr($0, 1, pos-1) }' input.txt

Output

apple
banana
cherry

How It Works

Element	Description
index($0, ":")	Gets the position of ":" within the line
pos	The reference position for extraction start
substr($0, pos+1)	Extracts everything after ":" (the value part)
substr($0, 1, pos-1)	Extracts everything before ":" (the key part)
$0	Processes the entire line

Explanation

By dynamically obtaining the position with index, you can flexibly handle cases where the delimiter position changes.
Combining it with substr makes it easy to extract strings from any position.

Extracting Strings from the End Using the length Function Together

Creating the File

cat << 'EOF' > input.txt
apple
banana
cherry
EOF

Command

awk '{ print substr($0, length($0)-2, 3) }' input.txt

Output

ple
ana
rry

How It Works

Element	Description
length($0)	Gets the number of characters in the entire line
substr($0, start, count)	Extracts a string from the specified position
length($0)-2	Calculates the start position of the last 3 characters
$0	Represents the entire line

Explanation

By getting the character count with length and calculating the start position from the end, extraction from the back becomes possible.
The key is to combine it with substr.

Filtering Lines with Specific Patterns by Combining if Statements with substr

Creating the File

cat << 'EOF' > input.txt
apple_001
banana_002
apple_123
orange_999
apple_abc
EOF

Command

awk '{ if (substr($0,1,5) == "apple" && substr($0,7,3) ~ /^[0-9]{3}$/) print }' input.txt

Output

apple_001
apple_123

How It Works

Element	Description
substr($0,1,5)	Gets the first 5 characters (checks for "apple")
substr($0,7,3)	Gets 3 characters starting from position 7 (the numeric part)
~ /^[0-9]{3}$/	Checks with a regex whether it is a 3-digit number
if condition	Filters for "starts with apple AND is a 3-digit number"
print	Outputs only lines matching the condition

Explanation

By extracting character positions with substr and branching with if, you can efficiently extract only lines matching a specific pattern.
Flexible filtering is possible with awk alone.

Splitting a String Character by Character and Storing in an Array Using a for Loop

Creating the File

cat << 'EOF' > input.txt
hello
EOF

Command

awk '{
    for(i=1;i<=length($0);i++){
        arr[i]=substr($0,i,1)
    }
    for(i=1;i<=length($0);i++){
        print arr[i]
    }
}' input.txt

Output

h
e
l
l
o

How It Works

Process	Description
length($0)	Gets the number of characters in the line
substr($0,i,1)	Gets the i-th single character
arr[i]	Stores one character at a time in the array
for(i=1;i<=length($0);i++)	Processes sequentially from the beginning
print arr[i]	Outputs the array contents in order

Explanation

Using awk's substr, you can decompose a string one character at a time.
Note that using for(i in arr) does not guarantee array order and may result in disordered output such as 2 3 4 5 1, so using an index-based for loop is the safer approach.

Efficient Use of substr for Parsing Fixed-Width Text Data

Creating the File

cat << 'EOF' > input.txt
00001Yamada   Tokyo     030
00002Suzuki   Osaka     045
00003Tanaka   Nagoya    028
EOF

Command

awk '{id=substr($0,1,5); name=substr($0,6,9); city=substr($0,15,10); age=substr($0,25,3); printf "ID:%s Name:%s City:%s Age:%s\n", id, name, city, age}' input.txt

Output

ID:00001 Name:Yamada    City:Tokyo      Age:030
ID:00002 Name:Suzuki    City:Osaka      Age:045
ID:00003 Name:Tanaka    City:Nagoya     Age:028

Command

awk '{id=substr($0,1,5); name=substr($0,6,9); city=substr($0,15,10); age=substr($0,25,3);
gsub(/^ +| +$/,"",name); gsub(/^ +| +$/,"",city);
printf "%s,%s,%s,%s\n", id, name, city, age}' input.txt

Output

00001,Yamada,Tokyo,030
00002,Suzuki,Osaka,045
00003,Tanaka,Nagoya,028

How It Works

Item	Description
substr($0,1,5)	Gets 5 characters from position 1 (ID)
substr($0,6,9)	Gets 9 characters from position 6 (name)
substr($0,15,10)	Gets 10 characters from position 15 (city)
substr($0,25,3)	Gets 3 characters from position 25 (age)
$0	The entire line string
gsub	Trims whitespace

Explanation

Fixed-width data has no delimiter characters, so awk substr with position-based extraction is very fast and simple.
The advantage is that parsing remains stable once the column positions are fixed in advance.

Criteria for Choosing Between Regex Replacement Functions (sub|gsub) and substr

Creating the File

cat << 'EOF' > input.txt
apple 123 orange
banana 456 grape
cherry 789 melon
EOF

Command

awk '{print substr($1,1,3)}' input.txt

Output

app
ban
che

Command

awk '{sub(/[0-9]+/,"NUM"); print}' input.txt

Output

apple NUM orange
banana NUM grape
cherry NUM melon

Command

awk '{gsub(/[aeiou]/,"_"); print}' input.txt

Output

_ppl_ 123 _r_ng_
b_n_n_ 456 gr_p_
ch_rry 789 m_l_n

How It Works

Feature	Function	Scope	Characteristic	Difference from awk substr
Partial extraction	substr	Position-specified	Extracts string from the specified position	Does not replace
Single replacement	sub	First match only	Replaces only the first matched portion	Pattern-based
Global replacement	gsub	All matching locations	Replaces all matched portions	Applied repeatedly

Explanation

substr extracts by "position," while sub/gsub replaces by "regex pattern" — that is the deciding criterion.
Choose based on whether the purpose is "structure" or "pattern."

How to Extract Strings Between Specific Symbols

Creating the File

cat << 'EOF' > input.txt
abc[hello]def
123[world]456
xxx[test123]yyy
EOF

Command

awk '{
  start = index($0, "[") + 1
  end = index($0, "]")
  print substr($0, start, end - start)
}' input.txt

Output

hello
world
test123

How It Works

Element	Description
index($0, "[")	Gets the position of [
index($0, "]")	Gets the position of ]
start	Extraction start position (the character after [)
end - start	Number of characters to extract
substr($0, start, length)	Extracts the string within the specified range

Explanation

The start and end positions are obtained with index, and that range is extracted with substr.
This is a flexible extraction method that does not depend on the delimiter character.

A Practical Summary for Mastering awk and substr

awk's substr is a simple yet broadly applicable function.

By understanding the basic syntax and then learning how to omit the third argument and combine it with index and length, you will be able to handle production-level processing.

Furthermore, combining it with if statements and for loops enables even more flexible data manipulation.

It is especially powerful for processing fixed-width data and pattern extraction, and being mindful of when to use sub and gsub instead will improve both code readability and efficiency.

Properly understanding awk and substr, and building up from small tasks, is the fastest path to improving your skills.

Articles on how to use awk other than with the “substr”

The following link is an article about the awk command.

Please make use of it if you want to learn comprehensively.

Mastering the awk Command

Introduction

Basic Syntax and Argument Definitions of awk’s substr Function

Creating the File

Command

Output

Command

Output

Command

Output

How It Works

Explanation

Behavior When the Third Argument (Length) Is Omitted and Use Cases

Creating the File

Command

Output

Command

Output

How It Works

Explanation

Dynamically Identifying the Extraction Start Position by Combining with the index Function

Creating the File

Command

Output

Command

Output

How It Works

Explanation

Extracting Strings from the End Using the length Function Together

Creating the File

Command

Output

How It Works

Explanation

Filtering Lines with Specific Patterns by Combining if Statements with substr

Creating the File

Command

Output

How It Works

Explanation

Splitting a String Character by Character and Storing in an Array Using a for Loop

Creating the File

Command

Output

How It Works

Explanation

Efficient Use of substr for Parsing Fixed-Width Text Data

Creating the File

Command

Output

Command

Output

How It Works

Explanation

Criteria for Choosing Between Regex Replacement Functions (sub|gsub) and substr

Creating the File

Command

Output

Command

Output

Command

Output

How It Works

Explanation

How to Extract Strings Between Specific Symbols

Creating the File

Command

Output

How It Works

Explanation

A Practical Summary for Mastering awk and substr

Articles on how to use awk other than with the “substr”

Related Posts:

Leave a Reply Cancel reply