Introduction
awk is a powerful tool specialized in text processing, widely used in log analysis and data formatting.
Among its features, substr is a fundamental and important function for extracting parts of strings.
However, for beginners, it can be a tricky point when it comes to understanding the meaning of arguments and how to combine it with other functions.
This article explains awk's substr from the basics to advanced usage, from a perspective useful in real-world practice.
Reference: GNU awk
Basic Syntax and Argument Definitions of awk’s substr Function
Creating the File
cat << 'EOF' > input.txt
Hello,awk,substr,function
EOF
Command
awk '{ print substr($0,1,5) }' input.txt
Output
Hello
Command
awk '{ print substr($0,7,3) }' input.txt
Output
awk
Command
awk '{ print substr($0,11) }' input.txt
Output
substr,function
How It Works
| Element | Description |
|---|---|
| substr(string, start, length) | Basic syntax |
| string | The target string (e.g., $0 represents the entire line) |
| start | Index starting from 1 |
| length | Optional (if omitted, extracts to the end) |
| Return value | The substring within the specified range |
Explanation
substr is a function that extracts a string starting from a specified position.
The start position is 1-based, and if the length is omitted, the string is extracted to the end.
Behavior When the Third Argument (Length) Is Omitted and Use Cases
Creating the File
cat << 'EOF' > input.txt
HelloWorld
AWKsubstrExample
EOF
Command
awk '{print substr($0,6)}' input.txt
Output
World
bstrExample
Command
awk '{print substr($0,1)}' input.txt
Output
HelloWorld
AWKsubstrExample
How It Works
| Item | Description |
|---|---|
| Function | substr(string, start, length) |
| When third argument is omitted | Retrieves everything from the start position to the end |
| Start position basis | Starts from 1 (not 0) |
| Return value | Substring from the specified position onward |
| Use cases | Log analysis and string processing where you want to retrieve everything from a midpoint to the end |
Explanation
Omitting the third argument allows you to retrieve everything from the start position to the end in one go, which is convenient for extracting the latter part of variable-length data.
It is especially useful for writing concise code when processing logs or strings with ambiguous delimiters.
Dynamically Identifying the Extraction Start Position by Combining with the index Function
Creating the File
cat << 'EOF' > input.txt
apple:100
banana:200
cherry:300
EOF
Command
awk -F: '{ pos = index($0, ":"); print substr($0, pos+1) }' input.txt
Output
100
200
300
Command
awk -F: '{ pos = index($0, ":"); print substr($0, 1, pos-1) }' input.txt
Output
apple
banana
cherry
How It Works
| Element | Description |
|---|---|
| index($0, ":") | Gets the position of ":" within the line |
| pos | The reference position for extraction start |
| substr($0, pos+1) | Extracts everything after ":" (the value part) |
| substr($0, 1, pos-1) | Extracts everything before ":" (the key part) |
| $0 | Processes the entire line |
Explanation
By dynamically obtaining the position with index, you can flexibly handle cases where the delimiter position changes.
Combining it with substr makes it easy to extract strings from any position.
Extracting Strings from the End Using the length Function Together
Creating the File
cat << 'EOF' > input.txt
apple
banana
cherry
EOF
Command
awk '{ print substr($0, length($0)-2, 3) }' input.txt
Output
ple
ana
rry
How It Works
| Element | Description |
|---|---|
| length($0) | Gets the number of characters in the entire line |
| substr($0, start, count) | Extracts a string from the specified position |
| length($0)-2 | Calculates the start position of the last 3 characters |
| $0 | Represents the entire line |
Explanation
By getting the character count with length and calculating the start position from the end, extraction from the back becomes possible.
The key is to combine it with substr.
Filtering Lines with Specific Patterns by Combining if Statements with substr
Creating the File
cat << 'EOF' > input.txt
apple_001
banana_002
apple_123
orange_999
apple_abc
EOF
Command
awk '{ if (substr($0,1,5) == "apple" && substr($0,7,3) ~ /^[0-9]{3}$/) print }' input.txt
Output
apple_001
apple_123
How It Works
| Element | Description |
|---|---|
| substr($0,1,5) | Gets the first 5 characters (checks for "apple") |
| substr($0,7,3) | Gets 3 characters starting from position 7 (the numeric part) |
| ~ /^[0-9]{3}$/ | Checks with a regex whether it is a 3-digit number |
| if condition | Filters for "starts with apple AND is a 3-digit number" |
| Outputs only lines matching the condition |
Explanation
By extracting character positions with substr and branching with if, you can efficiently extract only lines matching a specific pattern.
Flexible filtering is possible with awk alone.
Splitting a String Character by Character and Storing in an Array Using a for Loop
Creating the File
cat << 'EOF' > input.txt
hello
EOF
Command
awk '{
for(i=1;i<=length($0);i++){
arr[i]=substr($0,i,1)
}
for(i=1;i<=length($0);i++){
print arr[i]
}
}' input.txt
Output
h
e
l
l
o
How It Works
| Process | Description |
|---|---|
| length($0) | Gets the number of characters in the line |
| substr($0,i,1) | Gets the i-th single character |
| arr[i] | Stores one character at a time in the array |
| for(i=1;i<=length($0);i++) | Processes sequentially from the beginning |
| print arr[i] | Outputs the array contents in order |
Explanation
Using awk's substr, you can decompose a string one character at a time.
Note that using for(i in arr) does not guarantee array order and may result in disordered output such as 2 3 4 5 1, so using an index-based for loop is the safer approach.
Efficient Use of substr for Parsing Fixed-Width Text Data
Creating the File
cat << 'EOF' > input.txt
00001Yamada Tokyo 030
00002Suzuki Osaka 045
00003Tanaka Nagoya 028
EOF
Command
awk '{id=substr($0,1,5); name=substr($0,6,9); city=substr($0,15,10); age=substr($0,25,3); printf "ID:%s Name:%s City:%s Age:%s\n", id, name, city, age}' input.txt
Output
ID:00001 Name:Yamada City:Tokyo Age:030
ID:00002 Name:Suzuki City:Osaka Age:045
ID:00003 Name:Tanaka City:Nagoya Age:028
Command
awk '{id=substr($0,1,5); name=substr($0,6,9); city=substr($0,15,10); age=substr($0,25,3);
gsub(/^ +| +$/,"",name); gsub(/^ +| +$/,"",city);
printf "%s,%s,%s,%s\n", id, name, city, age}' input.txt
Output
00001,Yamada,Tokyo,030
00002,Suzuki,Osaka,045
00003,Tanaka,Nagoya,028
How It Works
| Item | Description |
|---|---|
| substr($0,1,5) | Gets 5 characters from position 1 (ID) |
| substr($0,6,9) | Gets 9 characters from position 6 (name) |
| substr($0,15,10) | Gets 10 characters from position 15 (city) |
| substr($0,25,3) | Gets 3 characters from position 25 (age) |
| $0 | The entire line string |
| gsub | Trims whitespace |
Explanation
Fixed-width data has no delimiter characters, so awk substr with position-based extraction is very fast and simple.
The advantage is that parsing remains stable once the column positions are fixed in advance.
Criteria for Choosing Between Regex Replacement Functions (sub|gsub) and substr
Creating the File
cat << 'EOF' > input.txt
apple 123 orange
banana 456 grape
cherry 789 melon
EOF
Command
awk '{print substr($1,1,3)}' input.txt
Output
app
ban
che
Command
awk '{sub(/[0-9]+/,"NUM"); print}' input.txt
Output
apple NUM orange
banana NUM grape
cherry NUM melon
Command
awk '{gsub(/[aeiou]/,"_"); print}' input.txt
Output
_ppl_ 123 _r_ng_
b_n_n_ 456 gr_p_
ch_rry 789 m_l_n
How It Works
| Feature | Function | Scope | Characteristic | Difference from awk substr |
|---|---|---|---|---|
| Partial extraction | substr | Position-specified | Extracts string from the specified position | Does not replace |
| Single replacement | sub | First match only | Replaces only the first matched portion | Pattern-based |
| Global replacement | gsub | All matching locations | Replaces all matched portions | Applied repeatedly |
Explanation
substr extracts by "position," while sub/gsub replaces by "regex pattern" — that is the deciding criterion.
Choose based on whether the purpose is "structure" or "pattern."
How to Extract Strings Between Specific Symbols
Creating the File
cat << 'EOF' > input.txt
abc[hello]def
123[world]456
xxx[test123]yyy
EOF
Command
awk '{
start = index($0, "[") + 1
end = index($0, "]")
print substr($0, start, end - start)
}' input.txt
Output
hello
world
test123
How It Works
| Element | Description |
|---|---|
| index($0, "[") | Gets the position of [ |
| index($0, "]") | Gets the position of ] |
| start | Extraction start position (the character after [) |
| end - start | Number of characters to extract |
| substr($0, start, length) | Extracts the string within the specified range |
Explanation
The start and end positions are obtained with index, and that range is extracted with substr.
This is a flexible extraction method that does not depend on the delimiter character.
A Practical Summary for Mastering awk and substr
awk's substr is a simple yet broadly applicable function.
By understanding the basic syntax and then learning how to omit the third argument and combine it with index and length, you will be able to handle production-level processing.
Furthermore, combining it with if statements and for loops enables even more flexible data manipulation.
It is especially powerful for processing fixed-width data and pattern extraction, and being mindful of when to use sub and gsub instead will improve both code readability and efficiency.
Properly understanding awk and substr, and building up from small tasks, is the fastest path to improving your skills.
![[shell script] Run and understand Get and display all arguments string shell script](https://running-terminal-commands.com/wp-content/uploads/thumbnail_shell-script_1920_1080.png.webp)
