Mastering Text Processing: Learning awk and Scripting

Introduction

awk is a powerful command specialized for text processing, widely used for everything from simple scripts to advanced data analysis.

This article organizes the points that beginners tend to stumble on, focusing on the basic structure of the awk command and how script execution works.

Reference: GNU awk

Basic Structure of the awk Command and How Script Execution Works

Create File

cat << 'EOF' > input.txt
apple 100
banana 200
orange 150
EOF

Command

awk '{print $1, $2}' input.txt

Output

apple 100
banana 200
orange 150

Command

awk '$2 > 150 {print $1}' input.txt

Output

banana

Command

awk '{sum += $2} END {print sum}' input.txt

Output

How It Works

Element	Description
Pattern	Condition expression (e.g., $2 > 150)
Action	Execution process (e.g., {print $1})
Field	References column data such as $1, $2
END block	Process executed after all lines are processed
Script execution	Can be executed inline or via a file

Explanation

awk is a stream-oriented tool that processes each line using a "pattern + action" model. It flexibly handles everything from simple one-liners to full script files.

Two Basic Methods for Saving and Executing awk Script Files

Create File

cat << 'EOF' > input.txt
apple
banana
cherry
EOF

Create File

cat << 'EOF' > script.awk
BEGIN { print "=== awk script start ===" }
{ print NR ":" $0 }
END { print "=== awk script end ===" }
EOF

Command

awk -f script.awk input.txt

Output

=== awk script start ===
1:apple
2:banana
3:cherry
=== awk script end ===

Command

sed -i '1i #!/usr/bin/awk -f' script.awk
chmod +x script.awk
./script.awk input.txt

Output

=== awk script start ===
1:apple
2:banana
3:cherry
=== awk script end ===

How It Works

Method	How to Run	Mechanism
-f option	awk -f script.awk	Loads an external script file into the awk command
Executable file	./script.awk	Specifies awk via shebang and runs the script directly

Explanation

Separating an awk script into an external file improves reusability, and you can choose the execution method based on your use case. Use -f for simple processing, and make it an executable for tool-like use.

Initialization Processing Using the BEGIN Block

Create File

cat << 'EOF' > input.txt
apple 100
banana 200
orange 150
EOF

Command

awk 'BEGIN { sum=0; print "=== Aggregation Start ===" } { sum += $2 } END { print "Total:", sum }' input.txt

Output

=== Aggregation Start ===
Total: 450

How It Works

Block	Timing	Process
BEGIN	Before reading input	Variable initialization, header output
Main body	Each line	Adds up the numeric value in column 2
END	After reading input	Outputs the total value

Explanation

The BEGIN block executes only once before input processing begins, making it ideal for initialization.
It is frequently used in awk scripts as preparation before aggregation processing.

How to Handle Command-Line Arguments as Variables Inside a Script

Create File

cat << 'EOF' > script.awk
BEGIN {
    arg1 = ARGV[1]
    arg2 = ARGV[2]

    print "arg1 =", arg1
    print "arg2 =", arg2

    # Delete so awk does not treat them as regular files
    delete ARGV[1]
    delete ARGV[2]
}
{
    print "input:", $0
}
EOF

Command

echo "hello world" | awk -f script.awk foo bar

Output

arg1 = foo
arg2 = bar
input: hello world

How It Works

Element	Description
ARGV	Array of command-line arguments
ARGV[0]	The awk command itself
ARGV[1..]	User-specified arguments
delete	Prevents awk from treating the entry as an input file
BEGIN	Block executed before input processing

Explanation

Using ARGV allows you to handle arguments inside an awk script.
It is important to delete unused arguments, otherwise they will be processed as files.

Efficient Use of Regular Expressions in External Script Files

Create File

cat << 'EOF' > input.txt
apple 100
banana 200
apricot 150
grape 300
EOF

Create File

cat << 'EOF' > script.awk
/^(a|b)/ {
  if ($2 ~ /^[0-9]+$/) {
    sum += $2
    print $1, $2
  }
}
END {
  print "TOTAL:", sum
}
EOF

Command

awk -f script.awk input.txt

Output

apple 100
banana 200
apricot 150
TOTAL: 450

How It Works

Element	Description
Regex `/^(a\|b)/`	Targets lines starting with a or b
Numeric check /^[0-9]+$/	Validates that the field contains only digits
Action {...}	Describes the process when condition matches
sum += $2	Accumulates numeric values
END	Outputs total in final processing

Explanation

Writing regular expressions directly in the pattern section reduces branching and improves efficiency. Because awk can integrate conditions and processing, even external scripts can perform high-speed processing concisely.

Record Control Using Built-in Variables (NR, NF, FS)

Create File

cat << 'EOF' > input.txt
apple,fruit,100
carrot,vegetable,80
banana,fruit,120
EOF

Create File

cat << 'EOF' > script_all.sh
#!/bin/bash
awk -F',' '{ print "NR=" NR, "NF=" NF, "1=" $1, "2=" $2, "3=" $3 }' input.txt
EOF

Create File

cat << 'EOF' > script_line2.sh
#!/bin/bash
awk -F',' 'NR==2 { print "Line 2:", $1, $2, $3 }' input.txt
EOF

Create File

cat << 'EOF' > script_fruit.sh
#!/bin/bash
awk -F',' '$2=="fruit" { print "Fruit:", $1, $3 }' input.txt
EOF

Command

chmod +x script_all.sh script_line2.sh script_fruit.sh

Command

./script_all.sh

Output

NR=1 NF=3 1=apple 2=fruit 3=100
NR=2 NF=3 1=carrot 2=vegetable 3=80
NR=3 NF=3 1=banana 2=fruit 3=120

Command

./script_line2.sh

Output

Line 2: carrot vegetable 80

Command

./script_fruit.sh

Output

Fruit: apple 100
Fruit: banana 120

How It Works

Variable	Meaning	Role
NR	Current record number	Line identification and conditional branching
NF	Number of fields	Understanding the column count
FS	Field separator	Field splitting (specified with -F)

Explanation

Running processing while checking the input data makes it easier to understand how awk works.
Combining NR, NF, and FS enables flexible row- and column-based extraction.

Aggregating Results and Generating Reports with the END Block

Create File

cat << 'EOF' > input.txt
apple 100
banana 200
apple 150
banana 50
orange 300
EOF

Command

awk '{ sum[$1]+=$2 } END { for (k in sum) print k, sum[k] }' input.txt

Output

apple 250
banana 250
orange 300

How It Works

Element	Description
$1	Column 1 (key: product name)
$2	Column 2 (value: numeric)
sum[$1]+=$2	Adds to the total for each product
END	Block executed after all lines are processed
for (k in sum)	Loops over all keys in the associative array
print k, sum[k]	Outputs the aggregated result

Explanation

Using the END block in awk enables you to aggregate and generate reports all at once after all data has been processed.
Using associative arrays allows flexible aggregation by key.

Building Complex Logic with Control Structures (if, for, while)

Create File

cat << 'EOF' > input.txt
apple 10
banana 5
orange 20
grape 15
EOF

Create File

cat << 'EOF' > script.awk
{
  if ($2 >= 15) {
    print $1 " is high"
  } else if ($2 >= 10) {
    print $1 " is medium"
  } else {
    print $1 " is low"
  }
}
EOF

Command

awk -f script.awk input.txt

Output

apple is medium
banana is low
orange is high
grape is high

How It Works

Element	Description
script.awk	The awk script body
$1, $2	Fields (column 1: name, column 2: numeric value)
if	Conditional branch (15 or more)
else if	Intermediate condition (10 or more)
else	Processing for all other cases
-f	Specifies a script file for awk

Explanation

Externalizing an awk script into a file improves reusability and readability. Complex conditional branching can also be organized and managed more easily.

Extending Scripts with the system Function

Create File

cat << 'EOF' > input.txt
apple 100
banana 200
orange 150
EOF

Command

awk '{ system("echo Item:" $1 ", Price:" $2) }' input.txt

Output

Item:apple, Price:100
Item:banana, Price:200
Item:orange, Price:150

How It Works

Element	Description
awk	A scripting language that processes text line by line and field by field
$1, $2	Represent column 1 and column 2 of each line
system function	A function that executes an external command
echo	A command that outputs a string to standard output
Processing flow	Read line → split fields → execute externally with system

Explanation

Using awk's system function allows you to dynamically execute shell commands for each line. This enables flexible extensions that combine text processing with external commands.

How to Embed an awk Script Inside a Shell Script (bash)

Create File

cat << 'DATA' > input.txt
Alice 80
Bob 65
Charlie 90
DATA

Create File

cat << 'EOF' > script.sh
#!/bin/bash

# Embed and run an awk script
awk '{
  if ($2 >= 70) {
    print $1 " : Pass"
  } else {
    print $1 " : Fail"
  }
}' input.txt
EOF

Command

chmod +x script.sh
./script.sh

Output

Alice : Pass
Bob : Fail
Charlie : Pass

How It Works

Element	Content	Description
Shell script	script.sh	Controls the overall flow
Here document	cat << 'DATA'	Generates input data
awk script	awk '...'	Text processing logic
Field reference	$1, $2	Space-delimited columns
Conditional branch	if ($2 >= 70)	Numeric judgment
Output	print	Displays processed result

Explanation

Writing awk directly inside bash enables concise text processing without external files.
Combining it with here documents creates highly reproducible scripts.

Performance Optimization and Considerations for Large-Scale Data Processing

Create File

cat << 'EOF' > input.txt
id,name,score
1,Alice,82
2,Bob,91
3,Charlie,78
4,David,88
5,Eve,95
EOF

Create File

cat << 'EOF' > process.awk
BEGIN { FS="," }
NR>1 {
  sum += $3
  count++
}
END {
  print "Average:", sum/count
}
EOF

Create File

cat << 'EOF' > filter.awk
BEGIN { FS="," }
NR==1 || $3 >= 90
EOF

Create File

cat << 'EOF' > skip_header.awk
BEGIN { FS="," }
NR>1 {
  print $0
}
EOF

Command

awk -f process.awk input.txt

Output

Average: 86.8

Command

awk -f filter.awk input.txt

Output

id,name,score
2,Bob,91
5,Eve,95

Command

awk -f skip_header.awk input.txt

Output

1,Alice,82
2,Bob,91
3,Charlie,78
4,David,88
5,Eve,95

Command

awk -f skip_header.awk input.txt | sort -t',' -k3 -nr

Output

5,Eve,95
2,Bob,91
4,David,88
1,Alice,82
3,Charlie,78

How It Works

Item	Description
Input splitting	FS="," enables efficient CSV processing
Skip	NR>1 excludes the header
Aggregation	Sequential addition saves memory
Conditional extraction	Outputs only matching conditions
Pipe integration	Delegates to sort for external processing
Delimiter specification	-t',' specifies the column delimiter
Key specification	-k3 uses column 3 as the sort key
Numeric sort	-n for numeric comparison
Descending sort	-r for descending order

Explanation

Scripting awk makes it easy to split and reuse processing. For large-scale data, reducing I/O and designing with pipes is essential.

Creating Practical Scripts to Automate Log File Analysis

Create File

cat << 'EOF' > input.txt
2026-05-01 INFO User login success
2026-05-01 ERROR Database connection failed
2026-05-02 INFO File uploaded
2026-05-02 WARNING Disk space low
2026-05-03 ERROR Timeout occurred
EOF

Command

awk '$2 == "ERROR" {print $0}' input.txt

Output

2026-05-01 ERROR Database connection failed
2026-05-03 ERROR Timeout occurred

Command

awk '{count[$2]++} END {for (level in count) print level, count[level]}' input.txt

Output

INFO 2
ERROR 2
WARNING 1

Command

awk '$2=="ERROR" {print $1, $3, $4, $5}' input.txt

Output

2026-05-01 Database connection failed
2026-05-03 Timeout occurred

How It Works

Process	awk Expression	Description
Conditional extraction	$2 == "ERROR"	Extracts only lines where column 2 is ERROR
Count	count[$2]++	Counts occurrences by log level
End processing	END {}	Outputs results after all lines are processed
Field reference	$1, $2 ...	Specifies columns by space delimiter

Explanation

Using awk allows you to filter, aggregate, and format logs in a single one-liner.
Simple yet powerful, it is highly effective for automation in operational environments.

Avoiding Syntax Errors and Unintended Behavior

Create File

cat << 'EOF' > input.txt
apple 10
banana 20
orange 30
EOF

Command

awk '{ print $1, $2 * 2 }' input.txt

Output

apple 20
banana 40
orange 60

Command

awk 'NF == 2 { sum += $2 } END { print sum }' input.txt

Output

Command

awk '{ if ($2 ~ /^[0-9]+$/) print $1 ":" $2 }' input.txt

Output

apple:10
banana:20
orange:30

How It Works

Element	Description	Error Prevention Point
awk '{ ... }'	Executes processing for each line	Watch out for unclosed quotes
$1, $2	Field (column) references	Avoid misconfiguring the delimiter
NF == 2	Checks the number of fields	Prevents processing of invalid lines
~ /^[0-9]+$/	Validates numeric value with regex	Avoids malfunctions due to type mismatch
END { ... }	Processing after all lines	Prevents forgotten initialization or undefined variables

Explanation

In awk scripts, clarifying the assumptions about the input format and including condition checks (NF and regular expressions) prevents syntax errors and unintended behavior.

Summary: Making the Most of awk Scripts

awk is a lightweight yet powerful text processing tool that truly shines when used systematically as a script.

Making use of BEGIN and END, understanding built-in variables, and mastering control structures form the foundation.

Furthermore, external integration and bash embedding enable automation at a practical level.

It is important to be mindful of error avoidance and performance, and to build up skills step by step.