Introduction
When you start learning data processing on the command line, many people encounter awk.
Among its features, handling arrays is extremely important, but it can be a somewhat confusing point for beginners.
In this article, we will carefully explain everything from the basics of arrays in awk to practical usage, covering the points where people tend to stumble.
Reference: GNU awk
Basic Rules for Declaring and Initializing Arrays
Create File
cat << 'EOF' > input.txt
apple 100
banana 200
apple 150
orange 300
banana 50
EOF
Command
awk '{ arr[$1] += $2 } END { for (key in arr) print key, arr[key] }' input.txt
Output
apple 250
banana 250
orange 300
How It Works
| Element | Description |
|---|---|
| arr[$1] | Array using $1 (column 1) as the key |
| += $2 | Adds $2 (column 2) as the value |
| awk array | Associative array (string keys are OK) |
| END | Executed after all lines are processed |
| for (key in arr) | Loops through all keys in the array |
Explanation
awk arrays behave as associative arrays and are automatically initialized simply by specifying a key.
This means addition operations are possible without any prior declaration.
How Associative Arrays Work and the Concept of Keys
Create File
cat << 'EOF' > input.txt
apple 100
banana 200
apple 150
orange 300
banana 50
EOF
Command
awk '{ sum[$1] += $2 } END { for (k in sum) print k, sum[k] }' input.txt
Output
apple 250
banana 250
orange 300
How It Works
| Element | Description |
|---|---|
| Array name | sum |
| Key | $1 (column 1: strings such as "apple") |
| Value | Accumulated by adding $2 (numeric) |
| Behavior | Values are accumulated per identical key |
| END block | Outputs the total per key after all processing |
Explanation
awk arrays are associative arrays, and their defining feature is that strings can be used as keys. Because values are automatically grouped by the same key, aggregation processing can be written concisely.
Efficient Loop Processing Using for (index in array)
Create File
cat << 'EOF' > input.txt
apple 3
banana 5
apple 2
orange 4
banana 1
EOF
Command
awk '{ arr[$1] += $2 } END { for (i in arr) print i, arr[i] }' input.txt
Output
apple 5
banana 6
orange 4
How It Works
| Element | Description |
|---|---|
| $1 | Column 1 (key) |
| $2 | Column 2 (value to add) |
| arr[$1] += $2 | Accumulates values per key |
| END | Executed after all lines are processed |
| for (i in arr) | Loops through all keys in the array |
| print i, arr[i] | Outputs the key and total value |
Explanation
By using awk's associative arrays, you can efficiently aggregate per key in a single pass. The for (i in arr) construct allows you to concisely iterate over dynamically generated keys.
Checking Whether a Specific Element Exists
Create File
cat << 'EOF' > input.txt
apple
banana
orange
apple
grape
EOF
Command
awk '
{
arr[$1]++
}
END {
if ("apple" in arr) {
print "apple exists"
} else {
print "apple does not exist"
}
}' input.txt
Output
apple exists
How It Works
| Element | Description |
|---|---|
| arr[$1]++ | Stores each line's value as a key in the array and counts it |
| "apple" in arr | Checks whether a specific key exists in the array |
| END | Performs the check after all lines are processed |
Explanation
Because awk associative arrays are automatically created the moment a key is encountered, existence checks can be done simply with in.
Combining this with counting operations allows for efficient determination.
Deleting Array Elements with the delete Function and Memory Management
Create File
cat << 'EOF' > input.txt
A 10
B 20
C 30
D 40
EOF
Command
awk '
{
arr[NR] = $2
}
END {
delete arr[2]
for (i=1;i<=NR;i++) {
if(i in arr) print i, arr[i]
}
}' input.txt
Output
1 10
3 30
4 40
How It Works
| Operation | Description |
|---|---|
| arr[NR] = $2 | Stores the value using the line number as the key |
| delete arr[2] | Deletes the element with the specified key |
| for (i=1;i<=NR;i++) | Loops in numeric order |
| if(i in arr) | Outputs only keys that exist |
Explanation
By combining a numeric loop with an in check, you can skip deleted elements while preserving order. This is a stable output method that does not depend on hash ordering.
Generating an Array from a String Using the split Function
Create File
cat << 'EOF' > input.txt
apple,banana,grape
dog,cat,bird
EOF
Command
awk '{ n = split($0, arr, ","); for(i=1;i<=n;i++) print arr[i] }' input.txt
Output
apple
banana
grape
dog
cat
bird
How It Works
| Element | Description |
|---|---|
| split($0, arr, ",") | Splits the string by comma and stores the result in array arr |
| n | Number of elements after splitting |
| arr[i] | Each individual element after splitting |
| for loop | Processes each element of the array in order |
Explanation
Using the split function, you can explicitly split a string into an array using any delimiter you choose.
Since the return value gives you the element count, it works well together with loop processing.
Simulating Multidimensional Arrays and the Role of the SUBSEP Variable
Create File
cat << 'EOF' > input.txt
A 1 x
A 2 y
B 1 z
B 2 w
EOF
Command
awk '{
key = $1 SUBSEP $2
arr[key] = $3
}
END {
for (k in arr) {
split(k, idx, SUBSEP)
printf("arr[%s][%s] = %s\n", idx[1], idx[2], arr[k])
}
}' input.txt
Output
arr[A][1] = x
arr[A][2] = y
arr[B][1] = z
arr[B][2] = w
Command
awk 'BEGIN { print "SUBSEP =", SUBSEP }'
Output
SUBSEP =
How It Works
| Element | Description | Role |
|---|---|---|
| arr[key] | One-dimensional array | awk does not natively support multidimensional arrays |
| SUBSEP | Separator character (default is \034) | Combines multiple keys into one |
| $1 SUBSEP $2 | Key generation | Achieves a pseudo two-dimensional array |
| split() | Key decomposition | Restores the original indices |
Explanation
In awk, multidimensional arrays are internally managed using string keys, and SUBSEP handles the rules for combining them.
This mechanism allows you to flexibly simulate arrays of any number of dimensions.
How to Output Array Aggregation Results in the END Block
Create File
cat << 'EOF' > input.txt
apple 10
banana 20
apple 15
orange 5
banana 25
EOF
Command
awk '{ sum[$1] += $2 } END { for (i in sum) print i, sum[i] }' input.txt
Output
apple 25
banana 45
orange 5
How It Works
| Element | Description |
|---|---|
| $1 | Column 1 (key: fruit name) |
| $2 | Column 2 (value: numeric) |
| sum[$1] += $2 | Adds to the array per key |
| END | Executed after all lines are processed |
| for (i in sum) | Loops through all keys in the array |
| print i, sum[i] | Outputs the aggregation result |
Explanation
By using awk's associative arrays, automatic aggregation per key is possible.
Outputting everything together in the END block is the standard pattern.
Joining Two Files on a Common Key (JOIN) Using Arrays
Create File
cat << 'EOF' > file1.txt
1 Alice 25
2 Bob 30
3 Carol 28
EOF
Create File
cat << 'EOF' > file2.txt
1 Tokyo
2 Osaka
4 Fukuoka
EOF
Command
awk 'NR==FNR {a[$1]=$2; next} ($1 in a) {print $1, a[$1], $2, $3}' file2.txt file1.txt
Output
1 Tokyo Alice 25
2 Osaka Bob 30
How It Works
| Step | Content | Description |
|---|---|---|
| 1 | NR==FNR | Processes the first file (file2.txt) |
| 2 | a[$1]=$2 | Stores the value ($2) in the array using the key ($1) |
| 3 | next | Moves to the next line |
| 4 | ($1 in a) | Checks whether file1's key exists in the array |
| 5 | Outputs the JOIN result |
Explanation
By using awk's associative arrays, high-speed JOIN processing based on a key is possible.
The behavior resembles an SQL inner join, achieved with a simple one-liner.
Key Takeaways for Mastering Arrays in awk
Arrays in awk are not mere arrays — they are a powerful feature with the flexibility of associative arrays.
Understanding the basics — such as the no-declaration-required characteristic, the freedom of key choice, and operations using for loops and the in operator — greatly expands the range of applications.
They are especially indispensable in practical scenarios such as log analysis, data aggregation, and file joining.
Beginners may sometimes be confused by ambiguous behavior, but by carefully understanding each rule one by one, you will definitely build skills you can use in practice.
