Mastering sed and Regular Expressions: From Basics to Advanced Text Processing

Introduction

When it comes to manipulating strings on the command line, sed is an incredibly powerful option. Whether you are replacing file content or extracting specific lines, combining it with regular expressions allows you to complete complex tasks in a single line.

This article provides a step-by-step guide from basic usage to advanced techniques using regular expressions. It is structured to be accessible even for those unfamiliar with the command line, highlighting common pitfalls for beginners.

Reference: sed, a stream editor - GNU Official Documentation

How sed and Regex Replace the First Character with “H”

The primary command we will examine is:

sed 's/./H/' input.txt

Element	Description
`sed`	The stream editor. Processes files or standard input line by line.
`'s/./H/'`	The substitution command. Format: `s/search_pattern/replacement_string/`.
`.`	A regex metacharacter matching any single character.
`H`	The string to replace the match with.
`input.txt`	The target file for processing.

The s command without flags only replaces the first match in each line. Consequently, only the very first character of every line changes to "H".

Before Execution

First, create input.txt using the following command. Note that for lines containing tabs, you may need to press Ctrl+v then Tab in your terminal.

cat << 'EOF' > input.txt
hello.
world.

hello world.
	hello.
1hello 2world.
EOF

input.txt:

hello.
world.

hello world.
	hello.
1hello 2world.

After Execution

sed 's/./H/' input.txt

Output:

Hello.
Horld.

Hello world.
Hhello.
Hhello 2world.

Empty lines remain unchanged because there is no character to match. In Line 5, the leading tab character is replaced by "H". In Line 6, the "1" is the first character, so it becomes "H".

Execution image

Differences Between GNU sed and BSD sed

macOS typically comes with the BSD version of sed, while Linux usually features the GNU version. Their behaviors can differ.

Item	GNU sed	BSD sed
-i option (In-place)	`sed -i 's/a/b/' file`	`sed -i '' 's/a/b/' file` (Requires backup extension)
\t (Tab)	Supported in regex	May not be supported
\+, \?	Supported in BRE	Often not supported
-E option	Enables Extended Regex (ERE)	Also used for ERE

Unless otherwise noted, commands in this article are written for the BSD version.

Replacing All Occurrences of hello with HI

sed 's/hello/HI/g' input.txt

Element	Description
`s`	Substitute command.
`hello`	The string to search for.
`HI`	The replacement string.
`g`	The global flag. Replaces all matches within a line.

Without the g flag, only the first "hello" per line is replaced. With it, every instance—including those in Line 6—is changed.

Extracting Only Lines That Contain hello

sed -n '/hello/p' input.txt

Element	Description
`-n`	Suppresses default output.
`/hello/`	The pattern to match.
`p`	The print command. Explicitly outputs the matched line.

By default, sed outputs every line. Combining -n to suppress output with p to explicitly print matched lines achieves grep-like extraction behavior.

Extracting a Column Using Backreferences

sed -n 's/\(hello\).*/\1/p' input.txt

Elemtent	Description
`$hello$`	Grouping. The matched content can be referenced as `\1`.
`.*`	Matches any sequence of characters (0 or more).
`\1`	References the string matched by the first group.
`p`	Outputs only if the substitution was successful.

By combining regular expression grouping with backreferences, it is possible to extract only a portion of a line — effectively pulling out a specific column.

Swapping the Order of hello and world Using Backreferences

sed -n 's/\(hello\) \(world\)/\2 \1/p' input.txt

Element	Description
`$hello$`	First group (referenced as `\1`)
`$world$`	Second group (referenced as `\2`)
`\2 \1`	Outputs the groups in reversed order

Backreferences are numbered in order as \1, \2, and so on. On lines matching hello world, the substitution is applied and the output becomes world hello.

Removing Consecutive Blank Lines

sed 'N;s/^\n//' input.txt

Element	Description
`N`	Reads the next line into the pattern space and appends it
`s/^\n//`	Deletes the leading newline character

By default, sed processes one line at a time. The N command reads the next line as well, allowing two lines to be processed together. Without N, the pattern space does not contain a newline character, so the match fails. This is why N is necessary when removing consecutive blank lines.

Replacing Spaces with Underscores

sed 's/ /_/g' input.txt

Element	Description
	A half-width space
`_`	The replacement string
`g`	Replaces all matches on each line

Although it looks straightforward, a common mistake is confusing half-width and full-width spaces. When a full-width space is present, the pattern will not match. Always pay close attention to the type of whitespace character when working with spaces in regular expressions.

Replacing Tab Characters with the String SPACE

sed 's/\t/SPACE/g' input.txt

Element	Description
`\t`	Escape sequence representing a tab character
`SPACE`	The replacement string

In BSD sed, \t may not work inside regular expressions. In that case, use $'\t' or embed an actual tab character by pressing Ctrl+v followed by Tab. Since input.txt contains a tab on line 5, this command can be used to verify the behavior.

Escaping the Dot (.) and Replacing It with DOT

sed 's/\./DOT/g' input.txt

Element	Description
`\.`	An escaped dot. Matches a literal `.` character
`DOT`	The replacement string

In regular expressions, . is a metacharacter meaning "any single character." To match a literal dot, it must be escaped as \.. Forgetting the escape causes every character to match, resulting in unintended replacements.

Replacing Digits with #

sed 's/[0-9]/#/g' input.txt

Element	Description
`[0-9]`	A character class matching any single digit from 0 to 9
`#`	The replacement string

[0-9] is a regular expression character class that represents a single digit. Since \d is not supported in BSD sed, using [0-9] is the safe and reliable approach.

Dynamic Substitution Using Shell Variables and $1

Wrapping the sed command in double quotes allows shell variables to be expanded.

var="argument"
sed "s/hello/$var/" input.txt
sed 's/hello/$var/' input.txt

Element	Description
`"s/hello/$var/"`	Double quotes. The shell expands `$var` to `argument`
`'s/hello/$var/'`	Single quotes. `$var` is treated as a literal string

Variables are not expanded inside single quotes. In shell scripts, $1 (the first positional argument) is commonly used. For example, sed "s/hello/$1/" input.txt allows the replacement string to be passed dynamically when the script is run.

Greedy and Non-Greedy Matching

sed -E 's/h.*o/X/' input.txt

Element	Description
`h.*o`	Matches the longest possible string starting with `h` and ending with `o` (greedy matching)

By default, .* in sed performs greedy (longest) matching. When applied to hello world, h.*o matches as far right as possible — not stopping at the first o but continuing to the last one in the line.

Non-greedy (shortest) matching is not supported in BSD sed. Even in GNU sed, a workaround such as [^o]* is required. If non-greedy matching is essential, consider using Perl or Python instead.

Replacing Only on a Specific Line

sed '4s/hello/HI/' input.txt

Element	Description
`4`	Address specifier targeting only line 4
`s/hello/HI/`	Substitution command

Address specifiers allow processing to be limited to a specific line number or lines matching a regular expression. A range such as 2,4s/hello/HI/ is also supported.

About BRE and ERE

sed supports two types of regular expressions: BRE (Basic Regular Expressions) and ERE (Extended Regular Expressions).

BRE example:

sed 's/\(hello\)/HI/' input.txt

In BRE, grouping requires $ and $.

ERE example:

sed -E 's/(hello)/HI/' input.txt

With the -E option, ERE is enabled and grouping can be done with just ( and ). Syntax that requires \( in BRE can be written more cleanly in ERE.

Replacing hello or world with X Using Extended Regular Expressions

sed -E 's/(hello|world)/X/g' input.txt

Element	Description
`-E`	Enables extended regular expressions
`(hello\|world)`	Matches either `hello` or `world`
`X`	The replacement string

In ERE, | works without escaping,

Quick Reference: Common Regular Expressions in sed

Expression	Meaning	Example
`.`	Any single character	`s/./X/`
`*`	Zero or more repetitions of the preceding element	`s/el*/X/`
`^`	Beginning of line	`s/^/> /`
`$`	End of line	`s/$/ end/`
`[abc]`	Any one of a, b, or c	`s/[abc]/*/g`
`[^abc]`	Any character except a, b, or c	`s/[^abc]//g`
`[0-9]`	Any single digit	`s/[0-9]/#/g`
`$…$`	Grouping (BRE)	`s/$hello$/[\1]/`
`\1`	Backreference	`s/$hello$ $world$/\2\1/`
`\.`	Literal dot	`s/\./,/g`
`\t`	Tab character (GNU sed)	`s/\t/ /g`
`+`	One or more repetitions (ERE)	`s/[0-9]+/#/g`
`?`	Zero or one occurrence (ERE)	`s/e?l/X/g`
`\|`	Alternation (BRE)	`s/hello\|world/X/g`

Reverse Lookup: Find the Command for What You Want to Do

Example 1: Add a string at the beginning of each line

cat << 'EOF' > input.txt
hello.
world.
EOF

sed 's/^/> /' input.txt

^ matches the beginning of a line, and > is inserted there.

Example 2: Remove trailing spaces from each line

cat << 'EOF' > input.txt
hello   
world   
EOF

sed 's/ *$//' input.txt

$ matches the end of a line, and any spaces immediately before it are removed.

Example 3: Delete blank lines

cat << 'EOF' > input.txt
hello.

world.
EOF

sed '/^$/d' input.txt

^$ matches a line where the beginning and end are adjacent — in other words, an empty line. The d command deletes those lines.

Common Pitfalls When Commands Don’t Work as Expected

Full-width space deletion has no effect

cat << 'EOF' > sample.txt
hello world
EOF

sed 's/　//g' sample.txt

This command looks like it removes a full-width space, but if the character was converted to a half-width space during copy-paste or terminal encoding, the pattern will not match and nothing will change. Use cat -A or a similar tool to verify the actual characters in the file.

HTML tag removal deletes the entire line

cat << 'EOF' > sample.txt
<p>hello</p> and <span>world</span>
EOF

sed 's/<.*>//g' sample.txt

Because .* is greedy, it matches from the first < all the way to the last > on the line, removing everything in between. To limit matching to individual tags, use [^>]* instead of .*.

Backslash errors in date format conversion

cat << 'EOF' > sample.txt
2024-04-16
EOF

sed 's/\([0-9]\+\)-\([0-9]\+\)-\([0-9]\+\)/\3\/\2\/\1/' sample.txt

In BSD sed, \+ (one or more repetitions) may not be supported. In that case, rewrite it as [0-9][0-9]*, or switch to extended regular expressions using the -E option.

Broaden Your String Processing Skills with sed and Regular Expressions

sed may look simple at first glance, but combined with regular expressions it handles a wide range of tasks — substitution, extraction, formatting, and more. Beginners often find the differences between BRE and ERE, or the behavioral gaps between BSD and GNU sed, confusing at first. Running the commands in this article on actual files is the best way to build real understanding. Start small, experiment freely, and gradually expand your range of applications.