website/content/blog/capturing-quoted-string-sed.md at 01a28e2a4a54b66ecba994742434ea29737dfaa4

brozek/website

Fork 0

mirror of https://github.com/Brandon-Rozek/website.git synced 2024-11-09 10:40:34 -05:00

Brandon Rozek c5ff5538a6

Medium syndication information

2023-01-05 14:04:45 -05:00

1.6 KiB

Raw Blame History

date

draft

math

medium_enabled

medium_post_id

tags

title

2022-12-18 12:55:32-05:00

false

true

9801b2556737

Capturing Quoted Strings in Sed

Disclaimer: This posts assumes some knowledge about regular expressions.

Recently I was trying to capture an HTML attribute in sed. For example, let's say I want to extract the href attribute in the following example:

<a href="https://brandonrozek.com" rel="me"></a>

Advice you commonly see on the Internet is to use a capture group for anything between the quotes of the href.

In regular expression land, we can represent anything as .* and define a capture group of some regular expression X as \(X\).

sed "s/.*href=\"\(.*\)\".*/\1/g"

What does this look like for our input?

echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\(.*\)\".*/\1/g"

https://brandonrozek.com" rel="me

It matches all the way until the second "! What we want, is to not match any character within the quotations, but match any character that is not the quotation itself [^\"]*

sed "s/.*href=\"\([^\"]*\)\".*/\1/g"

This then works for our example:

echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\([^\"]*\)\".*/\1/g"

https://brandonrozek.com

Within a bash script, we can make this a little more readable by using multiple variables.

QUOTED_STR="\"\([^\"]*\)\""
BEFORE_TEXT=".*href=$QUOTED_STR.*"
AFTER_TEXT="\1"
REPLACE_EXPR="s/$BEFORE_TEXT/$AFTER_TEXT/g"

INPUT="\<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\>"

echo "$INPUT" | sed "$REPLACE_EXPR"

1.6 KiB Raw Blame History

1.6 KiB

Raw Blame History