Pastebin

« Atari Portfolio data transfer over serial interface to modern Linux // Setting up castor as handler for gemini URLs on a Gnome desktop »

Which UNIX tool to use for text processing?

Posted on November 7th, 2020 by El RIDO

A short decision flow chart to pick the optimal text processing tool on a UNIX-like system (dowload it as plaintext):

                                   .--------.
                                   | I want |
                                   | to ... |
                                   '--------'
                                        |
                                        v
                                  .-----------.
                            yes  / ... search  \  no
                          .-----(  in freeform  )-----------------.
                          |      \    text.    /                  |
                          v       '-----------'                   v
                   ____________                           .---------------.
                   \           \                    yes  /    ... work     \  no
                    ) use grep  )                 .-----(  with structured  )--------.
                   /___________/                  |      \      text.      /         |
                          |                       v       '---------------'          v
                          v                 ___________                       .-------------.
                     .--------.             \          \            replace  /   ... work    \  delete
               yes  /   It's   \  no         ) use awk  )             .-----(  with freeform  )-----.
            .------(  a simple  )-------.   /__________/              |      \     text.     /      |
            |       \ pattern. /        |                             |       '-------------'       |
            v        '--------'         v                             v           extract           v
.-----------------------.  .-------------------------.  .--------------------------. | .-------------------------.
| grep 'simple pattern' |  |  egrep 'regex pattern'  |  | sed 's/search/replace/g' | | | sed '/search pattern/d' |
'-----------------------'  |            or           |  '--------------------------' v '-------------------------'
                           | grep -E 'regex pattern' |           .---------------------------------------.
                           '-------------------------'           | sed 's/^.*\(search pattern\).*$/\1/g' |
                                                                 |                  or                   |
                                                                 |       grep -Po 'search pattern'       |
                                                                 '---------------------------------------'

I use grep most of the time, like for finding stuff in logs, and fall back on sed when I have to manipulate data, like setting values in configuration files. But I keep finding myself constructing complex regexes to find and then extract certain outputs or piping from grep into sed, only to realize that my use case would be much simpler to solve in awk.

Here is a recent example: I needed to find and delete docker images matching a name and tag pattern and above a certain size. Yes, docker already has a –filter argument, but unfortunately every time I want to use it, it doesn’t support the filter conditions I would need. The awk solution ends up being pretty readable:

docker images –format ‚{{.Repository}}:{{.Tag}}\t{{.Size}}\t{{.ID}}‘ | \
awk ‚$1 ~ „prefix.+:tag“ && $2 > 10 && $3 == „GB“ { print $4 }‘ | \
xargs -r docker rmi

awk splits each line at the whitespace characters into numbered arguments ($0 would be the whole line – similar to POSIX shell), and will treat consecutive whitespace as a single delimiter. You can also pass the delimiter character explicitly with the -F argument. You can do string comparisons as well as letting it treat the arguments as numerical values or even check for regex matches.

The xargs‘ -r will prevent it from launching the command if the previous commands don’t return any output. It will turn the returned IDs into space separated arguments appended to the given command, so it is only called once.

PS: The flowchart was created with asciio.

Tags: Programmieren, Werkzeuge // Add Comment »

M	D	M	D	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Worüber ich im Netz stolpere und für später aufhebe…

Which UNIX tool to use for text processing?

Discussion Area - Leave a Comment

Navigation

Kategorien

Archive

Kalender

Meta