Week 2 Discussion - Working with Unix#
Self-Help#
What could the command man
stand for (as an abbreviation)?
Solution to
From man man
: an interface to the system reference manuals
Browse the manual page for the command man
. man
supports section numbers from 1 to 9. What could be motivation behind this?
Hint: try man 5 passwd
and man passwd
Solution to
man
also contains documentation about configuration files like /etc/passwd
. So the keyword passwd
could refer to /etc/passwd
and the command passwd
. To differentiate between different types of terms, sections are used.
Why does man passwd
show the manual for the command passwd
but not for the configuration file /etc/passwd
? Browse man man
and find the section where this is explained.
Solution to
From the section DEFAULTS
:
The order of sections to search … . By default it is as follows: 1 … 8 … 5
Which keys are very useful when browsing a man page?
Solution to
/
for searching, thenn
andShift+n
for jumping to the next and previous, respectivelyq
for quitting
Why could the authors of the program apropos
could have chosen this name?
Solution to
Cambridge Dictionary - apropos
used to introduce something that is related to or connected with something that has just been said: I had an email from Sally yesterday - apropos (of) which, did you send her that article?
apropos
lists manpages related to a keyword, that could be the reason why the authors have chosen this name.
You have a directory 2021-Cambridge-travel
with text and image files. The former have the file extensions txt
and odt
, and the latter jpg
. You want to compress the directory using the tool tar
and using the algorithm lzma
.
Try the following approaches to achieve your goal. Which approach do you like most?
using
man
andapropos
using a search engine
curl cheat.sh/COMMAND_LINE_TOOL
, e.g.,curl cheat.sh/tar
Solution to
Personally I like to search in the man
first. If the manual page does not exist, is too complex to read, or does not have examples, then I proceed with the web search.
For very common tasks web search may be much faster, but some results on the web may be outdated. The reason is that the first results you get are determined by an algorithm which probably favors the most-clicked ones. It takes a while until a more up-to-date page moves to the top. Compared to the web results manual pages tend to be up-to-date.
A major disadvantage of manpages is that they traditionally tend to have a strict structure which begins with the description and has some examples in the end - if you have luck. Most times you want to have a short example how to use this tool. I recently discovered cheat.sh, which provides examples of the command (like cheat sheets) and thus overcomes the disadvantage of manual pages. For example, compare man cut
with cheat.sh/cut
You look at the manual page for echo
and see the following options:
… –help display this help and exit –version output version information and exit … Then you try the following command:
echo --version
which outputs --version
instead of the version number of echo
. What could be the reason?
Solution to
Some basic commands like echo
, alias
, which
are builtin in Bash. The documentation for builtin commands are in the Bash manual (man bash
).
But what does man echo
show you then? Well, this shows you the documentation of the standalone echo
program, which is on the path /usr/bin/echo
:
$ /usr/bin/echo --version
# outputs:
# echo (GNU coreutils) 9.0
# ...
/usr/bin/echo
would be used if the shell you are using does not provide their echo
. Shells tend to integrate echo
in their core, because echo
is a very frequently used command. Integrating may be more advantageous for runtime, in other words, it may then run faster compared to running /usr/bin/echo
.
You may be asking
How can I know if a command that I am using is a builtin or a standalone command?
You can find it out by using type
. For example type echo
in Bash outputs:
echo is a shell builtin
But type grep
:
grep is /usr/bin/grep
Get Wild#
What is ISO 8601?
What is its advantage?
Solution to
ISO 8601 is an international standard to communicate date and time. For example in USA the date is written as month-date-year. A date information like 2021-01-02
could cause confusion.
Find a command which generates today’s date in ISO 8601 format.
Optional challenge: Use only apropos
and man
as an exercise (instead of web search). apropos -s 1
only searches in section 1 (commands)
Solution to
date -I
What is a wildcard?
When would you use a wildcard?
Solution to
A (single) character which represents a set of characters. In a directory with different filetypes, a wildcard could be used to filter only images, e.g., ls *.jpg
You have a folder with files that have the following format: date-name.filesuffix
(date in ISO 8601 format). How would you list files which have the year 2010 to 2019 the month May in their name?
Solution to
ls 201*-05*
Search#
Which regex characters do you know, and what does this character resemble?
Solution to
single character
.
alternation
|
quantifiers
+
,*
,?
,{n,m}
subexpressions
(...)
bracket expressions
[...]
negation
[^...]
anchors
^
,$
escape character
\
character classes
[:alnum:]
,[:alpha:]
, …
What does [:alnum:]
mean in grep
?
Solution to
According to grep
man-page [:alnum:]
is a character class, and as the name suggests, [:alnum:]
resembles an alphanumeric character.
… For example, [[:alnum:]] means the character class of numbers and letters in the current locale. In the C locale and ASCII character set encoding, this is the same as [0-9A-Za-z].
Do the exercises on RegexOne interactive exercises
If you are looking more advanced exercises look at Regex Crossword
What is the difference between grep
and egrep
?
Solution to
grep
uses basic regular expression (BRE) as default. In basic regular expressions the metachars ?
, +
, {
, |
, (
, and )
lose their special meaning and become literal chars, therefore we have to use their backslash-escaped versions. But other metachars like -
(when used a bracket expression), *
, ^
, $
, [
, ]
, and .
do not have to be escaped.
egrep
stands for grep -E
. --extended-regexp
or -E
switches to ERE (extended regular expression) mode and we can use many regex metachars without escaping.
Note that grep
man - description states:
the variant programs egrep and fgrep are the same as grep -E and grep -F, respectively. These variants are deprecated, but are provided for backward compatibility.
We should better use grep -E
instead of egrep
.
What is the difference between:
find . -name lsi
andfind -name lsi
?find -iname lsi
andfind -name lsi
?
Solution to
From
find
man-page - description:… If no starting-point is specified,
.
is assumed.There is no difference. Both commands will search recursively (will not only look in the current directory but also subdirectories) for the file with the name
lsi
.-iname
stands for case-insensitive name, so the patternlsi
will search additionally for all capital and small-letter combinations likeLsi
,lSI
, etc.
What is the difference between ls *.jpg
and find -iname '*.jpg'
?
Solution to
Practical difference: The former lists the jpg files in the current directory. The second one searches not only in the current directory, but also in subdirectories (also called recursive search).
Moreover a subtle but important difference: Former command lists the files in the current folder after expanding the pattern *.jpg
using Bash, then lists the expanded filenames, e.g., ls f1.jpg f2.jpg g1.jpg
. In the latter no shell expansion is done, because the pattern is quoted ('*.jpg'
) which protects from shell expansion. Then '*.jpg'
is interpreted directly by find
.
Note that Bash leaves the pattern string '*jpg'
untouched if there are no filenames that match this glob pattern. That is the reason why the following command fails with the given error message:
$ ls *non-existing-suffix
ls: cannot access '*non-existing-suffix': No such file or directory
In this example we see that ls
receives *non-existing-suffix
untouched by Bash.
More about shell expansion in chapter Shell Programming.
What is the difference between globbing and regular expressions?
Solution to
Glob patterns or globbing are mainly used in describing filenames and file paths, but regular expressions can be used for any string. Glob patterns tend to be shorter than regular expression patterns, but less powerful in return. For example we cannot quantify a character using an interval in a glob pattern:
mktemp XXX.txt
mktemp XXXX.txt
mktemp XXXXX.txt
mktemp XXXXXX.txt
# Now try to find the files which are three to five characters long
ls ?{3,5}.txt # does not work
find -iname '?{3,5}.txt' # does not work either
# Regex helps:
find -regextype egrep -regex './.{3,5}\.txt'
Note that we used the regex type egrep
in the last command because there are different regular expression syntaxes, and -regex
in find
uses the basic findutils-default
regular expression syntax as default, which does not support a quantifier in an interval.
If you are working with files on the shell, then glob patterns should be most of the time sufficient for your work.
# create files and folders with random filenames
mktemp XXX.a
mktemp XXXX.a
mktemp XXX.c
DIR=$(mktemp -d XXX)
mktemp -p $DIR XXXXX.a
mktemp -p $DIR XXXXX.b
mktemp -p $DIR XXXXX.c
# searching for .a files
find -iname *.a # errors out
find -iname '*.a' # works
# searching for .b files
find -iname *.b # works
find -iname '*.b' # works
# searching for .c files
find -iname *.c # works but only shows a single file
find -iname '*.c' # works and shows all .c files
Why does the first
find
command errors out but the second one works?Why do both the third and fourth
find
command work?Why does the fifth command only shows a single file compared to the sixth command?
Solution to
In the first case the shell expands
*.a
, but in the second case not.For a more detailed explanation look to “paths must precede expression” error message in find(1) manpages.
There are no .b files in the current directory, so Bash leaves this pattern untouched without expanding, so there is no different between the third and fourth
find
.Shell expands
*.c
to the only existing file in the directory, so the command searches only for this file.For an elaborate explanation look to bash - find and globbing (and wildcards) - Stack Exchange
Configure#
You have a file called diary.txt
. Somehow the file seems to be corrupted. You suspect that your sibling could have played with your shell and edited the file to annoy you.
Do you have an idea how to find out what happened to diary.txt
?
Solution to
grep diary.txt ~/.bash_history
At first sight Unix command line may seem very clunky and inefficient. For example navigating to directories using cd
and cd ..
may take more time compared to navigating in a file explorer.
Imagine that you have a file called projects/inf1/notes.md
that you access very often. How could you access this file very efficiently using shell?
Solution to
Creating an alias using alias
in the shell configuration file, e.g., .bash_profile
for Bash:
alias inf1notes='vim ~/projects/inf1/notes.md'
Differentiate#
What is the difference between diff
and sdiff
?
Solution to
diff
outputs only the lines which differ between files. sdiff
shows a side-by-side comparison.
Advice: also try vimdiff
if you want to show differences and edit files at the same time.
You have a 4GB sequencing data which you want to store in the cloud for the next five years. How can you ensure that the data is intact (not corrupted) when you download this data after five years?
Solution to
Backups could help against data loss, but in case of corruption you may need at least three backups to find out which data is the corrupted one.
A better solution is to create a checksum of the file and store the checksum along with the file. The probability for a corruption of a large file is much greater than its checksum. Using the checksum you can check if the sequence data is corrupted or not. In other words you check the data integrity
Find at least three checksum generation commands on your shell
Solution to
shasum
for SHA1, sha256sum
for SHA256, md5sum
for MD5
Pipes#
Write a sequence of at least three commands which are piped together
Solution to
curl -sH "Accept: text/plain" https://icanhazdadjoke.com/ | tr '[a-z]' '[A-Z]' | cowsay
Make#
What does a rule, dependency mean in context of
make
?What happens with a rule when we invoke
make
?
Solution to
The following depicts a rule:
target: files-that-target-depends-on(dependencies)
recipe-how-to-make-the-target
If we make
, then the recipe will be processed, but only if
the target does not exist, or
one of the dependencies has changed, so the target should be remade.
make
reexecutes the recipe, if the modification date of one of the dependencies is newer than the target itself.
You have three files in your directory seq1.txt
, seq2.txt
, and seq3.txt
which contain sequence of letters. You want to open these files using an ancient program which only supports reading capital letters.
write a makefile which create the capital letter versions of these files with the name
seq*-capitalized.txt
. You can use thetr
command.optional: Write a makefile which can capitalize any txt file and store it as
*-capitalized.txt
. Hint: use wildcards in makefile
Solution to
first version:
all: \
seq1-capitalized.txt \
seq2-capitalized.txt \
seq3-capitalized.txt
seq1-capitalized.txt: seq1.txt
echo $< | tr '[a-z]' '[A-Z]' > $@
seq2-capitalized.txt: seq2.txt
echo $< | tr '[a-z]' '[A-Z]' > $@
seq3-capitalized.txt: seq3.txt
echo $< | tr '[a-z]' '[A-Z]' > $@
Solution to
SOURCE_FILES := $(wildcard *.txt)
TARGET_FILES := $(SOURCE_FILES:.txt=-capitalized.txt)
%-capitalized.txt: %.txt
echo $< | tr '[a-z]' '[A-Z]' > $@
Summary and reflection#
Did you reach the following learning objectives for this week? Discuss with your partner.
Solve problems by consulting documentation
Use wildcards on the command line to work with multiple files and folders
Use grep, egrep, metacharacters, regular expressions, and find to search in file and directories
Customize Bash
Examine differences among files
Use pipes to deploy the output of one command as the input of another command
Last weeks review#
Look at least ten problems from last weeks. A short review of last weeks will reinforce what you have already learned.