Missing Semester 04 - Data Wrangling
Data wrangling is taking data in one format and changing it into a different format 💡.
Shoutout to tldr and dashdash for making the man files much easier to work with👍!
Note, I have tldr aliased to help, it's much easier to type! (my other bash aliases).
Regex permalink
.means “any single character” except newline*zero or more of the preceding match+one or more of the preceding match[abc]any one character of a, b, and c(RX1|RX2)either something that matches RX1 or RX2^the start of the line$the end of the line
greedy matching is to match as much as you can. Add a ? to make matching non-greedy.
capture group is any text matched by a regex surrounded by parentheses and stored in a numbered capture group ( \1, \2, \3).
Sed permalink
Edit (find & replace) text in a (non-interactive) scriptable manner.
echo "Welcome to the jungle" | sed 's/jungle/party/'
Find the string jungle and replace it with string party.
sed 's/jungle/super party/' jungleFile
Find the string jungle (inside the file jungleFile) and replace it with super party.
sed 's/jungle/super party/gi' myfile
Find all occurrences of the string jungle and replace them with super party (ignoring character case).
sed -n '5,10p' myfile.txt
Return lines 5 to 10 inside the file myfile.txt.
sed '20,35d' myfile.txt
Return all of the file except for lines 20-35 inside the file myfile.txt.
sed’s regular expressions are somewhat weird, and will require you to put a \ before most of these to give them their special meaning. Or you can pass -E.
sed "s/[aeiou]/*/g" myfile.txt
Find all vowels and replace them with *.
sed 's/[aeiou]/\u&/g' birthday.txt
& is the capture group, \u makes all the vowels uppercase.
wc permalink
wc -l file
Count lines in file.
wc -w file
Count words in file.
sort permalink
sort filename
Sort a file in ascending order.
sort -r
In reverse order
sort -n
Will sort in numeric (instead of lexicographic) order
sort -r filename
Sort a file in descending order.
uniq permalink
uniq -c
Will collapse consecutive lines that are the same into a single line, prefixed with a count of the number of occurrences
sort file | uniq
Display each line once.
sort file | uniq -d
Display only duplicate lines.
awk permalink
For editing column data.
Awk assigns some variables for each data field found:
$0for the whole line.$1for the first field.$2for the second field.$nfor the nth field.
Fun alert!
history | awk '{CMD[$2]++;count++;}END { for (a in CMD)print CMD[a] " " CMD[a]/count*100 "% " a;}' | grep -v "./" | column -c3 -s " " -t | sort -nr | nl | head -n10
Display 10 most frequently used bash commands from history.
xargs permalink
xargs
Takes a list of inputs and turns them into arguments. Execute a command with piped arguments coming from another command, a file, etc.
ls CC* | xargs wc
Print the number of lines/words/characters in each file in the list
find /tmp -name core -type f -print | xargs /bin/rm -f
Find files named core in or below the directory /tmp and delete them. Note that this will not work correctly if there are any filenames containing newlines or spaces.
find /tmp -name core -type f -print0 | xargs -0 /bin/rm -f
Find files named core in or below the directory /tmp and delete them, processing filenames in such a way that file or directory names containing spaces or newlines are correctly handled.
find /tmp -depth -name core -type f -delete
Find files named core in or below the directory /tmp and delete them, but more efficiently than in the previous example.
💪 My own legit and tested example:
find . -iname 'IMG*.jpg' -mtime -20 | xargs exiftool -All=
Find (case insensitive) all images that start with IMG, less than 20 days old and remove their EXIF data.
Misc permalink
less
Open a file for interactive reading, allowing scrolling and search.
bc -l
Run calculator in interactive mode using the standard math library: