The Power of Awk
Yesterday, I used a shell script I had written a while back to convert some VTT subtitle files into normal text, however, because these files this time were from yt-dlp, there was extra formatting and I had to write a second shell-script for post processing. These took a while to write but also a long time to run and it was quite late by the time I had working outputs from them. After it was all done, I wondered if it would have been faster to write some awk scripts, given this is text processing and that's what awk is best at. This wasn't my first choice as awk is not a language I am very familiar with, but today I thought I'd give it a try for arguments' sake to see if it would be any faster. Writing and testing the scripts took about 1.5 hours (a lot of which was spent finding out how to do things), but I now have some scripts that work.
Original shell scripts
First, the shell scripts I used yesterday:
vtt.sh:
#!/bin/sh #vtt converter vttfile="$1" textfile="$2" i='1' numl="$(wc -l "$vttfile"|cut -f 1 -d ' ')" mark='0' ci='' getci() { cat "$vttfile"|sed -ne "${1}p" } echo "" > "$textfile" while test "$i" -lt "$numl"; do ci="$(getci "$i")" echo "$i: $ci" if test "$mark" == '0'; then while test "$(echo "$ci"|sed -ne '/-->/p')" == ""; do i="$(($i + 1))" ci="$(getci "$i")" done mark="1" else if test "$ci" == ''; then mark='0' else echo "$ci" >> "$textfile" fi fi i="$(($i + 1))" done
vtt-post.sh:
#!/bin/sh #vtt post processing subs="$1" textfile="$2" ci='' pci='' i='1' numl="$(wc -l "$subs"|cut -f 1 -d ' ')" getci() { cat "$subs"|sed -nbe "${1}p" } getcif() { getci "$1"|sed -be "/[<][0-9].*[>]/s/.*//" } echo "" > "$textfile" while test "$i" -lt "$numl"; do ci="$(getcif "$i")" if test ! -z "$(echo $ci)"; then if test "$ci" != "$pci"; then echo "$i: $ci" echo "$ci" >> "$textfile" fi pci="$ci" fi i="$(($i + 1))" done while test -z "$(getci "$i")"; do i="$(($i - 1))" done ci="$(getci "$i")" if echo "$ci"|grep -q '[<][0-9].*[>]'; then ci="$(echo "$ci"|sed -be 's/[<][^>]*[>]//g')" echo "$i: $ci" echo "$ci" >> "$textfile" fi unix2dos "$textfile"
These scripts took about 2-5 minutes each to run on each VTT file and I had to run vtt.sh first to output an intermediary file, then run vtt-post.sh on that file.
The awk scripts
First, I converted vtt.sh to awk:
#!/bin/awk -f BEGIN { mark=0} /-->/ { mark = 1} /^[ ]*$/ { mark = 0 } mark == 1 && $0 !~ /-->/ { print $0 }
That was short (though it took a while to write it)! It also ran through in under a second!
Then I converted vtt-post.sh to awk:
#!/bin/awk -f BEGIN {lci=""; pci=""; li=0; ii=0} /^[^<>]+[^<> ]+[^<>]*$/ { if (pci != $0) { print $0; pci = $0; ii = NR } } /[<][0-9].*[>]/ { lci=$0; li = NR } END { if (li > ii) { gsub(/[<][^>]*[>]/,"",lci); print lci } }
A bit more complex in some ways and the output needed me to run "unix2dos" on it manually, but even with that added to the end of the launching command, it ran in under a second!
I then wondered if I could merge them.
vtt-combined.awk:
#!/bin/awk -f #vtt and vtt-post combined BEGIN { mark=0; lci=""; pci=""; li=0; ii=0} /-->/ { mark = 1} /^[ ]*$/ { mark = 0 } mark == 1 && $0 !~ /-->/ && /^[^<>]+[^<> ]+[^<>]*$/ { if (pci != $0) { print $0; pci = $0; ii = NR } } mark == 1 && $0 !~ /-->/ && /[<][0-9].*[>]/ { lci=$0; li = NR } END { if (li > ii) { gsub(/[<][^>]*[>]/,"",lci); print lci } }
It would appear so! Again, I had to add "unix2dos" to the launching command, but the entire thing ran again in under a second!
The only difference between the output of the shell-scripts and the final awk script was a blank line at the start of the file, which I didn't want anyway (thinking back, I should have used "echo -n '' > $textfile" in my shell-scripts instead of "echo '' > $textfile" to fix this, but it doesn't occur in awk).
Conclusion
A job costing 10 minutes of computer time reduced to under a second? Lesson learned! The only thing is it's slightly harder to debug, but with such short scripts and runtimes, this doesn't matter as much as I had thought it would.