The Power of Awk

Yesterday, I used a shell script I had written a while back to convert some VTT subtitle files into normal text, however, because these files this time were from yt-dlp, there was extra formatting and I had to write a second shell-script for post processing. These took a while to write but also a long time to run and it was quite late by the time I had working outputs from them. After it was all done, I wondered if it would have been faster to write some awk scripts, given this is text processing and that's what awk is best at. This wasn't my first choice as awk is not a language I am very familiar with, but today I thought I'd give it a try for arguments' sake to see if it would be any faster. Writing and testing the scripts took about 1.5 hours (a lot of which was spent finding out how to do things), but I now have some scripts that work.

Original shell scripts

First, the shell scripts I used yesterday:

vtt.sh:

#!/bin/sh
#vtt converter

vttfile="$1"
textfile="$2"
i='1'
numl="$(wc -l "$vttfile"|cut -f 1 -d ' ')"
mark='0'
ci=''

getci() {
  cat "$vttfile"|sed -ne "${1}p"
}

echo "" > "$textfile"

while test "$i" -lt "$numl"; do
  ci="$(getci "$i")"
  echo "$i: $ci"
  if test "$mark" == '0'; then
    while test "$(echo "$ci"|sed -ne '/-->/p')" == ""; do
	  i="$(($i + 1))"
	  ci="$(getci "$i")"
	done
	mark="1"
  else
    if test "$ci" == ''; then
	  mark='0'
	else
	  echo "$ci" >> "$textfile"
	fi
  fi
  i="$(($i + 1))"
done

vtt-post.sh:

#!/bin/sh
#vtt post processing

subs="$1"
textfile="$2"
ci=''
pci=''
i='1'
numl="$(wc -l "$subs"|cut -f 1 -d ' ')"

getci() {
  cat "$subs"|sed -nbe "${1}p"
}
getcif() {
  getci "$1"|sed -be "/[<][0-9].*[>]/s/.*//"
}

echo "" > "$textfile"

while test "$i" -lt "$numl"; do
  ci="$(getcif "$i")"
  if test ! -z "$(echo $ci)"; then
    if test "$ci" != "$pci"; then
	  echo "$i: $ci"
	  echo "$ci" >> "$textfile"
	fi
    pci="$ci"
  fi
  i="$(($i + 1))"
done
while test -z "$(getci "$i")"; do
  i="$(($i - 1))"
done
ci="$(getci "$i")"
if echo "$ci"|grep -q '[<][0-9].*[>]'; then
  ci="$(echo "$ci"|sed -be 's/[<][^>]*[>]//g')"
  echo "$i: $ci"
  echo "$ci" >> "$textfile"
fi
unix2dos "$textfile"

These scripts took about 2-5 minutes each to run on each VTT file and I had to run vtt.sh first to output an intermediary file, then run vtt-post.sh on that file.

The awk scripts

First, I converted vtt.sh to awk:

#!/bin/awk -f
BEGIN { mark=0}
/-->/ { mark = 1}
/^[ ]*$/ { mark = 0 }
mark == 1 && $0 !~ /-->/ { print $0 }

That was short (though it took a while to write it)! It also ran through in under a second!

Then I converted vtt-post.sh to awk:

#!/bin/awk -f
BEGIN {lci=""; pci=""; li=0; ii=0}
/^[^<>]+[^<> ]+[^<>]*$/ { if (pci != $0) { print $0; pci = $0; ii = NR } }
/[<][0-9].*[>]/ { lci=$0; li = NR }
END { if (li > ii) { gsub(/[<][^>]*[>]/,"",lci); print lci } }

A bit more complex in some ways and the output needed me to run "unix2dos" on it manually, but even with that added to the end of the launching command, it ran in under a second!

I then wondered if I could merge them.

vtt-combined.awk:

#!/bin/awk -f
#vtt and vtt-post combined
BEGIN { mark=0; lci=""; pci=""; li=0; ii=0}
/-->/ { mark = 1}
/^[ ]*$/ { mark = 0 }
mark == 1 && $0 !~ /-->/ && /^[^<>]+[^<> ]+[^<>]*$/ { if (pci != $0) { print $0; pci = $0; ii = NR } }
mark == 1 && $0 !~ /-->/ && /[<][0-9].*[>]/ { lci=$0; li = NR }
END { if (li > ii) { gsub(/[<][^>]*[>]/,"",lci); print lci } }

It would appear so! Again, I had to add "unix2dos" to the launching command, but the entire thing ran again in under a second!

The only difference between the output of the shell-scripts and the final awk script was a blank line at the start of the file, which I didn't want anyway (thinking back, I should have used "echo -n '' > $textfile" in my shell-scripts instead of "echo '' > $textfile" to fix this, but it doesn't occur in awk).

Conclusion

A job costing 10 minutes of computer time reduced to under a second? Lesson learned! The only thing is it's slightly harder to debug, but with such short scripts and runtimes, this doesn't matter as much as I had thought it would.