Finding text in text files within multiple tar archives without extracting any file onto the file system

So I was working the other day and then suddenly I had a need to find a piece of text within loads of text files contained across multitude of tar files. In a normal case, you’d extract them all somewhere in temp directory and then use grep to find the text you want. However, this doesn’t work very well when the tar files you have are in thousands. Also, there is something uncool about doing stuff manually when you’re in a world where there’s always a better way.

After an hour of experimentation, I found the following magic command:

find /home/manthan/loads_of_tarfiles -name "*.tar" -type f -exec sh -c "tar -tf {} | grep awesome | xargs -I found -n 1 tar -Oxf {} found | grep im" \; -print

The above command finds all files (indicated by -type f) in /home/manthan/loads_of_tarfiles folder with extension .tar (indicatetd by -name "*.tar"). For each file it finds, it runs a shell command (indicated by -exec sh -c) which lists all files contained within that tar file (indicated by tar -tf {}, where {} is each instance in which the find command found a tar file). For each file contained within the resident tar file, it then finds a text file whose name contains the word awesome (grep awesome). This can be anything as long as you know what kind of file you are looking for.

For each text file (whose name contains the word awesome) that is found within the tar file, it then extracts the contents of that file to STDOUT (indicated by tar -Oxf {} found where {} is the tar file that is found in the first part of the command and found is the text file that is found in the latter part of the command by the grep awesome command). Now that the contents of the file are available in STDOUT, a simple search finds the text we need in that file (using grep im which searches for text im within the extracted file). If it is found, the matching contents are printed and the name of the tar file it matched in is also printed (indicated by -print).

Phew! That was lengthy. As usual, if you have any questions or find a better way, let me know in the comments below and I’ll make sure to update the post with credits. Enjoy!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.