Created: January 29th 2019
Last updated: November 10th 2020
Categories: IT Support, Linux
Author: Marcus Fleuti

[SOLVED] Linux (bash/sh) HOWTO find and copy files based on file content + Exclude paths regex

Tags: bash, copy, egrep, exclude, file content, files, find, grep, Linux, Path, regex, rsync, search, shell

Donate with

82uymVXLkvVbB4c4JpTd1tYm1yj1cKPKR2wqmw3XF8YXKTmY7JrTriP4pVwp2EJYBnCFdXhLq4zfFA6ic7VAWCFX5wfQbCC

This howto explains a command with which you can find files in Linux based on their content and copy them into a folder of your choice.

Example

In our example we want to search through all e-mails in a mailbox that are either coming from or were sent to people named either Scott, Simon or James. We want to exclude certain paths and find files based on the pattern, that every file starts with a variable number. Below we show and explain the working command.

What tools are being used

We use the following tools to accomplish this:

find
egrep (or grep) - egrep is used to search by Regular Expressions instead of standard patterns
rsync - This is a standard package in most Linux distributions but it might be that you'd need to install it first

find ./ ! -path "./dont_want_to_search_in_this_path" ! -path "./also_no_paths_that_start_with_tmp*" -type f -name "[0-9]*.*" -exec egrep -m30 -li '(From|To):.*(scott|simon|james)\s+.*>' "{}" \; -exec rsync -av "{}" /copy_to_my_destination_path/ \;

Commands explained

find ./

The command will search in the current folder. You could also set a root path like find /folder_i_want_to_search_files_in .

! -path "./dont_want_to_search_in_this_path" ! -path "./also_no_paths_that_start_with_tmp*"

2 paths are being excluded. You can append as many negative (!) parameters as you like or you can also specifically include paths by not using the (!) operator.

-type f

Search for files only (type f = files).

-name "[0-9]."

The -name parameter searches for files with the name given inside of the quotes. You can use regex syntax in here. The value "[0-9]*.*" means: The file must start with a number between 0 and 9.

-exec egrep -m30 -li '(From|To):.(scott|simon|james)\s+.>'

Find shall execute the command egrep. Egrep will search through each file found by find.

-m30 means that egrep should only search through the first 30 lines of the file. Because we are searching through e-mail files and the information we search for is usually within the first 30 lines of such a file we do not need to search through the whole file in order to find the information we need. This is just an example. If you want to search whole files you'll need to remove this parameter.

-li means: do not display the text that was found inside of the file (=> only display the filename without any content). We need this in order to be able to copy the file afterwards. The i in the -li argument tells egrep to ignore case sensitivity

'(From|To):.*(scott|simon|james)\s+.*>' is a regex search string. It says: Search for a line in the file which contains either the text From or To followed by a colon (:). If that text is found check if the words scott, simon or james are on the same line followed by at least 1 space, a special character or a line ending (basically check if the text "scott" is not in a word like "scottish"). Then ignore anything else until the special character > is found.

"{}" \;

The curly brackets are placeholders for "current file". Basically what the system does is the following: The find command finds a file and overgives it to the command egrep by calling egrep [parameters] filepath/filename. The filepath/filename part is represented by the curly brackets.

The backslash and the semicolon (\;) represent the end of the exec command.

-exec rsync -av "{}" /copy_to_my_destination_path/ \;

The next command is issued which will copy the found file to the desired destination path. Again -exec is used to call the file copying/synchronisation program rsync. Rsync will copy all files overgiven by find (by using again "{}") and copy them to the desired destination path. Again the command is finished with the backslash-semicolon (\;) parameter.

Tipps & Tricks - Known issues

Why we use rsync and not cp

The command cp has a problem with directory spacings and other things. It's a bit of a primitive copy command and it caused lots of troubles in our setup. That's why we switched to rsync. We believe that rsync is a bit slower but in the end it's the tool that does the job nicely and reliably.

Performance

The command is rather slow. Even on an SSD storage it takes quite a long time to browse through several thousand files. We tried to limit the amount of data that needed to be processed by limiting egrep to only pull 30 lines of each file but it did not increase the performance that much. We know that ending the find command with these paramters: {} + instead of the ones we used: {} \; increases the performance but we had trouble to make it work like that. Combining multiple exec commands does not seem to work with using {} +. Perhaps you can find a better/faster approach. We're happy if you'd comment below.