Could we help you? Please click the banners. We are young and desperately need the money
This howto explains a command with which you can find files in Linux based on their content and copy them into a folder of your choice.
In our example we want to search through all e-mails in a mailbox that are either coming from or were sent to people named either Scott, Simon or James. We want to exclude certain paths and find files based on the pattern, that every file starts with a variable number. Below we show and explain the working command.
We use the following tools to accomplish this:
find ./ ! -path "./dont_want_to_search_in_this_path" ! -path "./also_no_paths_that_start_with_tmp*" -type f -name "[0-9]*.*" -exec egrep -m30 -li '(From|To):.*(scott|simon|james)\s+.*>' "{}" \; -exec rsync -av "{}" /copy_to_my_destination_path/ \;
The command will search in the current folder. You could also set a root path like find /folder_i_want_to_search_files_in .
2 paths are being excluded. You can append as many negative (!) parameters as you like or you can also specifically include paths by not using the (!) operator.
Search for files only (type f = files).
The -name parameter searches for files with the name given inside of the quotes. You can use regex syntax in here. The value "[0-9]*.*" means: The file must start with a number between 0 and 9.
Find shall execute the command egrep. Egrep will search through each file found by find.
-m30 means that egrep should only search through the first 30 lines of the file. Because we are searching through e-mail files and the information we search for is usually within the first 30 lines of such a file we do not need to search through the whole file in order to find the information we need. This is just an example. If you want to search whole files you'll need to remove this parameter.
-li means: do not display the text that was found inside of the file (=> only display the filename without any content). We need this in order to be able to copy the file afterwards. The i in the -li argument tells egrep to ignore case sensitivity
'(From|To):.*(scott|simon|james)\s+.*>' is a regex search string. It says: Search for a line in the file which contains either the text From or To followed by a colon (:). If that text is found check if the words scott, simon or james are on the same line followed by at least 1 space, a special character or a line ending (basically check if the text "scott" is not in a word like "scottish"). Then ignore anything else until the special character > is found.
The curly brackets are placeholders for "current file". Basically what the system does is the following: The find command finds a file and overgives it to the command egrep by calling egrep [parameters] filepath/filename. The filepath/filename part is represented by the curly brackets.
The backslash and the semicolon (\;) represent the end of the exec command.
The next command is issued which will copy the found file to the desired destination path. Again -exec is used to call the file copying/synchronisation program rsync. Rsync will copy all files overgiven by find (by using again "{}") and copy them to the desired destination path. Again the command is finished with the backslash-semicolon (\;) parameter.
The command cp has a problem with directory spacings and other things. It's a bit of a primitive copy command and it caused lots of troubles in our setup. That's why we switched to rsync. We believe that rsync is a bit slower but in the end it's the tool that does the job nicely and reliably.
The command is rather slow. Even on an SSD storage it takes quite a long time to browse through several thousand files. We tried to limit the amount of data that needed to be processed by limiting egrep to only pull 30 lines of each file but it did not increase the performance that much. We know that ending the find command with these paramters: {} + instead of the ones we used: {} \; increases the performance but we had trouble to make it work like that. Combining multiple exec commands does not seem to work with using {} +. Perhaps you can find a better/faster approach. We're happy if you'd comment below.