Refining text data with a forgotten friend or two
It still amazes me the amount of work people will do for the sake of a GUI interface. Don't get me wrong; pointing and clicking a familiar set of controls is much easier to learn and remember when the GUI is done well. To date, though, GUIs still haven't met the base requirement of easy use without an installation and they are usually kludgy to work with for automated, unattended processes like refining the data from standard ASCII text files. And it's for those files that the good ol' Find command steps in!
Let's get familiar with Find and its switches first:
FIND [/V] [/C] [/N] [/I] [/OFF[LINE]] "string" [[drive:][path]filename[ ...]]
/V Displays all lines NOT containing the specified string.
/C Displays only the count of lines containing the string.
/N Displays line numbers with the displayed lines.
/I Ignores the case of characters when searching for the string.
/OFF[LINE] Do not skip files with offline attribute set.
"string" Specifies the text string to find.
[drive:][path]filename
Specifies a file or files to search.
If a path is not specified, FIND searches the text typed at the prompt
or piped from another command.
What's this? Find can do multiple files AND work against specific attributes? It can do negative searches (Lines not containing the search string)? For you *nix gurus out there, think of this as grep's little, weak-kneed cousin.
It's not limited to text files, either. It can extract particular strings from binary files, too. While not as fully featured as the Unix Strings command, this is still a handy ability. For instance, here's Find's output after searching this document for spaces (written in Word XP):
---------- FINDINGFIND.DOC
Forgotten Find Refining text data with a forgotten friend It still amazes me
the amount of work people will do for the sake of a GUI interface. Don't get
me wrong; pointing and clicking a familiar set of controls is much easier to
learn and remember when the GUI is down well. To date, though, GUIs still haven't
met the base requirement of easy use without an installation and they are usually
kludgy to work with for automated, unattended processes like refining the data
from standard ASCII text files. And it's for those files that the good ol' Find
command steps in!
Let's get familiar with Find and its switches first:
FIND [/V] [/C] [/N] [/I] [/OFF[LINE]] "string" [[drive:][path]filename[ ...]]
/V Displays all lines NOT containing the specified string.
/C Displays only the count of lines containing the string.
/N Displays line numbers with the displayed lines.
/I Ignores the case of characters when searching for the string.
/OFF[LINE] Do not skip files with offline
1h °Ð/ °à=!° "° # $ %°
Forgotten Find
Greg Chapman
Greg Chapman
Microsoft Word 10.0
MouseTrax Computing Solutions
n
Forgotten Find
fj ôÄ (
Microsoft Word Document
Hey!! That's not too bad!! We even found stuff in there we didn't expect like some of the document attribute information.
Well that was enjoyable but not really all that useful. What if you were
parsing some of the text data from the event files created by another of
my monstrous scripts, the Win32EventRealTime.vbs script at http://pubs.logicalexpressions.com/Pub0009/LPMArticle.asp?ID=115.
This script, you might recall, monitors the event log on systems and puts them out to a text file on the drive. There's a lot of data to parse when the system in question is accessed by a lot of users and you're auditing events. How do you parse out only the information you're interested in? Or, asked differently, how do you FIND what you want in that output file? Here's an example in which we'll look for particular event codes:
find /i /n "592" mycomputerevents.txt
And here's the interesting output:
---------- MYCOMPUTEREVENTS.TXT
[910]8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
[925]8/16/2004 8:05:51 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
That's pretty cool
but what if I've got several files in this directory from
which I want to harvest that event? Easy, change the way you use Find:
find /i /n "592" .\*.*
---------- .\EXAMPLE01.TXT
[910]8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
[925]8/16/2004 8:05:51 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
---------- .\EXAMPLE02.TXT
[910]8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
[925]8/16/2004 8:05:51 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
---------- .\EXAMPLE03.TXT
[910]8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
[925]8/16/2004 8:05:51 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
---------- .\MYCOMPUTEREVENTS.TXT
[910]8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
[925]8/16/2004 8:05:51 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
You'll note that I used a funny path specification in the command for Find:
find /i /n "592" .\*.*
In Windows, the "." parameter refers to the local directory, just like the Unix environment. I just happened to be running Find with my current directory set to the location of the files I wanted to search
and
I wanted to search all of them so I specified all files using the * wildcard.
Now, the data is obviously less detailed than you might want at this
point. Okay, so a process was created
which process? Let's use FINDSTR
instead!
FINDSTR is much more powerful than Find with support for multiple string searches AND regular expressions, a topic I've yet to master and have heard referred to as an art form. Using a similar approach to that we used with FIND, we can search out even more specific, informative data. For instance, I want to know how many logged instances of process creation AND process exits occurred for FireFox.exe. I can do that this way:
FINDSTR /i /n /S "592 593 firefox" .\*.txt
This results in another very long set of hits like this:
.\example01.txt:858:8/16/2004 8:04:19 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 593, A process has exited:
.\example01.txt:871:8/16/2004 8:04:25 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 593, A process has exited:
.\example01.txt:884:8/16/2004 8:04:32 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 593, A process has exited:
.\example01.txt:897:8/16/2004 8:05:13 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 593, A process has exited:
.\example01.txt:901: Image File Name: C:\PROGRA~1\MOZILL~1\firefox.exe
.\example01.txt:910:8/16/2004 8:05:43 PM, MYCOMPUTER, MYCOMPUTER\gchapman, 592, A new process has been created:
.\example01.txt:914: Image File Name: C:\PROGRA~1\MOZILL~1\firefox.exe
FINDSTR has a pretty rich array of switches and options. The list is so extensive
that you may discover more than you want to work with and FIND will be your preferred
tool. Check out this rich set of instructions:
FINDSTR /?
Searches for strings in files.
FINDSTR [/B] [/E] [/L] [/R] [/S] [/I] [/X] [/V] [/N] [/M] [/O] [/P] [/F:file]
[/C:string] [/G:file] [/D:dir list] [/A:color attributes] [/OFF[LINE]]
strings [[drive:][path]filename[ ...]]
/B Matches pattern if at the beginning of a line.
/E Matches pattern if at the end of a line.
/L Uses search strings literally.
/R Uses search strings as regular expressions.
/S Searches for matching files in the current directory and all
subdirectories.
/I Specifies that the search is not to be case-sensitive.
/X Prints lines that match exactly.
/V Prints only lines that do not contain a match.
/N Prints the line number before each line that matches.
/M Prints only the filename if a file contains a match.
/O Prints character offset before each matching line.
/P Skip files with non-printable characters.
/OFF[LINE] Do not skip files with offline attribute set.
/A:attr Specifies color attribute with two hex digits. See "color /?"
/F:file Reads file list from the specified file(/ stands for console).
/C:string Uses specified string as a literal search string.
/G:file Gets search strings from the specified file(/ stands for console).
/D:dir Search a semicolon delimited list of directories
strings Text to be searched for.
[drive:][path]filename
Specifies a file or files to search.
Use spaces to separate multiple search strings unless the argument is prefixed
with /C. For example, 'FINDSTR "hello there" x.y' searches for "hello" or
"there" in file x.y. 'FINDSTR /C:"hello there" x.y' searches for
"hello there" in file x.y.
Regular expression quick reference:
. Wildcard: any character
* Repeat: zero or more occurances of previous character or class
^ Line position: beginning of line
$ Line position: end of line
[class] Character class: any one character in set
[^class] Inverse class: any one character not in set
[x-y] Range: any characters within the specified range
\x Escape: literal use of metacharacter x
\<xyz word position: beginning of word
xyz\> Word position: end of word
For full information on FINDSTR regular expressions refer to the online Command
Reference.
As you can see, there's a great wealth of data you can parse at the command line.
So the next time you've a complex or long bit of data to sort through and you've
a distinct list of items you're looking for, don't forget the old standby commands
FIND and FINDSTR!
|