Skip to main content

skip to main content

developerWorks  >  Linux  >

Simplify data extraction using Linux text utilities

An overview of the most commonly used command-line text tools

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Harsha S. Adiga (haradiga@in.ibm.com), Software Engineer, IBM

09 Aug 2006

Much of Linux® system administration involves tediously combing through plain-text configuration files. Fortunately, Linux has a rich array of UNIX®-derived data extraction utilities, including head, tail, grep, egrep, fgrep, cut, paste, join, awk, and more. This article uses real-world examples that show how these simple command-line programs can make you a better sysadmin. This article looks at each data extraction utility and its options, applies them to typical files that are used in day-to-day work, and looks at how and why each tool is useful for pulling data from these files.

The Linux operating system is loaded with files: configuration files, text files, documentation files, log files, user files, and the list goes on and on. Quite often, those files contain information you need to access in order to find important data. Although you can easily dump the contents of most files to the screen with standard utilities such as cat, more, and others, there are utilities better suited for filtering and parsing out only those values that are relevant to you.

As you read this article, you can open your shell and try the examples of each utility.

Regular expressions

Before you start, you should first understand what regular expressions are and how to use them.

In their simplest form, regular expressions are the search criteria used for locating text in a file. For example, to find all lines containing the word "admin", you can search for "admin". Thus, "admin" constitutes a regular expression. If you want not only to find "admin" but also to replace it with "root", you can give the appropriate commands in a utility to substitute "root" for "admin". Both thus constitute regular expressions.

These basic rules govern regular expressions:

  • Any single character or series of characters can be used to match itself or themselves, as in the "admin" example above.

  • The caret sign (^) signifies the beginning of a line; the dollar sign ($) signifies the end.

  • To literally search for special characters such as the dollar sign, precede them with a backslash (\). For example, \$ searches for $ and not the end of a line.

  • The period (.) represents any single character. For example, ad..n stands for five-character entries, the first two being "ad" and the last being "n". The middle two characters can be anything, but there can be only two of them.

  • Any time the regular expression is contained within slashes (for example, /re/), the search is forward through the file. When it is enclosed in question marks (for example, ?re?), the search is backward through the file.

  • Square brackets ([]) signify multiple values, and a minus sign (-) indicates a range of values. For example, [0-9] is the same as [0123456789], and [a-z] is the equivalent of a search for any lowercase letter. If the first character of a list is a caret, it matches any character not in the list.

Table 1 illustrates how these matches work in practice.

Table 1. Sample regular expressions
ExampleDescription
[abc]Matches one of "a", "b", or "c"
[a-z]Matches any one lowercase letter from "a" to "z"
[A-Z]Matches any one uppercase letter from "A" to "Z"
[0-9]Matches any one number from 0 to 9
[^0-9]Matches any character other than the numbers from 0 to 9
[-0-9]Matches any number from 0 to 9, or a dash ("-")
[0-9-]Matches any number from 0 to 9, or a dash ("-")
[^-0-9]Matches any character other than the numbers from 0 to 9, or a dash ("-")
[a-zA-Z0-9]Matches any alphabetic or numeric character

With this information under your belt, let's look at the utilities.



Back to top


grep

The grep utility works by searching through each line of a file (or files) for the first occurrence of a given string. If that string is found, the line is printed; otherwise, the line is not printed. The following file, which I'll name "memo," illustrates grep's usage and results.

To: All Employees

From: Human Resources

In order to better serve the needs of our mass market customers, ABC Publishing is integrating the groups selling to this channel for ABC General Reference and ABC Computer Publishing. This change will allow us to better coordinate our selling and marketing efforts, as well as simplify ABC's relationships with these customers in the areas of customer service, co-op management, and credit and collection. Two national account managers, Ricky Ponting and Greeme Smith, have joined the sales team as a result of these changes.

To achieve this goal, we have also organized the new mass sales group into three distinct teams reporting to our current sales directors, Stephen Fleming and Boris Baker. I have outlined below the national account managers and their respective accounts in each of the teams. We have also hired two new national account managers and a new sales administrator to complete our account coverage. They include:

Sachin Tendulkar, who joins us from XYZ Consumer Electronics as a national account manager covering traditional mass merchants.

Brian Lara, who comes to us via PQR Company and will be responsible for managing our West Coast territory.

Shane Warne, who will become an account administrator for our warehouse clubs business and joins us from DEF division.

Effectively, we have seven new faces on board:

1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNE

Please join me in welcoming each of our new team members.

As a simple example, to find the lines that have the word "welcoming", the best approach would be to use the following command line:

# grep welcoming memo
Please join me in welcoming each of our new team members.

If you look for the word "market", the results are slightly different, as shown below.

# grep market memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.

Note that two matches are found: the requested "market", and "marketing". If the words "marketable" or "marketed" had occurred in the file, the utility would have displayed the lines containing those words as well.

Wildcards and meta-characters can be used with grep, and I strongly recommend that you place them inside quotation marks so that the shell doesn't interpret them as commands.

To find all lines that contain a number, use the following:

# grep  "[0-9]" memo
1. RICKY PONTING
2. GREEME SMITH
3. STEPHEN FLEMING
4. BORIS BAKER
5. SACHIN TENDULKAR
6. BRIAN LARA
7. SHANE WARNE

To find all lines that contain "the", use this:

# grep the memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.

To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:

As you might have noticed, the output contains the word "these", along with exact matches of the word "the".

The grep utility, like almost every other UNIX/Linux utility, is case-sensitive, which means that a completely different result comes from looking for "The" instead of "the".

# grep The memo
To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:

If you are seeking a particular word or phrase and don't care about the case, there are two ways to proceed. The first is to look for both "The" and "the" by using square brackets, as shown below:

# grep "[T, t]he" memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.

To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:

The second method is to use the -i option, which tells grep to ignore case sensitivity.

# grep -i the memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.

To achieve this goal, we have also organized
the new mass sales group into three distinct
teams reporting to our current sales
directors, Stephen Flemming and Boris Baker. I
have outlined below the national account
managers and their respective accounts in each
of the teams. We have also hired two new
national account managers and a new sales
administrator to complete our account
coverage. They include:

In addition to -i, there are several other command-line options to change grep's output. The most relevant are the following:

  • -c -- Suppress normal output; instead, print a count of matching lines for each input file.
  • -l -- Suppress normal output; instead, print the name of each input file from which output would have normally been printed.
  • -n -- Prefix each line of output with the line number within its input file.
  • -v -- Invert the sense of matching -- that is, select lines that don't match the search criteria.


Back to top


fgrep

fgrep searches files for a string and prints all lines that contain that string. Unlike grep, fgrep searches for a string instead of searching for a pattern that matches an expression. The fgrep utility can be thought of as grep with a few enhancements:

  • You can search for more than one object at a time.
  • The fgrep utility is always much faster than grep.
  • You can't use fgrep to search for regular expressions with patterns.

Suppose you want to pull uppercase names from your earlier memo file. In order to find "STEPHEN" and "BRIAN", you would have to issue two separate grep commands, as shown below:

# grep STEPHEN memo
3. STEPHEN FLEMING

# grep BRIAN memo
6. BRIAN LARA

You can accomplish the same task with just one fgrep command:

# fgrep "STEPHEN
> BRIAN" memo
3. STEPHEN FLEMING
6. BRIAN LARA

Note that carriage return is required between entries. Without the carriage return, the search would look for "STEPHEN BRIAN" on each line. With the return, it looks for a match to "STEPHEN" and a match to "BRIAN".

Note also that quotation marks must be used around the targeted text. This is what differentiates the text from the filename (or filenames).

Instead of specifying search items on the command line, you can place them in a file and use the contents of that file to search other files. The -f option allows you to specify a master file containing search items for which you search frequently.

For example, imagine a file named "search_items" that contains two search items for which you intend to search:

# cat search_items
STEPHEN
BRIAN

The following command searches for "STEPHEN" and "BRIAN" in our earlier memo file:

# fgrep -f search_items memo
3. STEPHEN FLEMING
6. BRIAN LARA



Back to top


egrep

egrep is a more powerful version of grep that allows you to search for more than one object at a time. Objects being searched for are separated by carriage returns (as with fgrep) or by the pipe symbol (|).

# egrep "STEPHEN
> BRIAN" memo
3. STEPHEN FLEMING
6. BRIAN LARA

# egrep "STEPHEN | BRIAN" memo
3. STEPHEN FLEMING
6. BRIAN LARA

The two commands above do the same job.

Besides the capacity to search for multiple objects, egrep offers the ability to search for repetitions and groups:

  • ? looks for zero repetitions or one repetition of the character that precedes the question mark.
  • + looks for one or more repetitions of the character that precedes the plus sign.
  • ( ) signifies a group.

For example, imagine that you can't remember whether Brian's surname is "Lara" or "Laras".

# egrep "LARAS?" memo
6. BRIAN LARA

This search produces matches to both "LARA" and "LARAS". The following search is a bit different:

# egrep "STEPHEN+" memo
3. STEPHEN FLEMING

It matches "STEPHEN", STEPHENN", STEPHENNN", and so on.

If you are looking for a word plus one of its possible derivatives, include the distinguishing characters of the derivative in parentheses.

# egrep -i "electron(ic)?s" memo
Sachin Tendulkar, who joins us from XYZ Consumer
Electronics as a national account manager covering
traditional mass merchants.

This finds a match for both "electrons" and "electronics".

To summarize:

  • A regular expression followed by + matches one or more occurrences of the regular expression.

  • A regular expression followed by ? matches zero or one occurrence of the regular expression.

  • Regular expressions separated by | or by a carriage return match strings that are matched by any of the expressions.

  • A regular expression can be enclosed in parentheses ( ) for grouping.

  • The command-line parameters you can use include -c, -f, -i, -l, -n, and -v.


Back to top


The grep utilities: A real-world example

The grep family of utilities can be used with any system file in text format to find a match in a line. For example, to find the entries in the /etc/passwd file for a user named "root", use the following:

# grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin

Because it looks for a match anywhere in the file, grep finds entries for both "root" and "operator". If you want to find only the entry with the username "root", you can modify the command as follows:

# grep "^root" /etc/passwd
root:x:0:0:root:/root:/bin/bash



Back to top


cut

With the cut utility, you can separate columns that could constitute data fields in a file. The default delimiter is the tab, and the -f option is used to specify the desired field.

For example, imagine a text file named "sample" with three columns that look like this:

one    two    three
four   five   six
seven  eight  nine
ten    eleven twelve

Now, apply the following command:

# cut -f2 sample

This will return:

two
five
eight
eleven

If you change your command like so:

# cut -f1, 3 sample

It will return the opposite:

one    three
four   six
seven  nine
ten    twelve

Several command-line options are available with this command. Besides -f, you should be familiar with these two:

  • -c -- Allows you to specify characters instead of fields.
  • -d -- Allows you to specify a delimiter other than the tab.


Back to top


cut: Two real-world examples

The ls -l command shows the permissions, number of links, owner, group, size, date, and filenames of all the files in a directory -- all separated by white space. If you're not interested in most of the fields and want to see only the file owner, you can use the following command:

# ls -l | cut -d" " -f5
root
562
root
root
root
root
root
root

This command displays only the file owner (the fifth field), ignoring every other field.

If you know the exact position at which the first character of the file owner begins, you can use -c option to display the first character of the file owner. Assuming that it begins with the 16th character, the following command returns the 16th character, the first letter of the owner's name.

# ls -l | cut -c16
r

r
r
r
r
r
r

If you further assume that most users will use eight characters or fewer for their name, you can use the following command:

# ls -l | cut -c16-24

It will return those entries in the name field.

Now, assume that the name of the file begins with the 55th character, but that it is impossible to determine how many characters it takes up after that because some filenames are considerably longer than others. A solution is to begin with the 55th character and not specifying an ending character (meaning that the entire rest of the line is taken) as shown below:

# ls -l | cut -c55-
a.out
cscope-15.5
cscope-15.5.tar
cscope.out
memo
search_items
test.c
test.s

Now, consider another scenario. To obtain a list of all the users on the system, you can pull only the first field from the /etc/passwd file used in an earlier example:

# cut -d":" -f1 /etc/passwd
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
news
uucp
operator

To collect the usernames and their corresponding home directories, you can pull the first and sixth fields:

# cut -d":" -f1,6 /etc/passwd
root:/root
bin:/bin
daemon:/sbin
adm:/var/adm
lp:/var/spool/lpd
sync:/sbin
shutdown:/sbin
halt:/sbin
mail:/var/spool/mail
news:/etc/news
uucp:/var/spool/uucp
operator:/root



Back to top


paste

The paste utility combines fields from files. It takes one line from one source and combines it with another line from another source.

For example, imagine that the content of a file named "fileone" is:

IBM
Global
Services

In addition, you have "filetwo" with this content:

United States
United Kingdom
India

The following command combines the contents of these files, as shown below:

# paste fileone filetwo
IBM       United States
Global    United Kingdom
Services  India

If there were more lines in fileone than filetwo, then the pasting would continue, with blank entries following the tab.

The tab character is the default delimiter, but you can change it to anything else with the -d option.

# paste -d", " fileone filetwo
IBM, United States
Global, United Kingdom
Services, India

You can also use the -s option to output all of fileone on a line, followed by a carriage return and then filetwo.

# paste -s fileone filetwo
IBM           Global            Services
United States United Kingdom    India



Back to top


join

join is a greatly enhanced version of paste. join works only if the files being joined share a common field.

For example, consider the two files you were using with the paste command previously. Here's what happens when you try to combine them with join:

# join fileone filetwo

Note that there is nothing to display. The join utility must find a common field between the files in question, and by default it expects that common field to be the first.

To see how this works, try adding some new content. Assume that fileone now contains these entries:

aaaa    Jurassic Park
bbbb    AI
cccc    The Ring
dddd    The Mummy
eeee    Titanic

And filetwo now contains the following:

aaaa    Neil    1111
bbbb    Steven  2222
cccc    Naomi   3333
dddd    Brendan 4444
eeee    Kate    5555

Now, try that command again:

# join fileone filetwo
aaaa    Jurassic Park    Neil    1111
bbbb    AI               Steven  2222
cccc    The Ring         Naomi   3333
dddd    The Mummy        Brendan 4444
eeee    Titanic          Kate    5555

The commonality of the first field was identified, and the matching entries were combined. But paste blindly took from each file to create the output; join combines only lines that match, and the match must be exact. For example, imagine you added a line to filetwo:

aaaa    Neil    1111
bbbb    Steven  2222
ffff    Elisha  6666
cccc    Naomi   3333
dddd    Brendan 4444
eeee    Kate    5555

Now, your command will produce this output:

# join fileone filetwo
aaaa    Jurassic Park   Neil     1111
bbbb    AI              Steven   2222

As soon as the files no longer match, no further operations can be carried out. Each line in the first file is matched to the same and only the same line in the second file for a match on the default field. If matches are found, they are incorporated into the output; otherwise they are not.

By default, join looks only at the first fields for matches and outputs all columns, but you can change this behavior. The -1 option lets you specify which field to use as the matching field in fileone, and the -2 option lets you specify which field to use as the matching field in filetwo.

For example, to match the second field of fileone to the third field of filetwo, use the following syntax:

# join -1 2 -2 3 fileone filetwo

The -o option specifies output in the format {file.field}. Thus, to print the second field of fileone and the third field of filetwo on matching lines, the syntax is:

# join -o 1.2 -o 2.3 fileone filetwo



Back to top


join: A real-world example

The most obvious way you could use join in the real world would be to pull the username and the corresponding home directory from the /etc/passwd file and the group name from the /etc/group file. Groups appear in the fourth field in numerical format in the /etc/passwd file. Similarly, they appear in the third field in the /etc/group file.

# join -1 4 -2 3 -o 1.1 -o 2.1 -o 1.6 -t":" /etc/passwd /etc/group
root:root:/root
bin:bin:/bin
daemon:daemon:/sbin
adm:adm:/var/adm
lp:lp:/var/spool/lpd
nobody:nobody:/
vcsa:vcsa:/dev
rpm:rpm:/var/lib/rpm
nscd:nscd:/
ident:ident:/home/ident
netdump:netdump:/var/crash
sshd:sshd:/var/empty/sshd
rpc:rpc:/



Back to top


awk

awk is one of the most powerful utilities in Linux. It is actually a programming language in and of itself and can be used with complex logic statements, as well as to simply pull out snippets of text. We'll skip the details, but let's quickly review the syntax and then walk through some real-world examples.

An awk command consists of a pattern and an action composed of one or more statements, as shown in the syntax below:

awk '/pattern/ {action}' file

Notice that:

  • awk tests every record in the specified file (or files) for a pattern match. If a match is found, the specified action is performed.
  • awk can act as a filter in a pipeline or take input from the keyboard (standard input) if no file or files are specified.

One useful action is to print the data! Here is how to reference fields in a record.

  • $0 -- The entire record
  • $1 -- The first field in the record
  • $2 -- The second field in the record

You can also pull multiple fields in a record, separating each field by a comma.

For example, to pull the sixth field from the /etc/passwd file, the command is:

# awk -F: '{print $6}' /etc/passwd
/root
/bin
/sbin
/var/adm
/var/spool/lpd
/sbin
/sbin
/sbin
/var/spool/mail
/etc/news
/var/spool/uucp

Note that -F is the input field separator defined by the predefined variable FS. It is a blank space, in my case.

To pull the first and sixth fields from the /etc/passwd file, the command is:

# awk -F: '{print $1,$6}' /etc/passwd
root /root
bin /bin
daemon /sbin
adm /var/adm
lp /var/spool/lpd
sync /sbin
shutdown /sbin
halt /sbin
mail /var/spool/mail
news /etc/news
uucp /var/spool/uucp
operator /root

To print the file using a dash in place of the colon delimiter between fields, the command is:

# awk -F: '{OFS="-"}{print $1,$6}' /etc/passwd
root-/root
bin-/bin
daemon-/sbin
adm-/var/adm
lp-/var/spool/lpd
sync-/sbin
shutdown-/sbin
halt-/sbin
mail-/var/spool/mail
news-/etc/news
uucp-/var/spool/uucp
operator-/root

To print the file using a dash between fields, and print only the first and sixth fields in reverse order, the command is:

# awk -F: '{OFS="-"}{print $6,$1}' /etc/passwd
/root-root
/bin-bin
/sbin-daemon
/var/adm-adm
/var/spool/lpd-lp
/sbin-sync
/sbin-shutdown
/sbin-halt
/var/spool/mail-mail
/etc/news-news
/var/spool/uucp-uucp
/root-operator



Back to top


head

The head utility prints the first part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.

For example, if you want to extract the first two lines from your memo file, the command is:

# head -2 memo
In order to better serve the needs of our mass
market customers, ABC Publishing is
integrating the groups selling to this channel
for ABC General Reference and ABC Computer
Publishing. This change will allow us to
better coordinate our selling and marketing
efforts, as well as simplify ABC's
relationships with these customers in the
areas of customer service, co-op management,
and credit and collection. Two national
account managers, Ricky Ponting and Greeme
Smith, have joined the sales team as a result
of these changes.

You can specify the number of bytes to display using the -c option. For example, if you want to read the first two bytes from the memo file, the command is:

# head -c 2 memo
In



Back to top


tail

The tail utility prints the last part of each file (10 lines by default). It reads from standard input if no files are given, or if given a filename of -.

For example, if you want to extract the last two lines from your earlier memo, the command is:

# tail -2 memo

Please join me in welcoming each of our new team members.

You can specify the number of bytes to display using the -c option. For example, if you want to read the last five bytes from the memo file, the command is:

# tail -c 5 memo
ers.



Back to top


Conclusion

Now you know how to use various utilities to extract data from standard Linux files. Once extracted, that data can be manipulated for viewing and printing or directed into other files or databases. Knowing how to use just this handful of tools can help you spend less time on mundane tasks and become a more efficient administrator.



Resources

Learn

Get products and technologies
  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.


Discuss


About the author

Harsha Adiga works in the IBM Software Group in Bangalore, India, and is heavily involved in various Linux and open source communities and working groups. His primary focus areas include Linux and UNIX internals, porting, compilers, and code optimization. He has been involved in software development and testing on Linux and UNIX platforms for more than six years.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM, DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.