Senior Project in Computer Science, Fall 1996
Abstract
My project is a library of functions that retrieve, cache,
and manipulate the contents of web pages. These functions can be used
to extract certain parts of pages (specific text, images, or other
inline objects) and recombine these objects locally to create
"scrapbook" pages that change as the source pages change. For example,
the system can be used to combine news headlines, comics, and stories
from varied websites into one page for a user.
Pulling elements from a page is based on analysis of
the HTML that is specific to that page. To extend the flexibility of
the system, I have also created a markup standard that page
constructors can use to set off page elements based on content instead
of syntax. This makes it easier to have flexible programs which pull
items from different sites.
The project includes some sample modules that demonstrate
grabbing text and images from different websites, as well as
constructing requests that require additional headers and
authorization.
The functions to retrieve pages and manipulate their HTML are
written in TCL, as, correspondingly, are the included modules for
specific websites. The system does require one included external
binary written in C. This binary uses public domain MD5 hashing code
also written in C.
Using the Library
I will illustrate how to use my library of functions to
retrieve the popular comic, Dilbert.
The following code, found in modules.tcl, retrieves
today's Dilbert, stores the image in a local .gif, and
returns the filename of that .gif. Because the URL of the
actual image changes every day, we need to retrieve the page that
references the image, and then dig through the page for the desired
URL to the image. A complete function reference can be found later in this document.
proc dilbert {} {
withpage "http://www.unitedmedia.com/comics/dilbert/" {
set dilbertRegexp "/servers/stripServer/comics/dt/"
set dilbertComic [tag attributes [tag match [tag solo img ] ]]
set dilbertSrc [request niceURL [tag attribute SRC ]]
return [request binary [request get [request make {}]]]
}
}
withpage retrieves the page given by its argument and
sets up some default variables. is the non-header
portion of the content of the page returned by the web server. tag
solo img returns a list of all non-pair img tags from the
body of the page. tag match returns a list of all tags that
match the given regular expression. In this case, this amounts to one
tag, so we can put the attributes of the img tag into
. tag attribute pulls out a given
attribute, in this case SRC from the retrieved list of
attributes. request niceURL canonicalizes the path in the
img's SRC attribute with the page given as the
argument to withpage.
The last line of the procedure returns a path to a local copy of
the dilbert image. This happens as follows: first, request
make, which takes a URL and a list of headers, creates a
"Request" object for the image. This is handed off to request
get, which actually retrieves the image from the remote server or
local cache and returns its local filename to request binary,
which runs an external program, written in C, which puts the binary
contents of the page into a file within the document space of the
webserver. The name of this file is a 32-character MD5 hash of its
contents, which prevents duplication of identical files. This external
program, in turn, returns that filename, which the procedure then
returns.
So, to use this program to include Dilbert in a web page, one
would just need to do something like:
puts "<img src=[dilbert] alt=dilbert>"
Retrieving Dilbert needs no additional headers, but other pages
might. These can be specified as an optional second argument to
withpage. Most headers are plaintext, so adding them is as
simple as something like
withpage "http://www.student.net/" {User-Agent Mozilla/7.0} {}
One frequently used additional header that isn't plaintext, however,
is "Authorization". Basic Authorization with userid "userid" and
password "password" requires a header line like
Authorization: Basic Credentials
where "Credentials" is the string "userid:password" uuencoded. To
facilitate access to authorization-protected pages, the function
request auth is provided. It takes an authorization method, a
userid, and a password as arguments and returns the corresponding
credential string. (Only basic authorization is supported now.) It
would be used like:
withpage "http://wopr.student.net:81/490/auth/time.cgi" [request auth userid pwd] {}
Markup Standard
One weakness of this scheme is that modules developed to grab a
particular image or part of a page are entirely dependent on the
site-specific HTML conventions of the target. If those conventions
change, they will change without warning and they will break the
corresponding modules.
To give the Scrapbook a way to work around this problem, I propose
a markup standard based on content instead of syntax that will let
modules operate without concern for the underlying arrangement of the
HTML. The standard integrates seamlessly into existing web pages
because it is in the form of HTML comments. Marking specific pieces of
content on a page is as simple as surrounding them with HTML comments
that look like
<!--dhs:type:name-->
<!--/dhs:type:name-->
where type is something like "headline," "story," or "alert,"
and name is related to the specific content surrounded -- it
-- could be "asparagus," "Bob Dole," or "interesting senior project,"
-- for example.
The Scrapbook Library contains three functions to use the
information in these comments:
- tag std type returns a list of each piece of text
between comments with a given type string.
- tag std name returns a list of each piece of text
between comments with a given name string.
- tag std typename returns the text between the comment
with a given type string and a given name string.
Implementation Issues
Caching
As manifested in request getURL, the Scrapbook tries to minimize
network traffic by checking the date on files it has already retrieved
and adding the If-Modified-Since header to outgoing requests for
those pages. This task is made simpler by making the filename for
a retrieved page the same as its URL. (Except the leading protocol://
is stripped, and all /'s become _'s)
Additionally, by making local image filenames a consequence of
their contents, the external program that splits the headers off of a
binary page won't create many local copies of the same image. The same
image will always hash to the same filename.
Usage of External Programs
The Scrapbook Library uses four external programs. Three are
standard UNIX utilities. /bin/cat is used in
dhs_util slurp, to pull all data from a socket into a
file. /bin/mv and /bin/rm are used in
request getURL to remove temporary files created in checking
whether a page has changed since the last request for it.
The last external program, unheader, is also part of the
Scrapbook system. Its source code is in unheader.c This
program creates the content-only (no headers) versions of binary files
retrieved by the Scrapbook. Its first command line argument is the
directory in which to place the new file. This should be in the web
server's document tree. The second command line argument is the name
of the file which contains the headers and content, and the third
command line is the extension to give the new file. At its conclusion,
it prints out one line, which is the pathname to the just-content
file. This file's name is an MD5 hash of the file's contents.
Heavy Use of Regular Expressions
All of the HTML manipulation functions are big loops that iterate
through matching a regexp on a given page, and then attempt to
match the regexp on what wasn't part of the match in the last
attempt. This provides a conceptually simple way to sequentially
construct lists containing all instances of matches in a page. It also
makes any HTML parsing the Scrapbook does heavily dependent on the
efficiency of TCL's regexp routines. Additionally, understanding the
HTML parsing code requires a fair comfort level with regular
expressions.
Characters On Which To Split Headers/Content
Although the HTTP/1.0
standard clearly specifies that the delimiter between headers and
content in a server's response should be CRLF, that is, ASCII 13
followed by ASCII 10, it was educational to learn the wide variety of
newline and linefeed sequences that different servers used for this
purpose. This was frustrating initially, but after adjusting the code
to take into account these variations, header-splitting proceeded
smoothly.
Included Modules
modules.tcl contains four sample procedures that illustrate
different features of the Scrapbook. dilbert.tcl, discussed
above, returns a local filename of today's
Dilbert. yahooHeadlines takes a category name as an argument
and returns a list of current headlines from Yahoo in that
category. Currently, those categories are news, business, tech,
international, sports, entertainment, politics, health, and weather.
nandoTopStory returns the current top story from the News
& Observer's Nando Times, and sampleAuth, which takes a
username and a password as an argument, demonstrates access to pages
protected by Basic authorization.
Future Directions
Extensions to the Scrapbook most immediately need to take the form
of expanded coverage. This means both more modules that retrieve data
from different websites as well as more (i.e. any) websites using the
markup standard. This is best accomplished as a distributed effort,
since it is hard for one person to divine which parts of which
websites are interesting to large numbers of people. As different
people use the Scrapbook, the library of modules will increase.
Additionally, the code needs to be trusted. Successful distribution of
modules depends upon users trusting code that comes from arbitrary
locations. TCL's "safe interpreters" are useful for preventing foreign
code from doing nasty things. Unfortunately, some of those nasty
things include writing to files and sockets, which the Scrapbook
functions need to do. The ideal solution would be a safe interpreter
that the user has configured to be allowed to write to certain
temporary and cache directories so the library can work its magic.
Installation Notes
The following two variables need to be adjusted at the top of
request.tcl:
set dhs_retrievePrefix "/tmp"
set dhs_unheaderProg "/home/sklar/www/490/unheader"
dhs_retrievePrefix should be set to the place where all retrieved
web pages will be stored. /tmp is probably a good choice.
dhs_unheaderProg should be set to the path to
unheader. You make unheader by typing "make unheader" in the
distribution directory. After you move it to wherever you want it to
be (Outside of the web server's document tree is a good choice. Note
that is doesn't have to be in a cgi-bin directory.), it needs to
be setuid to some userid that can write to the directory where it will
be depositing image files.
Function Reference
The Scrapbook Library introduces four new commands to TCL.
- tag deals with parsing HTML.
- withpage provides an easy way to
accomplish most page manipulation needs.
- request deals with retrieving pages and
parsing URLs.
- dhs_util performs some internal
functions used by the other commands.
Additionally, the Library code creates three global variables:
dhs_retrievePrefix and dhs_unheaderProg as described
in Installation Notes and dhs_uuTable,
which is used by the uuencoding routines.
tag option arg ?arg ...?
- tag pair content tagname page
- Searches for all paired tags tagname, i.e.,
<tagname>...</tagname>, in the string page. If
content is "text", the function returns a list of text between
each <tagname></tagname> pair. If content is
"attributes", the function returns a list of the attributes inside the
<tagname> tag. If there is only one tag, the function converts
the attributes to array form with tag attributes.
- tag list tagname page
- Searches for text associated with single tags like <li> It
returns a list of text blocs in page delimited by tagname.
- tag solo tagname page
- Searches for non-pair tagnames that have no text associated with
them, like <img>. It returns a list, where each element is the
attributes of a <tagname> in page.
- tag attributes attrs
- Converts the string of element attributes that gets returned by
something like tag solo into a listified array. I.e. if attrs is
"src=/nice/image.gif border=0", tag attributes returns the
list "src /nice/image.gif border 0", which can be converted back to an
array with array set.
- tag match taglist regexp
- Returns the first tag in taglist whose text matches the
regexp regexp.
- tag attribute whichattr attrlist
- attrlist is a list returned by tag
attributes. whichattr is something like "SRC" or
"BORDER". The function returns the value in attrlist that
corresponds to the index whichattr.
- tag std type typeval
- Returns a list of each piece of text between comments with type
string typeval.
- tag std name nameval
- Returns a list of each piece of text between comments with name
string nameval.
- tag std typename typeval nameval
- returns the text between the comment with a type string typeval
and a name string nameval.
withpage basePage ?headers? code
Retrieves the url basePage and sends along the optional headers
headers with the request, then eval's the block of code
code. Also sets up the following variables for use while
executing code:
- myRequest: the Request object for retrieving basePage
with headers headers
- myPage: the returned Page object from retrieving basePage.
- headers: the headers from myPage.
- body: the body from myPage.
request option ?arg ...?
- request make url options
- Return a Request object for url url and optional headers
options. options takes the form of a list of alternating
"Header Name", "Header Value" elements, like {Set-Cookie Monster
User-Agent Mothra}.
- request get request
- Return the Page object that results from retrieving the Request
object request.
- request part page part
- Return part of the Page object page. If part is
"headers", the function returns the headers that resulted from making
the given HTTP request in an array. If part is "body", the
function returns the content of the HTTP server response. If
part is "fname", the function returns the local filename of the
file that contains both headers and content.
- request cache fname
- Return the Page object associated with the Request cached in the
local file fname.
- request binary page
- Process the Page object page whose content is binary. The
function writes the binary content of page to a file in the web
server's document space. The filename is a hash of the file's
contents. The function returns the filename.
- request niceURL ?baseURL? path
- Canonicalize the URL in path relative to baseURL. If
baseURL is not provided, the function assumes it is being
called within a codeblock being executed by withpage and path is canonicalized
relative to the value basePage one stack frame up.
- request auth type username password
- Returns the Authorization header associated with the
authorization type type, username username, and password
password. Right now, only basic authorization is supported.
- request MIMEDecode MIME-Type
- Returns a list whose first element is the major MIME type of
MIME-Type (e.g. "text") and whose second element is the minor
MIME type of MIME-Type (e.g. "html").
(The following functions are not meant to be called by users.)
- request getURL host port path headers
- Does the caching and checking of "If-Modified-Since" headers to
ensure minimal network traffic but the newest versions of requested
web objects.
- request transfer host port path headers fname
- Sets up the sockets and file descriptors for retrieving a web
object into a file and sends the actual request over the network to
the server.
- request splitURL url
- Parses a URL into its component protocol, host, port, and path
parts.
- request niceHeaders headers
- Parses the sequence of headers as returned from the server into a
nice array.
- request splitPage
- Splits a page as returned from the server into headers and body.
dhs_util option ?arg ...?
- dhs_util getFile fname
- Returns the text of the file fname. The file descriptor is
configured for binary translation.
- dhs_util slurp in out
- Passes file descriptor in as standard in and file
descriptor out as standard out to /bin/cat.
- dhs_util encode str
- Returns the result of uuencoding str.
- dhs_util ENC i
- Returns the character in the ith position in
dhs_uuTable a global variable used in uuEncoding basic
authorization credential strings.
|