Dynamic Hypertext Scrapbook

Senior Project in Computer Science, Fall 1996

Abstract

My project is a library of functions that retrieve, cache, and manipulate the contents of web pages. These functions can be used to extract certain parts of pages (specific text, images, or other inline objects) and recombine these objects locally to create "scrapbook" pages that change as the source pages change. For example, the system can be used to combine news headlines, comics, and stories from varied websites into one page for a user.

Pulling elements from a page is based on analysis of the HTML that is specific to that page. To extend the flexibility of the system, I have also created a markup standard that page constructors can use to set off page elements based on content instead of syntax. This makes it easier to have flexible programs which pull items from different sites.

The project includes some sample modules that demonstrate grabbing text and images from different websites, as well as constructing requests that require additional headers and authorization.

The functions to retrieve pages and manipulate their HTML are written in TCL, as, correspondingly, are the included modules for specific websites. The system does require one included external binary written in C. This binary uses public domain MD5 hashing code also written in C.

Using the Library

I will illustrate how to use my library of functions to retrieve the popular comic, Dilbert.

The following code, found in modules.tcl, retrieves today's Dilbert, stores the image in a local .gif, and returns the filename of that .gif. Because the URL of the actual image changes every day, we need to retrieve the page that references the image, and then dig through the page for the desired URL to the image. A complete function reference can be found later in this document.


proc dilbert {} {
    withpage "http://www.unitedmedia.com/comics/dilbert/" {
        set dilbertRegexp "/servers/stripServer/comics/dt/"
	set dilbertComic [tag attributes [tag match [tag solo img $body] $dilbertRegexp]]
        set dilbertSrc [request niceURL [tag attribute SRC $dilbertComic]]
	return [request binary [request get [request make $dilbertSrc {}]]]
    }
}

withpage retrieves the page given by its argument and sets up some default variables. $body is the non-header portion of the content of the page returned by the web server. tag solo img $body returns a list of all non-pair img tags from the body of the page. tag match returns a list of all tags that match the given regular expression. In this case, this amounts to one tag, so we can put the attributes of the img tag into $dilbertComic. tag attribute pulls out a given attribute, in this case SRC from the retrieved list of attributes. request niceURL canonicalizes the path in the img's SRC attribute with the page given as the argument to withpage.

The last line of the procedure returns a path to a local copy of the dilbert image. This happens as follows: first, request make, which takes a URL and a list of headers, creates a "Request" object for the image. This is handed off to request get, which actually retrieves the image from the remote server or local cache and returns its local filename to request binary, which runs an external program, written in C, which puts the binary contents of the page into a file within the document space of the webserver. The name of this file is a 32-character MD5 hash of its contents, which prevents duplication of identical files. This external program, in turn, returns that filename, which the procedure then returns.

So, to use this program to include Dilbert in a web page, one would just need to do something like:

puts "<img src=[dilbert] alt=dilbert>"

Retrieving Dilbert needs no additional headers, but other pages might. These can be specified as an optional second argument to withpage. Most headers are plaintext, so adding them is as simple as something like

withpage "http://www.student.net/" {User-Agent Mozilla/7.0} {}

One frequently used additional header that isn't plaintext, however, is "Authorization". Basic Authorization with userid "userid" and password "password" requires a header line like

Authorization: Basic Credentials

where "Credentials" is the string "userid:password" uuencoded. To facilitate access to authorization-protected pages, the function request auth is provided. It takes an authorization method, a userid, and a password as arguments and returns the corresponding credential string. (Only basic authorization is supported now.) It would be used like:

withpage "http://wopr.student.net:81/490/auth/time.cgi" [request auth userid pwd] {}

Markup Standard

One weakness of this scheme is that modules developed to grab a particular image or part of a page are entirely dependent on the site-specific HTML conventions of the target. If those conventions change, they will change without warning and they will break the corresponding modules.

To give the Scrapbook a way to work around this problem, I propose a markup standard based on content instead of syntax that will let modules operate without concern for the underlying arrangement of the HTML. The standard integrates seamlessly into existing web pages because it is in the form of HTML comments. Marking specific pieces of content on a page is as simple as surrounding them with HTML comments that look like


<!--dhs:type:name-->
<!--/dhs:type:name-->

where type is something like "headline," "story," or "alert," and name is related to the specific content surrounded -- it -- could be "asparagus," "Bob Dole," or "interesting senior project," -- for example.

The Scrapbook Library contains three functions to use the information in these comments:

tag std type returns a list of each piece of text between comments with a given type string.
tag std name returns a list of each piece of text between comments with a given name string.
tag std typename returns the text between the comment with a given type string and a given name string.

Implementation Issues

Caching

As manifested in request getURL, the Scrapbook tries to minimize network traffic by checking the date on files it has already retrieved and adding the If-Modified-Since header to outgoing requests for those pages. This task is made simpler by making the filename for a retrieved page the same as its URL. (Except the leading protocol:// is stripped, and all /'s become _'s)

Additionally, by making local image filenames a consequence of their contents, the external program that splits the headers off of a binary page won't create many local copies of the same image. The same image will always hash to the same filename.

Usage of External Programs

The Scrapbook Library uses four external programs. Three are standard UNIX utilities. /bin/cat is used in dhs_util slurp, to pull all data from a socket into a file. /bin/mv and /bin/rm are used in request getURL to remove temporary files created in checking whether a page has changed since the last request for it.

The last external program, unheader, is also part of the Scrapbook system. Its source code is in unheader.c This program creates the content-only (no headers) versions of binary files retrieved by the Scrapbook. Its first command line argument is the directory in which to place the new file. This should be in the web server's document tree. The second command line argument is the name of the file which contains the headers and content, and the third command line is the extension to give the new file. At its conclusion, it prints out one line, which is the pathname to the just-content file. This file's name is an MD5 hash of the file's contents.

Heavy Use of Regular Expressions

All of the HTML manipulation functions are big loops that iterate through matching a regexp on a given page, and then attempt to match the regexp on what wasn't part of the match in the last attempt. This provides a conceptually simple way to sequentially construct lists containing all instances of matches in a page. It also makes any HTML parsing the Scrapbook does heavily dependent on the efficiency of TCL's regexp routines. Additionally, understanding the HTML parsing code requires a fair comfort level with regular expressions.

Characters On Which To Split Headers/Content

Although the HTTP/1.0 standard clearly specifies that the delimiter between headers and content in a server's response should be CRLF, that is, ASCII 13 followed by ASCII 10, it was educational to learn the wide variety of newline and linefeed sequences that different servers used for this purpose. This was frustrating initially, but after adjusting the code to take into account these variations, header-splitting proceeded smoothly.

Included Modules

modules.tcl contains four sample procedures that illustrate different features of the Scrapbook. dilbert.tcl, discussed above, returns a local filename of today's Dilbert. yahooHeadlines takes a category name as an argument and returns a list of current headlines from Yahoo in that category. Currently, those categories are news, business, tech, international, sports, entertainment, politics, health, and weather. nandoTopStory returns the current top story from the News & Observer's Nando Times, and sampleAuth, which takes a username and a password as an argument, demonstrates access to pages protected by Basic authorization.

Future Directions

Extensions to the Scrapbook most immediately need to take the form of expanded coverage. This means both more modules that retrieve data from different websites as well as more (i.e. any) websites using the markup standard. This is best accomplished as a distributed effort, since it is hard for one person to divine which parts of which websites are interesting to large numbers of people. As different people use the Scrapbook, the library of modules will increase.

Additionally, the code needs to be trusted. Successful distribution of modules depends upon users trusting code that comes from arbitrary locations. TCL's "safe interpreters" are useful for preventing foreign code from doing nasty things. Unfortunately, some of those nasty things include writing to files and sockets, which the Scrapbook functions need to do. The ideal solution would be a safe interpreter that the user has configured to be allowed to write to certain temporary and cache directories so the library can work its magic.

Installation Notes

The following two variables need to be adjusted at the top of request.tcl:

set dhs_retrievePrefix "/tmp"
set dhs_unheaderProg "/home/sklar/www/490/unheader"

dhs_retrievePrefix should be set to the place where all retrieved web pages will be stored. /tmp is probably a good choice. dhs_unheaderProg should be set to the path to unheader. You make unheader by typing "make unheader" in the distribution directory. After you move it to wherever you want it to be (Outside of the web server's document tree is a good choice. Note that is doesn't have to be in a cgi-bin directory.), it needs to be setuid to some userid that can write to the directory where it will be depositing image files.

Function Reference

The Scrapbook Library introduces four new commands to TCL.

tag deals with parsing HTML.
withpage provides an easy way to accomplish most page manipulation needs.
request deals with retrieving pages and parsing URLs.
dhs_util performs some internal functions used by the other commands.

Additionally, the Library code creates three global variables: dhs_retrievePrefix and dhs_unheaderProg as described in Installation Notes and dhs_uuTable, which is used by the uuencoding routines.

tag option arg ?arg ...?

tag pair content tagname page: Searches for all paired tags tagname, i.e., <tagname>...</tagname>, in the string page. If content is "text", the function returns a list of text between each <tagname></tagname> pair. If content is "attributes", the function returns a list of the attributes inside the <tagname> tag. If there is only one tag, the function converts the attributes to array form with tag attributes.
tag list tagname page: Searches for text associated with single tags like <li> It returns a list of text blocs in page delimited by tagname.
tag solo tagname page: Searches for non-pair tagnames that have no text associated with them, like <img>. It returns a list, where each element is the attributes of a <tagname> in page.
tag attributes attrs: Converts the string of element attributes that gets returned by something like tag solo into a listified array. I.e. if attrs is "src=/nice/image.gif border=0", tag attributes returns the list "src /nice/image.gif border 0", which can be converted back to an array with array set.
tag match taglist regexp: Returns the first tag in taglist whose text matches the regexp regexp.
tag attribute whichattr attrlist: attrlist is a list returned by tag attributes. whichattr is something like "SRC" or "BORDER". The function returns the value in attrlist that corresponds to the index whichattr.
tag std type typeval: Returns a list of each piece of text between comments with type string typeval.
tag std name nameval: Returns a list of each piece of text between comments with name string nameval.
tag std typename typeval nameval: returns the text between the comment with a type string typeval and a name string nameval.

withpage basePage ?headers? code

Retrieves the url basePage and sends along the optional headers headers with the request, then eval's the block of code code. Also sets up the following variables for use while executing code:

myRequest: the Request object for retrieving basePage with headers headers
myPage: the returned Page object from retrieving basePage.
headers: the headers from myPage.
body: the body from myPage.

request option ?arg ...?

request make url options: Return a Request object for url url and optional headers options. options takes the form of a list of alternating "Header Name", "Header Value" elements, like {Set-Cookie Monster User-Agent Mothra}.
request get request: Return the Page object that results from retrieving the Request object request.
request part page part: Return part of the Page object page. If part is "headers", the function returns the headers that resulted from making the given HTTP request in an array. If part is "body", the function returns the content of the HTTP server response. If part is "fname", the function returns the local filename of the file that contains both headers and content.
request cache fname: Return the Page object associated with the Request cached in the local file fname.
request binary page: Process the Page object page whose content is binary. The function writes the binary content of page to a file in the web server's document space. The filename is a hash of the file's contents. The function returns the filename.
request niceURL ?baseURL? path: Canonicalize the URL in path relative to baseURL. If baseURL is not provided, the function assumes it is being called within a codeblock being executed by withpage and path is canonicalized relative to the value basePage one stack frame up.
request auth type username password: Returns the Authorization header associated with the authorization type type, username username, and password password. Right now, only basic authorization is supported.
request MIMEDecode MIME-Type: Returns a list whose first element is the major MIME type of MIME-Type (e.g. "text") and whose second element is the minor MIME type of MIME-Type (e.g. "html").

(The following functions are not meant to be called by users.)

request getURL host port path headers: Does the caching and checking of "If-Modified-Since" headers to ensure minimal network traffic but the newest versions of requested web objects.
request transfer host port path headers fname: Sets up the sockets and file descriptors for retrieving a web object into a file and sends the actual request over the network to the server.
request splitURL url: Parses a URL into its component protocol, host, port, and path parts.
request niceHeaders headers: Parses the sequence of headers as returned from the server into a nice array.
request splitPage: Splits a page as returned from the server into headers and body.

dhs_util option ?arg ...?

dhs_util getFile fname: Returns the text of the file fname. The file descriptor is configured for binary translation.
dhs_util slurp in out: Passes file descriptor in as standard in and file descriptor out as standard out to /bin/cat.
dhs_util encode str: Returns the result of uuencoding str.
dhs_util ENC i: Returns the character in the ith position in dhs_uuTable a global variable used in uuEncoding basic authorization credential strings.

sklar.com