NetScanner: Protocols and Diagnostics

NetScanner
Introduction
Features
<<Protocols and Diagnostics>>
Download

NetScanner handles the following document types and links.

Absolute links
Document (http/file) Relative links
ftp links

Document Types

NetScanner accpets remote as well as local documents for analysis. The document may be any URL of http or file types. Examples are

  1. http://www.geocities.com/SiliconValley/Vista/6222/: The default HTML file for this URL (which is set by the server, in this case www.geocities.com) will be loaded by NetScanner.
  2. http://www.geocities.com/SiliconValley/Vista/6222/links.html: The document may also point to a HTML file on the internet.
  3. file:/users/some_user/HomePage/index.html (or equaivalent in Windows 95/NT): In this case, the specified file is loaded from the localhost (the user's system).

In (2) and (3) above, anchors and search strings or redirections may also be specified.

Refer to A Guide to URLs for a comprehensive description of URLs.

Absolute Links

Absolute http links are checked by opening a network connection to the web server specified in the link URL. NetScanner does not actually download the file, but only checks for a response from the server. The response codes are specified in RFC 1945. A HTML version of this RFC may be found at http://www.ics.uci.edu/pub/ietf/http/rfc1945.html. For a brief description, see HTTP response messages.

Relative Links

A relative link such as <a href="welcome.html">Welcome</a> does not make a complete URL, and the file's location (i.e., the one that contains the link) is used to resolve it into an absolute link. For example, if a link <a href="welcome.html">Welcome</a> is used in a document <a href="http://www.welcome.org/index.html">Welcome</a>, the location http://www.welcome.org/ is used to resolve the complete URL.

FTP Links

For checking an ftp link, NetScanner first attempts to make an anonymous login with a dummy password (by default). If this is successful, and if the link specifies a filepath, NetScanner checks if the filepath exits on the ftp server.

For a description of the diagnostic response messages, see FTP response messages.

Anchors

Anchors in URLs allow the browsers to point to a specific anchored (or named) part of a document.

As per the HTTP, anchors are not technically a part of the URL specification. It is usually the client's (e.g., a browser) responsibility to resolve anchors. This implies that even if the anchor specified in a link is not found (but the file supposed to contain the anchor is valid), the HTTP protocol does not specify any response code and reason-phrase. Therefore NetScanner first checks if the file specified in the link is valid. If so, NetScanner proceeds to download the complete file, and searches for a HTML tag that defines the anchor. If the anchor is not found, NetScanner responds with a "Anchor not found" message.

Search Strings

These are part of the standard URL description and will be automatically resolved.

Protocols and Diagnostics

HTTP Messages

The following is a brief description of response messages for HTTP links. These are extracted from RFC 1945. As per the above RFC, the server responds by a three digit response code followed by a reason-phrase. The response code are categorized as follows based on the first digit of the status code:

1XX Informational - Not used, but reserved for future use
2XX Success - The action was successfully received, understood, and accepted.
3XX Redirection - Further action must be taken in order to complete the request
4XX Client Error - The request contains bad syntax or cannot be fulfilled
5XX Server Error - The server failed to fulfill an apparently valid request

Some examples of response code and reason-phrase are given below.

200 OK
201 Created
202 Accepted
204 No Content
301 Moved Permanently
302 Moved Temporarily
304 Not Modified
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable

Whenever a response code and reason-phrase could be obtained for a link, NetScanner displays the reason-phrase against the link. If the response code is not of class 2XX, NetScanner in addition displays an error message in the Java Console, as such responses indicate bad links. As long as the response code confirms 2XX, even if the reason-phrase is not one of the above (under 2XX), the URL link should be considered to be valid.

See the miscellaneous errors section for other messages NetScanner generators. This is usually the case when a valid response could not be obtained from the server.

FTP Messages

  1. Login failed: Anonymous (or other, if specified) login failed.
  2. File not found: The spcified directory/file is not found on the ftp server.

Miscellaneous Errors

  1. Host unknown: The hostname specified in the URL is unknown.
  2. Protocol unknown: The protocol specified in the URL is unknown.
  3. Error while connecting: The host could not be reached due to network errors.
  4. Unknown Error: Internal error.
  5. Unchecked: This is the default status message when a document is first loaded. This message, after a scan, indicates that a valid connection to the URL could not be established within the specified seconds.

Last modified: Fri Jun 18 11:47:52 India Standard Time 1999

© Subbu Allamaraju 1998, 1999. All rights reserved.

All copyrights and trademarks acknowledged.