Duplicate messages controlled at last.

Richard Schaal (starnet!apple!phx.mcd.mot.com!rschaal)
Mon, 31 Jan 1994 13:38:58 -0700 (MST)

In the beginning, there was comparison of timestamps to use as a vehicle to
determine whether messages were duplicated. It was good.

In the time that has transpired since then, several of the people have
complained, loudly at times, that there were duplicate messages, but the cry
fell upon deaf ears. It was a pain.

>From out of the West, came a computer geek who came to settle in the Valley of
the Sun { Scottsdale }. It was hot.

The geek came to learn of X*press and the nifty mail list. And from that, he
learned of the software. It was crude.

The geek polished the software and made it do his bidding. With the addition
of C-News and NN, he constructed a news system. It was pretty neat.

The geek's wife, a generally kind hearted woman, noticed that the geek was
spending an increasing amount of time with the computer. This was alright,
since that was the geek's trader. She also noticed that the "honey-do" list
was languishing for lack of attention. It was bad.

The geek now finds himself with a fixed amount of time to read the news, and
little tolerance for duplicate articles. It was a challenge.

He found that the same message would be sent out repeatedly for months with
different time stamps which defeated the time stamp comparison. After weeks of
thought, it came to him that a part of a security/encryption scheme would make
a robust scheme for detecting duplicate text in the face of different time
stamps. It was MD5.

An excerpt from RFC 1321 "MD5 Message-Digest Algorithm" April 1992

This document describes the MD5 message-digest algorithm. The
algorithm takes as input a message of arbitrary length and produces
as output a 128-bit "fingerprint" or "message digest" of the input.
It is conjectured that it is computationally infeasible to produce
two messages having the same message digest, or to produce any
message having a given prespecified target message digest. The MD5
algorithm is intended for digital signature applications, where a
large file must be "compressed" in a secure manner before being
encrypted with a private (secret) key under a public-key cryptosystem
such as RSA.

In a couple of weeks, once my testing has completed, I will be making a new
source archive available which contains the MD5 algorithm applied to the text
of the articles. The implementation does not depend on the timestamp of the
article, just the text. This use of the message digest coupled with a
reasonable C-News history file is expected to eliminate duplicates, thus making
the geek's newsreading time more fulfilling.

-- 
Richard Schaal
Motorola Computer Group
M/S AZ43 DW278
2900 South Diablo Way, Tempe, AZ 85252
Voice: (602)438-3519   Fax: (602)438-3836
E-Mail: rschaal@phx.mcd.mot.com
"This is a great place to work.... there's a going away party every Friday!"