[Edited Aug 20 to make directory name consistent in FSF address patch.] [Edited Aug 19 to fix typo in unknown host name error message patch.] From htdig@htdig.org Fri Aug 6 17:24:42 1999 Return-Path: Received: from sob.htdig.org (htdig.org [209.75.193.22]) by cliff.scrc.umanitoba.ca (8.8.5/8.8.5) with ESMTP id RAA08051 for ; Fri, 6 Aug 1999 17:24:41 -0500 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (Postfix) with SMTP id 168A0A101; Fri, 6 Aug 1999 15:23:18 -0700 (PDT) From: Gilles Detillieux Errors-To: htdig@htdig.org To: htdig@htdig.org Message-ID: <37AB6056.BeroList-2.5.9@sob.htdig.org> Delivered-To: htdig@htdig.org Date: Fri, 6 Aug 1999 17:23:00 -0500 (CDT) X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig] honking big patch file collection for 3.1.2 Sender: htdig@htdig.org Status: RO Hi, folks. Over the past week, I've put together a big collection of patch files for htdig-3.1.2, to fix many of the bugs that have been reported over the past three and a half months, since the last release. Some of these were contributed by others. Many were backported from the 3.2 development code, and several were put together by me in the past week. Next week, I'll make sure that any of these that haven't made it into 3.2 yet will. In the meantime, I'd appreciate any feedback from all of you as to whether these patches really do fix the problems they claim to, or if they introduce other problems. Each patch is preceeded by a brief description, so you can pick them out and apply them one by one if you want, but I had no problem applying the whole collection at once with "patch -p1" on my Red Hat Linux box. Here's a summary of the changes: - PR#339 fixed - URL encodes all non-ASCII characters in URIs - PR#560 fixed - prevent inappropriate suffix stripping in endings fuzzy - PR#542 fixed - URL passed to external parser now quoted - PR#541 fixed - ANCHOR variable now set properly - PR#535 & PR#557 fixed - HTTP header parsing now more robust - username/password now blotted out from command arguments - adds support for , and tags - PR#554 fixed - locale now affects default date format in htsearch - fixes the bug in the handling of modification_time_is_now - PR#578 fixed - multiple directives in robots tag now work - now gives an error message for unknown hosts - empty or null strings won't cause htfuzzy to core dump - PDF parser now clears title string properly when done with it - PR#543 & PR#585 fixed - names like left_index.html no longer stripped - fixes server_alias entries so port defaults to 80 if omitted - decodes SGML entities inside tag attributes - PR#566 fixed - urls like 'http:/dir/file.ext' resolved properly - $(VAR) at end of template string now being expanded properly - PR#595 fixed - corrected address for FSF - maximum word length now a config attribute, not compile-time option - PR#81 & PR#472 fixed - htdig -vvv shouldn't crash in strftime() - PR#348 fixed - missing or invalid port number will get set correctly - PR#493 fixed - valid URL with ".." within a file name not rejected - PR#572 fixed - htsearch won't crash if CONTENT_LENGTH not set - PR#545 fixed - configure tests for presence of alloca.h for regex.c - documentation updates, including PR#558 & PR#626. -------- 8< -------- snip -------- 8< -------- This patch should fix PR#545, to test for presence of alloca.h --- htdig-3.1.2.bak/configure.in Wed Apr 21 21:47:53 1999 +++ htdig-3.1.2/configure.in Wed Aug 4 16:17:57 1999 @@ -13,7 +13,7 @@ # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software -# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # AC_INIT(htcommon/DocumentDB.cc) @@ -79,7 +79,7 @@ dnl More header checks--here use C++ AC_LANG_CPLUSPLUS -AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h) +AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h) AC_CHECK_HEADER(fstream.h,nofstream=0,nofstream=1) if test "x$nofstream" = "x1" ; then AC_MSG_ERROR([To compile ht://Dig, you will need a C++ library. Try installing libstdc++.]) --- htdig-3.1.2.bak/configure Wed Apr 21 21:47:53 1999 +++ htdig-3.1.2/configure Wed Aug 4 16:17:57 1999 @@ -2010,7 +2010,7 @@ CXXCPP="$ac_cv_prog_CXXCPP" echo "$ac_t""$CXXCPP" 1>&6 -for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h +for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h do ac_safe=`echo "$ac_hdr" | sed 'y%./+-%__p_%'` echo $ac_n "checking for $ac_hdr""... $ac_c" 1>&6 --- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 16:30:10 1999 @@ -55,6 +55,9 @@ /* Define if you have the header file. */ #undef HAVE_ZLIB_H + +/* Define if you have the header file. */ +#undef HAVE_ALLOCA_H /* Define if you have the header file. */ #undef HAVE_SYS_FILE_H --- htdig-3.1.2.bak/htlib/regex.c Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htlib/regex.c Wed Aug 4 16:20:48 1999 @@ -27,6 +27,7 @@ #undef _GNU_SOURCE #define _GNU_SOURCE +#include #ifdef HAVE_CONFIG_H # include #endif This adds descriptions for attributes that were missing, adds a few clarifications, and corrects a few defaults and typos. Covers PR#558, PR#626, and then some. --- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/attrs.html Fri Aug 6 14:00:28 1999 @@ -413,6 +413,57 @@
+ bin_dir +
+
+
+
+ type: +
+
+ string +
+
+ used by: +
+
+ htdig, + htnotify, + htfuzzy, + htmerge and + htsearch +
+
+ default: +
+
+ BIN_DIR +
+
+ description: +
+
+ This is the directory in which the executables + related to ht://Dig are installed. It is never used + directly by any of the programs, but other attributes + can be defined in terms of this one. +

+ The default value of this attribute is determined at + compile time. +

+
+
+ example: +
+
+ bin_dir: /usr/local/bin +
+
+
+
+
+
+
case_sensitive
@@ -595,7 +646,8 @@
If specified and the zlib - compression library was available when compiledi controls + compression library was available when compiled, + this attribute controls the amount of compression used in the doc_db file. Defaults to zero to provide backward compatility with old databases. @@ -612,6 +664,58 @@
+ config_dir +
+
+
+
+ type: +
+
+ string +
+
+ used by: +
+
+ htdig, + htnotify, + htfuzzy, + htmerge and + htsearch +
+
+ default: +
+
+ CONFIG_DIR +
+
+ description: +
+
+ This is the directory which contains all configuration + files related to ht://Dig. It is never used + directly by any of the programs, but other attributes + or the include directive + can be defined in terms of this one. +

+ The default value of this attribute is determined at + compile time. +

+
+
+ example: +
+
+ config_dir: /var/htdig/conf +
+
+
+
+
+
+
create_image_list
@@ -1459,7 +1563,7 @@ default:
- cgi-bin .cgi + /cgi-bin/ .cgi
description: @@ -2136,6 +2240,103 @@
+ image_url_prefix +
+
+
+
+ type: +
+
+ string +
+
+ used by: +
+
+ htsearch +
+
+ default: +
+
+ IMAGE_URL_PREFIX +
+
+ description: +
+
+ This specifies the directory portion of the URL used + to display star images. This attribute isn't directly + used by htsearch, but is used in the default URL for + the star_image and + star_blank attributes, and + other attributes may be defined in terms of this one. +

+ The default value of this attribute is determined at + compile time. +

+
+
+ example: +
+
+ image_url_prefix: /images/htdig +
+
+
+
+
+
+
+ include +
+
+
+
+ type: +
+
+ string +
+
+ used by: +
+
+ htdig, + htnotify, + htfuzzy, + htmerge and + htsearch +
+
+ description: +
+
+ This is not quite a configuration attribute, but + rather a directive. It can be used within one + configuration file to include the definitions of + another file. The last definition of an attribute + is the one that applies, so after including a file, + any of its definitions can be overridden with + subsequent definitions. This can be useful when + setting up many configurations that are mostly the + same, so all the common attributes can be maintained + in a single configuration file. The include directives + can be nested, but watch out for nesting loops. +
+
+ example: +
+
+ include: ${config_dir}/htdig.conf +
+
+
+
+
+
+
iso_8601
@@ -4045,6 +4246,11 @@ that is part of the xpdf 0.80 package have been tested as pdf_parsers. +

+ The default value of this attribute is determined at + compile time, to include the path to the acroread + executable. +

example: @@ -4521,6 +4727,10 @@ if no matches were found. In this case the nothing_found_file attribute is used instead. + Also, this file will not be output if it is + overridden by defining the + search_results_wrapper + attribute.
example: @@ -4633,6 +4843,10 @@ if no matches were found. In this case the nothing_found_file attribute is used instead. + Also, this file will not be output if it is + overridden by defining the + search_results_wrapper + attribute.
example: @@ -6256,7 +6470,7 @@ default:
- .-_/!#$%^&*' + .-_/!#$%^&'
description: @@ -6285,6 +6499,50 @@
+ version +
+
+
+
+ type: +
+
+ string +
+
+ used by: +
+
+ htsearch +
+
+ default: +
+
+ VERSION +
+
+ description: +
+
+ This specifies the value of the VERSION + variable which can be used in search templates. + The default value of this attribute is determined + at compile time, and will not normally be set + in configuration files. +
+
+ example: +
+
+ version: 3.1.2PL1 +
+
+
+
+
+
+
word_db
@@ -6385,7 +6643,7 @@ Andrew Scherpbier <andrew@contigo.com> -Last modified: Sun Feb 14 21:51:44 EST 1999 +Last modified: Fri Aug 6 15:00:15 EDT 1999 --- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/cf_byname.html Fri Aug 6 14:16:41 1999 @@ -24,12 +24,14 @@ * bad_extensions
* bad_querystr
* bad_word_list
+ * bin_dir

C
* case_sensitive
* common_dir
* common_url_parts
* compression_level
+ * config_dir
* create_image_list
* create_url_list

@@ -68,6 +70,8 @@
I
* image_list
+ * image_url_prefix
+ * include
* iso_8601

K
@@ -170,6 +174,7 @@

V
* valid_punctuation
+ * version

W
* word_db
--- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/cf_byprog.html Fri Aug 6 14:19:45 1999 @@ -168,6 +168,7 @@ * use_meta_description
* use_star_image
* valid_punctuation
+ * version
* word_db
We uncovered a bug back on May 20, in the encodeURL() function. This function should encode all non-ascii characters, but right now it doesn't. I think this is what PR#339 was all about. Here's the fix: --- htdig-3.1.2/htlib/URLTrans.cc.orig Tue Feb 16 23:03:56 1999 +++ htdig-3.1.2/htlib/URLTrans.cc Wed Jun 2 08:29:05 1999 @@ -75,7 +75,7 @@ void encodeURL(String &str, char *valid) for (p = str; p && *p; p++) { - if (isdigit(*p) || isalpha(*p) || strchr(valid, *p)) + if (isascii(*p) && (isdigit(*p) || isalpha(*p) || strchr(valid, *p))) temp << *p; else { Suffix-handling improvement (PR#560), to prevent inappropriate suffix stripping in endings fuzzy matches. > From: Steve Arlow > Subject: Suffix-handling improvement > To: htdig3-bugs@htdig.org > Date: Tue, 8 Jun 1999 19:57:54 -0400 (EDT) > Cc: yorick@yorick.com > > Hello, > > I do consulting for a number of law firms, and quickly discovered a > problem with htfuzzy matching on the word "witness". (There are > three root words in the distribution dictionary that end in "-ness" > and also certainly exhibit this problem; the other two are > "highness" and "likeness". Other words can also be argued about.) > > The fix (which does not appear to break anything else AFAICT, but > may have a small effect on performance) is to add a preliminary check > on root2word before trying word2root. The code is below (from the > file htdig-3.1.2/htfuzzy/Endings.cc), optimize it to your taste. Follow-up example: > Words of the form XXXness which are not a form of the word XXX. If I > enter "witness" into htdig with matching for alternate endings enabled, > it will look for "wit", "wits", or "witness". What it should really be > looking for is "witness", "witnessed", "witnessing", or "witnesses". > > A similar problem might occur with other suffixes, but I can't think of > an example off the top of my head. > > The fix is to try to interpret each term as a root word before trying > to interpret it as an alternate form. --- htdig-3.1.2/htfuzzy/Endings.cc.endingsbug Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htfuzzy/Endings.cc Fri Jul 30 14:43:57 1999 @@ -68,22 +68,6 @@ Endings::getWords(char *w, List &words) String word = w; word.lowercase(); - if (word2root->Get(word, data) == OK) - { - // - // Found the root of the word. We'll add it to the list already - // - word = data; - words.Add(new String(word)); - } - else - { - // - // The root wasn't found. This could mean that the word - // is already the root. - // - } - if (root2word->Get(word, data) == OK) { // @@ -97,6 +81,40 @@ Endings::getWords(char *w, List &words) words.Add(new String(token)); } token = strtok(0, " "); + } + } + else + { + if (word2root->Get(word, data) == OK) + { + // + // Found the root of the word. We'll add it to the list already + // + word = data; + words.Add(new String(word)); + } + else + { + // + // The root wasn't found. This could mean that the word + // is already the root. + // + } + + if (root2word->Get(word, data) == OK) + { + // + // Found the root's permutations + // + char *token = strtok(data.get(), " "); + while (token) + { + if (mystrcasecmp(token, w) != 0) + { + words.Add(new String(token)); + } + token = strtok(0, " "); + } } } } Quote the filename before passing it to the command-line to prevent shell escapes. Fixes PR#542. Also make error messages more useful. --- htdig-3.1.2/htdig/ExternalParser.cc.old Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/ExternalParser.cc Fri Jul 30 15:08:57 1999 @@ -133,8 +133,8 @@ ExternalParser::parse(Retriever &retriev // Now start the external parser. // String command = currentParser; - command << ' ' << path << ' ' << contentType << ' ' << base.get() << - ' ' << configFile; + command << ' ' << path << ' ' << contentType << " \"" << base.get() << + "\" " << configFile; FILE *input = popen(command, "r"); if (!input) @@ -170,7 +170,7 @@ ExternalParser::parse(Retriever &retriev (hd = atoi(token3)) >= 0 && hd < 12) retriever.got_word(token1, loc, hd); else - cerr<< "External parser error in line:"<. --- htdig-3.1.2.bak/htsearch/Display.h Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htsearch/Display.h Fri Jul 30 14:23:56 1999 @@ -151,7 +151,7 @@ protected: String *readFile(char *); void expandVariables(char *); void outputVariable(char *); - String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first); + String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first); char *hilight(char *str, String urlanchor, int fanchor); void setupImages(); String *generateStars(DocumentRef *, int); --- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 14:24:05 1999 @@ -959,7 +959,7 @@ Display::buildMatchList() //***************************************************************************** String * -Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first) +Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first) { char *head; int use_meta_description = 0; This patch fixes PR#348, to make sure a missing or invalid port number will get set correctly. --- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htlib/URL.cc Wed Aug 4 13:09:01 1999 @@ -282,6 +282,8 @@ void URL::parse(char *u) p = strtok(0, "/"); if (p) _port = atoi(p); + if (!p || _port <= 0) + _port = 80; } else { This should fix PR#493, to avoid rejecting a valid URL with ".." in it. --- htdig-3.1.2.bak/htdig/Retriever.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Retriever.cc Wed Aug 4 15:51:44 1999 @@ -625,7 +625,7 @@ Retriever::IsValidURL(char *u) // Currently, we only deal with HTTP URLs. Gopher and ftp will // come later... ***FIX*** // - if (strstr(u, "..") || strncmp(u, "http://", 7) != 0) + if (strstr(u, "/../") || strncmp(u, "http://", 7) != 0) { if (debug > 2) cout << endl <<" Rejected: Not an http or relative link!"; This updates the FSF address in COPYING & Makefile.in. PR#595. The address is still old in configure.in, but we won't touch it here so that we don't need to run autoconf. --- htdig-3.1.2.bak/COPYING Tue Feb 16 23:03:53 1999 +++ htdig-3.1.2/COPYING Wed Aug 4 07:40:22 1999 @@ -2,7 +2,7 @@ Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. - 675 Mass Ave, Cambridge, MA 02139, USA + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. @@ -305,7 +305,8 @@ You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Also add information on how to contact you by electronic and paper mail. --- htdig-3.1.2.bak/htdoc/COPYING Tue Feb 16 23:03:53 1999 +++ htdig-3.1.2/htdoc/COPYING Wed Aug 4 07:40:22 1999 @@ -2,7 +2,7 @@ Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. - 675 Mass Ave, Cambridge, MA 02139, USA + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. @@ -305,7 +305,8 @@ You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Also add information on how to contact you by electronic and paper mail. --- htdig-3.1.2.bak/Makefile.in Wed Apr 21 21:47:53 1999 +++ htdig-3.1.2/Makefile.in Wed Aug 4 10:10:54 1999 @@ -13,7 +13,7 @@ # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software -# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA top_srcdir= @top_srcdir@ srcdir= @srcdir@ This should help with PR#81 & PR#472, where strftime() would crash on some systems. Idea submitted by benoit.sibaud@cnet.francetelecom.fr. --- htdig-3.1.2.bak/htdig/Document.cc Wed Aug 4 12:43:27 1999 +++ htdig-3.1.2/htdig/Document.cc Wed Aug 4 13:37:43 1999 @@ -215,6 +215,8 @@ Document::getdate(char *datestring) // correct for mystrptime, if %Y format saw only a 2 digit year if (tm.tm_year < 0) tm.tm_year += 1900; + tm.tm_yday = 0; // clear these to prevent problems in strftime() + tm.tm_wday = 0; if (debug > 2) { This patch fixes a few problems with header parsing, including PR#535 & PR#557. --- htdig-3.1.2/htdig/Document.cc.hdrparsebug Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 14:15:10 1999 @@ -478,14 +478,18 @@ Document::readHeader(Connection &c) inHeader = 0; else { + char *token = line.get(); + while (*token && !isspace(*token)) + token++; + while (*token && isspace(*token)) + token++; if (strncmp(line, "HTTP/", 5) == 0) { // // Found the status line. This will determine if we // continue or not // - strtok(line, " "); - char *status = strtok(0, " "); + char *status = strtok(token, " "); if (status && strcmp(status, "200") == 0) { returnStatus = Header_ok; @@ -508,22 +512,19 @@ Document::readHeader(Connection &c) returnStatus = Header_not_authorized; } } - else if (modtime == 0 + else if (modtime == 0 && *token && mystrncasecmp(line, "last-modified:", 14) == 0) { - strtok(line, " \t"); - modtime = getdate(strtok(0, "\n\t")); + modtime = getdate(strtok(token, "\n\t")); } - else if (contentLength == -1 + else if (contentLength == -1 && *token && mystrncasecmp(line, "content-length:", 15) == 0) { - strtok(line, " \t"); - contentLength = atoi(strtok(0, "\n\t")); + contentLength = atoi(strtok(token, "\n\t")); } - else if (mystrncasecmp(line, "content-type:", 13) == 0) + else if (*token && mystrncasecmp(line, "content-type:", 13) == 0) { - strtok(line, " \t"); - char *token = strtok(0, "\n\t"); + token = strtok(token, "\n\t"); if ((returnStatus == Header_not_found || returnStatus == Header_ok) && @@ -537,8 +538,7 @@ Document::readHeader(Connection &c) } else if (mystrncasecmp(line, "location:", 9) == 0) { - strtok(line, " \t"); - redirected_to = strtok(0, "\r\n \t"); + redirected_to = strtok(token, "\r\n \t"); } } } This is Geoff's patch to hide the username/password in the command line arguments. --- htdig-3.1.2/htdig/htdig.cc.orig Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/htdig.cc Fri Jul 30 17:24:32 1999 @@ -79,6 +79,8 @@ main(int ac, char **av) break; case 'u': credentials = optarg; + for (int pos = 0; pos < strlen(optarg); pos++) + optarg[pos] = '*'; break; case 'a': alt_work_area++; This patch adds support for , and tags. (Don't you wish all additions could be this easy?) --- htdig-3.1.2/htdig/HTML.cc.old Fri Jul 30 12:24:14 1999 +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:16:55 1999 @@ -63,7 +63,7 @@ HTML::HTML() // the attrs Match object is used to match names of tag parameters. // tags.IgnoreCase(); - tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base"); + tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base|embed|object|link"); attrs.IgnoreCase(); attrs.Pattern("src|href|name"); @@ -894,6 +894,8 @@ HTML::do_tag(Retriever &retriever, Strin } case 21: // frame + case 24: // embed + case 25: // object { which = -1; int pos = srcMatch.FindFirstWord(position, which, length); @@ -963,6 +965,7 @@ HTML::do_tag(Retriever &retriever, Strin } case 22: // area + case 26: // link { which = -1; int pos = hrefMatch.FindFirstWord(position, which, length); @@ -972,7 +975,7 @@ HTML::do_tag(Retriever &retriever, Strin case 0: // "href" { // - // src seen + // href seen // while (*position && *position != '=') position++; Torsten Neuer's fix for PR# 554. --- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htsearch/Display.cc Tue Aug 3 14:46:30 1999 @@ -20,6 +20,7 @@ static char RCSid[] = "$Id: Display.cc,v #include #include #include +#include #include "HtURLCodec.h" #include "HtWordType.h" @@ -318,6 +319,7 @@ Display::displayMatch(ResultMatch *match { struct tm *tm = localtime(&t); char *datefmt = config["date_format"]; + char *locale = config["locale"]; if (!datefmt || !*datefmt) { if (config.Boolean("iso_8601")) @@ -325,6 +327,10 @@ Display::displayMatch(ResultMatch *match else datefmt = "%x"; } + if ( locale && *locale ) + { + setlocale(LC_TIME,locale); + } strftime(buffer, sizeof(buffer), datefmt, tm); *str << buffer; } This patch turns the maximum word length into a run-time option, rather than compile-time. --- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 10:43:33 1999 @@ -5,7 +5,6 @@ #define _config_h_ #define VERSION 1 -#define MAX_WORD_LENGTH 12 /* Define if on AIX 3. System headers sometimes define this. --- htdig-3.1.2.bak/htcommon/WordReference.h Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htcommon/WordReference.h Wed Aug 4 10:44:12 1999 @@ -25,7 +25,7 @@ public: WordReference() {} ~WordReference() {} - char Word[MAX_WORD_LENGTH + 1]; + String Word; int WordCount; int Weight; int Location; --- htdig-3.1.2.bak/htcommon/WordList.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htcommon/WordList.cc Wed Aug 4 12:22:31 1999 @@ -46,11 +46,12 @@ void WordList::Word(char *word, int loca if (weight_factor == 0.0) // Why should we add words with no weight? return; String shortword = word; + static int maximum_word_length = config.Value("maximum_word_length", 12); shortword.lowercase(); word = shortword.get(); - if (shortword.length() > MAX_WORD_LENGTH) - word[MAX_WORD_LENGTH] = '\0'; + if (shortword.length() > maximum_word_length) + word[maximum_word_length] = '\0'; if (!valid_word(word)) return; @@ -80,7 +81,7 @@ void WordList::Word(char *word, int loca wordRef->DocumentID = docID; wordRef->Weight = int((1000 - location) * weight_factor); wordRef->Anchor = anchor_number; - strcpy(wordRef->Word, word); + wordRef->Word = word; words->Add(word, wordRef); } } @@ -145,7 +146,7 @@ void WordList::Flush() while ((wordRef = (WordReference *) words->Get_NextElement())) { - fprintf(fl, "%s",wordRef->Word); + fprintf(fl, "%s",wordRef->Word.get()); fprintf(fl, "\ti:%d\tl:%d\tw:%d", wordRef->DocumentID, wordRef->Location, @@ -220,15 +221,16 @@ void WordList::BadWordFile(char *filenam char buffer[1000]; char *word; String new_word; - int minimum_word_length = config.Value("minimum_word_length", 3); + static int minimum_word_length = config.Value("minimum_word_length", 3); + static int maximum_word_length = config.Value("maximum_word_length", 12); while (fl && fgets(buffer, sizeof(buffer), fl)) { word = strtok(buffer, "\r\n \t"); if (word && *word) { - if (strlen(word) > MAX_WORD_LENGTH) - word[MAX_WORD_LENGTH] = '\0'; + if (strlen(word) > maximum_word_length) + word[maximum_word_length] = '\0'; new_word = word; // We need to clean it up before we add it new_word.lowercase(); // Just in case someone enters an odd one HtStripPunctuation(new_word); --- htdig-3.1.2.bak/htcommon/DocumentRef.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htcommon/DocumentRef.cc Wed Aug 4 10:45:30 1999 @@ -571,8 +571,7 @@ void DocumentRef::AddDescription(char *d static double description_factor = config.Double("description_factor"); static int max_descriptions = config.Value("max_descriptions", 5); - // Not restricted to this size, just used as a hint. - String word(MAX_WORD_LENGTH); + String word; while (*p) { --- htdig-3.1.2.bak/htcommon/defaults.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htcommon/defaults.cc Wed Aug 4 10:47:44 1999 @@ -89,6 +89,7 @@ ConfigDefaults defaults[] = {"max_prefix_matches", "1000"}, {"max_stars", "4"}, {"maximum_pages", "10"}, + {"maximum_word_length", "12"}, {"metaphone_db", "${database_base}.metaphone.db"}, {"meta_description_factor", "50"}, {"method_names", "and All or Any boolean Boolean"}, --- htdig-3.1.2.bak/htsearch/parser.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htsearch/parser.cc Wed Aug 4 10:50:41 1999 @@ -202,6 +202,7 @@ Parser::setError(char *expected) void Parser::perform_push() { + static int maximum_word_length = config.Value("maximum_word_length", 12); String temp = current->word.get(); String data; char *p; @@ -220,8 +221,8 @@ Parser::perform_push() } temp.lowercase(); p = temp.get(); - if (temp.length() > MAX_WORD_LENGTH) - p[MAX_WORD_LENGTH] = '\0'; + if (temp.length() > maximum_word_length) + p[maximum_word_length] = '\0'; if (dbf->Get(p, data) == OK) { p = data.get(); --- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/attrs.html Wed Aug 4 10:58:59 1999 @@ -3124,6 +3124,51 @@
+ + maximum_word_length +
+
+
+
+ type: +
+
+ number +
+
+ used by: +
+
+ htdig and + htsearch +
+
+ default: +
+
+ 12 +
+
+ description: +
+
+ This sets the maximum length of words that will be + indexed. Words longer than this value will be silently + truncated when put into the index, or searched in the + index. +
+
+ example: +
+
+ maximum_word_length: 15 +
+
+
+
+
+
+
meta_description_factor
--- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/cf_byname.html Wed Aug 4 10:59:30 1999 @@ -96,6 +96,7 @@ * max_prefix_matches
* max_stars
* maximum_pages
+ * maximum_word_length
* meta_description_factor
* metaphone_db
* method_names
--- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdoc/cf_byprog.html Wed Aug 4 11:00:31 1999 @@ -54,6 +54,7 @@ * max_head_length
* max_hop_count
* max_meta_description_length
+ * maximum_word_length
* meta_description_factor
* minimum_word_length
* modification_time_is_now
@@ -132,6 +133,7 @@ * max_prefix_matches
* max_stars
* maximum_pages
+ * maximum_word_length
* method_names
* minimum_prefix_length
* minimum_word_length
I think this patch will fix PR#514 in the bug database. It's Geoff's first patch, with a minor correction, plus an added test in the vscode macro, which is where the problem seemed to be happening. The author of the metaphone code likely assumed that isalpha() meant [A-Za-z], and forgot about upper half characters. This won't do anything to map accented vowels to their unaccented counterparts, but it should hopefully put an end to the segmentation faults. --- htdig-3.1.2.bak/htfuzzy/Fuzzy.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htfuzzy/Fuzzy.cc Fri Jul 30 16:37:42 1999 @@ -55,6 +55,8 @@ Fuzzy::getWords(char *word, List &words) { if (!index) return; + if (!word || !*word) + return; // // Convert the word to a fuzzy key --- htdig-3.1.2.bak/htfuzzy/Metaphone.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htfuzzy/Metaphone.cc Tue Aug 3 14:50:06 1999 @@ -51,7 +51,7 @@ static char vsvfn[26] = { /* N O P Q R S T U V W X Y Z */ /* Macros to access character coding array */ -#define vscode(x) (vsvfn[(x) - 'A']) +#define vscode(x) ((x) >= 'A' && (x) <= 'Z' ? vsvfn[(x) - 'A'] : 0) #define vowel(x) ((x) != '\0' && vscode(x) & 1) /* AEIOU */ #define same(x) ((x) != '\0' && vscode(x) & 2) /* FJLMNR */ #define varson(x) ((x) != '\0' && vscode(x) & 4) /* CGPST */ @@ -63,6 +63,9 @@ static char vsvfn[26] = { void Metaphone::generateKey(char *word, String &key) { + if (!word || !*word) + return; + char *n; String ntrans; This patch fixes the bug in the handling of modification_time_is_now in the readHeader() function. --- htdig-3.1.2/htdig/Document.cc.modnowbug Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:39:18 1999 @@ -96,10 +96,7 @@ Document::Reset() delete url; url = 0; referer = 0; - if(config.Boolean("modification_time_is_now")) - modtime = time(NULL); - else - modtime = 0; + modtime = 0; contents = 0; document_length = 0; @@ -463,10 +460,7 @@ Document::readHeader(Connection &c) int inHeader = 1; int returnStatus = Header_not_found; - if (config.Boolean("modification_time_is_now")) - modtime = time(NULL); - else - modtime = 0; + modtime = 0; while (inHeader) { @@ -542,6 +536,11 @@ Document::readHeader(Connection &c) } } } + static int modification_time_is_now = + config.Boolean("modification_time_is_now"); + if (modtime == 0 && modification_time_is_now) + modtime = time(NULL); + if (debug > 2) cout << "returnStatus = " << returnStatus << endl; return returnStatus; This patch fixes robots parsing to allow multiple directives to work correctly. Fixes PR#578, as provided by Chris Liddiard . --- htdig-3.1.2/htdig/HTML.cc.robotbug Fri Jul 30 12:24:14 1999 +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:28:35 1999 @@ -873,9 +873,9 @@ HTML::do_tag(Retriever &retriever, Strin doindex = 0; retriever.got_noindex(); } - else if (content_cache.indexOf("nofollow") != -1) + if (content_cache.indexOf("nofollow") != -1) dofollow = 0; - else if (content_cache.indexOf("none") != -1) + if (content_cache.indexOf("none") != -1) { doindex = 0; dofollow = 0; This patch fixes PR#572, where htsearch crashed if CONTENT_LENGTH was not set but REQUEST_METHOD was. --- htdig-3.1.2.bak/htlib/cgi.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htlib/cgi.cc Wed Aug 4 16:51:49 1999 @@ -67,7 +67,9 @@ int n; char *buf; - n = atoi(getenv("CONTENT_LENGTH")); + buf = getenv("CONTENT_LENGTH"); + if (!buf || !*buf || (n = atoi(buf)) <= 0) + return; // null query buf = new char[n + 1]; read(0, buf, n); buf[n] = '\0'; This patch adds error messages for unknown hosts. --- htdig-3.1.2/htdig/Document.cc.nohostmsg Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:48:03 1999 @@ -301,14 +301,22 @@ Document::RetrieveHTTP(time_t date) if (c.assign_port(proxy->port()) == NOTOK) return Document_not_found; if (c.assign_server(proxy->host()) == NOTOK) + { + if (debug) + cout << "Unknown proxy host: " << proxy->host() << endl; return Document_no_host; + } } else { if (c.assign_port(url->port()) == NOTOK) return Document_not_found; if (c.assign_server(url->host()) == NOTOK) + { + if (debug) + cout << "Unknown host: " << url->host() << endl; return Document_no_host; + } } if (c.connect(1) == NOTOK) This patch fixes a bug in the PDF parser. When the Title header was just the temporary file name, it wouldn't be used, but it also wouldn't be cleared from the _parsedString variable, so it ended up polluting the document excerpt. --- htdig-3.1.2/htdig/PDF.cc.orig Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/PDF.cc Tue May 25 12:01:43 1999 @@ -290,8 +290,8 @@ void PDF::parseNonTextLine(String &line) _parsedString.get()); _retriever->got_title(_parsedString); - _parsedString = 0; } + _parsedString = 0; } } This fixes the infamous problem with files like left_index.html not getting indexed. PR#543 & PR#585. --- htdig-3.1.2/htlib/URL.cc.orig Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htlib/URL.cc Fri Jun 11 12:24:40 1999 @@ -440,7 +440,7 @@ void URL::removeIndex(String &path) l.Release(); } if (defaultdoc->hasPattern() && - defaultdoc->FindFirstWord(path.sub(filename)) >= 0) + defaultdoc->CompareWord(path.sub(filename))) path.chop(path.length() - filename); } Fix server_alias entries so port defaults to 80 if omitted. --- htdig-3.1.2/htlib/URL.cc.old Fri Jul 30 14:51:32 1999 +++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 16:57:35 1999 @@ -540,6 +540,11 @@ char *URL::signature() } +//***************************************************************************** +// void URL::ServerAlias() +// Takes care of the server aliases, which attempt to simplify virtual +// host problems +// void URL::ServerAlias() { static Dictionary *serveraliases= 0; @@ -547,6 +552,7 @@ void URL::ServerAlias() if (! serveraliases) { String l= config["server_aliases"]; + String from, *to; serveraliases = new Dictionary(); char *p = strtok(l, " \t"); char *salias= NULL; @@ -556,7 +562,13 @@ void URL::ServerAlias() if (! salias) continue; *salias++= '\0'; - serveraliases->Add(p, new String(salias)); + from = p; + if (from.indexOf(':') == -1) + from.append(":80"); + to= new String(salias); + if (to->indexOf(':') == -1) + to->append(":80"); + serveraliases->Add(from.get(), to); // cout << "Alias: " << p << "->" << salias << "\n"; // printf ("Alias: %s->%s\n", p, salias); p = strtok(0, " \t"); This patch fixes the HTML parser to decode SGML entities within tag attributes. --- htdig-3.1.2.bak/htdig/HTML.h Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/HTML.h Fri Jul 30 12:23:25 1999 @@ -72,6 +72,7 @@ private: // Helper functions // void do_tag(Retriever &, String &); + char *transSGML(char *); }; #endif --- htdig-3.1.2.bak/htdig/HTML.cc Wed Apr 21 21:47:57 1999 +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 16:22:55 1999 @@ -544,7 +544,7 @@ HTML::do_tag(Retriever &retriever, Strin in_ref = 0; } delete href; - href = new URL(position, *base); + href = new URL(transSGML(position), *base); in_ref = 1; description = 0; position = q + 1; @@ -595,7 +595,7 @@ HTML::do_tag(Retriever &retriever, Strin q++; *q = '\0'; } - retriever.got_anchor(position); + retriever.got_anchor(transSGML(position)); position = q + 1; break; } @@ -704,7 +704,7 @@ HTML::do_tag(Retriever &retriever, Strin q++; *q = '\0'; } - retriever.got_image(position); + retriever.got_image(transSGML(position)); break; } @@ -736,15 +736,15 @@ HTML::do_tag(Retriever &retriever, Strin } if (conf["htdig-email"]) { - retriever.got_meta_email(conf["htdig-email"]); + retriever.got_meta_email(transSGML(conf["htdig-email"])); } if (conf["htdig-notification-date"]) { - retriever.got_meta_notification(conf["htdig-notification-date"]); + retriever.got_meta_notification(transSGML(conf["htdig-notification-date"])); } if (conf["htdig-email-subject"]) { - retriever.got_meta_subject(conf["htdig-email-subject"]); + retriever.got_meta_subject(transSGML(conf["htdig-email-subject"])); } if (conf["htdig-keywords"] || conf["keywords"]) { @@ -757,7 +757,7 @@ HTML::do_tag(Retriever &retriever, Strin char *keywords = conf["htdig-keywords"]; if (!keywords) keywords = conf["keywords"]; - char *w = strtok(keywords, " ,\t\r\n"); + char *w = strtok(transSGML(keywords), " ,\t\r\n"); while (w) { if (strlen(w) >= minimumWordLength) @@ -783,7 +783,7 @@ HTML::do_tag(Retriever &retriever, Strin while (*qq && (*qq != ';') && (*qq != '"') && !isspace(*qq))qq++; *qq = 0; - URL *href = new URL(q, *base); + URL *href = new URL(transSGML(q), *base); // I don't know why anyone would do this, but hey... if (dofollow) retriever.got_href(*href, ""); @@ -811,7 +811,7 @@ HTML::do_tag(Retriever &retriever, Strin // // We need to do two things. First grab the description // - meta_dsc = conf["content"]; + meta_dsc = transSGML(conf["content"]); if (meta_dsc.length() > max_meta_description_length) meta_dsc = meta_dsc.sub(0, max_meta_description_length).get(); if (debug > 1) @@ -824,7 +824,7 @@ HTML::do_tag(Retriever &retriever, Strin // (slot 11 is the new slot for this) // - char *w = strtok(conf["content"], " \t\r\n"); + char *w = strtok(transSGML(conf["content"]), " \t\r\n"); while (w) { if (strlen(w) >= minimumWordLength) @@ -836,7 +836,7 @@ HTML::do_tag(Retriever &retriever, Strin if (keywordsMatch.CompareWord(cache)) { - char *w = strtok(conf["content"], " ,\t\r\n"); + char *w = strtok(transSGML(conf["content"]), " ,\t\r\n"); while (w) { if (strlen(w) >= minimumWordLength) @@ -847,15 +847,15 @@ HTML::do_tag(Retriever &retriever, Strin } else if (mystrcasecmp(cache, "htdig-email") == 0) { - retriever.got_meta_email(conf["content"]); + retriever.got_meta_email(transSGML(conf["content"])); } else if (mystrcasecmp(cache, "htdig-notification-date") == 0) { - retriever.got_meta_notification(conf["content"]); + retriever.got_meta_notification(transSGML(conf["content"])); } else if (mystrcasecmp(cache, "htdig-email-subject") == 0) { - retriever.got_meta_subject(conf["content"]); + retriever.got_meta_subject(transSGML(conf["content"])); } else if (mystrcasecmp(cache, "htdig-noindex") == 0) { @@ -948,7 +948,7 @@ HTML::do_tag(Retriever &retriever, Strin *q = '\0'; } delete href; - href = new URL(position, *base); + href = new URL(transSGML(position), *base); if (dofollow) { description = 0; @@ -1016,7 +1016,7 @@ HTML::do_tag(Retriever &retriever, Strin *q = '\0'; } delete href; - href = new URL(position, *base); + href = new URL(transSGML(position), *base); if (dofollow) { description = 0; @@ -1085,7 +1085,7 @@ HTML::do_tag(Retriever &retriever, Strin q++; *q = '\0'; } - URL tempBase(position, *base); + URL tempBase(transSGML(position), *base); *base = tempBase; } } @@ -1095,4 +1095,25 @@ HTML::do_tag(Retriever &retriever, Strin default: return; // Nothing... } +} + + +//***************************************************************************** +// char * HTML::transSGML(char *text) +// +char * +HTML::transSGML(char *str) +{ + static String convert; + unsigned char *text = (unsigned char *)str; + + convert = 0; + while (*text) + { + if (*text == '&') + convert << SGMLEntities::translateAndUpdate(text); + else + convert << *text++; + } + return convert.get(); } Fix PR#566 by setting the correct length of the string being matched. 'http://' is 7 characters. Submitted by . --- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999 +++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 14:51:32 1999 @@ -130,7 +130,7 @@ URL::URL(char *ref, URL &parent) while (isalpha(*p)) p++; int hasService = (*p == ':'); - if (hasService && ((strncmp(ref, "http://", 6) == 0) || + if (hasService && ((strncmp(ref, "http://", 7) == 0) || (strncmp(ref, "http:", 5) != 0))) { // Fixes problem with $(VAR) at end of template string not being expanded. --- htdig-3.1.2/htsearch/Display.cc.varstatebug Fri Jul 30 14:24:05 1999 +++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 15:25:09 1999 @@ -822,7 +822,7 @@ Display::expandVariables(char *str) } str++; } - if (state == 5) + if (state == 2 || state == 5) { // // The end of string was reached, but we are still trying to -------- 8< -------- snip -------- 8< -------- -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.