From: esr@snark.thyrsus.com (Eric S. Raymond) Newsgroups: comp.unix.sys5.r4,comp.unix.pc-clone.32bit,comp.bugs.sys5,news.answers,comp.answers Subject: Known Bugs in the USL UNIX distribution Followup-To: comp.unix.pc-clone.32bit Approved: news-answers-request@MIT.Edu Supersedes: <1mNyW2#M9B6mD216CrtH5dJCqc4MlOSk=esr@boojum.thyrsus.com> Archive-name: usl-bugs Last-update: $Date: 1994/11/21 21:59:40 $ Version: 19.0 *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** After this wrap-up posting, I will no longer be maintaining this FAQ. The material in this version is probably somewhat out of date. I have switched to using Linux, find it better than SVr4, and am therefore no longer interested in SVr4. If you are an SVr4 fan, and you want to take over this FAQ, send me email and I will give you the masters and my notes. *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** Many FAQs, including this one, are available via FTP on the archive site rtfm.mit.edu in the directory pub/usenet/news.answers. The name under which this FAQ is archived appears in the Archive-name line above. This FAQ is updated monthly; if you want the latest version, please query the archive rather than emailing the overworked maintainer. (If you email me questions that address gaps in the FAQ material, you will probably get a reply that says "Sorry, everything I know about this topic is in the Guide". If you find out the *answer* to such a question, please share it with me for the Guide, so everyone can benefit.) What's new in this issue: * New bug info (see below) * Extensive response to the bug list by ESIX (In the table below, bugs new this issue are marked with a ** at the left margin; old bugs for which information has been added are marked with *) 0. Table of Contents I. Introduction II. General Bugs 1. UNIX kernel must lie below the 1024-cylinder mark 2. Suid programs dump core when signalled 3. DMAs on large ISA machines may fail 4. There is a cylinder limit on disk size 5. more(1) doesn't handle SIGWINCH 6. X performance problem 7. C shell background process termination logs you out 8. A security hole in login 9. COFF problems with long filenames 10. Flakeouts in the Wangtek device driver 11. A kernel declaration bug 12. Reading tar archives with cpio foos up on multiply-linked files 13. Process accounting is broken 14. tar(1) foos up in the presence of symbolic links 15. Symbolic links can interfere with shellscript execution 16. Piping a csh builtin causes the shell to hang. 17. tar(1) fails to restore adjacent symbolic links properly 18. COFF binaries linked with curses(3) and shared libc hang 19. shl hangs, sxt devices bad 20. num-lock prevents mouse from working properly 21. adjtime() doesn't work 23. cron mail doesn't go through aliasing 24. fragility in xterm 25. csh lossage due to bad optimization 26. Bug in cp(1) 27. tbl -me doesn't work 28. who -r fragility leads to boot-time problems 29. at(1) breaks here-documents in shell scripts 30. UHC mouse driver ignores the middle button. 31. mmap acces doesn't update file mod times 32. AT&T select(2) is incompatible with BSD select(2) 33. (4.2) The login program requires its PPID to be 1 34. (4.2) Bad MAXMINOR values can make the system unbootable 35. Incompatible change in TZ interpretation 36. Nulls in pixmaps can crash X 37. Potential security hole in SVr4s using sendmail 38. Reporting bug in df on non-root filesystems 39. tar writes -v output to stdout, not stderr 40. SIGPIPE is delayed and not reliable 41. /usr/lib/acct/fwtmp doesn't work 42. whatis database is full of garbage. * 43. (4.0 & 4.2) mmap is seriously broken 44. a bug in xterm 45. DrawText16() bug in XWIN 46. output redirection with exec fails in sh 47. rm fails to reject . or .. arguments 48. bc/dc divide is buggy 49. (4.2) pkgrm with no arguments nukes all packages without confirmation 50. (4.2) pkgmk with no -p option loses III. Serial-port and tty administration problems 1. Dropout problems with tty devices 2. Quick port setup option in sysadm is broken 3. ttymon drops DTR when it shouldn't 4. ttymon doesn't drop DTR when it should 5. (4.2) Terminating cu to a direct line locks up the port 6. Hardware flow control bug breaks streaming data transfers 7. Bad interaction between ttymon and networking * 8. Bogus restriction to 2 users IV. Networking and File-Sharing Bugs 1. NFS locking is unusably slow 2. UFS file system problems 3. Byte-order problem with NFS when accessing Sun disks 4. Under weird circumstances, lseek on UFS may cause corruption 5. FTP problems 6. A bug in the WD80x3 support 7. Security hole near fingerd 8. Fatal bug in priority-band message handling. 9. SVr4.0.4 TCP/IP routing is broken 10. df(1) on NFS volumes returns bad data 11. rsh hogs the processor 12. MTU for remote networks ignored 13. Bug in remote printing. V. SCSI Support Problems 1. sar is confused by SCSI 2. A configuration problem 3. Synchronous SCSI hang problem 4. ps chokes on commands that do SCSI I/O 5. Transfer speed problems with Adaptec 1542B on 486s 6. df gives inaccurate values for large SCSI partitions VI. Development Tools Problems 1. General UCB library brokenness 2. USL emulation of BSD signals doesn't work 3. Possible string library problems 4. USL's ndbm support is broken. 5. An include file is missing 6. sscanf(3) has a potential bug 7. shmat(2) vs. vfork(2) 8. FIONREAD fails on regular files 9. fread(3) does the wrong thing on pipes and FIFOs 10. putw appears to be broken 11. Compiler problems 12. getlogin() doesn't work 13. syslog routines don't work 14. Bogus `r' in xt driver configuration flags 15. ioctl for kernel symbol fetches fails (4.2) 16. Bug in cc optimizer (4.2.1) 17. /usr/ucb/install uses missing group "staff" ** 18. sigsetjmp calls may lose due to header error VII. The FUBYTE Problem I. Introduction This posting lists known bugs in System V Release 4 implementations, and known fixes applied by various porting houses (there's also random bits of information about SCO UNIX here and there). It was formerly part of the 386-buyers-faq issues 1.0 through 4.0, and is still best read in conjunction with the pc-unix/software FAQ descended from that posting. This document is maintained and periodically updated as a service to the net by Eric S. Raymond , who began it for the very best self-interested reason that he was in the market and didn't believe in plonking down several grand without doing his homework first (no, I don't get paid for this, though I have had a bunch of free software and hardware dumped on me as a result of it!). Corrections, updates, and all pertinent information are welcomed at that address. This posting is periodically broadcast to the USENET group comp.unix.sysv386 and to a list of vendor addresses. If you are a vendor representative, please check to make sure the information on your company is current and correct. If it is not, please email me a correction ASAP. If you are a knowledgeable user of any of these products, please send me a precis of your experiences for the improvement of future issues. The bug descriptions often include indications of fixes by the various porting houses to their current releases. These are: Consensys UNIX Version 1.3 abbreviated as "Cons" below Dell UNIX Issue 2.2 abbreviated as "Dell" below Esix Revision A abbreviated as "Esix" below Micro Station Technology SVr4 UNIX abbreviated as "MST" below Microport System V Release 4.0 version 4 abbreviated as "uPort" below UHC Version 3.6 abbreviated as "UHC" below SCO Open DeskTop 1.1 abbreviated as "SCO" below II. General Bugs 1. UNIX kernel must lie below the 1024-cylinder mark Bela Lubkin says "SCO's boot filesystem must lie below 1024 cylinder mark; anything else can be anywhere. This is more-or-less a limitation of the BIOS interface that the bootstrap loader must use. Could be circumvented by going directly to controller hardware in the bootstrap loader, but that would be horrendously complex with all the controllers & host adapters to be supported." Actually this is not quite right. It's the *kernel* that must lie below the 1K-cylinder mark; the rest of the root partition could extend above it. But since partition endpoints are the only way to control where physical blocks get allocated, it comes to the same thing Roger Knopf adds: "The 1024 cylinder limit applies not only to the kernel but also to /boot. Both are read in while we are using the BIOS to talk to the hard disk. There are 10 bits set aside in the register for cylinders in the INT 13 call, hence 1024 cylinders. There are a few controllers that allocate 2 more bits (they are taken away from the space allocated for head bits, I recall). It is trivial to modify all the relevant boot code to use these bits IF YOU KNOW THAT THE CONTROLLER WILL USE THEM but I know of no way to reliably determine that this is the case. Once the kernel is loaded we use 16 bits everywhere to hold the cylinder number." 2. Suid programs dump core when signalled Mark Snitily of SGCS says that under many SVr4s, signalling a process that is running suid root will cause it to core-dump. He says Dell and MST have fixed this, and SCO doesn't suffer from this. Esix says: Bug does not exist in Esix 4.0.4.1 3. DMAs on large ISA machines may fail On ISA machines with more that 16MB of RAM, SVr4 may try to do DMA from outside the bus's address space, causing serious problems. UNIX ought to do an in-memory copy to within the low 16MB but the USL base code doesn't. Dell says they've fixed this, and that's been confirmed by a user. UHC says they've fixed this; they add that the special buffer-allocation logic to handle the problem can be turned off with a tunable kernel parameter if you've got less than 16M. Microport says they've fixed this in their new 4.1 release, shipping early March. Esix offers a 4.0.4 patch to correct this problem; the patch is integrated into 4.0.4.1. SCO used to have a similar bug but fixed it long ago. John Sully writes: "This was due to a bug in pre version 4 dma code. The USL code has always at least attempted to do a copy from low memory to high memory on systems with more than 16Mb of RAM. By the way UHC is wrong; the buffer allocation code only comes into play if you have more than 16Mb of memory. You can turn it off if you have a machine (ie. an EISA bus) which will allow you to do DMA above 16Mb. You *must* have this tunable (MAXDMAPAGE) turned on if you are using *ISA* bus masters in a system with more than 16Mb of ram. Unfortunately doing this will affect all drivers which do dma as there is no good way to do this on a per-driver basis." 4. There is a cylinder limit on disk size Stock USL code is limited to 1,024 cylinders per Winchester, which might cause problems with some disk drives. Microport, Dell, Esix, MST, and UHC have fixed this. 5. more(1) doesn't handle SIGWINCH It doesn't get its window size from the stty/termio structures, so it doesn't cope with SIGWINCH properly. Esix has this bug under active investigation now. 6. X performance problem Stock X11R4 and R5 (at least prior to 1.2E) is said to hog the processor if you use the LOCALCONNECT option. Jan Brittenson posted the following workaround: I don't know what causes the standard X server to hog the CPU, but it can be avoided. Use the following program instead of xinit. Compile it with `$CC -O -o xserv xserv.c -lX11' where CC is either /usr/ccs/bin/cc or gcc. Set DISPLAY and XINITRC and run `xserv' from your home directory. This is just a q&d hack, and not really a substitute for xinit -- but it works. /* xserv.c -- start X server Start X server. Similar to xinit, but intended to circumvent the X386 CPU Hog Mode Jan Brittenson, June 2 1992 05:15 am with corrections by Adam Donnison Tue, 2 Mar 1993 */ #include #include #include #include #include #include #include #include #include extern int errno; /* This may need to be "/usr/X386/bin/X386" */ #define DEFAULT_XPATH "/usr/bin/X11/X" /* Start X server. Fork-exec server, passing the DISPLAY environment variable. Wait for server to get up and running (at which point it passes back a SIGUSR1), at which point the user xinitrc file is run. */ #define XINITRC ".xinitrc" #define DEFAULT_XCOMMAND "xterm -g +1+1 -n login -display :0" extern void *malloc (), free (); extern char *basename (), *getenv (), *strcpy (); /* X stuff */ Display *top_display; /* This is supposed to be in libgen.a... */ static char *basename (s0) char *s0; { register char *s1; for (s1 = s0 + strlen (s0) - 1; s1 > s0 && *s1 != '/'; s1--); if (*s1 == '/') return s1+1; return s1; } jmp_buf sigusr1_frame; static void caught_sigusr1 (int dummy) { longjmp (sigusr1_frame, !0); } static char *dispname (s0) char *s0; { register char *s1; for (s1 = s0 + strlen (s0) - 1; s1 > s0 && *s1 != ':'; s1--); return s1; } /* No arguments */ int main (argc, argv) int argc; char **argv; { char *xserver_file, *xinitrc_file, *home_path, *display, *display_X_arg; int xserver_pid, orgmask; /* Not that it really matters, just to avoid being used as a direct replacement for xinit. */ if (argc != 1) { fprintf (stderr, "usage: %s\n", basename (*argv)); exit (1); } /* Resolve xinitrc path. This is done before the server is started. */ if (!(home_path = getenv ("HOME"))) home_path = "/etc"; if (!(xinitrc_file = getenv ("XINITRC"))) { xinitrc_file = malloc (strlen (home_path) + 1 + strlen (XINITRC) + 1); sprintf (xinitrc_file, "%s/%s", home_path, XINITRC); } else xinitrc_file = strdup (xinitrc_file); /* Resolve display */ if (!(display = getenv ("DISPLAY"))) display = display_X_arg = ":0.0"; else display_X_arg = dispname (display); /* Tell server to notify us when up and running */ signal (SIGUSR1, SIG_IGN); orgmask = sigblock (sigmask (SIGUSR1)); /* Start server */ if (!(xserver_pid = vfork ())) { xserver_file = DEFAULT_XPATH; execl (xserver_file, xserver_file, display_X_arg, NULL); fprintf (stderr, "%s: can't exec %s (errno = %d) -- start-up aborted\n", basename (*argv), xserver_file, errno); exit (1); } if (xserver_pid < 0) { fprintf (stderr, "%s: can't fork (errno = %d) -- start-up aborted\n", basename (*argv), errno); exit (1); } /* Await signal from server */ #if 0 /* Why the #@$*! doesn't this work?! */ sigsetmask (orgmask); alarm (20); sigpause (sigmask (SIGUSR1) | sigmask (SIGALRM)); #else sleep (5); #endif /* Open display */ if (!(top_display = XOpenDisplay (display))) { fprintf (stderr, "%s: unable to open display '%s' -- start-up aborted\n", basename (*argv), display); exit (1); } /* Execute xinitrc file */ if (system (xinitrc_file) < 0) system (DEFAULT_XCOMMAND); /* Close display */ XCloseDisplay (top_display); /* Terminate server */ kill (xserver_pid, SIGTERM); /* Finished */ free (xinitrc_file); } Esix has fixed this in 4.0.4.1. 7. C shell background process termination logs you out In C shell, unless "ignoreeof" is set, termination of a background process will log you out. With "ignoreeof" set, just the message "Use logout to exit" will be printed. This was present in Esix 4.0.4 but has been fixed in 4.0.4.1. 8. A security hole in login David Wexelblat reports: "There is a HUGE security hole in /bin/login in all USL derived SVR4s before 4.0.4. Refer to CERT advisory CA-91:08, dated 5/23/91. This is known to be present in AT&T SVR4 2.1, and Microport SVR4 3.1. ESIX claims to have fixed it, Microport reports that it is fixed in 4.1. I won't give any more details unless necessary. Suffice to say that this bug allows any non-privileged user on an SVR4 system to get read-write access to any file on the system." There is an official USL patch out for this, which is incorporated in 4.0.4.1. 9. COFF problems with long filenames A source at Dell urges: "Our SVR4v2 did some stuff that USL didn't get around to until SVR4v4. Try Dell UNIX 2.1 with a COFF program on a large UFS filesystem in a directory with long names. Runs on Dell UNIX. Breaks on others." I don't have more definite info yet. This bug is not present in Esix 4.0.4.1, according to Esix. 10. Flakeouts in the Wangtek device driver Dell reports that USL's Wangtek device driver is seriously flaky. "How'd you like a multi volume backup where the second and subsequent volumes don't follow on from the previous volumes?" UHC confirms this and is actively working on the problem. An anonymous SCOer says "The QIC02 tape controller `standard' is seriously flaky. Our driver's in pretty good shape but nobody will ever have a truly solid driver that supports every QIC02 controller you can find." Gordon Ross reports: "Actually, the SCSI tape target driver `st01' has a similar problem at version 4.0.3 which I corrected while I worked on the SVR4 code. The correction was provided to the support group at USL. The actual problem was that the SCSI tape would return a `check status' completion code which was just trying to inform the driver of the arrival of the `logical end of media' indication but the driver was treating it as an error. The tape drive had in fact written the data, but the driver incorrectly assumed that the "check status" return meant that it failed. The result of this is that when you write into the end of the tape, you can read back one more "chunk" than yu wrote. Of course, cpio does not like this at all when doing multi-volume backups..." This was present in Esix 4.0.4 but has been fixed in 4.0.4.1. Esix rewrote the driver. 11. A kernel declaration bug A botch in USL's /etc/conf/pack.d/kernel/space.c (which is present in Consensys 1.3, Dell 2.1, Esix 4.0.3A, Microport 4.0.3 and 4.0.4 and may also be present in other SVr4s) can step on the linesw[] table. The problem is that the domain name array initialization is wrong and too short; thus, when it's set, data past the end of the array can be stomped. To fix this, find the following near line 247: char srpc_domain[] = SRPC_DOMAIN; and change it to char srpc_domain[SYS_NMLN] = SRPC_DOMAIN; then rebuild the kernel. Microport officially knows about this bug and plans to fix it in a near-future update release. It has been fixed in Dell 2.2. Correct array initialization has been put in for Esix 4.0.4.1 12. Reading tar archives with cpio foos up on multiply-linked files Paul De Bra reports the following: In theory, cpio(1) is supposed to be able to read tar(1) archives. In practice...don't try it. Multiply-linked files will be extracted from the archive, whether or not they match the current pattern and whether or not you have selected 'u'. This happens even if you use the `t' option, so it's not even save to list the archive files! Esix says 'cpio' is fixed to better handle 'tar' file type in Esix 4.0.4.1. 13. Process accounting is broken In 4.0.3, process accounting doesn't work. From examining the accounting scripts, it appears that /usr/lib/acct/accton is supposed to set a return code depending on whether accounting was switched on already or not. However, it always returns the same result - accounting switched off. This means that the /usr/lib/acct/ckpacct script, which is run every hour to keep the proccess accounting log in check, instead turns off accounting the first time it is run after booting. The same happens with the nightly /usr/lib/acct/monacct program. I don't yet know whether this bug is present in 4.0.4. It is definitely un-fixed in Dell 2.1 and Consensys 1.3. In Dell 2.2 the return bug is fixed, but accounting isn't automatically enabled at boot time. This was present in Esix 4.0.4 but has been fixed in 4.0.4.1. 14. tar(1) foos up in the presence of symbolic links Tar can get the names of symbolic links wrong when creating an archive. This bug can be demonstrated by doing the following: mkdir t cd t touch a 1234567890 ln -s 1234567890 b ln -s a c tar vcf ../t.tar . The output generated by tar is: a ./ 0 tape blocks a ./a 0 tape blocks a ./1234567890 0 tape blocks a ./b symbolic link to 1234567890 a ./c symbolic link to a234567890 (Note the above commands should be done in the order shown and in a new directory) This bug is nasty. Recommended solution: use GNU tar. This is reported from Esix 4.0.3 and Consensys 1.3, but probably exists on other SVr4s as well. It has been fixed in Dell 2.2. It doesn't seem to be in Esix 4.0.4 or 4.0.4.1. 15. Symbolic links can interfere with shellscript execution There is a problem running #! scripts when symbolic links are involved. Typing in the following from a command shell demonstrates the problem: mkdir a b ln -s a c cd a cat > script < reports: SVR4 tar has another strange bug. Seems if when restoring files, you restore one file that is a link, say "a ->/a/b/c/d/e" and there is another link just after it called "b ->/a/b/c" tar will restore it as "b ->/a/b/c/d/e" This just seems to be a lack of the NULL at the end of the string, like someone did a memmov or memcpy(dest,src,strlen(src)); where it should be strlen(src)+1 to include the NULL. Esix cannot reproduce this under 4.0.4 or 4.0.4.1, they think it's fixed. 18. COFF binaries linked with curses(3) and shared libc hang ...eating the CPU. Cause unknown. Esix cannot reproduce this under 4.0.4 or 4.0.4.1. 19. shl hangs, sxt devices bad shl(1) does not work. Try creating a layer and doing an 'ls'. Your session hangs. Bruce Momjian , who reported this bug, says he believes it is the sxt devices which are broken. It definitely exists in Consensys 1.3. Esix cannot reproduce this under 4.0.4 or 4.0.4.1. 20. num-lock prevents mouse from working properly When using the Motif window manager, if your num lock is on, your mouse clicks are not recognized by the window manager. The mouse still works in xterm(1). This is allegedly fixed in Destiny (4.2). Under Dell 2.2 if num lock is on there's no problem, but if scroll lock is on then mouse clicks aren't recognised. Esix 4.0.4 X server has been modified to solve that problem, but Esix 4.0.4.1 X server has no such problem. 21. adjtime() doesn't work Hugh Stearns reports that in 4.0.3.6 adjtime() doesn't. Calling `date -a' works to adjust the time slowly. Fixed in Esix 4.0.4.1. 23. cron mail doesn't go through aliasing Hugh Stearns reports that in 4.0.3.6 cron mail to adm doesn't get redirected by the aliases file. Esix is investigating this bug. 24. fragility in xterm Hugh Stearns reports that in 4.0.3.6, doing ~! from a cu in xterm kills xterm. This has been fixed in Dell 2.2. This bug cannot be duplicated in Esix 4.0.4.1. 25. csh lossage due to bad optimization If a csh user sources a non-existent file in their .cshrc (eg, source .alias, where .alias doesn't exist), then the system will hang for a couple of minutes. Eventually the user get an "Out of memory" error and the console logs "NOTICE: out of swap space - Insufficient memory to allocate 2 pages - system call failed". This appears to be due to over-optimization of code surrounding a longjmp call. This bug cannot be duplicated in Esix 4.0.4.1. (There are numerous other reports of memory leak bugs in csh). 26. Bug in cp(1) If ``copy'' encounters a directory before a file, it dumps core ... --- cut --- cd /tmp mkdir copybug jnk cd jnk mkdir directory >file cp -r * /tmp/copbug --- cut --- This was reported from Consensys 4.0.3 but is probably a generic SVr4 bug. It appears to have been fixed in ESIX SVR4.0.3A and Dell 2.2. It cannot be duplicated in Esix 4.0.4.1. 27. tbl -me doesn't work Wolfgang Denk reports that trying to use "tbl -me" for any input file causes tbl to quit. The problem is that newer tbl versions don't accept [nt]roff control lines (".rm @W") after .TS. Esix has this under investigation. 28. who -r fragility leads to boot-time problems It coredumps if the name of the timezone (TZ) is longer than three characters and the length is a multiple of four. This can be a real problem for European sites... and is potentially more hazardous than immediately apparent as _a lot_ of the initialization scripts (rc1.d, rc2.d) use ``who -r'' to see if the machine is in single- or multi-user mode. And when ``who'' bombs out, the ``set'' command is iven an empty command-line and can't do much else than print the shell variables, $1-$9 remain empty ... meaning that more or less all the scripts fail in various ways and the system has an exceptionally hard time coming up. Peter Wemm reports that this bug was present in Dell 2.0, fixed in Dell 2.1, but reappeared in Dell 2.2. Dell says it's a generic USL bug. Esix says they fixed it in 4.0.4.1. There is an easy workaround; make sure /etc/inittab is an odd number of characters long. The bug is causes by an off-by-one in a buffer malloc. 29. at(1) breaks here-documents in shell scripts at adds gratuitous empty lines to the job submitted by the user. This prevents shell here-documents from working. 30. UHC mouse driver ignores the middle button This may be a generic USL problem, but Dell (at least) has fixed it. UHC says they have a patch for it, but I haven't seen the patch. Doesn't occur in Esix 4.0.4.1. 31. mmap acces doesn't update file mod times Peter Wemm reports that under SVr4, if one mmap()'s a file, and writes to it via the mapped memory, when the disk is updated, the modification time does not update. Esix says "Fixed. A problem in mmap TLB has been corrected in Esix 4.0.4.1" 32. AT&T select(2) is incompatible with BSD select(2) Paul Eggert , as quoted by James Buster reports: The select() system call waits for read, write, or exception activity on a set of file descriptors, and yields an integer telling you how much activity it found. BSD's select(N,&R,&W,&E,&T) can yield up to 3*N, because BSD's select() counts the number of bits that it turns on in in the R, W, and E arguments, and R, W, and E each contain one bit per file descriptor. However, System V Release 4 v2.1's select(N,&R,&W,&E,&T) yields at most N, because SVR4's select() just counts the number of active file descriptors, regardless of how many bits it turns on. For example, the following code checks file descriptor 0. In BSD, this code can set n to 2 if file descriptor 0 is ready for both reading and writing. However, in SVR4, this code sets n to at most 1, because only file descriptor 0 is active. int n; fd_set r, w; FD_ZERO(r); FD_SET(0, &r); FD_ZERO(w); FD_SET(0, &w); n = select(1, &r, &w, (fd_set*)0, (struct timeval*)0); At least one widely used piece of software depends on the BSD behavior, namely X11R5 (see Xt/NextEvent.c). In this application, the bug's symptoms are subtle and are rarely encountered, but they do exist. Most of X11R5's calls to select() don't care about this difference, but the following files in the X11R5 distribution contain calls to select() that may be affected by this bug: contrib/lib/i18nXView2/lib/libxview/notify/ndetselect.c contrib/lib/xview3/lib/libxview/notify/ndetselect.c mit/fonts/server/os/waitfor.c mit/lib/Xt/NextEvent.c mit/server/os/WaitFor.c (Note: this is a very old bug. Paul Eggert tells me that William Kucharski reported this bug to AT&T in 1989 when he ported X11R3!) Esix says they have this bug under investigation. 33. (4.2) The login program requires its PPID to be 1 Rick Richardson reports: "The "/bin/login" program has been changed to be hardwired to require its PPID to be "1". In all other versions of UNIX, it is sufficient that there be an /etc/utmp entry. This bug was reported to USL, and I did get a fixed "login" program from them, but the fix did not make it into the release. I don't know how mere mortals get the fix at this point." 34. (4.2) Bad MAXMINOR values can make the system unbootable Rick Richardson reports: "If MAXMINOR is stune'ed to the maximum value, 0x3fff (18 bits), then the kernel will refuse to boot, cycling up to driver initialization and then doing a processor recent. Interestingly, this bug was not in the beta release, but was in the final release." Esix has this one under investigation. They say it only happens when kdb is installed. 35. (4.2) Incompatible change in TZ interpretation Rick Richardson reports: "While not really a bug, this is a surprise. In 4.2, the TZ variable was given a new meaning. Rather than the traditional CST6CDT type of value, it now looks like ":US/Central". This causes 3.2 and 4.0 binaries which use the date/time routines to report GMT time. I have no idea why another variable name was not choosen. I've taken to aliasing the binaries, e.g. "TZ=CST6CDT svr4binary"." Mike "Ford" Ditto corrects this. "This change was made in 4.0, not 4.2, and 4.0 binaries should have no problem with the new format. Some 4.0 systems use the new format by default. The old format should be avoided unless SVR3 binaries are in use, since the new features of the time conversion libraries are only available if the new format is used." Christoph Badura points out that the time functions still read the old TZ format, so you can set TZ=CST6DT or whatever and only the new features will be disabled. 36. Nulls in pixmaps can crash X Rick Richardson reports: "Displaying XPM2 pixmaps which have NULLS in them will crash the X server. Admittedly, this is not much of a bug, since these are ill-formed or corrupted pixmaps. But the server should stay up, even in these conditions. A little error checking needed." 37. Potential security hole in SVr4s using sendmail Christoph Badura writes: "/usr/ucblib/aliases contains an alias for decode that feeds straight into uudecode. I don't know under what uid uudecode gets invoked, but if it's root anyone can overwrite any file on a SVR4 system running the stock sendmail. [Under Dell UNIX] t appears that the files get created with a user-ID of "daemon". Not nice but better than root." Esix has this under investigation. 38. Reporting bug in df on non-root filesystems Paul Debra discovered that if df(1) is run on a filesystem other than root with an argument of `.', the file system name is always reported as '/'. This does *not* happen if you give it $PWD as argument. This bug is present in Dell 2.2. Esix says "Fixed. df has been modified in the area of identifying current directory to correct this problem in Esix 4.0.4.1" 39. tar writes -v output to stdout, not stderr This is an incompatible, undocumented change from earlier UNIXes and royally screws up invocations like /bin/tar cvf - foo | /bin/tar tf - that previously worked. Observed in ESIX 4.0.3A and 4.0.4, Dell 2.2; probably generic. It also existed in SCO ODT and Xenix before 2.0 and 3.2v4, but has been fixed in these most recent versions. Esix has fixed this, restoring the old behavior, in 4.0.4.1. 40. SIGPIPE is delayed and not reliable Wolfgang Denk reports a kernel bug in src/uts/i386/fs/fifofs/fifovnops.c that results in SIGPIPE not getting raised immediately by failed writes. You can reproduce this with the following program: 1 #include 2 #include 3 4 extern int errno; 5 6 int sp(); 7 8 int eop = 0; 9 10 char *line = "This is garbage.\n"; 11 12 main () { 13 int i; 14 int l = strlen (line); 15 16 signal (SIGPIPE, sp); 17 for (;;) { 18 /* 19 for (i=0; i<10000; ++i) ; 20 */ 21 if (write(1, line, l) != l) { 22 fprintf (stderr, "write error, errno=%d, eop=%d\n", 23 errno, eop); 24 fflush (stderr); 25 exit (errno); 26 } 27 } 28 } 29 30 int sp() 31 { 32 fprintf (stderr, "SIGPIPE\n"); 33 fflush (stderr); 34 eop = 1; 35 } To test this, pipe its result to ls. He writes: "That is, you can't be sure that SIGPIPE will be raised when a pipe breaks. Adding a short delay (for instance by uncommenting the for loop around line 19) gives _always_ SIGPIPE -- but usually you don't want to have additional delays in your program :-(" Bernard Fouche observes that this is not necessarily a bug. He writes: "Compile your example with the following change : - do not include your delay loop. - add a line between line 24 and 25. This line will be : sleep(60); This change will make a.out stay alive for 1 minute before exiting. - recompile, run with 'a.out|ls'. - do 'ps -le |grep a.out'. What you'll see is that a.out is now running in the background and its father is init(1)! So the return value of write(2) (EIO) can now be understood. The only thing that I can tell is that pipes, that are now based on streams in SVR4, have a more complex behavior than in SVR3.2 but I would not call problem #40 a 'bug'. It can be related to the shell that ran the command and/or the scheduler and/or the stream subsystem." Esix has this under investigation. 41. /usr/lib/acct/fwtmp doesn't work John F. Haugh reports that under Dell UNIX the /usr/lib/acct/fwtmp command does not work as described in the man page; the output contains no line feeds and appears to be garbage. I have verified this. This is probably a generic SVr4 bug. Esix says "Fixed. An output format error in fwtmp has been corrected in Esix 4.0.4.1" 42. whatis database is full of garbage. Raymond Nijssen reports: "Both under ESIX 4.0.3 and 4.0.4, whatis database contains an awful lot of garbage, such as nroff macros. In addition, quite a lot of man pages mentioned are missing, and several available man pages are not mentioned. Since makewhatis is broken (at least under 4.0.3A), this cannot be repaired easily. ESIX blamed USL for this." Esix says "Verified. Problem has been identified but could not be resolved immediately, and will continue to be worked on." 43. (4.0 & 4.2) mmap is seriously broken (thanks to Peter Wemm for a detailed report.) ALL SVR4.0s have/had a nasty kernel bug that causes seemingly random executable and shared library corruption, and also unleashes a SERIOUS security bug. The "Copy-on-Write" mechanism within the kernel has bugs. It is sufficient to say that the security related bug allows any user with shell and compiler access to WRITE to any file that they can read. SVR4.2 has been fixed for some time. ICL apparently fixed it in their sparc reference port (and x86 port), which means that Solaris2.x does not have the bugs. The most common symptom of shared library corruption is that programs simply core dump when you attempt to access a non existing file. On the other hand, Mark Abene reports: "Not long ago, I posted about my kernel panic woes with crashes at kmem_alloc(). It turned out that IRC (version 2.2.1.2) was the culprit. It appears that SVR4.2's malloc and mmap/brk's are seriously fucked. One would run IRC, /quit, and the system would either hang immediately or panic in a minute. Malloc/free calls seemed to be in order, and I even tried linking in GNU's malloc, which changed nothing. It seriously bothers me that anyone at all can compile something that should execute fine, yet can cause system crashes." $ more /notexisting Segmentation Fault (core dumped). To recover from this, restore /usr/lib/libc.so.1 from the distribution media. The security bugs have no known workaround, other than crippling the mmap() function in the kernel. Esix has a patched vm.o on their BBS that fixes this bug. This patch has been integrated into 4.0.4.1. Dell has produced a fix for their release 2.2 systems. The patch is available from dell1.dell.com:/support2.2/CoW.t Although it has not been tested, it is very unlikely that Dell's patch will work on any other SVR4/386, as it replaces two kernel modules, and Dell's kernel has autoconfiguration extensions that are not present in other systems. Dell 2.2 has got a STREAMS optimizer function enabled in the system that joins together small adjacent streams messages. There were bugs in the early USL versions of this, but for 2.2, Dell enabled it after applying a fix from USL. It seems that in some rare circumstances, some machines are quite unstable with this enabled as default. support2.2/CoW.t also disables the optimization to improve stability. This brings Dell 2.2 into line with the other SVR4.0.4 systems. 44. a bug in xterm Nickolay Saukh reports "" 45. DrawText16() bug in XWIN Nickolay Saukh reports "xterm strips off the eight bit of first character in line. This bug was present in x11r5 but fixed by some patch. I have no exact info under my thumb." (Can anyone else confirm this bug?) 46. output redirection with exec fails in sh Andreas Luik reports: "In Bourne shell scripts, the output of all following commands may be redirected using the "exec" builtin with an output redirection, e.g. exec > LOG If such a construct is used in a for loop with a variable filename for the redirection, e.g. exec > $f, only the first output redirection is executed in the SVR4 /bin/sh. It works correctly in /bin/ksh as well as in the HPUX, SunOS 4.1 and AIX Bourne shells." 47. rm fails to reject . or .. arguments Andreas Luik reports: "rm does not check for `.' and `..' arguments. The rm program should check for the arguments `.' and `..' (at least if called with the -r option) and ignore this arguments with the message "rm: cannot remove `.' or `..'". All implementation I'm aware of perform this check. As far as I know, this check is also in the SVR4 sources but implemented incorrect. This bug should be fixed for security reasons." 48. bc/dc divide is buggy A recent paper, Error in Unix Commands dc and bc for Multiple-Precision-Arithmetic Ingo Dittmer* December 24, 1992 * dittmer@OsFhRz70.rz.fh-osnabrueck.de SIGNUM Newsletter April 1993 pages 8-13 reveals errors in bc&dc in SVR4 and pre-SVR4 bc/dc for division and modulus. For example, try: 28420950579078013018256253301 / 17987947258 or 86833646827370 % 9980035577 The underlying bug is in dc's divide function. All commercial dcs seem to manifest variants of the bug. The bug doesn't exist in the latest Research Unix, or in the GNU bc. 49. (4.2) pkgrm with no arguments nukes all packages without confirmation Benson I. Margulies writes: "pkgrm with no arguments removes ALL packages, with no confirmation requested." 50. (4.2) pkgmk with no -p option loses Benson I. Margulies writes "pkgmk without the -p argument produces a completely bogus stamp containing an error message." III. Serial-port and tty administration problems 1. Dropout problems with tty devices The most serious problem anyone has reported is that the USL asy driver is flaky and occasionally drops characters at above 4800 baud. Microport, Dell,and UHC say that they believe they've fixed this. However, Dell, at least, was mistaken when they first made this claim; a more detailed description of the problem is given below. Bela Lubkin at SCO comments "386 interrupt latency vs. unbuffered UARTs. This is a tough problem. Nobody's driver should drop characters with a turned-on 16550. It's not so easy with a 16450. Anyone with 16450s or lower should be able to solve their problems by dropping in a 16550." Esix thinks they've fixed this problem; they rewrote the driver. 2. Quick port setup option in sysadm is broken In 4.0.3 sysadm, the quick port setup option, which is used to add and delete terminal ports, is seriously broken. The script modifies /etc/conf/* files, and has incorrect minor numbers, sets the 5th field of sdevice.d to Y when it should be N, and is missing columns for node.d. See /usr/sadm/sysadm/bin/q-add. This bug is present in USL 4.2 as well (certainly in Consensys V.4.2). The Esix people say this problem doesn't exist in either 4.0.4 or 4.0.4.1. 3. ttymon drops DTR when it shouldn't Hugh Stearns reports that in 4.0.3.6 the ttymon(1) utility for HDB uucp drops DTR every few weeks. The workaround is to disable and re-enable it. The SVr4.2 ttymon is even more broken; it *never* raises DTR after the first outgoing call. Jeremy Chatfield at IF has confirmed that this is a real bug in the USL sources and is on his urgent-fix list. Esix has this bug under investigation. In the May 10, 1993 issue of Open Systems Today, page 70, Jason Levitt describes some of his ttymon problems. He has a file posted on ftp.uu.net under /published/open-systems-today/other/svr42uucp.tar; This tar file contians a fixed ttymon program along with a text file describing setting up ttymon and uucp so that it works pretty well. 4. ttymon doesn't drop DTR when it should Stephen Hebditch reports from a Dell 2.2 system: "When a user logs out, ttymon does not appear to lower the DTR line for a sufficiently long enough time to always cause the modem to drop carrier. The WorldBlazer modem here is set to its default of 50ms DTR detection time - the minimum time allowable - but around 2 times out of 10, when a user logs out it will not drop carrier although the DTR light on its front panel can be seen to blink momentarily. Disabling service for a particular device (e.g. using 'pmadm -d -p ttymon3 -s 00') will only work if ttymon hasn't spawned a child process for that port. According to the manual "ttymon should exit if no one types anything in seconds after the prompt is sent". Occasionally when hanging up an outgoing connection, spurious characters can trigger ttymon into thinking that there is a new user wanting to log in. Because it has seen these characters, ttymon will then not time-out, locking up that port until the controlling ttymon child process is killed." See the fix note attached to III.3. Esix has this bug under investigation. 5. (4.2) Terminating cu to a direct line locks up the port The problem is the C2 security mechanisms. Terminating cu with ~. doesn't tear them down correctly. Subsequently, another cu(1) will be able to get at the port, but utilities which try to get at it directly (i.e., cat or stty) won't be. Rick Richardson adds: "The "cu" problem where ports can't be used by stty, seyon, or other programs once "cu" has had its way with them: This problem apparently affects any program (cu, uucp) that uses the DIAL(3) routines. Those routines have been modified to use the "cs" connection server daemon to open the port and/or dial a phone number on behalf of the client (though you'd hardly realize this from reading the manual page). The "cs" daemon does *something*, where *something* is not known yet, which causes all subsequent termio type ioctl's to fail. This bug has been reported to USL and Univel, but no fix has been forthcoming." He continues: "I had our streams device driver guy put in a version of one of our serial port drivers with debugging turned on, and he said that it looked like the driver "close" routine was never getting called - possibly because the device close call only happens on the last close of a device, and the connection server has still got the port open. This theory would seem to indicate that "cu" and "uucp" are fine, but that the connection server is broken. We don't really know, though -- its just a theory. See the fix note attached to III.3. 6. Hardware flow control bug breaks streaming data transfers Stephen Hebditch reports from a Dell 2.2 system: "There is a definite problem with hardware flow control. If characters are being continually sent to the modem with no break, then after around 40K or so the asy driver will ignore the fact that the modem has lowered the CTS line and will keep on sending. Up to that point it will correctly stall when the CTS line is lowered. If there is a break in sending, then flow control will work correctly once more. This means that streaming protocols such as Z-Modem will break but simpler protocols like UUCP g which don't fill up the modem buffer will work correctly." Your editor has seen this one himself while attempting to use rz for uploads to his friendly Internet site, as was his wont under SVr3. I now get around this by using ymodem protocol for uploads. This is probably a generic bug in 4.0.4 serial handling. Esix has this bug under investigation. 7. Bad interaction between ttymon and networking Stephen Hebditch reports from a Dell 2.2 system: "A problem with ttymon, in.telnetd and in.rlogind. When a user logs out, wrong entries are written to utmp and wtmp. This results in utmp and wtmp containing a new record for that user for a session starting at the time that they logged out. This results in some programs (finger for example) showing that users are logged in when they are not and means that login accounting is not possible." See the fix note attached to III.3. Esix has this bug under investigation. 8. Bogus restriction to 2 users Some USL binary distributions are limited to running only two tty processes at once. USL thinks this is a feature, but it is actually a bug. In SVR4/386, look in the pack.d/kernel/space.c file for "int eua_lim_ma = 2;" changing this to a high value will fix the bug. IV. Networking and File Sharing Bugs 1. NFS locking is unusably slow Randy Terbush has posted code which demonstrates a serious bug in the SVr4 NFS locking daemon. In his own words: "The symptoms are ~30% cpu usage by 'lockd' and severe slowing of the machines on the network. This program demonstates that it takes ~20 seconds to obtain locks from an ailing 'lockd'. We have verified that this bug does not exist in HPUX 8.0x." Randy's code is too large to be included here. He is, quite rightly, exercised at USL's exceedingly slow response to this problem. The comment in his makefile reads, in part: # USL has admitted to the existance of this bug in version 4.0, 4.1, # and 4.2 of their distributed and yet to be released sources. This is # a network crippling problem that they have refused to fix until # release 4.3, which will be OVER 1 YEAR from today. (29 Oct 1992) # If your version of 'lockd' exhibits this same problem, I would # strongly urge you to contact your vendor and ask them to put some # pressure on USL to fix this problem. SVR4 is virtually useless in a # network of shared resources while this problem exists. Esix has this bug under investigation. 2. UFS file system problems In stock USL 4.0.3, you can't use a UFS file system as the root; the system hangs if you try. Consensys, Dell, Esix, Microport, MST, UHC, and ESIX all appear to have fixed this. Esix confirms that they have. David Aitken, the UNIX product manager at UHC, writes "The ufs as root file system [problem] was not really a bug, just a little oversight on USL's part - we have fixed it completely by adding one line to the /stand/boot script: rootfstype=ufs!" He adds that they've been using ufs on their lab machines for over 10 months with no trouble, and the latest UHC release defaults to ufs if you have more than 120MB of disk. 3. Byte-order problem with NFS when accessing Sun disks Christoph Badura notes that the stock USL resolver library suffers from serious confusion about the byte order in the socketaddr_in structure. This bug is acknowledged by USL for the 4.0.4 release. A symptom of this bug is that Sun disks will not mount correctly over NFS. As a workaround, try removing the references to /usr/lib/resolv.so from /etc/netconfig and rebooting your system. Unfortunately, this will mean you can't use nameservers. Alan Batie writes: "Actually, you don't have to remove resolv.so, just put tcpip.so first and have a hosts file with the names of hosts you want to do NFS mounts from. This way you can use nameservers for most things." Esix has fixed this bug in 4.0.4.1. 4. Under weird circumstances, lseek on UFS may cause corruption Christoph Badura reports that a UFS lseek() to an offset which is a multiple of 4096 but not a multiple of 8192, followed by a write(), may corrupt the file being written. The bug shows up only, if the file has no pages in the page pool associated with it at the seek offset and at 4k before the seek offset. He has sent USL kernel fix for this, which was included in 4.0.4. Esix has this bug under investigation. 5. FTP problems The in.ftpd on SVR4.0.3 does not support all the commands listed in RFC 959. When recent SCO UNIX/ODT versions ftp to SVR4.0.3, the SVR4 side will refuse, drop the connection, and core dump after you authenticate. This is because the SCO end sends the 'SYST' command ala RFC 959, and the SVR4.0.3 end doesn't recognise it. Some ports have fixed this. Christoph Badura adds: "The bug is do to a longjmp(3) on a sigjmpbuf obtained by sigsetjmp(3). ARGH. Testing led to a bug in the original BSD sources, which is still present in the NET/2 ftpd. " Esix has fixed this bug in 4.0.4.1. 6. A bug in the WD80x3 support MST reports a serious bug in the SVr4 kernel support for this card. Here's how to reproduce it: server: init 3 and share (export) /usr for example. client: mount -F nfs server:/usr /mnt cd /mnt find . -print | cpio -ocBuv > /dev/null what happens: server and client will "hang" together. "cue": hit keys on server and/or client, hang will go away for 10-20 seconds temporarily. Yank BNC connectors do the same trick. They say they've heard from customers that this happens on Dell, UHC as well as USL 4.0.4. PCNFS/BWNFS network xcopy suffers this as well. Client can be a Sun Sparc for that matter. Esix has fixed this bug in 4.0.4.1. 7. Security hole near fingerd Jerry Whelan reports: We encountered a cute security hole in AT&T SVR4 2.1 (which I believe translates to USL 4.0.2). It apparently was fixed in AT&T SVR4 3.0. The hole related to the finger daemon. If a user set his .plan to a symbolic link pointing to a protected file (such as /etc/shadow, or somebody's mail file) then fingering the user would cause the finger daemon to read that file and display it. I don't know if the bug exists in any other vendor's versions of 4.0.2. We replaced our fingerd with gnu finger, only to find the same problem. I sent the changes back to the gnu finger developer, but I don't think a newer fixed version has been officially released yet. Steve Peltz writes: "The fix to the fingerd problem (pointing a .plan file to a protected file and thus getting read access to it) can be fixed by changing inetd.conf to not give root privileges to the fingerd process. It seems like overkill to have fingerd set to the user id of the person you're fingering to see if you should have access to the file." Esix has fixed this bug in 4.0.4.1. 8. Fatal bug in priority-band message handling. Douglas C. Schmidt" reports: There is a bug with handling priority-band messages that causes several System V Release 4 versions (particularly Solaris 2.1) to crash. The following code replicates the problem. Sun has been notified and claims they will fix this problem in the next release (2.2?). /* This program causes System V Release 4 to crash! */ #include #include #include #include #define FIFO "/tmp/foo" #define BIGFILE "/usr/dict/words" static int do_child (int fifo_fd) { struct strbuf msg; char buf[BUFSIZ]; msg.maxlen = sizeof buf; msg.buf = buf; do { int flags = 0; if (getmsg (fifo_fd, 0, &msg, &flags) != -1) (void) printf ("(%2d) (%2d): %s", msg.len - sizeof (int), *(int *) msg.buf, msg.buf + sizeof (int)); else return -1; } while (msg.len != 0); return 0; } static int do_parent (int fifo_fd) { FILE *fp; char buf[BUFSIZ]; (void) srand ((unsigned) time (0)); if ((fp = fopen (BIGFILE, "r")) == 0) return -1; while (fgets (buf + sizeof (int), sizeof buf, fp) != 0) { struct strbuf msg; int band = rand () % 11; msg.buf = buf; msg.len = strlen (buf + sizeof (int)) + 1 + sizeof (int); *(int *) buf = band; if (putpmsg (fifo_fd, 0, &msg, band, MSG_BAND) == -1) return -1; } return 0; } int main (void) { int fd; #if defined (TEST_FIFO) (void) unlink (FIFO); if (mkfifo (FIFO, 0666) == -1) perror ("mkfifo"), exit (1); #else int pipe_fds[2]; if (pipe (pipe_fds) == -1) perror ("pipe"), exit (1); #endif switch (fork ()) { case -1: perror ("fork"), exit (1); /* NOTREACHED */ case 0: #if defined (TEST_FIFO) if ((fd = open (FIFO, O_RDONLY)) == -1) return -1; #else fd = pipe_fds[0]; close (pipe_fds[1]); #endif if (do_child (fd) == -1) perror ("do_child"), exit (1); break; default: #if defined (TEST_FIFO) if ((fd = open (FIFO, O_WRONLY)) == -1) return -1; #else fd = pipe_fds[1]; close (pipe_fds[0]); #endif if (do_parent (fd) == -1) perror("do_parent"), exit (1); break; } return 0; } Esix has fixed this bug in 4.0.4.1. 9. SVr4.0.4 TCP/IP routing is broken Raymond Nijssen reports: "I found a problem with ESIX 4.0.4 TCP/IP routing. I'm not sure if it's also present in other SVR4 flavors. The problem is that once a system has received an ICMP route redirect message, it is supposed to store the new route in its routing tables. This does not work properly, which is revealed by ping(1)ing to a host though a gateway in a more complex network configuration. For almost every packet is sent to another gateway than the one which corresponds with the network of the destination. This in turn leads to an enormous amount of ICMP messages, which leads to bad network thoughput. We also had some mysterious crashes until we decided to change the network configuration to circumvent this problem." (This seems very likely to be a generic SVr4 problem). Esix has this bug under investigation. 10. df(1) on NFS volumes returns bad data Raymond Nijssen reports from Esix 4.0.3A and 4.0.4: " Diskspace figures of NFS mounted filesystems reported by both /bin/df and /usr/ucb/df are 4 times too big." Esix has fixed this bug in 4.0.4.1. 11. rsh hogs the processor Raymond Nijssen reports from Esix 4.0.3A and 4.0.4: "The rsh command hogs the CPU. On an empty system, `rsh foo -n bar' takes 1 second kernel-mode CPU per second elapsed." Esix has fixed this bug in 4.0.4.1. 12. MTU for remote networks ignored Nathan D. Lane reports: "Esix 4.0.4 ignores the MTU for remote networks. I have PPP setup on my RS/6000 and the Esix box connects via ethernet to the RS/6000. Packets are always sent out "full size" by the Esix machine, no matter where their destination. It is my understanding that, when routing to a remote network where the MTU is a) unknown or b) set to something lower than 1536, the originating machine should make the packets smaller. Instead, when the Esix box blasts out its packets across the PPP link, it sends them full size, making the other end do *a lot* of packet reassembly."" This has not been confirmed on other ports, but seems likely to be a generic SVR4 problem. 13. Bug in remote printing. A couple of USENETters have reported that the remote-printing support for lpr (the System V print spooler) is broken in SVr4.0. Printing is done correctly, but the job is not then removed from the print queue on either system. V. SCSI Support Problems 1. sar is confused by SCSI Sar -d doesn't work on SCSI drives. Dell fixed this in 2.1 and it's reported to work OK in Esix 4.0.3A; no report of any other SVr4 having fixed this yet. SCO fixed it in 3.2.4. Appears to be fixed in USL 4.2. Esix fixed this bug in 4.0.4. 2. A configuration problem Stock USL 4.0 requires you to jumper your SCSI devices to fixed IDs during installation (it can be changed to any other ID after). Specifically, the tape must be ID 6. Dell says they've fixed this. The requirement is definitely still present in Esix and Consensys 1.3. UHC thinks they've fixed this, but their 4.0.3.6 release still seems to demand ID 1 to install. I've seen an email report that USL 4.2 still has this problem. But after publishing this, I got a request for more info from Mike Drangula at USL. He wrote: > As far as > I know ( and I wrote the SCSI configuration tools for 4.2 ), there is only > one case where a device is required to be at a particular SCSI ID, unless > you count the requirement that the HBA be at ID 7. > > The only requirement for a given SCSI id is that, on a SCSI-based MCA > machine that uses IBM's SCSI Host Adapters, the boot disk must be at ID 6 > if there is more than one disk installed on the HBA. > > The old requirement that the tape be set to SCSI ID 6 is no longer in effect. > If your HBA will support booting from it, there is not even a requirement > that the boot SCSI disk be at SCSI ID 0. The only requirement for disks is > that the boot disk must have the lowest SCSI ID of any DISKS on the system > ( except in the already noted case of MCA SCSI ) Give Mike a hand for actually reading this bug list. 3. Synchronous SCSI hang problem David Wexelblat reports: "Stock SVR4.0.3 will hang the SCSI bus with a 1542 in synchronous mode. Dell fixed this, and this has been given to Microport [ed note: Microport 4.0.4 and Consensys 4.0.3 have fixed the problem; MST UNIX and Esix 4.0.3 still have this problem; I have not yet been able to determine if ESIX 4.0.4 does]. In the file /sbin/bcheckrc, change the line: echo MARK > /dev/rswap to echo MARK | dd of=/dev/rswap bs=512 conv=sync > /dev/null 2>&1 The magic is apparently the conv=sync, which forces a 512 byte block to be written. The original echo writes 4 bytes, which apparently causes synchronous SCSI to go out to lunch. Now, you ask, how can I fix this, since the system won't boot? There are a couple of methods. First, if possible, disable synchronous negotiation (1542 jumper J5-1 removed, plus whatever you may need to do to your drive). Then boot up, edit /sbin/bcheckrc, then shutdown, restrap for synchronous, then reboot. Everything should be OK. That's the easy way. Unfortunately, some hard drives will only work in synchronous mode. Well, you can still recover from this phenomenon. Here's how: 1) Install on your hard drive 2) Boot from the first boot floppy. When it tells you to, insert the second boot floppy. At the first prompt, hit to break out to a shell. 3) Mount your hard drive under /mnt with the following command (replace FS-TYPE with s5, s52, or ufs, whichever you used for for your root partition): /etc/fs/FS-TYPE/mount /dev/dsk/c0t0d0s1 /mnt 4) Now edit /mnt/sbin/bcheckrc: ed /mnt/sbin/bcheckrc You may want the 'ed' man page handy (I barely remember how to to use 'ed' :->). For simplicity, you can delete/comment out the offending line, then replace it with the correct line later. 5) Unmount the hard drive: umount /mnt 6) Reboot from the hard drive. Everything should come up OK. and you can finish editing /sbin/bcheckrc, if necessary. Note that you perform these actions at your own risk. The first version was performed by me on Microport SVR4, and the second was performed by someone else (on my suggestion) on ESIX SVR4." This problem appears to be fixed on Consensys 1.3 and Dell 2.1; also (pace David's remark) in ESIX 4.0.4, which has echo MARK | /sbin/dd.arch conv=sync > /dev/rswap 2> /dev/null MST says they've fixed it in their 3.02 release. It's also fixed in Esix 4.0.4. 4. ps chokes on commands that do SCSI I/O Hugh Stearns reports that in 4.0.3.6, ps doesn't work when a SCSI command in progress. It stops printing at the process executing the scsi command. This is still broken in Dell 2.2 and ESIX 4.0.3. Esix says they can't reproduce this in 4.0.4 or 4.0.4.1. 5. Transfer speed problems with Adaptec 1542B on 486s If a system mount or install fails, try setting the DMA speed to 5MB/s, rather than the default 5.7MB/s. This is accomplished by removing the jumper shorting the 12th pin pair of jumper block 5. 6. df gives inaccurate values for large SCSI partitions Derek Terveer reports "I was on a Esix 4.0.4 system recently with a >1024 cylinder (i.e., ~1.05 GB disk) and the df command was giving wildly inaccurate values. I presume that this has something to do with the size of the partitions, because it works just fine on a system with smaller drives and partitions." Esix says they can't reproduce this in 4.0.4 or 4.0.4.1. VI. Development Tools Problems 1. General UCB library brokenness The BSD compatibility libraries were badly broken in USL code. A Dell source adds "That meant that almost all the apps derived from them were broken too. Most stuff like automount will die when you send a SIGHUP, instead of rereading the map file. You can get a system into very strange states when that happens." John Sully of Microport opines: "This is a bug in automount itself rather than BSD compatibility, since the automount which comes with SVR4 is not compiled with the BSD libraries. (isn't this comforting?? :-()." Peter Wemm reports "There is a very simple and reliable sure to this sort of thing: Using your favourite hex editor, change all instances of "signal" in the binary file to "sigset". Most BSD code assumes that signal() auto-rearms after handling a signal. On SVR4, signal() does not, but sigset() is argument compatible, and has BSD semantics." Esix and UHC's BSD libraries are USL stock. I don't yet know the status of other ports. Microport has run into things they think may be symptoms of this but have no fix yet. John Sully of Microport counters with: "One common thread I find on reading of these problems is that the BSD compatibility libraries are *misused*. [...] The problem is that BSD and SYSV have similarly named .h files which sometimes contain different definitions for objects with the same name. This has been known to cause all sorts of problems because the SYSV headers are picked up and then the calls are satisfied from the BSD library rather than the shared object library. I have found that if you use /usr/ucb/cc that the BSD compatibility is much less broken than it would seem at first because it ensures that the correct headers are picked up." However, note that there is at least one *real* bug known --- as of 4.0.4 the signal emulation cannot explicitly set a handler to SIG_DFL or SIG_IGN. Developers should be very careful that if they use -L/usr/lib/ucb -lucb the cc used is also the Berkeley cc. Esix has this bug under investigation. 2. USL emulation of BSD signals doesn't work A different source reports that the the USL implementatation of BSD signals is broken in both 4.0.3 and 4.0.4; in particular, the sigvec() family doesn't work properly. It is possible to make minor tweaks to source to make such apps work properly with the native USL signals implementation. Here's more on the signals problem, thanks to Richard : ------------------------------------------------------------------------------ The problem is to do with the signal() function that is within the BSD compatability libc. To reproduce the problem do the following: #include #include #include #include main() { signal(SIGPIPE,SIG_IGN); pause(); } and compile it with cc xx.c -o xx /usr/ucblib/libucb.a (John Sully observes that this is definitely wrong; /usr/ucb/cc should have been used rather than "cc ... -L/usr/ucblib -lucb" or the equivalent "cc ... /usr/ucblib/libucb.a".) If you run the program and then signal it with a SIGPIPE, the program will die, even though you've told it to ignore SIGPIPE. The fix is difficult unless you've got source because there's a missing 'else' clause from the signal() code. This is the only signal fault I've found in the BSD signal functions, details of the rumoured sigvec problem would be useful? If you're trying to compile an application you could change the application code to do the following, this does work.. void catch(s) int s; { /* DO NOTHING */ ; } main() { signal(SIGPIPE,catch); pause(); } SUMMARY You can only change a signal handler to a function handler, any number of times. Any attempt to set the handler to SIG_DFL, or SIG_IGN will fail. This bug has given some people working with X11R5 aggro, causing the X server to die when you close a client. Christoph Badura confirms this bug He has sent USL a source fix. It appears already to have been fixed in Dell 2.2. Esix has fixed this in 4.0.4.1. 3. Possible string library problems There are also persistent rumors of problems in the BSD-emulation string libraries. I have not been able to pin down specifics on this. 4. USL's ndbm support is broken. Christoph Badura reports "The ndbm functions in the ucb library are broken [apparently due to a compiler of optimizer bug in cc -- ed.]. Try makeing the whatis data base for /usr/share/man with Tom Christiansen's perl rewrite of man. The easiest way to fix this is to compile GNU's replacement ndbm.c with gcc -fpcc-struct-return -traditional (gcc1.40 or 2.2 will do nicely) and install it in your C library. Source is available for FTP from prep.ai.mit.edu. Esix has this bug under investigation. 5. An include file is missing Both 4.0.3 and 4.0.4 USL versions are missing the documented dial.h file from their /usr/include directory. Dell 2.[12] has it. This file has been restored in Esix 4.0.4.1. 6. sscanf(3) has a potential bug Anthony Shipman reports: " I found the following bug in SCO Unix 3.2.* and I think it may be common to many AT&T derived Unixes. sscanf() calls _doscan() to read from a pretend file. The file uses the string as a buffer and a fake file descriptor of 60 (=_NFILE). Since _NFILE (for SCO UNIX) is 60 it assumes that fd 60 can never be open. Then when fscanf() hits the end of the string it calls _filbuf() to read into the buffer (which is the string) from fd 60. This should fail with an errno=9 and then _filbuf() sets EOF and it all terminates. However in SCO Unix you can reconfigure the kernel to increase the number of files per process to a recommended maximum of 150. If you do this then your program might have fd 60 open one day. Then sscanf() will read from this file overwriting your string. The byte count to the read() in _filbuf() is some undefined but large value so a lot of memory will be overwritten. In my case the string was on the stack so my stack was wiped. In short if you configure your kernel to have NOFILES > _NFILE ie more than the default then sscanf() is a time bomb in your code." This is alleged to have been fixed in SVr4, but I haven't been able to confirm the fix. Bob Tinsmamn of SCO support writes: "We're fixing it too, in a maintenance supplement to the Development System that will come out at the end of this year or the beginning of 1993, known as Development System Maintenance Supplement 4.2 or MSD 4.2." Esix has this bug under investigation. 7. shmat(2) vs. vfork(2) The shmat(2) call is known to interact bady with vfork(2). Specifically, if you attach a shared-memory segment, vfork(), and then the child releases the segment, the parent loses it too! Workaround; use fork(2). UHC and Microport both suspect that they still have this bug and opine that anyone who uses vfork deserves to lose. Dell has no plans to fix it. John Sully writes: "This is not a bug. It is completely consistent with the semantics of a change to the address space of the child. Think about it: any change to the address space of a child process created by vfork(2) is reflected in the parent since the child is actually executing in the parent's address space. Therefore if the child changes the address space (in this case by releasing the shared memory segment) what should happen? Right, the parent should have the same change happen. And what does happen? The segment is released in the parent. One can argue about the braindead semantics of vfork(2) all day, but the fact remains that this is exactly what one would expect to happen. To quote from the manual page: [...] vfork differs from fork in that the child borrows the parent's *memory* and thread of control until a call to execve or an exit (either by a call to exit or abnormally.) [ emphasis added ] and later: It does not work, however, to return while running in the child's context from the procedure which called vfork since the eventual return from vfork would then return to a no longer existent stack frame. Please note that the entire address space of the parent is used by the child created by vfork(2). The manual page also points out several other caveats involved in doing anything to the parent's address space except successfully calling an exec family function or _exit (note it specifically says *not* to call exit(2)). I do not believe that having a shared memory segment disappear from the parent's address space is out of line after reading the man page for vfork(2). It is interesting to note that Sun after implementing its new VM system in SunOS 4.0 initially had no plans to support vfork, since they felt that the COW semantics of the new fork would provide the necessary efficiency gain. Indeed they found that most programs which used vfork worked just fine by doing -Dvfork=fork. All that is, except for a certain popular command interpreter [ed: can you say C shell?]. So we are stuck with the legacy of this braindead system call. BTW, Microport has no plans to fix this :-)." 8. FIONREAD fails on regular files Christoph Badura reports that the FIONREAD ioctl() fails on regular (disk) files. He has sent USL a one-line kernel fix. Esix has this bug under investigation. 12. fread(3) does the wrong thing on pipes and FIFOs Ed Hall writes: "Unlike the raw read() system call, fread() is supposed to be able to make several partial reads to satisfy the data requested by its arguments. The exceptions are an EOF or an error on the stream. This characteristic is quite useful when moving data through pipes or over network connections, since partial reads are quite common in these cases. Well, the version of fread() in ESIX 4.0.3 (and likely other Sys5R4's) only does a single physical read, and if it only satifies part of the requested number of bytes, that's all you get. This can sting you even if you carefully check the value returned by fread(), since the value returned is rounded down to the number of complete "nitems" read, although your position in the stream can be up to size-1 bytes beyond that point. Neither ferror() nor feof() indicate anything is wrong when this happens." This bug (which is also present in 4.0.4) is serious and nasty and should be high on every porting house's list to fix. It appears to be peculiar to USL 4.0.3 and 4.0.4; 4.0.2 does *not* have it, nor does SCO. A USL source claims it has been fixed in 4.1. Esix has this bug under investigation. 10. putw appears to be broken There is a bug in the ESIX SVR4.0.3A putw() routine in the C shared library which is probably USL's. The following program demonstrates it: /* compile with: cc -o file file.c */ #include main() { int i; for (i=0; i<1022; ++i) { putchar('1'); } putw(-11, stdout); for (i=0; i<1022; ++i) { putchar('1'); } } The putw() routine does not output 4 bytes, as it should. It may be there is some interaction with buffer flushing that is causing the problem. Also, note that if you change the sign of the first argument to putw(), the program works fine. Esix has this bug under investigation. 11. Compiler problems Ronald Guilmette also reports the following: ------------------------------------------------------------------------------ /* Here is a bug in the original SVR4 C compiler (aka C Issue 5) which effectively prevents you from making good use of the `const' and `volatile' qualifiers defined by ANSI C in conjunction with pointer types and typedef statements. Compile this code and you will get: "qualifiers.c", line 23: left operand must be modifiable lvalue: op "=" ...if your copy of the svr4 C compiler still has the bug. Note that given these declarations, the ANSI C standard say that the thing pointed to by the variable `pci' should be considered to be constant... not the variable `pci' itself. (The GCC compiler, either version 1.x or version 2.x, correctly compiles this example without complaint.) */ typedef const int *ptr_to_const_int; ptr_to_const_int pci; int i; void main () { pci = &i; } ------------------------------------------------------------------------------ /* Here is a subtle bug in the original SVR4 C compiler (aka C Issue 5) which prevents you from first declaring a tagged type (i.e. a struct type or a union type) in a parameter list, and then defining that tagged type later on within the same scope. (Note that according to the ANSI C standard, the scope in which parameters get declared and the outermost block of a function body are one and the same scope. Thus, this really is legal ANSI C code!) Try compiling this with your C compiler on SVR4. If your compiler still has the bug, you will get: "tagged_type.c", line 24: warning: dubious tag declaration: struct S "tagged_type.c", line 28: warning: improper member use: i "tagged_type.c", line 28: warning: improper member use: i "tagged_type.c", line 31: warning: dubious tag declaration: struct S "tagged_type.c", line 35: warning: improper member use: i "tagged_type.c", line 35: warning: improper member use: i (The GCC compiler also had this bug in version 1.x, but it has been fixed in version 2.x.) */ void foobar1 (arg) /* use old-style without prototypes */ struct S *arg; { struct S { int i; }; /* define the type `struct S' */ arg->i = arg->i; /* legal according to ANSI C rules! */ } void foobar2 (struct S *arg) /* use new-style with prototypes */ { struct S { int i; }; /* define the type `struct S' */ arg->i = arg->i; /* legal according to ANSI C rules! */ } ------------------------------------------------------------------------------ /* Here is a serious bug in the original SVR4 `dump' program which dumps out parts of object files in either plain hex form or symbolically. To see the `dump' program get a segfault and die, save this code under the name `dump-bug.c' and then do: cc -g -c dump-bug.c dump -v -D dump-bug.o The bug arises whenever `dump' tries to read Dwarf debugging information for an array of pointers to any "user defined" type (e.g. `struct S' in this example). Past that point, `dump' is totally confused, so further Dwarf debugging information finally causes it to go belly-up. */ struct S { int i; }; struct S *array[10]; int j; ------------------------------------------------------------------------------ It appears that the svr4 C compiler (for x86 machines) doesn't conform real well to either the letter or the spirit of the IEEE 754 floating-point standard. In particular, "unordered comparisons" and other operations on NaNs don't always produce the result that that the IEEE 754 standard calls for. An AT&T source comments: "This is documented in the SVID as a future direction. We do not support NaNs in -Xa and -Xt modes, only in -Xc. Try isnan(sqrt(-1.0)) to determine which modes support it." ------------------------------------------------------------------------------ The compiler fails to issue diagnostics in cases where a typedef name is reused to declare a formal parameter, as in: ----------------------------------------------------------------------- typedef int FOO; void bar (FOO) int FOO; { } ----------------------------------------------------------------------- The compiler crashes on the following invalid input: ----------------------------------------------------------------------- int i; volatile void *pvv; void pvv_test () { (i ? *pvv : *pvv); /* ERROR */ } ----------------------------------------------------------------------- The compiler fails to issue diagnostics for cases where an attempt is made to "forward declare" an enum type (without also defining it), as in: ----------------------------------------------------------------------- enum enum0 *ep; /* ERROR */ ----------------------------------------------------------------------- The compiler rejects the following code with an error, although there seems to be no good reason why it should (because no object is being declared). ----------------------------------------------------------------------- #include typedef char array_type[ULONG_MAX]; ----------------------------------------------------------------------- Here's another nasty one: ----------------------------------------------------------------------- #!/bin/sh # The following script will cause SVR4 linkers on many x86 systems to crash # with a SIGSEGV (i.e. a segfault). echo 'leal foo@GOTOFF(%ebx),%ecx' > sl-bug.s as sl-bug.s ld -G sl-bug.o ----------------------------------------------------------------------- The 4.2 compiler messes up on the following code: ----------------------------------------------------------------------- void YUCK (__0this , __0word ) register struct XX *__0this ; os2u __0word ; { if (__0word <= 0xFF ) put_osbool__14XXFCi ( __0this , (int )0 ) ; else { foo ( __0this , (int )1 ) ; bar ( __0this , (unsigned char )(((unsigned char )(__0word >> 8 )))) ; } zzz ( __0this , (unsigned char )(((unsigned char )__0word ))) ; } ----------------------------------------------------------------------- Optimized, the first comparison is: movw 12(%esp),%dx ;copy 2byte argument to dx cmpw $255,%dx jg .L205 which is wrong, it does a signed comparison on an unsigned short. Compiling with -g it gets it right: movzwl 12(%ebp),%eax cmpl $255,%eax jg .L207 12. getlogin() doesn't work Robert Withrow reports "The posix function getlogin() doesn't work on most svr4s (at least up to SVR4.0.3.0... cuserid() *does* work, but it makes porting a pain. Try it some time and perhaps add it to your list." Raymond Nijssen confirms this and adds that this bug (due to utmp and wtmp file corruptions [possibly caused by ttymon bugs described above --- ed.]) breaks executables such as talk(1). Esix has this bug under investigation. 13. syslog routines don't work Raymond Nijssen reports: "Under ESIX 4.0.3, syslog routines are unusable. They are slightly better under 4.0.4, but still severely broken." "In addition, replacing the syslogd executable that comes with Esix with the one provided by Marc Boucher (marc@cam.org) shows that the syslog() call itself is sane. It's available from ftp.cam.org." 14. Bogus `r' in xt driver configuration flags Raymond Nijssen reports: "Both under ESIX 4.0.3 and 4.0.4, the `r' flag is present in the third column of /etc/conf/cf.d/mdevice for the [n][s]xt drivers, suggesting that these drivers would be required for relinking the kernel. This is not the case. I saw at least one release of Dell SVR4 in which this was ok." (Making this change reduces the kernel's size somewhat.) This is fixed in Esix 4.0.4.1. 15. ioctl for kernel symbol fetches fails Trying to obtain kernel values of certain symbols fails. The two symbols from the kernel that are quite useful are "avenrun" and "total" which as far as I can tell are defined in the "mm" driver. This bug manifests itself in applications like "top", "u386mon" ... One used to use the nlist() function call, but according to the man page for nlist() it should not be used due to the dynamic loading and unloading of drivers that can happen at any time in the "life" of a V.4.2 kernel. Try the sample hack below to see if your system has the same problem. #include #include #include #include main() { int fd=0; long ar[3]; struct mioc_rksym k; fd = open("/dev/kmem", O_RDONLY); k.mirk_buflen = sizeof(ar); k.mirk_buf = (void *)&ar; k.mirk_symname = "avenrun"; if((ioctl(fd, MIOC_READKSYM, &k))==-1) { perror("ioctl"); exit(1); } printf("%d %d %d\n",ar[0],ar[1],ar[2]); close(fd); } Thanks to David P. Cutter for reporting this. 16. Bug in cc optimizer (4.2.1) Nickolay Saukh reports a bug in cc, the Optimizing C Compilation System (CCS) 2.0 07/24/92 If you have global (external) structure/union with name 'tr' commands to access very first member (with zero offset) are garbled. Simple text to reproduce the bug struct _tr { int aa; int bb; } tr; void foo(int zz) { tr.aa = zz; } Here is the result of cc -O -S foo.c .file "ccbug.c" .version "01.01" .type foo,@function .text .globl foo .align 4 .nopsets "cc" .align 16 foo: movl 4(%esp),%eax movl %eax,&r ^------------- <<<< THE BUG ret .align 16,7,4 .size foo,.-foo .ident "acomp: (CCS) 2.0 07/24/92 " .data .comm tr,8,4 .text .ident "optim: (CCS) 2.0 07/24/92 " 17. /usr/ucb/install uses missing group "staff" /usr/ucb/install uses the group name "staff" as the default group to install programs. As this group does not exist in /etc/group, the installation will fail. I would suggest changing the /etc/group file like in Solaris as follows: nuucp::9:root,nuucp staff::10: 18. sigsetjmp calls may lose due to header error Your humble editor reports: the SVr4 setjmp.h header file has a potentially dangerous bug in it. It uses a preprocessor symbol `i386' to pick a length for the sigsetjmp jump buffer data type. The problem is that this symbol goes undefined if you compile in pure ANSI-conformant mode with -Xa --- and the length you get instead is shorter! Thus, sigsetjmp calls may trash static data areas past the end of the jump buffer. Possible setups for similar lossage may exist in core.h, float.h, ieeefp.h, limits.h, math.h, nan.h, prof.h, stdlib.h, and values.h. Workaround: include -Di386 in your compilation options. VII. The FUBYTE Problem (Thanks to Christoph Badura for this info) The kernel function fubyte() is documented to return a positive value when given a valid user space address and -1 otherwise. In the latter case u.u_error is set to EFAULT. USL SysV R4.0.3 has a sign extension bug in the implementation of fubyte() for local file descriptors (i.e. not opened via RFS), which causes fubyte() to return negative values if the byte fetched has its high bit set. This bug doesn't affect STREAMS drivers, as they don't call (and in fact are normally unable to call) fubyte(). Thus writing a byte with the high bit set to certain character device drivers returns with -1 and errno set to EFAULT. The bug may affect any character device driver that calls fubyte(). It's not limited to serial card drivers. The bug is noticed most often with serial card drivers, since uucp uses byte values > 127 very early during g-protocol setup and drivers for serial cards tend to use fubyte() quite often. Note also that the bug's effect is different if the driver checks for a -1 return value of fubyte() or just a negative one. In the former case it is possible to pass bytes with the 8 bit set through fubyte(), except for 0xff which is -1 in two's complement. That makes the bug more obscure. The fix is easy. First, make a backup copy of the kernel object file /etc/conf/pack.d/kernel/vm.o! A disassembly of vm.o(lfubyte) should reveal *exactly* one mov[s]bl (move byte to long w/sign extend). That one needs to be patched into a movzbl (zero extend). The difference is one bit in the second byte of the opcode. The movsbl has the bit pattern 00001111 1011111w mod/rm-byte. The movzbl has the bit pattern 00001111 1011011w mod/rm-byte. The 'w' bit is 0 for the instruction in question. So the opcodes are 0f be and 0f b6. Here is the diff -c from dis -F lfubyte showing the patch applied to the Dell 2.1 kernel: *** vm.o Mon Mar 9 00:31:38 1992 --- vm.o.org Mon Mar 9 00:32:40 1992 *************** *** 22,28 **** 11c90: 85 c0 testl %eax,%eax 11c92: 75 09 jne 0x9 <11c9d> 11c94: 8b 45 08 movl 8(%ebp),%eax ! 11c97: 0f b6 00 movzbl (%eax),%eax 11c9a: 89 45 fc movl %eax,-4(%ebp) 11c9d: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8 11ca7: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc --- 22,28 ---- 11c90: 85 c0 testl %eax,%eax 11c92: 75 09 jne 0x9 <11c9d> 11c94: 8b 45 08 movl 8(%ebp),%eax ! 11c97: 0f be 00 movsbl (%eax),%eax 11c9a: 89 45 fc movl %eax,-4(%ebp) 11c9d: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8 11ca7: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc Of course there is a workaround at the driver level. Canonically, one would do this by checking for fubyte() returning -1 *and* u.u_error being set to EFAULT (u.u_error is cleared upon entering a system call). However, in R4.0.3 fubyte() does NOT set u.u_error. It *does* set u.u_fault_catch.fc_errno. Cristoph reports that Dell 2.1 can be object-patched successfully to fix this. I'm told that the offending 11c97 is at exactly the same address in the Consensys 1.3 kernel. At vm.o:fa7d in Dell 2.2 there's a movzbl (%edx),%edx; same instruction, different target register. Here's the relevant diff output: *** vm.o-old Wed Jul 7 03:13:11 1993 --- vm.o Wed Jul 7 03:13:00 1993 *************** *** 25,31 **** fa76: 85 c0 testl %eax,%eax fa78: 75 09 jne 0x9 fa7a: 8b 55 08 movl 8(%ebp),%edx ! fa7d: 0f b6 12 movzbl (%edx),%edx fa80: 89 55 fc movl %edx,-4(%ebp) fa83: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8 fa8d: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc --- 25,31 ---- fa76: 85 c0 testl %eax,%eax fa78: 75 09 jne 0x9 fa7a: 8b 55 08 movl 8(%ebp),%edx ! fa7d: 0f be 12 movsbl (%edx),%edx fa80: 89 55 fc movl %edx,-4(%ebp) fa83: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8 fa8d: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc Applying this patch produces a working kernel. I do not know the status of the other ports. Another poster (Marc Boucher ) adds: On ESIX SVR4.0.3 Rev. A, the instruction movsbl in question can be changed to movzbl (as described above) with a binary-editor on file /etc/conf/pack.d/kernel/vm.o. At offset 0x11eb0, change 0xbe to 0xb6. Before patching, verify that your /etc/conf/pack.d/kernel/vm.o is the same as mine! On my system, the /bin/sum generated checksum of vm.o was "4440 222". The problem results from a sign-extension bug. The function lfubyte(), which is called by fubyte(), is declared as int lfubyte(char *addr); /* actually caddr_t */ The byte is fetched with val = *addr; which triggers sign extension. Casting addr to a unsigned char * or declaring it as such solves the problem. This bug is still present in stock USL 4.0.4. However, it has been fixed in Dell 2.2. Raymond Nijssen contributes the following: ---- README --------------------------------------------------------------->8-- This shell script was written to help out people who are less experienced in patching kernel binaries. This version can be used to fix the fubyte bug in follwing SVR4 flavors: ESIX 4.0.3A ESIX 4.0.4 Dell 2.1 Consensys 1.3 You need sdb and your system has to be able to rebuild the kernel. After the patch is applied, you have to rebuild the kernel by running /etc/conf/bin/idbuild and /etc/conf/bin/idreboot for the patch to take effect. You have to be root to do all this. The program will ask for your confirmation before it changes anything. Please do make a backup first, and remember that you can select the old kernel (/stand/unix.old) at boot time by pressing the space bar at the 'Booting the ESIX system....' prompt, in case the system fails to boot from the patched kernel, though this is higly unlikely. Systems to which this patch was applied have been running flawlessly for several months, in case you have doubts... Happy patching! --------------------------------------------------------------------------->8-- ----- fbfix --------------------------------------------------------------->8-- #!/bin/sh # # Copyright (c) 1993 Raymond X.T. Nijssen (raymond@woensel.es.ele.tue.nl) # All Rights Reserved # # the bug... # b=fubyte # offsets according to flakey USL sdb. gdb and dis say something different esix403_o=0x11eb0 esix404_o=0x11683 dell21_o=0x11c98 #dell 2.1 cons13_o=$dell21_o #consensys 1.3 # data v=0x458900be #old r=0x458900b6 #new # file f=/etc/conf/pack.d/kernel/vm.o # progs s=/usr/ccs/bin/sdb i=/etc/conf/bin/idbuild c='\c';t='\t';n='\n';N=/dev/null # aux pe() if [ -n "$e" ];then echo ${n}ERROR: $e $n;e="";fi yn() { while :;do echo $n$1 [$2] $c;read a;if [ -z "$a" ];then a=$2;fi case "$a" in y*)return 0;;n*)return 1;;*)echo Answer 'y' or 'n';;esac;done;} cr() if id|grep "^uid=0">$N;then return 0 else e="Only root may patch the kernel";return 1;fi ab() { echo ${n}FATAL: $e$n;exit 1;} ac() { pe;yn "Continue ?" "y";return;} qu() { R="";if [ -n "$1" ];then d="[$1] :";else d=":";fi while [ -z "$R" ];do echo ${n}Enter the $2 $d $c;read a if [ "$a" ];then R=$a;elif [ -n "$1" ];then R=$1; else e="No $2 entered";ac||exit 0;fi;done;} # main if [ ! -t 0 ];then e="This program must not be piped into a shell";ab;fi if [ ! -f $s ];then e="$s not found";ab;fi if [ ! -f $f ];then e="$f not found";ab;fi if [ ! -f $i ];then e="$i not found";ab;fi echo $n$n${t}YOU are responsible for running this program.$n$n${t}Clauses 9 and 10 of the GNU GENERAL PUBLIC LICENSE$n${t}apply to this program.$n$n${t}If you continue, you thereby agree that its author, $n${t}nor his employer, nor anybody else except yourself, has any $n${t}liablity for any loss, damage etc. etc.$n ac||exit 1 echo $n$n${t}Fixable versions with the $b bug$n$n$t$t[1]$t ESIX 4.0.3A$n$t$t[2]$t ESIX 4.0.4$n$t$t[3]$t DELL 2.1$n$t$t[4]$t Consensys 1.3$n R=1;qu "$R" "SVR4 flavor this system is running" case $R in 1)o=$esix403_o;; 2)o=$esix404_o;;3)o=$dell21_o;; 4)o=$cons13_o;; *)e="Invalid answer";ab;;esac echo $n${t}Looking for replacement target ... $c if echo $o:?lx|$s -e $f 2>$N|grep $o/$v>$N;then echo found if yn "Do you want to patch the kernel now?" "n";then cr||ab qu "$f.orig" "name of backup file" if [ -f $R ];then e="File $R already exists";ab;fi if cp $f $R;then echo $n${t}Copied $f to $R;else e="Failed to write $R";ab;fi if echo $o!$r|$s -e -w $f>$N 2>&1;then echo ${n}Fixed $b bug, you may now run $i and reboot$n;else e="$s failed";pe if cp $R $f;then echo $n${t}Copied $R to $f;else e="Restore $f failed";pe;fi e="Patch failed!!";ab;fi fi else echo not found;e="Replacement target not found at expected offset";ab;fi --------------------------------------------------------------------------->8-- -- Send your feedback to: Eric Raymond = esr@snark.thyrsus.com