Problem Log

This is a summary of operational problems in the GSI computing infrastructure which have affected a larger amount of users.

Mai 17^th, 2005: Server
During the pentecost weekend a disk in one of the RAID arrays of central WXP file servers broke. On tuesday morning, the dik was replaced. During the RAID rebuild, a second disk failed, rendering the whole 1.6 TB file system broken. Fixing the hardware and restoring the data from backup took till the end of the week, where the restore was by far the most time consuming part.
October 22^th, 2004: WAN
From 14:00 till 02:00 in the night the WAN connection was either totally interrupted or lost so many packets that usage was impossible. The problem was caused by node at GSI 'running wild' and overloading the CISCO router.
September 20^th, 2004: LAN
A local network problem, very similar to the one on September 16^th, occured around 14:30 and was resolved around 16:00.
September 16^th, 2004: LAN
After around midnight a malfunction of a part of the LAN backbone caused significant connection problems, many services, like the mail server, were not accessible. This was resolved around 10:30. A second, this time planned, interruption was needed between 15:00 and 15:45 to replace some failed components.
June 2^th, 2004: WAN
The WAN connection failed around 17:00. GSI and TUD are affected. Telekom is not able to resolve the issue in a timely manner. The WAN connection is finally operational again on 11:00 the next day.
June 2^th, 2004: Server
The KP3 group server lxgs03 failed during the night, all associated desktop systems stalled. The server was rebooted around 10:00 am, but seems to have hardware problems.
March 15^th, 2004: Server
Due to a power outage (part of the power distribution failed after some testing and re-organization done on the previous weeken) some essential parts of the Linux infrastructure went down. Among other things, the /u and some /d file servers were affected. This happened around 21:30 in the evening. Normal operation was resumed the following day at around 11:00.
March 3^rd, 2004: Server
The file server lxfs04 crashed around 14:00, the home file systems for the groups ap, alice, bel, bio, cbm, fn, fopi, land, kp1, kp3, pbar, pp, thd, the, uf, ugt, ukt, ul, and ulhf became unavailable. Recovery took about 2 hours.
February 21^st, 2004: Server
The linux webserver www-linux.gsi.de and all other virtual servers hosted on this server hung since at least Saturday afternoon. The server was ping'able and accepted connections to port 80, but never answered to a GET request.
February 8^th, 2004: General
A thunderstorm caused power glitches on Sunday. As a consequence the USV System for the central services failed to operate, and caused a powerloss for a large part of the central services. Because file servers were involved, this caused widespread effects. Systems not connected to the USV by and large continued to operate.
February 1^st, 2004: Linux
The server lxfs06 failed in the night from sunday to monday due to a problem with the processor box. The chassis was changed during monday, no data was lost. Affected was, among other file systems, /d/kp3.
January 13^th, 2004: TSM
Recalling files in the GSI mass storage system is sometimes very slow, with transfer rates in the 100 kByte/sec range on an otherwise idle system. This problem started late December already, after a TSM upgrade.
January 13^th, 2004: Server
The Web Server www-linux.gsi.de and with it the virtual server www-kp3.gsi.de hung-up in the evening. Was recovered next morning.
January 13^th, 2004: Server
Good part of the day the KP3 desktop clients were unavailable or unusable due to an urgent server upgrade, rebooting, and some mistakes made in the process.
January 8^th, 2004: Server
The Software RAID, which holds the ~/web-docs/ data, failed due to a double disk error. The data had to be restored from disk. Data of about one day was lost. The KHuK vote was affected, and had to be restarted.
January 5^th, 2004: Linux
Some Linux systems and accounts were found to be compromised. Corrective actions took the whole week, and caused various service interruptions.
December 30^th, 2003: Mail
The MTA's started to develop problems around the 30th. For two days mails weren't accepted, and thus stored at TUD. Later, the MTA's had still intermittend problems. An additional problem with the TUD mailing services caused delivery delays of the emails stored at TUD on 30/31. Some of them were delivered as late as January 12th.
December 28^th, 2003: TSM
Due to a failed power supply of a disk sub-system the TSM service stopped working, and with it all automatic backups.
December 12^th, 2003: Mail
The IMAP server wntmailsv.gsi.de failed due to a software problem during the night from Thursday to Friday. It went online again at noon time, but without being able to receive emails from outside GSI not to send emails via SMTP. Full operation was restored around noon time on Monday. All emails, with very few exceptions, were buffered in the MTA's and delivered on Monday and finally Tuesday.
November 11^th, 2003: Mail
The IMAP server wntmailsv.gsi.de was unavailable between 19:00 and the next morning.
October 27^th, 2003: Mail
The IMAP server wntmailsv.gsi.de was unavailable between about 10:10 and 10:40.
October 19^th, 2003: Server
The 'new' GSI Web Server www-new.gsi.de was unavailable from sometime sunday till monday morning. See also September 18^th.
October 10^th, 2003: Linux
The lxfs04 file server crashed at about 19:00 and came back after some time. This left some clients with 'NFS stales', some reboots were necessary. Last similar crashes on September 16^th, August 9^th, and July 21^st.
September 21^st, 2003: Mass Storage
The tsmcli server process became inresponsive on sunday. The TSM based mass storage access was unavailable till tuesday morning.
September 18^th, 2003: Network
Around 13:00 the central GigaBit-Ethernet switch had again a problem, like four days ago. This time Linux Fileservers and clients mostly survived, but the VMS systems (and thus accelerator operation) suffered. The network was operational again after about 45 minutes. Recovery of SIS operations required several hours.
September 18^th, 2003: Server
The hardware of the new GSI webserver www-new.gsi.de and webproxy failed and was unavailable for all morning.
September 16^th, 2003: Linux
In the early moring the lxfs04 file server crashed with 'kernel panic'. Recovery some hours, many KP3 nodes had to be rebooted to resolve NFS stales. The crash reason was a genuine panic this time, not related to the Ethernet cards which had all been exchanged.
September 14^th, 2003: Network
At about 21:00 a board in the central GigaBit-Ethernet switch failed. As a consequence much of the GSI LAN became inoperable. Affected was accelerator operation as well as Linux and Windows. The hardware problems were resolved monday morning, but full recovery took till noon.
September 8^th, 2003: Linux
At 3:10 one half of the UPS system went down (a fuse went off for unknown reasons) and by 5:10 the other half switched of due to overload. As a consequence the whole Linux infrastructure lost power. The restart/repair took till about 14:00.
August 12^th, 2003: Linux
A glitch in lxfs01 caused in many clients problems with the /usr/local file system. The batch farm lost most of its jobs, some interactive systems suffered too.
August 12^th, 2003: Mass Storage
The IBM tape robot went out-of-operation around 3:20 and was back online around 13:50.
August 12^th, 2003: Linux
The RAID array holding the /d/kp301 file system had a disk failure. Due to firmware bugs this lead to SCSI errors, which in turn crashed the lxfs07 file server. Restarts, rebuilds, and files system checks were done by noon, some files systems were still offline in the afternoon.
August 9^th, 2003: Linux
The file server lxfs04, serving all KP3 home directories, crashed around 16:00. Since nobody called 'on-call' personell to fix this the situation was fixed on monday mornig. Problems with 'stale NFS' on lxi* and lxb* systems persisted till lunch time.
July 30^th, 2003: WAN
Internet connection down between 10:00 and 12:20 due to a failure in the German backbone.
July 22^nd, 2003: Mass Storage
The fileserver lxfs01, which serves /usr/local, failed and had to be rebooted. Unfortunately all fileservers had to be rebooted around 15:30. Again, 'stale NFS' problems made a reboot of some client machines necessary.
July 21^st, 2003: Linux
The fileserver lxfs04, which serves among other things the /net/home9 file system with all kp3 home directories, crashed around 18:00. Due to 'stale NFS' problems a a variety of processes had to be killed and client machines to be rebooted.
July 19^th, 2003: Mass Storage
A temperature increase in the main computer room due to a failure of the main cooling system caused several outages and failures. The IBM3494 robot paused, no requests were serviced through the weekend. Two drives were damaged. Several other AIX services were also interrupted.
July 16^th, 2003: WAN
At about 19:00 the DoS attack against was restarted, the WAN connection was interrupted to protect local systems. In the morning the connection was restored after all ICMP traffic to/from GSI was suppressed at the backbone level.
July 16^th, 2003: Linux
During the afternoon the "/net/home9" file server failed, which holds, among many others, the /u/aladin or /u/kp3soft file systems. Many kp3 clients had to be rebooted to resolve 'stale NFS' problems. As a consequence of the reboots the S254 DACQ system had to be restarted (during beam time break).
July 15^th, 2003: WAN
Between about 10:00 and 12:30 the GSI WAN connection was disfunctional. Large ping delays and packet losses due to a DoS attack.
June 25^th, 2003: Network
Due to a failure in the GSI network backbone most IT services including Linux and WNT desktops were inoperable between about 9:30 and 14:00.
June 7^th, 2003: Linux
Due to a failure of a rzserv node NIS and SMTP became unavailable on saturday. As a consequence most central nodes hung (lxi***, batch farm). Recovered after 'Pfingsten' on Tuesday morning.
June 7^th, 2003: Mass Storage
The IBM mass storage system broke down due to a mechanical failure. The tooth belt in the robot mechanics broke. No mass storage access during the holiday weekend, fixed on tuesday.
June 3^rd, 2003: Mass Storage
Due to a drive motor malfunction the volume A02537 was physically damaged. It contained rootified raw data and NTuple files from the March S254 run. A recovery attempt on June 10th destroyed the tape and significantly damaged another drive. All files on this volume are irrecoverably lost.
May 16^th, 2003: Linux
Due to a botched upgrade attempt of a software component all Linux clients started to hang at some time after Friday afternoon. All clients had to be rebooted by Monday to recover from this.
April 10^th, 2003: Linux
The file server lxgs01 was unavailable between 16:00 and 17:20. All Linux clients got stuck during this time.
April 9^th, 2003: Linux
The file server lxgs04 got hung at 10:30. Came back after reboot around 12:30. Some clients had to be rebooted.
March 27^th, 2003: Linux
The group server lxgs03 crashed around 18:30. The file system was corrupted (a file with > 10 GB appeared) and crashed the server. Was recovered the next morning on 8:00. This also caused tsmcli and/or adsmcli sessions to hang for 12 hours.
March 23^th, 2003: Linux
The group server lxgs04 crashed (last message was 'fan failure'). Came back after power off/on and some file system repairs. Down time about 3 hours (thanks to the on call and voluntary help folks). Most KP3 clients had to be rebooted to get rid of 'stale NFS file handles'.
March 3^rd, 2003: Linux
The group server lxgs03 was corrupted due to a attempt to move a user account. From 18:00 till next day 10:00 the KP3 clients were effectivly unavailable. The problem was diagnosed and fixed.
January 30^th, 2003: Linux
The file server lxfs04, which hosts the KP3 home file system, was unavailable during the whole morning.
October 9^th, 2002: Linux
The /d/kp3 is finally available again after more than 5 weeks of problems or total unavailablilty (see notes on September 2nd, September 3rd, and September 24th). The whole frame was exchanged, with no improvement. Finally all disks were also exchanged. It turned out that a total of 7 files could not be read. Access to those files produced errors without that the RAID flaged internal disk errors. That should never happen and shows that the EasyRAID firmware is buggy. In addition problems with the SCSI termination had to be resolved (internal buggy, one has to use external). Finally TRANSTEC support suggested to disable the SCSI parity check (it supposedly caused spurious errors).
A repair time of 5 weeks, error situations which simply shouldn't happen, and the parity and termination issue lead to the conclusion that the purchased EasyRAID system is more than shaky.
September 24^th, 2002: Linux
The /d/kp3 file system is again unreliable and fails several times a day. It is dismounted for a detailed investigation
September 11^th, 2002: Linux
The lxfs04:/home9 was unavailable between 9:05 and about noon time.
September 3^nd, 2002: Linux
The /d/kp3 caused again many SCSI errors. Some hardware (cable, controller) was changed, no improvement, the file system had to be dismounted.
The /d/kp3 was available again in the afternoon of September 11th. Total downtime was 8 days !
Also the /s became unavailable after RAID problems. Same is true for /d/kp1 and /d/kp2. So a large fraction of the data file systems went offline.
September 2^nd, 2002: Linux
The RAID controler of /d/kp3 data file system failed twice, once on sunday and once on monday. No access for several hours, `stale NFS' problems on some nodes.
August 24^th, 2002: Linux
The file server lxfs04 which hosts the KP3 user file systems failed after a scheduled file server re-arrangement the file servers. No file service during weekend and most of monday.
January 24^th, 2002: ListServ
The ListServ service failed thursday evening. Back operating on monday morning, January 28^th. Some send messages were lost.
January 18^th, 2002: all
Announced shutdown: Due to work in the power distribution all central services (Linux, WNT, AIX, VMS, Lynx) were unavailable between Friday 8:00 and Monday 13:00.
January 18^th, 2002: all
Announced shutdown: Due to work in the power distribution all central services (Linux, WNT, AIX, VMS, Lynx) were unavailable between Friday 8:00 and Monday 13:00.
January 7^th, 2002: WAN
G-WIN connection unavailable 13:10 till 16:00.
January 6^th, 2002: Mass Storage
ADSM was unavailable from Sunday about 3:00 till monday morning.
November 19^th, 2001: WNT
Due to a NIS configuration error all group definitions were unavailable between monday night and tuesday morning. As a consequence all logins on the Linux system failed.
November 5^th, 2001: WNT
Yet another massive Virus problem. P:\scratch closed.
October 17^th, 2001: Linux
Due to software upgrades the X server configuration was destroyed on many Linux systems. Various effects, from full failure over bad performance to functional deficiencies (back to PseudoColor) were the consequence. It took many days to fully recover on all systems.
October 17^th, 2001: Mail
Due to a reconfiguration of the WNT system and the IMAP server it was not longer possible to access the IMAP server with clients other than MS Outlook. The new configuration required authentication via NTLM, which is unpublished and MS proprietary. The configuration change was undone by October 29^th, thus emails service was severely restricted for almost 2 weeks.
October 15^th, 2001: ALL
Due to hacker attacks all plattforms were unavailable for two days. All accounts were disabled, all passwords had to be renewed.
September 13^th, 2001: Linux
Much of the Linux cluster was unavailable between 2:00 and 14:00 due to a hardware failure (broken memory) of a file server.
June 15^th, 2001: Linux
Telnet and ssh under Linux have problems when the external IP connection is unavailable (see below) . rlogin works.
June 15^th, 2001: WAN
The Internet connection was again unavailable for many hours.
June 13^th, 2001: WAN
The Internet connection was unavailable for many hours in the evenings of June 13^th and 14^th.
June 13^th, 2001: VMS
The whole VMS cluster was unavailable during the afternoon.
March 16^th, 2001: WAN
Due to a router failure outside GSI no Internet access from 17:30 and 18:30.
February 21^th, 2001: WNT
www-wnt.gsi.de webserver temporarily unavailable due to hardware problems. Same happened the day before.
February 18^th, 2001: Network
Due to software problems in rzserv1/rzserv2 some core server functions were temporarily unavailable (POP, Listserv, DNS, lpr ...). Varying impact. Situation was resolved on tuesday, February 20^th.
February 14^th, 2001: WAN
Due to a router failure outside GSI no Internet access from 16:00 to 17:15.
February 13^th, 2001: Mail
The `AnnaKournikova.jpg.vbs' virus spread through the world and GSI in the morning, most accounts got 10-20 copies of it. All SMTP traffic to and from the Exchange server was blocked between 11:00 and 17:00, all emails (in- or outbound) received in this time were discarted.
December 6^th, 2000: WAN
The system server lxgs03 went down in the late evening, stopping all KP3 Linux systems till the next morning.
August 26^th, 2000: WAN
File server lxfs02 was inresponsive and had to be rebooted. Several hours down time to most of the Linux systems.
August 2^nd, 2000: Linux
The fileserver lxfs02 (serving some /u file systems) crashed around 10:30. Recovered at 15:00 because fsck took so long.
April 1^st, 2000: Linux
The KP3 group server lxgs03 stopped working due to a power supply failure. The node was fixed on monday morning. All KP3 clients hung over the weekend.
February 21^st, 2000: Linux
The home file server system was overloaded due to activities of 40 batch jobs overwhelming the /u file system with data. The response was so slow that all linux systems were useless for most practical work from friday evening till about 15:00.
January 13^th, 2000: Networks
The nameserver and NIS was unresponsive, causing all kinds of malfunctions in most Linux systems. Failure happened late in the evening of the 13th and was fixed by lunch time next day.
November 24^th, 1999: VMS
The CLEX2 system disk filled to the last block due to crash dump logs of the mailing system around 16:00. This forced a cluster reboot, the first since the VMS upgrade on September 28^th. This disabled all VMS work for 2+ hours.
The unrelated double failure of VMS and Linux made the 4^th floor at GSI an almost computer free zone....
November 24^th, 1999: Linux
The NFS server lxfs01 crashed at about 16:00 due to GigaBit Ethernet problems. The fsck took 3+ hours, effectively stopping the whole Linux system during this time.
October 31^th, 1999: Linux
The NFS server lxfs02 was inaccessible from at least kp3pck between about sunday 31.10.1999 12:20 and monday 1.11.1999 12:50.
October 16^th, 1999: VMS
The whole VMS cluster hung from sometime saturday till monday morning. No login possible, the session hung after entering the user name and never asked for a password.
September 29^th, 1999: Mail
The MX mail service didn't work anymore after the VMS upgrade to 7.2 on September 28th. Incoming and outgoing emails were lost due to a crashing server process. MX was stopped. The VMS SMTP mail service was installed on evening of September 30th as a temporary fix. Basic mail service works, but no mailing list support.
August 6^th, 1999: Network
One of the two schwitches driving the thin-wire Ethernet segments in the 4th floor failed in the late evening, grounding a good fraction of the older equipment. Fixed in the morning of August 9th.
August 5^th, 1999: Linux
Server lxfs02 was unavailable between 7:00 and 14:00. Crash of server caused by SCSI resets on system disk branch. Server didn't reboot properly because the parallel fsck caused too much swap activity.
July 28^th, 1999: Linux
Server lxfs02 was unavailable between 16:30 and 20:30. Since /u/kp3web is provideded by this server, all accesses to KPIII webspace ended in a somewhat misleading 403 error.
July 17^th, 1999: Linux
Server lxfs02 hung since about 18:00. Was fixed on July 19th (reboot, fsck...). Many aktivities hung during this time because /home4 is served by this node.
July 12^th, 1999: Mass Storage
No access to the archive kp3sys anymore. This was caused by a change to the access configuration list. Finally fixed on July 21st.
June 16^th, 1999: Network/VMS
The whole VMS cluster had many hangs for about one day. This was caused by one system which had network communication problems, which were in turn caused by a ill-configured thin-wire segment (missing terminator, installation done by untrained personell...).
June 15^th, 1999: Network/Linux
The nodes lxi001 to lxi005 had for about one day a network I/O rate of about 100-200 kbyte/sec. This was caused by an autonegotion problem between the servers and the switch after a component in the network switch was hot-swapped.
June 6^th, 1999: Linux
Due to a failure of file server lxfs02 most of the Linux systems were not usable from sunday, June 6th to about monday, June 7th around 17:00. This was caused by a broken Gigabit Ethernet interface card.
June 4^th, 1999: Mail
The node AXP602 crashed around June 4th 1:00 and was rebooted on June 5th 9:00. Much of VMS mail delivery was suspended during this time because all @gsi.de addresses point to axp602 and the service didn't fail-over to AXP601.
November 16^th, 1998: LAN
The whole LAN of the east end 4th floor was down between 10:00 and 11:00 due to a power failure followed by a component failure. Part of the VMS cluster crashed, Linux systems and Xterms were unavailable.
November 9^th, 1998: Mass Storage
adsmcli was unreliable throughout the night, rejecting requests with "All server sessions are currently in use".
November 5^th, 1998: Mass Storage
The tape robot was unavailable between 0:23 and 0:44, all retrieve and archive requests failed during this time. Again a software malfunction, the system recovered by itself.
November 4^th, 1998: Mass Storage
The tape robot was unavailable between 0:28 and 10:23. This was caused by a software malfunction and was recovered with a full restart of ADSM.
November 4^th, 1998: Mail
The central mail AIX and pop mail server (clri6a) failed due to a disk malfunction. All AIX and WNT users have no access to email.
Recovered in the evening of November 5^th.
November 3^rd, 1998: Mass Storage
The tape robot hung between 2:50 und 6:25. Cleared by operators.
October 24^th, 1998: Mass Storage
The tape robot went into `pause mode' at 1:30 after a problem with the accessor.
Was recovered on monday morning by resetting the system. Unfortunately, nobody called the on-call operators, which could have fixed this easily.
October 14^th, 1998: Linux
The system linux1 failed due to a broken swap file system. It was not revitalized before monday, October 19^th, because the whole Linux system group was not at GSI.
September 25^th, 1998: Mass Storage
The accessor of the tape robot failed around 14:00, all ADSM activities stalled or timed out. The system was available again at 21:50.
September 22^nd, 1998: WAN
The WAN connction was unavailable for a few hours because a link between the DFN-Switch at the TU and the next hop in Frankfurt failed.
August 21^st, 1998: Mass Storage
The whole tape robot was unavailable due to a hardware failure since about 17:00.
Was recovered the next day at around 12:00. Cause was a broken accessor.
August 13^th, 1998: Network
A power outage of some network equipment, caused by some work on the power lines, cause a complete network outage for KPI and KPIII systems.
Was recovered after about 30 minutes, most sessions were lost.
July 31^st, 1998: Linux
Some programs, e.g. netscape_new and xemacs, don't start anymore on the nodes linux1 to linux5. Instead one gets the error
```
  /usr/local/....: can't load library 'libc.so.6'
```
This problem comes and goes. The nodes kp3pci was not affected.
July 5^th, 1998: AIX
The printer queue p41wcs hung, no job was printed.
Problem cleared by the operators on monday morning (reenable queue).
July 5^th, 1998: VMS
The disk $1$DKD900: (DC057) holding cern$root went into MountVerify in the early morning. Since cernlogin as well as toollogin access this disk most logins on the AXP cluster hang.
Problem disappeared after some time by itself.
July 3^rd, 1998: VMS
The logins to the VAX cluster hang in the password prompt, the machines seem to hang as well. Cause: disk problems with frs$root, kp3$broot,...
Problem cleared after a reboot of some nodes on July 4^th
July 3^rd, 1998: VMS
Unannounced cluster reboot of AXP cluster (or a cluster crash) on 0:10.
July 1^st, 1998: Networks
The reverse translations of GSI IP addresses failed. This caused problems with xhost authentication, among other things.
Problem cleared soon after problem was identified.
June 30^th, 1998: VMS
The disk cern$root was hung, again causing login problems.
Problem cleared within a few hours.
June 29^th, 1998: Networks
The primary nameserver was unresponsive, causing a timeout wait of a few seconds for each namelookup.
Problem cleared soon after problem was identified.
June 28^th, 1998: Mass Storage
The response of adsmcli is sluggish in the early morning hours, one archive operation aborts with `incomplete data buffer sent to server'. Later in the evening any adsmcli retrieve request simply hangs, a retry produces a `staged object empty' message.
Problem cleared the next morning.
June 28^th, 1998: VMS
The disk holding tool$root was hung. The effect was disruptive because the puliclogin accesses this disk. Access to PD tools and adsmcli was blocked.
Problem cleared on June 29^th by a full cluster reboot.

Back to Computing home page

Walter F.J. Müller

Created: Juli 1st, 1998 Last modified: Thu May 19 18:46:48 CEST 2005

Imprint ---- Data privacy protection ---- Haftungsausschluss