This is a summary of operational problems in the GSI computing
infrastructure which have affected a larger amount of users.
Mai 17th, 2005: Server
During the pentecost weekend a disk in one of the RAID arrays of
central WXP file servers broke. On tuesday morning, the dik was
replaced. During the RAID rebuild, a second disk failed, rendering
the whole 1.6 TB file system broken. Fixing the hardware and
restoring the data from backup took till the end of the week,
where the restore was by far the most time consuming part.
October 22th, 2004: WAN
From 14:00 till 02:00 in the night the WAN connection was either
totally interrupted or lost so many packets that usage was impossible.
The problem was caused by node at GSI 'running wild' and overloading
the CISCO router.
September 20th, 2004: LAN
A local network problem, very similar to the one on
September 16th, occured around 14:30
and was resolved around 16:00.
September 16th, 2004: LAN
After around midnight a malfunction of a part of the LAN backbone
caused significant connection problems, many services, like the
mail server, were not accessible. This was resolved around 10:30.
A second, this time planned, interruption was needed between
15:00 and 15:45 to replace some failed components.
June 2th, 2004: WAN
The WAN connection failed around 17:00. GSI and TUD are affected.
Telekom is not able to resolve the issue in a timely manner. The
WAN connection is finally operational again on 11:00 the next day.
June 2th, 2004: Server
The KP3 group server lxgs03 failed during the night, all
associated desktop systems stalled. The server was rebooted around
10:00 am, but seems to have hardware problems.
March 15th, 2004: Server
Due to a power outage (part of the power distribution failed after
some testing and re-organization done on the previous weeken) some
essential parts of the Linux infrastructure went down. Among
other things, the /u and some /d file servers were affected.
This happened around 21:30 in the evening. Normal operation was
resumed the following day at around 11:00.
March 3rd, 2004: Server
The file server lxfs04 crashed around 14:00, the home
file systems for the groups ap, alice, bel, bio, cbm, fn, fopi, land,
kp1, kp3, pbar, pp, thd, the, uf, ugt, ukt, ul, and ulhf became
unavailable. Recovery took about 2 hours.
February 21st, 2004: Server
The linux webserver www-linux.gsi.de and all other virtual
servers hosted on this server hung since at least Saturday afternoon.
The server was ping'able and accepted connections to port 80, but
never answered to a GET request.
February 8th, 2004: General
A thunderstorm caused power glitches on Sunday. As a consequence
the USV System for the central services failed to operate, and
caused a powerloss for a large part of the central services.
Because file servers were involved, this caused widespread effects.
Systems not connected to the USV by and large continued to operate.
February 1st, 2004: Linux
The server lxfs06 failed in the night from sunday to
monday due to a problem with the processor box. The chassis was
changed during monday, no data was lost. Affected was, among other
file systems, /d/kp3.
January 13th, 2004: TSM
Recalling files in the GSI mass storage system is sometimes very
slow, with transfer rates in the 100 kByte/sec range on an
otherwise idle system. This problem started late December already,
after a TSM upgrade.
January 13th, 2004: Server
The Web Server www-linux.gsi.de and with it the virtual
server www-kp3.gsi.de hung-up in the evening. Was recovered
next morning.
January 13th, 2004: Server
Good part of the day the KP3 desktop clients were unavailable or
unusable due to an urgent server upgrade, rebooting, and some
mistakes made in the process.
January 8th, 2004: Server
The Software RAID, which holds the ~/web-docs/ data,
failed due to a double disk error. The data had to be restored
from disk. Data of about one day was lost. The KHuK vote was
affected, and had to be restarted.
January 5th, 2004: Linux
Some Linux systems and accounts were found to be compromised.
Corrective actions took the whole week, and caused various service
interruptions.
December 30th, 2003: Mail
The MTA's started to develop problems around the 30th. For two days
mails weren't accepted, and thus stored at TUD. Later, the MTA's
had still intermittend problems. An additional problem with the
TUD mailing services caused delivery delays of the emails stored
at TUD on 30/31. Some of them were delivered as late as January 12th.
December 28th, 2003: TSM
Due to a failed power supply of a disk sub-system the TSM service
stopped working, and with it all automatic backups.
December 12th, 2003: Mail
The IMAP server wntmailsv.gsi.de failed due to a software
problem during the night from Thursday to Friday. It went online
again at noon time, but without being able to receive emails from
outside GSI not to send emails via SMTP. Full operation was restored
around noon time on Monday. All emails, with very few exceptions,
were buffered in the MTA's and delivered on Monday and finally
Tuesday.
November 11th, 2003: Mail
The IMAP server wntmailsv.gsi.de was unavailable between
19:00 and the next morning.
October 27th, 2003: Mail
The IMAP server wntmailsv.gsi.de was unavailable between
about 10:10 and 10:40.
October 19th, 2003: Server
The 'new' GSI Web Server www-new.gsi.de was unavailable
from sometime sunday till monday morning. See also
September 18th.
October 10th, 2003: Linux
The lxfs04 file server crashed at about 19:00 and came
back after some time. This left some clients with 'NFS stales',
some reboots were necessary. Last similar crashes on
September 16th,
August 9th, and
July 21st.
September 21st, 2003: Mass Storage
The tsmcli server process became inresponsive on sunday.
The TSM based mass storage access was unavailable till tuesday
morning.
September 18th, 2003: Network
Around 13:00 the central GigaBit-Ethernet switch had again a
problem, like four days ago. This time
Linux Fileservers and clients mostly survived, but the VMS
systems (and thus accelerator operation) suffered. The network
was operational again after about 45 minutes. Recovery of SIS
operations required several hours.
September 18th, 2003: Server
The hardware of the new GSI webserver www-new.gsi.de and
webproxy failed and was unavailable for all morning.
September 16th, 2003: Linux
In the early moring the lxfs04 file server crashed with
'kernel panic'. Recovery some hours, many KP3 nodes had to be
rebooted to resolve NFS stales. The crash reason was a genuine
panic this time, not related to the Ethernet cards which had
all been exchanged.
September 14th, 2003: Network
At about 21:00 a board in the central GigaBit-Ethernet switch
failed. As a consequence much of the GSI LAN became
inoperable. Affected was accelerator operation as well as Linux
and Windows. The hardware problems were resolved monday morning,
but full recovery took till noon.
September 8th, 2003: Linux
At 3:10 one half of the UPS system went down (a fuse went off for
unknown reasons) and by 5:10 the other half switched of due to
overload. As a consequence the whole Linux infrastructure lost
power. The restart/repair took till about 14:00.
August 12th, 2003: Linux
A glitch in lxfs01 caused in many clients problems with
the /usr/local file system. The batch farm lost most of
its jobs, some interactive systems suffered too.
August 12th, 2003: Mass Storage
The IBM tape robot went out-of-operation around 3:20 and was
back online around 13:50.
August 12th, 2003: Linux
The RAID array holding the /d/kp301 file system had
a disk failure. Due to firmware bugs this lead to SCSI errors,
which in turn crashed the lxfs07 file server. Restarts,
rebuilds, and files system checks were done by noon, some files
systems were still offline in the afternoon.
August 9th, 2003: Linux
The file server lxfs04, serving all KP3 home directories,
crashed around 16:00. Since nobody called 'on-call' personell to fix
this the situation was fixed on monday mornig. Problems with 'stale
NFS' on lxi* and lxb* systems persisted till lunch time.
July 30th, 2003: WAN
Internet connection down between 10:00 and 12:20 due to a
failure in the German backbone.
July 22nd, 2003: Mass Storage
The fileserver lxfs01, which serves /usr/local,
failed and had to be rebooted. Unfortunately all fileservers
had to be rebooted around 15:30. Again, 'stale NFS' problems made
a reboot of some client machines necessary.
July 21st, 2003: Linux
The fileserver lxfs04, which serves among other things the
/net/home9 file system with all kp3 home directories,
crashed around 18:00. Due to 'stale NFS' problems a a variety
of processes had to be killed and client machines to be rebooted.
July 19th, 2003: Mass Storage
A temperature increase in the main computer room due to a failure
of the main cooling system caused several outages and failures. The
IBM3494 robot paused, no requests were serviced through the weekend.
Two drives were damaged. Several other AIX services were also
interrupted.
July 16th, 2003: WAN
At about 19:00 the DoS attack against was restarted, the
WAN connection was interrupted to protect local systems.
In the morning the connection was restored after all ICMP
traffic to/from GSI was suppressed at the backbone level.
July 16th, 2003: Linux
During the afternoon the "/net/home9" file server failed,
which holds, among many others, the /u/aladin or /u/kp3soft
file systems. Many kp3 clients had to be rebooted to resolve
'stale NFS' problems. As a consequence of the reboots the
S254 DACQ system had to be restarted (during beam time break).
July 15th, 2003: WAN
Between about 10:00 and 12:30 the GSI WAN connection was
disfunctional. Large ping delays and packet losses due to
a DoS attack.
June 25th, 2003: Network
Due to a failure in the GSI network backbone most IT services
including Linux and WNT desktops were inoperable between about
9:30 and 14:00.
June 7th, 2003: Linux
Due to a failure of a rzserv node NIS and SMTP became unavailable
on saturday. As a consequence most central nodes hung (lxi***,
batch farm). Recovered after 'Pfingsten' on Tuesday morning.
June 7th, 2003: Mass Storage
The IBM mass storage system broke down due to a mechanical failure.
The tooth belt in the robot mechanics broke. No mass storage access
during the holiday weekend, fixed on tuesday.
June 3rd, 2003: Mass Storage
Due to a drive motor malfunction the volume A02537 was physically
damaged. It contained rootified raw data and NTuple files from the
March S254 run. A recovery attempt on June 10th destroyed the tape
and significantly damaged another drive. All files on this volume
are irrecoverably lost.
May 16th, 2003: Linux
Due to a botched upgrade attempt of a software component all Linux
clients started to hang at some time after Friday afternoon. All
clients had to be rebooted by Monday to recover from this.
April 10th, 2003: Linux
The file server lxgs01 was unavailable between
16:00 and 17:20. All Linux clients got stuck during this time.
April 9th, 2003: Linux
The file server lxgs04 got hung at 10:30. Came back
after reboot around 12:30. Some clients had to be rebooted.
March 27th, 2003: Linux
The group server lxgs03 crashed around 18:30. The file
system was corrupted (a file with > 10 GB appeared) and crashed
the server. Was recovered the next morning on 8:00. This also
caused tsmcli and/or adsmcli sessions to hang for 12 hours.
March 23th, 2003: Linux
The group server lxgs04 crashed (last message was 'fan
failure'). Came back after power off/on and some file system
repairs. Down time about 3 hours (thanks to the on call and
voluntary help folks). Most KP3 clients had to be rebooted to
get rid of 'stale NFS file handles'.
March 3rd, 2003: Linux
The group server lxgs03 was corrupted due to a attempt
to move a user account. From 18:00 till next day 10:00 the KP3
clients were effectivly unavailable. The problem was diagnosed
and fixed.
January 30th, 2003: Linux
The file server lxfs04, which hosts the KP3 home file
system, was unavailable during the whole morning.
October 9th, 2002: Linux
The /d/kp3 is finally available again after more than 5 weeks
of problems or total unavailablilty (see notes on
September 2nd,
September 3rd,
and September 24th). The whole frame was
exchanged, with no improvement. Finally all disks were also exchanged.
It turned out that a total of 7 files could not be read. Access
to those files produced errors without that the RAID flaged
internal disk errors. That should never happen and shows that
the EasyRAID firmware is buggy. In addition problems with the
SCSI termination had to be resolved (internal buggy, one has to
use external). Finally TRANSTEC support suggested to disable the
SCSI parity check (it supposedly caused spurious errors).
A repair time of 5 weeks, error situations which simply shouldn't
happen, and the parity and termination issue lead to the conclusion
that the purchased EasyRAID system is more than shaky.
September 24th, 2002: Linux
The /d/kp3 file system is again unreliable and fails several
times a day. It is dismounted for a detailed investigation
September 11th, 2002: Linux
The lxfs04:/home9 was unavailable between 9:05 and about
noon time.
September 3nd, 2002: Linux
The /d/kp3 caused again many SCSI errors. Some hardware
(cable, controller) was changed, no improvement, the file system
had to be dismounted.
The /d/kp3 was available again in the afternoon of
September 11th. Total downtime was 8 days !
Also the /s became unavailable after RAID problems.
Same is true for /d/kp1 and /d/kp2. So a
large fraction of the data file systems went offline.
September 2nd, 2002: Linux
The RAID controler of /d/kp3 data file system failed twice,
once on sunday and once on monday. No access for several hours,
`stale NFS' problems on some nodes.
August 24th, 2002: Linux
The file server lxfs04 which hosts the KP3 user file systems
failed after a scheduled file server re-arrangement the file servers.
No file service during weekend and most of monday.
January 24th, 2002: ListServ
The ListServ service failed thursday evening. Back operating on
monday morning, January 28th. Some send messages were
lost.
January 18th, 2002: all
Announced shutdown: Due to work in the power distribution all central
services (Linux, WNT, AIX, VMS, Lynx) were unavailable between Friday
8:00 and Monday 13:00.
January 18th, 2002: all
Announced shutdown: Due to work in the power distribution all central
services (Linux, WNT, AIX, VMS, Lynx) were unavailable between Friday
8:00 and Monday 13:00.
January 7th, 2002: WAN
G-WIN connection unavailable 13:10 till 16:00.
January 6th, 2002: Mass Storage
ADSM was unavailable from Sunday about 3:00 till monday morning.
November 19th, 2001: WNT
Due to a NIS configuration error all group definitions were
unavailable between monday night and tuesday morning. As a
consequence all logins on the Linux system failed.
November 5th, 2001: WNT
Yet another massive Virus problem. P:\scratch closed.
October 17th, 2001: Linux
Due to software upgrades the X server configuration was destroyed
on many Linux systems. Various effects, from full failure over bad
performance to functional deficiencies (back to PseudoColor) were
the consequence. It took many days to fully recover on all systems.
October 17th, 2001: Mail
Due to a reconfiguration of the WNT system and the IMAP server
it was not longer possible to access the IMAP server with clients
other than MS Outlook. The new configuration required authentication
via NTLM, which is unpublished and MS proprietary. The configuration
change was undone by October 29th, thus emails service
was severely restricted for almost 2 weeks.
October 15th, 2001: ALL
Due to hacker attacks all plattforms were unavailable for two days.
All accounts were disabled, all passwords had to be renewed.
September 13th, 2001: Linux
Much of the Linux cluster was unavailable between 2:00 and 14:00
due to a hardware failure (broken memory) of a file server.
June 15th, 2001: Linux
Telnet and ssh under Linux have problems when the external IP
connection is unavailable (see below) . rlogin works.
June 15th, 2001: WAN
The Internet connection was again unavailable for many hours.
June 13th, 2001: WAN
The Internet connection was unavailable for many hours in the
evenings of June 13th and 14th.
June 13th, 2001: VMS
The whole VMS cluster was unavailable during the afternoon.
March 16th, 2001: WAN
Due to a router failure outside GSI no Internet access from
17:30 and 18:30.
February 21th, 2001: WNT
www-wnt.gsi.de webserver temporarily unavailable due to hardware
problems. Same happened the day before.
February 18th, 2001: Network
Due to software problems in rzserv1/rzserv2 some core server functions
were temporarily unavailable (POP, Listserv, DNS, lpr ...). Varying
impact. Situation was resolved on tuesday, February 20th.
February 14th, 2001: WAN
Due to a router failure outside GSI no Internet access from
16:00 to 17:15.
February 13th, 2001: Mail
The `AnnaKournikova.jpg.vbs' virus spread through the world and GSI
in the morning, most accounts got 10-20 copies of it. All SMTP
traffic to and from the Exchange server was blocked between 11:00
and 17:00, all emails (in- or outbound) received in this time were
discarted.
December 6th, 2000: WAN
The system server lxgs03 went down in the late evening,
stopping all KP3 Linux systems till the next morning.
August 26th, 2000: WAN
File server lxfs02 was inresponsive and had to be rebooted.
Several hours down time to most of the Linux systems.
August 2nd, 2000: Linux
The fileserver lxfs02 (serving some /u file systems) crashed
around 10:30. Recovered at 15:00 because fsck took so long.
April 1st, 2000: Linux
The KP3 group server lxgs03 stopped working due to a power
supply failure. The node was fixed on monday morning. All KP3 clients
hung over the weekend.
February 21st, 2000: Linux
The home file server system was overloaded due to activities of
40 batch jobs overwhelming the /u file system with data.
The response was so slow that all linux systems were useless for
most practical work from friday evening till about 15:00.
January 13th, 2000: Networks
The nameserver and NIS was unresponsive, causing all kinds of
malfunctions in most Linux systems. Failure happened late in the
evening of the 13th and was fixed by lunch time next day.
November 24th, 1999: VMS
The CLEX2 system disk filled to the last block due to crash dump
logs of the mailing system around 16:00. This forced a cluster reboot,
the first since the VMS upgrade on September 28th.
This disabled all VMS work for 2+ hours.
The unrelated double failure of VMS and Linux made the 4th
floor at GSI an almost computer free zone....
November 24th, 1999: Linux
The NFS server lxfs01 crashed at about 16:00 due to
GigaBit Ethernet problems. The fsck took 3+ hours, effectively
stopping the whole Linux system during this time.
October 31th, 1999: Linux
The NFS server lxfs02 was inaccessible from at least
kp3pck between about sunday 31.10.1999 12:20 and
monday 1.11.1999 12:50.
October 16th, 1999: VMS
The whole VMS cluster hung from sometime saturday till monday morning.
No login possible, the session hung after entering the user name and
never asked for a password.
September 29th, 1999: Mail
The MX mail service didn't work anymore after the VMS upgrade to 7.2
on September 28th. Incoming and outgoing emails were lost due to
a crashing server process. MX was stopped. The VMS SMTP mail service
was installed on evening of September 30th as a temporary fix.
Basic mail service works, but no mailing list support.
August 6th, 1999: Network
One of the two schwitches driving the thin-wire Ethernet segments
in the 4th floor failed in the late evening, grounding a good fraction
of the older equipment.
Fixed in the morning of August 9th.
August 5th, 1999: Linux
Server lxfs02 was unavailable between 7:00 and 14:00.
Crash of server caused by SCSI resets on system disk branch.
Server didn't reboot properly because the parallel fsck
caused too much swap activity.
July 28th, 1999: Linux
Server lxfs02 was unavailable between 16:30 and 20:30.
Since /u/kp3web is provideded by this server, all accesses
to KPIII webspace ended in a somewhat misleading 403 error.
July 17th, 1999: Linux
Server lxfs02 hung since about 18:00. Was fixed on July 19th
(reboot, fsck...). Many aktivities hung during this time because
/home4 is served by this node.
July 12th, 1999: Mass Storage
No access to the archive kp3sys anymore. This was caused by
a change to the access configuration list. Finally fixed on July 21st.
June 16th, 1999: Network/VMS
The whole VMS cluster had many hangs for about one day. This was caused
by one system which had network communication problems, which were in
turn caused by a ill-configured thin-wire segment (missing terminator,
installation done by untrained personell...).
June 15th, 1999: Network/Linux
The nodes lxi001 to lxi005 had for about one day a
network I/O rate of about 100-200 kbyte/sec. This was caused by an
autonegotion problem between the servers and the switch after a
component in the network switch was hot-swapped.
June 6th, 1999: Linux
Due to a failure of file server lxfs02 most of the Linux
systems were not usable from sunday, June 6th to about monday,
June 7th around 17:00.
This was caused by a broken Gigabit Ethernet interface card.
June 4th, 1999: Mail
The node AXP602 crashed around June 4th 1:00 and was rebooted on
June 5th 9:00. Much of VMS mail delivery was suspended during this
time because all @gsi.de addresses point to axp602 and the service
didn't fail-over to AXP601.
November 16th, 1998: LAN
The whole LAN of the east end 4th floor was down between 10:00 and 11:00
due to a power failure followed by a component failure. Part of the VMS
cluster crashed, Linux systems and Xterms were unavailable.
November 9th, 1998: Mass Storage
adsmcli was unreliable throughout the night, rejecting requests with
"All server sessions are currently in use".
November 5th, 1998: Mass Storage
The tape robot was unavailable between 0:23 and 0:44, all retrieve and
archive requests failed during this time. Again a software malfunction,
the system recovered by itself.
November 4th, 1998: Mass Storage
The tape robot was unavailable between 0:28 and 10:23. This was caused
by a software malfunction and was recovered with a full restart of ADSM.
November 4th, 1998: Mail
The central mail AIX and pop mail server (clri6a) failed due to a disk
malfunction. All AIX and WNT users have no access to email.
Recovered in the evening of November 5th.
November 3rd, 1998: Mass Storage
The tape robot hung between 2:50 und 6:25. Cleared by operators.
October 24th, 1998: Mass Storage
The tape robot went into `pause mode' at 1:30 after a problem with the
accessor.
Was recovered on monday morning by resetting the system. Unfortunately,
nobody called the on-call operators, which could have fixed this
easily.
October 14th, 1998: Linux
The system linux1 failed due to a broken swap file system.
It was not revitalized before monday, October 19th,
because the whole Linux system group was not at GSI.
September 25th, 1998: Mass Storage
The accessor of the tape robot failed around 14:00, all ADSM
activities stalled or timed out. The system was available
again at 21:50.
September 22nd, 1998: WAN
The WAN connction was unavailable for a few hours because a link between
the DFN-Switch at the TU and the next hop in Frankfurt failed.
August 21st, 1998: Mass Storage
The whole tape robot was unavailable due to a hardware failure since
about 17:00.
Was recovered the next day at around 12:00. Cause was a broken accessor.
August 13th, 1998: Network
A power outage of some network equipment, caused by some work on the
power lines, cause a complete network outage for KPI and KPIII systems.
Was recovered after about 30 minutes, most sessions were lost.
July 31st, 1998: Linux
Some programs, e.g. netscape_new and xemacs, don't start anymore on
the nodes linux1 to linux5. Instead one gets the error
/usr/local/....: can't load library 'libc.so.6'
This problem comes and goes. The nodes kp3pci was not affected.
July 5th, 1998: AIX
The printer queue p41wcs hung, no job was printed.
Problem cleared by the operators on monday morning (reenable queue).
July 5th, 1998: VMS
The disk $1$DKD900: (DC057) holding cern$root went into MountVerify
in the early morning. Since cernlogin as well as toollogin access this
disk most logins on the AXP cluster hang.
Problem disappeared after some time by itself.
July 3rd, 1998: VMS
The logins to the VAX cluster hang in the password prompt, the machines
seem to hang as well. Cause: disk problems with frs$root, kp3$broot,...
Problem cleared after a reboot of some nodes on July 4th
July 3rd, 1998: VMS
Unannounced cluster reboot of AXP cluster (or a cluster crash) on 0:10.
July 1st, 1998: Networks
The reverse translations of GSI IP addresses failed. This caused
problems with xhost authentication, among other things.
Problem cleared soon after problem was identified.
June 30th, 1998: VMS
The disk cern$root was hung, again causing login problems.
Problem cleared within a few hours.
June 29th, 1998: Networks
The primary nameserver was unresponsive, causing a timeout wait of a few
seconds for each namelookup.
Problem cleared soon after problem was identified.
June 28th, 1998: Mass Storage
The response of adsmcli is sluggish in the early morning hours,
one archive operation aborts with `incomplete data buffer sent to
server'. Later in the evening any adsmcli retrieve request
simply hangs, a retry produces a `staged object empty' message.
Problem cleared the next morning.
June 28th, 1998: VMS
The disk holding tool$root was hung. The effect was disruptive
because the puliclogin accesses this disk. Access to PD tools and
adsmcli was blocked.
Problem cleared on June 29th by a full cluster reboot.