- November 24th, 1997
The SMTP mail delivery hung for the last few days. AXP101 was rebooted.
- October 14th, 1997
-- Cluster crash due to system disk shadows sets
The whole cluster hung up and had to be rebooted after the
DSA0: shodow set lost both it's member disks.
A full cluster reboot was necessary to restart the system.
This happened on:
- the evening of October 14th
- the evening of October 13th
- the evening of October 10th
- January 30th, 1997
-- Hard disk errors on $7$dka600:::
The disk $7$dka600 (dl180), a QUANTUM XP34300,
part of the al_pool0 volume set (rvn=3) starts to develop hard
errors. The error log history since January 1st:
******************************* ENTRY 11414. *******************************
DATE/TIME 23-JAN-1997 01:29:20.33 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE 000100F0
0A550B0F
010B9301
80000218
0900 RECOVERED ERROR
RECOVERED READ WITH ECC CORRECTION
******************************* ENTRY 11418. *******************************
DATE/TIME 24-JAN-1997 19:46:52.46 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE 000300F0
0A5D5D0E
050B8201
80000011
7000 MEDIUM ERROR
UNRECOVERED READ ERR IN DATA
******************************* ENTRY 11420. *******************************
DATE/TIME 24-JAN-1997 19:59:51.17 SYS_TYPE 0000000F
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
=== about 120 error's of same type in 2 minutes ===
******************************* ENTRY 11541. *******************************
DATE/TIME 24-JAN-1997 20:01:54.46 SYS_TYPE 0000000F
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
******************************* ENTRY 11542. *******************************
DATE/TIME 24-JAN-1997 20:01:55.46 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE 000B0070
0A000000
00000000
0000004E
0000 ABORTED COMMAND
SENSE CODE = 4E(X)
******************************* ENTRY 11543. *******************************
DATE/TIME 24-JAN-1997 20:01:56.58 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 02 CHECK CONDITION
=== no extended sense data ===
******************************* ENTRY 11544. *******************************
DATE/TIME 24-JAN-1997 20:02:02.91 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
READ
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE 000300F0
0A5D5D0E
050B8201
80000011
7000 MEDIUM ERROR
UNRECOVERED READ ERR IN DATA
******************************* ENTRY 11546. *******************************
DATE/TIME 24-JAN-1997 20:39:44.34 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE 000300F0
0A5D5D0E
050B8201
80000011
7000 MEDIUM ERROR
UNRECOVERED READ ERR IN DATA
******************************* ENTRY 11556. *******************************
DATE/TIME 28-JAN-1997 01:17:44.49 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 00 GOOD
EXTENDED SENSE 000100F0
0A144C27
060B0404
80000218
0900 RECOVERED ERROR
RECOVERED READ WITH ECC CORRECTION
******************************* ENTRY 11557. *******************************
DATE/TIME 28-JAN-1997 01:18:45.45 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 00 GOOD
EXTENDED SENSE 000100F0
0A729027
080B0B04
80000218
0800 RECOVERED ERROR
RECOVERED READ WITH ECC CORRECTION
******************************* ENTRY 11559. *******************************
DATE/TIME 28-JAN-1997 01:24:06.64 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 00 GOOD
EXTENDED SENSE 000100F0
0ABC7D29
040B3E04
80000218
0800 RECOVERED ERROR
RECOVERED READ WITH ECC CORRECTION
******************************* ENTRY 11560. *******************************
DATE/TIME 28-JAN-1997 01:29:40.33 SYS_TYPE 0000000F
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI STATUS 00 GOOD
EXTENDED SENSE 000100F0
0A73A12B
070B7704
80000218
0800 RECOVERED ERROR
RECOVERED READ WITH ECC CORRECTION
- May 8th, 1996 -- Decwindow problem on VSBZ:
The DECWINDOW_RESTART did not reenable the login on VSBZ, even after
several attempts and stoping some DECW processes. VSBZ was rebooted after
about 34 days uptime.
- April 25th, 1996 -- FDDI Error on AXP602:
The FDDI connection of AXP602 hung around 18:00, causing among
other things many email problems.
- April 24th, 1996 -- Hard disk errors on $7$dkb200::
The disk $7$dkb200: (dl124) went into MountVerify.
A power off/on on the $7$dkb disk string resolved the problem.
- March 1st, 1996 -- Hard disk errors on $12$dkc0::
Starting March 1st the disk $12$dkc0: develops read errors:
DATE/TIME 1-MAR-1996 03:40:59.05 SYS_TYPE 00000004
SYSTEM UPTIME: 3 DAYS 12:19:49
SCS NODE: AXP612 OpenVMS AXP V6.2
GENERIC DK SUB-SYSTEM, UNIT _AXP612$DKC0:, CURRENT LABEL "DL154"
SEAGATE ST15150N
HW REVISION 34313030 HW REVISION = 0014
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 00 SCSI ID = 0.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 5D000028
000074CB
0035 READ EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE DATA
EXTENDED SENSE 000300F0
0AA1CB5D
00000000
80D00011
2000 MEDIUM ERROR
UNRECOVERED READ ERR IN DATA
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08021910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
UCB$L_OPCNT 002109AE 2165166. QIO'S THIS UNIT
UCB$L_ERRCNT 0000000C 12. ERRORS THIS UNIT
IRP$L_BCNT 00006A00 TRANSFER SIZE 27136. BYTE(S)
IRP$L_BOFF 00000110 272. BYTE PAGE OFFSET
IRP$L_PID 00030085 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
- February 20th, 1996 -- Disk structure error on $4$DKB300(DL177).
The nightly set vol /rebuild=force detected a
%REBUILD-W-DUALLOC, dual allocation on relative volume 3 (_AXP604$DKB300:)
There was no problem found the night before. An anal/disk found:
%ANALDISK-W-MULTALLOC, file (262,1,3) [ALADIN_DATA.S117]S117_PCH_2777.LMD;1
multiply allocated blocks
VBN 9025 to 9040
LBN 6738640 to 6738655, RVN 3
%ANALDISK-W-MULTALLOC, file (896,5,3) [ALADINSOFT.AMGR.REF]LAST_DISTRIBUTE_VSBZ.TXT;238
multiply allocated blocks
VBN 1 to 16
LBN 6738640 to 6738655, RVN 3
This is similar to the problem after a disk hangup on February 14th.
The disk was however replaced in the mean time, the data was copied
from the old to the new volume with back/image/volume. The
structure had been verified on February 15th to be o.k.. There was no
I/O error on any disk of the volume set since February 15th. So there
is no obvious reason for a disk structure problem.
The file LAST_DISTRIBUTE_VSBZ.TXT;238 was created 19-FEB-1996 at
2:57. That is after the structure checks are made, so it is
likely that the structure was damaged on 19-FEB-1996. The DEFRAG
logfiles didn't show any problems either.
Inspection of the two affected files showed that
- The contents of LAST_DISTRIBUTE_VSBZ.TXT was o.k.
- The block 9025 of S117_PCH_2777.LMD showed the
.TXT file contents
- Blocks 9026 to 9040 seem to be `random bytes'.
This is what one would expect when the structure error was introduced when
LAST_DISTRIBUTE_VSBZ.TXT was written with a faulty allocation.
The structure error was removed by
- deleting S117_PCH_2777.LMD
- doing an anal/disk/repair
Note, that the previous structure error was removed by deleting both files
before an anal/disk, which didn't turn up any errors anymore,
was performed. It might be, that this corrupted the free block
caches. The volume set was dismounted in the mean time however, so this
shouldn't affect operations anymore.
- February 15th, 1996 2:52 -- Again Hangup of $4$DKB300(DL177).
This time we got about 30 errors on the disk plus a PKB error:
*******************************
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB300:
DATE/TIME 15-FEB-1996 02:52:29.27 SYS_TYPE 0000000D
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI CMD 7D00002A
0000201E
007E WRITE EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE 000300F0
0A221E7D
21029B0E
8000020C
2800 MEDIUM ERROR
SENSE CODE = 0C(X)
*******************************
53C810 SCSI PORT SUB-SYSTEM, AXP604$PKB0:
DATE/TIME 15-FEB-1996 02:52:29.27 SYS_TYPE 0000000D
ERROR TYPE 000A RESEL ERROR
SUB-ERROR TYPE = 00(X)
*******************************
DATE/TIME 15-FEB-1996 02:52:29.61 SYS_TYPE 0000000D
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB300:
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI CMD 7D00002A
0000201E
007E WRITE EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE DATA
EXTENDED SENSE 00070070
0A000000
00000000
00000027
0000 DATA PROTECT
WRITE PROTECTED
*******************************
DATE/TIME 15-FEB-1996 02:52:30.62 SYS_TYPE 0000000D
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB300:
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI CMD 7D00002A
0000201E
007E WRITE EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE DATA
EXTENDED SENSE 00000700
000A0000
27000000
00000000
00002000
00000000
8010DA53
03086713
8483801F
00000700
35000000
00000020
98080000
01929418
01929418
00000000
00000F79
8F097D00
00001004
80000C0F
96010769
00000000
00000000
00000000
00000000
00000000
8730140C
00000000 NO SENSE
NOADDITIONAL SENSE INFO
The disk continued to issue EXTENDED SENSE with a
WRITE PROTECTED code. Only a power cycling cleared the
problem.
- February 14th, 1996 2:26 -- Hangup of $4$DKB300(DL177).
This disk, member of the al_data0: volume set which also holds
most of the reference copy areas started to log errors at 2:07:
DATE/TIME 14-FEB-1996 02:26:55.02 SYS_TYPE 0000000D
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
DEVICE ERROR AlphaStation 400 4/233
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB300:
QUANTUM XP34300
HW REVISION 47393835 HW REVISION = 589G
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 03 SCSI ID = 3.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 7400002A
0000769A
007E WRITE EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE DATA
EXTENDED SENSE 000100F0
0A789A74
16024C0D
8000010C
2800 RECOVERED ERROR
SENSE CODE = 0C(X)
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08021810 ONLINE
SOFTWARE VALID
UNLOAD AT DISMOUNT
UCB$L_OPCNT 001BFF1E 1834782. QIO'S THIS UNIT
UCB$L_ERRCNT 00000172 114. ERRORS THIS UNIT
IRP$L_BCNT 0000FC00 TRANSFER SIZE 64512. BYTE(S)
IRP$L_BOFF 00001200 4608. BYTE PAGE OFFSET
IRP$L_PID 002200E2 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
Intermixed were SCSI Port Driver errors like
ERROR SEQUENCE 6518. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 14-FEB-1996 02:29:09.33 SYS_TYPE 0000000D
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
DEVICE ATTENTION AlphaStation 400 4/233
53C810 SCSI PORT SUB-SYSTEM, AXP604$PKB0:
ERROR TYPE 000A RESEL ERROR
SUB-ERROR TYPE = 00(X)
SCSI ID 03 SCSI ID = 3.
SCSI STATUS FF NO STATUS RECEIVED
PORT ERROR CNT 00000100
00000000
00000000 BUS BUSY CNT = 256.
UNSOL RESET CNT = 0.
UNSOL INTRPT CNT = 0.
CONN ERROR CNT 00000000
00000000
00000000
00000000
00010C0D
00000000
00000000 ARB FAIL CNT = 0.
SEL FAIL CNT = 0.
PARITY ERR CNT = 0.
PHASE ERR CNT = 0.
BUS RESET CNT = 68621.
BUS ERROR CNT = 0.
CONTROLLER ERROR CNT = 0.
PORT DEPENDENT DATA
CHIP_DATA_CNT 53
SCNTL0 DA Initiator Device
Assert ATN on Parity Error
Enable Parity Checking
Select ATN on a Start Seq
Full Arbitration
SCNTL1 10 53C810 Connected to bus
SCNTL2 80 SCSI Disconnect Unexpected
SCNTL3 13 Clock conversion factor of 2
Synchronous clock conversion factor of 1
SCID 67 Encoded 53C810 Chip SCSI ID = 7
Enable Response to selection
Enable Response to reselection
SXFER 08 Max SCSI Synchronous Offset = 8 - Asynchronous
Synchronous transfer period = 4
SDID 03 Encoded Destination SCSI ID = 3
GPREG 1F General Purpose I/O bit 0 = 1
General Purpose I/O bit 1 = 1
SFBR 80 Sel/Reselect ID = 7
SSID 83 SCSI Valid Bit
Encoded Destination SCSI ID = 3
DSTAT 84 SCRIPTS Interrupt Instruction Received
DMA FIFO Empty
SSTAT0 01 SCSI Parity Signal
SSTAT1 0F SCSI I/O/ signal
SCSI C/D / signal
SCSI MSG / signal
Latched SCSI parity
Number of bytes in SCSI FIFO = 0
SSTAT2 00
DSA 00000001
CTEST2 35 Data Acknowledge Inactive
Data Request Inactive
SCSI True End of Process Active
Configured as Memory
Configured as I/O
Signal Process not Active
Xfer Direction = Host to SCSI
CTEST3 20 Chip Rev level = 2
CTEST4 00 DMA FIFO Byte Lane Disabled
CTEST5 00 DMA Dir = Host to SCSI
Reset Master Control
DFIFO 00 Bytes Left In FIFO = 0.
DBC/DCMD 98080000 Did: Interrupt
DNAD 01929418
DSP 01929418
DSPS 00000000
SCRATCHA 00000F79
DMODE 00 Manual Start Mode set
Destination Addr. = Memory Space
Source Addr = Memory Space
Burst length = 2 transfers
DIEN 7D Illegal Instruction Detected
Script Interrupt Instruction Received
Script Step Interrupt
Aborted
Bus Fault ENABLED
Master Data Parity Error
DCNTL 09 810 Native Mode
Normal DMA Operation
Enable Totem Pole Driver for IRQ
Single Step Disabled
SIEN0 8F SCSI Parity Error
SCSI Reset
Unexpected Disconnect
SCSI Gross Error
Mismatch/Atn in Initiator/Target Mode
SIEN1 04 Sel/Reselect time-out
SIST0 00
SIST1 00
MACNTL 00
GPCNTL 0F GP I/O 0 Iinput Enable
GP I/O 1 Intput Enable
STIME0 0C Selection time-out = 204.8 mili seconds
HTH time-out disbled
STIME1 00 GP timer period disbled
RESPID 80 Sel/Reselection ID 7
SCRATCHB 96010769
DSA_8 00000000
DSA_4 00000000
DSA0 00000000
DSA4 00000000
DSA8 00000000
SOFF 140C
SSOFF 8730
The AXP604 system console reported disk offline and mount
verification completed events. The disk appeared online and mounted,
all I/O to dkb300 however hung. The disk had 370 logged errors,
the port driver about 640.
The disk was accessible again after a reboot. The volume set has now,
however, a disk structure error. The problem is on RVN 1 and not on
RVN 2 (dkb300).
%ANALDISK-W-MULTALLOC, file (1000,2,1) [ALADINSOFT.S117.REF]C_INI_ZEB.OBJ;1
multiply allocated blocks
VBN 1 to 16
LBN 7059264 to 7059279, RVN 1
%ANALDISK-W-MULTALLOC, file (1773,1,1) [ALADIN_DATA.S117]S117_PCH_1870.LMD;1
multiply allocated blocks
VBN 321 to 336
LBN 7059264 to 7059279, RVN 1
This was fixed by deleting both files (no anal/disk/repair
needed afterwards...).
- February 9th, 1996 7:00 -- Hangup of some AXP612 disks
The al_temp0 and al_temp1 disks were in an irregular
state:
Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$12$DKA100: (AXP612) MntVerifyTimeout 0 DL124 2197088 29 4
dismount
$12$DKA300: (AXP612) Online 0 (remote access)
A show device on all cluster nodes gave:
SYSMAN> do sho dev $12$DKA100:
Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
%SYSMAN-I-OUTPUT, command execution on node AXP610
$12$DKA100: (AXP612) Mounted alloc 0 (remote mount) 4
%SYSMAN-I-OUTPUT, command execution on node AXP625
$12$DKA100: (AXP612) MntVerifyTimeout 0 DL124 2197088 44 4
dismount
%SYSMAN-I-OUTPUT, command execution on node AXP604
$12$DKA100: (AXP612) MntVerifyTimeout 0 DL124 2197088 46 4
dismount
%SYSMAN-I-OUTPUT, command execution on node AXP602
$12$DKA100: (AXP612) MntVerifyTimeout 0 DL124 2197088 2 4
dismount
%SYSMAN-I-OUTPUT, command execution on node AXP612
$12$DKA100: (AXP612) MntVerifyTimeout 0 DL124 2197088 29 4
dismount
SYSMAN> do sho dev $12$DKA300:
%SYSMAN-I-OUTPUT, command execution on node AXP610
Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$12$DKA300: (AXP612) Online alloc 0
Probable cause is a SCSI controler problem:
AXP612 $ sho err
Device Error Count
PEA0: 24
AXP612$PKA0: 37386
$12$DKA300: (AXP612) 3 <-- cause by cycling power
$12$DKC0: (AXP612) 11 <-- the usual 11
The disks came back online after two reboots. Afterwards (but
maybe even before), kp3$root was in MntVerifyTimeout
on AXP627 and AXP607.
- January 11th, 1996 10:00 -- Crash of AXP604
Sometime between and 10:03 and 11:39 the node AXP604 crashed.
The last console message was "system dump canceled", and there is
no dump and error log entry indeed.
The disks served by AXP604 were dismounted on some (all ?) other
nodes. This is now done by the remounter when it detects a
'hostunavailable' condition.
- November 5rd, 1995 16:09 -- Crash of most of AXP cluster
Sometime between and 2:00 and 16:09 the node AXP612 hungup. About 50%
of the AXP cluster rebooted around 20:49 or later. Reason unclear.
- November 3rd, 1995 2:40 -- Disk Hangup on AXP612$DKA300
Problem similar to the one on October 30th. First COMMAND
TRANSMISSION FAILURE error logged for AXP612$DKA300 logged on
02:40:08.99, about 320 like errors follow till 02:45:35.55, than a
RESEL ERROR, 13 UNEXPECTED INTERRUPT and two
reset messages. This time, the system recovered all by itself, it's
unclear whether the disk DKA300 initialed the reset of the SCSI
adapter. Both disks on PKA were reset however.
ERROR SEQUENCE 7867. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 3-NOV-1995 02:45:35.99 SYS_TYPE 00000004
SYSTEM UPTIME: 7 DAYS 04:39:43
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ATTENTION DEC 3000 Model 400
TM32 SCSI PORT SUB-SYSTEM, AXP612$PKA0:
ERROR TYPE 040A RESEL ERROR
NO CONNECTION TO THIS TARGET
SCSI ID FF
SCSI STATUS FF NO STATUS RECEIVED
PORT ERROR CNT 00000000
00000001
00000000 BUS BUSY CNT = 0.
UNSOL RESET CNT = 1.
UNSOL INTRPT CNT = 0.
PORT DEPENDENT DATA
TRANSFER CNT 0000 TRANSFER CNT = 0.
FIFO 52
CMD 00
STATUS 97 MESSAGE IN
TRANSFER CNT
INTERRUPT PENDING
INTERRUPT STS 40 ILLEGAL CMD
SEQUENCE STEP CC SEQUENCE CODE = 4(X)
FIFO FLAGS 80
CONFIG1 17 PARITY CHECKING
CONFIG2 0B DMA PARITY ENABLE
REGISTER PARITY ENABLE
SCSI2 MODE
CONFIG3 04 SAVE RESIDUAL BYTE
DMA ADDRESS 000003B4
IMER FFCCFFCC
SDIC 08819780
SAVED CIR 00000FF0
SAVED CMD 00000046
SAVED STATUS 97
SAVED FIFO FLGS 81
SAVED INT REG 08
*******************************
ERROR SEQUENCE 7868. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 3-NOV-1995 02:45:35.99 SYS_TYPE 00000004
SYSTEM UPTIME: 7 DAYS 04:39:43
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ATTENTION DEC 3000 Model 400
TM32 SCSI PORT SUB-SYSTEM, AXP612$PKA0:
ERROR TYPE 0B08 UNEXPECTED INTERRUPT
SCSI ID FF
SCSI STATUS FF NO STATUS RECEIVED
PORT ERROR CNT 00000000
00000001
00000000 BUS BUSY CNT = 0.
UNSOL RESET CNT = 1.
UNSOL INTRPT CNT = 0.
PORT DEPENDENT DATA
TRANSFER CNT 0000 TRANSFER CNT = 0.
FIFO 52
CMD 00
STATUS 97 MESSAGE IN
TRANSFER CNT
INTERRUPT PENDING
INTERRUPT STS 40 ILLEGAL CMD
SEQUENCE STEP CC SEQUENCE CODE = 4(X)
FIFO FLAGS 80
CONFIG1 17 PARITY CHECKING
CONFIG2 0B DMA PARITY ENABLE
REGISTER PARITY ENABLE
SCSI2 MODE
CONFIG3 04 SAVE RESIDUAL BYTE
DMA ADDRESS 000003B4
IMER FFCCFFCC
SDIC 00801780
SAVED CIR 00040FF0
SAVED CMD 00000046
SAVED STATUS 17
SAVED FIFO FLGS 80
SAVED INT REG 00
*******************************
ERROR SEQUENCE 7884. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 3-NOV-1995 02:45:38.57 SYS_TYPE 00000004
SYSTEM UPTIME: 7 DAYS 04:39:45
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ERROR DEC 3000 Model 400
GENERIC DK SUB-SYSTEM, UNIT _AXP612$DKA300:
DEC DSP5350S
HW REVISION 41373234 HW REVISION = 427A
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 03 SCSI ID = 3.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 00000000
0000 TEST UNIT RDY
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE DATA
EXTENDED SENSE 00060070
0A000000
00000000
00000029
0000 UNIT ATTENTION
POWER ON OR RESET OCCURRED
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08025910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
"MOUNT VERIFICATION" IN-PROGRESS
UCB$L_OPCNT 0063A2EB 6529771. QIO'S THIS UNIT
UCB$L_ERRCNT 000029CA 202. ERRORS THIS UNIT
IRP$L_BCNT 00000000 TRANSFER SIZE 0. BYTE(S)
IRP$L_BOFF 00000000 TRANSFER PAGE ALIGNED
IRP$L_PID 8303B430 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
*******************************
ERROR SEQUENCE 7885. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 3-NOV-1995 02:48:00.15 SYS_TYPE 00000004
SYSTEM UPTIME: 7 DAYS 04:42:07
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ERROR DEC 3000 Model 400
GENERIC DK SUB-SYSTEM, UNIT _AXP612$DKA200:
DEC DSP5350S
HW REVISION 41373234 HW REVISION = 427A
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 02 SCSI ID = 2.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 35000028
0000013B
0001 READ EXTENDED
SCSI STATUS 00 GOOD
EXTENDED SENSE DATA
EXTENDED SENSE 00060070
0A000000
00000000
00000029
0000 UNIT ATTENTION
POWER ON OR RESET OCCURRED
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08021810 ONLINE
SOFTWARE VALID
UNLOAD AT DISMOUNT
UCB$L_OPCNT 0043A0F0 4432112. QIO'S THIS UNIT
UCB$L_ERRCNT 00000001 1. ERRORS THIS UNIT
IRP$L_BCNT 00000200 TRANSFER SIZE 512. BYTE(S)
IRP$L_BOFF 000010D0 4304. BYTE PAGE OFFSET
IRP$L_PID 83069780 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
- November 3rd, 1995 16:58 -- Crash AXP cluster
The cluster hung shortly after the submit of two jobs processing data
from the local DLT drives on AXP601 and AXP602. The same type of job was
already run twice this day, on the other hand we had hungups seemingly
caused by BACKUP on AXP601 or AXP602 several times before. The node
AXP602 was rebooted, but might have lost the FDDI inferface. This caused
a cluster crash when AXP601 joined in... -> Total reboot. The double crash
overwrote the primary crash dump.
- October 30th, 1995 8:10 -- Crash VSBZ and all Satelites
The local Ethernet was switched to a DEChub 900. The interruption
crashed the affected part of the VAX cluster (VSBZ after 43 days uptime).
- October 30th, 1995 6:19 -- Disk Hangup on AXP612$DKA300
Problem similar to the one observed with AXP604$DKB600 on October 27th.
The disk AXP612$DKA300, part of the al_temp0 volume set,
went into mount verify. This time however, the disk didn't recover
automatically. The hangup was finally clear by powering the disk off and
on again. Pertinent error log entries:
ERROR SEQUENCE 62064. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 30-OCT-1995 06:19:09.10 SYS_TYPE 00000004
SYSTEM UPTIME: 3 DAYS 08:13:36
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ERROR DEC 3000 Model 400
GENERIC DK SUB-SYSTEM, UNIT _AXP612$DKA300:
DEC DSP5350S
HW REVISION 41373234 HW REVISION = 427A
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
SCSI ID 03 SCSI ID = 3.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 000009B8 %SYSTEM-W-NOTQUEUED, REQUEST NOT QUEUED
SCSI CMD 00000000
0000 TEST UNIT RDY
SCSI STATUS 02 CHECK CONDITION
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08025910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
"MOUNT VERIFICATION" IN-PROGRESS
UCB$L_OPCNT 001AEEFB 1765115. QIO'S THIS UNIT
UCB$L_ERRCNT 00000001 1. ERRORS THIS UNIT
IRP$L_BCNT 00000000 TRANSFER SIZE 0. BYTE(S)
IRP$L_BOFF 00000000 TRANSFER PAGE ALIGNED
IRP$L_PID 8303B430 REQUESTOR "PID"
IRP$Q_IOSB 00000000 00000000 IOSB, 0. BYTE(S) TRANSFERRED
****************** after about 10000 like errors in 1 second
****************** drive power was cycled, causing "SELECTION FAILED" errors
****************** and finally a
ERROR SEQUENCE 6943. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 30-OCT-1995 09:18:48.57 SYS_TYPE 00000004
SYSTEM UPTIME: 3 DAYS 11:13:14
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ERROR DEC 3000 Model 400
GENERIC DK SUB-SYSTEM, UNIT _AXP612$DKA300:
DEC DSP5350S
HW REVISION 41373234 HW REVISION = 427A
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 03 SCSI ID = 3.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 00000000
0000 TEST UNIT RDY
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE DATA
EXTENDED SENSE 00060070
0A000000
00000000
00000029
0000 UNIT ATTENTION
POWER ON OR RESET OCCURRED
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08025910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
"MOUNT VERIFICATION" IN-PROGRESS
UCB$L_OPCNT 001AF055 1765461. QIO'S THIS UNIT
UCB$L_ERRCNT 0000288C 140. ERRORS THIS UNIT
IRP$L_BCNT 00000000 TRANSFER SIZE 0. BYTE(S)
IRP$L_BOFF 00000000 TRANSFER PAGE ALIGNED
IRP$L_PID 8303B430 REQUESTOR "PID"
IRP$Q_IOSB 00000000 00000000 IOSB, 0. BYTE(S) TRANSFERRED
- October 27th, 1995 17:18 -- Disk Hangup on AXP604$DKB600:
During an image backup of AL_POOL0 residing on the
volume set $4$DKB400/500/600 the disk DKB600 went into
mount verify on AXP604. On other cluster nodes the whole volume
set went mount verify. The error log contained 120 "COMMAND TRANSMISSION
FAILURE" errors, with one second delay, and one "EXTENDED SENSE DATA
RECEIVED", after which the disk obviously responded again:
ERROR SEQUENCE 790. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 27-OCT-1995 17:18:44.70 SYS_TYPE 0000000D
SYSTEM UPTIME: 3 DAYS 22:35:37
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
DEVICE ERROR AlphaStation 400 4/233
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB600:
QUANTUM XP34300
HW REVISION 47393835 HW REVISION = 589G
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
SCSI ID 06 SCSI ID = 6.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 000009B8 %SYSTEM-W-NOTQUEUED, REQUEST NOT QUEUED
SCSI CMD 00000000
0000 TEST UNIT RDY
SCSI STATUS 02 CHECK CONDITION
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08025910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
"MOUNT VERIFICATION" IN-PROGRESS
UCB$L_OPCNT 0003DA85 252549. QIO'S THIS UNIT
UCB$L_ERRCNT 00000001 1. ERRORS THIS UNIT
IRP$L_BCNT 00000000 TRANSFER SIZE 0. BYTE(S)
IRP$L_BOFF 00000000 TRANSFER PAGE ALIGNED
IRP$L_PID 82FB8A30 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
*******************************
ERROR SEQUENCE 911. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 27-OCT-1995 17:20:49.46 SYS_TYPE 0000000D
SYSTEM UPTIME: 3 DAYS 22:37:41
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
DEVICE ERROR AlphaStation 400 4/233
GENERIC DK SUB-SYSTEM, UNIT _AXP604$DKB600:
QUANTUM XP34300
HW REVISION 47393835 HW REVISION = 589G
ERROR TYPE 05 EXTENDED SENSE DATA RECEIVED
SCSI ID 06 SCSI ID = 6.
SCSI LUN 00 SCSI LUN = 0.
SCSI SUBLUN 00 SCSI SUBLUN = 0.
PORT STATUS 00000001 %SYSTEM-S-NORMAL, NORMAL SUCCESSFUL COMPLETION
SCSI CMD 00000000
0000 TEST UNIT RDY
SCSI STATUS 02 CHECK CONDITION
EXTENDED SENSE DATA
EXTENDED SENSE 000B0070
0A000000
00000000
0000004E
0000 ABORTED COMMAND
SENSE CODE = 4E(X)
UCB$L_ERTCNT 00000010 16. RETRIES REMAINING
UCB$L_ERTMAX 00000010 16. RETRIES ALLOWABLE
ORB$L_OWNER 00010004 OWNER UIC [001,004]
UCB$L_CHAR 1C4D4008 DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08025910 ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
"MOUNT VERIFICATION" IN-PROGRESS
UCB$L_OPCNT 0003DA89 252553. QIO'S THIS UNIT
UCB$L_ERRCNT 0000007A 122. ERRORS THIS UNIT
IRP$L_BCNT 00000000 TRANSFER SIZE 0. BYTE(S)
IRP$L_BOFF 00000000 TRANSFER PAGE ALIGNED
IRP$L_PID 82FB8A30 REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED
- October 26th, 1995 22:00 -- Crash AXP612
The disks $12$dka200 and $12$dka300 went into mount
verify. The pka0 error log was running to, somebody rebooted...
The problem started at 20:14 with an "RESEL ERROR" followed by a
"UNEXPECTED INTERRUPT" error on PKA0. Some 89000
"UNEXPECTED INTERRUPT" errors followed till the reboot at 22:03:
ERROR SEQUENCE 61453. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 26-OCT-1995 20:14:24.63 SYS_TYPE 00000004
SYSTEM UPTIME: 3 DAYS 01:36:30
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ATTENTION DEC 3000 Model 400
TM32 SCSI PORT SUB-SYSTEM, AXP612$PKA0:
ERROR TYPE 040A RESEL ERROR
NO CONNECTION TO THIS TARGET
SCSI ID FF
SCSI STATUS FF NO STATUS RECEIVED
PORT ERROR CNT 00000000
00000000
00000000 BUS BUSY CNT = 0.
UNSOL RESET CNT = 0.
UNSOL INTRPT CNT = 0.
PORT DEPENDENT DATA
TRANSFER CNT 0000 TRANSFER CNT = 0.
FIFO 1C
CMD 00
STATUS 97 MESSAGE IN
TRANSFER CNT
INTERRUPT PENDING
INTERRUPT STS 40 ILLEGAL CMD
SEQUENCE STEP CC SEQUENCE CODE = 4(X)
FIFO FLAGS 80
CONFIG1 17 PARITY CHECKING
CONFIG2 0B DMA PARITY ENABLE
REGISTER PARITY ENABLE
SCSI2 MODE
CONFIG3 04 SAVE RESIDUAL BYTE
DMA ADDRESS 00002000
IMER FFCCFFCC
SDIC 08819740
SAVED CIR 00000FF0
SAVED CMD 00000046
SAVED STATUS 97
SAVED FIFO FLGS 81
SAVED INT REG 08
*******************************
ERROR SEQUENCE 61454. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 26-OCT-1995 20:14:24.63 SYS_TYPE 00000004
SYSTEM UPTIME: 3 DAYS 01:36:30
SCS NODE: AXP612 OpenVMS AXP V6.2
HW_MODEL: 0000040B Hardware Model = 1035.
DEVICE ATTENTION DEC 3000 Model 400
TM32 SCSI PORT SUB-SYSTEM, AXP612$PKA0:
ERROR TYPE 0B08 UNEXPECTED INTERRUPT
SCSI ID FF
SCSI STATUS FF NO STATUS RECEIVED
PORT ERROR CNT 00000000
00000000
00000000 BUS BUSY CNT = 0.
UNSOL RESET CNT = 0.
UNSOL INTRPT CNT = 0.
PORT DEPENDENT DATA
TRANSFER CNT 0000 TRANSFER CNT = 0.
FIFO 1C
CMD 00
STATUS 97 MESSAGE IN
TRANSFER CNT
INTERRUPT PENDING
INTERRUPT STS 40 ILLEGAL CMD
SEQUENCE STEP CC SEQUENCE CODE = 4(X)
FIFO FLAGS 80
CONFIG1 17 PARITY CHECKING
CONFIG2 0B DMA PARITY ENABLE
REGISTER PARITY ENABLE
SCSI2 MODE
CONFIG3 04 SAVE RESIDUAL BYTE
DMA ADDRESS 00002000
IMER FFCCFFCC
SDIC 00801740
SAVED CIR 00040FF0
SAVED CMD 00000046
SAVED STATUS 17
SAVED FIFO FLGS 80
SAVED INT REG 00
- October 23th, 1995 16:00 -- Crash AXP Cluster
Whole AXP cluster went down after quorum disk wasn't accessible anymore.
- October 16th, 1995 10:30 -- Crash AXP Cluster
Whole AXP cluster went down after a page disk failed on AXP602.
- October 10th, 1995 15:27 -- Crash AXP604
Crash with KSP inval - PC = 802C9408, after an uptime of
14 days. Like crash of September 6th.
- October 1st, 1995 13:54 -- Crash AXP612
Crashed and auto-reboot. No dump, no specific error log entry.
- September 26th, 1995 21:00 -- Crash AXP612
The DKB SCSI branch (connecting al_temp0), hung up at
21:00. It came partly back on 2:00, but the DO_DAILY jobs
didn't execute. System was rebooted after another loss of the DKB
disks on September 27th, 14:00.
- September 25th, 1995 17:00 -- Crash AXP604
Node AXP604 hung after 4 errors listed below.
The system was explicitely console CRASHED, but there was no
crash dump on the AXP604 system disk, even though this dump file
is up to now exclusively used by AXP604. The errors prior to
the crash were all like:
ERROR SEQUENCE 2928. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 25-SEP-1995 16:46:22.39 SYS_TYPE 0000000D
SYSTEM UPTIME: 4 DAYS 07:50:20
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
ERL$LOGMSCP ENTRY AlphaStation 400 4/233
MESSAGE TYPE 0010 IMMEDIATE MODE COMMAND TIMEOUT
_ CONTROLLER RESET
CLASS DRIVER 4B534944 /DISK/
CDDB$Q_CNTRLID 01040000 0000F546
UNIQUE IDENTIFIER, 00000000F546(X)
MASS STORAGE CONTROLLER
VMS (SOFTWARE MSCP SERVER)
CDDB$B_SYSTEMID 0000F546
0000
System was rebooted but had wrong system time. The error log showed:
*******************************
ERROR SEQUENCE 0. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 25-SEP-1995 17:18:47.90 SYS_TYPE 0000000D
SYSTEM UPTIME: 0 DAYS 00:00:11
SCS NODE: AXP604 OpenVMS AXP V6.2
HW_MODEL: 00000480 Hardware Model = 1152.
"UNKNOWN DEVICE" ENTRY AlphaStation 400 4/233
ERROR LOG RECORD
ERF$L_SID 00000480 SYSTEM ID REGISTER
ERL$W_ENTRY 0062 ERROR ENTRY TYPE
EXE$GQ_SYSTIME 706D9545
00996EF3 64 BIT TIME WHEN ERROR LOGGED
ERL$GL_SEQUENCE 0000 UNIQUE ERROR SEQUENCE = 0.
UCB$L_STS 00002010 DEVICE STATUS
UCB$B_DEVCLASS 20 DEVICE CLASS = 32.
UCB$B_DEVTYPE 3C DEVICE TYPE = 60.
UCB$W_UNIT 0000 PHYSICAL UNIT NUMBER = 0.
UCB$L_ERRCNT 00000001 UNIT ERROR COUNT = 1.
UCB$L_OPCNT 00000000 UNIT OPERATION COUNT = 0.
ORB$L_OWNER 00010004 OWNER UIC = [001,004]
UCB$L_DEVCHAR 0C442000 DEVICE CHARACTERISTICS
UCB$B_SLAVE 00 DEVICE SLAVE CONTROLLER = 0.
DDB$T_NAME 5058410A
24343036
00415746
00000000 /AXP604$FWA/
LONGWORD 1. 00000010
LONGWORD 2. 000001FC
LONGWORD 3. 0000000A
LONGWORD 4. 00000000
LONGWORD 5. 00000000
LONGWORD 6. 00000005
LONGWORD 7. 0000002A
LONGWORD 8. 00000000
LONGWORD 9. 00000003
LONGWORD 10. 00000000
LONGWORD 11. 00000000
LONGWORD 12. 00000000
LONGWORD 13. 00000000
LONGWORD 14. 00000000
LONGWORD 15. 00000000
LONGWORD 16. 00000000
LONGWORD 17. 00000000
Another reboot fixed this, AXP604had correct time...
- September 21th, 1995 2:00 -- Crash AXP Cluster
Cluster nodes hung till at least 2:00 am.
AXP601/AXP602 hangup, both systems hung after a reboot. CLUSTER CRASH.
In addition, AXP612 was effectively hung up by a looping MX-AXP after
September 20th, 20:12. Unclear whether this caused the other problems.
- September 20th, 1995 13:30 -- Crash AXP Cluster
AXP601/AXP602 hangup, both systems hung after a reboot. CLUSTER CRASH.
- September 16th, 1995 15:40 -- Crash VSBZ
Crash, reason unknown, no dump, no errorlog entry, autoboot.
- September 7th, 1995 12:15 -- Crash AXP604
Crash with KSP inval - PC = 802C93f0, again in PE Driver:
AXP604: PC=802C93f0 --> SYS$PEDRIVER_NPRO+0D3F0
System failed to boot, needed a reset before a b command
worked.
- September 6th, 1995 10:17 -- Crash AXP604
Crash with KSP inval - PC = 802C9408.
It turns out that the crash addresses on 604 and 612 point to the
PE driver:
AXP604: PC=802C9408 --> SYS$PEDRIVER_NPRO+0D408
AXP612: PC=801D33F0 --> SYS$PEDRIVER_NPRO+0D3F0
- September 5th, 1995 16:20 -- Crash AXP612
Cause unknown, no valid dump. System rebooted automatically, so it
was not a KSP inval this time.
After an automatic reboot the system hung in the state:
- DKA100: mounted on AXP612
- DKA200:,DKA300: not mounted on AXP612, but available
to cluster
- DKB400:,DKB500: not seen, in MountVerify on cluster.
A reset and "t tc2 cnfg" showed the SCSI-A disks and hung before
the first SCSI-B disk was shown.
Cycling power on all five SCSI disks fixed this, they were again
visible with "t tc2 cnfg", the system booted fine....
- September 4th, 1995 12:46 -- Crash AXP612
Fourth crash since V6.2 upgrade on August 30th with
KSP inval - PC = 801D33F0 PSL = 800
- September 2nd, 1995 15:20 -- Crash AXP612
Third crash since V6.2 upgrade on August 30th
The SYSGEN parameter KSTACKPAGES is now increased from 1 to 2.
- September 1st, 1995 14:20 -- Crash AXP612
Second crash since V6.2 upgrade on August 30th with
KSP inval - PC = 801D33F0 PSL = 800
- May 5th, 1995 13:28 -- Crash VSBZ
Crashed after an uptime of 34 DAYS 23:58 uptime. History:
- Used SQL on node VSCN on a database located on
kp3$broot. The system hungup immediately.
- 15:28:30.81: BUS RESET INITIATED
- 15:28:30.81: BUS RESET DETECTED
- 15:28:30.81: COMMAND TRANSMISSION FAILURE on DKA200:
- 15:28:30.85: COMMAND TRANSMISSION FAILURE on DKA0:
... repeated a couple of times
- 15:28:30.86: SSRVEXCEPT, Unexpected system service exception
The device DKA200 was busy, ANAL/CRASH SHOW DEVICE shows:
I/O request queue (for VSBZ$DKA200)
-----------------
STATE IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS
C 84BAB300 846F085C K 0000 0808 00000000 0 00000000 00000000 4100
packack physio,srvio
- April 23rd, 1995 8:00 -- Crash AXP612
Crashed after an uptime of 40 days. History:
- set host axp612 didn't work since many hours, failed with
"connection to network object rejected". Unclear whether this
was related.
- mounted a tape on mkd0:; mount command never completed.
- axp612 hung. Other cluster nodes showed all
$12$dk disks in mount verification !
- suspected tape, unloaded volume and cycled power on drive.
- system crashed, failed to write crash dump because "shadow set
went invalid (..?..)", so there are no error log entries of the
crash.
- disks and the tape work fine after reboot....
- March 31st, 1995 -- Crash VSBZ
Again a crash of VSBZ similar to the one on March 21st
- 14:32:59.68: INVALID MODE SENSE DATA RETURNED on DKA200:
- 14:33:19.68: COMMAND TRANSMISSION FAILURE on DKA0:
- 14:33:29.68: ARBITRATION FAILED on PKA0:
.... the usual
- 14:33:29.72: COMMAND TRANSMISSION FAILURE on DKA0:
- 14:33:29.72: Unexpected system service exception
Consequence: Undid SCSI driver patch.
- March 30st, 1995 -- Crash VSBZ
Again a crash of VSBZ similar to the one on March 21st
- 10:31:10.99: INVALID MODE SENSE DATA RETURNED on DKA300:
- 10:31:30.99: COMMAND TRANSMISSION FAILURE on DKA0:
- 10:31:50.99: ARBITRATION FAILED on PKA0:
.... the usual
- 10:31:51.00: COMMAND TRANSMISSION FAILURE on DKA0:
- 10:31:51.01: Unexpected system service exception
- March 28st, 1995 -- Crash VSBZ
Again a crash of VSBZ similar to the one on March 21st
- 15:34:13.93: INVALID MODE SENSE DATA RETURNED on DKA400:
- 15:34:34.93: COMMAND TRANSMISSION FAILURE on DKA0:
- 15:34:54.93: ARBITRATION FAILED on PKA0:
.... the usual
- 15:34:54.94: COMMAND TRANSMISSION FAILURE on DKA0:
- 15:34:54.94: Unexpected system service exception
- March 27st, 1995 -- Crash VSBZ
Again a crash of VSBZ similar to the one on March 21st
- 10:50:48.47: INVALID MODE SENSE DATA RETURNED on DKA200:
- 10:51:08.47: COMMAND TRANSMISSION FAILURE on DKA0:
- 10:51:18.47: ARBITRATION FAILED on PKA0:
- 10:51:18.47: BUS RESET INITIATED on PKA0:
- 10:51:18.47: BUS RESET DETECTED on PKA0:
- 10:51:18.47: COMMAND TRANSMISSION FAILURE on DKA0:
.... a couple of retries
- 10:51:19.48: COMMAND TRANSMISSION FAILURE on DKA0:
- 10:51:19.48: Unexpected system service exception
- March 21st, 1995 -- Crash VSBZ
Crash of VSBZ probably caused by page read error on DL094:
There is quite some prehistory of the crash:
ERROR SEQUENCE 230. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 19:59:26.83 SYS_TYPE 04140002
DEVICE ERROR KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA300:, CURRENT LABEL "DL095"
ERROR TYPE 06 INVALID MODE SENSE DATA RETURNED
ERROR SEQUENCE 231. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 19:59:46.83 SYS_TYPE 04140002
DEVICE ERROR KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
RZ28 SUB-SYSTEM, UNIT _VSBZ$DKA0:, CURRENT LABEL "VMS05X_1G"
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
PORT STATUS 0000022C %SYSTEM-F-TIMEOUT, DEVICE TIMEOUT
ERROR SEQUENCE 232. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
DEVICE ATTENTION KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
SCSI PORT SUB-SYSTEM, UNIT _VSBZ$PKA0:
ERROR TYPE 0002 ARBITRATION FAILED
ERROR SEQUENCE 233. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
DEVICE ATTENTION KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
SCSI PORT SUB-SYSTEM, UNIT _VSBZ$PKA0:
ERROR TYPE 0009 BUS RESET INITIATED
ERROR SEQUENCE 234. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
DEVICE ATTENTION KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
SCSI PORT SUB-SYSTEM, UNIT _VSBZ$PKA0:
ERROR TYPE 0007 BUS RESET DETECTED
ERROR SEQUENCE 235. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA300:, CURRENT LABEL "DL095"
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
PORT STATUS 00000054 %SYSTEM-F-CTRLERR, FATAL CONTROLLER
ERROR SEQUENCE 236. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA400:, CURRENT LABEL "DL094"
like above
ERROR SEQUENCE 237. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
NON-FATAL BUGCHECK KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
---> killed process DECW$TE_015D
ERROR SEQUENCE 238. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA200:, CURRENT LABEL "DL093"
like above
ERROR SEQUENCE 239. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA400:, CURRENT LABEL "DL094"
like above
ERROR SEQUENCE 240. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
RZ28 SUB-SYSTEM, UNIT _VSBZ$DKA0:, CURRENT LABEL "VMS05X_1G"
like above
----
ERROR SEQUENCE 241. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
SCS NODE: VSBZ VAX/VMS V6.1
DEVICE ERROR KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
GENERIC DK SUB-SYSTEM, UNIT _VSBZ$DKA400:, CURRENT LABEL "DL094"
ERROR TYPE 03 COMMAND TRANSMISSION FAILURE
IRP$L_PID 0001005D REQUESTOR "PID"
----
ERROR SEQUENCE 242. LOGGED ON: SID 12000003
DATE/TIME 22-MAR-1995 20:00:06.83 SYS_TYPE 04140002
SCS NODE: VSBZ VAX/VMS V6.1
FATAL BUGCHECK KA46 CPU FW REV# 3. CONSOLE FW REV# 1.4
SSRVEXCEPT, Unexpected system service exception
PROCESS NAME DEC
PROCESS ID 0001005D
- March 18st, 1995 -- Crash VSBZ
Crash of VSBZ caused by read error on system disk.
The prehistory is very similar to the crash on March 21st:
- 12:59:04.95: "INVALID MODE SENSE DATA RETURNED" on DKA300:
- 12:59:25.95: "COMMAND TRANSMISSION FAILURE" on DKA0:
- 12:59:45.95: "ARBITRATION FAILED" on PKA0:
- 12:59:45.95: "BUS RESET INITIATED" on PKA0:
- 12:59:45.95: "BUS RESET DETECTED" on PKA0:
- 12:59:45.95: "COMMAND TRANSMISSION FAILURE" on DKA0: