31.12.2021 08:57
twain
 
Доброе (?) утро!
При переноске сервера Sun Fire 490 и его продувки он работать отказался.
Админ, который работал с ним ушел из компании полгода назад.
Есть у кого подсказки, что дальше делать?
Ситуация такая, судя по всему умер один из внутренних дисков, на которых сидела система.
Во всяком случае он помечен как неисправный.
По обещаниям они должны быть в зеркале, но похоже что нет.
Такая тема описана в базе знаний Оракл, все симптомы сходятся:
/knowledge/Sun%20Microsyst…900988_1.html#FIX

Но наш предыдущий рулевой предусмотрительно поддержку не продлил.
Я пока пытаюсь ее восстановить, но когда это произойдет, неизвестно.
А на восстановление у нас праздники.
Может кто глянуть что там хорошего написано или, может, и так знает?

Где можно посмотреть логи выключения? Мне кажется он что-то необычное писал когда его выключали.
Я не админ, переносил его не я, мне просто нужно чтобы сервер работал
31.12.2021 11:26
OlegON
 
Учитывая, что я с телефона, ссылка коцаная и описания симптомов-то нет...
Что есть "работать отказался"? Солярки все достаточно подробно на старте пишут, обычно... Вы, часом, не прервали загрузку? Некоторые серваки по полчаса стартуют... На экране есть какие-то ошибки?
31.12.2021 11:40
twain
 
Ссылку полностью мне форум послать не дает - не заслужил
хттпс : два слеша и support.oracle.com/knowledge/Sun Microsyst…900988_1.html#FIX




(0,02Мб)


(0,06Мб)

Картинки вложил. Нет, он вошел в цикл перезагрузки в режиме паника...
31.12.2021 11:50
OlegON
 
Цитата:
twain Microsyst…900988
вот видишь точечки? это где-то ссылку сократили, у меня есть доступ на металинк, но ссылку полностью дай? ткни в нее и из адресной строки уже скопируй...
что касается зеркала, то UFS, которая у тебя на системном разделе (печаль-беда), зеркало не умеет... понять бы, как собран рейд и где...
пока весьма пессимистические предчувствия... не факт, что проблема нарисовалась именно в процессе переезда. Его ребутили до этого вообще?
31.12.2021 11:58
twain
 
Сорри,

Все данные на СХД лежат отдельно, а система на этих двух дисках.
Его ребутили неделю назад, когда первый раз хотели перенести.
Когда сейчас выключали, он что-то написал что типа срочно отключитесь,
а то будет потеря данных и сам отрубился. Может он так всегда пишет.
Я его обычно не выключал. Я раньше гасил только приложение и СУБД,
а дальше делал админ. Сдается мне, новый админ не погасил нулевую зону и выключил кнопкой.
31.12.2021 13:06
Occul
 
Вот что по той ссылке...

Symptoms

1, System panics on boot showing panic[cpu6]/thread=180e000: vfs_mountroot: cannot mount root

2, Disk backplane failing during boot

WARNING: Device /pci@8,600000/SUNW,qlc@2 being marked with 'status' == fail

3, probe-scsi-all sees all internal disk

ok probe-scsi - -all

/pci@8,600000/SUNW,qlc@2
LiD HA LUN --- Port WWN --- ----- Disk description -----
0 0 0 500000e0106d5f11 FUJITSU MAP3735F SUN72G 1201
1 1 0 500000e010777df1 FUJITSU MAP3735F SUN72G 1201
2 2 0 500000e0106a5661 FUJITSU MAP3735F SUN72G 1201
6 6 0 50800200001d2cf1 SUNW SUNWGS INT FCBPL9228
3 3 0 500000e0106e4be1 FUJITSU MAP3735F SUN72G 1201
4 4 0 500000e0106a6911 FUJITSU MAP3735F SUN72G 1201
5 5 0 500000e0106f0321 FUJITSU MAP3735F SUN72G 1201
8 8 0 2100000c50565823 SEAGATE ST373307FSUN72G 0307
9 9 0 2100001862cc39a8 SEAGATE ST314655FSUN146G0691
a a 0 2100001862caf6e7 SEAGATE ST314655FSUN146G0691
b b 0 2100000c5056560c SEAGATE ST373307FSUN72G 0307
c c 0 2100001862cc3775 SEAGATE ST314655FSUN146G0691
d d 0 2100001862cfd57d SEAGATE ST314655FSUN146G0691

4, obdiag passes test 1

>> Testing disk at loop ID: d
Selftest at /pci@8,600000/SUNW,qlc@2 .................................. passed
Pass:1 (of 1260) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:8:29



no error found under obdiag test but the panic persist

Disk backplane fail qlc@2 path failed during booting

WARNING: Device /pci@8,600000/SUNW,qlc@2 being marked with 'status' == fail

Testing with obdiag all run fine
obdiag> test 1
Hit the spacebar to interrupt testing
Testing /pci@8,600000/SUNW,qlc@2
>> Testing RISC RAM (this may take a while)..........
>> Firmware copied
>> Waiting for loop to come up.
>> Waiting for firmware ready state
>> FCAL device count = 0xe
>> Found device with loop ID 0x7d (AL_PA = 0x1 )
>> Found device with loop ID 0x0 (AL_PA = 0xef )
>> Found device with loop ID 0x1 (AL_PA = 0xe8 )
>> Found device with loop ID 0x2 (AL_PA = 0xe4 )
>> Found device with loop ID 0x6 (AL_PA = 0xdc )
>> Found device with loop ID 0x3 (AL_PA = 0xe2 )
>> Found device with loop ID 0x4 (AL_PA = 0xe1 )
>> Found device with loop ID 0x5 (AL_PA = 0xe0 )
>> Found device with loop ID 0x8 (AL_PA = 0xd9 )
>> Found device with loop ID 0x9 (AL_PA = 0xd6 )
>> Found device with loop ID 0xa (AL_PA = 0xd5 )
>> Found device with loop ID 0xb (AL_PA = 0xd4 )
>> Found device with loop ID 0xc (AL_PA = 0xd3 )
>> Found device with loop ID 0xd (AL_PA = 0xd2 )
>> ISP2200 found at loop ID 0x7d
>> Enclosure services device found at loopid 0x6
>> Direct-access device ( disk 0 ) found at loop ID 0x0
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 0
>> Direct-access device ( disk 1 ) found at loop ID 0x1
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 1
>> Direct-access device ( disk 2 ) found at loop ID 0x2
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 2
>> Direct-access device ( disk 3 ) found at loop ID 0x3
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 3
>> Direct-access device ( disk 4 ) found at loop ID 0x4
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 4
>> Direct-access device ( disk 5 ) found at loop ID 0x5
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 5
>> Direct-access device ( disk 6 ) found at loop ID 0x8
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 8
>> Direct-access device ( disk 7 ) found at loop ID 0x9
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: 9
>> Direct-access device ( disk 8 ) found at loop ID 0xa
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: a
>> Direct-access device ( disk 9 ) found at loop ID 0xb
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: b
>> Direct-access device ( disk 10 ) found at loop ID 0xc
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: c
>> Direct-access device ( disk 11 ) found at loop ID 0xd
>> Waiting for disk to spin up (timeout in one minute)... Disk spun up.
>> Disk media testing - this will take a while.
>> Testing disk at loop ID: d
Selftest at /pci@8,600000/SUNW,qlc@2 .................................. passed
Pass:1 (of 1260) Errors:0 (of 0) Tests Failed:0 Elapsed Time: 0:0:8:29


System still panic during boot

panic[cpu6]/thread=180e000: vfs_mountroot: cannot mount root

000000000180b950 genunix:vfs_mountroot+370 (1861800, 188c800, 0, 129a800, 1865800, 6)
%l0-3: 0000030002866e00 0000000001858d90 000000000113d000 00000000018be400
%l4-7: 0000000000000600 0000000000000200 0000000000000800 0000000000000200
000000000180ba10 genunix:main+10c (18b2000, 180c000, 183b2c0, 10aa400, 0, 183c190)
%l0-3: 0000000000000001 0000000070002000 0000000070002000 0000000000000000
%l4-7: 0000000001841800 0000000000000000 0000000001815400 0000000001815648


To confirm that a single disk is not causing the backplane to be faulted. All internal disk be should pull out and tests run with different disks installed as a way to eliminate a single disk problem. Process of elimination is the only way to determine the correct cause of fault.

Solution

Replace disk BackPlane or disk as appropriate.
31.12.2021 13:18
twain
 
Ясно, типа надо менять диск, а т.к. зеркала нет еще и систему накатывать.
Печаль. Зато все чисто внутри
31.12.2021 13:34
Occul
 
If This is a Cold Replace:
1, Have the Administrator shutdown the application and OS.
2, Power off the machine
3, Proceed to steps to "Physically remove Disk From System"

If this is a Hot-Plug Replace:
The drive should be in an unused state for Hot-Plug replacement.
1, As root run the luxadm remove_device command
# luxadm remove_device <Disk_Drive>
where Disk_Drive is /dev/rdsk/cXtXdXs2
The system will ask for verification.
The system will prompt you to hit return after Replacing the Drive.
Note – It may take up to one minute for the drive to come offline and spin down
2, Proceed to "Physically remove Disk From System". And then return to step 3 in Hot-Plug
3, Hit Return to acknowledged that the physical replacement is complete.
Note – Screen confirmation may take up to one minute.

4. Enter the following command to introduce the disk to the system:

# /usr/sbin/devfsadm
or
# /usr/sbin/luxadm insert_device <enclosure_name>,sx

where x is the slot number. (Use luxadm display <enclosure_name> to find the slot number. To find the <enclosure_name> do a luxadm probe .)

5, Verify with format that the drive is present.
6, Hot-Plug replace is complete.

For additional information on hot replacement see:

Removing and Replacing the Sun Fire[TM] 280R , Sun Fire[TM] V480 ,Sun Fire[TM] V490 ,Sun Fire[TM] V880 ,Sun Fire[TM] V880z or Sun Fire[TM] V890 Hot-Pluggable Internal Disk Drives. [ID 1007367.1]

Solaris Volume Manager (SVM) How to Replace Internal FC-AL Disks in 280R, V480, V490, V880, V890, and E3500 Servers (Doc ID 1010753.1)

Physically remove Disk From System
1. Open front bezel on the server.
2. Slide the catch to the right, remove the drive
3. Slide the catch to the right, install the new drive.
4. Push the metal level until the HDD clicks in place.
5. Close the front bezel.
6, Cold-Replace is Complete. For Hot-Plug Return to Step 3 in Hot-Plug section.


OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO
AN OPERATIONAL STATE:
Reconfigure disk as previously configured. e.g. Part of raid set or mounted file-system or raw partition or other.
31.12.2021 13:38
Occul
 
This document provides a working example of how to replace a system's failed internal Fibre Channel Arbitrated Loop (FC-AL) disk when it is under Solaris Volume Manager (SVM) control. The most typical scenario involves a failed boot disk mirrored with SVM.

1) Let's start with a working mirrored pair of boot disks in a V490.
In this example, the root and swap partitions are mirrored.
Notice the optimal outputs from metastat, metadb and format.
# metastat

d1: Mirror
Submirror 0: d11
State: Okay
Submirror 1: d21
State: Okay
Size: 1052163 blocks (513 MB)

d11: Submirror of d1
State: Okay
Size: 4194828 blocks (2.0 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t1d0s1 0 No Okay Yes
d21: Submirror of d1
State: Okay
Size: 4194828 blocks (2.0 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t0d0s1 0 No Okay Yes

d0: Mirror
Submirror 0: d10
State: Okay
Submirror 1: d20
State: Okay
Size: 16629921 blocks (7.9 GB)

d10: Submirror of d0
State: Okay
Size: 25166079 blocks (12 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t1d0s0 0 No Okay Yes
d20: Submirror of d0
State: Okay
Size: 25166079 blocks (12 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t0d0s0 0 No Okay Yes

Device Relocation Information:
Device Reloc Device ID
c4t0d0 Yes id1,ssd@n20000004cf7fe655
c4t1d0 Yes id1,ssd@n20000004cf8f57c1

# format
28. c4t0d0 <SEAGATE-ST336605FSUN36G-0438 cyl 24620 alt 2 hd 27 sec 107>
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cf7fe655,0
29. c4t1d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cf8f57c1,0

# metadb
flags first blk block count
a m p luo 16 8192 /dev/dsk/c4t0d0s7
a p luo 8208 8192 /dev/dsk/c4t0d0s7
a p luo 16400 8192 /dev/dsk/c4t0d0s7
a p luo 16 8192 /dev/dsk/c4t1d0s7
a p luo 8208 8192 /dev/dsk/c4t1d0s7
a p luo 16400 8192 /dev/dsk/c4t1d0s7


2) Disk c4t1d0 fails. Notice how the outputs have changed. Submirrors are in "Needs Maintenance", state database replicas have write errors and format output shows "drive type unknown". Other logs such as /var/adm/messages should be reviewed as well for evidence of failure.
# metastat

d1: Mirror
Submirror 0: d11
State: Needs maintenance
Submirror 1: d21
State: Okay
Size: 1052163 blocks (513 MB)

d11: Submirror of d1
State: Needs maintenance
Size: 4194828 blocks (2.0 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t1d0s1 0 No Maintenance Yes
d21: Submirror of d1
State: Okay
Size: 4194828 blocks (2.0 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t0d0s1 0 No Okay Yes

d0: Mirror
Submirror 0: d10
State: Needs maintenance
Submirror 1: d20
State: Okay
Size: 16629921 blocks (7.9 GB)

d10: Submirror of d0
State: Needs maintenance
Size: 25166079 blocks (12 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t1d0s0 0 No Maintenance Yes
d20: Submirror of d0
State: Okay
Size: 25166079 blocks (12 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t0d0s0 0 No Okay Yes

Device Relocation Information:
Device Reloc Device ID
c4t0d0 Yes id1,ssd@n20000004cf7fe655
c4t1d0 Yes id1,ssd@n20000004cf8f57c1

# format
28. c4t0d0 <SEAGATE-ST336605FSUN36G-0438 cyl 24620 alt 2 hd 27 sec 107>
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cf7fe655,0
29. c4t1d0 <drive type unknown>
/pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cf8f57c1,0

# metadb
flags first blk block count
a m p luo 16 8192 /dev/dsk/c4t0d0s7
a p luo 8208 8192 /dev/dsk/c4t0d0s7
a p luo 16400 8192 /dev/dsk/c4t0d0s7
W p l 16 8192 /dev/dsk/c4t1d0s7
W p l 8208 8192 /dev/dsk/c4t1d0s7
W p l 16400 8192 /dev/dsk/c4t1d0s7


3) If any submirror in the failed disk is still reporting as "Okay", that submirror should be detached. For example if d10 was still reporting "Okay":
# metadetach d0 d10


4) We can now proceed with the hardware replacement of the disk.
# luxadm remove_device -F /dev/rdsk/c4t1d0s2

WARNING!!! Please ensure that no filesystems are mounted on these device(s).
All data on these devices should have been backed up.
The list of devices which will be removed is:
1: Device name: /dev/rdsk/c4t1d0s2
Node WWN: 20000004cf8f57c1
Device Type:Disk device
Device Paths:
/dev/rdsk/c4t1d0s2
Please verify the above list of devices and
then enter 'c' or <CR> to Continue or 'q' to Quit. [Default: c]:
stopping: /dev/rdsk/c4t1d0s2....Done
offlining: /dev/rdsk/c4t1d0s2....Done
Hit <Return> after removing the device(s).
............
Device: /dev/rdsk/c4t1d0s2 Removed.

If disk is Hot-Pluggable refer to:
Document 1007367.1 Removing and Replacing 280R, V480, V490, V880, V880z or V890 Hot-Pluggable Internal Disk Drives

For the v880 only, you may remove the device by enclosure name and slot number. For example:
# luxadm remove_device FCloop,s1



if the luxadm remove fails, you may have to do the following additional step :


# init 0

Physically remove the disk cxtyd0

ok boot

# devfsadm -C

Verified this removed /dev/[r]dsk/cxtyd0*

# devfsadm -c disk




5) Physically replace the disk.

6) The next command will create the devices for the new drive.
# luxadm insert_device
Please hit <RETURN> when you have finished adding Fibre Channel Enclosure(s)/Device(s):
Waiting for Loop Initialization to complete...
New Logical Nodes under /dev/dsk and /dev/rdsk :
c4t1d0s0
c4t1d0s1
c4t1d0s2
c4t1d0s3
c4t1d0s4
c4t1d0s5
c4t1d0s6
c4t1d0s7


For the v880 only, you may add the device by enclosure name and slot number. For example:
# luxadm insert_device FCloop,s1


Alternatively, you can use the command devfsadm -Cv to create the devices. Depending on your version of Solaris, the picld daemon may make them for you as well. If all these fail, a reconfiguration reboot usually solves the problem.

If device to replace is shown as as 'unusable' in cfgadm output then check:
Document 1639070.1 Steps For Clearing Devices in Unusable or Failing State From cfgadm After LUNs Have Already Been Removed (Doc ID 1639070.1)



7) Once the disk is labeled and partitioned in Solaris, we can finish up the repairs in SVM.

Prepare the partition table on the disk to match that of the disk it is mirrored with:
# prtvtoc /dev/rdsk/c4t0d0s2 | fmthard -s - /dev/rdsk/c4t1d0s2
fmthard: New volume table of contents now in place.

See also:
Document 1386408.1 Solaris Volume Manager (SVM): How To Copy The Partition Table From One Disk To Another


8) Restore the state database replicas.
# metadb -d c4t1d0s7
# metadb -a -c 3 c4t1d0s7


9) Remirror and monitor the resync using the metastat command. Note that any submirror which was manually detached needs to be reattached. For example, if metadetach was used on d10, reattach it:
# metattach d0 d10

9a) Otherwise, remirror with metareplace.

First Update Device Relocation Information (DRI) of replaced drive:
# metadevadm -u c2t0d0 (5.9 Only)
For Solaris 10 check if affected by
Bug 22065674 - 5.10_u[10,11] - metareplace corrupting metadb's. metadevadm -u ineffective.
Typical affected systems are when one of the following patches is installed:
(SPARC) SunPatch 145899-03 or higher - SunOS 5.10: SVM patch
(X86) SunPatch 145900-03 or higher - SunOS 5.10_x86: SVM patch
both of which reached revision -15 before being rolled into patches
(SPARC) SunPatch 147147-26 SunOS 5.10: kernel patch
(X86) SunPatch 148076-09 SunOS 5.10_x86: md patch
Further details in:
Document 2090016.1 Solaris Volume Manager (SVM) the 'metadevadm -u' Command Shows "New device reloc information" or "Invalid device relocation information detected" But DevID Is Not Updated

Then run metareplace.
# metareplace -e d0 c4t1d0s0
d0: device c4t1d0s0 is enabled
# metareplace -e d1 c4t1d0s1
d1: device c4t1d0s1 is enabled



10) If metattach/metareplace fails, check status of the meta mirror. And if both metadevices in 'Needs maintenance' then check 'Invoke' command in metastat output for good device.
d1: Mirror
Submirror 0: d11
State: Needs maintenance
Submirror 1: d21
State: Needs maintenance
Size: 1052163 blocks (513 MB)

d11: Submirror of d1
State: Needs maintenance
Size: 4194828 blocks (2.0 GB)
Invoke: metareplace d1 /dev/dsk/c4t1d0s1 <new device>
Device Start Block Dbase State Reloc Hot Spare
c4t1d0s1 0 No Maintenance Yes
d21: Submirror of d1
State: Needs maintenance
Invoke: metasync d1
Size: 4194828 blocks (2.0 GB)
Device Start Block Dbase State Reloc Hot Spare
c4t0d0s1 0 No Okay Yes

The submirror d21 can be returned to the "Okay" state by using the metasync command as shown in the output of metastat. When in okay state return to metattach/metareplace command.
31.12.2021 13:43
OlegON
 
Погоди, почему зеркала нет? Вроде в зеркале они...

Форум на базе vBulletin®
Copyright © Jelsoft Enterprises Ltd.
В случае заимствования информации гипертекстовая индексируемая ссылка на Форум обязательна.