Random dCache failures in SAM
As most sites will know, random failures are occasionally reported by SAM. It's often not clear what caused this (poor error messages don't help) and the "problem" does not occur during the next test. This is a reason why having a BDII-independent SRM test would help as it would decouple problems with the information system from problems with the dCache. This page continues to be added to.
Contents
- 1 No valid credential found
- 2 Timeout when executing test ??? after 600 seconds!
- 3 Communication error on send / Error Delete failed
- 4 InvocationTargetException
- 5 CGSI-gSOAP: GSS Major Status: Authentication Failed
- 6 gPlazma timed out
- 7 Name server not active
- 8 gethostbyname
- 9 CGSI-gSOAP: GSS Major Status: General failure
No valid credential found
User credentials or host credentials? We reckon this is a problem with the LFC, but how can we check?
+ lcg-cp -v --vo ops lfn:SRM-put-gfe02.hep.ph.ic.ac.uk-1185595194 file:/home/samops/.same/SRM/nodes/gfe02.hep.ph.ic.ac.uk/testFile.txt send2nsd: NS002 - send error : No valid credential found Bad credentials lcg_cp: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch VO name: ops
Timeout when executing test ??? after 600 seconds!
Anyone know what is timing out? Is this a network problem, or is the server too busy?
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185947904 -d heplnx204.pp.rl.ac.uk Timeout when executing test SRM-put after 600 seconds!
Communication error on send / Error Delete failed
Internal dCache communication problem perhaps? Or is it a problem communicating with the client?
+ lcg-del -v --vo ops -a lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185982360 java.rmi.RemoteException: srm advisoryDelete failed; nested exception is: java.lang.RuntimeException: advisoryDelete(User [name=ops001, uid=40101, gid=24336, root=/],pnfs/pp.rl.ac.uk/data/ops/generated/2007-08-01/file919f58c3-7ca9-41ee-8f3b-1f384eca7d 11) Error Delete failed: NULL lcg_del: Communication error on send VO name: ops Using GUID : 7a5c2db6-fcd6-4ec5-8899-4dcad45ec6d3 set timeout to 0 seconds srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/ops/generated/2007-08-01/file919f58c3-7ca9-41ee-8f3b-1f384eca7d11 is NOT deleted + result=1
Explanation
This is caused by the dCache not deleting the file within 10 seconds of receiving the request from the SRM. The SRM then returns an error to the client, even though the operation could succeed in, e.g., 11 seconds. Therefore, although the lcg-del failed, the file will actually be removed from the dCache. At this time, (v1.7.0-39) this timeout cannot be configured. The dCache developers are working on it.
InvocationTargetException
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1194051383 -d heplnx204.pp.rl.ac.uk 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec instthe server sent an error response: 500 500 java.lang.reflect.InvocationTargetException: java.rmi.RemoteException: srm advisoryDelete failed; nested exception is: java.lang.RuntimeException: advisoryDelete(User [name=ops001, uid=40101, gid=24336, root=/],pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b) Error file does not exist, cannot delete lcg_cr: No such file or directory Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-heplnx204.pp.rl.ac.uk-1194051383 Using SURL : srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b Source URL: file:/home/samops/.same/SRM/testFile.txt File size: 41472 VO name: ops Destination specified: heplnx204.pp.rl.ac.uk Destination URL for copy: gsiftp://heplnx173.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b # streams: 1 # set timeout to 0 seconds Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-heplnx204.pp.rl.ac.uk-1194051383 Copy Failed: Unregistering alias from catalog. + result=1
CGSI-gSOAP: GSS Major Status: Authentication Failed
Problem getting hold of server certificates?
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-dcache02.tier2.hep.manchester.ac.uk-1186123222 -d dcache02.tier2.hep.manchester.ac.uk CGSI-gSOAP: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: (null) lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-dcache02.tier2.hep.manchester.ac.uk-1186123222 Using SURL : srm://dcache02.tier2.hep.manchester.ac.uk/pnfs/tier2.hep.manchester.ac.uk/data/ops/generated/2007-08-03/file82f5de94-7cf2-4de7-9b32-4f07eb2a23a2 + result=1
gPlazma timed out
This is probably due to the fact that the gPlazma cell is being used, rather than the module. The difference here is that with the module, other dCache cells directly call the methods of gPlazma to do the authorisation. However, with the cell, there is a dedicated process which other cells must talk to. This can lead to time outs if there are problems with inter-cell communication.
+ lcg-cp -v --vo ops lfn:SRM-put-heplnx204.pp.rl.ac.uk-1186472286 file:/home/samops/.same/SRM/nodes/heplnx204.pp.rl.ac.uk/testFile.txt the server sent an error response: 530 530 Authorization Service failed: diskCacheV111.services.authorization.AuthorizationServiceException: authRequestID 761915796 Message to gPlazma timed out for authentification of /C=CH/O=CERN/OU=GRID/CN=Judit Novak 0973 - ops lcg_cp: Invalid argument Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch VO name: ops + result=1
Name server not active
+ lcg-cr -v --vo ops -d srm.epcc.ed.ac.uk -l lfn:sft-lcg-rm-cr-wn0.epcc.ed.ac.uk.070820122357 file:///home/opssgm/globus-tmp.wn0.7879.0/WMS_wn0_08352_https_3a_2f_2frb127.cern.ch_3a9000_2fpS xaLLFVkJREpHb51kEa4g/work/testjob/nodes/ce.epcc.ed.ac.uk/sft-lcg-rm-cr.txt Name server not active lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/sft-lcg-rm-cr-wn0.epcc.ed.ac.uk.070820122357 + result=1
gethostbyname
+ lcg-cp -v --vo ops lfn:SRM-put-srm.epcc.ed.ac.uk-1193901323 file:/home/samops/.same/SRM/nodes/srm.epcc.ed.ac.uk/testFile.txt globus_ftp_control_connect: globus_libc_gethostbyname_r failed lcg_cp: Invalid argument Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch VO name: ops + result=1
CGSI-gSOAP: GSS Major Status: General failure
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-srm.epcc.ed.ac.uk-1194098403 -d srm.epcc.ed.ac.uk CGSI-gSOAP: GSS Major Status: General failure GSS Minor Status Error Chain: acquire_cred.c:125: gss_acquire_cred: Error with GSI credential globus_i_gsi_gss_utils.c:1310: globus_i_gsi_gss_cred_read: Error with gss credential handle globus_gsi_credential.c:721: globus_gsi_cred_read: Valid credentials could not be found in any of the possible locations specified by the credential search order. globus_gsi_credential.c:447: globus_gsi_cred_read: Error reading host credential globus_gsi_system_config.c:3977: globus_gsi_sysconfig_get_host_cert_filename_unix: Error with certificate filename globus_gsi_system_config.c:380: globus_i_gsi_sysconfig_create_cert_string: Error with certificate filename: /etc/grid-security/hostcert.pem not owned by current user. globus_gsi_credential.c:239: globus_gsi_cred_read: Error reading proxy credential globus_gsi_system_config.c:4589: globus_gsi_sysconfig_get_proxy_filename_unix: Could not find a valid proxy certificate file location globus_gsi_system_config.c:446: globus_i_gsi_s lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-srm.epcc.ed.ac.uk-1194098403 Using SURL : srm://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/ops/generated/2007-11-03/filee5e1619f-9329-47a9-a4ee-42aca32127ac + result=1