Discussion:
[Pw_forum] wrong record length
Marci
2009-03-02 16:42:48 UTC
Permalink
Dear All,

I faced a strange error. I've made a parallel scf calculation with a
quite big sytem and after that I've tried to use a postprocess program
called epsilon.x. The first calculation went well, however epsilon.x
crashed saying:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 0
from diropn : error # 3
wrong record length
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The input file seems OK, I used wfcollect = .true. in the scf
calculation and set the same prefix and outdir directory in the
postprocess input file. I searched the archive and the source code of
espresso and it seems to me that this problem is related to this
specific cluster ( http://www.nersc.gov/nusers/systems/bassi ) where I
use quantum-espresso. The source code (eg. /PW/diropn.f90 ) says:

!
! the unit for record length is unfortunately machine-dependent
!
unf_recl = DIRECT_IO_FACTOR * recl
if (unf_recl <= 0) call errore ('diropn', 'wrong record length', 3)

Maybe my problem is related to the value of DIRECT_IO_FACTOR? I forgot
to mention, I'm using espresso4.0. My question is, why does this
happen only with using a postprocess code and how can I cure it?

On the top of this, I succeeded in running epsilon.x on a similar but
smaller system at my home cluster but it's says USPPs are not
implemented. As I would like to perform only a 'jdos' calculation,
which doesn't need wavefunctions at all, what if I simply comment out
the corresponding part of epsilon.f90?

If someone could help me I can send the input files in private.

Thanks in advance.
Yours,
Marton

-----------------------------------------------------
M\'arton V\"or\"os, physicist student
Department of Atomic Physics
Budapest University of Technology and Economics (BUTE)
Budafoki ut 8., H-1111, Budapest, Hungary
vormar_at_gmail_dot_com, vm776_at_hszk_dot_bme_dot_hu
-----------------------------------------------------
Axel Kohlmeyer
2009-03-02 17:18:15 UTC
Permalink
On Mon, 2 Mar 2009, Marci wrote:

MV> Dear All,
MV>
MV> I faced a strange error. I've made a parallel scf calculation with a
MV> quite big sytem and after that I've tried to use a postprocess program
MV> called epsilon.x. The first calculation went well, however epsilon.x
MV> crashed saying:
MV>
MV> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MV> task # 0
MV> from diropn : error # 3
MV> wrong record length
MV> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MV>
MV> The input file seems OK, I used wfcollect = .true. in the scf
MV> calculation and set the same prefix and outdir directory in the
MV> postprocess input file. I searched the archive and the source code of
MV> espresso and it seems to me that this problem is related to this
MV> specific cluster ( http://www.nersc.gov/nusers/systems/bassi ) where I
MV> use quantum-espresso. The source code (eg. /PW/diropn.f90 ) says:

marton,

are you trying to run the postprocessing on your local
machine or on the IBM machine?

MV> !
MV> ! the unit for record length is unfortunately machine-dependent
MV> !
MV> unf_recl = DIRECT_IO_FACTOR * recl
MV> if (unf_recl <= 0) call errore ('diropn', 'wrong record length', 3)
MV>
MV> Maybe my problem is related to the value of DIRECT_IO_FACTOR? I forgot
MV> to mention, I'm using espresso4.0. My question is, why does this
MV> happen only with using a postprocess code and how can I cure it?

that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).

thanks,
axel.


MV> On the top of this, I succeeded in running epsilon.x on a similar but
MV> smaller system at my home cluster but it's says USPPs are not
MV> implemented. As I would like to perform only a 'jdos' calculation,
MV> which doesn't need wavefunctions at all, what if I simply comment out
MV> the corresponding part of epsilon.f90?
MV>
MV> If someone could help me I can send the input files in private.
MV>
MV> Thanks in advance.
MV> Yours,
MV> Marton
MV>
MV> -----------------------------------------------------
MV> M\'arton V\"or\"os, physicist student
MV> Department of Atomic Physics
MV> Budapest University of Technology and Economics (BUTE)
MV> Budafoki ut 8., H-1111, Budapest, Hungary
MV> vormar_at_gmail_dot_com, vm776_at_hszk_dot_bme_dot_hu
MV> -----------------------------------------------------
MV> _______________________________________________
MV> Pw_forum mailing list
MV> Pw_forum at pwscf.org
MV> http://www.democritos.it/mailman/listinfo/pw_forum
MV>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Marci
2009-03-02 20:05:14 UTC
Permalink
Hi Axel,
Post by Axel Kohlmeyer
marton,
are you trying to run the postprocessing on your local
machine or on the IBM machine?
on the IBM machine. I had bad experiences with postprocessing on a
different machine because of using the iotk package, converting binary
files to text files and back is quite time consuming... (and I hate
ssh-ing gygabites of files)
Post by Axel Kohlmeyer
that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).
Unfortunately, the espresso I'm using on BASSI was not compiled by
myself, and now I'm scared of compiling mine because I'm not sure that
it will be able to read the binary that was made with an espresso
probably compiled with different compilers and/or compiler options.
Yeah, I know... I should have compiled my own version of quantum
espresso before making serious calculations to avoid these
situtations.

So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
my own version of espresso (with this I get the same error) to print
the values below in the case of the big run, honestly I do not really
know much about this cluster, but I'm sure I'm using compiler xl
fortran version 11.1.0.3 and library essl 4.2.0.3.

recl: 415578000
DIRECT_IO_FACTOR: 8
unf_recl: -970343296

On my home cluster, I used a parallelized espresso-4.0.3 on system
"Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
system (same pseudos, same cutoff, only gamma point), as I said there
is no "wrong record length" error and I got the following values:

recl: 97079200
DIRECT_IO_FACTOR: 8
unf_recl: 776633600

If I'm right... 415578000*8 = 3324624000 which is bigger than the
largest value of a signed 32 bit integer, maybe that causes the
problem?

Thanks for your help,
Marton
Lex Kemper
2009-03-02 20:50:47 UTC
Permalink
Hi Marton,

Just in case you didn't run across this page:

http://www.nersc.gov/nusers/systems/bassi/programming.php

Cheers,

Lex Kemper
Department of Physics
University of Florida
Post by Marci
Hi Axel,
Post by Axel Kohlmeyer
marton,
are you trying to run the postprocessing on your local
machine or on the IBM machine?
on the IBM machine. I had bad experiences with postprocessing on a
different machine because of using the iotk package, converting binary
files to text files and back is quite time consuming... (and I hate
ssh-ing gygabites of files)
Post by Axel Kohlmeyer
that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).
Unfortunately, the espresso I'm using on BASSI was not compiled by
myself, and now I'm scared of compiling mine because I'm not sure that
it will be able to read the binary that was made with an espresso
probably compiled with different compilers and/or compiler options.
Yeah, I know... I should have compiled my own version of quantum
espresso before making serious calculations to avoid these
situtations.
So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
my own version of espresso (with this I get the same error) to print
the values below in the case of the big run, honestly I do not really
know much about this cluster, but I'm sure I'm using compiler xl
fortran version 11.1.0.3 and library essl 4.2.0.3.
recl: 415578000
DIRECT_IO_FACTOR: 8
unf_recl: -970343296
On my home cluster, I used a parallelized espresso-4.0.3 on system
libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
system (same pseudos, same cutoff, only gamma point), as I said there
recl: 97079200
DIRECT_IO_FACTOR: 8
unf_recl: 776633600
If I'm right... 415578000*8 = 3324624000 which is bigger than the
largest value of a signed 32 bit integer, maybe that causes the
problem?
Thanks for your help,
Marton
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
Axel Kohlmeyer
2009-03-02 21:10:16 UTC
Permalink
On Mon, 2 Mar 2009, Marci wrote:

MV> Hi Axel,
MV>
MV> > marton,
MV> >
MV> > are you trying to run the postprocessing on your local
MV> > machine or on the IBM machine?
MV>
MV> on the IBM machine. I had bad experiences with postprocessing on a
MV> different machine because of using the iotk package, converting binary
MV> files to text files and back is quite time consuming... (and I hate
MV> ssh-ing gygabites of files)

just checking. actually, there are ways to make fortran read IEEE-754
compliant binary floating point numbers on different endian hardware,
but i never checked whether iotk can handle this as well.


[...]

MV> Unfortunately, the espresso I'm using on BASSI was not compiled by
MV> myself, and now I'm scared of compiling mine because I'm not sure that
MV> it will be able to read the binary that was made with an espresso
MV> probably compiled with different compilers and/or compiler options.

there is a big difference between linux and non-linux machines.
on linux there is a zoo of compilers and math libraries and
there are all kinds of subtle compatibility issues. on AIX
or other "commercial" platforms, this is generally less of an
issue, only that it is not as easy to replace one compiler by
another, in case the system provided compiler is broken.

MV> Yeah, I know... I should have compiled my own version of quantum
MV> espresso before making serious calculations to avoid these
MV> situtations.
MV>
MV> So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
MV> my own version of espresso (with this I get the same error) to print
MV> the values below in the case of the big run, honestly I do not really
MV> know much about this cluster, but I'm sure I'm using compiler xl
MV> fortran version 11.1.0.3 and library essl 4.2.0.3.

that is fine.

MV>
MV> recl: 415578000
MV> DIRECT_IO_FACTOR: 8
MV> unf_recl: -970343296

bingo! this is your problem. 8x415578000 is larger than 2^31,
so unf_recl defined as integer*4 will overflow.

MV> On my home cluster, I used a parallelized espresso-4.0.3 on system
MV> "Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
MV> libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
MV> system (same pseudos, same cutoff, only gamma point), as I said there
MV> is no "wrong record length" error and I got the following values:
MV>
MV> recl: 97079200
MV> DIRECT_IO_FACTOR: 8
MV> unf_recl: 776633600
MV>
MV> If I'm right... 415578000*8 = 3324624000 which is bigger than the
MV> largest value of a signed 32 bit integer, maybe that causes the
MV> problem?

exactly.

the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
perhaps, even just removing the test for negative unf_recl might work,
but i doubt it.

good luck,
axel.

MV> Thanks for your help,
MV> Marton
MV>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Paolo Giannozzi
2009-03-03 09:23:07 UTC
Permalink
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess

Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
Marci
2009-03-09 23:03:25 UTC
Permalink
Dear Axel, Lex and Paolo,

Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.

Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
Ary Junior
2009-05-30 18:49:38 UTC
Permalink
Hi, I'm using ifort 11.0 64 bits to compile espresso-4.0.4 and I get the
same "wrong record length"... Which is the equivalent "-qintsize=8" option
for ifort compiler?

Thank you very much!
Post by Marci
Dear Axel, Lex and Paolo,
Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.
Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.democritos.it/pipermail/pw_forum/attachments/20090530/dfae8d4e/attachment.htm
Axel Kohlmeyer
2009-05-30 19:12:30 UTC
Permalink
Post by Ary Junior
Hi, I'm using ifort 11.0 64 bits to compile espresso-4.0.4 and I get
the same "wrong record length"... Which is the equivalent
"-qintsize=8" option for ifort compiler?
it is listed in the compiler help!

-i8 or -integer-size 64

axel.
Post by Ary Junior
Thank you very much!
Dear Axel, Lex and Paolo,
Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.
Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this
problem.
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
you could try and declare unf_recl as integer*8 and try to
recompile.
Post by Paolo Giannozzi
once upon a time there were limitations on the maximum
length of
Post by Paolo Giannozzi
direct access records on some machines. I remember I
modified the
Post by Paolo Giannozzi
two routines that open and read/write in such a way that
each
Post by Paolo Giannozzi
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Axel Kohlmeyer
2009-05-30 19:12:30 UTC
Permalink
Post by Ary Junior
Hi, I'm using ifort 11.0 64 bits to compile espresso-4.0.4 and I get
the same "wrong record length"... Which is the equivalent
"-qintsize=8" option for ifort compiler?
it is listed in the compiler help!

-i8 or -integer-size 64

axel.
Post by Ary Junior
Thank you very much!
Dear Axel, Lex and Paolo,
Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.
Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this
problem.
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
you could try and declare unf_recl as integer*8 and try to
recompile.
Post by Paolo Giannozzi
once upon a time there were limitations on the maximum
length of
Post by Paolo Giannozzi
direct access records on some machines. I remember I
modified the
Post by Paolo Giannozzi
two routines that open and read/write in such a way that
each
Post by Paolo Giannozzi
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Ary Junior
2009-05-30 18:49:38 UTC
Permalink
Hi, I'm using ifort 11.0 64 bits to compile espresso-4.0.4 and I get the
same "wrong record length"... Which is the equivalent "-qintsize=8" option
for ifort compiler?

Thank you very much!
Post by Marci
Dear Axel, Lex and Paolo,
Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.
Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://pwscf.org/pipermail/pw_forum/attachments/20090530/dfae8d4e/attachment-0002.html
Marci
2009-03-09 23:03:25 UTC
Permalink
Dear Axel, Lex and Paolo,

Sorry for the late answer. Thanks again for all your help. If someone
meets the same problem, here is the solution. Unfortunately the
easiest way didn't work (just outcommenting the test for negative
unf_recl) and declaring unf_recl as integer*8 didn't help either.
However on this specific system with XL fortran compiler adding
"-qintsize=8" to FFLAGS solved my problem.

Yours,
Marton
Post by Paolo Giannozzi
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess
Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
Paolo Giannozzi
2009-03-03 09:23:07 UTC
Permalink
Post by Axel Kohlmeyer
the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
once upon a time there were limitations on the maximum length of
direct access records on some machines. I remember I modified the
two routines that open and read/write in such a way that each
record was split into more records with the maximum allowed
length. Anyway, for very big jobs, integers that don't fit
into 32 bits will show up somewhere else, so the ultimate
solution is to compile with default 64-bit integers, I guess

Paolo
--
Paolo Giannozzi, Democritos and University of Udine, Italy
Lex Kemper
2009-03-02 20:50:47 UTC
Permalink
Hi Marton,

Just in case you didn't run across this page:

http://www.nersc.gov/nusers/systems/bassi/programming.php

Cheers,

Lex Kemper
Department of Physics
University of Florida
Post by Marci
Hi Axel,
Post by Axel Kohlmeyer
marton,
are you trying to run the postprocessing on your local
machine or on the IBM machine?
on the IBM machine. I had bad experiences with postprocessing on a
different machine because of using the iotk package, converting binary
files to text files and back is quite time consuming... (and I hate
ssh-ing gygabites of files)
Post by Axel Kohlmeyer
that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).
Unfortunately, the espresso I'm using on BASSI was not compiled by
myself, and now I'm scared of compiling mine because I'm not sure that
it will be able to read the binary that was made with an espresso
probably compiled with different compilers and/or compiler options.
Yeah, I know... I should have compiled my own version of quantum
espresso before making serious calculations to avoid these
situtations.
So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
my own version of espresso (with this I get the same error) to print
the values below in the case of the big run, honestly I do not really
know much about this cluster, but I'm sure I'm using compiler xl
fortran version 11.1.0.3 and library essl 4.2.0.3.
recl: 415578000
DIRECT_IO_FACTOR: 8
unf_recl: -970343296
On my home cluster, I used a parallelized espresso-4.0.3 on system
libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
system (same pseudos, same cutoff, only gamma point), as I said there
recl: 97079200
DIRECT_IO_FACTOR: 8
unf_recl: 776633600
If I'm right... 415578000*8 = 3324624000 which is bigger than the
largest value of a signed 32 bit integer, maybe that causes the
problem?
Thanks for your help,
Marton
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum
Axel Kohlmeyer
2009-03-02 21:10:16 UTC
Permalink
On Mon, 2 Mar 2009, Marci wrote:

MV> Hi Axel,
MV>
MV> > marton,
MV> >
MV> > are you trying to run the postprocessing on your local
MV> > machine or on the IBM machine?
MV>
MV> on the IBM machine. I had bad experiences with postprocessing on a
MV> different machine because of using the iotk package, converting binary
MV> files to text files and back is quite time consuming... (and I hate
MV> ssh-ing gygabites of files)

just checking. actually, there are ways to make fortran read IEEE-754
compliant binary floating point numbers on different endian hardware,
but i never checked whether iotk can handle this as well.


[...]

MV> Unfortunately, the espresso I'm using on BASSI was not compiled by
MV> myself, and now I'm scared of compiling mine because I'm not sure that
MV> it will be able to read the binary that was made with an espresso
MV> probably compiled with different compilers and/or compiler options.

there is a big difference between linux and non-linux machines.
on linux there is a zoo of compilers and math libraries and
there are all kinds of subtle compatibility issues. on AIX
or other "commercial" platforms, this is generally less of an
issue, only that it is not as easy to replace one compiler by
another, in case the system provided compiler is broken.

MV> Yeah, I know... I should have compiled my own version of quantum
MV> espresso before making serious calculations to avoid these
MV> situtations.
MV>
MV> So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
MV> my own version of espresso (with this I get the same error) to print
MV> the values below in the case of the big run, honestly I do not really
MV> know much about this cluster, but I'm sure I'm using compiler xl
MV> fortran version 11.1.0.3 and library essl 4.2.0.3.

that is fine.

MV>
MV> recl: 415578000
MV> DIRECT_IO_FACTOR: 8
MV> unf_recl: -970343296

bingo! this is your problem. 8x415578000 is larger than 2^31,
so unf_recl defined as integer*4 will overflow.

MV> On my home cluster, I used a parallelized espresso-4.0.3 on system
MV> "Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
MV> libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
MV> system (same pseudos, same cutoff, only gamma point), as I said there
MV> is no "wrong record length" error and I got the following values:
MV>
MV> recl: 97079200
MV> DIRECT_IO_FACTOR: 8
MV> unf_recl: 776633600
MV>
MV> If I'm right... 415578000*8 = 3324624000 which is bigger than the
MV> largest value of a signed 32 bit integer, maybe that causes the
MV> problem?

exactly.

the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
perhaps, even just removing the test for negative unf_recl might work,
but i doubt it.

good luck,
axel.

MV> Thanks for your help,
MV> Marton
MV>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Marci
2009-03-02 20:05:14 UTC
Permalink
Hi Axel,
Post by Axel Kohlmeyer
marton,
are you trying to run the postprocessing on your local
machine or on the IBM machine?
on the IBM machine. I had bad experiences with postprocessing on a
different machine because of using the iotk package, converting binary
files to text files and back is quite time consuming... (and I hate
ssh-ing gygabites of files)
Post by Axel Kohlmeyer
that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).
Unfortunately, the espresso I'm using on BASSI was not compiled by
myself, and now I'm scared of compiling mine because I'm not sure that
it will be able to read the binary that was made with an espresso
probably compiled with different compilers and/or compiler options.
Yeah, I know... I should have compiled my own version of quantum
espresso before making serious calculations to avoid these
situtations.

So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
my own version of espresso (with this I get the same error) to print
the values below in the case of the big run, honestly I do not really
know much about this cluster, but I'm sure I'm using compiler xl
fortran version 11.1.0.3 and library essl 4.2.0.3.

recl: 415578000
DIRECT_IO_FACTOR: 8
unf_recl: -970343296

On my home cluster, I used a parallelized espresso-4.0.3 on system
"Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
system (same pseudos, same cutoff, only gamma point), as I said there
is no "wrong record length" error and I got the following values:

recl: 97079200
DIRECT_IO_FACTOR: 8
unf_recl: 776633600

If I'm right... 415578000*8 = 3324624000 which is bigger than the
largest value of a signed 32 bit integer, maybe that causes the
problem?

Thanks for your help,
Marton
Marci
2009-03-02 16:42:48 UTC
Permalink
Dear All,

I faced a strange error. I've made a parallel scf calculation with a
quite big sytem and after that I've tried to use a postprocess program
called epsilon.x. The first calculation went well, however epsilon.x
crashed saying:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 0
from diropn : error # 3
wrong record length
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The input file seems OK, I used wfcollect = .true. in the scf
calculation and set the same prefix and outdir directory in the
postprocess input file. I searched the archive and the source code of
espresso and it seems to me that this problem is related to this
specific cluster ( http://www.nersc.gov/nusers/systems/bassi ) where I
use quantum-espresso. The source code (eg. /PW/diropn.f90 ) says:

!
! the unit for record length is unfortunately machine-dependent
!
unf_recl = DIRECT_IO_FACTOR * recl
if (unf_recl <= 0) call errore ('diropn', 'wrong record length', 3)

Maybe my problem is related to the value of DIRECT_IO_FACTOR? I forgot
to mention, I'm using espresso4.0. My question is, why does this
happen only with using a postprocess code and how can I cure it?

On the top of this, I succeeded in running epsilon.x on a similar but
smaller system at my home cluster but it's says USPPs are not
implemented. As I would like to perform only a 'jdos' calculation,
which doesn't need wavefunctions at all, what if I simply comment out
the corresponding part of epsilon.f90?

If someone could help me I can send the input files in private.

Thanks in advance.
Yours,
Marton

-----------------------------------------------------
M\'arton V\"or\"os, physicist student
Department of Atomic Physics
Budapest University of Technology and Economics (BUTE)
Budafoki ut 8., H-1111, Budapest, Hungary
vormar_at_gmail_dot_com, vm776_at_hszk_dot_bme_dot_hu
-----------------------------------------------------
Axel Kohlmeyer
2009-03-02 17:18:15 UTC
Permalink
On Mon, 2 Mar 2009, Marci wrote:

MV> Dear All,
MV>
MV> I faced a strange error. I've made a parallel scf calculation with a
MV> quite big sytem and after that I've tried to use a postprocess program
MV> called epsilon.x. The first calculation went well, however epsilon.x
MV> crashed saying:
MV>
MV> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MV> task # 0
MV> from diropn : error # 3
MV> wrong record length
MV> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
MV>
MV> The input file seems OK, I used wfcollect = .true. in the scf
MV> calculation and set the same prefix and outdir directory in the
MV> postprocess input file. I searched the archive and the source code of
MV> espresso and it seems to me that this problem is related to this
MV> specific cluster ( http://www.nersc.gov/nusers/systems/bassi ) where I
MV> use quantum-espresso. The source code (eg. /PW/diropn.f90 ) says:

marton,

are you trying to run the postprocessing on your local
machine or on the IBM machine?

MV> !
MV> ! the unit for record length is unfortunately machine-dependent
MV> !
MV> unf_recl = DIRECT_IO_FACTOR * recl
MV> if (unf_recl <= 0) call errore ('diropn', 'wrong record length', 3)
MV>
MV> Maybe my problem is related to the value of DIRECT_IO_FACTOR? I forgot
MV> to mention, I'm using espresso4.0. My question is, why does this
MV> happen only with using a postprocess code and how can I cure it?

that depends on what is causing this. it could just be that you
have an integer overflow, due to the size of your system, or it
could be that you try to read unformatted data on a different
endian machine. i would suggest you insert a print statment into
the code that prints out the values of DIRECT_IO_FACTOR and recl
as well as unf_recl and then get back to use with the information
about the architectures and these numbers (ideally also for the
smaller test, where it worked).

thanks,
axel.


MV> On the top of this, I succeeded in running epsilon.x on a similar but
MV> smaller system at my home cluster but it's says USPPs are not
MV> implemented. As I would like to perform only a 'jdos' calculation,
MV> which doesn't need wavefunctions at all, what if I simply comment out
MV> the corresponding part of epsilon.f90?
MV>
MV> If someone could help me I can send the input files in private.
MV>
MV> Thanks in advance.
MV> Yours,
MV> Marton
MV>
MV> -----------------------------------------------------
MV> M\'arton V\"or\"os, physicist student
MV> Department of Atomic Physics
MV> Budapest University of Technology and Economics (BUTE)
MV> Budafoki ut 8., H-1111, Budapest, Hungary
MV> vormar_at_gmail_dot_com, vm776_at_hszk_dot_bme_dot_hu
MV> -----------------------------------------------------
MV> _______________________________________________
MV> Pw_forum mailing list
MV> Pw_forum at pwscf.org
MV> http://www.democritos.it/mailman/listinfo/pw_forum
MV>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
Continue reading on narkive:
Loading...