Skip to content

Performance issues on processing huge collections -> revise multithreading implementation #45

@proycon

Description

@proycon

Processing of huge amounts of pre-tokenised FoLiA documents (for Nederlab) goes unexpectedly slow, despite disabling various modules (--skip=mcpa). In about 24 hours, about 90 documents have been processed.

Frog is called on a directory as follows (to eliminate initialisation overhead):

frog --skip=mcpa -override tokenizer.rulesFile=tokconfig-nld-historical --xmldir "." --threads 40 --testdir input/ -x

Log excerpt of a single document (rarity:/scratch/proycon/morr001cryp01_01.tok.folia.xml) in a long-running batch (days if not weeks):

frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 1 milliseconds and 510 microseconds
frog-:CGN tagging took:   42 seconds, 593 milliseconds and 430 microseconds
frog-:NER took:           30 seconds, 188 milliseconds and 829 microseconds
frog-:Mblem took:         0 seconds, 803 milliseconds and 711 microseconds
frog-:Frogging in total took: 89 seconds, 930 milliseconds and 350 microseconds
frog-:resulting FoLiA doc saved in ./mole002refe01_01_0005.tok.folia.xml
frog-:Frogging input/moll013albu01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 16 milliseconds and 169 microseconds
frog-:CGN tagging took:   263 seconds, 846 milliseconds and 415 microseconds
frog-:NER took:           184 seconds, 152 milliseconds and 171 microseconds
frog-:Mblem took:         5 seconds, 328 milliseconds and 128 microseconds
frog-:Frogging in total took: 559 seconds, 893 milliseconds and 805 microseconds
frog-:resulting FoLiA doc saved in ./moll013albu01_01.tok.folia.xml
frog-:Frogging input/mont003zome01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 47 milliseconds and 859 microseconds
frog-:CGN tagging took:   57 seconds, 762 milliseconds and 530 microseconds
frog-:NER took:           37 seconds, 773 milliseconds and 605 microseconds
frog-:Mblem took:         1 seconds, 304 milliseconds and 201 microseconds
frog-:Frogging in total took: 120 seconds, 269 milliseconds and 869 microseconds
frog-:resulting FoLiA doc saved in ./mont003zome01_01.tok.folia.xml
----------------------------- (my emphasis) ------------------------
frog-:Frogging input/morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 31 milliseconds and 59 microseconds
frog-:CGN tagging took:   322 seconds, 971 milliseconds and 28 microseconds
frog-:NER took:           728 seconds, 839 milliseconds and 11 microseconds
frog-:Mblem took:         20 seconds, 219 milliseconds and 307 microseconds
frog-:Frogging in total took: 1494 seconds, 657 milliseconds and 222 microseconds
-----------------------------------------------------
frog-:resulting FoLiA doc saved in ./morr001cryp01_01.tok.folia.xml
frog-:Frogging input/moul004vowe01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 7 milliseconds and 995 microseconds
frog-:CGN tagging took:   336 seconds, 695 milliseconds and 921 microseconds
frog-:NER took:           111 seconds, 161 milliseconds and 469 microseconds
frog-:Mblem took:         3 seconds, 0 milliseconds and 52 microseconds
frog-:Frogging in total took: 508 seconds, 408 milliseconds and 533 microseconds
frog-:resulting FoLiA doc saved in ./moul004vowe01_01.tok.folia.xml
frog-:Frogging input/mouw001brah01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 37 milliseconds and 979 microseconds
frog-:CGN tagging took:   30 seconds, 709 milliseconds and 748 microseconds
frog-:NER took:           28 seconds, 536 milliseconds and 162 microseconds
frog-:Mblem took:         0 seconds, 966 milliseconds and 700 microseconds
frog-:Frogging in total took: 75 seconds, 622 milliseconds and 890 microseconds
frog-:resulting FoLiA doc saved in ./mouw001brah01_01.tok.folia.xml
frog-:Frogging input/muld014janf01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!

(full log in rarity:/scratch/proycon/frog.log)

Comparison; a standalone run on only the highlighted document (without -nostdout) :

frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 17 milliseconds and 448 microseconds
frog-:CGN tagging took:   17 seconds, 26 milliseconds and 152 microseconds
frog-:NER took:           65 seconds, 981 milliseconds and 710 microseconds
frog-:Mblem took:         28 seconds, 174 milliseconds and 786 microseconds
frog-:Frogging in total took: 487 seconds, 310 milliseconds and 584 microseconds

I have some minor suggestions for better debugging:

  • Let frog output a date/timestamp before and after processing (makes it easier to inspect a long logs running for days)
  • Is some normalisation of timings possible (in addition to the absolute numbers), e.g. divided by the total amount of tokens?

Other possibility for testdir:

  • Parallellise document processing instead of modules? (if not too complicated)

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions