Linear-Chain CRF@GPU is the first parallel implementation of Linear-Chain Conditional Random Fields (CRFs) for segmenting/labeling sequential data which runs on GPUs. Linear-Chain CRFs could be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking. It relies on highly parallel algorithms, written with NVIDIA's CUDA C/C++.
Both the training file and the test file need to be in a particular format. Each row must either be empty or consist of tab-separated tokens. Depending on it's position, every token represents a hidden node realisation (y) or an observed node realisation (x). The first token in each line represents a hidden node realisation (label). All following tokens represent an observed node realisation (attribute). Each row may contain as many attributes as you like. An empty row marks the end of a sequence.
In general, the data consists of tab-separated realisations:
HIDDEN_LABEL1 OBSERVED_REALISATION11 OBSERVED_REALISATION12 OBSERVED_REALISATION13 HIDDEN_LABEL2 OBSERVED_REALISATION21 HIDDEN_LABEL3 OBSERVED_REALISATION31 OBSERVED_REALISATION32 OBSERVED_REALISATION33 OBSERVED_REALISATION43 HIDDEN_LABEL4 OBSERVED_REALISATION41 OBSERVED_REALISATION42 HIDDEN_LABEL1 OBSERVED_REALISATION11 HIDDEN_LABEL2 OBSERVED_REALISATION21
In case of the CoNLL2000 shared task, the data would look like this:
B-NP X00= X01= X02=Rockwell X03=International X04=Corp. X05=/Rockwell X06=Rockwell/International .. I-NP X00= X01=Rockwell X02=International X03=Corp. X04='s X05=Rockwell/International .. I-NP X00=Rockwell X01=International X02=Corp. X03='s X04=Tulsa X05=International/Corp. .. B-NP X00=International X01=Corp. X02='s X03=Tulsa X04=unit X05=Corp./'s X06='s/Tulsa .. I-NP X00=Corp. X01='s X02=Tulsa X03=unit X04=said X05='s/Tulsa X06=Tulsa/unit X10=NNP .. I-NP X00='s X01=Tulsa X02=unit X03=said X04=it X05=Tulsa/unit X06=unit/said X10=POS X11=NNP .. B-VP X00=Tulsa X01=unit X02=said X03=it X04=signed X05=unit/said X06=said/it X10=NNP .. B-NP X00=unit X01=said X02=it X03=signed X04=a X05=said/it X06=it/signed X10=NN X11=VBD .. B-VP X00=said X01=it X02=signed X03=a X04=tentative X05=it/signed X06=signed/a X10=VBD X11=PRP .. B-NP X00=it X01=signed X02=a X03=tentative X04=agreement X05=signed/a X06=a/tentative X10=PRP X11=VBD .. I-NP X00=signed X01=a X02=tentative X03=agreement X04=extending X05=a/tentative X06=tentative/agreement .. I-NP X00=a X01=tentative X02=agreement X03=extending X04=its X05=tentative/agreement X06=agreement/extending .. B-VP X00=tentative X01=agreement X02=extending X03=its X04=contract X05=agreement/extending X06=extending/its .. B-NP X00=agreement X01=extending X02=its X03=contract X04=with X05=extending/its X06=its/contract .. I-NP X00=extending X01=its X02=contract X03=with X04=Boeing X05=its/contract X06=contract/with .. B-PP X00=its X01=contract X02=with X03=Boeing X04=Co. X05=contract/with X06=with/Boeing X10=PRP$ .. B-NP X00=contract X01=with X02=Boeing X03=Co. X04=to X05=with/Boeing X06=Boeing/Co. X10=NN X11=IN X12=NNP .. I-NP X00=with X01=Boeing X02=Co. X03=to X04=provide X05=Boeing/Co. X06=Co./to X10=IN X11=NNP X12=NNP .. B-VP X00=Boeing X01=Co. X02=to X03=provide X04=structural X05=Co./to X06=to/provide X10=NNP X11=NNP X12=TO .. I-VP X00=Co. X01=to X02=provide X03=structural X04=parts X05=to/provide X06=provide/structural X10=NNP X11=TO .. B-NP X00=to X01=provide X02=structural X03=parts X04=for X05=provide/structural X06=structural/parts X10=TO .. I-NP X00=provide X01=structural X02=parts X03=for X04=Boeing X05=structural/parts X06=parts/for X10=VB .. B-PP X00=structural X01=parts X02=for X03=Boeing X04='s X05=parts/for X06=for/Boeing X10=JJ X11=NNS X12=IN .. B-NP X00=parts X01=for X02=Boeing X03='s X04=747 X05=for/Boeing X06=Boeing/'s X10=NNS X11=IN X12=NNP X13=POS .. B-NP X00=for X01=Boeing X02='s X03=747 X04=jetliners X05=Boeing/'s X06='s/747 X10=IN X11=NNP X12=POS X13=CD .. I-NP X00=Boeing X01='s X02=747 X03=jetliners X04=. X05='s/747 X06=747/jetliners X10=NNP X11=POS X12=CD .. I-NP X00='s X01=747 X02=jetliners X03=. X04= X05=747/jetliners X06=jetliners/. X10=POS X11=CD X12=NNS X13=. .. O X00=747 X01=jetliners X02=. X03= X04= X05=jetliners/. X06=./ X10=CD X11=NNS X12=. X13= X14= .. B-NP X00= X01= X02=Rockwell X03=said X04=the X05=/Rockwell X06=Rockwell/said X10= X11= .. B-VP X00= X01=Rockwell X02=said X03=the X04=agreement X05=Rockwell/said X06=said/the X10= .. ..
Use lcrfcuda command:
% cat train_file | lcrfcuda -M model_file
where train_file contains the training data.
The trained model is stored in the file model_file.
lcrfcuda outputs the following information.
LINEAR-CHAIN CRF@CUDA, R1 Copyright (C) 2011 Nico Piatkowski, All rights reserved. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. DEVICE Tesla C2050 / C2070 (Device 0) BATCHSIZE 96 ETA 0.1 INITIAL-WEIGHT 0.05 TRAINING 8936 TESTING 8936/8936 ACCURACY 96.2962 WRITING MODEL DONE LABELS 22 ATTRIBUTES 338547 PARAMETERS 1039407 TIME 21s
There are three major parameters to control the training
Use lcrfcuda command:
% lcrfcuda -M model_file -T test_file --predict
where model_file is a previously learned model and
test_file contains the test set.
There are three major parameters to control the testing