栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何“更新”现有的命名实体识别模型-而不是从头开始创建?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何“更新”现有的命名实体识别模型-而不是从头开始创建?

抱歉,我花了一些时间来整理一个不错的代码示例……下面的代码在您的句子中读取了内容,并使用默认的en-ner-
person模式来做到最好。然后,将这些结果写入具有良好命中率的文件和具有不良命中率的文件。然后,将这些文件送入底部的“ modelbuilder-
addon”调用中。

为了获得最佳结果,请按原样运行类…,然后进入已知的实体文件和黑名单文件,并添加和删除名称。换句话说,将它根本找不到但您知道的名称放入已知信息中,并从已知信息中删除坏名。从黑名单文件中删除好名字,并将其添加到已知文件中。然后,再次运行模型构建器部分,而没有第一部分读取所有数据和所有内容。可以在已知文件和黑名单文件中复制副本。如果您有任何疑问,请通知我…有点复杂

import java.io.File;import java.io.FileWriter;import java.util.ArrayList;import java.util.List;import java.util.Map;import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;import opennlp.tools.entitylinker.EntitylinkerProperties;import opennlp.tools.namefind.NameFinderME;import opennlp.tools.namefind.TokenNameFinderModel;import opennlp.tools.util.Span;public class ModelBuilderAddonUse {//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database  private static List<String> getSentencesFromSomewhere() throws Exception {    List<String> sentences = new ArrayList<>();    int counter = 0;    DocProvider dp = new DocProvider();    String modelPath = "c:\apache\entitylinker\";    EntitylinkerProperties properties = new EntitylinkerProperties(new File(modelPath + "entitylinker.properties"));    Map<Long, List<String>> docs = dp.getDocs(properties);    for (Long key : docs.keySet()) {      counter++;      System.out.println("ttDOC: " + key + "nn");      String docu = "";      sentences.addAll(docs.get(key));      counter++;      if(counter > 1000){        break;      }    }    return sentences;  }  public static void main(String[] args) throws Exception {        File sentences = new File("C:\temp\modelbuilder\sentences.text");        File knownEntities = new File("C:\temp\modelbuilder\knownentities.txt");        File blacklistedentities = new File("C:\temp\modelbuilder\blentities.txt");        File annotatedSentences = new File("C:\temp\modelbuilder\annotatedSentences.txt");        File theModel = new File("C:\temp\modelbuilder\theModel");//------------create a bunch of file writers to write your results and sentences to a file    FileWriter sentenceWriter = new FileWriter(sentences, true);    FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);    FileWriter knownEntityWriter = new FileWriter(knownEntities, true);//set some thresholds to decide where to write hits, you don't have to use these at all...    double keeperThresh = .95;    double blacklistThresh = .7;        TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\temp\opennlpmodels\en-ner-person.zip"));    NameFinderME personFinder = new NameFinderME(personModel);        for (String s : getSentencesFromSomewhere()) {      sentenceWriter.write(s.trim() + "n");      sentenceWriter.flush();      String[] tokens = s.split(" ");//better to use a tokenizer really      Span[] find = personFinder.find(tokens);      double[] probs = personFinder.probs();      String[] names = Span.spansToStrings(find, tokens);      for (int i = 0; i < names.length; i++) {        //YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL        if (probs[i] > keeperThresh) {          knownEntityWriter.write(names[i].trim() + "n");        }        if (probs[i] < blacklistThresh) {          blacklistWriter.write(names[i].trim() + "n");        }      }      personFinder.clearAdaptiveData();      blacklistWriter.flush();      knownEntityWriter.flush();    }    //flush and close all the writers    knownEntityWriter.flush();    knownEntityWriter.close();    sentenceWriter.flush();    sentenceWriter.close();    blacklistWriter.flush();    blacklistWriter.close();        DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities, theModel, annotatedSentences, "person", 3);  }}

这就是控制台的外观(为简洁起见,我删除了几行)

ITERATION: 0    Perfoming Known Entity Annotation        knowns: 625        reading data....:         writing annotated sentences....:         building model....     Building Model using 7343 annotations        reading training data...Indexing events using cutoff of 5    Computing event counts...  done. 561755 events    Indexing...  done.Sorting and merging events... done. Reduced 561755 events to 127362.Done indexing.Incorporating indexed data for training...  done.    Number of Event Tokens: 127362        Number of Outcomes: 3      Number of Predicates: 106490...done.Computing model parameters ...Performing 100 iterations.  1:  ... loglikelihood=-617150.9462211537  0.015709695507828147  2:  ... loglikelihood=-90520.86903515142  0.9771288195031642  3:  ... loglikelihood=-56901.86905339755  0.9771288195031642  4:  ... loglikelihood=-44231.80460317638  0.9773086131854634  5:  ... loglikelihood=-37222.56576767385  0.9787985865724381  6:  ... loglikelihood=-32900.5623814595   0.9801924326441243  7:  ... loglikelihood=-29992.881445391187 0.9829747843810914  8:  ... loglikelihood=-27893.341149419102 0.9836423351817073  9:  ... loglikelihood=-26296.107313900917 0.9845092611547739 10:  ... loglikelihood=-25033.501573153182 0.9850682236918229 11:  ... loglikelihood=-24006.060636903556 0.9856182855515305 12:  ... loglikelihood=-23150.856525607975 0.9859084476328649 13:  ... loglikelihood=-22425.987337392176 0.9861897090368577 14:  ... loglikelihood=-21802.386362016423 0.9864211266477378 15:  ... loglikelihood=-21259.20580401235  0.9865208142339632 16:  ... loglikelihood=-20781.0716762281   0.9867362106256287 17:  ... loglikelihood=-20356.37732369309  0.986905323495118 18:  ... loglikelihood=-19976.18228587008  0.9870673158227341 19:  ... loglikelihood=-19633.47877575036  0.9872097266601988 20:  ... loglikelihood=-19322.689448146353 0.9873165347882974 21:  ... loglikelihood=-19039.31522510173  0.9874073216971812 22:  ... loglikelihood=-18779.683112448918 0.9875176900962164 23:  ... loglikelihood=-18540.76222439295  0.9876316187661881 24:  ... loglikelihood=-18320.027315327916 0.9877081645913254 25:  ... loglikelihood=-18115.35602743375  0.9877918309583359 26:  ... loglikelihood=-17924.95047403401  0.9878612562416 27:  ... loglikelihood=-17747.27665623459  0.9879378020667373 28:  ... loglikelihood=-17581.01712643139  0.9879947664017231 29:  ... loglikelihood=-17425.03361369085  0.9880784327687337 30:  ... loglikelihood=-17278.3372262906   0.9881282765618463 31:  ... loglikelihood=-17140.06447937828  0.9882012621160471 32:  ... loglikelihood=-17009.45784626013  0.9882546661800963 33:  ... loglikelihood=-16885.84985637711  0.9883187510569554 34:  ... loglikelihood=-16768.64999916476  0.9883703749855364 35:  ... loglikelihood=-16657.3338665414   0.9884166585077124 36:  ... loglikelihood=-16551.434095577726 0.9884558214880153 37:  ... loglikelihood=-16450.532769374073 0.9885074454165962 38:  ... loglikelihood=-16354.255007222264 0.9885448282614306 39:  ... loglikelihood=-16262.263530858221 0.9885733104289236 40:  ... loglikelihood=-16174.254036589966 0.9886391754412511 41:  ... loglikelihood=-16089.951236435176 0.9886765582860856 42:  ... loglikelihood=-16009.105457548561 0.9887281822146665 43:  ... loglikelihood=-15931.489709807445 0.988747763704818 44:  ... loglikelihood=-15856.897147780543 0.9887798061432475 45:  ... loglikelihood=-15785.138866385483 0.9888065081752722 46:  ... loglikelihood=-15716.041980029182 0.9888349903427651 47:  ... loglikelihood=-15649.447943527766 0.9888581321038531 48:  ... loglikelihood=-15585.211079986258 0.9888901745422827 49:  ... loglikelihood=-15523.19728647256  0.9889328977935221 50:  ... loglikelihood=-15463.282892914636 0.9889595998255467 51:  ... loglikelihood=-15405.353653492159 0.9889685005028883 52:  ... loglikelihood=-15349.303852923775 0.9889809614511664 53:  ... loglikelihood=-15295.035512678789 0.9889934223994445 54:  ... loglikelihood=-15242.457684348112 0.989013003889596 55:  ... loglikelihood=-15191.485819217298 0.9890236847024059 56:  ... loglikelihood=-15142.041204645499 0.9890397059216206 57:  ... loglikelihood=-15094.050459152337 0.9890539470053671 58:  ... loglikelihood=-15047.445079207273 0.9890592874117721 59:  ... loglikelihood=-15002.161031666768 0.9890753086309868 60:  ... loglikelihood=-14958.13838658306  0.9890966702566065 61:  ... loglikelihood=-14915.320985817205 0.9891180318822262 62:  ... loglikelihood=-14873.656143433394 0.9891269325595677 63:  ... loglikelihood=-14833.094374397517 0.9891500743206558 64:  ... loglikelihood=-14793.589148498404 0.9891589749979973 65:  ... loglikelihood=-14755.096666806796 0.9891785564881488 66:  ... loglikelihood=-14717.5756582924   0.9891892373009586 67:  ... loglikelihood=-14680.98719451864  0.9891892373009586 68:  ... loglikelihood=-14645.294520562966 0.9891945777073635 69:  ... loglikelihood=-14610.462900520715 0.9891999181137685 70:  ... loglikelihood=-14576.45947616036  0.989214159197515 71:  ... loglikelihood=-14543.25313742511  0.9892212797393881 72:  ... loglikelihood=-14510.814403643026 0.9892230598748565 73:  ... loglikelihood=-14479.115314429962 0.9892230598748565 74:  ... loglikelihood=-14448.129329357815 0.9892426413650078 75:  ... loglikelihood=-14417.831235594616 0.9892515420423494 76:  ... loglikelihood=-14388.19706276905  0.9892622228551593 77:  ... loglikelihood=-14359.204004414    0.9892711235325008 78:  ... loglikelihood=-14330.8303454032   0.9892764639389058 79:  ... loglikelihood=-14303.055394843146 0.9892764639389058 80:  ... loglikelihood=-14275.859423957678 0.9892924851581205 81:  ... loglikelihood=-14249.223608524193 0.9893013858354621 82:  ... loglikelihood=-14223.129975482772 0.9893209673256135 83:  ... loglikelihood=-14197.561353359844 0.9893263077320185 84:  ... loglikelihood=-14172.50132620183  0.9893280878674867 85:  ... loglikelihood=-14147.934190713178 0.9893263077320185 86:  ... loglikelihood=-14123.84491635766  0.9893316481384233 87:  ... loglikelihood=-14100.21910816809  0.9894313357246487 88:  ... loglikelihood=-14077.042972066316 0.989433115860117 89:  ... loglikelihood=-14054.303282478262 0.9894437966729268 90:  ... loglikelihood=-14031.987352086799 0.9894580377566733 91:  ... loglikelihood=-14010.083003539214 0.9894615980276099 92:  ... loglikelihood=-13988.578542971209 0.9894776192468246 93:  ... loglikelihood=-13967.46273521311  0.9894811795177613 94:  ... loglikelihood=-13946.724780546094 0.9894829596532296 95:  ... loglikelihood=-13926.354292898612 0.9894829596532296 96:  ... loglikelihood=-13906.341279379953 0.9894900801951029 97:  ... loglikelihood=-13886.676121050288 0.9894936404660395 98:  ... loglikelihood=-13867.34955484593  0.9894954206015077 99:  ... loglikelihood=-13848.35265657199  0.9894954206015077100:  ... loglikelihood=-13829.676824889664 0.9894972007369761    model generated        model building complete....         annotated sentences: 7343    Performing NER with new model        Printing NER Results. Add undesired results to the blacklist file and start over//prints some names    annotated sentences: 7369        knowns: 651ITERATION: 1    Perfoming Known Entity Annotation        knowns: 651        reading data....:         writing annotated sentences....:         building model....     Building Model using 20370 annotations        reading training data...Indexing events using cutoff of 5    Computing event counts...  done. 1116781 events    Indexing...  done.Sorting and merging events... done. Reduced 1116781 events to 288251.Done indexing.Incorporating indexed data for training...  done.    Number of Event Tokens: 288251        Number of Outcomes: 3      Number of Predicates: 206399...done.Computing model parameters ...Performing 100 iterations.  1:  ... loglikelihood=-1226909.3303549637 0.03418485808766446  2:  ... loglikelihood=-196688.7107544095  0.9622047653031346  3:  ... loglikelihood=-138615.22912914792 0.9651462551744702  4:  ... loglikelihood=-114777.09879832959 0.9697075791941303  5:  ... loglikelihood=-101055.0229949508  0.9716443958126079  6:  ... loglikelihood=-92253.8923255943   0.973049326591337  7:  ... loglikelihood=-86146.35307405592  0.9750121107003074  8:  ... loglikelihood=-81641.85792288609  0.975682788299586  9:  ... loglikelihood=-78164.62963136223  0.9762594456746667 10:  ... loglikelihood=-75386.40867917785  0.9767044747358703 11:  ... loglikelihood=-73106.85371375803  0.9770590652957025 12:  ... loglikelihood=-71196.60721959372  0.9774718588514668 13:  ... loglikelihood=-69568.23683712543  0.9777279520335679 14:  ... loglikelihood=-68160.39924327709  0.9779374828189233 15:  ... loglikelihood=-66928.70260893498  0.9780914969004666 16:  ... loglikelihood=-65840.17418566217  0.9782661058882628 17:  ... loglikelihood=-64869.77222395241  0.9784040022170865 18:  ... loglikelihood=-63998.109674075415 0.9785159310554173 19:  ... loglikelihood=-63209.92394252923  0.9786475593692944 20:  ... loglikelihood=-62493.02131098982  0.9787505339005589 21:  ... loglikelihood=-61837.53211219312  0.9788597764467698 22:  ... loglikelihood=-61235.37451190329  0.9789457377946079 23:  ... loglikelihood=-60679.86146007204  0.9790003590677133 24:  ... loglikelihood=-60165.407875448924 0.979062143786472 25:  ... loglikelihood=-59687.30928567587  0.9791346736737104 26:  ... loglikelihood=-59241.572255584455 0.979201830976709 27:  ... loglikelihood=-58824.78291785096  0.9792698837104141 28:  ... loglikelihood=-58434.00392167818  0.979333459290586 29:  ... loglikelihood=-58066.69284046825  0.979381812548745 30:  ... loglikelihood=-57720.63696783972  0.9794355383911438 31:  ... loglikelihood=-57393.9007602091   0.9795089637090889 32:  ... loglikelihood=-57084.78313293037  0.9795483626601814 33:  ... loglikelihood=-56791.78250307578  0.9795743301506741 34:  ... loglikelihood=-56513.567973701254 0.9796298468544863 35:  ... loglikelihood=-56248.955425711436 0.9796808864047651 36:  ... loglikelihood=-55996.887560355084 0.9797202853558576 37:  ... loglikelihood=-55756.41714443519  0.9797543117227102 38:  ... loglikelihood=-55526.69286884015  0.9797963969659226 39:  ... loglikelihood=-55306.94735282102  0.9798152010107621 40:  ... loglikelihood=-55096.48692031122  0.9798563908232679 41:  ... loglikelihood=-54894.68284780714  0.9799029532200136 42:  ... loglikelihood=-54700.963840494    0.9799378750175728 43:  ... loglikelihood=-54514.80953871555  0.9799656333694788 44:  ... loglikelihood=-54335.744892614406 0.9800005551670381 45:  ... loglikelihood=-54163.33527156895  0.9800301043803574 46:  ... loglikelihood=-53997.182198154995 0.9800551764401436 47:  ... loglikelihood=-53836.91961491415  0.980082039361343 48:  ... loglikelihood=-53682.210607423985 0.980112484005369 49:  ... loglikelihood=-53532.74451955152  0.980140242357275 50:  ... loglikelihood=-53388.23440690913  0.9801688961398878 51:  ... loglikelihood=-53248.41478285541  0.9801921773382606 52:  ... loglikelihood=-53113.03961847529  0.9802109813831001 53:  ... loglikelihood=-52981.880563479055 0.9802351580121796 54:  ... loglikelihood=-52854.7253600851   0.9802584392105524 55:  ... loglikelihood=-52731.37642565477  0.9802727661018589 56:  ... loglikelihood=-52611.64958353087  0.9803005244537649 57:  ... loglikelihood=-52495.37292415569  0.9803148513450712 58:  ... loglikelihood=-52382.38578113555  0.9803470868505105 59:  ... loglikelihood=-52272.53780883427  0.9803748452024166 60:  ... loglikelihood=-52165.68814994865  0.9803891720937229 61:  ... loglikelihood=-52061.7046829472   0.9804043944157359 62:  ... loglikelihood=-51960.46334051503  0.9804151395842157 63:  ... loglikelihood=-51861.84749132724  0.9804393162132952 64:  ... loglikelihood=-51765.74737831825  0.9804491659510683 65:  ... loglikelihood=-51672.05960757943  0.9804634928423747 66:  ... loglikelihood=-51580.686682513515 0.9804876694714542 67:  ... loglikelihood=-51491.53657871175  0.9805046826548804 68:  ... loglikelihood=-51404.52235540815  0.9805172186847735 69:  ... loglikelihood=-51319.56179989248  0.9805315455760798 70:  ... loglikelihood=-51236.577101627925 0.9805440816059728 71:  ... loglikelihood=-51155.494553260556 0.9805584084972793 72:  ... loglikelihood=-51076.24427590388  0.980569153665759 73:  ... loglikelihood=-50998.75996642977  0.9805825851263587 74:  ... loglikelihood=-50922.97866477339  0.9805951211562518 75:  ... loglikelihood=-50848.84053937224  0.9806112389089714 76:  ... loglikelihood=-50776.28868909037  0.9806264612309844 77:  ... loglikelihood=-50705.2689602481   0.9806389972608774 78:  ... loglikelihood=-50635.729777298875 0.9806470561372372 79:  ... loglikelihood=-50567.62198610024  0.9806658601820769 80:  ... loglikelihood=-50500.8987085974   0.9806685464741968 81:  ... loglikelihood=-50435.51520800019  0.9806775007812633 82:  ... loglikelihood=-50371.42876358994  0.9806837687962098 83:  ... loglikelihood=-50308.59855431275  0.9806918276725697 84:  ... loglikelihood=-50246.98555046764  0.9806989911182228 85:  ... loglikelihood=-50186.55241287111  0.980703468271756 86:  ... loglikelihood=-50127.26339882067  0.9807195860244757 87:  ... loglikelihood=-50069.08427441567  0.9807312266236621 88:  ... loglikelihood=-50011.9822326526   0.9807357037771953 89:  ... loglikelihood=-49955.92581691934  0.9807446580842618 90:  ... loglikelihood=-49900.88484943885  0.9807527169606216 91:  ... loglikelihood=-49846.83036430355  0.9807634621291014 92:  ... loglikelihood=-49793.734544757914 0.9807724164361679 93:  ... loglikelihood=-49741.57066440427  0.9807786844511144 94:  ... loglikelihood=-49690.31303207665  0.9807840570353543 95:  ... loglikelihood=-49639.93694007888  0.9807948022038341 96:  ... loglikelihood=-49590.418615580194 0.9808001747880739 97:  ... loglikelihood=-49541.73517492774  0.9808073382337271 98:  ... loglikelihood=-49493.86458067577  0.9808145016793803 99:  ... loglikelihood=-49446.785601155134 0.9808234559864467100:  ... loglikelihood=-49400.477772387036 0.9808359920163399    model generated        model building complete....         annotated sentences: 20370    Performing NER with new modelit will do this for each iteration  util you see...... 97:  ... loglikelihood=-49140.50129715517  0.9808462362240823 98:  ... loglikelihood=-49095.42289306763  0.9808641444693966 99:  ... loglikelihood=-49051.095083380205 0.9808713077675223100:  ... loglikelihood=-49007.49834809576  0.9808748894165852    model generated

如果您看到带注释的句子停止更改,并且在精炼列表时已知信息在后续运行中停止更改,则可以更改num迭代。

高温超导



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/451342.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号