sci论文代写,sci代写,sci代发

CSCI-UA.0480-057

Homework Number 5

CSCI Create a program that takes a fifile like WSJ_02-21.pos-chunk as input and produces a fifile consisting of feature value pairs for use

1.You will write a Noun Group tagger, using similar data that you used for Homework 3. However, for this program we will focus more on feature selection than on an algorithm.

1.Download the WSJ_CHUNKFILES.zip from NYUClasses (Resources). This includes the following data fifiles

WSJ_02-21.pos-chunk -- the training fifile

WSJ_24.pos -- the development fifile that you will test your system on

WSJ_24.pos-chunk -- the answer key to test your system output against

WSJ_23.pos -- the test fifile, to run your fifinal system on, producing system output

2.Download MAX_ENT_fifiles.zip, also from NYUClasses resources. This includes the following program fifiles (using the OpenNLP package):

maxent-3.0.0.jar, MEtag.java. MEtrain.java and trove.jar -- Java fifiles for running the maxent training and classifification programs

score.chunk.py -- A python scoring script

Create a program that takes a fifile like WSJ_02-21.pos-chunk as input and produces a fifile consisting of feature value pairs for use with the maxent trainer and classififier. As this step represents the bulk of the assignment, there will be more details below, including the format information, etc. This program should create two output fifiles. From the training corpus (WSJ_02-21.pos-chunk), create a training feature fifile (training.feature). From the development corpus (WSJ_24.pos), create a test feature fifile (test.feature). See details below.

Compile and run MEtrain.java, giving it the feature-enhanced training fifile as input; it will produce a MaxEnt model. MEtrain and MEtest use the maxent and trove packages, so you must include the corresponding jar fifiles, maxent-3.0.0.jar and trove.jar, on the classpath when you compile and run. Assuming all java fifiles are in the same directory, the following command-line commands will compile and run these programs -- these commands are slightly different for posix systems (Linux or Apple), than for Microsoft Windows.

1.For Linux, Apple and other Posix systems, do: CSCI

javac -cp maxent-3.0.0.jar:trove.jar *.java ### compiling

java -cp .:maxent-3.0.0.jar:trove.jar MEtrain training.feature

model.chunk ### creating the model of the training data

java -cp .:maxent-3.0.0.jar:trove.jar MEtag test.feature model.chunk response.chunk ### creating the system output

2.For Windows Only -- Use semicolons instead of colons in each of the above commands, i.e., the command for Windows would be:

javac -cp maxent-3.0.0.jar;trove.jar *.java ### compiling

java -cp .:maxent-3.0.0.jar;trove.jar MEtrain training.chunk model.chunk ### creating the model of the training data

java -cp .:maxent-3.0.0.jar;trove.jar MEtag test.chunk model.chunk response.chunk ### creating the system output

3.Quick Fixes

If the system is running out of memory, you can specify how much RAM java uses. For example, java -Xmx16g -cp ... will use 16 gigabytes of RAM.

If your system cannot fifind java fifiles or packages and just doesn't run for that reason, the easiest fifix is to run (the java steps) on one of NYU's linux servers. Accounts can be made available to all students in this class. Alternatively, you can make sure that all path variables are set properly, that java is properly installed, etc.

5.Score your results with the python script as follows:

python score.chunk.py WSJ_24.pos-chunk response.chunk ### WSJ_24.pos-chunk is the answer key and response.chunk is your outputfifile

6.When you are done creating your system, create a test.feature fifile from the test corpus (WSJ_23.pos) and execute step 1.4.1.3 (or 1.4.2.3) to create your fifinal response fifile (WSJ_23.chunk). Your submission to gradescope must contain the fifile WSJ_23.chunk or it will not be able to grade your work.

7.This pipeline is set up so you can write the code for producing the feature fifiles in any programming language you wish. You have the alternative of using any Maxent package you would like, provided that the scoring script works on your output.

2.As mentioned in section 1.3, you are primarily responsible for a program that creates sets of features for the Maximum Entropy system. CSCI

1.Format Information:

There should be 1 corresponding line of features for each line in the input fifile (training or test)

If the input and feature fifiles have different numbers of lines, you have a bug

Blank lines in the input fifile should correspond to blank lines in your feature fifile

Each line corresponding to text should contain tab separated values as follows:

the fifirst fifield should be the token (word, puncutation, etc.)

this should be followed by as many features as you want (but no feature should contain white space). Typically, features are recommended to have the form attribute=value, e.g., POS=NN

This makes the features easy for humans to understand, but isnot actually required by the program, e.g., the code does not look for the = sign.

for the training fifile only, the last fifield should be the BIO tag (B-NP, I-NP or O)

for the test fifile, there should be no fifinal BIO fifield (as there is nonein the .pos fifile that you would be training from)

A sample training fifile line (where \t represents tab): 'fifish\tPOS=NN\tprevious_POS=DT\tprevious_word=the\tI-NP ## actual lines will probably be longer

There is a special symbol '@@' that you can use to refer to the previous BIO tag, e.g., Previous_BIO=@@

This allows you to simulate a (bigram) MEMM because you can refer to the previous BIO tag

2.Suggested features: CSCI

Features of the word itself: POS, the word itself, stemmed version of the word

Similar features of previous and/or following words (suggestion: use the features of previous word, 2 words back, following word, 2 words forward)

Beginning/Ending Sentence (at the beginning of the sentence, omit features of 1 and 2 words back; at end of sentence, omit features of 1 and 2 words forward)

capitalization, features of the sentence, your own special dictionary, etc.

3.When you have completed the assignment, submit the following in a zip fifile through GradeScope (link to be added):

Your code

A short write-up describing the features you tried and your score on the development corpus.

Your output fifile from the test corpus, i.e., WSJ_23.chunk. Your submission to gradescope must contain the fifile WSJ_23.chunk or it will not be able to grade your work.

4.Understanding the scoring: CSCI

Accuracy = (correct BIO tags)/Total BIO Tags

Precision, Recall and F-measure measure Noun Group performance: A noun group is correct if it in both the system output and the answer key.

Precision = Correct/System_Output

Recall = Correct/Answer_key

F-measure = Harmonic mean of Precision and Recall

5.Evaluation: A simple system should achieve about 90% F-measure. It is possible, but diffificult to obtain 95-96%. Your grade will be judged as follows:

Your system's score

The innovativeness of your features

Any interesting analysis that you do

商科代写 cs代写法律学代写经济学代考_经济学作业代写艺术代写心理学代写哲学代写伦理学代写体育学代写化学代写教育学代写医学代写历史代写地理学代写

sci论文代写,sci代写,sci代发

CSCI-UA.0480-057

Homework Number 5

1.You will write a Noun Group tagger, using similar data that you used for Homework 3. However, for this program we will focus more on feature selection than on an algorithm.

2.Download MAX_ENT_fifiles.zip, also from NYUClasses resources. This includes the following program fifiles (using the OpenNLP package):

1.For Linux, Apple and other Posix systems, do: CSCI

2.As mentioned in section 1.3, you are primarily responsible for a program that creates sets of features for the Maximum Entropy system. CSCI

2.Suggested features: CSCI

3.When you have completed the assignment, submit the following in a zip fifile through GradeScope (link to be added):

4.Understanding the scoring: CSCI

发表回复

联系我们

分类目录

精选文章

关键词

最近页面