www.pudn.com > BPËã·¨Ô´Âë.zip > readme


                     Basis of AI Backprop 
                   Code from April 10, 1996 
               Documentation from April 10, 1996 
 
Copyright (c) 1990-96 by Donald R. Tveter 
 
CONTENTS 
-------- 
1. Introduction 
2. Making the Simulators 
3. A Simple Example 
4. Basic Facilities 
5. The Format Command 
6. Taking Training and Testing Patterns from a File 
7. Saving and Restoring Weights 
8. Initializing Weights 
9. The Seed Values 
10. The Algorithm Command 
11. The Delta-Bar-Delta Method 
12. Quickprop 
13. Making a Network 
14. Recurrent Networks 
15. Miscellaneous Commands 
16. Limitations 
17. The Pro Version Additions 
 
1. Introduction 
--------------- 
   This manual describes the free version of my Basis of AI Backprop 
designed to accompany my not yet published (sigh) textbook, _The Basis 
of AI_.  This program contains enough features for students in an 
ordinary AI or Neural Networking course.  More serious users will 
probably need the professional version of this software, see: 
 
http://www.mcs.com/~drt/probp.html 
 
or send me email at: drt@mcs.com.  Other free NN software for the 
textbook is also available at: 
 
http://www.mcs.com/~drt/svbp.html 
 
   For more on backprop see my "Backpropagator's Review" at: 
 
http://www.mcs.com/~drt/bprefs.html 
 
   Notice: this is use at your own risk software.  There is no guarantee 
that it is bug-free.  Use of this software constitutes acceptance for 
use in an as is condition.  There are no warranties with regard to this 
software. In no event shall the author be liable for any damages 
whatsoever arising out of or in connection with the use or performance 
of this software. 
 
   There are four simulators that can be constructed from the included 
files.  The program, bp, does back-propagation using real weights and 
arithmetic.  The program, ibp, does back-propagation using 16-bit 
integer weights, 16 and 32-bit integer arithmetic and some floating 
point arithmetic.  The program, sbp, uses symmetric floating point 
weights and its sole purpose is to produce weights for two-layer 
networks for use with the Hopfield and Boltzman relaxation algorithms 
(included in another package).  The program sibp does the same using 
16-bit integer weights.  The integer versions are faster on systems 
without floating point hardware however sometimes these versions don't 
have enough range or precision and then using the floating point 
versions is necessary.  DOS binaries are included here for systems with 
floating point hardware.  If you need other versions write me. 
 
2. Making the Simulators 
------------------------ 
   This code has been written to use either 32-bit floating point 
(float) or 64-bit floating point (double) arithmetic.  On System V 
machines the standard seems to be that all floating point arithmetic is 
done with double precision arithmetic so double arithmetic is faster 
than float and therefore this is the default.  Other versions of C (e.g. 
ANSI C) will do single precision real arithmetic which will ordinarily 
be faster on most machines (I think).  To get 32-bit floating point set 
the compiler flag FLOAT in the makefile.  The function, exp, defined in 
real.c is double since System V specifies it as double.  If your C uses 
float, change this definition as well. 
 
   For UNIX systems, use either makefile.unx or makereal.unx. 
The makefile.unx will make any of the programs and makefile will keep 
the bp object code files around while makereal.unx will only make bp 
but it keeps the bp object code files around.  Also for DOS systems 
there are two makefiles to choose from, makefile and makereal.  Makefile 
is designed to make all four programs but it only leaves around the 
object files for ibp while erasing object files for sibp, sbp and bp. 
On the other hand, makereal only makes bp and it leaves its object 
files around.  For 16-bit DOS you need to set the flag, -DDOS16 and for 
32-bit DOS, you need to set the flag -DDOS32.  The flags I have in the 
DOS makefiles are what I use with Zortech C 3.1.  The code is known not 
to compile with at least one version of Turbo C because of an oddity 
(or bug?) in the compiler. 
 
   There was a problem found with the previous free student version 
where it crashed on a Sun when the program hits a call to free in the 
file bp.c.  This can be solved by removing the two calls to free and the 
amount of space you waste is minimal.  I haven't had a report of such a 
problem with this version yet but if it happens, let me know, in all 
probability removing a call or two to free in the file io.c will solve 
the problem. 
 
   This code will work with basic C compilers however the libraries 
sometimes vary from system to system.  DOS systems seem to use the 
function getch in the conio library for hot key capability.  For a 
System V UNIX system the code uses a home-made function called getch for 
hot key capability.  This is the default setting for a UNIX system and 
it also works with Suns.  If you use BSD UNIX then you need to define 
the compiler variable BSD either in the cc command by adding the 
parameter, -DBSD.  To get the hotkey feature to work with a NeXT use the 
parameter -DNEXT.  At this point I don't know what other variations of 
UNIX use so you may need to adapt the ioctl function call in the file 
io.c and the files rbp.h and ibp.h to make them fit some other version. 
If your system uses some other standard then if you can send me the 
documentation I should be able to make it work as well.  If necessary 
the hot key option can be removed by removing or commenting out the 
line: 
 
#define HOTKEYS 
 
in the rbp.h and ibp.h files. 
 
   There are some other more minor options that can be compiled in or 
left out but these are mentioned at other points in the documentation. 
 
   To make a particular executable file use the makefile given with the 
data files and make any or all of them like so: 
 
        UNIX                        DOS 
 
    make -f makereal.unx bp       make -f makereal bp 
    make -f makefile.unx bp       make bp 
    make -f makefile.unx ibp      make ibp 
    make -f makefile.unx sibp     make sibp 
    make -f makefile.unx sbp      make sbp 
 
If you do get bugs on an odd system and you can let me telnet in to 
your system (preferably on a separate login, rather than your personal 
login) I will try and fix the problem for you. 
 
 
3. A Simple Example 
------------------- 
   Each version would normally be called with the name of a file to read 
commands from, as in: 
 
bp xor 
 
After the data file is read commands are then taken from the keyboard. 
When no file name is specified bp will take commands from the keyboard 
(stdin file).  Normally you will find it convenient to put the commands 
you need to set up the network in a short file however it is possible to 
type them all in to the program from the keyboard.  If you have more 
than a tiny amount of data you should have the data ready in a training 
file and a testing file if you have test data. 
 
   The commands are one or two or three letter commands and most of them 
have optional parameters.  The `a', `d', `f' and 'q' commands allow a 
number of sub-commands on a line.  The maximum length of any line is 256 
characters.  An `*' is a comment and it can be used to make the 
remainder of the line a comment.  In addition ctrl-R will run the 
training. 
 
   Here is an example of a data file to do the xor problem: 
            
* input file for the xor problem 
            
m 2 1 1 x     * make a 2-1-1 network with extra input-output connections 
s 7           * seed the random number function 
ci            * clear and initialize the network with random weights 
 
rt {          * read training patterns into memory 
1 0 1 
0 0 0 
0 1 1 
1 1 0} 
 
e 0.5         * set eta, the learning rate to 0.5 (and eta2 to 0.5) 
a 0.9         * set alpha, the momentum to 0.9 
 
First in this example, the m command will make a network with 2 units in 
the input layer, 1 unit in the second layer and 1 unit in the third 
layer.  Much of the time a three layer network where the connections are 
only between adjacent layers is as complex as a network needs to be 
however there are problems where having additional connections between 
the input units and output units will greatly speed-up the learning 
process.  The xor problem is one of those problems where the extra 
connections help so the 'x' at the end of the command will add these 
two extra connections.  The `s' (seed) command sets the seed for the 
random number function.  The "ci" command (clear and initialize) clears 
the existing network weights and initializes the weights to random 
values between -1 and +1.  The rt (read training set) command gives four 
new patterns to be read into the program.  All of them are listed 
between the curly brackets ({}).  The input pattern comes first followed 
by the output pattern.  The command "e 0.5" sets eta, the learning 
rate for the upper layer to 0.5 and eta2 for the lower layers to 0.5 as 
well.  The last line sets alpha, the momentum parameter, to 0.9. 
 
   After these commands are executed the following messages and prompt 
appears: 
 
Basis of AI Backprop (c) 1990-96 by Donald R. Tveter 
   drt@mcs.com - http://www.mcs.com/~drt/home.html 
               April 10, 1996 version. 
taking commands from stdin now 
[ACDFGMNPQTW?!acdefhlmopqrstw]? q 
 
The characters within the square brackets are a list of the possible 
commands.  To run 100 iterations of back-propagation and print out the 
status of the learning every 10 iterations type "r 100 10" at the 
prompt: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? r 100 10 
 
This gives: 
  
running . . . 
   10      0.00 % 0.49947    
   20      0.00 % 0.49798    
   30      0.00 % 0.48713    
   40      0.00 % 0.37061    
   50      0.00 % 0.15681    
   59    100.00 % 0.07121    DONE 
 
The program immediately prints out the "running .  .  ." message.  After 
each 10 iterations a summary of the learning process is printed giving 
the percentage of patterns that are right and the average value of the 
absolute values of the errors of the output units.  The program stops 
when the each output for each pattern has been learned to within the 
required tolerance, in this case the default value of 0.1.  Sometimes 
the integer versions will do a few extra iterations before declaring the 
problem done because of truncation errors in the arithmetic done to 
check for convergence.  Unlike the previous student version the default 
for these values is to be "up-to-date" however this can be over-ridden 
to save a little on CPU time. 
 
   There are many factors that affect the number of iterations needed 
for a network to converge.  For instance if your random number function 
doesn't generate the same values as the one with the Zortech 3.1 
compiler (which is the same one used by most UNIX C compilers) the 
number of iterations it takes will be different.  The integer versions 
produce slightly different results that the floating point versions. 
 
Listing Patterns 
 
   To get a listing of the status of each pattern use the `p' command 
to give: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? p 
    1  0.903  e 0.097 ok 
    2  0.050  e 0.050 ok 
    3  0.935  e 0.065 ok 
    4  0.072  e 0.072 ok 
   59    (TOL) 100.00 % (4 right  0 wrong)  0.07121 err/unit 
 
The number folloing the e (for error) is the sum of the absolute values 
of the output errors for each pattern.  An `ok' is given to every 
pattern that has been learned to within the required tolerance.  To get 
the status of one pattern, say, the fourth pattern, type "p 4" to give: 
 
 0.07  (0.072) ok 
 
To get a summary without the complete listing use "p 0".  To get the 
output targets for a given pattern, say pattern 3, use "o 3". 
 
   A particular test pattern can be input to the network by giving the 
pattern at the prompt: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? 1 0 
       0.903  
 
Examining Weights 
 
   It is often interesting to see the values of some particular weights 
in the network.  To see a listing of all the weights in a network you 
can use the save weights command described later on and then list the 
file containing the weights, however, to see the weights leading into a 
particular node, say the node in row 3, node 1 use the w command as in: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? w 3 1 
 
layer unit  inuse  unit value    weight   inuse   input from unit 
  1     1     1      1.00000     5.38258     1        5.38258 
  1     2     1      0.00000    -4.86238     1        0.00000 
  2     1     1      1.00000   -10.86713     1      -10.86710 
  3     b     1      1.00000     7.71563     2        7.71563 
                                              sum =   2.23111 
 
This listing also gives data on how the current activation value of the 
node is computed using the weights and the activations values of the 
nodes feeding into unit 1 of layer 3.  The `b' unit is the bias (also 
called the threshold) unit.  The inuse column to the right of the unit 
column is 1 when the unit is in use and 0 if it is not in use.  In this 
free version there are no commands to take weights out of use.  A 1 
indicates a regular weight in use and a 2 indicates a bias weight in 
use. 
 
   Besides saving weights you can save all the parameters to a file 
with the save everything command as in: 
 
   se saved 
 
At the same time the weights will be written to the current weights 
file.  The file saved is virtually the same as the one you get with the 
'?' command.  To start over from where you left off you can use: 
 
   bp saved 
 
and this also reads in the patterns and weights.  This command DOES NOT 
save training and testing patterns since normally you would have them 
in a file of their own. 
 
   To get a short online tutorial on how to use the program you can type 
T at the command prompt and get the listing: 
 
   A Tutorial 
 
   The following topics are designed to be read in the order listed. 
 
   To get help on a topic type the code on the right at the prompt. 
 
   Understanding the Menus                                         h1 
   Formatting Data for a Classification Problem                    h2 
   Formatting Data for Function Approximation                      h3 
   Formatting Data for a Recurrent Problem                         h4 
   Making a Network for Classification or Function Approximation   h5 
   Making a Recurrent Network                                      h6 
   Reading the Data                                                h7 
   Setting Algorithms and Parameters                               h8 
   Running the Program                                             h9 
   Saving Almost Everything                                        h10 
   To Quit the Program                                             h11 
 
   To end the program the `q' (for quit) command is entered: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? q 
 
 
4. Basic Facilities 
------------------- 
   There are a very large number of parameters that can be set for 
various algorithms in these programs.  Typing a `?'  will get a compact 
listing of them all however they are packed rather tight.  To get a 
better view of the parameters there are now many upper-case letter 
commands that give a listing of parameters in a less compact form. 
These screens list parameters, generally on the left of the screen in 
the form of the commands you would need to set them.  The center of the 
screen gives a short description of the parameter.  Sometimes one or two 
lines are inadequate to describe the command so at the far right there 
may be a sequence you can type to get more help with the command. 
 
   The most important screen you can look at is the C for commands 
screen that summarizes what each menu screen will show: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? C 
 
Screen     Includes Information and Parameters on: 
 
  A        algorithm parameters and tolerance 
  C        this listing of major command groups 
  D        delta-bar-delta parameters 
  F        formats: patterns, output, paging, copying screen i/o 
  G        gradient descent (plain backpropagation) 
  M        miscellaneous commands: shell escape, seed values, clear, 
           clear and initialize, quit, kick a network, run command, 
           save almost everything 
  N        network building: making a network, initializing 
           a network, kicking a network 
  P        pattern commands: reading patterns, testing patterns, 
  Q        quickprop parameters 
  T        a short tutorial 
  W        weight commands: listing, saving, restoring 
  ?        a compact listing of everything 
 
One typical menu screen is the A screen that lists the main algorithm 
parameters: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? A 
 
Algorithm Parameters 
 
a a      sets all act. functions to ; {ls}             h aa 
a ah s         hidden layer(s) act. function; {ls}                 h aa 
a ao s         output layer act. function; {ls}                    h aa 
a d d          the output layer derivative term; {cdf}             h ad 
a i -          initializes units before using the training set; {+-} 
a u p          the weight update algorithm; {Ccdpq}                h au 
t 0.100        tolerance/unit for successful learning; (0..1) 
 
f O -          allows out-of-date statistics to print; {+-} 
f u -          compute up-to-date statistics; {+-} 
 
The first of these listings is the line: 
 
a a      sets all act. functions to ; {ls}             h aa 
 
which doesn't give a parameter value but instead it gives the pattern of 
a command designed to set the activation function for the entire 
network.  The first sequence is: 
 
a a  
 
and this sequence will change the activation function but when you type 
it in you will have to substitute a character code for the activation 
function instead of the string .  One other activation function is 
the linear activation function denoted by the character l, so to get 
this function you can type in the line: 
 
a a l 
 
The first `a' codes for the "algorithm" command, the second `a' codes 
for the "activation function" and s is the letter for the function.  The 
idea of putting the variable portion of the command within the angle 
brackets (<>) is a notation devised by Computer Scientists to describe 
computer languages.  The word inside these brackets describes the kind 
of thing that is the variable portion of the command. 
 
   The middle part of the line: 
 
a a      sets all act. functions to ; {ls}             h aa 
 
gives a short description of the meaning of the command and within the 
curly brackets there is a listing all the values for all the activation 
functions, {ls}.  To get a more detailed explanation of the options type 
the sequence on the right: `h aa', this gives: 
 
and the following comes up: 
 
   a a  sets every activation function to . 
   a ah  sets the hidden layer activation function to . 
   a ao  sets the output layer activation function to , 
    can be any of the following: 
 
                     Function                           Range 
 
      l    linear function, x                             (-inf..+inf) 
      s    standard sigmoid, 1 / (1 + exp(-x))               (0..+1) 
 
Here you get the code for the function, the function and the range of 
values the function can take on.  This range portion following another 
standard of notation used by Mathematicians.  A ( or ) next to a number, 
say 0, means the range runs very close to 0 but never exactly to 0, in 
other cases (not shown above) a [ or ] next to a number means value can 
range up to exactly the number.  Thus the range: 
 
(0..+1] 
 
meaning that the range can run from ALMOST EXACTLY 0 up to exactly +1. 
 
   If we now return to the A screen, the second line was: 
 
a ah s         hidden layer(s) act. function; {ls}                 h aa 
 
Here the idea is to indicate that the activation function for the hidden 
layer (or layers) of the network IS NOW the s (standard sigmoid 
function).  Again there is a short explanation of this, the set of codes 
for functions and information about how to get more help.  This line can 
also be taken as a direction as to how to set the hidden layer 
activation function as well.  To change it to l you can type in: 
 
a ah l 
 
(Note: normally you would only use the linear activation function in the 
output layer of a network.) 
 
   The third line: 
 
a ao l         output layer act. function; {ls}                    h aa 
 
is similar except it states that the activation function for the output 
layer is l. 
 
 
Paging 
 
   In the student version the paging was a simple version of the System 
V utility, pg.  Now the paging is more like the common UNIX more 
command.  The default page size is 24 lines and it can be reset to 
another value with the format command's paging size sub-command.  For 
instance to get 12 lines / page instead of 24, use: 
 
f P 12 
 
To get no paging at all use: 
 
f P 0 
 
When the page is full you get the prompt: 
 
More? 
 
At this point you can type: 
 
   q                  to quit viewing the text if you are in a loop, 
   a blank            to get another page, 
   ^D                 to get another half a page, 
   a carriage return  to get one more line and 
   c                  to continue without paging. 
 
Mostly paging is needed for loops within the program, like running a 
large number of iterations and printing the results, listing the values 
of all the patterns or listing weights leading into a particular unit. 
Typing the q quits these loops, however paging can also occur with some 
of the longer screen menus that are generating lines of output without 
running a loop.  For these cases the q does not work. 
 
   Every new command entered from the keyboard sets the page counting 
variable to 0 however if input is being taken from a file other than 
stdin the counter is NOT reset.  Most of the time this doesn't matter 
since the little data files like the xor example used to set up 
parameters don't produce any output anyway, however if they do paging is 
in effect.  Having paging here is helpful in case there is a problem 
with reading the files. 
 
Interrupts 
 
   In UNIX entering an interrupt will stop the current command and the 
program will give the user another prompt.  With DOS entering a ctrl-C 
will generate a similar kind of interrupt however DOS only checks for 
this condition when it has to do i/o.  However when the DOS version is 
in a training loop the program also checks to see if a key has been hit 
and if that key is the the escape key, the program will break the 
training loop. 
 
Control Command 
 
   One control-key command is available in this version, hitting ctrl-R 
will run the training algorithm, it is a shorthand for typing r followed 
by a carriage return. 
 
Passing Commands to the Operating System 
 
   By using the '!' command you can pass commands to the operating 
system from within the program.  The kind of typical things you might 
want to do are to list the contents of a directory, list a file or after 
saving weights to a file you might want to list them or even edit them 
and read them back in.  Here is what you can say for DOS to list the 
little data file xor: 
 
! type xor 
 
Once a string has been defined with a ! command it can be re-run simply 
by typing the ! followed immediately by a carriage return. 
 
Making a Copy of Your Session 
 
   Sometimes you may want to make of copy of everything that you type in 
and the program prints out.  For instance you may get exceptionally good 
or bad results using a certain training sequence and an exact record of 
what you and the program did could be worth having.  Or you may need an 
exact copy of the training or testing set values.  Or you may need lots 
of runs where you average the results using another program.  To turn on 
the making of a copy use the format command to turn on the copying 
process: 
 
f c+ 
 
and to turn it off use: 
 
f c- 
 
The text is written to the file copy. 
 
An Alphabetical Listing of the Commands 
 
   The following listing is designed to give you an idea of the set of 
commands available.  Details are given in later sections. 
 
a        sets the momentum parameter, alpha 
a       the algorithm command 
c                clear the network 
ci               clear and initialize the network 
d       set delta-bar-delta parameters 
e     set the learning rate eta 
f       lots of formatting options 
h        gives more help with certain options 
l         list the values of units on that layer 
m       make a network 
o        list the output targets of the training set pattern 
p       list information about training patterns 
q                quit 
qp               set quickprop parameters 
r       run the training algorithm 
rt      read the training set patterns 
rw     read the weights 
rx     read the extra training set patterns 
s         seed value 
sb         set bias weights 
se     save almost everything 
sw     save weights 
t       list testing file statistics of various sorts 
t          tolerance per output unit that must be met 
tf     gives the file name with testing patterns 
tr          special test for a recurrent network 
trp         special test for a recurrent network 
w   list weights leading into unit 
 
The Summary Line 
 
   The default setting for the summaries you get produce up to 
date statistics on the error and on how many patterns are correct. 
Here are several lines of summaries from a problem that has training 
data and test data for a classification problem: 
 
   10      0.00 %  49.04 % 0.47087       0.00 %  62.50 % 0.38234  
   20      0.00 %  73.08 % 0.38584       0.00 %  77.88 % 0.38108  
   30      2.88 %  76.92 % 0.35043       4.81 %  79.81 % 0.33285  
 
The first column is of course the number of iterations, the next column 
gives the percentage of training patterns that are correct based on the 
tolerance.  The next column gives the percentage of correct training 
patterns based on the maximum value of the output units.  The next 
column is the average absolute value of the error per output unit.  Note 
that many other programs will report the RMS error.  The columns on the 
right list the percentage of test set patterns that are correct based on 
tolerance, the percentage correct based on maximum value and finally the 
average error on the test set. 
 
   Some CPU time can be saved by altering certain parameter settings 
that skip some of the forward passes used to determine the current set 
of statistics.  The more often you print the statistics the more time 
you can save by altering these parameter settings.  The penalty is that 
the statistics will be out of date by one iteration with the quickprop 
delta-bar-delta, supersab and regular periodic update methods and only 
approximate for both the right and wrong continuous update methods. 
 
   In all the update methods the training set statistics are computed 
when the program passes back the error.  However then an update of the 
weights is done and these numbers are out of date.  So if 100 iterations 
have been done the program only has the statistics on iteration 99, the 
values that were true before the weights were changed.  When it is time 
to print out the program statistics the default is to do another forward 
pass through the training set to get up to date statistics.  This can be 
stopped by setting the off by 1 option in the format command like so: 
 
f O+ 
 
The results that print out will now look like this: 
 
   10 -1   0.00 %  64.42 % 0.42463       0.00 %  62.50 % 0.38234  
   20 -1   0.00 %  73.08 % 0.39058       0.00 %  77.88 % 0.38108  
   30 -1   8.65 %  75.00 % 0.35052       4.81 %  79.81 % 0.33285  
 
where the string "-1" comes right after the number of iterations done 
and of course it means the numbers shown are for the previous iteration. 
For the test set patterns one pass through the training set has to be 
made to get up to date statistics on them so they are always up to date. 
Most of the time you will probably be more interested in the test set 
results than in the training set results so setting the off by 1 option 
saves a little time and getting off by 1 results on the training set 
is not important. 
 
   The situation when you use the "right" and "wrong" continuous update 
methods is even more complicated.  After the forward pass for one 
pattern is done it is checked to see if it is right and the error is 
added in to a sum of errors.  Then weights are changed.  Then another 
pattern is processed in the same way.  When the weight changes are done 
for this second pattern they may well ruin the right/wrong decision for 
all the previous patterns.  Thus the number of right and wrong patterns 
and the average error can be off by quite a lot.  With the off by 1 
option off ("f O-") the program still does a forward pass to get the 
up to date statistics on the training set.  However when the off by 1 
option is on the statistics look like this: 
 
   10 -1   1.92 %  66.35 % 0.42093 ?    31.73 %  40.38 % 0.46316  
   20 -1  12.50 %  64.42 % 0.38829 ?    40.38 %  44.23 % 0.53761  
   30 -1  34.62 %  75.00 % 0.30939 ?    50.96 %  63.46 % 0.37728  
 
where the ? after the training set error flags the fact that the numbers 
are very suspect. 
 
   Another option in the program is to do an extra forward pass through 
the training set even when there are no statistics to print out.  The 
option to give you up to date statistics is: 
 
f u+ 
 
If you are using the periodic update method, quickprop or dbd you don't 
need "f u+" as the program will report the correct values anyway. 
 
   The one line form of the summary is the default but it can be turned 
off using: 
 
f s- 
 
With this you get nothing whatsoever and normally you won't want this 
unless perhaps you are producing your own customized output. 
 
5. The Format Command (f) 
------------------------- 
   There are several ways to input and output patterns, numbers and 
other values and there is one format command, `f', that is used to set 
these options.  In the format command a number of options can be given 
on a single line as for example in: 
 
f b+ ir oc wB 
 
Input Patterns 
 
   The programs are able to read pattern values in two different 
formats.  Real numbers follow the C language notation and must be 
separated by a space.  The letters `H' used in recurrent networks is 
also allowed.  The letter `x' with a default value of 0.5 is also 
allowed.  The `x' character has a default value of 0.5.  The value of 
`x' can be changed, for example to make `x' -1 use: 
 
f x -1 
 
Real input format is now the default but if you use the other format 
(a compressed binary format) you can re-set the format to real with: 
 
f ir 
 
   The other format is the compressed format, a format consisting of 1s, 
0s and the letters `x' and `H'.  In compressed format each value is one 
character and it is not necessary to have blanks between the characters. 
For example, in compressed format the patterns for xor could be written 
out in either of the following ways: 
 
101      10 1 
000      00 0 
011      01 1 
110      11 0 
 
The second example is preferable because it makes it easier to see the 
input and the output patterns. 
 
To change to compressed format use: 
 
f ic 
 
Output of Patterns 
 
   Output format is controlled with the `f' command as in: 
 
f or   * output node values using real (the C %f) format 
f oc   * output node values using compressed format 
f oa   * output node values using analog compressed format 
f oe   * output values with e notation 
 
The first sets the output to real numbers.  The second sets the output 
to be compressed mode where the value printed will be a `1' when the 
unit value is greater than 1.0 - tolerance, a `^' when the value is 
above 0.5 but less than 1.0 - tolerance, a `v' when the value is less 
than 0.5 but greater than the tolerance.  Below the tolerance value a 
`0' is printed.  The tolerance can be changed using the `t' command (not 
a part of the format command).  For example, to make all values greater 
than 0.8 print as `1' and all values less than 0.2 print as `0' use: 
 
t 0.2 
 
Of course this same tolerance value is also used to check to see if all 
the patterns have converged.  The third output format is meant to give 
"analog compressed" output.  In this format a `c' is printed when a 
value is close enough to its target value.  Otherwise, if the answer is 
close to 1, a `1' is printed, if the answer is close to 0, a `0' is 
printed, if the answer is above the target but not close to 1, a `^' is 
printed and if the answer is below the target but not close to 0, a `v' 
is printed.  This output format is designed for problems where the 
output is a real number, as for instance, when the problem is to make a 
network learn sin(x).  The format "e" writes out node values using 
exponential notation with four places to the right of the decimal point. 
 
Breaking up the Output Values 
 
   In the compressed formats the default is to print a blank after every 
10 values.  This can be altered using the `B' (for inserting breaks) 
option within the format ('f') command.  The use for this command is to 
separate output values into logical groups to make the output more 
readable.  For instance, you may have 24 output units where it makes 
sense to insert blanks after the 4th, 7th and 19th positions.  To do 
this, specify: 
 
f B 4 7 19 
 
Then for example the output will look like: 
 
  1 10^0 10^ ^000v00000v0 01000 e 0.17577 
  2 1010 01v 0^0000v00000 ^1000 e 0.16341 
  3 0101 10^ 00^00v00000v 00001 e 0.16887 
  4 0100 0^0 000^00000v00 00^00 e 0.19880 
 
The break option allows up to 20 break positions to be specified.  The 
default output format is the real format with 10 numbers per line.  For 
the output of real values the option specifies when to print a carriage 
return rather than when to print a blank. 
 
Pattern Formats 
 
   There are two different types of problems that back-propagation can 
handle, the general type of problem where every output unit can take on 
an arbitrary value and the classification type of problem where the goal 
is to turn on output unit i and turn off all the other output units when 
the pattern is of class i.  The xor problem is an example of the general 
type of problem.  For an example of a classification problem, suppose 
you have a number of data points scattered about through two-dimensional 
space and you have to classify the points as either class 1, class 2 or 
class 3.  For a pattern of class 1 you can always set up the output: 
"1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1", however doing 
the translation to bit patterns can be annoying so another notation can 
be used.  Instead of specifying the bit patterns you can set the pattern 
format option to classification (as opposed to the default value of 
general) like so: 
 
f pc 
 
and then the program will read data in the form: 
 
   1.33   3.61   1   *  shorthand for 1 0 0 
   0.42  -2.30   2   *  shorthand for 0 1 0 
  -0.31   4.30   3   *  shorthand for 0 0 1 
 
and translate it to the bit string form.  To switch to the general form 
use "f pg".  Another benefit of the classification format is that when 
the program outputs a status line it will also include the percentage of 
correct patterns based on the maximum value rather than just on 
tolerance. 
 
 
Controlling Summaries 
 
   When the program is learning patterns you normally want to have it 
print out the status of the learning process at regular intervals.  The 
default is to print out a one-line summary of how learning is going 
and this is set by using "f s+".  However if you want to customize 
exactly what is printed out and you don't want the standard summary, use 
"f s-". 
 
Skipping the "running . . ." Message 
 
   Normally whenever you run more training iterations the message, 
"running . . ." prints out to reassure you that something is in fact 
being done, however this can also be annoying at times.  To get rid of 
this message use "f R-" and to bring it back use "f R+". 
 
Ringing the Bell 
 
   To ring the bell when the learning has been completed use "f b+" and 
to turn off the bell use "f b-". 
 
Echoing Input 
 
   When you are reading commands from a file it is sometimes worthwhile 
to see those commands echoed on the screen.  To do this, use "f e+" and 
to turn off the echoing, use "f e-". 
 
Paging 
 
   To set the page size to some value, say, 25, use "f P 25" or to skip 
paging use "f P 0". 
 
Making a Copy of Your Session 
 
   To make a copy of what appears on the screen use "f c+" to start 
writing to the file "copy" and "f c-" to stop writing to this file. 
Ending the session automatically closes this file as well. 
 
Up-To-Date Statistics 
 
   During the ith pass thru the network the program will collect 
statistics on how many patterns are correct and how much error there is. 
It does this so that it will know when to stop the training.  But it 
gets these numbers BEFORE the weights are changed in the ith pass.  In 
the case of periodic update methods (the periodic, delta-bar-delta, 
quickprop and supersab) this is not much of a problem.  If the off by 1 
flag is off ("f O-") there is another forward pass done whenever the 
statistics are printed out so you get up to date statistics anyway.  If 
the off by 1 flag is on ("f O+") you get the string "-1" after the 
number of iterations is printed on the summary line.  Getting the 
statistics in the off by 1 form is harmless and it saves a little CPU 
time.  When the network converges the "-1" flag will not be shown. 
 
   However with the continuous update methods the weights are changed 
after each pattern and this skews the statistics gathered by the 
training process by quite a lot.  To get an accurate assessment of how 
well the training is going when results are printed on the summary line 
you either need to have the off by 1 flag set to "f O+" or you need to 
set the up to date statistics flag by: "f u+".  The default is to leave 
this flag off: "f u-".  Furthermore, if you are training to get an 
accurate assessment of how many iterations it takes to learn the 
training set you need to set "f u+" (NOT JUST "f O+"!).  The "u+" 
setting makes a check after every complete pass through the training 
set.  The "f O+" setting only makes a check when it is time to print 
the status line. 
 
 
6. Taking Training and Testing Patterns from Files (rt,rx,tf) 
------------------------------------------------------------- 
   In the xor example given above the four patterns were part of the 
data file and to read them in the following lines were used: 
 
rt { 
1 0 1 
0 0 0 
0 1 1 
1 1 0 } 
 
However it is also convenient to take patterns from a file that contains 
nothing but a list of patterns (and possibly comments).  To read a new 
set of patterns from some file, patterns, use: 
 
rt patterns 
 
To add an extra group of patterns to the current set you can use: 
 
rx patterns 
 
To read in test patterns from say the file, xtest, do the following: 
 
tf xtest 
 
To evaluate all the test patterns without listing them do "t0".  To list 
them, use "t".  To list one particular test pattern, say pattern 3, do 
"t 3". 
 
 
7. Saving and Restoring Weights and Related Values (sw,rw,sw+,swe,swem) 
-------------------------------------------------------------- 
   Sometimes the amount of time and effort needed to produce a set of 
weights to solve a problem is so great that it is more convenient to 
save the weights rather than constantly recalculate them.  To save the 
weights to the current weights file use "sw".  The weights are then 
written on a file called "weights" or to the last file name you have 
specified.  The weights file looks like: 
 
59r  m 2 1 1 x aahs aos bh 1.000000 bo 1.000000 Dh 1.000000 Do 1.000000  file = ../xor3.new 
 8.926291e+000  1  1 1 to 2 1 
-7.945858e+000  1  1 2 to 2 1 
 3.898432e+000  2  2 b to 2 1 
 5.382575e+000  1  1 1 to 3 1 
-4.862383e+000  1  1 2 to 3 1 
-1.086713e+001  1  2 1 to 3 1 
 7.715632e+000  2  3 b to 3 1 
 
To write the weights the program starts with the second layer, writes 
out the weights leading into these units in order with the threshold 
weight last, then it moves on to the third layer, and so on.  In 
addition to writing out the weights the second column lists whether or 
not the weights are in use.  If the weight is in use it is marked with a 
1, if it is a bias unit weight it is marked as 2 and if it is not in use 
it is marked with a 0.  This is not used in this free version.  The last 
4 numbers on each line tell which units the weights run between.  The 
first weight listed runs from layer 1 unit 1 to layer 2 unit 1.  The 
letter b indicates the weight is a bias unit.  These last 4 values on a 
line are ignored when the file is read so in fact if you want to make up 
your own weights file you don't need to type them in.  These last four 
values are just here for human convenience.  However the inuse values 
must be present if you write your own weights file.  And you must use 
only one weight per line. 
 
   To restore these weights type `rw' for restore weights.  At this time 
the program reads the header line and sets the total number of 
iterations the program has gone through to be the first number it finds 
on the header line.  It then reads the character immediately after the 
number.  The `r' indicates that the weights will be real numbers 
represented as character strings. 
 
   The remaining text on the first line of a weight file is not used by 
the restore weights command at this time and it is there to give you a 
record of what size and type the network was.  The fact that the rest of 
this line is not read by the restore weights program means that before 
you read in weights you have to make the proper size network with the 
"m" command.  The "m 2 1 1 x" of course means there are 2 units in the 
first layer, one in the second, one in the third and the x means there 
are extra connections from the input units to the output unit. 
Following that the initial command file that was read in is given. 
 
   To save weights to a file other than "weights" you can say: "sw 
", where, of course,  is the file you want to save 
to.  To continue saving to the same file you can just do "sw".  If you 
type "rw" to restore weights they will come from this current weights 
file as well.  You can restore weights from another file by using: "rw 
".  Of course this also sets the name of the file to write to 
so if you're not careful you could lose your original weights file. 
 
 
8. Initializing Weights (c,ci) 
------------------------------ 
   All the weights in the network initially start out at 0 and they are 
also set to 0 by using the clear (c) command.  In some problems where 
all the weights are 0 the weight changes may cancel themselves out so 
that no learning takes place.  Moreover, in most problems the training 
process will usually converge faster if the weights start out with small 
random values.  To do this use the clear and initialize command as in: 
 
ci 0.5 
 
where the random initial weights will run from -0.5 to +0.5.  If the 
value is omitted the last range specified will be used.  The initial 
value is 1. 
 
 
9. The Seed Value (s) 
--------------------- 
   The initial seed value is set to 0 and this value is as good as any 
other value however networks often do not converge quickly or at all 
with some sets of initial weights.  To get some other initial random 
weights use the seed command as in: 
 
s 7 
 
where the seed is set to 7.  The seed value is of type unsigned. 
 
 
10. The Algorithm Command (a) 
----------------------------- 
   A number of different variations on the original back-propagation 
algorithm have been proposed in order to speed up convergence and some 
of these have been built into these simulators.  These options are set 
using the `a' command and a number of options can go on the one line. 
 
Activation Functions 
 
   To set the activation functions use: 
 
a a   * to set the activation function for all layers to . 
a ah  * to set the hidden layer(s) function to . 
a ao  * to set the output layer function to . 
 
where  can be: 
 
   l  for the linear activation function:  x 
   s  for the traditional smooth activation function: 
      1.0 / (1.0 + exp(x)) 
 
   The s function is the standard smooth activation function originally 
used by researchers and it is still the most commonly used one.  In the 
bp program it is implemented by a table look-up (default) or if the 
compiler variable LOOKUP is undefined in the file ibp.h the regular 
time-consuming real valued calculations are done. 
 
   The linear activation function gives networks only a very limited 
ability to learn patterns and it is therefore hardly ever used by itself 
in a network however it is often used in the output layer of networks 
with 3 or more layers so that the network can give output values beyond 
the range of the other activation functions.  For instance, suppose you 
need to train a network to compute some non-linear function but you need 
to produce outputs in the range -10 to 10.  The usual activation 
functions are restricted to the range 0 to 1 or -1 to 1 but you can 
choose a non-linear function for the network's hidden layers and with  
linear neurons in the output layer the network can produce values 
in the range -10 to 10. 
 
 
The Derivatives 
 
   The correct derivative for the standard activation function is s(1-s) 
where s is the activation value of a unit however when s is near 0 or 1 
this term will give only very small weight changes during the learning 
process.  To counter this problem Fahlman proposed the following one 
for the output layer: 
 
0.1 + s(1-s) 
 
(For the original description of this method see "Faster Learning 
Variations of Back-Propagation:  An Empirical Study", by Scott E. 
Fahlman, in Proceedings of the 1988 Connectionist Models Summer School, 
Morgan Kaufmann, 1989.) 
 
   Besides Fahlman's derivative and the original one the differential 
step size method (see "Stepsize Variation Methods for Accelerating the 
Back-Propagation Algorithm", by Chen and Mars, in IJCNN-90-WASH-DC, 
Lawrence Erlbaum, 1990) takes the derivative to be 1 in the layer going 
into the output units and uses the correct derivative term for all other 
layers.  The learning rate for the inner layers is normally set to some 
smaller value.  To set a value for eta2 give two values in the `e' 
command as in: 
 
e 0.1 0.01 
 
To set the derivative use the `a' command as in: 
 
a dc   * use the correct derivative for whatever function 
a dd   * use the differential step size derivative (default) 
a df   * use Fahlman's derivative in only the output layer 
a do   * use the original derivative (same as `c' above) 
 
Update Methods 
 
   The choices are the periodic (batch) method, the continuous (online) 
method, delta-bar-delta and quickprop.  The following commands set the 
update methods: 
 
a uC   * for the "right" continuous update method 
a uc   * for the "wrong" continuous update method 
a ud   * for the delta-bar-delta method 
a up   * for the original periodic update method (default) 
a uq   * for the quickprop algorithm 
 
 
11. The Delta-Bar-Delta Method (d) 
---------------------------------- 
   The delta-bar-delta method attempts to find a learning rate, eta, for 
each individual weight.  The parameters are the initial value for the 
etas, the amount by which to increase an eta that seems to be too small, 
the rate at which to decrease an eta that is too large, a maximum value 
for each eta and a parameter used in keeping a running average of the 
slopes.  Here are examples of setting these parameters: 
 
d d 0.5    * sets the decay rate to 0.5 
d e 0.1    * sets the initial etas to 0.1 
d k 0.25   * sets the amount to increase etas by (kappa) to 0.25 
d m 10     * sets the maximum eta to 10 
d n 0.005  * an experimental noise parameter 
d t 0.7    * sets the history parameter, theta, to 0.7 
 
These settings can all be placed on one line: 
 
d d 0.5  e 0.1  k 0.25  m 10  t 0.7 
 
The version implemented here does not use momentum.  The symmetric 
versions sbp and srbp do not implement delta-bar-delta. 
 
   The idea behind the delta-bar-delta method is to let the program find 
its own learning rate for each weight.  The `e' sub-command sets the 
initial value for each of these learning rates.  When the program sees 
that the slope of the error surface averages out to be in the same 
direction for several iterations for a particular weight the program 
increases the eta value by an amount, kappa, given by the `k' parameter. 
The network will then move down this slope faster.  When the program 
finds the slope changes signs the assumption is that the program has 
stepped over to the other side of the minimum and so it cuts down the 
learning rate by the decay factor given by the `d' parameter.  For 
instance, a d value of 0.5 cuts the learning rate for the weight in 
half.  The `m' parameter specifies the maximum allowable value for an 
eta.  The `t' parameter (theta) is used to compute a running average of 
the slope of the weight and must be in the range 0 <= t < 1.  The 
running average at iteration i, a[i], is defined as: 
 
a[i] = (1 - t) * slope[i] + t * a[i-1], 
 
so small values for t make the most recent slope more important than the 
previous average of the slope.  Determining the learning rate for 
back-propagation automatically is, of course, very desirable and this 
method often speeds up convergence by quite a lot.  Unfortunately, bad 
choices for the delta-bar-delta parameters give bad results and a lot of 
experimentation may be necessary.  If you have n patterns in the 
training set try starting e and k around 1/n.  The n parameter is an 
experimental noise term that is only used in the integer version.  It 
changes a weight in the wrong direction by the amount indicated when the 
previous weight change was 0 and the new weight change would be 0 and 
the slope is non-zero.  (I found this to be effective in an integer 
version of quickprop so I tossed it into delta-bar-delta as well.  If 
you find this helps please let me know.)  For more on delta-bar-delta 
see "Increased Rates of Convergence" by Robert A. Jacobs, in Neural 
Networks, Volume 1, Number 4, 1988. 
 
 
12. Quickprop (qp) 
------------------ 
    Quickprop (see "Faster-Learning Variations on Back-Propagation: An 
Empirical Study", by Scott E. Fahlman, in Proceedings of the 1988 
Connectionist Models Summer School", Morgan Kaufmann, 1989 or ftp to 
archive.cis.ohio-state.edu, look in the directory pub/neuroprose for the 
file, fahlman.quickprop-tr.ps.Z.) may be one of the fastest network 
training algorithms.  It is loosely based on Newton's method. 
 
   The parameter mu is used to limit the size of the weight change to 
less than or equal to mu times the previous weight change.  Fahlman 
suggests mu = 1.75 is generally quite good so this is the initial value 
for mu but slightly larger or slightly smaller values are sometimes 
better. 
 
   To get the process started quickprop makes the typical backprop 
weight change of - eta * slope.  I have found that a good value for the 
quickprop eta value is around 1 / n or 2 / n where n is the number of 
patterns in the training set.  Other sources often use much larger 
values.  In addition Fahlman uses this term at other times.  I had to 
wonder if this was a good idea so in this code I've included a 
capability to add it in or not add it in.  So far it seems to me that 
sometimes adding in this extra term helps and sometimes it doesn't.  The 
default is to use the extra term. 
 
   Another factor involved in quickprop comes about from the fact that 
the weights often grow very large very quickly.  To minimize this 
problem there is a decay factor designed to keep the weights small. 
The weight decay is implemented by decreasing the value of the slope 
and it is different from the general weight decay that people use and 
which is also implemented in this software.  Fahlman recently mentioned 
that now he does not use does not use this unless the weights get very 
large.  I've found that too large a decay factor can stall 
out the learning process so that if your network isn't learning fast 
enough or isn't learning at all one possible fix is to decrease the 
decay factor.  Note:  in the old free version the value of the weight 
decay constant is the value you enter / 1000 in order to allow small 
weight decay values in the integer version however in this version the 
problem is handled differently so that what you enter is exactly what 
you get, not the value divided by 1000. 
 
   I built in one additional feature for the integer version.  I found 
that by adding small amounts of noise the time to convergence can be 
brought down and the number of failures can be decreased somewhat.  This 
seems to be especially true when the weight changes get very small.  The 
noise consists of moving uphill in terms of error by a small amount when 
the previous weight change was zero.  Good values for the noise seem to 
be around 0.005. 
 
   The parameters for quickprop are all set in the `qp' command like 
so: 
 
qp d   * set the weight decay factor for all layers to  
qp d h 0      * the default weight decay for hidden layer units 
qp d o 0.0001 * the default weight decay for output layer units 
qp e 0.5      * the default value for eta 
qp m 1.75     * the default value for mu 
qp n 0        * the default value for noise 
qp s+         * the default value is to always include the slope 
 
or a whole series can go on one line: 
 
qp d 0.1 e 0.5 m 1.75 n 0 s+ 
 
 
13. Making a Network (m) 
------------------------ 
   In the simplest form of the make a network command you type an `m' 
followed by the number of units in each layer as in: 
 
m 8 4 4 2 
 
Most of the time this type of network is all you will ever need but 
there are others that can be tried and which may sometimes will work 
better.  One innovation that often speeds up learning is to include 
extra connections between the input and output layers.  To get this 
type of network you add an x to the end of the m command as in: 
 
m 8 4 2 x 
 
These extra connections are said to be important when the problem to 
be solved is almost linear and then the hidden layer units provide some 
extra corrections to the output neurons to distort the results from a 
purely linear model. 
 
   In the student version every time you made a network all the training 
and testing patterns were thrown out because they were attached to the 
network.  (Not true in the pro version.) 
 
   To make a recurrent network with 25 regular input units, twenty 
hidden layer units (that are copied to the input layer) and 25 output 
units use: 
 
m 25+20 20 25 
 
This means that the first layer will have 45 inputs and the first 25 are 
regular input values but the next 20 come from the first hidden layer. 
These 20 units are called the short term memory units.  Then there are 
20 units in the hidden layer.  This value should match the number of 
units given for the short term memory units.  At the moment there is no 
check to see that it does.  Finally there are 25 units in the output 
layer.  This recurrent network notation also requires a change in the 
way training and testing patterns are written down for input into the 
program.  For more on this see the next section. 
 
14. Recurrent Networks 
---------------------- 
   Recurrent back-propagation networks take values from hidden layer 
and/or output layer units and copy them down to the input layer for use 
with the next input.  These values that are copied down are a kind of 
coded record of what the recent inputs to the network have been and this 
gives a network a simple kind of short-term memory, possibly a little 
like human short-term memory.  For instance, suppose you want a network 
to memorize the two short sequences, "acb" and "bcd".  In the middle of 
both of these sequences is the letter, `c'.  In the first case you want 
a network to take in `a' and output `c', then take in `c' and output 
`b'.  In the second case you want a network to take in `b' and output 
`c', then take in `c' and output `d'.  To do this a network needs a 
simple memory of what came before the `c'. 
 
   Let the network be an 7-3-4 network where input units 1-4 and output 
units 1-4 stand for the letters a-d and the `h' stands for the value of 
a hidden layer unit.  So the codes are: 
 
a: 1000 
b: 0100 
c: 0010 
d: 0001 
 
In action, the networks need to do the following.  When `a' is input, 
`c' must be output: 
 
   0010     <- output layer 
 
   hhh      <- hidden layer 
 
1000 stm    <- input layer 
 
In this context, when `c' is input, `b' should be output: 
 
   0100 
 
   hhh 
 
0010 stm 
 
For the other string, when `b' is input, `c' is output: 
 
   0010 
 
   hhh 
 
0100 stm 
 
and when `c' in input, `d' is output: 
 
   0001 
 
   hhh 
 
0010 stm 
 
This is easy to do if the network keeps a short-term memory of what its 
most recent inputs have been.  Suppose we input a and the output is c: 
 
   0010     <- output layer 
 
   hhh      <- hidden layer 
 
1000 stm    <- input layer 
 
Placing `a' on the input layer generates some kind of code (like a hash 
code) on the 3 units in the hidden layer.  On the other hand, placing 
`b' on the input units will generate a different code on the hidden 
units.  All we need to do is save these hidden unit codes and input them 
with a `c'.  In one case the network will output `b' and in the other 
case it will output `d'.  In one particular run inputting `a' produced: 
 
     0  0  1  0 
 
  0.993 0.973 0.020 
 
 1  0  0  0  0  0  0 
 
When `c' is input the hidden layer units are copied down to input to 
give: 
 
        0  1  0  0 
 
    0.006 0.999 0.461 
 
0  0  1  0  0.993 0.973 0.020 
 
For the other pattern, inputting `b' gave: 
 
    0  0  1  0 
 
  0.986 0.870 0.020 
 
0  1  0  0  0  0  0 
 
Then the input of `c' gave: 
 
          0  0  0  1 
 
      0.005 0.999 0.264 
 
0  0  1  0  0.986 0.870 0.020 
 
   This particular problem can be set up as follows: 
 
m 7 3 4 
s 7 
ci 
t 0.2 
rt { 
1000 H   0010 
0010 H   0100 
 
0100 H   0010 
0010 H   0001 
} 
 
where the first four values on each line are the normal input.  The H 
codes for however many hidden layer units there are.  The last four 
values are the desired outputs. 
 
   By the way, this simple problem does not converge particularly fast 
and you may need to do a number of runs before you hit on initial values 
that will work quickly.  It will work more reliably with more hidden 
units. 
 
   Rather than using recurrent networks to memorize sequences of letters 
they are probably more useful at predicting the value of some variable 
at time t+1 given its value at t, t-1, t-2, ... .  A very simple of this 
is to give the value of sin(t+1) given a recent history of inputs to the 
net.  Given a value of sin(t) the curve may be going up or down and the 
net needs to keep track of this in order to correctly predict the next 
value.  The following setup will do this: 
 
m 1+5 5 1 
f ir 
a aol dd uq 
qp e 0.02 
ci 
rt { 
   0.00000  H   0.15636 
   0.15636  H   0.30887 
   0.30887  H   0.45378 
 
   . . . 
 
  -0.15950  H  -0.00319 
  -0.00319  H   0.15321 
} 
 
and in fact it converges rather rapidly.  The complete set of data can 
be found in the example file rsin.bp. 
 
   Another recurrent network included in the examples is one designed to 
memorize two lines of poetry.  The two lines were: 
 
   I the heir of all the ages in the foremost files of time 
 
   For I doubt not through all the ages ones increasing purpose runs 
 
but for the sake of making the problem simpler each word was shortened 
to 5 characters giving: 
 
   i the heir of all the ages in the frmst files of 
 
   time for i doubt not thru the ages one incre purpo runs 
 
The letters were coded by taking the last 5 bits of their ASCII codes. 
See the file poetry.bp.   
 
   Once upon a time I was wondering what would happen if the poetry 
network learned its verses and then the program was given several words 
in the middle of the verses.  Would it pick up the sequence and be able 
to complete it given 1 or 2 or 3 or n words?  So given for example, the 
short sequence "for i doubt" will it be able to "get on track" and 
finish the verse?  To test for this there are an extra pair of commands, 
tr and trp.  Given a test set (which should be the training set) they 
start at every possible place in the test set, input n words and then 
check to see if the net produces the right answer.  For this example I 
tried n = 3, 4, 5, 6 and 7 with the following results: 
 
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 3 
 TOL:  81.82 %  ERROR: 0.022967 
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 4 
 TOL:  90.48 %  ERROR: 0.005672 
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 5 
 TOL:  90.00 %  ERROR: 0.005974 
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 6 
 TOL: 100.00 %  ERROR: 0.004256 
[ACDFGMNPQTW?!acdefhlmopqrstw]? tr 7 
 TOL: 100.00 %  ERROR: 0.004513 
 
So after getting just 3 words the program was 81.82% right in predicting 
the next word to within the desired tolerance.  Given 6 or 7 words it 
was getting them all right.  The trp command does the same thing except 
it also prints the final output value for each of the tests made. 
 
 
15. Miscellaneous Commands 
-------------------------- 
   Below is a list of some miscellaneous commands, a short example of 
each and a short description of the command. 
 
 
!   Example: ! ls 
 
Anything after `!' will be passed on to the OS as a command to execute. 
An ! followed immediately by a carriage-return will repeat the last 
command sent to the OS. 
 
l   Example: l 2 
 
Entering "l 2" will print the values of the units on layer 2, or 
whatever layer is specified. 
 
sb  Example: sb -3 
 
Entering "sb -3" will set the bias unit weight to -3.  In the symmetric 
versions the weight will be frozen at this value while in the regular 
versions it will only be the initial value and should be set after the 
other weights are initialized. 
 
 
16. Limitations 
--------------- 
   Weights in the ibp and sibp programs are 16-bit integer weights where 
the real value of the weight has been multiplied by 1024.  The integer 
versions cannot handle weights less than -32 or greater than 31.999. 
The weight changes are all checked for overflow but there are other 
places in these programs where calculations can possibly overflow as 
well and none of these places are checked.  Input values for the integer 
versions can run from -31.992 to 31.999.  Due to the method used to 
implement recurrent connections, input values in the real version are 
limited to -31992.0 and above. 
 
 
17. The Pro Version Additions 
----------------------------- 
   This section lists the additions to the pro version at this time. 
For a more detailed and more up-to-date description see the online pro 
version manual at: 
 
http://www.mcs.com/~drt/probp.html 
 
The additional commands are: 
 
ac       add a weight connection between the units 
ah       add a hidden unit to  
b               benchmarking 
i     read input from the file 
k      give the network a kick 
n      dynamic network building parameters 
ofu       turn off a unit 
onu       turn on a unit 
ofw     turn off a weight 
onw     turn on a weight 
pw      prune weights 
rp              set rprop parameters 
s        set multiple seed values 
ss     set SuperSAB parameters 
swem