Feature Data Format and Experiments

Format of the Feature Data Files

The biometric feature data output format from the mouse movement, stylometry, and keystroke projects will take the form of a file or corresponding spreadsheet. The form of the file is as follows with fields in a record comma delimited and items in a field slash delimited:

Feature Value Normalization

The following pseudo-code will normalized the feature values into the range 0-1.

for i = 1 to number_of_features
     min =  999999 {initialize to a large positive number}
     max = -999999 {initialize to a large negative number}
     for j = 1 to number_of_samples {find min and max}
          if feature_value (i,j) < min then min = feature_value (i,j)
          if feature_value (i,j) > max then max = feature_value (i,j)
     end
     for j = 1 to number_of_samples {normalize}
          feature_value (i,j) = (feature_value (i,j) - min) / (max - min)
     end
end

Examples of Feature Data Files

Mushroom data example, see Mushroom database
(the mushroom data are used in Dr. Cha's Data Mining class)

Mushroom data example created September 2007 for illustrative purposes
8124
edible, ?, ?, ?, 22, b, f, n, t, a, a, c, b, k, e, ?, f, f, n, n, p, n, n, c, k, a, g
poisonous, ?, ?, ?, 22, b, g, n, t, a, a, c, b, k, e, b, y, y, n, n, p, n, n, c, k, a, m
etc. for a total of 8124 pattern instance records

Mouse movement biometric data example

Mouse movement biometric data example created September 2007
4
MaryJones/F/26, left-handed, Dell mouse, fixed 10-button sequence/used right hand, 2, 0.13668, 0.53375
MaryJones/F/26, left-handed, Dell mouse, fixed 10-button sequence/used right hand, 2, 0.14378, 0.56275
JohnSmith/M/27, right-handed, optical mouse, random 10-button sequence/used right hand, 2, 0.53628, 0.43865
JohnSmith/M/27, right-handed, optical mouse, random 10-button sequence/used right hand, 2, 0.43628, 0.53865

Stylometry biometric data example

Stylometry biometric data example created September 2007
6
MaryJones/F/26, bachelors degree, Dell laptop, structured email task, 2, 0.13668, 0.53375
MaryJones/F/26, bachelors degree, Dell laptop, structured email task, 2, 0.14378, 0.56275
JohnSmith/M/27, masters degree, Compaq handheld, free email task, 2, 0.53628, 0.43865
JohnSmith/M/27, masters degree, Compaq handheld, free email task, 2, 0.43628, 0.53865
ChrisHill/F/02-04-1983, PhD degree, Dell desktop, free email task, 2, 0.39734, 0.92862
ChrisHill/F/02-04-1983, PhD degree, Dell desktop, free email task, 2, 0.49924, 0.98861

Keystroke biometric data example

Keystroke biometric data example created September 2007
8
MaryJones/F/08-01-1981, left-handed, Dell laptop, copy task, 2, 0.13668, 0.53375
MaryJones/F/08-01-1981, left-handed, Dell laptop, copy task, 2, 0.14378, 0.56275
JohnSmith/M/06-01-1980, right-handed, Dell laptop, email task, 2, 0.53628, 0.43865
JohnSmith/M/06-01-1980, right-handed, Dell laptop, email task, 2, 0.43628, 0.53865
JohnSmith/M/04-21-1982, left-handed, Dell desktop, copy task, 2, 0.88321, 0.43464
JohnSmith/M/04-21-1982, left-handed, Dell desktop, copy task, 2, 0.78721, 0.33262
ChrisHill/F/02-04-1983, right-handed, Dell desktop, email task, 2, 0.39734, 0.92862
ChrisHill/F/02-04-1983, right-handed, Dell desktop, email task, 2, 0.49924, 0.98861

Notes on the Feature Data Files

1) A "?" is used to indicate a data item that is unknown, unavailable, or not relevant.
2) The mushroom example has two pattern classes: edible and poisonous. The mouse movement example also has two classes: Mary Jones and John Smith. The stylometry example has three classes: Mary Jones, John Smith, and Chris Hill. The keystroke example has four classes: Mary Jones, John Smith born 1980, John Smith born 1982, and Chris Hill. This information is implicit in the data but not explicitly specified.
3) Although the biometric feature measurements have been represented with five decimal places as shown above for simplicity, eight or ten decimal places is recommended for the actual project data.

Anticipated Data and Experiments

The following summarizes the data needed, and the backend experiments we will perform.

Mouse: 
  Set 1: 50 samples (5 from each of 10 subjects, provided by Team 1 no later than November 26)
  Authentication Experiments conducted by Teams 5 and 6
    Train and Test on Different Subjects but Same Conditions
      Set 1: Train on 5 subjects, test on other 5 subjects
      Set 1: Reverse - train on 5 test subjects of previous exp, test on other 5 subjects
      (Team 5 will provide dichotomy-model training and test sets of 50 within-class
       and 250 between-class samples for each of the two 5-subject data sets)
  Identification Experiments conducted by Team 6
    Train and Test on Same Subjects but Different Samples (use the leave-one-out procedure)
      (identification is the 1-out-of-n problem in contrast to the yes/no authentication one)
Stylometry: 
  Set 1: 70 samples (10 from each of 7 subjects, provided by Team 2 no later than November 26))
  Authentication Experiments conducted by Teams 5 and 6 
    Train and Test on Different Subjects but Same Conditions
      Set 1: Train on 4 subjects, test on other 3 subjects
      Set 1: Reverse - train on 3 test subjects of previous exp, test on other 4 subjects
      (Team 5 will provide two dichotomy-model data sets:
       for 4 subjects we will have 180 within-class and 600 between-class samples,
       and for 3 subjects we will have 135 within-class and 300 between-class samples)
  Identification Experiments conducted by Team 6
    Train and Test on Same Subjects but Different Samples (use the leave-one-out procedure)
Keystroke:
  Set 1. Old data - 180 desktop/copy samples (5 from each of 36 subjects)
  Set 2. Old data - 180 desktop/free samples (5 from each of 36 subjects)
  Set 3. Old data - 180 laptop/copy samples (5 from each of 36 subjects)
  Set 4. Old data - 180 laptop/free samples (5 from each of 36 subjects)
    (above data sets provided by Team 4 no later than November 26)
  Authentication Experiments conducted by Teams 5 and 6 
    Train and Test on Different Subjects but Same Conditions
      Set 1: train on 18 subjects - test on other 18 subjects 
      Set 2: train on 18 subjects - test on other 18 subjects
      Set 3: train on 18 subjects - test on other 18 subjects
      Set 4: train on 18 subjects - test on other 18 subjects
      (Team 5 will provide dichotomy-model training and test sets of 180 within-class
       and 500 [randomly select 500 from larger number] between-class samples
       for each of the four data sets above)
    Train and Test on Same Subjects but Different Conditions
      Sets 1 & 2: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 2 & 1: (reverse training and test sets)
      Sets 3 & 4: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 4 & 3: (reverse training and test sets)
      Sets 1 & 3: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 3 & 1: (reverse training and test sets)
      Sets 2 & 4: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 4 & 2: (reverse training and test sets)
      Sets 1 & 4: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 4 & 1: (reverse training and test sets)
      Sets 2 & 3: train on 5 samples from each of 36 subjects in first set, test on second set
      Sets 3 & 2: (reverse training and test sets)
      (Team 5 will provide a set of 360 within-class and 500 between-class samples
       from each of the four data sets - Set 1, Set 2, Set 3, and Set 4)
  Identification Experiments conducted by Team 6
    Train and Test on Same Subjects but Different Conditions (same as above)
  -----------------------------------------------------------------------------------
  Set 1a. New data - 20 desktop/copy samples (5 from each of 4 subjects)
  Set 1b. New data - 20 desktop/free samples (5 from each of 4 subjects)
  Set 1c. New data - 20 laptop/copy samples (5 from each of 4 subjects)
  Set 1d. New data - 20 laptop/free samples (5 from each of 4 subjects)
  Set 2a. New data - 20 desktop/copy samples (5 from each of 4 subjects) two weeks later
  Set 2b. New data - 20 desktop/free samples (5 from each of 4 subjects) two weeks later
  Set 2c. New data - 20 laptop/copy samples (5 from each of 4 subjects) two weeks later
  Set 2d. New data - 20 laptop/free samples (5 from each of 4 subjects) two weeks later
  Set 3a. New data - 20 desktop/copy samples (5 from each of 4 subjects) four weeks later
  Set 3b. New data - 20 desktop/free samples (5 from each of 4 subjects) four weeks later
  Set 3c. New data - 20 laptop/copy samples (5 from each of 4 subjects) four weeks later
  Set 3d. New data - 20 laptop/free samples (5 from each of 4 subjects) four weeks later
    (sets 1 & 2 provided by Team 4 no later than Nov 26 and sets 3 no later than Dec 3)
  Longitudinal Authentication Experiments conducted by Teams 5 and 6 
    Train and Test on Same Subjects but Different Conditions (Time of Data Capture)
      Sets 1a-2a: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1b-2b: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1c-2c: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1d-2d: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1a-3a: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1b-3b: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1c-3c: train on 5 samples from each of 4 subjects in first set, test on second set
      Sets 1d-3d: train on 5 samples from each of 4 subjects in first set, test on second set
      (Team 5 will provide dichotomy-model training and test sets of 40 within-class
       and 150 between-class samples for each of the 12 data sets above)
  Longitudinal Identification Experiments conducted by Team 6
    Train and Test on Same Subjects but Different Conditions (same as above)