CEDPA Logo DataBus Header

 
Conference
DataBus Index
Listservs
Presentations
Events
Organization
Bylaws
Directors
More Info
CEDPA Home

Issue Index

   DataBus - Vol 40 No 4: June-July, 2000
  
CSIS Participants to Use New Approach In Sharing Student Identifiers


Several consortia comprised of some seventy districts and county offices of education are currently working with the California School Information Services Program to develop and implement working solutions that will enable the accurate and timely exchange of student transcripts and assist local education agencies (an LEA may be a school, district, county office of education or other agency providing services to students) to electronically transmit reports to the California Department of Education. A significant aspect of this work is uniquely identifying students without compromising confidentiality. Research and design work done by CSIS and participating consortia regarding the issues surrounding this need are unprecedented in the nation.
 
Overview
 
Inherent in the CSIS Program is the need to establish a means to distinguish records for the approximately 6,000,000 California K12 students. Both records transfer and state reporting require multiple years of data, a problem compounded by the fact that students attend various institutions within those multiple school years. The goal is that each student's records be uniquely distinguishable from that of other students, and that they be consistently identifiable over the entire academic career of each student from kindergarten through high school graduation.
 
To achieve this goal, it is necessary to assign every student a unique, unambiguous and persistent identifier that will stay with the student as they move among districts within the state.
 
It is incumbent on CSIS and all local education agencies to provide extraordinary protection of any and all personally identifiable data elements.
 
The Challenge
 
Given that the responsibility for confidentiality of student data exists at the local level, it is very desirable to hold the personal data required to establish identifiers as close as possible to the local agency that maintains the students' records. However, the process to persistently and reliably determine unique student identities is enhanced by having many personally identifiable data elements. Several other states have addressed this problem. Research into other systems with a need to create a unique personal identifier indicates that useful data elements that make up an identifier include:
  • Student's Legal Name
  • Student's also known as names (AKA)
  • Parents' Name(s)
  • Gender
  • Ethnicity
  • Birth Date
  • Birth Place
  • Plus, other potentially sensitive demographic data.
The Solution
 
Working with consortia members, CSIS has devised a unique strategy for establishing the Identifier based upon three key recommendations:
  1. Use "Soundex"1 encoding for names and birthplaces to avoid sending these very personal data elements outside of the local education agency systems. Further, scramble the soundex codes such that even phonetic representations of the name could not be reverse-engineered, in the event of accidental viewing of the database content.
  2. CSIS provides a utility to standardize all soundex transformations, along with other data manipulations necessary for CSIS data exchange.
  3. Use a focused-random number generation utility to assign non-intelligent identifiers for each K12 student.
Recommendation 1--Soundex
 
The soundex is a coded name index based on the way a name sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.
 
Basic Soundex Coding Rule

Every soundex code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the following scheme:
 
Number Represents the Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R

Disregard the letters A, E, I, O, U, H, W and Y.
Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded.
 
For example, Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).
 
For example, Lee is coded L-000 (L, 000 added).

Additional Soundex Coding Rules

If the surname has any double letters, they should be treated as one letter.
 
For example, Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
 
If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter.
 
For example, Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).
 
For example, Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).
 
If a surname has a prefix, such as Van, Con, De, Di, La, or Le, code both with and without the prefix because the surname might be listed under either code. Note, however, that Mc and Mac are not considered prefixes.
 
For example, VanDeusen might be coded two ways:
V-532 (V, 5 for N, 3 for D, 2 for S)
Or
D-250 (D, 2 for the S, 5 for the N, 0 added).
 
Recommendation 2--CSIS Utility
 
To ease implementation and support issues, a common utility developed by CSIS is to be used for each consortia's preparation and hand off of data to CSIS. After examining several different potential architectural solutions from completely centralized to totally decentralized, CSIS and the consortia recommend hybrid architecture to achieve acceptable levels of efficiency without sacrificing privacy.
 
The strategy is to employ a centralized database containing only non-personally identifiable data with local formatting of locator data. The centralized locator database at CSIS provides extraordinary separation of personally identifiable data from any other information about a student. All formatting, including soundex encoding, is at the local education agency – so personally identifiable data stays at the local level, except for records transfer to other local agencies.
 
The actual soundex code will be scrambled and then encrypted prior to leaving the local agency. Once it is received at the CSIS server, it will be de-encrypted, but not de-scrambled, providing another level of security for the locator database. Eventually, this utility will be a callable routine, allowing for vendors to embed the utility within their proprietary systems.
 
Recommendation 3--Calculating the Identifier
 
The final recommendation is how to actually calculate the student identifier. There are basically two schools of thought on "id code" assignment: random vs. intelligent numbering. An "intelligent" identifier is one that carries some embedded data that attempts to personally identify the individual. Embedded information typically adds to the length of a number. For example county/district/school codes add up to 14 digits for a number. Birth date adds eight. All of this information can be carried elsewhere, reducing data entry requirements, security requirements on the identifier itself and lowering potential for errors. With "random" numbering, the identifier self contains no intelligence.
 
CSIS has adopted a calculation method that is nearly random:
  • Identifiers are 10 digits – all numeric.
  • The 10th digit is a check digit.
  • The 1st digit is never zero.
  • There are no occurrences of more than 2 repeating digits (e.g. 333 is not allowed, but 33 is ok).
  • Numbers will be assigned evenly spread across the range from 1,001,001,001 to 9,989,989,989 to enhance database search and retrieval efficiency.
There is no intelligence within this numbering system to that can be translated to a student identity, and further protection of the student confidentiality is enforced within that part of the CSIS data warehouse used for reporting. That further level of protection is accomplished via the generation of another layer of ‘secret' surrogate keys that provide identifier values that are never associated with even a soundex of student names.
 
Results to Date
 
Other states have implemented statewide identifiers in ways that are unacceptable to our situation in California. Some states have used social security numbers, and most share personally identifiable information such as student names with other educational agencies including their state departments of education. Developing a new solution for a problem that has been solved very differently in previous situations, calls for very deliberate steps.
 
A prototype routine has been developed by CSIS and tested by each of the consortia. Results have been very promising as evidenced by these results:
  • Using its historical data for 242,014 students, one consortium district found that the CSIS routine uniquely identified 99.82% of the population, and that using full student names produced unique identifiers for 99.92% of the same population.
  • Using Department of Health Services files that equate to the incoming kindergarten class of 2001, the CSIS routine uniquely identified 99.97% of the 537,707 records.
Further, CSIS is contracting with a firm specializing in security applications to:
  • Verify the results to date, and consider the efficacy of this approach to the student population of six million.
  • Attempt to ‘break' the separation of the CSIS Identifier from the student's identity, that is attempt to reverse engineer the student names from the ten digit number used as the key to a student record.
Further Information regarding CSIS
 
The CSIS website contains several documents of both a general and detailed nature. More explicit documentation of the CSIS Student Identifier strategy may be found by linking to http://www.csis.k12.ca.us/library/ and selecting "CSIS Statewide Student Identifiers: Recommendation for Establishing Identifier Elements". Any questions regarding the strategy and solutions described in this article may be directed to either Russ Brawn or Charles Burns, consultant to CSIS. The respective email addresses are or
1 "Using the Census Soundex," General Information Leaflet 55 (Washington, DC: National Archives and Records Administration, 1995), a free brochure available from [email protected] (include your name, postal address, and "GIL 55 please").