INSERT SORTING AND INDEXING by Dale Thorn [1760] Part A: Insert sorting in RAM using a single string of data. The following program illustrates insert sorting where a single sorting order is sufficient, and RAM is large enough to contain all of the sorted data. This program requires the user to specify the number of records (Line 40), number of fields per record (Line 50), and the length of each field (Lines 90 - 130). NOTE: If you change the number of fields on line 50, you will have to add or subtract lines where the field lengths are specified (Lines 90 - 130). The example program specifies 127 records of data where each record has 5 units of information (fields), and the fields are 10, 20, 30, 35, and 15 characters long, for a total of 110 characters per record. If you multiple 110 characters (bytes) per record times 127 records you get 13,970 bytes total, which is the dimensioned length of A$ (see Lines 240 and 270). In this simplified example program, we require the number of records to be a power of 2, minus 1, to make the search routine simple (believe it or not); in actual applications, intermediate values would be acceptable. NOTE: If you set up your own example where the file size (the dimensioned length of A$) is much greater than 30,000 bytes, this program will begin to run very slowly because of the time it takes the 71B to mount large strings on the internal stack. The first things the program does are to sum the number of bytes in each field to get the total number of bytes per record, find the beginning and ending byte positions in the record string (B$) and assign them to the arrays H() and I(), and set the variable Z equal to the length or size of the longest field. The program will place all input fields of data into the record string (B$) on Lines 490 - 520, and then use the H() and I() arrays on Line 550 to find the field(s) in the record string (B$) and the file string (A$) which are used for sorting comparisons. If you wish to sort on a different field number, you will need to change only the 1's on Line 550 to the new field number. If you wish to sort on more than one field, you could rewrite Line 550 to read: 550 IF B$[H(3),I(3)]&B$[H(4),I(4)]>A$[P1+H(3),P1+I(3)]&A$[P1+H(4),P1+I(4)] THEN P1=P1+P2 ELSE P1=P1-P2 ! etc., etc. Needless to say, once you get started with a file, you cannot change the number of fields or the sequence of the fields in the sorting order on Line 550. NOTE: If a change in the sorting order is necessary, instead of using one of the published sort routines to re-sort your file, you could simply create a new file with a new sorting order and use this program (with slight modification) to bring in the records from the previous file, allowing this program to insert-sort them much as if they were being typed in from the keyboard. .PA The next thing the program does in Lines 250 - 260 is to establish a control pointer and increment/decrement variable for the Binary Searches. As long as the number of records is 2^n-1, these two algorithms do not change. The following is an illustration of the use of pointers in a Binary Search. Imagine a sorted file with 7 records (2^n-1 records, where n=3). For control pointer purposes, we will assume N7=15 (bytes-per-record) and N0=105 (total bytes in file). 1) Allen, Jim 2) Best, Frank 3) Dierkop, Chuck 4) Feldman, Bill 5) Firestone, Ray 6) Nelson, Richard 7) ~~~~~~~~~~~~~~~ (end-of-file code) We have just input a new name at the keyboard (Johnson, Ray), and we want to file it in order with the preceding names. Our imaginary control pointer is set to 45, and our increment/decrement variable is set to 30. We read record number 4 from the file (Feldman, Bill, byte positions 46 through 60), compare it to our input record (Johnson, Ray), and when we see that Johnson is greater in the ASCII order than Feldman, we move the control pointer up to 75 by adding the increment/decrement variable to it, and divide the increment/decrement variable by 2, giving it the value of 15. We now read record number 6 from the file (Nelson, Richard, byte positions 76 through 90), compare it to our input record, and when we see that Johnson is less than Nelson in the ASCII order, we lower the control pointer to 60 by subtracting the increment/decrement variable, and divide the increment/decrement variable by 2, giving it the value of 7.5. We now read record number 5 from the file (Firestone, Ray, byte positions 61 through 75), compare it to our input record, and when we see that Johnson is greater than Firestone in the ASCII order, we move the control pointer up to 67.5 by adding the increment/decrement variable to it, and divide the increment/decrement variable by 2, giving it the value of 3.75. NOTE: In the last paragraph, when the increment/decrement variable dropped to the value 3.75 (N7/4), it signalled the program that the search was finished and the control pointer was midway between a lesser and a greater-or-equal file record. At this point, the control pointer is raised to 75, just in front of record number 6 in the file (see Line 660 in the program). The next step is to "shift" the file string by 1 record to create a "hole" for the new record to drop into (see Lines 690 and 710). These lines will then have the actual values as follows: 690 A$[75+15+1,105]=A$[75+1,105-15] ! the file string is shifted by the width 710 A$[75+1,75+15]=B$ ! of one record and the new record is ! dropped into place as per the following ! illustrations: .PA The file string before shifting (count only the characters and spaces inside of the double brackets: ||Allen, Jim Best, Frank Dierkop, Chuck Feldman, Bill Firestone, Ray || ||Nelson, Richard~~~~~~~~~~~~~~~|| The file string after shifting: ||Allen, Jim Best, Frank Dierkop, Chuck Feldman, Bill Firestone, Ray || ||Nelson, RichardNelson, Richard|| The file string after the new record is dropped into place: ||Allen, Jim Best, Frank Dierkop, Chuck Feldman, Bill Firestone, Ray || ||Johnson, Ray Nelson, Richard|| The reason for creating a string of blanks (see Lines 300 - 330) is to be able to "pad" each input field before the fields are joined together in the record string in Lines 490 - 520. The computer can easily find any record, and any field within a record, when every record is the same length and every field is the same length. There are rare instances in filing and sorting when it is better not to pad these inputs, but the great majority of situations require equal-length records and equal-length fields to facilitate direct random-access retrieval. In Lines 340 - 390 we fill B$ and then A$ with CHR$(126)'s. These ASCII values are useful as end-of-file markers, and help prevent loss of data, file overflow, and sorting out of order. The last things for the program to do are: Check for end-of-file in Line 420 (make sure there is room for another record). Input the new data in Lines 430 - 480, and pad the inputs if applicable. Join the fields together in the record string in Lines 490 - 520. Set the starting values in Lines 530 and 540 for the control pointer and the increment/decrement variable. Set up a Binary Search loop in Lines 550 - 600. Adjust the control pointer in Line 660, after the search is completed. Shift the file string in Line 690, in preparation for the new insert. Insert the new record in Line 710. Return to INPUT NEW DATA starting in Line 410. Last note: If you use a language that does not permit long strings, you can simulate a long string by concatenating shorter strings end-to-end (in series, so to speak). You will need 2 extra pointers; a pointer that denotes which string in the series you are examining, and a pointer that denotes which byte in the string you are looking at. When you locate the insertion point, the strings will have to be shifted one-by-one, and although it is not as fast in interpreted BASIC as the long-string approach, the compiled versions will handle tens of thousands of records in under 1 second.