.de PG .P "\\$1: \\$2 .. .CH 4 "Data Manipulation" .de AA .br All data manipulation programs are introduced, showing some of their options. Full documentation is in the manual entries. |STAT data manipulation tools allow users to generate, transform, format, extract, and validate data. .T dm , the data manipulator, is the most important tool for use with other |STAT programs. A detailed manual for .T dm is the last section of this chapter. .if t .sp 6p .if n .sp There are several classes of data manipulation programs. .B "Generation programs produce more data than their inputs by repeating data, numbering data, or by creating new data. .B "Transformation programs allow algebraic conversion of data. .B "Formatting programs change the shape or order of the data. .B "Extraction programs produce subsets of datasets. .B "Validation programs check the consistency, data types, and ranges of data. .br .rm AA .. .if n .AA .SH "Data Generation/Augmentation" .PG repeat "repeat a string or file .T repeat can repeat strings or lines in a file as many times as requested. It helps generate labels for datasets, or feed a program like .T dm that needs input to produce output. The following will repeat the file .T data 10 times. .( repeat -n 10 data .) The following will repeat its input series of 20 numbers 15 times. .( series 1 20 | repeat -n 15 .) Strings can be repeated using the .T echo command. The following will repeat the string .T hello 100 times. .( echo hello | repeat -n 100 .) .PG series "generate a linear series .T series generates a linear series of numbers between two values. By default, its values change by units, but this can be modified. The following produces a series of 10 numbers, 1 to 10, one per line. .( series 1 10 .) The following produces the same series, but in reverse order; the start of the series can be greater than the end. .( series 10 1 .) Non-integral series can be created by supplying an optional increment. .( series 0 1 .1 .) produces the series: .( 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 .) except that each value is on its own line. The output from series can be transformed with .T dm to produce other than linear series. Here is an exponential series: .( series 1 10 | dm "exp(x1)" .) .PG probdist "generate random numbers .T probdist can generate random numbers for several probability distributions. The following will generate 100 random numbers from the uniform distribution (between 0 and 1). .( probdist random uniform 100 .) This can be transformed using .T dm to get random numbers with other ranges. The following will produce 100 random integers uniformly distributed between 10 and 29. .( probdist random uniform 100 | dm "floor(x1*20+10)" .) The following generates numbers from a one-trial binomial distribution with probability 0.5. .( probdist random uniform 100 | dm "if x1 > .5 then 1 else 0" .) .T probdist also has a binomial distribution built in, so the following would be equivalent to the previous example: .( probdist rand binomial 1 1/2 100 .) The random number generator can be seeded. The following will seed the random number generator with 143 and generate 100 normally distributed z values. .( probdist -s 143 random normal 100 .) The seeding option is useful when a random sequence must be repeated. The random normal numbers have a mean of 0 and a standard deviation of 1, so .T dm can help create different random normal distributions. The following samples a normal distribution with mean 100 and standard deviation 15. .( probdist random normal 100 | dm "x1*15+100" .) .PG abut "number lines, recycle files .T abut can number input lines in files using the .T -n option, or cycle through input files as many times as is necessary to match the length of longer files. The latter case is common in creating input files for programs like .T anova and .T contab , which have input data tagged with regular patterns of labels. .( .ta 1i 2i 3i 4i .if t .ft I File1 File2 Data .if t .ft P large easy 12 small easy 23 hard 34 hard 45 56 67 78 89 .) For the above input file configuration, the command .( abut -nc File1 File2 Data .) would produce the following by recycling the smaller files. .( .ta 1i 2i 3i 4i 1 large easy 12 2 small easy 23 3 large hard 34 4 small hard 45 5 large easy 56 6 small easy 67 7 large hard 78 8 small hard 89 .) .PG dm "number lines .T dm can number its input lines with its special variables .T INLINE , which always contains the input line number, and .T INPUT , which always contains the current input line. .( dm INLINE INPUT < data .) .SH "Data Transformation" .PG dm "conditional algebraic combinations of columns .T dm can produce algebraic combinations of columns. The following command reads from .T data and produces the ratio of columns 2 and 1 with column 3 added on. .( dm x2/x1+x3 < data .) Transformations can be based on conditions. For example, if .T x1 , the value in column 1, in the above example is 0, then .T dm will exit after producing an error message like: .( dm: division by zero. input line 12 expr[1]. .) To avoid this problem, the following will do the division only if .T x1 is non-zero. .( dm "if x1 then x2/x1+x3 else 0" < data .) .PG probdist "probability/statistic conversion .T probdist can convert probabilities to distribution statistics and .I "vice versa" as seen in tables at the end of most statistics textbooks. Many distributions are supported, including: the normal z, binomial, chi-square, F, and t. The following will print the two-tailed probability of an obtained t statistic of 2.5 with 20 degrees of freedom. .( probdist prob t 20 2.5 0.021234 .) Similarly, the following will print the two-tailed probability of an F ratio of 6.25 with 1 and 20 degrees of freedom. .( probdist prob F 1 20 6.25 0.021234 .) These results are the same because of the relationship between the t and F distributions. .P The following prints the critical value (also called the quantile) in the chi-square distribution with 5 degrees of freedom to obtain a significance level of .05. .( probdist crit chisq 5 .05 11.070498 .) Both probabilities and critical values in the normal z distribution use the lower one tail -\(if to +\(if distribution, so the z value that produces the .05 level is obtained with the following. .( probdist crit z .05 -1.644854 .) The critical value for the 99th percentile is found with the following. .( probdist crit z .99 2.326348 .) Binomial distribution critical values are treated differently than the other continuous distributions. For the binomial distribution based on five trials, and a probability of success of one half, The critical value for a one-tailed test at the .05 level is: .( probdist crit binomial 5 1/2 .05 5 .) even though the probability of 5 successes is proportionally much less than .05: .( probdist prob binomial 5 1/2 5 0.031250 .) This is because the binomial distribution is discrete. Not only are critical values conservative, sometimes there may be no possible value; there is no way to get a less probable event than five out of five successes: .( probdist crit binomial 5 1/2 .01 6 .) Here, .T probdist is returning an impossible value (one with zero probability). .PG ranksort "convert data to ranks .T ranksort can rank order data from numerical data columns. For the input: .( 1 95 4.3 2 113 5.2 3 89 4.5 4 100 5.0 5 89 4.5 .) .T ranksort would produce: .( 1 3 1 2 5 5 3 1.5 2.5 4 4 4 5 1.5 2.5 .) The ties in the second and third columns get the average rank of the values for which they are tied. Once data are ranksorted, further ranksorting has no effect. With rank orders within columns, rank order statistics (e.g., Spearman rank order correlation, average group rank) can be computed by parametric programs like .T pair or .T regress . .SH "Data Formatting" .PG maketrix "form a matrix format file .T maketrix reads its data, one whitespace separated string at a time from its free format input, and produces a multicolumn output. .( series 1 20 | maketrix 5 .) The above produces a five column output. .( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .) .PG perm "permute lines .T perm , with no options, randomizes its input lines. It can randomize output from programs like .T series . .( series 1 20 | perm .) A subset of this permutation is a sample without replacement. The following is a sample of size 10 from the file .T data . .( perm < data | dm "if INLINE <= 10 then INPUT else EXIT" .) .T perm can be supplied a seed for its random number generator, to replicate a random permutation. .( series 1 20 | perm -s 5762 | maketrix 5 .) The above produces (with my system's random number generator): .( 18 7 10 13 2 14 11 19 15 20 1 3 9 6 16 8 17 12 5 4 .) .P .T perm can also put its lines in alphabetical or numerical order. For example, the output from the previous example could be put into ascending order (according to the first number on each line) with: .( series 1 20 | perm -s 5762 | maketrix 5 | perm -n .) This produces: .( 1 3 9 6 16 8 17 12 5 4 14 11 19 15 20 18 7 10 13 2 .) .PG dsort "sort data lines by multiple keys The last example of the .T perm filter showed how lines can be ordered according to the numerical value in the first column. .T dsort can sort lines based on numerical or alphabetical values in any column. For example, the following command sorts the previous example matrix in ascending order of the values in the third column. .( series 1 20 | perm -s 5762 | maketrix 5 | dsort -n 3 .) This produces: .( 1 3 9 6 16 18 7 10 13 2 8 17 12 5 4 14 11 19 15 20 .) If there were ties in a column, .T dsort could sort by additional key columns. .PG transpose "transpose matrix format file .T transpose flips rows and columns in its input. For the input: .( 1 2 3 4 5 6 7 8 9 10 11 12 .) .T transpose produces: .( 1 5 9 2 6 10 3 7 11 4 8 12 .) The input to .T transpose does not have to be regular, nor does it have to be numerical. .( .ta 1i 2i 3i 4i 5i one two three four five six seven eight nine ten eleven .) For the above input, .T transpose produces the following. .( .ta 1i 2i 3i 4i 5i one four six seven nine two five eight ten three eleven .) Note that with regular inputs, the transposition of a transposition yields the original. This is not necessarily so with data as in the above input and output. The above output piped through another pass of .T transpose produces a result different from the original input. .( .ta 1i 2i 3i 4i 5i one two three four five eleven six eight seven ten nine .) .PG reverse "reverse lines, columns, characters .T reverse can reverse the lines, fields, or characters in its input. It can provide easier access to the last lines in a file, or the last columns on lines. To get the last 10 lines in a file, we can reverse the file, get the first 10 lines, and then reverse those 10 lines. .( reverse < data | dm "if INLINE GT 10 then EXIT else INPUT" | reverse .) To get the last two .I columns in a file is easier. .( reverse -f < data | dm s2 s1 .) Here, .T dm is used for column extraction, and rather than call .T reverse a second time, what were the last two columns before reversal are listed in the opposite order. .PG colex "reorder columns, reformat columns .T colex is a column extraction program that shares some of the functionality of .T dm and .T reverse . .T colex is faster and has a simpler syntax than .T dm and has data formatting capabilities. Suppose a matrix dataset with 10 columns is created with the following. .( series 1 50 | maketrix 10 .) .T colex can extract the last five columns followed by the first five with the command: .( series 1 50 | maketrix 10 | colex 6-10 1 2 3 4 5 .) Either ranges of columns or single columns can be given. The above command produces: .( .if t .ne 5v 6 7 8 9 10 1 2 3 4 5 16 17 18 19 20 11 12 13 14 15 26 27 28 29 30 21 22 23 24 25 36 37 38 39 40 31 32 33 34 35 46 47 48 49 50 41 42 43 44 45 .) .P Note in the previous example how the numbers line up on the left, rather than the customary format to line up the unit digits. This is because .T colex puts tabs between columns, and it is not a problem because |STAT programs read data in free-format. .T colex can print its columns in several numerical formats as well as the default string format. The numerical formatting can round values to some number of decimal places (like zero, for whole numbers). The option: .T "-F\ 4i" would tell .T colex to format all the columns as integers, each four spaces wide, and the .T -t option would tell .T colex to not place a tab between columns. The format of columns can be assigned to individual columns by placing the format before each range of columns. For example, the following variation on the previous command would print columns 6-10 in a money format with two digits after the decimal place, and columns 1-5 as integers four wide. .( series 1 50 | maketrix 10 | colex -t 6.2n6-10 4i1-5 .) .( 6.00 7.00 8.00 9.00 10.00 1 2 3 4 5 16.00 17.00 18.00 19.00 20.00 11 12 13 14 15 26.00 27.00 28.00 29.00 30.00 21 22 23 24 25 36.00 37.00 38.00 39.00 40.00 31 32 33 34 35 46.00 47.00 48.00 49.00 50.00 41 42 43 44 45 .) .PG dm "reorder columns .T dm , like .T colex , can reorder columns. However, it does not allow the specification of ranges of columns. The above example of .T colex could be done with .T dm with similar results. .( series 1 50 | maketrix 10 | dm s6 s7 s8 s9 s10 s1 s2 s3 s4 s5 .) .PG abut "paste corresponding lines from files .T abut can join data in separate files beside one another. In the usual case, .T abut takes N files with K lines and produces 1 file with K lines. Suppose the files .T height and .T weight contain the respective heights and weights of the same people. Each line in each file contains one height or weight. These could be plotted with the plotting option on the .T pair program with the following command. .( abut height weight | pair -p .) .SH "Data Extraction" .PG dm "conditional data extraction .T dm can extract subsets of its input, either by columns or by lines. To extract columns of data, each extracted column is specified with the number of the column preceded by the letter .T s . The following extracts columns 8, 2, and 11, in that order. .( dm s8 s2 s11 .) .T dm can extract lines of data by using its built-in line skipping expression .T SKIP . The following will extract lines 50 to 100. .( dm "if INLINE >= 50 & INLINE <= 100 then INPUT else SKIP" .) It is more awkward than column extraction, but the latter is common. .PG colex "quick column extraction .T colex can extract individual columns, or ranges of columns. For column extraction, it is easier to use and faster than .T dm . The following extracts, in order, columns 8, 2, and 11. .( colex 8 2 11 .) .PG linex "line extraction .T linex can extract individual lines (by number), or ranges of lines. The following extracts, in order, lines 8, 2, and 11. .( linex 8 2 11 .) To extract lines 50 to 100, you could type: .( linex 50-100 .) or you could even extract them in reverse order: .( linex 100-50 .) .SH "Data Validation" .PG validata "data validation .T validata will report for its input the number of columns, data-types of columns, and for columns with numerical values, the maxima and minima. .T validata reports any inconsistencies in the number of columns in its input. Floating point numbers can be entered in scientific notation. For the input: .( 1 2 3 4 5 6 7 2E2 end 5 1e-3 .) .T validata 's output is: .( validata: Variable number of columns at line 4 Col N NA alnum alpha int float other type min max 1 4 0 4 0 4 4 0 int 1 7 2 4 0 3 0 2 4 0 float 0.001 200 3 3 0 3 1 2 2 0 alnum 3 6 .) .PG dm "conditional data validation .T dm can find exceptional cases in its input. A simple case is non-numerical input, which can be checked with .T dm 's .T number function. .( dm "if !number(s1) then 'bad input on line' else SKIP" INLINE .) .T dm can check for specific values, ranges of values, or specific relations of values. The following prints all lines in .T data with the string .T "bad" in them. .( dm "if 'bad' C INPUT then INPUT else SKIP" .) The input line number could be prepended. .( dm INLINE "if 'bad' C INPUT then INPUT else SKIP" .) This is possible because .T dm will produce no output for skipped lines, regardless of expression order. The following prints all lines where column 3 is greater than column 2. .( dm "if x3 > x2 then INPUT else SKIP" .) .T dm can print lengths of strings and check for numerical fields: .( dm len(s1) number(s1) .) will print the length of column 1 strings, and report if they are numerical (0 for non-numbers, 1 for integers, 2 for real numbers, 3 for exponential scientific notation numbers). .SH "DM: Tutorial and Manual" .so DM.nr .TC -- .if t .sp .2i .if t .AA