Two-week ISTE workshop on Effective teaching/learning of computer programming

Dr Deepak B Phatak Subrao Nilekani Chair Professor Department of CSE, Kanwal Rekhi Building IIT Bombay Lectures 15, File I/O Wednesday 7 July 2010 Two-week ISTE workshop onEffective teaching/learning of computer programming

Overview • Need for processing data in files • OS view of files • Program execution environment • Files in C, simple I/O operations • Scanf and printf functions • File operations • Binary files • C functions for file handling • Examples

Files in C • A File is regarded as a large collection of bytes, stored outside the main memory, typically on a disk. • Files on the disk are managed by the OS (by a component called file system). This provides for organization of files in directories and subdirectories. Each file can be created, data can be written to it, and data can be retrieved from it • Additionally, data can be inserted into or deleted from an existing file • Each file has certain properties. It has a name, a ‘path’, permissions, etc. • Physical location of a file, along with its properties are known to the OS

A typical disk drive

OS view of files • The file system of OS organizes all external information as files • Files residing on disks are logically organized into directories and subdirectories • Every user is given a ‘home’ directory under which his/her files reside • For the OS, a file contains a number of bytes, which is the size of the file. The meaning of the contents of any file (text, digital image, numbers) is known only to the program which processes these bytes

OS view of the files ... • When OS reads data from a file, it is capable of detecting when there is no more data left in the file to read. • It notifies the user program through a flag called End-of-File (eof) • When dealing with other devices, such as keyboard or terminal, the UNIX OS still considers each of these as (special) files • A keyboard is supposed to be giving to OS a sequence of bytes, a special character given as input signifies end of file (^d)

stdin our program stdout stderr Program execution environment • Whenever we run our compiled program, the OS runs it as a ‘process’ and gives it an environment comprising of certain parameter settings (called environment variables) and a set of predefined files • A program can, naturally, use as many other files as it needs, apart from these standard files shown below

Execution Environment ... • By default, the OS connects these standard files to meaningful devices • stdin to keyboard • stdout and stderr to display monitor stdout stdin Our program stderr

Execution Environment ... • When we use ‘redirection’ at the command prompt, these standard files are disconnected from the devices and are connected instead to the named files $ ./a.out < indata.txt > outdata.txt outdata.txt indata.txt stdout stdin Our program stderr

Basic Input Output in C • Intrinsically, C does not have any instruction for reading or writing data • All input and output to external world is performed through functions • Special functions have been written and made available in the standard library to perform I/O with keyboard and terminal, • e.g. scanf() and printf() • Parameters to these functions include a “format” string, followed by data values (expressions) to be read/printed • C applies the appropriate pattern to each value, for interpreting input characters or for generating output characters

Printf() • This function displays one or more values on the user terminal printf("%d is a number\n", N); • The format string has a “format specifier” (%d), which is used to interpret N and convert it to a formatted value. Other characters are displayed as they are. \n introduces a new lline • Specifiers can appear anywhere, each must correspond correctly to the corresponding value following the format string

Output format specifiers in printf() • format specification takes the following form, with optional parts shown enclosed in angled brackets: • %<flags><width><precision><length>conversion • Flags (Zero or more of the following): • - :Left justify the conversion within its field. • + :A signed conversion will always start with a plus or minus sign. • Space: If the first character of a signed conversion is not a sign, insert a space. Overridden by + if present.

Flags … • # : Forces an alternative form of output. The first digit of an octal conversion will always be a 0; inserts 0X in front of a non-zero hexadecimal conversion; forces a decimal point in all floating point conversions even if one is not necessary; does not remove trailing zeros from g and G conversions. • 0 :Pad d, i, o, u, x, X, e, E, f, F and G conversions on the left with zeros up to the field width. Overidden by the - flag. If a precision is specified for the d, i, o, u, x or X conversions, the flag is ignored. The behaviour is undefined for other conversions.

field width • A decimal integer specifying the minimum output field width. This will be exceeded if necessary. If an asterisk is used here, the next argument is converted to an integer and used for the value of the field width; if the value is negative it is treated as a - flag followed by a positive field width. Output that would be less than the field width is padded with spaces (zeros if the field width integer starts with a zero) to fit. The padding is on the left unless the left-adjustment flag is specified.

precision • This starts with a period ‘.’. It specifies the minimum number of digits for d, i, o, u, x, or X conversions; the number of digits after the decimal point for e, E, f conversions; the maximum number of digits for g and G conversions; the number of characters to be printed from a string for s conversion. The amount of padding overrides the field width. If an asterisk is used here, the next argument is converted to an integer and used for the value of the field width. If the value is negative, it is treated as if it were missing. If only the period is present, the precision is taken to be zero.

length • h preceding a specifier to print an integer type causes it to be treated as if it were a short. (Note that the various sorts of short are always promoted to one of the flavours of int when passed as an argument.) i works like h but applies to a long integral argument. L is used to indicate that a long double argument is to be printed, and only applies to the floating-point specifiers. These can cause undefined behaviour if they are used with the ‘wrong’ type of conversion.

Format conversion specifications • d signed decimal • i signed decimal • u unsigned decimal • o unsigned octal • x unsigned hexadecimal (0–f) • X unsigned hexadecimal (0–F) • f Print a double with precision digits (rounded) after the decimal point. To suppress the decimal point use a precision of zero explicitly. Otherwise, at least one digit appears in front of the point

Format conversion specifications … • e Print a double in exponential format, rounded, with one digit before the decimal point, and precision digits after it. • g,G Use style f, or e (E with G) depending on the exponent. • c The int argument is converted to an unsigned char and the resultant character printed. • s Print a string up to precision digits long. If precision is not specified, or is greater than the length of the string, the string must be NULL terminated

Examples %6d - 6 digit integer %7s - string fitted in 7 characters spacing %8.2f - float, 8 digits total, 2 after the decimal point %8.2g - same, switch to E notation if required

Example #include <stdio.h> int main() { int a, b, c; float p,q,r; a=-1; b=10; c=100; p=123.456; q=0.1234; r=-12.34;

Example … printf("%5d \n",a); -1 printf("%5d \n",b); 10 printf("%5d \n",c); 100 printf("%2d \n",a); -1 printf("%2d \n",b); 10 printf("%2d \n",c); 100

printf("%8.4f \n",p); printf("%8.4f \n",q); printf("%8.4f \n",r); printf("%4.2f \n",p); printf("%4.2f \n",q); printf("%4.2f \n",r); Example …

The scanf() function • scanf needs to be passed pointers to its arguments, so that the values read can be assigned to the proper destinations. Forgetting to pass a pointer is a very common error, and one which the compiler cannot detect—the variable argument list prevents it. • The format string is used to control interpretation of a stream of input data. This stream generally contains values to be assigned to the objects pointed to by the remaining arguments to scanf.

Format specification for scanf() Contents of format string • white space • This causes the input stream to be read up to the next non-white-space character. • ordinary character • Anything except white-space or % characters. The next character in the input stream must match this character.

scanf() … • conversion specification • This is a % character, followed by an optional * character (which suppresses the conversion), followed by an optional nonzero decimal integer specifying the maximum field width, an optional h, l or L to control the length of the conversion and finally a non-optional conversion specifier. Note that use of h, l, or L will affect the type of pointer which must be used. • There are other functions also, such as getchar() and putchar()

Example of reading characters #include <stdio> #include <stdlib.h> int main(){ int ch; ch = getchar(); while(ch != 'a'){ if(ch != '\n') { printf("ch was %c, value %d\n", ch, ch); } ch = getchar(); } return 0; }

Files in C • C language treats a file as a ‘stream’ of bytes. A file can be ‘opened’ for reading bytes from it (input), or for writing bytes to it (output), or for both (i-o). The bytes are simply treated as ‘chars’ • When we invoke the function scanf (or the operator cin), the program actually reads from stdin, and hence from the keyboard • Similarly, printf (and cout) write output bytes to stdout, and hence to monitor • OS error messages are output to stderr

Typical data in a text file • CS101 Exam marks are entered in a text file 1,08331010,,0B,-2 2,9002040,NURUDDIN BAHAR ,0B,-1 ... 13,09D07010,GURURAJ SAILESHWAR,7A,44.5 14,09005014,RAWAL NAMIT LALIT,8D,44 ... 525,09D17001,SHAILENDRA SARAF,,22.5 • Such data may come from a spread sheet • Each line represents one ‘record’ for a student and has several ‘fields’ separated by commas

Data in a text file … • We can identify the meaning of values in different fields from the knowledge of implicit ‘metadata’; e.g., record 13 is 13,09D07010,GURURAJ SAILESHWAR,7A,44.5 • The field values represent • s.no., roll, name, lab batch, marks • We note possible missing field values, e.g. 1,08331010,,0B,-2 (name is missing) 525,09D17001,SHAILENDRA SARAF,,22.5 (batch code is missing)

Data in a text file • In order to process data in this file, we need to read each line as a string, separate out the field values, and store these in appropriate variables for s.no., and marks, and in char arrays for names, batch codes, and roll numbers. • Roll numbers may contain a non-digit • We would generally like to store basic information for all students, such as this, in a file (say, students’ data base file) • Consider some simple processing requirement from this file itself • we want to find batch wise average marks and the class average

A different solution • Before considering a C program to solve this problem, let us look at the possible use of a programming language called AWK • Named after the designers at Bell Labs who invented it in 1970’s: Alfred Aho, Peter Weinberger, Brian Kernighan • This language makes heavy use of string data type, associative arrays, and regular expressions • Some inadequacies have led to the development of a language called Perl

AWK language fundamentals • AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

How AWK handles records Each record of our file is like 13,09D07010,GURURAJ SAILESHWAR,7A,44.5 $1 $2 $3 $4 $5 - AWK separates out various “fields” as it reads records and assigns values to $1, $2 etc - What do we want to do? Pattern: $5 < 0 Action: Increment a count variable for absent students For other patterns: increment batch-counts, marktotals, … At END, print the accumulated results

AWK script and execution results AWK program $5 < 0 { absentcount++} $5 >=0 { count++; totmarks += $5; batchtot[$4] += $5; batchcount[$4] ++; }

AWK script END{ for (i in batch){ print i, batchcount[i], batchtot[i]/ batchcount[i]; } print "Total students are: ", count + absentcount; print " Number absent is: ", absentcount; print "Class average is: ", totmarks/count; }

Execution Result $ nawk -F "," -f analysemidsem.awk < midsemmarks.txt 1 22.5 0A 19 23.7368 0B 15 30.7667 1A 22 25.3636 0D 20 24 1B 20 22.5625 1C 20 31.525 1D 23 25.663 2A 22 28.4318 2B 20 26.4 2C 21 30.381

Execution Results … 3A 23 24.3043 2D 20 25.425 3B 23 22.8913 3C 21 26.5714 3D 23 23.1739 4A 22 25.0909 4B 22 31.2273 4C 23 32.8696 4D 21 25.881 5A 22 28.7955 5B 19 26.8684 5C 20 33.7875

Execution Results … 6A 22 23.9091 6B 20 21.45 6C 21 25.0238 6D 23 20.5 7A 21 28.2619 7B 24 22.9167 7C 20 29.325 7D 19 22.6579 8A 19 28.0263 8B 18 19.1111 8C 17 28.2353 8D 19 22.8421

Execution Results … 9A 21 25.619 9B 23 25.1304 9C 20 28.5 9D 21 24.0952 OC 19 21.1842 Total students are: 819 Number absent is: 10 Class average is: 25.9299 $

Programming languages • AWK is so simple, and works so well • Why use complicated programming languages? • AWK is superb for such problems but has limited capability to handle data of all kinds and to implement all functionality to solve general computational problems • It is an “interpreted” language, program (scripts) are not separately compiled into machine instructions • Operationally less efficient So we get back to our C programming

An interesting episode • In the class, I announced the names of students scoring high marks in the mid-semester examination • 42 09010021 J Jolly • 42 09010003 Raja Jain • 42 09D07012 Joy P Khan • 42 09005045 S Siva Chandra Mouli • 42 09026018 N Gautam Reddy • 42.5 09001004 Prateek Bhandari • 43.5 09005035 Vinayak Gagrani • 43.5 09005008 Siddhesh P Chaubal • 44 09005014 Namit Rawal

Interesting Episode … • A student remarked to me after the class, that one Nitant Vaidya from his batch has his answer book under review by the teaching assistant (TA) for the batch, and his marks will definitely increase putting him in the top league. I told him to let this Nitant write an email to me if and when he gets to know his updated marks, and that I will be glad to include him in the list of honors.

Interesting Episode … • A few days later, I got a mail from Nitant saying that he has now got his answer paper back, with the revised total as 43½ (out of 45). • His mail further said that, he has checked the paper before sending the mail, and found that the TA had made a totaling error originally; giving him 2 marks more than what should have been the correct total. His final marks should thus actually be only 41 ½ and not 43 ½ which I should record

Interesting Episode … • The significance of this admission is rather deep. He is not only ruling himself out from the recognition as a top performer in mid-semester, he is actually putting his final course grade in jeopardy. • Very importantly, he need not at all have acknowledged the correct total, as by that time we had 43 ½ marks firmly recorded against his name in our internal final score sheet, after all TAs had submitted upgraded marks

Interesting Episode … • As a teacher, I can only say that such admissions are very rare in real life. Coming from a 1st year student working in an intensely competitive environment (for marks and grades), this is indeed exceptional.

Interesting Episode … • In the next lecture, I did put up his name in an ‘extended’ honors list, without showing his marks. I briefly recited this episode, observed that in my opinion he deserves to be in the ‘honors list’, and asked the class if it agreed. A loud yes from over 800 students endorsed my observation. When, upon my request Nitant sheepishly stood up, the clapping from the class was louder than it was for those in the original list of toppers. • Just shows that we all really respect ethical behaviour.

Interesting Episode … • It is such episodes which make the life of a teacher so enjoyable, so meaningful, and so enriching. These go a long way to increase our own resolve to be scrupulously ethical in our activities

Interesting Episode … • Role models need not be searched far and wide only amongst great names. They exist amongst us, often disguised as normal people, coming out only through some small events. It is up to us to recognize, acknowledge and emulate, if we so dare.

File processing in C • A file is defined through a special pointer (FILE*) which points to a file object FILE *fp; • For disk files, C is capable of ‘positioning’ a pointer to any byte within the file, from which or at which, bytes can be read or written

POS Files in C • *fp will ‘point’ to the file object • A “position” pointer, which can be made to point to any byte position of the file, for random access files. The next input or out put operation will take place starting at that byte position • A file can be either a sequential file or random access file. Also, it can either be a text file or binary file

Two-week ISTE workshop on Effective teaching/learning of computer programming