Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit




Lehigh University Operating Systems Tutorial Series

Persistence and File I/O

It's almost impossible to imagine a world in which we had no way to store data reliably. Persistent kles, which maintain their content even after the machine has been turned off, are essential! But what does it mean to be persistent? Is it persistent if a kle is stored on a handful of always-on cloud servers? Does the kle need to be stored on a hard disk? Do the contents of the kle affect how we should interact with it? This tutorial will just scratch the surface, by letting us interact with kles on the local klesystem in a variety of ways.

 

 

You will need a Linux environment for writing C++ code that complies with the C++17 standard. The cse303dev container is the easiest way to have that environment. You will also need a code editor, such as Visual Studio Code.

 

 

We have two options for interacting with kles: we can access them directly, or we can do buffered operations on them. If you've ever written printf("hello"); while(true){} and wondered why you didn't see any output, then you may already know the difference… with buffering, an underlying library decides when to actually write data to the kle. Buffering can be benekcial, and we'll see that it can have a huge impact on performance. We'll also see that buffering doesn't protect us from needing to check for errors. We'll also get a good grip on how to use the low-level interface for reading and writing kles.

 

 

Create a kle called text_io.cc, and add it to your Makefile's TARGETS. Here's the start of our code:


 

/**

 * text_io.cc

 *

 * Text_io is similar to the Unix cat utility: it reads bytes fro

 * and writes them to another. By default, it reads stdin and wri

 * it can be configured to open files to serve as input and/or ou

 * supports appending.  Finally, it allows for the input and/or o

 * be accessed as C file streams or Unix file descriptors.

 */ 

 

#include <cstdint> #include <cstdio> #include <cstring> #include <fcntl.h> #include <functional> #include <libgen.h> #include <string> #include <unistd.h>

 

/**

 * Display a help message to explain how the command-line paramet

 * program work

 *

 * @progname The name of the program

 */ 

void usage(char *progname) {

  printf("%s: Demonstrate text-based I/O with streams and file de

         basename(progname));

  printf("  -i        Use file descriptor instead of stream for i

  printf("  -I [file] Specify a file to use for input, instead of

  printf("  -o        Use file descriptor instead of stream for o

  printf("  -O [file] Specify a file to use for output, instead o

  printf("  -a        Open output file in append mode (only works

  printf(" -h Print help (this message)\n");

}

 

/** arg_t is used to store the command-line arguments of the prog struct arg_t {

  /** should we use streams (FILE*) or file descriptors (int), fo

  bool in_fd = false;

 

  /** should we use streams (FILE*) or file descriptors (int), fo

  bool out_fd = false;


 

  /** filename to open (instead of stdin) for input */ 

  std::string in_file = "";

 

  /** filename to open (instead of stdout) for output */ 

  std::string out_file = "";

 

  /** append to output file? */ 

  bool append = false;

 

  /** Is the user requesting a usage message? */ 

  bool usage = false;

};

 

/**

 * Parse the command-line arguments, and use them to populate the

 * object.

 *

 * @param argc The number of command-line arguments passed to the

 * @param argv The list of command-line arguments

 * @param args The struct into which the parsed args should go

 */ 

void parse_args(int argc, char **argv, arg_t &args) {

  long opt;

  while ((opt = getopt(argc, argv, "aioI:O:h")) != -1) {

    switch (opt) {

    case 'a':

      args.append = true;

      break;

    case 'i':

      args.in_fd = true;

      break;

    case 'o':

      args.out_fd = true;

      break;

    case 'I':

      args.in_file = std::string(optarg);

      break;

    case 'O':

      args.out_file = std::string(optarg);

      break;

    case 'h':

      args.usage = true;


      break;

    }

  }

}

 

 

If you did the "refresher" tutorial, you should be familiar with getopt(). One thing to notice in this code is that we are going to be writing a program that supports all combinations of buffered and low-level input and output. We'll use C++ lambdas as a way to pass code as a parameter to functions, so that we don't have to write too many if statements to support all the interesting combinations.

 

 

To read text from a stream, we will use the fgets() function. It will read up to n-1 bytes, and store them in a buffer. Each time we get data in this way, we'll pass it to a "callback" function, that was provided as a parameter. The callback needs to know the string and its length. We'll look at an example callback next.

 

/**

 * Read text from a file stream and pass it to a callback functio

 *

 * @param file The file stream to read from

 * @param cb   A callback function that operates on each line of

 *             the stream.  It expects to take a pointer to some

 *             number of valid bytes reachable from that pointer.

 */ 

void read_lines_file(FILE *file, std::function<void(const char *,

  // read data into this space on the stack 

  char buffer[16];

 

  // NB: fgets() will read at most sizeof(line)-1 bytes, so that

  //     as the last character (so that printf and friends will w

  // NB: fgets() returns nullptr on either EOF or an error.  We w

  // errors as a reason to stop reading 

  // NB: we don't always get a full 16 bytes on a fgets, so the c

  // write '\n'. Instead, it must print the '\n' characters

  // the input. 

  while (fgets(buffer, sizeof(buffer), file)) {

    cb(buffer, strlen(buffer));


  }

  // now check for an error before closing the file. 

  if (ferror(file)) {

    perror("read_lines_file::fgets()");

    // for errors on FILE*, we must clear the error when we're do

    clearerr(file);

  }

}

 

 

 

When we write text using the stream interface (that is, when the kle is represented as a FILE*, which is buffered), the expectation is that fputs() is passed an array of characters that is null terminated (there is a 0 value in it). That being the case, we don't actually need the length of the string in our write_file() function… the stream interface is powerful enough to get it right.

 

/**

 * Write text to a file stream

 *

 * NB: Since we are assuming that we will receive text, we are as

 *     provided buffer will be null-terminated (it will end with

 * there is no need to use the provided buffer size.

 *

 * @param file The file stream to write to

 * @param buffer The buffer of text to write to the stream

 * @param num The number of bytes in the buffer (ignored)

 */ 

void write_file(FILE *file, const char *buffer, size_t num) {

  // if fputs() returns EOF, there was an error, so print it then

  if (fputs(buffer, file) == EOF) {

    perror("write_file::fputs()");

    clearerr(file);

  }

}

 

Notice that there will be times where the callback receives a string that includes a newline (\n) character. In those cases, the stream will flush, and the output will actually go to the kle. Otherwise, the output might buffer.


 

 

The low-level interface to kles in Unix is through kle descriptors. A kle descriptor is an integer that the OS associates with a kle that was explicitly opened by the program. Be sure to read the manual page for the read operation (you can type "man read" in Google)… there are a lot of error conditions!

 

/**

 * Read text from a file descriptor and pass it to a callback fun

 *

 * @param fd The file descriptor to read from

 * @param cb A callback function that operates on each line of te

 *           the user.  It expects to take a pointer to some text

 * number of valid bytes reachable from that pointer.

 */ 

void read_lines_fd(int fd, std::function<void(const char *, size_

  // read data into this space on the stack 

  char buffer[12];

 

  // NB: read() may read as many bytes as we let it, and won't pu

  // end, so we will need to do that manually. 

  // NB: read() returns the number of bytes read. 0 means EOF.  

  // error. It won't always read the maximum possible, so be

  // We will treat errors as a reason to stop reading 

  // NB: if fd refers to a true file, then all errors are bad.  B

  //     to a network socket, then an EINTR error is actually OK,

  //     keep reading.  If you're reading from a socket, the belo

  // correct. 

  ssize_t bytes_read;

  while ((bytes_read = read(fd, buffer, sizeof(buffer) - 1)) > 0)

    // we read one less byte than we could, so that we can put a

    // end.  This is necessary because we *might* be calling back

    // that expects a null-terminated string.  Were it not for th

    // we could use this function to read binary and text data fr

    // descriptor. 

    buffer[bytes_read] = '\0';

    // NB: don't include the trailing null in the number of bytes

    // callback 

    cb(buffer, bytes_read);

  }

  // now check for an error before closing the file 


  if (bytes_read < 0) {

    perror("read_lines_fd::read()");

  }

}

 

 

Since we know that this code might call our write_file() code as its callback, we manually insert a \0 character into the buffer, so that the contents are a proper null-terminated string. Also, since we are only dealing with disk-based kles for now, we treat all errors as a reason to stop. When we start writing network code, we'll discover that certain errors are recoverable (and normal).

 

 

Naturally, if we can read from a kle descriptor, we can also write to one. In the following code, we use the write() function to write the data to a kle. An important thing to notice is that sometimes there will be "short counts", where less data was written than we intended/requested. The practice of writing in a loop, and directly updating pointers to track the next byte to write, is a standard (and essential) practice.

 

/**

 * Write data (not exclusively text) to a file stream

 *

 * @param fd The file descriptor to write to

 * @param buffer The buffer of data to write to the stream

 * @param num The number of bytes in the buffer

 */ 

void write_fd(int fd, const char *buffer, size_t num) {

  // as with read(), we may not write as many bytes as we intend,

  // track how many bytes have been written, and where to resume

  // have bytes left to write. 

  size_t bytes_written = 0;

  const char *next_byte = buffer;

  while (bytes_written < num) {

    ssize_t bytes = write(fd, next_byte, num - bytes_written);

    // negative bytes written indicates an error 

    if (bytes < 0) {

      // NB: errors on file descriptors don't need to be cleared 

      perror("write_fd::write()");

      // NB: as with read(), if fd is a socket, then we should be


      // EINTR and continuing when the error is EINTR. 

      return;

    }

    // otherwise, advance forward to the next bytes to write 

    else {

      bytes_written += bytes;

      next_byte += bytes;

    }

  }

}

 

 

 

Our main() function is going to do a lot here… It will use stdin and stdout unless specikc input and output kles are given, in which case it will create new FILE* buffered streams for the given kles. It will extract kle descriptors for those streams (note: never use a kle descriptor and stream for the same kle at the same time), and then it runs the requested combination of read and write techniques:

 

int main(int argc, char **argv) {

  arg_t args;

  parse_args(argc, argv, args);

 

  // if help was requested, give help, then quit 

  if (args.usage) {

    usage(argv[0]);

    return 0;

  }

 

  // set up default input file 

  FILE *in_stream = stdin;

  int in_fd = fileno(stdin);

 

  // set up default output file 

  FILE *out_stream = stdout;

  int out_fd = fileno(stdout);

 

  // Should we open an input file? If so, do it in read-only mod

  if (args.in_file != "") {

    in_stream = fopen(args.in_file.c_str(), "r");


    if (in_stream == nullptr) {

      perror("fopen(in_file)");

      return -1;

    }

    in_fd = open(args.in_file.c_str(), O_RDONLY);

    if (in_fd < 0) {

      perror("open(in_file)");

      return -1;

    }

  }

 

  // Should we open an output file? If so, should it be write-on

  // append-only? Note that the file mode will be 700 

  if (args.out_file != "") {

    out_stream = fopen(args.out_file.c_str(), args.append ? "a" :

    if (out_stream == nullptr) {

      perror("fopen(out_file)");

      return -1;

    }

    if (args.append)

      out_fd =

          open(args.out_file.c_str(), O_WRONLY | O_CREAT | O_APPE

    else 

      out_fd = open(args.out_file.c_str(), O_WRONLY | O_CREAT, S_

    if (out_fd < 0) {

      perror("open(out_file)");

      return -1;

    }

  }

 

  // Create C++ lambdas to hide differences between writing to a

  // writing to a file descriptor. 

  std::function<void(const char *, size_t)> print_stream =

      [&](const char *buf, size_t num) { write_file(out_stream, b

  std::function<void(const char *, size_t)> print_fd =

      [&](const char *buf, size_t num) { write_fd(out_fd, buf, nu

 

  // Dispatch to the file descriptor or stream version of reading

  // appropriate writing function 

  if (args.in_fd) {

    read_lines_fd(in_fd, args.out_fd ? print_fd : print_stream);

  } else {

    read_lines_file(in_stream, args.out_fd ? print_fd : print_str


  }

 

  // only close the input file if it wasn't stdin 

  if (args.in_file != "") {

    if (close(in_fd) < 0) {

      perror("close(in_fd)");

    }

    if (fclose(in_stream) < 0) {

      perror("fclose(in_stream)");

    }

  }

 

  // only close the output file if it wasn't stdout 

  if (args.out_file != "") {

    if (close(out_fd) < 0) {

      perror("close(out_fd)");

    }

    if (fclose(out_stream) < 0) {

      perror("fclose(out_stream)");

    }

  }

}

 

 

There are a few important things to notice. The krst is that we can append just by changing how we open the kle for writing. If we open it in "w" mode, the kle gets truncated to zero bytes. If we open it in "a" mode, the kle contents remain, and we write new content to the end. The second is that the syntax for C++ lambdas is a little bit confusing, but they are incredibly powerful. Notice, for example, how print_stream() "captures" out_stream. That technique is one that you will deknitely want to understand, not just for C++ but for almost every modern programming language.

 

 

The impact of buffering is really quite amazing. In the following example, we will create a kle with 220 lines. Then we will read and write it in each of the four combinations, and time the result:

 

echo "hello world hello world hello world" >> in.dat

for i in `seq 20`; do cat in.dat in.dat > out.dat; mv out.dat in.


rm -f out.dat; time obj64/text_io.exe -I in.dat -O out.dat

   real   0m0.650s

   user   0m0.260s

   sys    0m0.340s

rm -f out.dat; time obj64/text_io.exe -I in.dat -O out.dat -i

   real   0m2.131s

   user   0m0.330s

   sys    0m1.770s

rm -f out.dat; time obj64/text_io.exe -I in.dat -O out.dat -o

   real   0m3.609s

   user   0m0.570s

   sys    0m3.010s

rm -f out.dat; time obj64/text_io.exe -I in.dat -O out.dat -o -i

   real   0m4.576s

   user   0m0.730s

   sys     0m3.800s rm -f out.dat in.dat

 

 

(Note: if the experiments are taking a really long time, try deleting in.dat

and then using a smaller argument to seq, such as 16.)

The difference in time is amazing! When we used buffering for input and output (the krst case), our code was almost 8 times faster than when we didn't buffer either. Buffering writes had the most impact, but buffering reads mattered too. In short, the power that the low-level interface provides is something that needs to be understood carefully… low-level doesn't always guarantee performance.

 

 

So far, we only looked at text-based input and output for streams. If we had put a "b" at the end of the mode speciker to fopen(), we could have opened the kle for binary access, but then we would need to use different functions to read ( fread()) and write ( fwrite()) data. Amusingly, for the kle descriptor interface, it is so low level that there is no distinction between text and binary I/O… everything is binary!

There is much more to consider. For example, both techniques for using kles support "seek" operations on binary data, where we can skip to a certain point and start reading or writing. "Append" mode is equivalent to opening for writing (without truncating) and seeking to the end. Structuring data to


take advantage of seeking is a huge optimization, especially in databases. You should be sure to read the man pages for lseek and fseek, and write some code to try them out. When you do, the int_ops.exe kle will be useful, since you can use it to make a large binary kle of sorted integers.

© 2021 by Michael Spear. All rights reserved.