Chapter 28 Multimedia
Streaming means a user can listen (or watch) the file after the downloading has started. On-demand audio/video [Streaming stored audio/video] : Files are compressed and stored on a server. A client downloads the files through the Internet. Ex. Famous lectures. Streaming live audio/video: User listens to broadcast audio and video through the Internet. Ex. Internet Radio. Interactive audio/video: People use the Internet to interactively communicate with one another. Ex. Internet telephony and Internet teleconferencing Figure 28.1Internet audio/video
Before sending audio or video signals on the Internet, they need to be digitized. Digitizing audio Sound fed into microphone generates an electronic analog signal, called as analog audio signal. Analog signal is to digitized to produce digital signal. Voice is sampled at 8000 samples per second with 8 bits per sample. This results in a digital signal of 64Kbps. Music is sampled at 44,100 samples per second with 16 bits per sample. This results in a digital signal of 705.6 Kbps for monaural and 1.411 Mbps for stereo. Digitizing video Video consists of sequence of frames. If frames are displayed on the screen fast enough, we get an impression of motion. No standard number of frames per second. To avoid a condition known as flickering, a frame needs to be refreshed. TV industry repaints each frame twice. Each frame is divided into small grids, called picture elements or pixels. Digitizing audio and video
For Black and white TV, each 8-bit pixel represents one of 256 different gray levels. For a color TV, each pixel is 24 bits, with 8 bits for each primary color (red, green, and blue) In a resolution of 1024*768 pixels, we need 2*25*1024*768*8 = 944 Mbps. This data rate needs a very high-data-rate technology such as SONET. Compression is needed to send video over the Internet. For speech, we need to compress a 64-KHz digitized signal; for music, we need to compress a 1.411-MHz signal.
Audio compression: Used for speech or music. Predictive encoding: Difference between the samples are encoded instead of encoding all sampled values; Normally used for speech. GSM (13 Kbps), G.729 (8 Kbps), G.723.3 (6.4 or 5.3 Kbps) Perceptual encoding: Used for CD-quality audio; MP3 (MPEG audio layer 3) Some sounds can mask other sounds. Masking can happen in frequency and time. In frequency masking, a loud sound in a frequency range can partially or totally mask a softer sound in another frequency range. In temporal masking, a loud sound can numb our ears for a short time even after the sound has stopped. MP3 has both frequency and temporal masking. Zero bits are allocated to the frequency ranges that are totally masked. Small number of bits are allocated to the frequency ranges that are partially masked. Larger number of bits are allocated to the frequency ranges that are not masked. Data rates: 96 Kbps, 128 Kbps, and 160 Kbps. Rate is based on range of frequencies in the original analog audio. Audio and video compression
Video compression We can compress video by first compressing images. Joint Photographic Experts Group (JPEG) Used to compress images Motion Photographic Experts Group (MPEG) Used to compress video Image compression: JPEG Gray scale picture is divided into blocks of 8*8 pixels (256 levels). If the picture is in color, each pixel can be represented by 24 bits (3*8 bits), with each 8 bits representing red, blue, or green (RBG). Change the picture into a linear (vector) set of numbers that reveal the redundancies. Redundancies can then be removed by using one of the text compression methods. Figure 28.2JPEG gray scale
Discrete Cosine Transform (DCT) Each block of 64 pixels goes through a transformation called DCT Transformation changes the 64 values so that the relative relationships between pixels are kept but the redundancies are revealed. Figure 28.3JPEG process
Figure 28.4Case 1: uniform gray scale In this case, we have a block of uniform gray, and the value of each pixel is 20. When we do the transformations, we get a nonzero value for the first element; the rest of the pixels have a value of 0. The value of T(0,0) is the average (multiplied by a constant) of the other values and is called the dc (direct current) value. The rest of the value, called ac values, in T(m,n) represent changes in pixel values.
Figure 28.5Case 2: two sections We have a block with two different uniform gray scale sections. There is a sharp change in the values of the pixels (from 20 to 50) After transformation, we get dc and ac values as nonzero. There are only a few nonzero values clustered around the dc value. Most of the values are 0.
Figure 28.6Case 3: gradient gray scale We have a block that changes gradually There is no sharp change between the values of neighboring pixels We gat a dc value and many nonzero ac values also.
The transformation creates table T from table P The dc value is the average value (multiplied by a constant) of the pixels The ac values are the changes Lack of changes in neighboring pixels creates 0s. Quantization After table T is created, the values are quantized to reduce the number of bits needed for encoding. Previously, in quantization, we dropped the fraction from each value and keep the integer part. Here, we divide the number by a constant and drop the fraction. This reduces the required number of bits. The divisor depends on the position of the value in the T table. This is done to optimize the number of bits and number of 0s for each particular application. This is an irreversible process. Information lost is not recoverable. So, JPEG is lossy compression.
Compression After quantization, the values are read from the table, and redundant 0s are removed. To cluster the 0s together, the table is read diagonally in a zigzag fashion rather than row by row or column by column. The reason is that if the picture does not have fine changes, the bottom right corner of the T table is all 0s.
Video compression: MPEG Moving Picture Experts Group (MPEG) is used to compress video. Frame is a spatial combination of pixels, and a video is a temporal combination of frames that are sent one after another. Compressing video means spatially compressing each frame and temporally compressing a set of frames. Spatial compression Done with JPEG; Each frame is a picture that can be independently compressed. Temporal compression Redundant frames are removed; Most of the consecutive frame are almost the same. When someone is talking, most of the frame is same as previous one except for the segment of the frame around the lips, which changes from one frame to another. To temporally compress data, MPEG divides frames into three categories: I-frames, P-frames, B-frames. Figure 28.8MPEG frames
I-frames Intracoded frame is an independent frame that is not related to any other frame (before or after). Sent Periodically [every 9th frame is an I-frame] Handles some sudden changes in frame that previous and next frames cannot show. Useful for someone who tunes his receiver at any time. P-frames Predicted frame is related to preceding I-frame or P-frame. Only changes are there. Carry very few bits after compression. B-frames Bidirectional frame is relative to preceding and following I-frame or P-frame. Not related to another B-frame. MPEG1 for CD-ROM with data rate 1.5Mbps; MPEG2 for high-quality DVD with data rate 3 to 6 Mbps. Figure 28.9MPEG frame construction
Streaming stored audio/video First Approach: Using a web server Compressed audio/video can be downloaded as a text file. Client (browser) can use the services of HTTP and send a GET message to download the file. Web sever can send the compressed file to the browser. To play, we can use an application like Media player. Does not involve streaming. Size of file is big [Audio: tens of megabits; video: hundreds of megabits]. Needs to download everything before playing. Thus, the user needs some seconds or ten of seconds before the file can be played. Figure 28.10Using a Web server
Media player is directly connected to the web server for downloading the audio/video file. Webserver stores two files: actual audio/video file and metafile that hold information about audio/video file. HTTP client accesses the web server using GET message. Information about metafile comes in response. Metafile is passed to media player. Media player uses URL in metafile to access audio/video file. Web server responds. Figure 28.11Second Approach:Using a Web server with a metafile
Browser and media player supports HTTP which runs over TCP. This is appropriate for metafile and not audio/video file. Retransmission of lost or damaged segment is against the concept of streaming. We need to dismiss TCP and its error control and move to UDP. HTTP client accesses web server using GET message. Information about metafile comes as response. Metafile is passed to media player. Media player uses URL in metafile to access the media server to download the file. Downloading can take place by any protocol that uses UDP. Media server responds. Figure 28.12Third Approach:Using a media server
Real-Time streaming protocol (RTSP) is a control protocol designed to add some more functionalities to streaming process. Using RTSP, we can control playing of audio/video. RTSP is an out-of-band control that is similar to the second connection in FTP. Figure 28.13Fourth Approach:Using a media server and RTSP
HTTP client accesses the web server using GET message. Information about metafile comes in response. Metafile is passed to media player. Media player sends a SETUP message to create a connection with the media server. Media server responds. Media player sends a PLAY message to start playing (downloading). Audio/video file is downloaded using another protocol that runs over UDP. Connection is broken using the TEARDOWN message. Media server responds. Media player can send other types of messages. For ex., a PAUSE message temporarily stops the downloading; downloading can be resumed with a PLAY message.
Streaming live audio/video Similar to the broadcasting of audio and video by radio and TV stations. Streaming stored audio/video and streaming live audio/video Both sensitive to delay; neither can accept retransmission. In stored audio/video, the communication is unicast and on-demand but in live audio/video, the communication is multicast and live. Live streaming is better suited to multicast services of IP and use of protocols such as UDP and RTP.
People communicate with one another in real time. Ex. Internet Phone or Voice over IP Characteristics Time relationship Real-time data on a packet-switched network require the preservation of time relationship between packets of a session. The time relationship between packets is preserved. Consistent delay is not a problem. Real-time interactive audio/video
Figure 28.15Jitter There is a gap between the first and second packets and between the second and third as the video is viewed at the receiver site. This phenomenon is called Jitter. Jitter is introduced in real-time data by delay between packets.
One solution to jitter is the use of a timestamp If each packet has a timestamp that shows the time it was produced relative to the first (or previous) packet, then the receiver can add this time to the time at which it starts the playback. If the receiver starts playing back the first packet at 00:00:08, the second will be played at 00:00:18, and third at 00:00:28. There are no gaps between the packets. Figure 28.16Timestamp
To be able to separate the arrival time from playback time, we need a buffer to store the data until they are played back. The buffer is referred to as a playback buffer. Playback buffer is required for real-time traffic. When a session begins (the 1st bit of the first packet arrives), the receiver delays playing the data until a threshold is reached. The threshold is measured in time units of data. The replay does not start until the time units of data are equal to the threshold value. Data are stored in buffer at a possibly variable rate, but they are extracted and played back at a fixed rate. Figure 28.17Playback buffer The amount of data in buffer shrinks or expands, but as long as the delay is less than the time to play back the threshold amount of data, there is no jitter.
Ordering Use sequence number to order the packets. Its needed to identify lost packets. Multicasting Play a primary role in audio and video conferencing. Traffic can be heavy, and the data are distributed using multicasting methods. Conferencing requires two-way communication between receivers and senders. Real-time traffic needs the support of multicasting. Translation Translation means changing the encoding of a payload to a lower quality to match the bandwidth of the receiving network. Translator is a computer that can change the format of a high-bandwidth video signal to a lower-quality narrow bandwidth signal. For example, a source is creating high-quality video signal at 5 Mbps and sending to a recipient having a bandwidth of less than 1 Mbps. To receive the signal, a translator is needed to decode the signal and encode it again at a lower quality that needs less bandwidth.
Mixing • To reduce the traffic to one stream, data from different sources can be mixed into one stream. • A mixer mathematically adds signals coming from different sources to create one single signal. • Support from Transport layer protocol • All the characteristics discussed could be implemented in application layer. However, they are so common in real-time applications that implementation in the transport layer protocol is preferable. • TCP: • Does not support timestamping or multicasting; • Retransmission and error control is not suitable for real-time;but has sequencing. • Retransmission upsets the whole idea of timestamping and playback. • Due to much redundancy in audio and video signals (even with compression), we can simply ignore a lost packet. • UDP: More suitable, supports multicasting and has no retransmission strategy; But no provision for timestamping or sequencing or mixing. • UDP in conjunction with a new transport protocol [Real-Time Transport Protocol (RTP)] is suitable for real-time traffic on the Internet.
Real-Time transport Protocol is designed to handle real-time traffic on the Internet. RTP stands between UDP and application program. RTP does not have a delivery mechanism (multicasting, port numbers, and so on). RTP supports timestamping, sequencing and mixing facilities. RTP is a Transport layer protocol but encapsulated in a UDP user datagram and not in an IP datagram. No well-known port is assigned for RTP. The port can be selected on demand with only one restriction [even number must be selected]. Figure 28.18RTP
Real-Time Transport Control protocol uses odd number temporary port that follows the RTP port. RTP allows only one type of message, one that carries data from source to destination. Messages meant to control the flow and quality of data and allow the recipient to send feedback to source or sources is done using RTCP. Sender report: Periodically sent by active senders in a conference to report transmission and reception statistics for all RTP packets sent during the interval. Uses absolute timestamp, which is the number of seconds elapsed since midnight January 1, 1970. The absolute timestamp allows the receiver to synchronize different RTP messages. Figure 28.19RTCP message types
Receiver report For passive participants that do not send RTP packets. Report informs the sender and other receivers about quality of service. Source description message Source periodically sends a source description message to give additional information about itself. Information can be name, email address, telephone number, and address of owner or controller of source. Bye message Source sends a bye message to shut down a stream. Announcement by source that it is leaving the conference. Although other sources can detect the absence of a source, this message is a direct announcement. It is also very useful to a mixer. Application-specific message Packet for an application that wants to use new applications (not defined in the standard).
Voice over IP is to use the Internet as a telephone network with some additional capabilities. Instead of communicating over a circuit-switched network, this application allows communication between two parties over packet-switched Internet. Two Protocols used: SIP and H.323 Session Initiation Protocol (SIP): Designed by IETF Application layer protocol that establishes, manages, and terminates a multimedia session (call) Used to create two-party, multiparty or multicast sessions. Independent of underlying transport layer protocol [TCP or UDP] Messages: SIP is a text-based protocol like HTTP; Has 6 messages. Message has header and body Header consists of several lines that describe the structure of message, caller’s capability, media type, and so on. Figure 28.20SIP messages Voice over IP
Caller initializes using a session with INVITE message. After callee answers the call, caller sends an ACK message for confirmation. BYE message terminates a session. OPTIONS message queries a machine about its capabilities. CANCEL message cancels an already started initialization process. REGISTER message makes a connection when the callee is not available. Addresses SIP is very flexible; An email address, an IP address, a telephone number, and other types of addresses can be used to identify the sender and receiver. However, the address needs to be in SIP format (also called scheme). Figure 28.21SIP formats
SIP consists of three modules: establishing, communicating and terminating. Establishing: Three-way handshake; Caller sends UDP or TCP INVITE message to begin the communication; Callee accepts it using ACK. Communicating: After the session has been established, the caller and callee can communicate using two temporary ports. Terminating: session can be terminated using BYE message sent by either party. Figure 28.22SIP simple session
Figure 28.23Tracking the callee To identify where the callee is sitting [if the callee is on a DHCP client or away from her system], SIP uses the concept of registration. SIP defines some servers as registrars. At any moment a user is registered with at least one registrar server; this server knows the IP address of the callee. Caller can use email address of Callee in the INVITE message to communicate with the callee. This message goes to the proxy server which can use lookup message (not part of SIP) to some registrar server to gain the IP address of the callee. When the proxy server receives a reply message from the registrar server, the proxy server takes the Callee’s INVITE message and inserts the newly discovered IP address of the callee.
Standard designed by ITU to allow telephones on the public telephone network to talk to computers (called terminals in H.323) connected to the Internet. Gateway connects the Internet to the telephone network. Gateway is a five-layer device that can translate a message from one protocol stack to another. Gateway transforms a telephone network message to an Internet message. Gatekeeper server on the local area network plays the role of the registrar server [like in SIP]. Figure 28.24H.323 architecture
H.323 uses a number of protocols to establish and maintain voice (or video) communication. G.71 or G.723.1 for compression H.245 allows the parties to negotiate the compression method. Q.931 is used for establishing and terminating connections. H.225 or RAS (Registration/Administration/Status) is used for registration with the gatekeeper. Figure 28.25H.323 protocols
Terminal sends a broadcast message to gatekeeper, which responds with an IP address. Terminal and gatekeeper communicate using H.225 to negotiate bandwidth. Terminal, gatekeeper, gateway and telephone communicate using Q.931 to set up a connection. Terminal, gatekeeper, gateway and telephone communicate using H.245 to negotiate the compression method. Terminal, gateway, and telephone exchange audio using RTP under the management of RTCP. Terminal, gatekeeper, gateway, and the telephone communicate using Q.931 to terminate the connection. Figure 28.26H.323 Operation