Using RFC2429 and H.263+ at low to medium bit-rates for low-latency applications

Stephan Wenger
Department of Computer Science
Technische Universitaet Berlin
stewe@cs.tu-berlin.de

Guy Côté
Department of Elec. & Comp. Engineering,
University of British Columbia
guyc@ece.ubc.ca

Abstract: In this paper, a standard-based video communication system combining video coding, packetization, and transport protocol environment is introduced. Rather than focusing on the specific areas mentioned above, the strength of the introduced concept lies in the novel combination of the various tools, especially in a rate-distortion optimized H.263 [1] video coder that takes packet loss into account, and the packetization based on the Internet RFC2429[2]. Together, both mechanisms form a powerful platform for low-latency, RTP-based video transmission in packet lossy environments with typically large packet sizes, and packet loss rates of up to 20%.

1. Introduction

Multimedia communication over packet networks, especially the Internet, has recently seen a lot of interest, both from academia and industry. Real-time and non real-time store-forward systems based on the video coding recommendation H.263 version 2 (also known as H.263+) are becoming increasingly popular, even though some applications regularly use a modified, proprietary syntax for packet-streams that takes the packet-oriented nature of the Internet into account. This is especially true for the store-forward type of applications such as Internet broadcasting. For real-time applications, the bit-oriented syntax of H.263, and other current video coding standards including the older MPEG-1 [3], MPEG-2 [4], and the recent MPEG-4 visual elementary stream [5], has to be converted into packets in an intelligent way to ensure reasonable performance. In the case of H.263 version 2, the packetization syntax for an RTP [6] environment is provided in RFC2429.

While H.263 and RFC2429 together provide the necessary video-oriented tools for a high-quality, low bitrate video transmission over the Internet, it is the combination of mechanisms from both standards in addition to non-standardized tools that form an efficient video communication system. Video standardization work generally leaves coding optimizations beyond bit stream syntax definition and decoder operation up to the system implementers in order to allow complexity scalability and competition in the marketplace. This is where our contribution comes in: based on standardized syntax a well-performing system is outlined.

This paper first introduces in Section 2 the network environment of our research, by providing an introduction to IP/UDP/RTP, information about useful packet sizes, error rates, delay, and other similar data. Readers that are familiar with Internet/RTP environments can safely skip this section. In Section 3 we briefly describe a rate-distortion optimized video encoder that takes a packet lossy environment into account, along with a corresponding decoder that uses a simple, yet powerful, error concealment method. In Section 4 a packetization scheme based on RFC2429 that leads to a good performance when used with the coder/decoder combination described in Section 3, is outlined. Section 5 provides simulation results for the above combination. The paper concludes with a summary and outlook in Section 6.

2. Real-time video communication over the Internet: an overview

This section attempts to characterize a complex environment (which changes rapidly over time, and on a large scale, mainly due to integration of new research results into products) in a few lines. Necessarily, there is a need for a lot of simplifications and abstraction. A more comprehensive, tutorial oriented, and only slightly outdated overview of the problems from a networking point-of-view can be found in [7].

Video communication over the Internet can be divided into two broad categories: interactive two-way communication and non-interactive store-forward communication. Interactive communication is characterized by the need for low end-to-end delay. Internet-based Videotelephony and Videoconferencing are good examples of such applications. Non-interactive communication applications can sustain reasonable end-to-end delay, for example a couple of seconds, as long as continuous playout of the video stream is possible. Typical examples include Internet television, and most forms of surveillance applications.

2.1 Non-interactive applications

From a transport point-of-view, non-interactive video communication is a relatively simple problem. In a point-to-point scenario, a possible protocol hierarchy could consist of TCP/IP for the transport of video information and RTP for providing timing information. Such a RTP/TCP/IP protocol combination would allow for an extremely low packet loss rate if the playout-delay is long enough to give TCP time for almost all re-transmissions. In multipoint scenarios, either reliable multicast protocols or a simpler scheme similar to the one used for broadcast could be employed. In broadcast scenarios it is impossible to use re-transmission protocols. To gain reasonable performance either mechanisms similar to the one described in the rest of this paper or forward error correction schemes could be employed.

2.2 Interactive applications

Interactive communication requires low end-to-end delay, and thus cannot rely on end-to-end re-transmission protocol algorithms that can introduce significant delays. The introduced transmission delay is, in case of an IP packet loss, at least three times that of the one-way transmission delay, since the packet loss has to be signaled (by the arrival of a packet with a higher than expected sequence number, or by timeout), the re-transmission has to be requested and the re-transmission itself has to take place. For many off-campus Internet connections involving more than a few routers, an end-to-end transmission delay of 100 ms or more can be assumed on the routing layer, leading to 300 ms or more on the application layer after a single re-transmission. Adding an assumed typical video coding/decoding delay of 200 ms, the end-to-end delay from a user's point of view results in half a second - too high for useful interactive communication.

Therefore, a transport protocol not based on re-transmission has to be employed. For the majority of Internet-based video communication of today, this is UDP. As a datagram protocol, UDP's major functionality is the application addressing. It does not perform any improvement in the transmission quality of service, with the exception of an (optional) CRC check to ensure the integrity of the payload data. In a typical environment, UDP packets arrive at the receiver often with a substantial packet loss rate, due to the lack of re-transmission. Recent research has shown that loss rates of 20% or more are common for many inter-continental connections [8] [9].

On top of UDP, RTP is generally employed to provide some real-time application-layer information such as a playout time-stamp and data type. Such an IP/UDP/RTP protocol hierarchy is assumed in the rest of this paper. RTP packets may carry in their payload either the media data directly, as it is done for most types of coded audio[10], or a payload specific additional header might be necessary. The payload header for H.263 version 2 video coding, for example, is defined in RFC2429. Most of the key functionalities enabled by this payload header are used by the packetization scheme described in Section 4.

As described above, a typical "video packet" consists of header information for IP, UDP, RTP, the RTP payload, and the payload data itself. The size of those headers is quite substantial: 20 bytes for IP, 8 bytes for UDP, 12 bytes for RTP, and a variable number of bytes for the payload specific header. In this section we will not consider the latter and leave this discussion for Section 4. Given a minimum amount of 40 bytes of header information for each packet, there is a need to produce video packets as large as possible to gain a reasonable relationship between header information and payload. Nevertheless, two upper bounds for this packet size have to be considered. First it is not helpful to transmit more than one picture in a single packet due to delay constraints, and second the typical Maximum Transfer Unit (MTU)-size of the Internet has to be considered. While all the mentioned Internet protocols do allow for packets of up to 64 Kbytes, the MTU size is usually assumed as being much smaller - around 1500 bytes. The reason for this number lies in the history of packet networks, especially Ethernet-type local area networks. On an Ethernet, every IP packet exceeding 1500 bytes has to be split into at least two Ethernet packets, thus at least doubling the IP packet-loss rate for a given Ethernet packet loss rate. Although Ethernet is no more relevant for long-distance Internet connections, many router implementations still seem to use split/recombine algorithms at packet sizes larger than the MTU size [8].

Given a maximum of 1450 bytes, or 11600 bits per packet, many coded pictures will fit completely into one packet. For example at 10 frames per second, the bit rate when using the full maximum payload size of a packet would be 116,000 bit/s on a bit-oriented channel, which is plenty for good quality QCIF pictures (176 x 144 pixel). Bigger picture formats like CIF (352 x 288) are also possible when accepting some degradation. At higher frame rates, the bit rate would be correspondingly higher, so that, at 30 fps, 348,000 bit/s would be the upper limit using a single picture per packet and an MTU-size of 1450 bytes. All those numbers suggest a general rule for minimum overhead packetization that can be expressed in "one picture, one packet".

However, when applying this rule, a single packet loss means the loss of a whole coded picture. From an error-resilience point-of-view it would be more desirable to divide a coded picture into a large number of packets to keep the spatial area affected by a packet loss as small as possible. In Section 4 we introduce a packetization scheme which uses 2 packets per picture because we feel that this is the optimum in the trade-off between error-resilience and overhead reduction, as outlined there.

We should also make a short remark here to address the issue of the burstyness of packet losses. It has been shown that packet losses tend to occur in a bursty fashion. Our earlier research showed, however, that those bursts do not seem to have a long enough duration to lead to bursty errors in the video traffic, given that pictures arrive only every 1/10th of a second when encoding at 10 frames per second (video packets are send out in 100 ms intervals at best).

Finally it should be mentioned that IP/UDP/RTP header compression is sometimes possible on the most critical, bandwidth limited dial-up connections between the terminal and the ISP. This header compression, however, cannot reduce the overhead of the full headers on the rest of the Internet. So we assume no header compression is performed.

3. Video encoder and decoder

An overview of H.263 can be found in [11], and [12] provides more detailed information about the error-resilience oriented mechanisms. The rate-distortion (RD)-optimized coder used for this work, and the applied error concealment method, are described in details in [13]. We therefore restrict the description of those algorithms to a minimum, and focus on packetization oriented issues.

In H.263+' smaller picture sizes, each spatial row of macroblocks is known as a group of blocks (GOB). GOBs can be coded with, or without, a GOB-header, which provides a synchronization point in the variable length coded bit stream. Furthermore, certain in-picture prediction mechanisms such as motion vector prediction do not cross GOB-boundaries if the GOB-header is present. This allows for a (more or less) independent decoding of GOBs preceded by GOB headers. In our work we code all GOBs with GOB headers, as outlined in Figure 1.

Figure 1: H.263+ coded QCIF picture with all GOB headers

Furthermore, several macroblocks are coded as non-predicted intra macroblocks, rather than relying on the - more efficient - predictive inter coding mode. Which macroblocks are coded in intra mode is decided by a rate-distortion optimization process, summarized below.

Our packetization schemes, described in Section 4, have in common that each packet consists of at least one full GOB, and we do not allow splitting of GOBs into more than one packet. Therefore, a packet loss corresponds to the loss of one, or more, full GOBs. This helps to apply a simple and effective error concealment mechanism as described below. The error concealment algorithm used by the decoder are known by the encoder, which can use this knowledge in the rate-distortion optimization process. In the following we first summarize the error concealment technique employed and then describe the RD intra updating method.

3.1 Error concealment

The error concealment method used in this work is based on the TCON model of H.263 Test Model TMN-11 [14]. Missing coded macroblocks (in our case always at least one full GOB) are detected using the GOB Number (GN) of the GOB header, which is present for every GOBs.

Error concealment is performed for the missing GOBs as follows. Motion vectors of the missing macroblocks are copied from the macroblock above when available, otherwise set to zero. Then the macroblock from the previous frame at the same spatial location is motion compensated with this motion vector and copied to the current location in the current frame.

Many other error concealment techniques have been proposed in the literature, and an excellent review is available in [15]. However, many of these techniques require substantial additional complexity that can be tolerated in still image decoding but not in real-time video decoding. The method used in this work provides efficient error concealment and requires very little additional computational complexity.

3.2 Rate-distortion optimized intra updating method

RD video coding provides an efficient framework for optimizing coding parameters, including coding mode selection. A summary of RD optimized video coding for an error free environment can be found in [16]. In our work, three coding modes are considered: skip, inter, and intra. Intra mode represents coding without temporal prediction and inter mode uses motion compensated temporal prediction. The skip mode is a special case of inter mode where no information is transmitted, and the macroblock is simply repeated from the spatially corresponding macroblock in the previous frame. Independently for every macroblock, we choose the mode that minimizes the Lagrangian given by:

Equation 1

that is, the coding mode that yields the best RD tradeoffs for the macroblock. Using

Equation 2

has been shown to provide good RD tradeoffs [16], where Q is the quantization step size of the macroblock.

Using the above method, the coding mode selection is only optimal if the video bit stream is received without errors at the decoder. When errors are present, temporal prediction will allow errors to propagate if the inter mode is chosen. Using the intra coding mode will stop error propagation, but at a higher coding rate cost.

If we know (or can estimate) the error concealment method employed by the decoder and error rates of the network, we can achieve better tradeoffs between compression efficiency and error resilience. This idea was also suggested in [17]. First, we can attribute the distortion to two sources: distortion Dq caused by quantization error, and distortion Dc remaining after error concealment. Assuming a macroblock error rate of p, we minimize the Lagrangian

Equation 3

Here, the rate R is the rate at which the coded sequence is transmitted, and is the same as in the error free case.

For a given macroblock, two distortions are computed for all three coding modes considered: the coding distortion Dq and the concealment distortion Dc. Then Dq is weighted by the probability (1-p) that this macroblock is received without error, and Dc is weighted by the probability p that the same macroblock is lost and concealed. Using this above minimization, good RD tradeoffs can be achieved subject to the probability of error rate and concealment constraints. The error concealment method will directly affect the mode decision. A better error concealment method than the one employed here will give better RD performance given the same probability of error rate. Note that by minimizing Equation 3, regions that are usually well concealed will most probably not be coded in the intra mode. If a given macroblock is perfectly concealed, then Dc = Dq and Equation 1 and Equation 3 are therefore equivalent.

4. Packetization

This section describes the application of the mechanisms defined in RFC2429 to the coded video bit stream generated by the RD-optimized coder outlined in the previous section.

4.1 Packetization using RFC2429

Section 3 introduced a mechanism that codes small sections of a picture (GOBs) independently by inserting a GOB header at the beginning of each GOB. If one of these GOBs is missing from the received bit stream, the decoder uses error concealment as described above. The most straightforward packetization scheme using this coding mechanism is described in details in [13]. This scheme employs a "one packet, one GOB" technique, leading to an overhead of 40 bytes per GOB. This overhead was deemed acceptable, as IP/UDP/RTP header compression can be employed on the critical, bandwidth limited link between terminal and first router. Backbone traffic wasn't considered as important.

We propose a more network friendly packetization, where the described overhead is minimized by using a smaller number of (bigger) packets, thereby also avoiding the use of header compression. We want to pack more than one GOB into a single packet, while keeping the maximum packet size to one picture. Since the straightforward error-concealment method described above performs best for the concealment of one missing GOB based on information from the spatially above GOB, it is preferable not to pack GOBs into one packet in the same order they are coded. Instead of trying to find an error concealment mechanism that can conceal very large spatial parts of a picture (which is a difficult task), we use a simple interleaving scheme by packing all even GOBs into one packet, and all odd GOBs into another. This leads to two packets per picture and allows the concealment of all macroblocks of a picture if only one of those two packets gets lost. This packetization is also more friendly to other spatial-oriented error concealment techniques.

The picture header contains information relevant to the whole picture and appears only once per coded picture representation, at the very start. If this picture header belongs to one packet only and gets lost, critical information is lost. In many environments, the picture header changes only rarely with the exception of the temporal reference (TR) field. Since a very similar (and, in fact, redundant) time reference is also present in the RTP header, such a situation could be easily concealed on the bit-stream level.

Some systems, however, may change contents of the picture header other than the TR, such as picture sizes, picture coding mode, selected reference picture, or optional coding modes, more frequently. This would cause serious problems if such a changed picture header are lost, which usually results in the inability to decode that picture, or worse, could force a decoder reset due to resource allocation problems (if for example the picture size was changed to a larger size). RFC2429 allows for adding a redundant copy of the picture header into the payload header of each packet. This mechanism is employed to ensure that a picture can be (partially) decoded and concealed, even when the first packet of this picture is lost.

4.2 A packetization scheme using two packets per picture with GOB interleaving

For the reasons mentioned above we propose a packetization scheme which tries to minimize the packetization overhead and still performs well in case of packet losses, assuming the coding/decoding environment of Section 3. To do so we transmit two packets per pictures with all even numbered coded GOBs in one packet, and all odd numbered coded GOBs in another packet. A typical example for a video bit stream coded at 50 kbps and 10 frames per second at QCIF resolution is presented in Figure 2. The constant packetization overhead (consisting of the IP/UDP/RTP headers, 40 bytes per packet in total) is thus reduced to 80 bytes per picture.

 

Figure 2: An example of two packets of a picture using the interleaved packetization scheme

Note that the 2 bytes minimum RFC2429 header has no impact on the overhead, because the use of this header allows for the deletion of the 16 bit Picture Start Code or GOB Start Code that proceeds each picture/GOB in H.263. This codeword, used in bit-oriented environments for synchronization, is represented by a single bit in the RFC2429 header and can thus be deleted from the video bit stream.

4.3 De-packetization and decoding process

When using the described packetization scheme four different situations can occur at the receiver, depending on the packet loss situation:

5. Simulation results

An H.263 video coder [18] implementing the Test Model TMN-11 specifications with RD optimizations [19] is used in all simulations. Simulation results are obtained using the sequence Foreman in QCIF format. Three bit rates are considered: 20 kbps and 50 kbits to simulate modem and ISDN dialup connections to an ISP, and 150 kbps to simulate high-bandwidth connections to the Internet backbone, or for LAN connections. All bit rates include network and video packetization overhead. By splitting the video frames into GOBs, additional coding penalties are incurred from both the size of those headers themselves and from predictive coding limitations (e.g. motion vector prediction). This has to be taken into account when comparing the PSNR values to those obtained by a coder optimized for a lossless environment. Table 1 provides a summary, where pack.-scheme 1 represents the one picture, two packets scheme and pack.-scheme 2 represents the one GOB, one packet scheme.

Transport bitrate

Total bitrate available for packet video

Packetization Scheme

Packetization overhead @ 10 fps and QCIF

Bitrate for H.263+ video

PSNR at 0% PLR

PSNR at 20% PLR

Modem,
33 kbps

20

1

6.4

13.6

27.1

20.9

2

28.8

N/A

N/A

N/A

ISDN,
64 kbps

50

1

6.4

43.6

30.0

23.6

2

28.8

21.2

28.1

20.7

LAN,
>150 kbps

150

1

6.4

143.6

34.4

27.6

2

28.8

121.2

33.7

25.1

Table 1: Simulation results summary for the sequence Foreman.

Results for the proposed one picture - two packets scheme are presented in Figure 3. We compare the RD-intra updating mechanism versus the random updating that is proposed in the ITU-T [20]. Packet Loss Rates (PLR) of 5, 10 and 20% are considered. At 20% PLR, as much as 2.4 dB gain is achieved by the proposed RD updating mechanism for bit rates of 150 kbps. The gain achieved by the RD intra updating method is due to the consideration of the error concealment technique at the encoder and its use at the decoder. A better error concealment technique would definitely improve our results.

Next, we want to compare the one picture - two packet scheme to the one GOB - one packet approach described in [13]. Figure 4 shows the PSNR of the different packetization schemes versus PLR for 50 and 150 kbps. Note that the bit rates include the packetization overhead. The one GOB - one packet scheme requires an overhead of 28.8 kbps at QCIF resolution and 10 fps, which limits the lower video bit rate reasonably achievable for this packetization method. Thus for low bit rates (e.g. Modem connections) this method cannot be used. But even for higher bit rates, the proposed packetization works consistently better than a one GOB - one packet packetization scheme. This result was verified using the sequence News. The results for News are similar and not shown here for space constraint reasons.

Figure 3: Performance of RD-optimized mode decision

Figure 4: Relative performance of the two packetization schemes

6. Summary

In this paper we introduced a standard-compliant video communication system for an Internet/RTP environment, that consists of video encoder, decoder with error concealment, and packetization scheme. The system was designed to allow for uni-directional, real-time communication, as it does not rely on any feedback mechanisms other than information about the packet loss rate. It was shown that this combination yields good quality of the reconstructed video pictures even at high packet loss rates such as 20%.

 

7. References:

1 ITU Recommendation H.263 Version 2, Video Coding for Low Bitrate Communication, Jan. 1998.

2 Bormann, C., L. Cline, G. Deisher, T. Gardos, C. Maciocco. D. Newell, J. Ott, G. Sullivan, S. Wenger and C. Zhu, "RTP Payload Format for the 1998 Version of ITU-T Rec. H.263 Video (H.263+)", RFC2429, May 1998.

3 ISO/IEC 11172-2 (MPEG-1) "Information technology -- Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s -- Part 2: Video" Nov. 1992.

4 ISO/IEC 13818-2 (MPEG-2) "Information technology -- Generic coding of moving pictures and associated audio information" ISO/IEC, July 1996.

5 ISO/IEC 14496-2 (MPEG-4 ), "Video Verification Model V.12", Dec. 1998.

6 Schulzrinne, H., S. Casner, R. Frederick and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC1889, Jan. 1996.

7 Black, U., Advanced Internet Technologies, Prentice Hall, 1998

8 Ott, J., S. Wenger, "Application of H.263+ Video Coding Modes in Lossy Packet Network Environments" accepted for publication, Journal for Visual Communication, 1998.

9 Handley, M., "An Examination of Mbone Performance", UCL/ISI Research Report, Jan. 1997.

10 Schulzrinne, H. "RTP Profile for Audio and Video Conference with Minimal Control" RFC1890, May 1996.

11 Côté, G., B. Erol, M. Gallant, F. Kossentini, "H.263+: Video Coding at Low Bit Rates", IEEE Transaction on Circuit and Systems for Video Technology, Vol. 8, No. 7, Nov. 1998, available from http://spmg.ece.ubc.ca.

12 Wenger, S., G. Knorr, J. Ott, F. Kossentini, "Error Resilience Support in H.263+", IEEE Transaction on Circuit and Systems for Video Technology, Vol. 8, No. 7, Nov. 1998.

13 Côté, G., and F. Kossentini, "Optimal Intra Coding of Macroblocks for Robust {H.263} Video

Communication over the Internet", submitted to Special Issue Of Image Communication on "Real-time Video over the Internet" EUROSIP Visual Communication, Sept. 1998.

14 ITU Telecom. Standardization Sector of ITU, "Video Codec Test model near-term, Version 11 (TMN11)", H.263 Ad Hoc Group, July 1998, available from ftp://standard.pictel.com/video-site/.

15 Wang, Y., and Q.F. Zhu, "Error Control and Concealment for Video Communication: A Review 98", IEEE Communication Magazine, Vol. 86 No. 5, pp. 974-997, May 1998.

16 Sullivan, G., and T. Wiegand, "Rate-Distortion Optimization for Video Compression", IEEE Signal Proc. Magazine, Nov. 1998.

17 Wiegand, T. Personal Communication with authors, July 1998.

18 Signal Processing and Multimedia Group, University of British Columbia, "TMN 8 (H.263+) Encoder/Decoder, Version 3.2", Feb. 1998, available from http://spmg.ece.ubc.ca/h263plus.

19 Gallant, M., G. Côté and F. Kossentini, "Description of and Results for Rate-Distortion Based Coder", ITU-T Study Group 16, Video Experts Group, Document Q15-D-49, Apr. 1998 available from ftp://standard.pictel.com/video-site/.

20 Wenger, S., "Video Test Model Description for H.324/M Based Communication", ITU-T Study Group 16, Video Experts Group, Document Q15-F-46, Nov. 1998, available from ftp://standard.pictel.com/video-site/.