IEEE Signal Processing Society 1998 Workshop on Multimedia
December 7-9, 1998, Los Angeles, California, USA
© 1998 IEEE
Many of the newer video coding/compression standards, including MPEG-4 and H.263+ support some form of temporal scalability as part of their layered codec concept. This is generally realized by using bi-directionally predicted pictures (B-pictures), which use both an earlier and a subsequent P-picture as reference (anchor) pictures. This paper introduces a new form of temporal scalability employing only P-pictures. This mechanism results in improved real-time behavior (particularly lower latency) and a more flexible layering structure, at the cost of less efficient coding. The P-picture based temporal scalability mechanism is particularly useful for interactive multimedia communication on networks that offer several independent transport streams (possibly with different quality of service), but have sub-optimal real-time characteristics. Typical applications include both the Internet and some forms of mobile communication. The 1998 version of H.263 (known as H.263+ in both academia and industry) offers a mechanism to support P-picture scalability within the bit-stream through the Reference Picture Selection mode  . Other video coding standards, including those of the MPEG family, require slight modifications to the decoder in order to support the proposed mechanism.
The concept of a layered codec is well known   and recent video coding standards have incorporated support for such mechanisms. A layered video codec produces a representation of the video sequence at various resolutions and qualities. This allows for compatible and robust representations. A particular representation can be obtained by decoding a particular subset of the coded data. Typically, each representation is referred to as a layer. In a hierarchical scalable coding scheme, each layer is coded with respect to lower layer representations. These layers can then be transmitted in separate data streams. The base layer is the only self-contained representation of the sequence and is usually low quality, low resolution, and low bit-rate. Enhancement layers code the difference information or error between the base layer and a higher quality or resolution representation of the video sequence. Three forms of scalability are widely used:
All different enhancement layer mechanisms can be used in combination, although some restrictions exist in the current standards. In H.263+, for example, it is possible to use a spatial enhancement layer to increase the spatial resolution of the representation. Using this higher resolution representation, temporal scalability can be employed to produce a higher frame-rate. However, it is not permitted to use the pictures of a temporal enhancement layer as a reference for subsequent SNR or spatial enhancement layers.
This paper focuses on temporal scalability only. It introduces a new form of temporal scalability employing P-pictures. The following section describes this P-picture scalability and distinguishes it from usual B-picture scalability, using H.263+ as an example. Simulation results for both are presented next. The final section describes the applicability of P-picture scalability to other video coding standards implementing inter picture prediction.
When implementing a layered codec based on H.263+, the common solution is to employ H.263+'s Temporal, Spatial, and SNR Scalability optional mode (Annex O). In this mode, temporal scalability is realized through bi-directionally predicted pictures (B-pictures). B-pictures use the previous and the subsequent P-picture (or I-picture) as their anchor pictures. Figure 1 illustrates a base layer that is decodable at 10 fps and an enhancement layer using B-pictures that is decodable at 30 fps.
Figure 1. Base and temporal enhancement layer. The B-pictures of the enhancement layer use only the P-pictures of the base layer as their anchors.
As stated above, H.263+ does not permit the B-pictures of the enhancement layer to be used as reference pictures for subsequent layers. Furthermore, with this mechanism it is necessary to decode two base layer pictures before it is possible to decode the B-pictures. Therefore, the overall latency of the base/enhancement layer combination and of the base layer itself are similar, although bits were spent in the enhancement layer to increase temporal resolution. The advantage of temporal scalability employing B-pictures is the comparably small size of B-pictures, leading to a high coding efficiency in the enhancement layer. This is in part due to the enhanced (bi-directional) prediction mechanisms for B-pictures. Additionally, a higher quantization step size is often used for the enhancement layer B-pictures compared to the base-layer P-pictures. This is possible because of the restriction that B-pictures not be used as references. The result is high coding efficiency for the enhancement layer B-pictures and little noticeable impact on the reproduced video quality.
The proposed P-picture based temporal scalability mechanism is illustrated in Figure 2. The base layer is a usual sequence of P or I-pictures, as in the example above. The enhancement layer, however, consists only of P-pictures. In addition, the enhancement layer now uses previous pictures of the enhancement layer, as well as base layer pictures, as reference pictures. Two scenarios are shown:
Note that the proposed mechanism can also employ a higher quantizer step-size in the enhancement layer P-pictures to increase coding efficiency with little noticeable impact. This is a particularly useful technique for the last picture of each enhancement layer sub-sequence.
Figure 2. : Principle of P-picture temporal scalability: the P-picture sequence in the base layer is enhanced by additional P-pictures in the enhancement layer. These enhancement layer pictures reference either the base layer's pictures, or the previous enhancement layer's picture.
Employing P-pictures for temporal scalability is not an explicitly defined optional mode in H.263+, nor in any of the other popular video coding standards. However, in H.263+ it is possible to use the Reference Picture Selection optional mode (Annex N) without a back channel to generate a standard-compliant bit stream containing scalable P-picture information. This optional mode allows, for each predicted picture, one of several recently transmitted pictures to be selected for use as the reference picture, as opposed to simply allowing the most recent picture to be used as the reference picture. This is signaled in the picture header by coding the temporal reference of the selected reference picture. An H.263+ decoder supporting this optional mode must store several previous reference pictures along with their associated temporal references.
The proposed P-picture based temporal scalability mechanism provides several advantages over the traditional B-picture based temporal scalability mechanism. First, the latency is reduced when spending bits for the enhancement layer. Since a P-picture can be decoded without backward prediction, the enhancement layer P-pictures can be decoded as soon as they are transmitted. At an enhancement layer frame rate of 30 fps, and a base layer frame rate of 10 fps, the overall latency can be 66 milliseconds shorter than using B-frame scalability (33 ms instead of 100 ms). Second, P-frames can be used as reference frames for subsequent enhancement layers. This provides more flexibility in the design of a layered codec. Third, error resilience is improved, as it is by using B-picture based temporal scalability. Any damage to an enhancement layer P-picture will not affect prediction starting from the next reference layer P-picture. The main disadvantage is the inferior coding efficiency of P-pictures relative to B-pictures. This reduced coding efficiency in the enhancement layer results in a slight increase in bit rate.
Using the University of British Columbia's H.263+ reference codec and our implementation of the Reference Picture Selection optional mode (Annex N), both B-picture based temporal scalability (using Annex O) and P-picture based temporal scalability (using Annex N) are simulated. A base layer decodable at 10 fps and an enhancement layer decodable at 30 fps are generated. Fixed quantizer step sizes of 13 and 16 are used for the base and enhancement layers respectively. The input sequences Foreman, Coastguard and Paris, at QCIF resolution, are employed. Several improved coding-efficiency oriented optional modes, Advanced Prediction mode (Annex F), Advanced Intra coding mode (Annex I), Deblocking Filter mode (Annex J) and Modified Quantization mode (Annex T) are used. This configuration produces representations at bit rates typical for Internet scenarios employing layered codecs. The resulting picture quality is far beyond what is achievable using current Internet applications (e.g. the popular Mbone tools).
The picture quality for B-picture based scalability and the proposed P-picture based scalability is comparable, with a PSNR difference of less than 0.1 dB. This result is expected as the same quantizer values are used for both sets of simulations. Therefore, we compare the different layering mechanisms based on coding efficiency, using the data in Table 1. The first column shows the bit rate for a non-layered representation decodable at 10 fps. The second column shows the bit rate for a non-layered representation decodable at 30 fps. This bit-rate can be considered as the optimum for full temporal resolution (i.e. 30 fps) P-picture based coding. The third and fourth columns show the bit rate of the enhancement layers for layered representations decodable at 30 fps using B-picture based temporal scalability and P-picture based temporal scalability respectively. The fifth and sixth columns show the total bit rate of the base and enhancement layers combined for layered representations decodable at 30 fps using B-picture based temporal scalability and P-picture based temporal scalability respectively. This data provides an indication of the network load imposed by the different mechanisms. Of particular interest are columns 4 and 6, which illustrate the bit rate overhead of P-picture based temporal scalability as compared to B-picture based temporal scalability.
|Paris||58.6||106.6||25.3||60.5 139%||83.9||119.1 42%|
|Foreman||43.8||108.5||29.9||45.7 52%||73.7||89.5 21%|
|Coastguard||60.5||113.0||18.7||25.9 38%||79.2||86.4 9%|
The data shows a sequence-dependent total bit rate increase of 9% to 42% when P-picture based temporal scalability is employed, as compared to B-picture based temporal scalability. The associated enhancement layer bit rate increase is 38% to 139%. For many applications, this increase in bit rate is tolerable, given that the benefits of a scalable bit stream are provided with significantly reduced latency using the proposed P-picture based temporal scalability mechanism.
H.263+ offers a mechanism for P-picture based temporal scalability through the Reference Picture Selection optional mode. The transport layer can then create a single P-picture based scalable bit stream from multiple transport streams (each of them conveying one layer) by using the temporal reference as the ordering criterion and concatenating the coded pictures. Unfortunately, other current video coding standards lack such a mechanism. This section provides a brief overview of how to circumvent such limitations and implement a P-picture based temporal scalability mechanism in a non-H.263+ framework. In particular, the MPEG-family of ISO/IEC standards and ITU-T H.261 and H.263 version 1 are discussed.
Using these standards, it is impossible to generate a standard compliant data stream that combines a base layer and a P-picture based temporal scalability layer without performing a complete decoding and re-encoding of the video sequence. However, it is possible to modify a decoder such that not only the most recently decoded P-picture is stored but also several recently decoded P-pictures are stored in a queue. The decoder can then use these stored P-pictures in the following manner:
The corresponding encoder operates in a similar manner:
We have implemented such a mechanism for a real-time commercial H.261 codec. The results are comparable to those presented above for the H.263+ case.