NVIDIA Video Codec SDK encoder initialization memory leak

February 8, 2019, 4:13 am

≫ Next: Media Foundation MP4 Media Source gets a bit too tired when doing too much work

≪ Previous: MFCreateVideoSampleFromSurface’s IMFTrackedSample offering

It appears that re-initialization of encoding session with NVIDIA Video Codec SDK is or might be producing an unexpected memory leak.

So, how does it work exactly?

NVENCSTATUS Status;
Status = m_ApiFunctionList.nvEncInitializeEncoder(m_Encoder, &InitializeParams);
assert(Status == NV_ENC_SUCCESS);
// NOTE: Another nvEncInitializeEncoder call
Status = m_ApiFunctionList.nvEncInitializeEncoder(m_Encoder, &InitializeParams);
assert(Status == NV_ENC_SUCCESS); // Still success
...
Status = m_ApiFunctionList.nvEncDestroyEncoder(m_Encoder);
assert(Status == NV_ENC_SUCCESS);

The root case problem is secondary nvEncInitializeEncoder call. Alright, it might be not exactly how API is designed to work, but returned statuses all indicate success, so it will be a bit hard to justify the leak by telling that second initialization call was not expected in first place. Apparently the implementation overwrites internally allocated resources without accurate releasing or reusing. And without triggering any warning of sorts.

Another part of the problem is eclectic design of the API in first place. You open a “session” and obtain “encoder” as a result. Then you initialize “encoder” and when you are finished you destroy “encoder”. Do you destroy “session”? Oh no, you don’t have any session at all except that API opening “session” actually opens an “encoder”.

So when I get into situation where I want to initialize encoder and it is already initialized then what I do is to destroy existing “encoder”, open new “session” and now I can initialize the session-encoder once again with the initialization parameters.

↧

Media Foundation MP4 Media Source gets a bit too tired when doing too much work

February 12, 2019, 6:01 am

≫ Next: Infrared Camera in Media Foundation

≪ Previous: NVIDIA Video Codec SDK encoder initialization memory leak

It appears there is a sort of a limitation (read: “a bug”) in Media Foundation MPEG-4 File Source implementation when it comes to reading long fragmented MP4 files.

When respective media source is used to read a file (for which, by the way, it does not offer seeking), the source issues a MF_SOURCE_READERF_ENDOFSTREAM before reaching actual end of file.

When some software sees a full hour of video in the file…

… Media Foundation primitive, after reading frame 00:58:35.1833333, issues “oh gimme a break” event and reports end of stream.

↧

Infrared Camera in Media Foundation

March 3, 2019, 10:22 am

≫ Next: UpdateVersionInfoGit: Multiple references/hashes

≪ Previous: Media Foundation MP4 Media Source gets a bit too tired when doing too much work

Surface Pro (5th Gen) infrared camera streamed into Chrome browser in H.264 encoding over WebSocket connection

The screenshot above shows Surface Pro tablet’s infrared camera (known as “Microsoft IR Camera Front” on the device) captured live, encoded and streamed (everything is hosted by Microsoft Media Foundation Media Session by this point) over network using WebSockets into Chrome’s HTML5 video tag by means of Media Source Extensions (MSE).

Why? Because why not.

Unfortunately, Microsoft did not publish/document API to access infrared and depth (time-of-flight) cameras so that traditional applications could use the hardware capabilities. Nevertheless, the functionality is available in Universal Windows Platform (UWP), see Windows.Media.Capture.Frames and friends.

UWP implementation is apparently using Media Foundation on its backyard so the fucntionlaity could certainly be published for desktop applications as well. Another interesting thing is that my [undocumented] way to access the device seems to be bypassing frame server and talks to device directly, including video.

It does not look like Microsoft is planning to extend visibility of these new features to desktop Media Foundation API since they sequentially add new features without exposing them for public use outside UWP. UWP API itself is eclectic and I can’t imagine how one could get a good understanding of it without having a good grip on underlying API layers.

↧

UpdateVersionInfoGit: Multiple references/hashes

March 16, 2019, 10:40 am

≫ Next: DirectShow VMR-7 bug in Windows 10

≪ Previous: Infrared Camera in Media Foundation

Some time ago I shared an application which I have been using to embed git reference into binary resources, especially as a post-build event in automated manner: Embedding a Git reference at build time.

This time I needed a small amendment related to use of a git repository as a sub-module of another repository. To make things easier for troubleshooting, when a project if built as a part of bigger build through a sub-module repository reference, both git details of the repository and its parent might be embedded into resources.

“$(ProjectDir)..\_Bin\Third Party\UpdateVersionInfoGit-$(PlatformName).exe” path “$(ProjectDir)..” path “$(ProjectDir)..\..” binary “$(TargetPath)”

The utility allows multiple path arguments, will go over all of them and concetenate the “git log” output. When multiple paths are given it is okay to some of them be invalid or unrelated to git repositories.

Download links

Binaries:
- 32-bit: UpdateVersionInfoGit-Win32.exe
- 64-bit: UpdateVersionInfoGit-x64.exe
- License: This software is free to use

↧

DirectShow VMR-7 bug in Windows 10

March 17, 2019, 11:28 am

≫ Next: Media Foundation on Raspberry Pi 3 Model B+

≪ Previous: UpdateVersionInfoGit: Multiple references/hashes

DirectShow Video Mixing Renderer (VMR-7) filter exhibits a (regression?) bug in Windows 10 systems. When aspect ratio preservation is enabled in VMR_ARMODE_LETTER_BOX mode, which makes overall sense as default mode quote so often, the letterboxing does not work as expected.

The problem is easy to reproduce with a well known DShowPlayerSDK sample application, with an edit enforcing VMR-7 mode. Once video is started, just resize the window and the parts not covered by video will not be erased as expected.

Apparently this worked well earlier.

↧

Media Foundation on Raspberry Pi 3 Model B+

March 24, 2019, 3:05 am

≫ Next: Video compression in AVerMedia Live Gamer Ultra GC553

≪ Previous: DirectShow VMR-7 bug in Windows 10

The interesting part with live WebM Media Foundation media source I mentioned in the previous post is that the whole thing works great on… Raspberry Pi 3 Model B+ running Windows 10 IoT Core (RaspberryPi 3B+ Technical Preview Build 17661).

Windows 10 IoT has quite the same Media Foundation infrastructure as in other Universal Windows Platform environments (Desktop, Xbox, HoloLens) including the core API, primitives, support in XAML MediaElement (MediaPlayerElement). There is no DirectX support on Raspberry Pi 3 Model B+ and video delivery fails, however this is a sort of known/expected problem with the Technical Preview build. Audio playback is okay.

The picture above is taken on C# UWP application (that’s ARM platform) running a MediaPlayerElement control taking live audio signal from network using a Windows.Networking.Sockets.MessageWebSocket connection.

A custom (the platform does not have a capable primitive out of the box) WebM live media source forwards the signal to media element for low latency audio playback. The codec is Opus and, yes, stock Media Foundation audio decoder MFT decodes the signal just fine.

↧

Video compression in AVerMedia Live Gamer Ultra GC553

April 7, 2019, 1:37 pm

≫ Next: State of video remoting

≪ Previous: Media Foundation on Raspberry Pi 3 Model B+

“The next generation of game capture is here.” The device addresses needs of real time capture of video signal: offering a pass-through HDMI connection the box provides a video capture sink with USB 3.1 Type C interface and makes the video signal available to video capture applications via standard DirectShow and Media Foundation APIs.

I was interested whether the device implements video compression, H.264 and/or H.265/HEVC in hardware. The technical specifications include:

• Max Pass-Through Resolutions:2160p60 HDR /1440p144 / 1080p240
• Max Record Resolutions:2160p30 / 1440p60 / 1080p120 / 1080p60 HDR
• Supported Resolutions (Video input):2160p, 1440p, 1080p, 1080i, 720p, 576p, 480p
• Record Format: MPEG 4 (H.264+AAC) or (H.265+AAC)*
…
Notes:
*H.265 Compression and HDR are supported by RECentral

So there is a direct mention of video compression, and given the state of the technology and the price of the box it makes sense to have it there. Logitech C930e camera has been offering H.264 video compression onboard for years.

So is it there in the Ultra thing? NO, IT IS NOT. Pathetic…

One could guess this of course from a study of FAQ section in the part of third party software configuration. The software is clearly expected to use external compression capabilities. However popular software is also known to not use the latest stuff, so there was a little chance that hardware codec is still there. I think it would fair to include that right there into technical specification that the product does not offer any encoding capabilities.

The good thing is that the box offers 10-bit video capture up to 2560×1440@30 – there is not so much of inexpensive hardware capable to do such job.

The specification mentions high rate 1920×1080@120 mode but I don’t see it in the effectively advertised capabilities.

Also, video capture capabilities in Media Foundation API suggest that it is possible to capture into video memory bypassing system memory mapping/copy. Even though it is irrelevant to most of the applications, some newer ones including those leveraging UWP video capture API could take advantage (such as, for example, video capture apps running on low power consumption devices).

↧

State of video remoting

May 4, 2019, 1:15 am

≫ Next: Source code to fit 80 columns of text

≪ Previous: Video compression in AVerMedia Live Gamer Ultra GC553

Some things are working decently well…

A monitor of one system is remoted to another system where the latter is… Xbox One X. Perceivable latency with 1920×1080@60 monitor resolution is under 2 video frames even though there are so many things happening in between.

The source system is powered by moderate GeForce GTX 750 running with its video encoding engine (encoding alone on this GPU requires around 12 ms for frame H.264 compression work) loaded at 40%. There is Rainway, Sachiel, Protocol Buffers, WebRTC on the sending side. Not necessary here but good to mention: network video data packaging overall remains HTML5 compliant. On the client side of things the same unwind with, of course, use of DXVA2 for video decoding. Xbox GPU engine utilization fluctuates around 6% and the broadcast overall stays an easy job with latency caused by scheduling rather than processing complexity.

↧

Source code to fit 80 columns of text

June 1, 2019, 2:11 am

≫ Next: Use of ICodecAPI interface with a video encoder managed by Media Foundation Sink Writer instance

≪ Previous: State of video remoting

LLVM Coding Standards – Source Code Width:

Write your code to fit within 80 columns of text. This helps those of us who like to print out code and look at your code in an xterm without resizing it.
The longer answer is that there must be some limit to the width of the code in order to reasonably allow developers to have multiple files side-by-side in windows on a modest display. If you are going to pick a width limit, it is somewhat arbitrary but you might as well pick something standard. Going with 90 columns (for example) instead of 80 columns wouldn’t add any significant value and would be detrimental to printing out code. Also many other projects have standardized on 80 columns, so some people have already configured their editors for it (vs something else, like 90 columns).
This is one of many contentious issues in coding standards, but it is not up for debate.

Is there any more stupid rule than to wrap around source code lines just because someone would possibly look at code in an xterm?

So source is consuming less than 25% width of a quote ordinary monitor wasting all this space on the right. Same time, the source code lines are objectively long and are massively wrapped around.

Wrapping destroys readability of code.

Re-wrapping source code has an obvious negative effect on change tracking.

I, for once, want to see as much of source code as possible momentarily because it helps to have a picture of what is going on. Information at the end of lines is less important so it is not a big deal even if it goes beyond the right visible margin, but it’s important to have as many LINES of code as possible – I would even prefer to skip blank lines and utilize IDE’s capabilities to collapse comments, functions, regions and scopes. For this reason some developers even rotate monitors into portrait mode – to see more of source code at a time.

Fitting 80 columns and having it even not up for debate is a clearly genius move to keep devs productive. Through continuous irritation.

↧

Use of ICodecAPI interface with a video encoder managed by Media Foundation Sink Writer instance

June 8, 2019, 6:59 am

≫ Next: CleanPoint markup fun with a fragmented MP4 file and Media Foundation MPEG-4 Source

≪ Previous: Source code to fit 80 columns of text

A bump of StackOverflow post about Media Foundation design flaw related to video encoding.

Set attributes via ICodecAPI for a H.264 IMFSinkWriter Encoder

I am trying to tweak the attributes of the H.264 encoder created via ActivateObject() by retrieving the ICodecAPI interface to it. Although I do not get errors, my settings are not taken into account. […]

Media Foundation’s Sink Writer is a simplified API with a encoder configuration question slipped away. The fundamental problem here is that you don’t own the encoder MFT and you are accessing it over the writer’s head, then the behavior of encoders around changing settings after everything is set up depends on implementation, which is in encoder’s case a vendor specific implementation and might vary across hardware.

Your more reliable option is to manage encoding MFT directly and supply Sink Writer with already encoded video.

Your potential trick to make things work with less of effort is to retrieve IMFTransform of the encoder as well and clear and then set back the input/output media types after you finished with ICodecAPI update. Nudging the media types, you suggest that encoder re-configures the internals and it would do this already having your fine tunings. Note that this, generally speaking, might have side issues.

The ‘trick’ seems to work for some of the ICodecAPI parameters (e.g. CODECAPI_AVEncCommonQualityVsSpeed) and only for Microsoft’s h.264 encoder. No effect on CODECAPI_AVEncH264CABACEnable. The doc indeed seems to be specifically for Microsoft’s encoder and not be a generic API. I’m using the QuickSync and NVidia codecs, do you know if those are configurable via the ICodecAPI assuming I create the MFT myself?

Vendor provided encoders fall under Certified Hardware Encoder requirements, so they must support ICodecAPI values mentioned in the MSDN article. Important is that it is not defined what the order of configuration calls is. If you are managing encoder yourself you would do ICodecAPI setup before setting up media types. In Sink Writer scenario it already configured the media types, then you jump in with your fine tuning. Hence, my trick suggestion includes the part of resetting existing media types. Because this trick is sensitive to implementation details I would suggest to get current media types, then clear them on the MFT, do ICodecAPI thing and get the types back. I assume that this should work in greater number of scenarios, not just MS encoder. Yet it still remains an unreliable hack.

IMO Nvidia’s encoder implementation is terrible (worst across vendors), Intel’s is better but it still has its own issues. Again IMO the MFTs are only provided to meet minimal certification requirements for hardware video encoding and for this reason their implementation is not well aligned. Various software packages prefer to implement video encoding via vendor SDKs rather than Media Foundation Transform interface. In one of the projects I used to also skip the idea of leveraging MFTs for encoding, and implemented my own MFTs on top of vendor SDKs.

Would the class factory approach in this post work with the IMFSinkWriter? This would avoid writing too much code…

I suppose that yes, this should work even though I feel that it’s not a pleasant work to patch it that way. Also you might need to take into account support of HW encoders because Sink Writer also tends to use hardware assisted encoding in some cases, including scenario where it’s given a DXGI device.

Another sort of a hack, which is similar but maybe a bit less intrusive (although in its implementation you would have to have a better understanding of internals) is to redefine vendor specific encoder CLSIDs within Sink Writer initialization scope. There are just three encoders (AMD, Intel, Nvidia; okay there is fourth from Shanghai Zhaoxin Semiconductor but it is not really popular) and their CLSIDs are known. If you CoRegisterClassObject in a smart way, you could hook MFT instantiation letting Media Foundation to decide which encoder to choose. It is just another idea though, so it might depend what is the best to do on other factors.

CleanPoint markup fun with a fragmented MP4 file and Media Foundation MPEG-4 Source

June 29, 2019, 8:30 am

≫ Next: Media Foundation incorrectly reports resolution for H.265/HEVC video tracks

≪ Previous: Use of ICodecAPI interface with a video encoder managed by Media Foundation Sink Writer instance

MPEG-4 Media Foundation Source stubbornly keeps marking a second video sample with a MFSampleExtension_CleanPoint flag even though nothing suggests that the video frame is an IDR frame.

The actual video frame is a P frame both in terms of MP4 box formatting and contained NAL units (the video is in fact an “infinite GOP” flavor of recording where all frames are P frames except the very first IDR one).

The problem is specific to fragmented MP4 files (and maybe even a subset of those), however is pretty much consistent and shows up with both H.264 and H.265/HEVC video.

↧

Media Foundation incorrectly reports resolution for H.265/HEVC video tracks

July 17, 2019, 9:01 am

≫ Next: The SDK API 1.9 adds SkipFrame…

≪ Previous: CleanPoint markup fun with a fragmented MP4 file and Media Foundation MPEG-4 Source

Another problem (bug) with Microsoft Media Foundation MPEG-4 Media Source H.265/HEVC handler is that it ignores conformance_window_flagflag and values from H.265’s seq_parameter_set_rbsp (see H.265 spec, F.7.3.2.2.1 General sequence parameter set RBSP syntax).

The problem might or might not be limited to fragmented MP4 variants.

It is overall questionable whether it has been a good idea to report video stream properties using parameter set data. This is not necessarily bad, especially if it was accurately documented in first place. Apparently this raises certain issues from time to time, like this one:
Media Foundation and Windows Explorer reporting incorrect video resolution, 2560×1440 instead of 1920×1080. Perhaps every other piece of software and library does not take a trouble to parse the bitstream and simply forwards values from tkhd and/or stsd boxes, why not?

Not the case of Media Foundation primitives which shake the properties out of bitstreams and their parameter sets. There is no problem if values match one another through the file of course.

A bigger problem, however, is that parsing out H.265/HEVC bitstream the media source handler fails to take into account cropping window… Seriously!

conformance_window_flag equal to 1 indicates that the conformance cropping window offset parameters follow next in the SPS. conformance_window_flag equal to 0 indicates that the conformance cropping window offset parameters are not present.

The popular resolution of 1920×1080 when encoded in 16×16 macroblocks is effectively consisting of 120×68 blocks with 1088 luma samples in height. The height of 1080 is obtained by cropping 1088 from either or both sides. By ignoring the cropping, Microsoft’s handler misreporting video height 1920×1088 even if all parts of video file have the correct value of 1080.

1920×1080 HEVC (meaning it does not play in every browser – beware and use Edge)

 MF_MT_MAJOR_TYPE, vValue {73646976-0000-0010-8000-00AA00389B71} (Type VT_CLSID, MFMediaType_Video, FourCC vids)
 MF_MT_SUBTYPE, vValue {43564548-0000-0010-8000-00AA00389B71} (Type VT_CLSID, MFVideoFormat_HEVC, FourCC HEVC)
 MF_MT_VIDEO_PROFILE, vValue 1 (Type VT_UI4)
 MF_MT_VIDEO_LEVEL, vValue 123 (Type VT_UI4)
 MF_MT_FRAME_SIZE, vValue 8246337209408 (Type VT_UI8, 1920x1088)
 MF_MT_INTERLACE_MODE, vValue 7 (Type VT_UI4)
 MF_MT_FRAME_RATE, vValue 65970697666816 (Type VT_UI8, 15360/256, 60.000)
 MF_MT_AVG_BITRATE, vValue 41976 (Type VT_UI4)
 MF_MT_MPEG4_CURRENT_SAMPLE_ENTRY, vValue 0 (Type VT_UI4)
 MF_MT_MPEG4_SAMPLE_DESCRIPTION, vValue 00 00 00 D1 68 76 63 31 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 80 04 38 00 00 00 48 00 00 00 48 00 00 00 00 00 01 0B 48 45 56 43 20 43 6F 64 69 6E 67 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 18 FF FF 00 00 00 7B 68 76 63 43 01 01 00 00 00 40 00 B0 00 00 00 00 7B F0 F0 FC FD F8 F8 3C 00 0B 03 A0 00 01 00 17 40 01 0C 01 FF FF… (Type VT_VECTOR | VT_UI1)
 MF_MT_VIDEO_ROTATION, vValue 0 (Type VT_UI4)
 MF_NALU_LENGTH_SET, vValue 1 (Type VT_UI4)

↧

The SDK API 1.9 adds SkipFrame…

November 6, 2019, 10:27 am

≫ Next: Direct3D 11 render into live HLS asset

≪ Previous: Media Foundation incorrectly reports resolution for H.265/HEVC video tracks

If there was a prize for the messiest SDK, Intel Media SDK would be a favorite. They seem to have put but special care to make things confusing, unclear, inconvenient to use and sick perplexed.

So there is no clear signal which versions of SDK have support for SkipFrame field. One has to query and the query itself is not something straightforward: one needs to build a multi-piece structure with request for multiple things among which this field is zeroed on the way back if the functionality is not supported. That could be fine if other vendors would not have shown that there are so much friendlier ways to expose features to developers.

Going further: the member itself is documented as introduced in SDK version 1.9. Good to know! Let us continue reading:

The enumeration itself is available since SDK version 1.11. That’s a twist!

To summarize, it is likely to be unsafe to do anything about this functionality, which is a small thing among so many there, before SDK 1.9. With SDK versions 1.9 and 1.10 the values are undefined because SDK 1.11 introduced enumeration, which was then extended in 1.13. Regardless, apart from SDK versions one needs to build a query (which alone makes you feel miserable if you happen to know how capability discovery is implemented by NVIDIA) because even though the field might be known to SDK runtime, its implementation might be missing.

However, as it often happens there is a silver lining if you look well enough: we have to thank Intel for the capability because AMD does not offer it at all.

↧

Direct3D 11 render into live HLS asset

November 10, 2019, 7:21 am

≫ Next: Intel Developer Zone lockout

≪ Previous: The SDK API 1.9 adds SkipFrame…

Further experiments with Direct3D 11 shadertoy rendering: HTTP Server API integration and serving on demand parts of HTTP Live Streaming (HLS) asset using Media Foundation with hardware video encoding. An hls.js player is capable to read and play the content, including being able to step between quality levels.

A sort of a Google Stadia for shadertoys with video on demand and possibly low latency. Standard HLS low latency (I am not following latest HTTP/2 extensions for lower latency HLS) is of course not even near the real ultra-low latency that we have in Rainway for web based game streaming being at levels of as low as 10-20 milliseconds with HTML5 delivery, however the approach proves that it is possible to deliver content with on demand rendering.

Perhaps it is possible to use the approach to broadcast live content with server side GPU based post processing. With a single viewer it is easy to change quality levels because a client would request new segment without also downloading it in another quality. Since consumer grade H.264/H.265 encoders are not normally designed to encode much faster than realtime (1920×1080@100 for H.264 is something to align expectations with, perhaps with only higher end NVIDIA cards offering more), quality change can be handled easy, but doing several qualities at a time might be excessive load.

Simplicity of HLS syntax overall allows to format the virtual asset in a flexible way: it can be a true live asset, or it can be a static fixed length seek-enabled asset with on demand rendering from randomly accessed point.

I would also like to use this opportunity to mention another beautiful shader “The Universe Within” by Martijn “BigWings” Steinrucken, which is running on my screenshot.

↧

Intel Developer Zone lockout

November 30, 2019, 8:42 am

≫ Next: Hardware accelerated JPEG video decoder MFT from AMD

≪ Previous: Direct3D 11 render into live HLS asset

Some time ago I found my account at Intel® Developer Zone was disabled. It was strange but who knows, let us go from assumption that there was a good reason.

For a moment I thought I was using wrong credentials, but I have saved ones. When password reset email did not show up, it was a bigger surprise – at least these things were supposed to be working. Username reminder did work and generally confirmed that I am using proper sign in data.

Given that “Contact Us” form is dedicated to login problems and being filled says “Thank you for contacting Intel. Your information has been submitted and we will respond to your inquiry within 48 hours.”, they seem to be disabling accounts from time to time and there is an emergency feedback channel for unexpected.

However I just figured out that a couple of weeks or more already passed, and there was no response. RIP Intel Developer Zone.

↧

Hardware accelerated JPEG video decoder MFT from AMD

January 11, 2020, 7:49 am

≫ Next: On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

≪ Previous: Intel Developer Zone lockout

Video GPU vendors (AMD, Intel, NVIDIA) ship their hardware with drivers, which in turn provide hardware-assisted decoder for JPEG (also known as MJPG and MJPEG. and Motion JPEG) video in form-factor of a Media Foundation Transform (MFT).

JPEG is not included in DirectX Video Acceleration (DXVA) 2.0 specification, however hardware carries implementation for the decoder. A separate additional MFT is a natural way to provide OS integration.

AMD’s decoder is named “AMD MFT MJPEG Decoder” and looks weird from the start. It is marked as MFT_ENUM_FLAG_HARDWARE, which is good but this normally assumes that the MFT is also MFT_ENUM_FLAG_ASYNCMFT, but the MFT lacks the markup. AMD’s another decoder MFT “AMD D3D11 Hardware MFT Playback Decoder” has the same problem though.

Hardware MFTs must use the new asynchronous processing model…

Presumably the MFT has the behavior of normal asynchronous MFT, however as long as this markup does not have side effects with Microsoft’s software, AMD does not care for this confusion to others.

Furthermore, the registration information for this decoder suggests that it can handle decoding into MFVideoFormat_NV12 video format, and sadly it is again inaccurate promise. Despite the supposed claim, the capability is missing and Microsoft’s Video Processor MFT jumps in as needed to satisfy such format conversion.

These were just minor things, more or less easy to tolerate. However, a rule of thumb is that Media Foundation glue layer provided by technology partners such as GPU vendors is only satisfying minimal certification requirements, and beyond that it causes suffering and pain to anyone who wants to use it in real world scenarios.

AMD’s take on making developers feel miserable is the way how hardware-assisted JPEG decoding actually takes place.

The thread 0xc880 has exited with code 0 (0x0).
The thread 0x593c has exited with code 0 (0x0).
The thread 0xa10 has exited with code 0 (0x0).
The thread 0x92c4 has exited with code 0 (0x0).
The thread 0x9c14 has exited with code 0 (0x0).
The thread 0xa094 has exited with code 0 (0x0).
The thread 0x609c has exited with code 0 (0x0).
The thread 0x47f8 has exited with code 0 (0x0).
The thread 0xe1ec has exited with code 0 (0x0).
The thread 0x6cd4 has exited with code 0 (0x0).
The thread 0x21f4 has exited with code 0 (0x0).
The thread 0xd8f8 has exited with code 0 (0x0).
The thread 0xf80 has exited with code 0 (0x0).
The thread 0x8a90 has exited with code 0 (0x0).
The thread 0x103a4 has exited with code 0 (0x0).
The thread 0xa16c has exited with code 0 (0x0).
The thread 0x6754 has exited with code 0 (0x0).
The thread 0x9054 has exited with code 0 (0x0).
The thread 0x9fe4 has exited with code 0 (0x0).
The thread 0x12360 has exited with code 0 (0x0).
The thread 0x31f8 has exited with code 0 (0x0).
The thread 0x3214 has exited with code 0 (0x0).
The thread 0x7968 has exited with code 0 (0x0).
The thread 0xbe84 has exited with code 0 (0x0).
The thread 0x11720 has exited with code 0 (0x0).
The thread 0xde10 has exited with code 0 (0x0).
The thread 0x5848 has exited with code 0 (0x0).
The thread 0x107fc has exited with code 0 (0x0).
The thread 0x6e04 has exited with code 0 (0x0).
The thread 0x6e90 has exited with code 0 (0x0).
The thread 0x2b18 has exited with code 0 (0x0).
The thread 0xa8c0 has exited with code 0 (0x0).
The thread 0xbd08 has exited with code 0 (0x0).
The thread 0x1262c has exited with code 0 (0x0).
The thread 0x12140 has exited with code 0 (0x0).
The thread 0x8044 has exited with code 0 (0x0).
The thread 0x6208 has exited with code 0 (0x0).
The thread 0x83f8 has exited with code 0 (0x0).
The thread 0x10734 has exited with code 0 (0x0).

For whatever reason they create a thread for every processed video frame or close to this… Resource utilization and performance is affected respectively. Imagine you are processing a video feed from high frame rate camera? The decoder itself, including its AMF runtime overhead, decodes images in a millisecond or less but they spoiled it with absurd threading topped with other bugs.

However, AMD video cards still have the hardware implementation of the codec, and this capability is also exposed via their AMF SDK.

 AMFVideoDecoderUVD_MJPEG

 Acceleration Type: AMF_ACCEL_HARDWARE
 AMF_VIDEO_DECODER_CAP_NUM_OF_STREAMS: 16 

 CodecId    AMF_VARIANT_INT64   7
 DPBSize    AMF_VARIANT_INT64   1

 NumOfStreams    AMF_VARIANT_INT64   16

 Input
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 0
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_HOST Native
 Interlace Support: 1 

 Output
 Width Range: 32 - 7,680
 Height Range: 32 - 4,320
 Vertical Alignment: 32
 Format Count: 4
 Format: AMF_SURFACE_YUY2 
 Format: AMF_SURFACE_NV12 Native
 Format: AMF_SURFACE_BGRA 
 Format: AMF_SURFACE_RGBA 
 Memory Type Count: 1
 Memory Type: AMF_MEMORY_DX11 Native
 Interlace Support: 1

I guess they stop harassing developers once they switch from out of the box MFT to SDK interface into their decoder. “AMD MFT MJPEG Decoder” is highly likely just a wrapper over AMF interface, however my guess is that the problematic part is exactly the abandoned wrapper and not the core functionality.

↧

On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

January 18, 2020, 2:34 am

≫ Next: So yes, C++/WinRT is how C++ development is to be done on Windows

≪ Previous: Hardware accelerated JPEG video decoder MFT from AMD

The previous post was focusing on problems with the hardware MFT decoder provided as a part of video driver package. This time I am going to mention some data about how the inefficiency affects performance of video capture using a high frame rate 260 FPS camera as a test stand. Apparently the effect is better visible with high frame rates because CPU and GPU hardware is fast enough already to process less complicated signal.

There is already some interest from AMD end (deserves a separate post why this is exceptional on its own), and some bug fixes are already under the way.

The performance problem is less visible because the decoder is overall performing without fatal issues and provides expected output: no failures, error codes, no deadlocks, neither CPU or GPU engine is peaked out, so things are more or less fine at first glance… The test application uses Media Foundation and Source Reader API to read textures in hardware MFT enabled mode and discards the textures just printing out the frame rate.

AMD MFT MJPEG Decoder

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using hardware decoder MFT AMD MFT MJPEG Decoder
 Using video frame format 640x384@260.004 MFVideoFormat_YUY2
 72.500 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 135.500 video samples per second captured
 134.000 video samples per second captured
 134.000 video samples per second captured
 135.000 video samples per second captured
 134.500 video samples per second captured
 133.500 video samples per second captured
 134.000 video samples per second captured

With no sign of hitting a bottleneck the reader process produces ~134 FPS from the video capture device.

Alax.Info MJPG Video Decoder for AMD Hardware

My replacement for hardware decoder MFT is doing the decoding of the same signal, and, generally, shares a lot with AMD’s own decoder: both MFTs are built on top of Advanced Media Framework (AMF) SDK. Driver package installs runtime for this SDK and installs a decoder MFT which is linked against a copy of the runtime (according to AMD representative, the static link copy shares the same codebase).

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using substitute decoder Alax.Info MJPG Video Decoder for AMD Hardware
 Using video frame format 640x360@260.004 MFVideoFormat_YUY2
 74.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 261.000 video samples per second captured
 260.500 video samples per second captured

Similar CPU and GPU utilization levels with higher frame rate. Actually, with the expected frame rate because it is the rate the camera is supposed to operate at.

1280×720@120 Mode

Interestingly, at lower FPS mode the AMD MFT threading issues are present, and, more to that the MFT exhibits two other issues (one of them is “just ignore” one per AMD comment). At the same time video capture rate is no longer reduced: the horsepower of the hardware is hiding the implementation inefficiency.

 Using camera HD USB Camera
 Using adapter Radeon RX 570 Series
 Using video capture format 1280x720@120.000 MFVideoFormat_MJPG
 Using hardware decoder MFT AMD MFT MJPEG Decoder
 Using video frame format 1280x736@120.000 MFVideoFormat_YUY2
 18.500 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured

Intel Hardware M-JPEG Decoder MFT

AMD is not the only one GPU vendor out there and my development system is equipped with integrated GPU from Intel as well, so why not give it a try?

To AMD defence, Intel’s decoder is exhibiting a subpar performance:

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Intel(R) UHD Graphics 630
 Using video capture format 640x360@260.004 MFVideoFormat_MJPG
 Using hardware decoder MFT IntelРѕ Hardware M-JPEG Decoder MFT
 Using video frame format 640x368@260.004 MFVideoFormat_YUY2
 24.000 video samples per second captured
 63.500 video samples per second captured
 63.500 video samples per second captured
 64.000 video samples per second captured
 63.500 video samples per second captured
 63.000 video samples per second captured
 63.500 video samples per second captured
 62.000 video samples per second captured
 63.500 video samples per second captured
 64.000 video samples per second captured
 63.500 video samples per second captured

At lower relative utilization levels and, again, without hitting any bottleneck visibly, the capture rate is reduced.

And this happens even without the threading problem I could at least see in the AMD’s case.

120 FPS mode is doing good:

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Intel(R) UHD Graphics 630
 Using video capture format 1280x720@120.000 MFVideoFormat_MJPG
 Using hardware decoder MFT Intelо Hardware M-JPEG Decoder MFT
 Using video frame format 1280x720@120.000 MFVideoFormat_YUY2
 77.000 video samples per second captured
 119.000 video samples per second captured
 120.000 video samples per second captured
 121.000 video samples per second captured
 119.000 video samples per second captured
 121.000 video samples per second captured
 120.000 video samples per second captured
 120.000 video samples per second captured
 120.500 video samples per second captured
 119.500 video samples per second captured
 120.000 video samples per second captured

That is, there is an obvious performance issue in Intel’s implementation since they fail to process lower resolution signal at original rate and even at rate they are showing for higher resolution signal!

So does 1920×1080@60:

C:\...\MjpgCameraReader\bin\x64\Release>MjpgCameraReader.exe
 Using camera HD USB Camera
 Using adapter Intel(R) UHD Graphics 630
 Using video capture format 1920x1080@60.000 MFVideoFormat_MJPG
 Using hardware decoder MFT Intelо Hardware M-JPEG Decoder MFT
 Using video frame format 1920x1088@60.000 MFVideoFormat_YUY2
 49.500 video samples per second captured
 60.500 video samples per second captured
 59.500 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured
 60.000 video samples per second captured

In closing

Bottom line is that hardware ASICs are generally good, but the quality of software MFT layer is not something GPU vendors care much of.

The application below does the testing on first available GPU and it assumes you have a video capture compatible to Media Foundation API. The application uses highest frame rate MJPG format of the camera and uses a hardware decoder MFT associated with the GPU.

One more thing to mention is that video capture takes place through so called Microsoft Windows Camera Frame Server (FrameServer) Service, notorious and not documented. Frame Server virtualizes video capture device adding processing overhead and cross-process synchronization.

Some time later I will compare performance of capturing around Frame Server and around Media Foundation default implementation of video capture device proxy. I expect though that there is no visible performance difference as those are, eventually, done well.

Download links

Binaries:

64-bit: MjpgCameraReader.exe (in .7z archive)
License: This software is free to use

↧

So yes, C++/WinRT is how C++ development is to be done on Windows

January 19, 2020, 12:33 pm

≫ Next: Telegram bot to extract contents of H.264 parameter set NAL units

≪ Previous: On efficiency of hardware-assisted JPEG decoding (AMD MFT MJPEG Decoder)

“Modern” C++/WinRT is the way to write rather powerful things in a compact and readable way, mixing everything you can think of together: classic C++ and libraries, UWP APIs including HTTP client, JSON, COM, ability to put code into console/desktop applications, async API model and C++20 coroutines.

Fragment of Telegram bot code snippet that echoes a message back, written with just bare Windows 10 SDK API set without external libraries, for example:

for(auto&& UpdateValue: UpdateArray) // https://core.telegram.org/bots/api#update
{
	JsonObject Update = UpdateValue.GetObject();
	const UINT64 UpdateIdentifier = static_cast<UINT64>(Update.GetNamedNumber(L"update_id"));
	m_Context.m_NextUpdateIdentifier = UpdateIdentifier + 1;
	if(Update.HasKey(L"message"))
	{
		JsonObject Message = Update.GetNamedObject(L"message");
		m_Journal.Write(
		{ 
			L"Message",
			static_cast<std::wstring>(Message.Stringify()),
		});
		const UINT64 MessageIdentifier = static_cast<UINT64>(Message.GetNamedNumber(L"message_id"));
		JsonObject FromUser = Message.GetNamedObject(L"from");
		const UINT64 FromUserIdentifier = static_cast<UINT64>(FromUser.GetNamedNumber(L"id"));
		std::wstring FromUserUsername = static_cast<std::wstring>(FromUser.GetNamedString(L"username"));
		#pragma region ACK
		JsonObject Chat = Message.GetNamedObject(L"chat");
		const UINT64 ChatIdentifier = static_cast<UINT64>(Chat.GetNamedNumber(L"id"));
		{
			std::wstring Text = Format(L"Hey, *@%ls*, I confirm message _%llu_\\. Send me a file now\\!", FromUserUsername.c_str(), MessageIdentifier);
			Uri RequestUri(static_cast<winrt::hstring>(Format(L"https://api.telegram.org/bot%ls/sendMessage", m_Configuration.m_Token.c_str())));
			JsonObject Request;
			Request.Insert(L"chat_id", JsonValue::CreateNumberValue(static_cast<DOUBLE>(ChatIdentifier)));
			Request.Insert(L"text", JsonValue::CreateStringValue(static_cast<winrt::hstring>(Text)));
			Request.Insert(L"parse_mode", JsonValue::CreateStringValue(L"MarkdownV2"));
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Request",
				static_cast<std::wstring>(Request.Stringify()),
			});
			HttpStringContent Content(Request.Stringify(), UnicodeEncoding::Utf8);
			Content.Headers().ContentType(Headers::HttpMediaTypeHeaderValue(L"application/json"));
			HttpResponseMessage ResponseMessage = Client.PostAsync(RequestUri, Content).get();
			JsonObject Response = JsonObject::Parse(ResponseMessage.Content().ReadAsStringAsync().get());
			m_Journal.Write(
			{ 
				L"sendMessage",
				L"Response",
				static_cast<std::wstring>(Response.Stringify()),
			});
			__D(Response.GetNamedBoolean(L"ok"), E_UNNAMED);
		}
		#pragma endregion
	}

Please count me as a fan of this.

↧

Telegram bot to extract contents of H.264 parameter set NAL units

January 20, 2020, 8:18 am

≫ Next: State of video remoting continued

≪ Previous: So yes, C++/WinRT is how C++ development is to be done on Windows

In continuation of previous post about C++/WinRT and Telegram, here we with @ParameterSetAnalyzeBot: “Your buddy to extract H.264 parameter set NAL data”. In a chat, it expects an MP4 file with an H.264 video track sent him (her?). Then it extracts data from sample description box and deciphers into readable form:

It’s literally taking the MP4 file to the Media Foundation Source Reader API, pulls MF_MT_MPEG_SEQUENCE_HEADER and pipe the data to h264_analyze tool (my fork of it has Visual Studio 2019 project, and is added ability to take input from stdin for piping needs).

It will probably not be online forever, but it’s live. Be aware that Telegram limits file transmissions to 20 MB per file at the moment.

!! Found NAL at offset 3 (0x0003), size 31 (0x001F) 
 0.8: forbidden_zero_bit: 0 
 0.7: nal->nal_ref_idc: 3 
 0.5: nal->nal_unit_type: 7 
 1.8: sps->profile_idc: 77 
 2.8: sps->constraint_set0_flag: 0 
 2.7: sps->constraint_set1_flag: 0 
 2.6: sps->constraint_set2_flag: 0 
 2.5: sps->constraint_set3_flag: 0 
 2.4: sps->constraint_set4_flag: 0 
 2.3: sps->constraint_set5_flag: 0 
 2.2: reserved_zero_2bits: 0 
 3.8: sps->level_idc: 31 
 4.8: sps->seq_parameter_set_id: 0 
 4.7: sps->log2_max_frame_num_minus4: 3 
 4.2: sps->pic_order_cnt_type: 0 
 4.1: sps->log2_max_pic_order_cnt_lsb_minus4: 0 
 5.8: sps->num_ref_frames: 4 
 5.3: sps->gaps_in_frame_num_value_allowed_flag: 0 
 5.2: sps->pic_width_in_mbs_minus1: 79 
 7.5: sps->pic_height_in_map_units_minus1: 44 
 8.2: sps->frame_mbs_only_flag: 1 
 8.1: sps->direct_8x8_inference_flag: 1 
 9.8: sps->frame_cropping_flag: 0 
 9.7: sps->vui_parameters_present_flag: 1 
 9.6: sps->vui.aspect_ratio_info_present_flag: 1 
 9.5: sps->vui.aspect_ratio_idc: 255 
 10.5: sps->vui.sar_width: 1 
 12.5: sps->vui.sar_height: 1 
 14.5: sps->vui.overscan_info_present_flag: 0 
 14.4: sps->vui.video_signal_type_present_flag: 1 
 14.3: sps->vui.video_format: 0 
 15.8: sps->vui.video_full_range_flag: 0 
 15.7: sps->vui.colour_description_present_flag: 0 
 15.6: sps->vui.chroma_loc_info_present_flag: 0 
 15.5: sps->vui.timing_info_present_flag: 1 
 15.4: sps->vui.num_units_in_tick: 1 
 19.4: sps->vui.time_scale: 50 
 23.4: sps->vui.fixed_frame_rate_flag: 1 
 23.3: sps->vui.nal_hrd_parameters_present_flag: 0 
 23.2: sps->vui.vcl_hrd_parameters_present_flag: 0 
 23.1: sps->vui.pic_struct_present_flag: 0 
 24.8: sps->vui.bitstream_restriction_flag: 1 
 24.7: sps->vui.motion_vectors_over_pic_boundaries_flag: 1 
 24.6: sps->vui.max_bytes_per_pic_denom: 2 
 24.3: sps->vui.max_bits_per_mb_denom: 1 
 25.8: sps->vui.log2_max_mv_length_horizontal: 16 
 26.7: sps->vui.log2_max_mv_length_vertical: 16 
 27.6: sps->vui.num_reorder_frames: 0 
 27.5: sps->vui.max_dec_frame_buffering: 4 
 28.8: rbsp_stop_one_bit: 1 
 28.7: rbsp_alignment_zero_bit: 0 
 28.6: rbsp_alignment_zero_bit: 0 
 28.5: rbsp_alignment_zero_bit: 0 
 28.4: rbsp_alignment_zero_bit: 0 
 28.3: rbsp_alignment_zero_bit: 0 
 28.2: rbsp_alignment_zero_bit: 0 
 28.1: rbsp_alignment_zero_bit: 0 
!! Found NAL at offset 37 (0x0025), size 4 (0x0004) 
 0.8: forbidden_zero_bit: 0 
 0.7: nal->nal_ref_idc: 3 
 0.5: nal->nal_unit_type: 8 
 1.8: pps->pic_parameter_set_id: 0 
 1.7: pps->seq_parameter_set_id: 0 
 1.6: pps->entropy_coding_mode_flag: 1 
 1.5: pps->pic_order_present_flag: 0 
 1.4: pps->num_slice_groups_minus1: 0 
 1.3: pps->num_ref_idx_l0_active_minus1: 0 
 1.2: pps->num_ref_idx_l1_active_minus1: 0 
 1.1: pps->weighted_pred_flag: 0 
 2.8: pps->weighted_bipred_idc: 0 
 2.6: pps->pic_init_qp_minus26: 0 
 2.5: pps->pic_init_qs_minus26: 0 
 2.4: pps->chroma_qp_index_offset: 0 
 2.3: pps->deblocking_filter_control_present_flag: 0 
 2.2: pps->constrained_intra_pred_flag: 0 
 2.1: pps->redundant_pic_cnt_present_flag: 0 
 3.8: rbsp_stop_one_bit: 1 
 3.7: rbsp_alignment_zero_bit: 0 
 3.6: rbsp_alignment_zero_bit: 0 
 3.5: rbsp_alignment_zero_bit: 0 
 3.4: rbsp_alignment_zero_bit: 0 
 3.3: rbsp_alignment_zero_bit: 0 
 3.2: rbsp_alignment_zero_bit: 0 
 3.1: rbsp_alignment_zero_bit: 0

Maybe it is worth adding full Media Foundation attribute printout as well, and similar H.265/HEVC data. This will have to wait for next occasion though.

And – yeah – it does have support for fragmented MP4 too:

↧

State of video remoting continued

January 22, 2020, 1:36 am

≫ Next: I have a container element, but I will not give it to you…

≪ Previous: Telegram bot to extract contents of H.264 parameter set NAL units

Comparison of time codes is one method, and getting impression on latency through driving is another. Rainway Xbox One UWP application as a thin client to a desktop PC game.

And we are happy people love it:

Whoever the engineers who wrote the core technology, the minimal-latency streaming code - wow, I am so impressed by what they've created! It's SO quick, like I'm streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it's enabling me to use my computer anywhere.

Whoever the engineers who wrote the core technology, the minimal-latency streaming code – wow, I am so impressed by what they’ve created! It’s SO quick, like I’m streaming now from two computers to remote platforms, and everything is all over WiFi and latency is 9ms or less. This is giving life to some old hardware, and it’s enabling me to use my computer anywhere.

UPD. A few days later:

↧