fre:ac Developer Blog

SuperFast LAME technical details

Written by Robert

Friday, 20 July 2018 19:27

So I finished the SuperFast LAME multi-threaded MP3 encoder last week and it's time to write about some technical aspects of it.

tl;dr: Implementing SuperFast LAME required some additional work to handle certain features of the MP3 format. You can download a preview release of fre:ac with SuperFast LAME support from GitHub.

The challenge

SuperFast LAME is significantly more complex than the SuperFast components for AAC, Opus and Speex, mostly because of technical peculiarities of the MP3 format.

The main difficulty is that while most other formats have discrete frames of audio samples in their bitstreams, MP3 frames can overlap each other:

In this example, the average frame size is 4 blocks of data. The individual frame lengths are 4, 3, 4, 3, 1, 5, 5 and 7 blocks. In an AAC bitstream, each frame will simply have a length matching the number of data blocks required for that frame and the frames will neatly come one after another. In an MP3 bitstream, however, (at least for CBR files, VBR is more complicated) frames have a fixed size and when there is space left in a frame after all samples have been encoded, that space can be used by the following frames. This space available to following frames is called bit reservoir and allows the codec to maintain a set target quality in most cases, even when frame sizes are fixed and audio complexity changes.

Have a look at the example. The 5th frame is only one data block long and that data block fits completely into the 4th frame. It even leaves some space, so the first data block of the 6th frame starts in the 4th frame as well. Looking at only the 5th and 6th frame, their layout in the bitstream looks like this:

Here the frame headers come after the data and in case of frame #5, there even is data of another frame (#6) between its data and its header. In real world MP3 streams, the situation can be even more intricate.

Basic SuperFast operation

So this is a problem when implementing the SuperFast technology for MP3. SuperFast works by passing chunks of audio data to separate encoder instances and later joining the encoded data blocks back together in the right order. This requires the frames to be available in discrete form in order to deal with overlap and joining the frames correctly. The SuperFast encoding loop usually looks like this (click to jump to example source code):

MP3 difficulties

When dealing with MP3, multiple issues arise from the peculiarities around the bit reservoir:

The encoder might not return all encoded frames after processing a chunk of data as some frames might still be waiting for additional data to put in the bit reservoir.
Frames are not available in discrete form, but may be overlapping each other.
After dealing with the above, frames need to be put back into an MP3 compatible bitstream after joining.
Frames might require more reservoir than is available after joining with frames coming from other codec instances.

Previous attempts to create multi-threaded MP3 encoders dealt with these issues in a very simple way: They completely disabled the bit reservoir to get nicely laid out frames with no overlapping data. This solution cuts into the resulting MP3's quality, though, which is why such encoders never really gained traction.

So let's see how we can handle these issues more adequately.

Unraveling it

The first one is relatively simple. After encoding a chunk of data, we call lame_encode_flush_no_gap to force the encoder to return all encoded frames even if they are not completely filled yet. This makes sure we can operate with all the relevant frames in the next steps.

The second issue is handled by a bitstream unpacker that parses the data returned by the encoder and extracts discrete frames from the bitstream. After this step all frames will be laid out as a frame header followed by the complete data belonging to that frame. No more intermixing with other frames' headers or data.

After unpacking, we are ready to perform overlap skipping and ordering of data chunks from different encoder instances.

When writing the ordered frames to the output stream, we now need to make sure to repack them back into an MP3 compatible bitstream. The repacker deals with frame sizes and the bit reservoir and tries to pack frames in the most compact way.

Sometimes, though, a frame requires more reservoir than is currently available and the repacker needs to find a way to fit it in. It basically has two options to accomplish this: If only a few extra bits are needed, the repacker can add padding to a frame. This will add an additional byte and sometimes this is enough to provide the required reservoir. In cases where it is not sufficient, the repacker can enlarge one or more previous frames to a bigger frame size. This usually allows to provide enough reservoir, but requires all affected frames to be repacked again.

However, even this might not be enough when issue number 4 comes into play. In some rare cases, a frame requires so much reservoir that it is simply not possible to fit it into the bitstream. This can happen because one encoder instance cannot know how much reservoir will be left over by the instance encoding the preceding chunk. In cases where the preceding instance has to deal with a difficult to encode signal, it might leave next to no reservoir available to the next encoder.

Dealing with this was difficult. While there are some simple options like forcing the encoder to use a lower bitrate, these might potentially result in audible quality drops. So I tried to find another way to handle this.

Basically, the SuperFast algorithm will try to re-encode the audio part starting with the non-fitting frame and repeat this until it fits. To work around situations where it might never fit using this strategy, each time it fails, we try to put some more pressure on the bit reservoir by prepending a few frames of difficult to encode dummy data. These dummy frames force the encoder to spend some reservoir on them and lead to using less reservoir for our previously non-fitting frame, eventually allowing us to fit the frame into the bitstream.

The result

With all these additional steps, the process for SuperFast LAME now looks like this (click to jump to source code):

Get next worker thread and wait until it is ready
Check if worker thread has encoded frames
a. Unpack frames
b. Skip overlap frames
c. Repack other frames
i. If successful, continue with 2d
ii. Write completed frames to output stream
iii. Put increasing pressure on the reservoir
iv. Re-encode starting from failed frame
v. Continue with 2a
d. Write repacked frames to output stream
Pass next chunk of audio to worker thread

Arriving at this point took several months of work, but was absolutely worth it. The SuperFast LAME encoder scales well with the number of CPU cores and can provide a 3.5x speedup on a quad-core processor. On my 8 core, 16 thread CPU, I was able to achieve up to 12x speed increase with it.

Unlike previous attempts to speed up MP3 encoding, SuperFast LAME does this while still using the MP3 format's bit reservoir feature and uses an unmodified encoder library - the necessary changes are all implemented in the frontend application and could be used with alternative MP3 encoders as well.

I plan to implement this technology on top of the command line LAME frontend in the future. For now, my priority is on releasing fre:ac 1.1 beta and final versions, though. But keep watching this blog for future annoucements about a SuperFast enabled stand-alone LAME version.

Downloads

SuperFast LAME is now in testing and included in the SuperFast Preview Release 3 available at GitHub.

Source code

Check out the SuperFast repository on GitHub if you would like to learn more or build the code yourself. The SuperFast LAME implementation can be found in the components/lame folder.

fre:ac development status update 06/2018

Written by Robert

Wednesday, 04 July 2018 23:39

The June development update is overdue, but better late than never, here it is. It was a very productive month, so let's get right to the good stuff.

Parallel conversion jobs

The current alpha release supports only one conversion job at a time. Multiple tracks in a conversion can be processed in parallel, but when you try to start a new job while a conversion is still running, you just get a message asking if you would like to schedule the new job for after the current one is finished.

The next release will enable parallel conversion jobs. As long as there are CPU threads left, multiple conversions, possibly using different settings, can run at the same time. This helps when converting multiple albums to a single file per album or when ripping CDs using multiple drives.

Improved handling of automatic ripping

This brings us to the next item. There are some issues with the current alpha when using the automatic ripping option with multiple drives. When inserting a disc while other tracks are still in the joblist, the new ripping job will try to process those other tracks again, leading to some tracks being ripped more than once. Also, the new job will not start before any currently running rip is finished. Both issues will be fixed in the next alpha which makes ripping with multiple drives much more usable.

Fixed metadata bug with Core Audio on Windows

In May, a user opened an issue on GitHub reporting that when converting ALAC files to AAC using the Core Audio encoder on Windows, tags were missing on some files. I could easily reproduce the issue, but it seemed really strange. It occurred only when converting files decoded with an external decoder (i.e. a separate .exe called by fre:ac) and the selected encoder was Core Audio. That didn't seem to make any sense at first.

It turned out to actually be a bug in Apple's Core Audio implementation on Windows. It would make file handles created by its API calls inheritable by sub-processes. The sub-processes (in this case the external decoders) would then inherit any open handles and lock the respective files, making them unwritable by the tagger component.

Making handles inheritable is something that an API never should do as it can lead to unforeseeable behavior and very difficult to analyze bugs.

Fortunately there is a work-around by avoiding the problematic APIs. The next alpha release will include this fix.

Automatic codec builds

Till now, all the codecs included with fre:ac are built manually: Set the correct compiler flags for each codec on each supported OS, apply necessary patches, configure the codecs with the right flags and run make to build them. This costs a lot of time whenever a new codec version is relased and also is a bit error-prone, so it was necessary to change it.

I built a script to automate all the steps listed above for most of the necessary codecs and some other libraries. The script can compile FAAC, FAAD2, FDK-AAC, FLAC, LAME, libav, libogg, libsamplerate, libsndfile, Monkey's Audio, mpg123, Opus, RubberBand, Speex, Vorbis and WavPack on Windows, macOS, Linux and FreeBSD. Whenever a new version of one of these libraries is released in the future, I can simply update the package download URL and run the script to build a new release.

The script can be found in the source repository on GitHub.

Reworked donation dialog

The donation dialog has been reworked to support more payment types. Previously supporting only PayPal, the new dialog adds support for Donorbox, SEPA transfers and the Bitcoin and Ethereum crypto currencies.

Other items

A number of other changes have been implemented in the past month, the most notable of which are:

HiDPI icons
Preparing for the upcoming beta release, I added higher quality versions of the toolbar icons that now look crisp on HiDPI displays like Apple's Retina screens.

Completely translatable
In the current alpha release, not all strings are translatable. This applies to configuration dialogs for external codecs especially. The next alpha will fix this and enable translations for WavPack, Musepack, OptimFROG and TAK configuration dialogs along with some other previously untranslatable strings.

Fixed MP4 metadata bug
When converting multiple files in parallel to AAC or ALAC output, it can happen that some files end up being unoptimized due to a bug in the MP4v2 library used by fre:ac. Optimization of MP4 files means that tags and the seektable are moved to the beginning of the file for more efficient processing. The next alpha release will include a work-around for the MP4v2 bug fixing the issue of MP4 files not being optimized.

Downloads now hosted on GitHub
The links on the downloads page now point to GitHub instead of SourceForge. This enables direct downloads without an intermediate page to choose a mirror and allows downloading using right-click + save as.

SuperFast LAME status

There were some open issues with the SuperFast LAME implementation when I last wrote about it in the April status update. These have been fixed now and there will be another SuperFast preview release including LAME support very soon after the next alpha. I'm also preparing a technical article about how the MP3 bit reserviour is handled in SuperFast LAME. This should be out within one week from now.

That's it for this month. Be sure to come back in about one month for the next update.

fre:ac development status update 05/2018

Written by Robert

Thursday, 31 May 2018 22:35

Hi all, it's time for an update on fre:ac development again. The past month was quite productive and so I have lots of things to talk about.

Integration with Travis CI

The GitHub projects for the smooth Class Library, BoCA and fre:ac are now integrated with the Travis CI platform for automated build tests. Every commit to one of these repositories now starts automatic build processes on Linux and macOS to check if anything got broken. This improves the development process by ensuring that build-breaking issues will be noticed quickly.

The build processes are also started for pull requests, so anyone who submits a patch can immediately see if it breaks anything.

Fast CRC patches accepted into FLAC and Ogg

The patches for faster CRC calculations I wrote about last month have been accepted by the upstream FLAC and Ogg projects. So with the next FLAC and Ogg releases, any software using them will benefit from faster encoding and decoding.

Allowing playback during conversions

Until now, it's not possible to play a track in fre:ac while a conversion is running. This limitation will be lifted with the next alpha release. You will be able to play tracks during conversions as long as they are not on a CD that is currently being ripped from.

Faster AAC, APE and WMA encoding

fre:ac's AAC, Monkey's Audio (APE) and WMA encoder components use temporary files for writing output data. The content of these files is transferred to the actual output file after the encoding process is finished which causes a small delay at the end of each conversion. The next alpha release will fix this by writing directly to the actual output file from the start.

Making this possible required an addition to the internal IO filter interface and extensive testing. This is why it was not done like this earlier.

Improved handling of album artists

Starting with the next alpha release, fre:ac will make use of the <albumartist> placeholder in the default output filename pattern. This prevents the creation of separate folders for each track when dealing with sampler CDs. On samplers, the previously used <artist> placeholder would resolve to a different artist for each track, while <albumartist> will usually be something like Various artists and be the same for every track.

SourceForge Project of the Month

fre:ac has been chosen as the SourceForge Project of the Month of May 2018. This is the second time fre:ac won this award after October 2015. You can read a short interview with me in the SourceForge blog.

This closes this month's issue. Be sure to come back in June for another update.

fre:ac development status update 04/2018

Written by Robert

Tuesday, 01 May 2018 12:28

It's time for a new development status update after an interesting month.

Optimized CRC routines for audio codecs

In case you missed it, here is my article on speeding up LAME, FLAC, Ogg and Monkey's Audio with faster CRC checks. The proposed CRC algorithm is roughly 5 times faster than the one previously used and results in a speedup of about 5% for FLAC encoding and decoding. Patches have been submitted to the upstream projects and I hope for integration in official releases of these codecs.

Fixed crashes with local CDDB queries

A user reported occasional crashes when querying a local CDDB database on Linux. This turned out to be a thread-safety issue that manifested itself only when the CDDB query dialog was displayed and then immediately closed before the main thread finished processing the window mapping event.

The issue affects all systems using the X11 window system, so it can happen on Linux, FreeBSD and other Unix-like systems.

This and another issue that I found while investigating it will be fixed in the next alpha release.

SuperFast LAME nearing completion

A whole bunch of changes have been incorporated into the SuperFast version of the LAME MP3 encoder component. It's almost complete now and an official preview release is getting closer.

This month's changes include:

Support for CBR mode and VBR rate limiting
Support for MP3s with frame CRCs
Writing Xing header table of contents
Writing Xing header CRCs

There is just one item left on my list now which is related to handling the bit reservoir in high complexity situations (especially with MPEG 2 streams at 22.05 or 24 kHz). In that case it can happen that an encoder thread tries to use more reservoir than actually is available. Special handling has to be implemented to resolve such situations. I hope to be able to finish this in May.

While waiting for SuperFast LAME, make sure to check out the 2nd SuperFast preview release with added support for FDK-AAC and Speex and tuning for Opus and Core Audio AAC.

Faster CRC checks to speed up codecs

Written by Robert

Sunday, 29 April 2018 22:33

So, I kind of stumbled into this, but always looking for possible optimizations, I simply had to explore it...

tl;dr: I accelerated checksum calculations and thus encoding times of LAME, FLAC, Ogg and Monkey's Audio using an optimized CRC algorithm. Find patches at the end of this post. These will be part of the next fre:ac 1.1 alpha release.

Calculating Xing/LAME header CRCs

Working on the LAME MP3 implementation of my SuperFast technology, I came across the necessity to do CRC checksum calculations. Every MP3 created by LAME has a Xing or LAME VBR header at the beginning. It contains index points to the MP3 as well as information about duration and gapless playback. At the end of this header, there are two CRC checksums, one for the MP3 bitstream and one for the header itself.

As the bitstream repacker used in SuperFast LAME changes the MP3's internal structure, an update of the Xing/LAME header's CRC values is necessary afterwards. I started with a simple implementation of the CRC16 algorithm that I wrote for the smooth Class Library. This created a small delay at the end of each conversion when the CRC for the MP3 file is updated. Not a big deal for the usually small MP3s weighting in at 3-4 MB. With larger files, however, like when converting a whole album to a single output file, it became painful. The CRC calculation added a delay of half a second for a 60 MB file on my i7 6900K system. On slower systems it would be much more.

Steps to optimize the calculation

The first thing I tried was using compiler optimizations for the CRC routines (GCC's -O3 instead of -Os). This brought the delay down to about a quarter second. Still too much for my taste, though.

I then started looking for optimized CRC algorithms and found Matt Stancliff's crcspeed repository. It is based on an algorithm developed by Intel that uses additional lookup tables to enable processing of multiple input bytes in a single step. There are different variants of this algorithm circling around, processing different numbers of bytes in each step, but it's generally called slicing-by-X (where X is usually 2, 4, 8 or 16).

I updated my CRC implementation to use the slicing algorithm and did some measurements. The slicing-by-8 variant turned out to be roughly 10 times faster than my original version and 5 times faster than the GCC -O3 compiled one. There was very little additional speedup when using slicing-by-12 (which I found to be the fastest) or slicing-by-16, so I decided to stick with slicing-by-8 as a good compromise between speed and memory requirements. Using the slicing-by-8 algorithm reduced the delay at the end of the 60 MB MP3 conversion to just a few 10s of milliseconds.

But I did not stop there...

Looking further

So, if I have to calculate CRC checksums for the Xing/LAME header, LAME itself will have to do the same. You just don't notice a delay, because the calculation is not done at once at the end, but spread over the whole encoding process. But does LAME use an optimized CRC implementation? As it turned out, no.

I updated the LAME CRC routines with the slicing-by-8 algorithm and got a speed-up of only 0.5%. Not much, but I wondered if other codecs (especially lossless ones that generate more data) might benefit more.

I looked further and found non-optimal CRC implementations in FLAC, Ogg (used for Opus, Vorbis and other codecs) and Monkey's Audio. Replacing them with the optimized algorithm yielded similar results to LAME for the lossy formats. The lossless formats, however, benefit more from the optimization and are sped up by about 5% due to more data being generated. When using Ogg FLAC, the speed-up is roughly 10% due to CRC's being calculated for both, the FLAC audio frames and the Ogg container pages.

So we get up to 5% speed-up in the usual case and around 10% improvement for the Ogg FLAC format. All by simply replacing the CRC algorithm with an optimized version.

Technical considerations

The original Intel algorithm and Matt Stancliff's version require separate implementations for big-endian and little-endian CPUs. I converted the algorithm to an endian-independent form, i.e. only one variant for all processors. I did not measure any significant speed difference after making the code endian-independent when compiling with optimizations turned on.

It's possible to speed up the CRC calculations even more using other methods such as using the PCLMULQDQ instruction on modern x86 CPUs. However, that would make the code depend on that platform and probably provide only marginal additional speed gains.

My implementation uses static lookup tables for LAME, FLAC and Ogg. This blows up code size a bit and I would have preferred calculating the tables on the fly on first use. That's difficult to get right in a portable, thread safe way in plain C though, so it is used only for Monkey's Audio which is written in C++ (allowing dynamic initialization of static data).

Speed gains

Here are some numbers showing relative speed gain when encoding and decoding with different codecs (all used with default settings):

Codec	Encode	Decode
LAME	0.5%	-
Opus*	0.5%	1%
Vorbis*	0.5%	2%
Monkey's Audio	4%	-
FLAC	5%	5%
Ogg FLAC	10%	15%

* Opus and Vorbis themselves are not optimized, but use the optimized Ogg container library.

The patches

Here are my patches to update the mentioned codecs' CRC calculations to the optimized slicing algorithm:

LAME 3.100: lame-3.100-fastcrc.diff
FLAC 1.3.2: flac-1.3.2-fastcrc.diff
Ogg 1.3.3: libogg-1.3.3-fastcrc.diff
Monkey's Audio 4.33: mac-sdk-4.33-fastcrc.patch

Update: The Monkey's Audio patch has been integrated in the official Monkey's Audio 4.34 release.

Update 2: The Ogg and FLAC patches have been merged into to the upstream repositories and will be part of the next official releases.

Here is a proof-of-concept FLAC build for Win64 for everyone to try out: flac-1.3.2-fastcrc-win64.zip

The patched codecs will be used in the next fre:ac 1.1 alpha release and I will contact the maintainers of these projects to request integration of the patches in official releases.

fre:ac development status update 03/2018

Written by Robert

Saturday, 31 March 2018 17:45

Hi and welcome to the March 2018 status update on fre:ac development.

The past month finally saw the 1.0.32 release and a new fre:ac 1.1 alpha providing lots of fixes and new features. Apart from that, I worked mostly on two things: Making the config dialogs for external codecs more beauti- and useful and implementing the repacker part of the upcoming SuperFast LAME encoder.

Improved config dialogs for external codecs

The configuration dialogs for external codecs are generated on the fly from a description provided by the external codec's XML script. Until now, that dialog was very small and displayed only one option at a time:

The old config dialog for external codecs

The next alpha will feature an improved configuration dialog generator. It will create dialogs that show all the options at the same time and provide more space for them:

The new config dialog for external codecs

Progress on SuperFast codecs

I spent a lot of time on writing the repacker part for the SuperFast LAME encoder in the past few weeks. A proof-of-concept was done quickly, but implementing all the edge cases turned out to be more difficult than I initially thought. Particularly, correct handling of the bit-reservoir was a lot of work. Nevertheless, it's working great now and I'm now in the testing stage.

The code for the SuperFast LAME component is now available in the SuperFast GitHub repository.

A few things are still missing, like support for CBR mode, CRC checksums and generating a valid Xing header, but I am confident that I can implement these things in the next few weeks.

I'm currently preparing a new SuperFast preview release based on fre:ac 1.1 alpha 20180306 which I plan to release in the next few days. As mentioned earlier, this will not include the SuperFast LAME encoder yet, but introduce SuperFast FDK-AAC and Speex along with some tweaks for the other codecs.

That's it for the moment. Make sure to come back in a month for another update.

<< Start < Prev 1 2 3 4 5 6 7 8 9 10 Next > End >>

Page 3 of 11

1.	Exporthub Global B2B Marketplace
2.	GreenPromoCode.com
	C. T.

Main Menu

News Feed

The challenge

Basic SuperFast operation

MP3 difficulties

Unraveling it

The result

Downloads

Source code

Parallel conversion jobs

Improved handling of automatic ripping

Fixed metadata bug with Core Audio on Windows

Automatic codec builds

Reworked donation dialog

Other items

SuperFast LAME status

Integration with Travis CI

Fast CRC patches accepted into FLAC and Ogg

Allowing playback during conversions

Faster AAC, APE and WMA encoding

Improved handling of album artists

SourceForge Project of the Month

Optimized CRC routines for audio codecs

Fixed crashes with local CDDB queries

SuperFast LAME nearing completion

Calculating Xing/LAME header CRCs

Steps to optimize the calculation

Looking further

Technical considerations

Speed gains

The patches

Improved config dialogs for external codecs

Progress on SuperFast codecs

Share

Downloads

Top Donors

Donate