Coding basis 128 Varints, explained
A process of converting data from one format to another is called encoding. In this process, both the data and the format can be variable, that is, that is, for any kind of data in any kind of format, if we do a conversion, the process will be called encoding. Based on the nature of the data, there are many popular terms such as multimedia encoding, character encoding. Multimedia encoding refers to converting audio and video files to different formats, eg mp3, avi, wav, etc.
Likewise, character encoding is also very popular and refers to the representation of character data in machine-readable binary format, e.g. ASCII,US-ASCII, UTF-8. Although the term encoding refers to a verb or a process, sometimes used as a noun meant to encode itself. People tend to say mp3 encoding, or mp4 encoding, where they mean that the data or the media file is encoded in that particular format.
In this article, we will discuss a particular encoding for integer values. We are trying to understand how whole values can be represented in binary and in particular how efficiently this can be done.
Everyone should have heard and read the terms bytes and bits. A bit is an atomic unit that a computer can measure or understand. Computers read a bit of it as a value of 1 or a value of 0. The set of 8 bits is called a byte where each bit can be represented by 0 or 1. So there is a total
2 ^ 8 = 256 combinations of these bits. TheComputers and programming languages are designed to read and write the collection of these eight bits aka Byte together instead of reading / writing individual bits.
Each combination of these bits represents a decimal value. for example.
00000000 represents zero and
00000001 represents one. Wasn't that so complex 😄. But it gets complicated with other discrete values because the bit combination
00000101 represents the number
129 . Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all the values for which the bits are turned
Encoding large integers - The problem
With the help of the diagram shownabove, it's easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or greater than the single byte storage capacity. You may have thought of more bytes and you are not wrong.
We can store larger integer values, say in two bytes, which will be
2 ^ 16 = 65536 . So the binary representation of
7896 will be
0001111011011000 which will be stored on two bytes and can be stored under
11011000 . And a decimal value greater than
65536 can also be stored as three bytes and so on. There is no limitation.
But there is a problem because bytes are read and written in a sequence in memory when reading if two bytes are found
11011000 we cannot be sure if itthis is
216 or two bytes combined into
It There may be different solutions to the problem, one is to predefine the size of the value you wrote and then re-read that number of bytes. This concept is called primitive types in any programming language. Programming languages define different types and have a fixed number of bytes associated eg. in C
short type take 2 bytes and
long take 8 bytes. When we store and read these values, the compiler already knows their type and reads that many sequence of bytes.
Another solution can be defined as a delimiter or an identifier whena particular byte sequence defines the boundary. For variable length data types, this solution is mainly used. For example, in C and other languages, string values are delimited by a null byte. So when reading a
string , the compiler keeps reading bytes until it finds a null byte.
This solution can also be used for larger integers or variable size integer encodings but will waste storage as most of the computer operations are involved in a lot of calculations on numbers.
Base 128 Var ints - The solution
To overcome the problem of representing integer values of variable length, the Base 128 technique Varint is used. This does not require a delimiting byte, instead use a single bit for this purpose. AvUsing this technique, we can save a lot of storage and store any number of bytes associated with an integer value.
This encoding technique is one of many variable length encodings and another common name associated with Base 128 Varints is LEB128 - Little Endian 128 . This encoding is used in the DWARF, WebAssembly, and Google Protobuf file format to represent integer values in the smallest possible bytes.
The entire encoding process to
LEB128 is not that complex and consists of the following steps.
- Convert number to binary presentation
- Organize bits in group of 7 bits keeping 8th bit empty
- Fill all 8 bits with each byte with 1 for least significant bytes (LSB)
- Fill in the remaining bits of the most significant byte (MSB) with 0
- Reverse byte order
- You have a LEB128 encoded binary representation
When reading the sequence of bytes:
- Read a byte in the sequence of bytes
- Check if its 8th bit of this byte is set 1, keep it in a buffer, then repeat step 1.
- When reading a byte whose 8th bit is set to 0, we know that all the bytes associated with an integer value are finished .
- Reverse the order of the bytes in the buffer
- Now we need to remove the 8th bit from each byte in the buffer
- Join the remaining 7 bits from each byte together
- And here we have our original integer value
Wasn't that so simple, let's take it 'example of a number
963412 Which have a binary representation
111010110011 01010100 . Let's encode it with
LEB128 using this diagramand the steps above.
And the LEB128 for signed integers
Signed numbers are the ones where you should store information whether the numbers were positive or negative. This information is stored in the most significant bit. While all the binary representation is done as compliment to two diagram. So the normal binary representation of
12 in a byte is
00001100 but for
-12 we will do the two's complement which will be
11110100 . This is how it is calculated:
12 Decimal 00001100 Binary 11110011 1 compliment of (reverse bits) 11110100 2 's compliment (add 1)
The rest of the calculations for LEB128 remain the same in addition to this resulting binary (2's complement of the same positive number).
To decode it to an integer, you must already know if it was an integer sign when encoding or no. This is the reason why all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two complements of the output. This will give an absolute value that youcan undo to get a signed number.
If you want to be a generic signed encoding algorithm for positive or negative values, you can do it by taking two's complement of any value.
Data encoding is a large field and the subject of integer encoding is even has a lot of dimensions. The base 128 Varint is not the only technique for encoding integers. There are many more, few popular ones are Fibonacci Encoding , Gamma encoding Elias , HPACK .
If you know the distribution of input values andUsing the output, you can decide which encoding is best for you. You can use any one, develop a new one for yourself, or even optimize the existing one for your need.
Here is an implementation of the LEB128 algorithm. Please keep in mind that there may be many different approaches to solving this problem or implementing this algorithm, so feel free to develop your own.
# unsigned_leb128_encode_decode. py # Python 3.7.4 def encode_unsigned_leb128 (number) : result = bytearray () # As long as nombre is greater than one byte while number.bit_length ()> 7 : # Get the first 7 bits and add 1 to the 8th bit single_byte = (number & 0b01111111 ) | 0b10000000 # Add byte to result result.append (single_byte) # Truncate 7 bits on the right number = number >> 7 # Add remaining byte to result result.append (number) # As we added earlier, it is not necessary to reverse the bytes return result def decode_unsigned_leb128 (byte_array, offset = 0 ) : needle = offset pair_count = 0 result = 0 while True : single_byte = byte_array [needle] # If the first bit is 1 if single_byte & 0b10000000 == 0 : break # Remove first bit single_byte = single_byte & 0b01111111 # Press on the number of bits we have already calculated single_byte = single_byte <<(pair_count * 7 ) # Merge byte with result result = result | single_byte needle = needle + 1 pair_count = pair_count + 1 # Merge the last byte with the result single_byte = single_byte <<(pair_count * 7 ) result = result | single_byte return result
# signed_leb128_encode_decode.py # Python 3.7.4 de unsigned_leb128_encode_decode import decode_unsigned_leb128, encode_unsigned_leb128 def encode_signed_leb128 (number) : # Greater multiple of 7 bits_multiple_of_7 = ((number.bit_length () + 7 ) // 7 ) * 7 twos_complement = ( 1 < return encode_unsigned_leb128 (twos_complement) def decode_signed_leb128 (byte_array, offset = 0 ) : number = decode_unsigned_leb128 (byte_array, offset) # Multiple lower of 7 bits_multiple_of_7 = (number.bit_length () // 7 ) * 7 twos_complement = ( 1 < return twos_complement
You can download the most of this link . Share your thoughts and especially if your comments if you have a problem with the code above.
Previously posted on https: //basicdrift.com/explore-encoding-base-128-varints-41665a0dca36