Saturday, 12 August 2017

communication - Why are USB devices slower than 480 MBit/s




With a signaling rate of 480 MBit/s USB 2.0 devices should be able to transmit data with up to 60 MB/s. However today's devices seem to be limited to 30-42 MB/s while reading [Wiki:USB]. That is a 30 percent overhead.


USB 2.0 has been a de facto standard for external devices for over 10 years. One of the most important application for the USB interface from early on has been portable storage. Unfortunately USB 2.0 was quickly a speed limiting bottleneck to these bandwidth demanding applications, a today's HDD is for example capable of more than 90 MB/s in sequential read. Considering the long market presence and the constant need for higher bandwidth we should expect that the USB 2.0 eco system has been optimized over the years and reached a read performance that is close to the theoretical limit.


What is the theoretical maximum bandwidth in our case? Every protocol has overhead including USB and according to the official USB 2.0 standard it is 53.248 MB/s [2, Table 5-10]. That means theoretically today's USB 2.0 devices could be 25 percent faster.



To get anywhere near to the root of this problem the following analysis will demonstrate what is happening on the bus while reading sequential data from a storage device. The protocol is broken down layer by layer and we are especially interested in the question why 53.248 MB/s is maximum theoretical number for bulk upstream devices. Finally we will talk about the limits of the analysis which might give us some hints of additional overhead.


Notes


Throughout this question only decimal prefixes are used.


A USB 2.0 host is capable of handling multiple devices (via hubs) and multiple endpoints per device. Endpoints can operate in different transfer modes. We will limit our analysis to a single devices that is directly attached to the host and that is capable of continuously sending full packets over an upstream bulk endpoint in High-Speed mode.


Framing


USB high speed communication is synchronized in a fixed frame structure. Each frame is 125 us long and begins with an Start-Of-Frame packet (SOF) and is limited by and End-Of-Frame sequence (EOF). Each packet starts with SYNC and ends with and End-Of-Packet (EOF). Those sequences have been added to the diagrams for clarity. EOP is variable in size and packet data dependent, for SOF it is always 5 bytes.



enter image description here Open image in a new tab to see a larger version.


Transactions


USB is a master driven protocol and each transaction is initiated by the host. The timeslot between SOF and EOF can be used for USB transactions. However the timing for SOF and EOF is very strict and the host initiates only transactions that can be fully completed within the free timeslot.


The transaction we are interested in is a successful bulk IN transaction. The transaction starts with a tocken packet IN, then the hosts waits for a data packet DATA0/DATA1 and confirms the transmission with a handshake packet ACK. The EOP for all these packets is 1 to 8 bit depending on the packet data, we assumed the worst case here.


Between each of these three packets we have to consider waiting times. Those are between the last bit of the IN packet from the host and the first bit of the DATA0 packet of the device and between the last bit of the DATA0 packet and the first bit of the ACK packet. We don't have to consider any further delays as the host can start sending the next IN right after sending a ACK. The cable transmission time is defined to be maximum 18 ns.


A bulk transfer can send up to 512 bytes per IN transaction. And the host will try to issue as many transaction as possible in between the frame delimiters. Although bulk transfer has low priority it can take up all of the available time in a slot when there are no other transaction pending.


To ensure proper clock recovery the standards defines a method call bit stuffing. When the packet would require a very long sequence of the same output an additional flank is added. That ensures a flank after a maximum of 6 bits. In the worst case this would increase the total packet size by 7/6. The EOP is not subject to bit stuffing.


enter image description here Open image in a new tab to see a larger version.


Bandwidth Calculations


A bulk IN transaction has an overhead of 24 bytes and a payload of 512 bytes. That's a total of 536 bytes. The timeslot between is 7487 bytes wide. Without the need of bit stuffing there is space for 13.968 packets. Having 8000 Micro-Frames per second we can read data with 13 * 512 * 8000 B/s = 53.248 MB/s



For totally random data we expect that bit stuffing is necessary in one of 2**6 = 64 sequences of 6 consecutive bits. That's an increase of (63 * 6 + 7) / (64 * 6). Multiplying all bytes that are subject to bit stuffing by that numbers gives a total transaction length of (19 + 512) * (63 * 6 + 7) / (64 * 6) + 5 = 537.38 Bytes. Which results in 13.932 packets per Micro-Frame.


There is another special case missing from these calculations. The standard defines a maximum device response time of 192 bit times [2, Chapter 7.1.19.2]. This has to be considered when deciding if the last package still fits into the frame in case the device need the full response time. We could account for that by using a window of 7439 bytes. The resulting bandwidth though is identical.


What's left




  • Error detection and recovering has not been covered. Maybe error are frequent enough or the error recovering is time consuming enough to have an impact on the average performance.




  • We have assumed instant host and device reaction after packets and transaction. I personally don't see any need of big processing tasks at the end of packets or transactions on either side and therefore I can't think of any reason why the host or device should not be able to respond instantly with sufficiently optimized hardware implementations. Especially in normal operation most of the book keeping and error detection work could be done during transaction and the next packets and transaction could be queued up.





  • Transfers for other endpoints or additional communication has not been considered. Maybe the standard protocol for storage devices requires some continuous side channel communication that consumes valuable slot time.




  • There might be an additional protocol overhead for storage devices for the device driver or file system layer. (packet payload == storage data?)








  • Why are today's implementations not capable of streaming at 53 MB/s?




  • Where is the bottleneck in today's implementations?




And a potential follow up: Why has nobody tried to eliminate such a bottleneck?



[1] The official USB 2.0 specification


[2] Fast pdf mirror of the specification





No comments:

Post a Comment

arduino - Can I use TI's cc2541 BLE as micro controller to perform operations/ processing instead of ATmega328P AU to save cost?

I am using arduino pro mini (which contains Atmega328p AU ) along with cc2541(HM-10) to process and transfer data over BLE to smartphone. I...