books

I/O Consolidation in the Data Center

Name: I/O Consolidation in the Data Center
Author: Silvano Gai, Claudio DeSanti

Silvano Gai, Claudio DeSanti

85 highlights

Highlights & Annotations

The biggest challenge of I/O consolidation is to satisfy the requirements of different traffic classes with a single network.

Ref. B039-A

Typically these flows were not sensitive to latency, but this is changing rapidly, and latency now must be taken into serious consideration.

Ref. 837A-B

Because SCSI is extremely sensitive to packet drops, in FC losing frames is not an option. FC traffic is characterized by large frame sizes, to carry the typical 2KB SCSI payload.

Ref. 5018-C

Clusters do not care too much about the underlying network if it is cheap, it is high bandwidth, it is low latency, and the adapters provide zero-copy mechanisms.

Ref. 9F47-D

The real downside is that iSCSI is “SCSI over TCP,” it is not “FC over TCP,” and therefore it does not preserve the management and deployment model of FC. It still requires gateways, and it has a different naming scheme (perhaps a better one, but anyhow different), a different way of doing zoning, and so on.

Ref. 04FA-E

lossless networks and lossy networks.

Ref. 146B-F

This classification does not consider losses due to transmission errors that, in a controlled environment with limited distances like the Data Center, are rare in comparison to losses due to congestion.

Ref. 456E-G

Fibre Channel and Infiniband are examples of lossless networks (i.e., they have a link level signaling mechanism to keep track of buffer availability at the other end of the link).

Ref. 778D-H

Historically Ethernet has been a lossy network, since Ethernet switches do not use any mechanism to signal to the sender that they are out of buffers.

Ref. 5176-I

A few years ago, IEEE 802.3 added a PAUSE mechanism to Ethernet.

Ref. 65DE-J

Avoiding frame drops is mandatory for carrying native storage traffic over Ethernet, since storage traffic does not tolerate frame drops. SCSI was designed with the assumption of running over a reliable transport in which failures are so rare that it is acceptable to recover slowly from them.

Ref. AA14-K

iSCSI is an alternative to Fibre Channel that solves the same problem by requiring TCP to recover from frame drops; however iSCSI has not been widely deployed in the Data Center.

Ref. 6575-L

In general, it is possible to say that lossless networks require fewer buffers in the switches than lossy

Ref. 01CB-M

Cut-through switches have a lower latency at the cost of a more complex design, required to save the intermediate store-and-forward. This

Ref. E95B-N

Cut-through switching is not possible if there are frames already queued for a given destination and if the speed of the egress link is higher than the speed of the ingress link (data underrun). Cut-through is typically not performed for multicast/broadcast frames.

Ref. F291-O

Therefore, a proper implementation of the Ethernet PAUSE may transform an Ethernet network into a lossless network.

Ref. A2E6-P

A PAUSE frame is a standard Ethernet frame, not tagged. (For example, the pausing does not apply per VLAN or per Priority, but to the whole link.) The format of a PAUSE frame is shown in

ok important difference with fcoe. pfc

Ref. FC48-Q

MAC Control Frames that are identified by Ethertype = 0x8808.

Ref. 0CB0-R

The only significant field is the Pause_Time that contains the time the link needs to remain paused, expressed in Pause Quanta (512 bit times). If the link needs to remain paused for a long time, it is customary to refresh the pause by sending periodic PAUSE frames. It is also possible to send a PAUSE frame with Pause_Time = 0 to unpause the link (i.e., restart transmission, without waiting for the timer to expire).

how pause frames work

Ref. 7207-S

With PAUSE (see Figure 2-1) Switch A does not keep track of the buffers available in Switch B, and it assumes by default that buffers are available, unless told the contrary by switch B with a PAUSE frame.

Ref. B17C-T

PAUSE is a regular Ethernet frame and therefore it uses slightly more bandwidth than the credit mechanism, negligible in comparison to the data transferred.

Ref. 65D7-U

It should be noticed that for I/O consolidation PFC (see page 20) is superior to the basic FC credits mechanism, since FC credits apply to the whole link and not per Priority.

why is pfc superior to b2b credit mechanism

Ref. E9D1-V

FC credits have been extended with a new ordered set called VC_RDY (Virtual Circuit Ready) to provide a similar functionality.

what is the vcr

Ref. AD19-W

PAUSE, PFC, and Credits are all hop-by-hop mechanisms. (That is, they apply to a specific link.) They do not automatically propagate to other links in the network.

hop by hop ?

Ref. 64C3-X

The goal of these three mechanisms is to suspend the transmission of frames so that the receiver is not forced to drop frames if it cannot forward them due to congestion.

what is the goal of pfc and b2b

Ref. 4C86-Y

Therefore there is not a direct propagation of the PAUSE mechanism in the network, but there may be an indirect propagation: PAUSE is generated–queue gets full–PAUSE is generated–queue gets full–and so on.

important

Ref. 2789-Z

On the plus side, frames are never dropped, and therefore higher-level protocols have a reduced amount of work to do.

Ref. 4F9F-B

This is particularly important for protocols like SCSI that are not good at error recovery.

Ref. FA34-C

Lossless is therefore important for transporting FC over Ethernet.

Ref. 5087-D

Other application level protocols may take advantage of a lossless behavior, for example Network File System (NFS).

which other app can take care of lossless

Ref. 30EF-E

Therefore, very short TCP flows have been shown to work better on lossless Ethernet.

Ref. D244-F

head of line (HOL) blocking

Ref. 2765-G

Why PAUSE Is Not Widely Deployed

Ref. 10FD-H

PAUSE applies to the whole link. (That is, it is a single mechanism for all traffic classes.) Often different traffic classes have incompatible requirements (e.g., some need a lossy behavior, others need a lossless behavior), and this may cause “traffic interference.” For example, with the PAUSE mechanism, storage traffic could be paused due to congestion on IP traffic. This is clearly undesirable and needs to be fixed.

the problem with pause frame_why

Ref. 469C-I

Priority-based Flow Control (PFC, also known as Per Priority Pause or PPP) [6], is a finer grain flow control mechanism.

Ref. 5244-J

PFC enables PAUSE functionality on a per-Priority basis.

Ref. DC28-K

The Ethertype = 0x8808 is the same as for PAUSE (MAC Control Frame), but the Opcode = 0x0101 is different. There

difference between pfc frame and pause frames

Ref. 4016-L

DCBX is the management protocol of Data Center Bridging, defined in the IEEE 802.1Qaz project [7].

Ref. 3500-M

DCBX is an extension of Link Layer Discovery Protocol (LLDP, see IEEE 802.1AB-2005).

Ref. 2E30-N

LLDP is a vendor-neutral Layer 2 protocol that allows a network device to advertise its identity and its capabilities on the local network.

Ref. 4D3C-O

DCBX discovers the capabilities of the two peers at the two ends of a link and can check that they are consistent.

Ref. 1382-P

DCBX can notify the device manager in the case of configuration mismatches and can provide basic configuration if one of the two peers is not configured.

Ref. AD26-Q

DCBX-capable links exchange DCB capabilities,

Ref. 0779-R

conflict alarms are sent to the appropriate management stations.

Ref. 7358-S

With this structure it is possible to assign bandwidth to each Priority Group (e.g., 40% LAN, 40% SAN, and 20% IPC).

Ref. D733-T

Latency is becoming increasingly important in Data Centers, especially for IPC applications

Ref. DDB1-U

One of the downsides of lossless Ethernet discussed in page 19 is that, in presence of congestion, it tends to create undesirable Head Of Line (HOL) blocking.

downside of lossless ethernet

Ref. 399D-V

This is because it spreads the congestion across the network.

Ref. 75A7-W

The algorithms that have been considered are Backward Congestion Notification (BCN) and Quantized Congestion Notification (QCN), with QCN being standardized. They are similar, and they act as shown in

how do we deal hop congestion

Ref. BAC8-X

The main difference between this kind of signaling and PAUSE is that PAUSE is hop-by-hop (see page 18), while these congestion notification messages propagate all the way toward the source of the congestion (see Figure 2-12).

Ref. AF30-Y

One of the primary design goals of the STP is to eliminate all loops from the network topology. This is because Ethernet frames do not contain any “time to live” data in the frame header, making it theoretically possible for frames to circulate forever in the case of topologies with loops or multiple links between devices.

Ref. 9F2A-Z

However, in the case of destination addresses that have not yet been learned, frames are flooded on all ports (except the receiving port). In absence of the STP, this flooding process can create data storms in looped network environments, as frames are replicated and flooded by each bridge or switch in the network. The

Ref. 35C6-A

This is clearly undesirable, since it creates congestion near the root and it limits the “bisectional bandwidth” of the network.

Ref. BB3F-B

Four alternative models solve the important subproblem of how to provide active-active connectivity from an access switch to two core switches. They are: • Etherchannel, see page 32 • VSS (Virtual Switching System), see page 32 • vPC (virtual Port Channel), see page 34 • Ethernet Host Virtualizer, see page 36

Ref. 2BCC-C

Etherchannel allows aggregating several physical Ethernet links to create one logical Ethernet link with a bandwidth equal to the sum of the bandwidths of the links being aggregated.

Ref. 3D76-D

Etherchannel can aggregate from two to eight links and all higher-level protocols see these multiple links as a single connection (see Figure 2-18).

Ref. C8A9-E

A limitation of Etherchannel is that all the physical ports in the aggregation group must reside on the same switch.

the limit with classic etherchannel

Ref. 8696-F

(VSS, vPC, and Ethernet Host Virtualizer) were developed, and they are collectively illustrated in Figure 2-19.

Ref. A4E6-G

VSS is the first of two Cisco technologies that allow using Etherchannel from an access switch to two distribution switches,

Ref. 23A5-H

VSS accomplishes this by clustering two physical chassis together into a single, logical entity.

Ref. 58BA-I

The individual chassis become indistinguishable, and therefore the access switch believes the upstream switches to be a single distribution switch, as in the case of Figure 2-18. VSS has also many other advantages, since it improves high availability, scalability, management, and maintenance.

Ref. A1D1-J

The key enabler of the Cisco VSS technology is a special link called Virtual Switch Link (VSL), which binds the two chassis together and passes special control information.

Ref. 05C4-K

A VSL link is a connection between the two internal fabrics of the two switches to combine them into a single logical network entity and make them indistinguishable from an external observer.

Ref. 95BF-L

Therefore, access switches may use multiple uplinks toward the two switches and configure them as regular Etherchannel, since the VSS appears as a single, logical switch or router.

Ref. ACA2-M

Within the Cisco VSS, both switches are active from a data plane perspective, but from the control and management plane perspective, only one switch is designated as active; the other is designated as hot-standby, much as in the case of a dual supervisor switch (e.g., the Catalyst 6500).

data plane vs control plane view

Ref. 3D1A-N

All control-plane functions, including management (SNMP, Telnet, SSH), Layer 2 protocols (STP, LACP, and so on), and Layer 3 routing protocols are centrally managed by the active switch that is also responsible for programming the hardware forwarding information of both switches.

Ref. 63DB-O

vPC (also known as Multi-Chassis Etherchannel/virtual Port Channel, MCEC/vPC)

Ref. 0BE1-P

Each vPC switch maintains its identity and its management and control planes, but it cooperates with the other vPC switch to provide active-active virtual Port Channel. It is not a goal of vPC to present the two vPC switches as a single one, as in the case of VSS.

Ref. 50E8-Q

The key challenge in vPC, present also in VSS, is to deliver each frame exactly once, avoiding frame duplication and loops.

Ref. E837-R

VSS and vPC are techniques implemented on the distribution switches to allow the access switches to keep using Etherchannel in a traditional manner. The same problem can be solved on the access switches by a technique called Ethernet Host Virtualizer.

Ref. B735-S

A switch running Ethernet Host Virtualizer divides its ports into two groups: host ports and network ports. Both types of ports can be a single interface or an Etherchannel. The switch then associates each host port with a network port. This process is called pinning.

Ref. AD48-T

The same host port always uses the same network port, unless it fails. In this case, the access switch moves the pinning to another network port.

Ref. 6AE2-U

Particular attention must be paid to multicast and broadcast frames to avoid loops and frame duplications. Typically, access switches that implement this feature act as follows: • They never retransmit a frame received from a network port to another network port. • They divide the multicast/broadcast traffic according to multicast groups, and they assign each multicast group to a single network port. Only one network port may transmit and receive a multicast group.

Ref. 365E-V

An additional advantage of multipath is reduced latency, since the shortest path is always used in forwarding frames, and a lower number of hops normally equates to reduced latency. In addition, a less loaded path can be used to forward delay-sensitive frames.

Ref. DA96-W

By making all the switches addressable devices, all L2MP solutions can utilize a link state protocol to compute the topology of the network (i.e., how the switches are interconnected), and from that topology they can compute a switch forwarding database.

Ref. 3888-X

This is similar to what is already implemented on Fibre Channel networks, where the FC switches are addressable and run FSPF to compute the topology of the network. In FC, the switch address is embedded in the FC_ID, the Fibre Channel address.

Ref. D82E-Y

This is not the case in Ethernet. Switch addresses need to be carried in the Ethernet frames but so do the end stations addresses. For this reason an extra header is added in the L2MP cloud.

Ref. D3F4-Z

The idea behind FCoE is extremely simple: implement I/O consolidation by carrying each Fibre Channel frame inside an Ethernet frame.

Ref. 2200-A

The encapsulation is performed on a frame-by-frame basis; therefore, it is completely stateless and does not require fragmentation nor reassembly.

Ref. 7022-B

backbones are still maintained for LAN and SAN. FCoE allows Fibre Channel traffic to share Ethernet links with other traffic, as shown in Figure

Ref. F347-C

FCoE poses some requirements on the underlying Ethernet network, the most important one being lossless.

Ref. 6A1C-D

Lossless can be simply achieved by the PAUSE mechanism

Ref. 7CF1-E

More realistically, in an I/O consolidation environment, PFC is used (see page 20) and additional protocols like DCBX (see page 22) facilitate deployment. FCoE also requires support of jumbo frames, since FC frames are not fragmented (see page 80).

Ref. 3796-F

The Data Field is of variable size, ranging from 0 to 2112 bytes.

Ref. B0A6-G