<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info"
     docName="draft-song-dmsc-promblem-and-requirements-00"
     ipr="trust200902">
  <front>
    <title
    abbrev="Service Mesh Problem Statement and DMSC Requirements">Problem
    Statements of Service Mesh Infrastructure and Requirements of DMSC</title>

    <author fullname="Enge Song" initials="E" surname="Song">
      <organization>Alibaba Cloud</organization>

      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology
          Park</street>

          <city>Beijing</city>

          <region/>

          <code>100124</code>

          <country>China</country>
        </postal>

        <email>enge.seg@alibaba-inc.com</email>
      </address>
    </author>

    <author fullname="Yang Song" initials="Y" surname="Song">
      <organization>Alibaba Cloud</organization>

      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology
          Park</street>

          <city>Beijing</city>

          <region/>

          <code>100124</code>

          <country>China</country>
        </postal>

        <email>song288954@alibaba-inc.com</email>
      </address>
    </author>

    <author fullname="Shaokai Zhang" initials="S" surname="Zhang">
      <organization>Alibaba Cloud</organization>

      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology
          Park</street>

          <city>Beijing</city>

          <region/>

          <code>100124</code>

          <country>China</country>
        </postal>

        <email>shaokai.zsk@alibaba-inc.com</email>
      </address>
    </author>

    <author fullname="Xing Li" initials="X" surname="Li">
      <organization>Alibaba Cloud</organization>

      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology
          Park</street>

          <city>Beijing</city>

          <region/>

          <code>100124</code>

          <country>China</country>
        </postal>

        <email>lixing.lix@aliyun-inc.com</email>
      </address>
    </author>

    <author fullname="Jiangu Zhao" initials="J" surname="Zhao">
      <organization>Alibaba Cloud</organization>

      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology
          Park</street>

          <city>Beijing</city>

          <region/>

          <code>100124</code>

          <country>China</country>
        </postal>

        <email>jiangu.zjg@alibaba-inc.com</email>
      </address>
    </author>

    <date day="6" month="January" year="2025"/>

    <abstract>
      <t>Service meshes, as one infrastructure, has been widely used in the
      major public cloud providers. Its main function is to accomplish the
      policy routing, precise traffic allocation, and traffic throttling etc.
      Currently, the design and implementation of service mesh takes the
      centralized control approach, which bring various challenges for its
      current deployments and further developments. This document analyzes the
      problems that exists in current service mesh implementations, and
      provide the requirements for the future distributed micro service
      communication(DMSC) infrastructure.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref target="RFC2119"/>
      <xref target="RFC8174"/> when, and only when, they appear in all
      capitals, as shown here.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Service meshes, as an infrastructure component, facilitate
      communication between services. Major public cloud providers such as
      AWS, Azure, GCP, and Alibaba Cloud have all introduced service
      mesh-based products to simplify the building and management of
      microservices-based applications. In many service mesh frameworks, a key
      component is the sidecar proxy, which is responsible for managing pod
      traffic and implementing functionalities such as policy routing, precise
      traffic allocation, and traffic throttling. By decoupling network
      functionalities into the sidecar, flexible traffic management can be
      achieved without altering the user business logic. However, deploying
      sidecars in production environments reveals certain performance
      bottlenecks,which have also been mentioned in other literature <xref
      target="Dissecting"/></t>

      <t>This document analyzes the problems that exists in current service
      mesh implementations, and provide the requirements for the future
      distributed micro service communication(DMSC) infrastructure.</t>
    </section>

    <section anchor="PS"
             title="Probem Statements of current Service Mesh Infrastructure">
      <section anchor="high couple"
               title="Service Mesh is highly Coupled with User Service">
        <t>In the model where a sidecar (such as Istio Mesh <xref
        target="Istio"/>) is deployed within each pod, the sidecar is embedded
        within the user application's pod and is responsible for handling the
        communication tasks of the application. The sidecar coexists with the
        user application, sharing the pod&rsquo;s resources<xref
        target="SPRIGHT"/> <xref target="CanalMesh"/>. To ensure uninterrupted
        communication between applications and to avoid resource waste caused
        by isolated sidecars, both the sidecar and the application are
        designed to be created, destroyed, and scaled simultaneously, sharing
        the same life cycle. However, this design introduces stability and
        security issues; for example, memory leaks in the sidecar may lead to
        application crashes, and upgrading the sidecar requires restarting the
        pod, resulting in interruptions to application operation.</t>
      </section>

      <section anchor="performance overhead"
               title="Service Mesh Introduces Additional Performance Overhead">
        <t>Since traffic needs to be processed through the sidecar, the
        outgoing traffic from the user application is redirected to the
        sidecar (for example, using iptables), which introduces additional
        processing steps<xref target="SPRIGHT"/> <xref target="CanalMesh"/>
        Specifically, at both the source and destination, the traffic
        redirection introduces two additional context switches, memory
        copying, and protocol stack processing overhead <xref
        target="SPRIGHT"/> Furthermore, the sidecar is required to perform
        complex Layer 7 (L7) tasks, such as CPU-intensive TLS encryption and
        decryption operations, which may further lead to significant
        performance degradation.</t>
      </section>

      <section anchor="high resource consumption"
               title="Service Mesh Results in High Resource Consumption">
        <t>Since the sidecar is deployed within the user pod, it consumes
        resources that would otherwise be allocated to the user application
        [6]. For example, a customer with 500 nodes and 15,000 pods found that
        the sidecars consumed 1,500 CPU cores (10% of the total) and 5,000 GB
        of memory (10% of the total)<xref target="CanalMesh"/>. In extreme
        cases, the CPU and memory usage of the sidecar can even exceed that of
        the application itself due to the complex Layer 7 functionalities it
        provides. This issue has raised concerns among customers, as the pod
        resources they purchased are not fully utilized for running their
        applications. Additionally, measurement results indicate that to
        achieve optimal performance, it may even be necessary to oversupply
        resources for the sidecar.</t>
      </section>

      <section anchor="contorl plane overhead"
               title=" Service Mesh Incurs Overhead in Control Plane">
        <t>With the growing popularity of service meshes, an increasing number
        of customers are choosing to use them to deploy micro services, which
        has rapidly increased the number of sidecars that the control plane
        needs to manage. Sidecars can handle many types of configurations;
        however, orchestrating service dependency configurations for each
        sidecar individually is both time-consuming and error prone, and any
        misconfiguration could potentially affect service continuity. To
        reduce complexity, a common practice is to download the same
        configuration set to all sidecars. This configuration set contains all
        possibly relevant configurations, ensuring that any pod can freely
        communicate with other pods as needed. However, pushing the complete
        configuration to all pods during each update significantly increases
        southbound bandwidth overhead. This is because whenever a sidecar is
        updated even if the updates are not related to other side cars they
        still need to be pushed to all sidecars. In scenarios involving
        cross-region or multi-cloud deployments within a Kubernetes cluster
        (such as on-premises deployments or multi-site disaster recovery), the
        significant southbound configuration bandwidth overhead may lead to
        configuration delays or even losses. Since cross-region/cross-cloud
        communication requires VPNs or dedicated lines, the communication
        costs are relatively high. As a result, most customers opt for a more
        conservative bandwidth purchasing strategy. This means that when
        managing cross-region or multi-cloud clusters, the controller's
        configuration updates to geographically distributed sidecars can
        deplete the customer's cross-region/cross-cloud bandwidth, potentially
        resulting in delays or losses of configuration data.</t>
      </section>
    </section>

    <section anchor="DMSC Requirements"
             title="Requirements of Distributed Micro Services Communication (DMSC)">
      <section anchor="Non-instrusive"
               title=" Non-intrusive Service Mesh for User Applications">
        <t>Current mainstream service mesh solutions like Istio and Ambient
        exhibit a high degree of intrusiveness toward user services. This is
        manifested in components such as sidecars that share the life cycle
        with pods (L4 + L7 proxies), L4 proxies that share resources with
        other pods within the same node, and L7 proxies that share resources
        across all nodes in the Kubernetes cluster. These components not only
        occupy resources that users allocate for their business operations but
        also introduce potential failure risks. To ensure equivalence in
        service mesh functionalities, Canal Mesh [5] still retains lightweight
        proxies locally. Therefore, there is a pressing need for service
        meshes to further reduce their intrusiveness to users, with the
        ultimate goal of achieving a completely non-intrusive service
        mesh.</t>
      </section>

      <section anchor="Reduced control plane overhead"
               title="Reduce Control Plane Overhead">
        <t>The control plane of the service mesh needs to handle tasks such as
        full configuration orchestration and mass sidecar configuration
        pushing. When the overhead is too high, it can lead to issues like
        prolonged configuration effectiveness time and excessive consumption
        of dedicated line bandwidth during cross-cloud or IDC deployments.
        Additionally, this overhead is directly proportional to the scale of
        the cluster, which severely hinders the scalable deployment of service
        meshes. Therefore, there is an urgent need to reduce the overhead of
        the service mesh control plane. One potential solution is the
        centralized mesh gateway configuration in Canal Mesh <xref
        target="CanalMesh"/>. Moreover, further optimizing the configuration
        orchestration and pushing methods (for example, transforming full
        pushes into incremental pushes) is also a potentially viable
        direction.</t>
      </section>

      <section anchor="Improved data plane performance"
               title="Improve Data Plane Performance">
        <t>The service mesh takes over the user's advanced network
        communication needs by inserting proxy nodes into the user's
        communication path. While this provides the convenience of allowing
        users to focus solely on business development, redirecting traffic
        through the proxy inevitably affects the data plane transmission
        latency and throughput. Whether the service mesh proxies are located
        remotely in the cloud or retained locally in a limited capacity,
        improving the data plane performance of the service mesh is crucial.
        For example, leveraging SmartNICs to offload proxy functions can help
        reduce the performance degradation that deploying a service mesh may
        bring to user applications. This represents an important direction for
        evolution.</t>
      </section>

      <section anchor="beyond kubernete"
               title="Implement an Application Mesh that is Not Limited to Kubernetes">
        <t>In addition to Kubernetes users, there are many business scenarios
        that also wish to introduce the concept of service mesh to reduce
        repetitive development for network communication needs. For example,
        AWS&rsquo;s VPC Lattice service unifies advanced network communication
        capabilities across various forms such as VMs, bare metal, and
        Kubernetes, providing a broader range of service mesh functionalities
        [1]. Some operators also hope to extend the concept of service mesh
        into the backbone network, offering advanced network features at a
        cloud and IDC granularity through routers<xref
        target="I-D.li-dmsc-architecture"/>. In summary, expanding the concept
        of service mesh beyond Kubernetes to achieve a more generalized
        application mesh is a potential research direction.</t>
      </section>
    </section>

    <section anchor="security considerations" title="Security Considerations">
      <t>This information document introduces no any extra security problem to
      the Internet.</t>
    </section>

    <section anchor="ack" title="Acknowledgement">
      <t>TBD</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>None</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.8174"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.I-D.li-dmsc-architecture"?>

      <reference anchor="Istio" target="">
        <front>
          <title>Istio: Up and running: Using a service mesh to connect,
          secure, control, and observe.</title>

          <seriesInfo name=" O&rsquo;Reilly Media, 2019" value=""/>

          <author fullname="Lee" initials="L." surname="Calcote">
            <organization/>
          </author>

          <author fullname="Zack" initials="Z." surname="Butcher">
            <organization/>
          </author>
        </front>
      </reference>

      <reference anchor="SPRIGHT" target="">
        <front>
          <title>Extracting the Server from Serverless Computing,
          High-performance eBPF-based Event-driven, Shared-memory
          Processing.</title>

          <seriesInfo name="ACM SIGCOMM" value="pages 780&ndash;794, 2022."/>

          <author fullname=" Shixiong" initials="S." surname="Qi">
            <organization/>
          </author>

          <author fullname="Leslie " initials="L." surname="Monis">
            <organization/>
          </author>

          <author fullname="Ziteng" initials="Z." surname="Zeng">
            <organization/>
          </author>

          <author fullname="Ian-chin" initials="I." surname="Wang">
            <organization/>
          </author>

          <author fullname="KK" initials="K." surname="Ramakrishnan">
            <organization/>
          </author>
        </front>
      </reference>

      <reference anchor="CanalMesh" target="">
        <front>
          <title>Canal mesh: A cloud-scale sidecar-free multitenant service
          mesh architecture.</title>

          <seriesInfo name="ACM SIGCOMM 2024 Conference,"
                      value="860&ndash;875, 2024."/>

          <author fullname="Enge" initials="E." surname="Song">
            <organization/>
          </author>

          <author fullname="Yang" initials="Y." surname="Song">
            <organization/>
          </author>

          <author fullname="Chengyun" initials="C." surname="Lu">
            <organization/>
          </author>

          <author fullname="Tian" initials="T." surname="Pan">
            <organization/>
          </author>

          <author fullname="Shaokai" initials="S." surname="Zhang">
            <organization/>
          </author>

          <author fullname="Jianyuan" initials="J." surname="Lu">
            <organization/>
          </author>

          <author fullname=" Jiangu" initials="J." surname="Zhao">
            <organization/>
          </author>

          <author fullname="Xining " initials="X." surname="Wang">
            <organization/>
          </author>

          <author fullname="Xiaomin" initials="X." surname="Wu">
            <organization/>
          </author>

          <author fullname="Minglan" initials="M." surname="Gao">
            <organization/>
          </author>
        </front>
      </reference>

      <reference anchor="Dissecting" target="">
        <front>
          <title>Dissecting overheads of service mesh sidecars.</title>

          <seriesInfo name="ACM SoCC" value="pages 142&ndash;157, 2023."/>

          <author fullname="Xiangfeng" initials="X." surname="Zhu">
            <organization/>
          </author>

          <author fullname="Guozhen" initials="G." surname="She">
            <organization/>
          </author>

          <author fullname="Bowen" initials="B." surname="Xue">
            <organization/>
          </author>

          <author fullname="Yu" initials="Y." surname="Zhang">
            <organization/>
          </author>

          <author fullname="Yongsu" initials="Y." surname="Zhang">
            <organization/>
          </author>

          <author fullname="Xuan Kelvin " initials="X." surname="Zou">
            <organization/>
          </author>

          <author fullname="XiongChun" initials="X." surname="Duan">
            <organization/>
          </author>

          <author fullname="Peng" initials="P." surname="He">
            <organization/>
          </author>

          <author fullname="Arvind" initials="A." surname="Krishnamurthy">
            <organization/>
          </author>

          <author fullname="Matthew" initials="L." surname="Lentz">
            <organization/>
          </author>
        </front>
      </reference>
    </references>
  </back>
</rfc>
