风筝
发表于: 2015-9-18 22:05:30 | 显示全部楼层

对 Cortex-M7微控制器可配置内存问题的更深一步探讨

紧密耦合内存(TCM)是 Cortex-M7产品家族所具备的一个突出特性,可以通过对 CPU的单周期访问以及确保来自外部设备的高优先级、低延迟时间请求来提高微控制器的性能。

cortex-m7-chip-diagramlg.png

最早一批基于 ARM 的 M7嵌入式处理器内核的微控制器产品 - 像 Atmel 公司的SAM E70和 S70芯片 - 已经投入市场。所以,对于 M7微控制器的可配置内存问题做一番深入的探讨,了解紧密耦合内存(TCM)是如何让确定性代码的执行成为可能、如何让处理器以最高速度快速地传输数据,还是很有必要的。

下面就列出了针对 Cortex-M7微控制器先进的内存架构的一些重要发现:


1. TCM 是可以配置的

首先,紧密耦合内存(TCM)的大小是可以配置的。TCM是微控制器(MCU)物理内存架构当中的一个组成部分,最高可支持16MB 紧密耦合内存。ARMCortex-M7内核的可配置性能够让片上系统(SoC)设计师整合一系列不同大小的缓存。这样一来,工业和物联网领域的产品开发人员就能够确定为了满足目标应用的需求在TCM 中关键代码以及实时数据的数量。

Atmel | SMARTCortex-M7架构并未指定应提供什么类型的内存或需要多大内存。这个决定留给了在微控制器中使用 M7的设计师来决定,这也是为产品能够做到以差异化而取胜提供了一种途径。因此,一个灵活的内存系统可以针对性能、周期决定性以及低延迟时间进行优化,以满足具体的应用需求。


2. 指令 TCM

指令 TCM 或 ITCM可以在实时处理类的应用过程中,例如音频编码/解码、音频处理和马达控制等应用中,完成关键代码的确定性执行任务。使用标准内存将会由于缓存缺失和中断等原因而带来延迟,因此会对实时响应或者无缝的音频/视频操作等应用所要求的确定时间的执行带来妨碍。

确定性关键软件例程应加载到一个64位的指令内存端口(ITCM),这个端口应支持双指令处理器架构,可为 CPU 提供单周期访问以提高MCU 的性能。但是,开发人员需要小心地计算出需要零等待状态执行性能的代码数量,以确定在一个 MCU 设备中需要的 ITCM数量。

the-anatomy-of-tcm-inside-the-m7-architecture1.gif

M7 架构中的 TCM 构成。


3. 数据 TCM

数据 TCM 或 DTCM应用在需要快速数据处理的任务当中,如二维条码解码和指纹及声音识别等。这种情况下有两个数据端头(DTCM)可以同时以并行方式提供对实时数据的32位数据访问。指令TCM 和数据 TCM - 在都用于对片上闪存和外部资源的高效访问时 - 必须拥有同样的大小。


4. 系统 RAM 和 TCM

系统 RAM,也叫作通用 RAM,用在与网络、现场总线、高带宽连接、USB等方面相关的通讯栈上。其任务是进行外部设备数据缓冲,一般是通过直接内存存取(DMA)引擎,能够在不需要 CPU介入的情况下由主系统直接访问。

在这种情况下,产品开发人员必须要记住由于向 CPU 和DMA同时传输数据而出现的内存访问冲突。所以开发人员必须要为来自外设的并对延迟有严格要求的请求设定清晰的优先等级,小心地计划对于延迟有严格要求的数据传输,如USB 接口描述符的传输,或者为低数据速率外设提供一个小型本地缓冲等。来自 DMA和缓存的访问一般都会突发至连续的地址,以便优化系统的性能。

值得注意的一点是,虽然系统内存在逻辑上与 TCM 是分开的,像Atmel这样的微控制器供应商已经将 TCM 和系统 RAM 整合在了一个单独的 SRAM块中。这种方法可以让物联网开发人员在共享一般目的的任务的同时,将 TCM 和系统 RAM 功能分离开来,以应对具体的使用情况。

a-single-sram-block-for-tcm-and-system-memory-allows-higher-flexibility-and-util.jpg

将 TCM 和系统内存整合在一个 SRAM 块中可以带来更大的灵活性和利用率。


5. TCM 加载

Cortex-M7 采用的是分散式 RAM 架构,允许 MCU专门针对关键任务和数据传输设定专用 RAM,从而将 MCU 的性能提高到最高水平。TCM 可以从众多来源加载,具体是哪个来源在 M7架构中并未加以规定。这个任务留给了MCU的设计人员,由他们来决定是单一的 DMA 还是若干来自视频或 USB等各类流的加载数据,在进行软件构建时,物联网产品开发人员必须要确定哪些代码段和数据块应分配给TCM。具体方法是将杂注嵌入软件当中以及应用链接器设置,从而让软件构建在分配内存的过程中可以为代码设定合适的位置。


6. 为什么要使用 SRAM?

闪存可以安装在 TCM接口中,但是闪存无法以处理器的时钟频率运行,而且还需要缓存。因此,在出现缓存缺失时就会造成延迟,威胁到 TCM技术所应具备的决定性。

DRAM 技术只在理论上是可行的,但是其成本之高却令人望而却步。这样一来SRAM 就成了一个可行的选择,可以实现快速、直接和无缓存的 TCM 访问。SRAM可以很容易地嵌入到一个芯片上,允许按照处理器的速度随机存储。但是,SRAM 的每比特成本要高于闪存和 DRAM,这就意味着将 TCM的大小限制在一定范围之内是很关键的。


Atmel | SMART Cortex-M7微控制器

以 Atmel 公司的 SAM E70、S70和V70/71微控制器为例,这些微控制器都专门针对 TCM 和系统 RAM 将 SRAM组成四个存储体。该公司最近已经开始批量向物联网市场和工业市场供应 SAM E70和 S70产品家族,并宣称这些微控制器的性能比最好的竞争对手的微控制器都要超出50%。

sam-e70_s70_blockdiagram_lg_929x516.png

Atmel 公司以 M7为基础的微控制器可提供最高384KB 的嵌入式 SRAM,可作为 TCM或系统内存加以配置,为物联网的设计提供了更高的灵活性和可利用率。例如,其 E70 和 S70 微控制器将 384KB嵌入式 SRAM组织成4个端口,以限制内存访问冲突的出现。这些 MCU 将 256KB 的 SRAM 分配给 TCM 功能 - 其中ITCM 和DTCM 各分配128KB - 以便能够以 300MHz 的处理器速度提供零等待访问。其余的128KB SRAM可以配置为系统内存,以 150MHz 的速度运行。


但是,以 384KB内存体的形式组织形成的 SRAM 块的提供,就意味着系统SRAM 和 TCM 都可以在同时得到使用。较大尺寸的384KB 片上 SRAM对于许多物联网设备来说也是非常关键的,因为这样一来设备就能够在同一个微控制器上运行多个通讯栈和应用程序,无需添加外部存储设备。在物联网领域中这就是一个非常重要的增值点,因为不需要外部存储设备就能够降低材料清单的成本,减少印刷线路板(PCB)的碳足迹,消除高速PCB 设计中的复杂性。


跳转到指定楼层
回复

使用道具 举报

风筝
发表于: 2015-9-18 22:15:33 | 显示全部楼层

点击阅读英文原文


6 memory considerations for Cortex-M7-based IoT designs


Taking a closer look at the configurable memory aspects of Cortex-M7 microcontrollers.


Tightly coupled memory (TCM) is a salient feature in the Cortex-M7 lineup as it boosts the MCU’s performance by offering single cycle access for the CPU and by securing the high-priority latency-critical requests from the peripherals.


cortex-m7-chip-diagramlg.png


The early MCU implementations based on the ARM’s M7 embedded processor core — like Atmel’s SAM E70 and S70 chips — have arrived in the market. So it’d be worthwhile to have a closer look at the configurable memory aspects of M7 microcontrollers and see how the TCMs enable the execution of deterministic code and fast transfer of real-time data at the full processor speed.


Here are some of the key findings regarding the advanced memory architecture of Cortex-M7 microcontrollers:


1. TCM is Configurable


First and foremost, the size of TCM is configurable. TCM, which is part of the physical memory map of the MCU, supports up to 16MB of tightly coupled memory. The configurability of the ARM Cortex-M7 core allows SoC architects to integrate a range of cache sizes. So that industrial and Internet of Things product developers can determine the amount of critical code and real-time data in TCM to meet the needs of the target application.


The Atmel | SMART Cortex-M7 architecture doesn’t specify what type of memory or how much memory should be provided; instead, it leaves these decisions to designers implementing M7 in a microcontroller as a venue for differentiation. Consequently, a flexible memory system can be optimized for performance, determinism and low latency, and thus can be tuned to specific application requirements.


2. Instruction TCM


Instruction TCM or ITCM implements critical code with deterministic execution for real-time processing applications such as audio encoding/decoding, audio processing and motor control. The use of standard memory will lead to delays due to cache misses and interrupts, and therefore will hamper the deterministic timing required for real-time response and seamless audio and video performance.


The deterministic critical software routines should be loaded in a 64-bit instruction memory port (ITCM) that supports dual-issue processor architecture and provide single-cycle access for the CPU to boost MCU performance. However, developers need to carefully calibrate the amount of code that need zero-wait execution performance to determine the amount of ITCM required in an MCU device.


the-anatomy-of-tcm-inside-the-m7-architecture1.gif


The anatomy of TCM inside the M7 architecture.


3. Data TCM


Data TCM or DTCM is used in fast data processing tasks like 2D bar decoding and fingerprint and voice recognition. There are two data ports (DTCMs) that provide simultaneous and parallel 32-bit data accesses to real-time data. Both instruction TCM and data TCM — used for efficient access to on-chip Flash and external resources — must have the same size.


4. System RAM and TCM


System RAM, also known as general RAM, is employed for communications stacks related to networking, field buss, high-bandwidth bridging, USB, etc. It implements peripheral data buffers generally through direct memory access (DMA) engines and can be accessed by masters without CPU intervention.


Here, product developers must remember the memory access conflicts that arise from the concurrent data transfer to both CPU and DMA. So developers must set clear priorities for latency-critical requests from the peripherals and carefully plan latency-critical data transfers like the transfer of a USB descriptor or a slow data rate peripheral with a small local buffer. Access from the DMA and the caches are generally burst to consecutive addresses to optimize system performance.


It’s worth noting that while system memory is logically separate from the TCM, microcontroller suppliers like Atmel are incorporating TCM and system RAM in a single SRAM block. That lets IoT developers share general-purpose tasks while splitting TCM and system RAM functions for specific use cases.


a-single-sram-block-for-tcm-and-system-memory-allows-higher-flexibility-and-util.jpg


A single SRAM block for TCM and system memory allows higher flexibility and utilization.


5. TCM Loading


The Cortex-M7 uses a scattered RAM architecture to allow the MCU to maximize performance by having a dedicated RAM part for critical tasks and data transfer. The TCM might be loaded from a number of sources, and these sources aren’t specified in the M7 architecture. It’s left to the MCU designers whether there is a single DMA or several data loading points from various streams like USB and video.


It’s imperative that, during the software build, IoT product developers identify which code segments and data blocks are allocated to the TCM. This is done by embedding programs into the software and by applying linker settings so that software build appropriately places the code in memory allocation.


6. Why SRAM?


Flash memory can be attached to a TCM interface, but the Flash cannot run at the processor clock speed and will require caching. As a result, this will cause delays when cache misses occur, threatening the deterministic value proposition of the TCM technology.


DRAM technology is a theoretical choice but it’s cost prohibitive. That leaves SRAM as a viable candidate for fast, direct and uncached TCM access. SRAM can be easily embedded on a chip and permits random accesses at the speed of the processor. However, cost-per-bit of SRAM is higher than Flash and DRAM, which means it’s critical to keep the size of the TCM limited.


Atmel | SMART Cortex-M7 MCUs


Take the case of Atmel’s SMART SAM E70, S70 and V70/71 microcontrollers that organize SRAM into four memory banks for TCM and System SRAM parts. The company has recently started shipping volume units of its SAM E70 and S70 families for the IoT and industrial markets, and claims that these MCUs provide 50 percent better performance than the closest competitor.


sam-e70_s70_blockdiagram_lg_929x516.png


Atmel’s M7-based microcontrollers offer up to 384KB of embedded SRAM that is configurable as TCM or system memory for providing IoT designs with higher flexibility and utilization. For instance, E70 and S70 microcontrollers organize 384KB of embedded SRAM into four ports to limit memory access conflicts. These MCUs allocate 256KB of SRAM for TCM functions — 128 KB for ITCM and DTCM each — to deliver zero wait access at 300MHz processor speed, while the remaining 128KB of SRAM can be configured as system memory running at 150MHz.


However, the availability of an SRAM block organized in the form of a memory bank of 384KB means that both system SRAM and TCM can be used at the same time.The large on-chip SRAM of 384KB is also critical for many IoT devices, since it enables them to run multiple communication stacks and applications on the same MCU without adding external memory. That’s a significant value proposition in the IoT realm because avoiding external memories lowers the BOM cost, reduces the PCB footprint and eliminates the complexity in the high-speed PCB design.


回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

主题 11 | 回复: 18



手机版|

GMT+8, 2024-4-27 08:19 , Processed in 0.047622 second(s), 7 queries , Gzip On, MemCache On. Powered by Discuz! X3.5

YiBoard一板网 © 2015-2022 地址:河北省石家庄市长安区高营大街 ( 冀ICP备18020117号 )

快速回复 返回顶部 返回列表