Preface#
This note only introduces the essential concepts about Static Timing Analysis, which does not include:
- Async, i.e. remove, recover
- Timing concepts, i.e. false path, multi-cycle path, etc.
- Advanced timing domain knowledge
- POCV, MCMM, etc.
What is STA#
As the clock frequency increases, the logic units in the chip can perform more operations per unit time, so frequency is positively correlated with chip performance. Chip design requires a trade-off between PPA, so how can we know the frequency limit at which a chip can operate normally? This introduces the concept of STA (Static Timing Analysis).
STA is used to verify whether the design can safely operate at a given clock frequency without timing violations. STA has the following characteristics:
- Pros
- No need for input stimulus simulation
- Comprehensive timing checks
- Cons
- Cannot handle asynchronous timing
STA Application Scenarios#
STA can be applied at multiple stages of PD (Physical Design) and has different characteristics at each stage, such as:
- Synthesis: In the logic design phase, since there is no physical information related to layout, it can be assumed that interconnects are in an ideal state. This phase focuses more on identifying the logic that leads to the worst paths. Another technique used in this phase is the wire load model to estimate the length of interconnects; the wire load model provides an estimated RC value based on the fan-out of logic units.
- Pre-CTS: At the beginning of physical design, the clock tree is considered ideal, meaning it has zero delay. After CTS, the clock has actual propagation delay.
- Pre-Route: Before actual routing, STA is used to calculate the metal line parasitic RC as an estimate.
Cell#
Cells can be standard cells, IO buffers, or complex IPs like USB cores. In addition to timing information, the library cell description also includes other attributes, such as cell area and functionality, which are unrelated to timing but are used during RTL synthesis.
Pin Capacitance#
Each input and output of a cell can specify capacitance at the pin. In most cases, only the input pins of the cell specify capacitance, while output pins do not, meaning the output pin capacitance in most cell libraries is 0.
The example above shows the general specification of the input pin capacitance value for INP1. In its most basic format, pin capacitance is specified as a single value (0.5 units in the example above). The capacitance unit is typically picofarads (pF), usually specified at the beginning of the library file. The cell description can also specify values for rise_capacitance (0.5 units) and fall_capacitance (0.45 units), which refer to the values when the level rises and falls at pin INP1. The values of rise_capacitance and fall_capacitance can also be specified as a range, with lower and upper limits indicated in the description.
Drive Strength#
Input pin capacitance is defined in liberty, while output pin capacitance is determined by all downstream cells driven by that cell. When a CMOS cell switches state, the speed of the switch depends on how quickly the capacitance on the output pin is charged and discharged.
Generally, the cell drive strength determines the maximum capacitive load that can be driven, and the maximum capacitive load determines the maximum number of fan-outs, i.e., how many other cells can be driven. Higher output drive corresponds to lower output pull-up/pull-down resistance, allowing the cell to charge and discharge larger loads at the output pin.
- The larger the drive strength, the larger the cell area, and the larger the
max_cap
. - The larger the drive strength, the smaller the corresponding output resistance, and the smaller the delay.
- If the standard cell library only contains standard logic cells with small drive strengths, what impact does this have on timing?
- When the entire library consists of small drive cells, the first thought is that the driving capability of each cell is weak, and the output resistance is larger.
- If an inverter has a small drive strength, then the maximum load capacitance it can drive is also small. If certain nodes in the design must drive larger capacitances, such as long lines or high fan-out networks, small drive cells may not meet the requirements, leading to setup or hold time violations.
Propagation Delay#
The propagation delay of a cell is defined by certain measurement points on the level switching waveform. The units of these thresholds are a percentage of Vdd or the power supply, and for most standard cell libraries, the 50% threshold is typically used to calculate delay.
The propagation delay here is divided into two types (not equal) based on the rise/fall of the output signal:
- Output rise delay: The time delay from when the input signal reaches the falling edge threshold point to when the output signal reaches the rising edge threshold point.
- Output fall delay: The opposite of output rise delay.
Slew#
The definition of slew rate is the rate of voltage change. In STA, the rise or fall waveform is usually measured based on the speed of level transitions. Slew is typically defined based on transition time, which refers to the time required for a signal to transition between two specific levels. Note that transition time is actually the inverse of slew rate; thus, the larger the transition time, the lower the slew rate, and vice versa.
A specified threshold voltage is generally used to define the starting and ending points for transition time calculations.
Slew rate and slew are not the same thing. Slew refers to transition, while slew rate is its inverse.
Timing Arc#
Timing arcs describe the delay of signal transmission between cell pins and the signal transition conditions.
- For combinational logic units like AND gates, OR gates, NAND gates, and adders, there is a timing arc from each input pin to each output pin.
- For sequential logic units like flip-flops, in addition to the timing arc from the clock pin to the output pin, there are also timing constraints regarding the data pins relative to the clock pin.
Each timing arc has a specific timing sense, which indicates how the output changes in response to different types of transitions at the input. In non-unate timing arcs, the direction of the transition from one input pin alone cannot determine how the output pin level will change; it also depends on the states of other input pins.
Timing Model#
The timing model of a logic unit aims to provide accurate timing information for various instances of units in the design.
- Each timing arc has a timing model.
- The timing model is obtained from detailed circuit simulation.
For an inverter, there are two types of delays: the output rise delay $T_{r}$ and the output fall delay $T_{f}$.
The delay and output transition through the inverter mainly depend on:
- Output load, i.e., the capacitive load at the inverter's output pin.
- Transition time of the input signal.
- Transistor layout design: negligible.
The signal input of a logic unit is like water flowing into a tank; the water first drives the blue water wheel (similar to input transition time), and then fills the tank (output capacitance) before it can drive the red water wheel (the next logic unit).
Delay values are directly related to load capacitance: the larger the load capacitance, the larger the delay. In most cases, delay will also increase with the increase in input signal transition time. PS: Not absolute.
NLDM#
The timing model of a logic unit can be simply understood as a function of input slew and output load, but simple linear timing models are not accurate when applied to submicron technology. Therefore, most cell libraries currently use more complex non-linear delay models (NLDM).
Most cell libraries include table models to specify delays for various timing arcs of the unit and perform timing checks. These table models are referred to as NLDM (Non-Linear Delay Model) and can be used for delay, output slew calculations, or other timing checks. The table model provides: the delay through the unit for various combinations of input transition time at the unit's input pin and output load capacitance at the output pin.
According to the delay table, when the input falling transition time is 0.3ns and the output load is 0.16pf, the rise delay of the inverter is 0.1018ns. Since the falling edge transition of the input causes the rising edge transition of the inverter output, when the input pin experiences a falling edge transition, the cell_rise delay table should be queried. Note that the table model can also be three-dimensional, for example, a flip-flop with complementary outputs Q and QN.
The NLDM model can be used not only to calculate delays but also to calculate the transition time of the logic unit's output pin, which is also characterized by input transition time and output load capacitance.
Thus, the NLDM model can calculate:
- Rise Delay
- Fall Delay
- Rise Slew
- Fall Slew
Additionally, if there is no corresponding index in the table, results can be calculated through interpolation.
Derate#
skip it
Slew values are based on measurement threshold points specified in the library. Most previous-generation libraries (0.25um or older) used 10% and 90% (corresponding to the linear portion of the waveform) as measurement threshold points for slew (or transition time).
With the advancement of technology, the most linear part of actual waveforms is usually between 30% and 70%. Therefore, most new-generation timing libraries specify slew measurement threshold points as 30% and 70% of Vdd. However, since previously measured transition times were between 10% and 90%, when filling the library, the measured transition times at 30% to 70% are typically doubled, which is specified by the slew derate factor (slew derate factor
), usually set to 0.5. A slew measurement threshold point of 30% and 70% with a slew derate factor of 0.5 is equivalent to measurement threshold points of 10% and 90%.
Combinational Logic Units#
For a two-input AND gate: there are four types of delays and four types of output transitions.
- Rise and fall * two input pins = 4
- In an FPGA, all delay information for each logic unit is generally fixed, so each type of logic unit fits a fixed delay (e.g., LUT is 0.1ns, DSP is 1.3ns, etc.).
General Combinational Logic Block#
Consider the following general combinational logic block with three inputs and two outputs:
Such combinational logic blocks can have multiple timing arcs. Typically, there is a timing arc from each input of the block to each output.
Sequential Logic Units#
The timing arcs of sequential logic units are as follows:
For synchronous input signals at pins D, SI, and SE, the following timing arcs exist (both rise and fall):
- Setup time check timing arc
- Hold time check timing arc
For synchronous output signals at pin Q, the following timing arc exists:
- CK to Q or QN Propagation delay arc
For asynchronous input signals at pin CDN, the following timing arcs exist:
- Removal time check timing arc
- Recovery time check timing arc
Additionally, for clock pins and asynchronous pins, there are also:
- Pulse width timing checks
Setup and Hold#
Setup time and hold time synchronous timing checks are used to ensure that data can be correctly propagated through the timing unit. These timing checks can verify that the input data is in a definite logical state at the clock's active edge and that the correct data is latched at the active edge.
The two-dimensional table model is determined based on the transition time at the constrained pin (D) and the related pin (CK).
Details of setup and hold will be introduced later.
Asynchronous Timing Checks#
SKIP
State-Dependent Timing Models#
SKIP
The timing arcs between inputs and outputs depend on the logical states of other pins in the module.
Black Box Interface Timing Models#
SKIP
Advanced Timing Models#
SKIP
Non-linear delay models (NLDM) are timing models that represent delays through timing arcs based on output load capacitance and input transition time. In practice, the output load of a unit not only includes capacitance but should also include interconnect resistance.
Since the NLDM method assumes that the output load is purely capacitive, interconnect resistance becomes an issue. Even if the interconnect resistance is not zero, these NLDM models are still used when the impact of interconnect resistance is small. In the presence of interconnect resistance, the delay calculation method improves the NLDM model by obtaining equivalent effective capacitance at the output of the unit. The effective capacitance obtained using the “effective” capacitance method in delay calculation tools ensures that the unit output delay matches that of a unit output delay with RC interconnect.
Since NLDM does not handle errors caused by interconnect resistance well, more advanced timing models, such as CCS (Composite Current Source), have been proposed.
Clock#
Skew#
Skew refers to the timing difference between two or more signals (data or clock). For example, if a clock tree has 500 endpoints and has a skew of 50ps, it means that the delay difference between the longest clock path and the shortest clock path is 50ps.
The starting point of the clock tree is usually the node that defines the clock, and the endpoints of the clock tree are usually the clock pins of synchronous elements (such as flip-flops). Clock latency (Source + Insertion) refers to the total time taken from the clock source to the endpoint, while clock skew refers to the time difference in reaching different endpoints of the clock tree.
Ideal clock trees assume that the clock source has infinite drive strength, and the clock can drive an infinite number of endpoints with no delay. Additionally, it is assumed that any logic units present in the clock tree have zero delay. In the early stages of logical design, STA typically uses ideal clock trees for execution, so the focus of the analysis is on the data path. The set_clock_latency
command can be used to specify the clock tree delay.
Uncertainty#
The set_clock_uncertainty
command specifies a window for the appearance of clock edges. The uncertainty of clock edge timing considers multiple factors, such as clock cycle jitter and additional margins (slack) needed for timing verification. In reality, there is no ideal clock; all clocks have a certain amount of jitter, and clock cycle jitter should be included when specifying clock uncertainty.
Before the clock tree is implemented, clock uncertainty must also include expected clock skew. Hold time checks do not require including clock jitter, so a smaller clock uncertainty is typically specified for hold time checks.
Actual Clock Signals#
Actual clock signals include rising and falling edges:
Combining the two clock signals results in an ideal eye diagram under ideal conditions, where only transitions are present:
However, in reality, clock signals exhibit different arrival times (jitter), resulting in the following eye diagram:
Additionally, clock signals may experience voltage drops and ground bounce due to power supply variations.
Ultimately, the actual clock signal under real conditions is:
For level fluctuations, noise margin is defined, allowing for a certain amount of distortion:
The area of the clock signal without jitter is referred to as the window where data is reliable:
The area of the clock signal with jitter is referred to as jitter: Jitter has to be accounted for in the timing reports. We model this using one more parameter called Uncertainty.
Example: Uncertainty = 90ps = 0.09ns
Clock Domain#
A clock typically drives many flip-flops, and a group of flip-flops driven by the same clock is called its clock domain. The following diagram shows two clock domains:
A key question to consider is: Are the two clock domains related or independent of each other? The answer depends on whether there is a data path that starts in one clock domain and ends in another. If there is no such path, we can confidently say that these two clock domains are independent of each other, meaning there is no timing path that starts in one clock domain and ends in another.
If there is a data path crossing clock domains (as shown in the diagram), it must be determined whether these paths are real paths: for example, a flip-flop driven by a double-frequency clock initiates data, which is then captured by a flip-flop driven by a single-frequency clock; this path is a real path.
An example of a false path is when designers explicitly place clock synchronizer logic between two clock domains. In this case, even though it seems there is a timing path from one clock domain to the next, it is not a real timing path because the data is not constrained to propagate through the synchronizer logic within one clock cycle. Such paths are called false paths (not real) because it is the clock synchronizer that ensures data is correctly passed from one clock domain to another.
- False paths belong to timing exceptions, so skip it.
- In designs, some paths cannot exist or cannot occur; these paths are called false paths. False paths usually occur in asynchronous circuits and across clock domains; or the internal logic of the circuit is complex, and it is derived that it is actually a constant that will not change.
In practice, cross-clock-domain situations are often bidirectional, meaning from the USBCLK clock domain to the MEMCLK clock domain, and from the MEMCLK clock domain to the USBCLK clock domain; both situations need to be correctly understood and handled in STA.
SDC#
Correct constraints are crucial for analyzing STA results. Only by accurately specifying the design environment can STA analysis identify all timing issues in the design. The preparation for STA includes setting clocks, specifying IO timing characteristics, and specifying false paths and multi-cycle paths.
To perform STA on such a design, it is necessary to specify the clock for the flip-flops and the timing constraints for all paths entering and exiting the design.
Specifying Clocks#
To define a clock, we need to provide the following information:
- Clock source: It can be a port of the design or a pin of an internal unit of the design (usually part of the clock generation logic).
- Period: The clock period.
- Duty cycle: The duration of the high level (positive phase) and the duration of the low level (negative phase).
- Edge times: The moments of the rising and falling edges.
An example of creating a clock: create_clock -name SYSCLK -period 20 -waveform {0 5} [get_ports SCLK]
; this clock is named SYSCLK and is defined at port SCLK. The period of SYSCLK is specified as 20 units, and if not specified, the default time unit is nanoseconds (usually, the time unit is specified in the technology library). The first argument in the waveform specifies the moment the rising edge occurs, and the second argument specifies the moment the falling edge occurs.
Clock Uncertainty#
The timing uncertainty of the clock cycle can be specified using the set_clock_uncertainty
constraint, which can be used to model various factors that may reduce the effective clock cycle. These factors may include jitter and any other pessimism that may need to be considered in timing analysis.
set_clock_uncertainty -setup 0.2 [get_clocks CLK_CONFIG]
; note that the clock uncertainty for setup time checks will reduce the available effective clock cycle. For hold time checks, clock uncertainty will serve as additional timing margin that needs to be met.
Clock Latency#
Clock latency can be set using the following command, such as set_clock_latency 1.8 -rise [get_clocks MAIN_CLK]
.
There are two types of clock latency: network latency and source latency: the total clock latency at the clock pin of the flip-flop is the sum of source latency and network latency. After the clock tree synthesis is completed, the total clock latency from the clock source to the clock pin of the flip-flop is the source latency plus the actual delay of the clock tree from the clock definition point to the flip-flop.
- Network latency refers to the delay from the clock definition point (create_clock) to the clock pin of the flip-flop.
- Ignored after CTS
- Source latency, also known as insertion delay, refers to the delay from the clock source to the clock definition point; source latency may represent on-chip or off-chip delays.
- Retained after CTS
An important distinction between source latency and network latency is that once a clock tree is established for the design, network latency can be ignored (assuming the set_propagated_clock
command is specified).
Constraints on Input Paths#
The flip-flop UFF0 is external to the design and provides data to the internal flip-flop UFF1. The data connects the two flip-flops through the input port INP1.
The clock definition for CLKA specifies the clock period, which is the total time available between the two flip-flops UFF0 and UFF1. The time required by the external logic is Tclk2q (the CK to Q delay of the data initiating flip-flop UFF0) plus Tc1 (the delay through the external combinational logic), so the delay definition at the input pin INP1 specifies the external delay of Tclk2q plus Tc1.
The following are the constraints for input delays (which can be defined separately for min and max):
set Tclk2q 0.9
set Tc1 0.6
set_input_delay -clock CLKA -max [ expr Tclk2q + Tc1] [ get_ports INP1]
Constraints on Output Paths#
Constraints on output paths are similar to those on input paths and can be specified using the command set_output_delay
to indicate external delays:
Timing Path Groups#
Timing paths in the design can be viewed as a collection of paths, each with a start point and an endpoint.
Timing paths can be classified into different timing path groups based on the clock associated with the endpoint. Therefore, each clock has a set of timing paths associated with it. There is also a default timing path group that includes all non-clock (asynchronous) paths.
External Attribute Modeling#
Although create_clock
, set_input_delay
, and set_output_delay
are sufficient to constrain all paths used for timing analysis in the design, they are not enough to obtain the accurate timing on the module IO pins.
For inputs, slew must be specified at the input port:
set_drive
set_driving_cell
set_input_transition
For outputs, the load capacitance at the output pin must be specified:
set_load
Drive Strength Modeling#
In summary, designers need to specify the slew value at the input to determine the delay of the first unit in the input path. In the absence of this constraint, it will be assumed to be an ideal transition value of 0, which is clearly unrealistic.
The set_drive
and set_driving_cell
constraints are used to model the drive strength of external units at the input port of the driving module. In the absence of these constraints, it is assumed that all inputs have infinite drive strength, meaning the transition time at the input pin is 0.
set_drive
explicitly specifies the drive resistance value at the DUA input pin; the smaller the resistance value, the higher the drive strength, and a resistance value of 0 indicates infinite drive strength. The drive strength at the input port is used to calculate the transition time of the first unit. The specified drive strength can also be used to calculate the delay value from the input port to the first unit under any RC interconnect conditions.
- Delay value = (Drive strength * Network load) + Interconnect delay
The set_driving_cell
constraint provides a more convenient and accurate way to describe the driving capability of the port. The set_driving_cell
can be used to specify the type of unit driving the input port. However, the incremental delay caused by the driving unit due to the capacitive load at the input port is considered as additional delay included on the input.
As an alternative to the above methods, the set_input_transition
constraint provides a convenient way to represent transition time at the input port and can specify a reference clock.
Load Capacitance Modeling#
Specifying the load on the output is important because this value affects the delay of the unit driving the output. In the absence of this constraint, the load will be assumed to be 0, which is clearly unrealistic.
The set_load
constraint sets the capacitive load at the output port to simulate the external load driven by the output port. By default, the capacitive load at the port is 0. The load can be explicitly specified as a capacitance value or as the input pin capacitance of a certain unit.
DRV#
Two commonly used design rules in STA are maximum transition time -max_transition
and maximum capacitance -max_capacitance
. These rules will check whether all ports and pins in the design meet the specified constraints for transition time and capacitance.
Additionally, other design rule checks can be specified for the design, such as: set_max_fanout
(specifying fanout constraints for all pins in the design) and set_max_area
(for the design). However, these checks apply to synthesis rather than STA.
Delay Calculation#
Basic Concepts of Delay Calculation#
As mentioned above, each input pin of a unit has pin capacitance, so each net will have capacitive load, which is the sum of all fanout pin load capacitances and the parasitic capacitance of the interconnect.
Consider the following design:
For NET0
, ignoring interconnect parasitics, its capacitance equals the sum of the input pin capacitances of UAND1
and UNOR2
. Thus, the above diagram can be equivalently represented as:
The load capacitance of output O1
is equivalent to the output port load (not specified, can be specified using set_load
) plus the input pin load capacitance of UNOR2
(already specified in the library), so at this point, simply specifying the slew (or set_drive
) for input I1
allows us to obtain the propagation delay and output transition of unit UAND1
relative to that input transition (knowing the output transition of the previous level allows us to obtain the input transition of the next level).
Since multi-input units have multiple timing arcs from different inputs to outputs, the value of output transition is determined by the slew merge results.
Effective Capacitance Calculation for Unit Delay#
When the load at the unit output includes interconnect resistance, the NLDM model cannot be used directly. Therefore, the “effective” capacitance method is used to handle the impact of resistance.
The effective capacitance method attempts to find a capacitance that can be used as an equivalent load, so that the original design behaves consistently in terms of timing at the unit output with a design having an equivalent capacitive load. This equivalent capacitance is referred to as effective capacitance (C_eff).
In practical situations, the impact of interconnect parasitics cannot ignore the influence of resistance; at this point, RC interconnect can be modeled as a simplified PI model. Since NLDM only accepts capacitance, RC is calculated as an equivalent $C_{eff}$, allowing us to continue using NLDM lookup tables to obtain unit delay. Various algorithms exist to calculate this $C_{eff}$, such as second-order AWE, Arnoldi algorithm, etc.
Note: Although it is possible to obtain approximate unit delays, output slew does not match the actual output waveform of the unit.
Net Delay#
For students with a background in large-scale simple circuits, the essence of routing delay is that the conducting circuit can be equivalently represented as resistance and capacitance (R and C), and the delay of signal transmission over it can be simplified to RC Delay. Overall, the delay of routing depends on line width, line length (Wire length), process, fanout branches (Fanout). In different EDA stages, we can estimate the routing delay between two pins using different models.
- Logic synthesis: For example, Synopsys's Design Compiler estimates the routing delay between two signal pins based on the wire load model (Wire Load Model, WLM). In this design phase, the chip design has not yet reached layout and routing, so there is no relative position to determine the specific routing path. Therefore, WLM estimates the length of the network based on the number of fanouts, thus obtaining the delay (the error can be imagined, after all, a logic path with fewer fanouts may also be stretched far during layout). WLM is usually provided by the corresponding ASIC/FPGA manufacturers, and designers can fine-tune it based on their designs. In a design, different levels and different routings can configure different WLMs to approximate actual delays.
- Layout: During the layout phase, the specific positions of each logic unit are known, allowing us to fully utilize positional information to estimate paths: we first estimate the length of the lines between two connected logic units, and then estimate the delay based on the line length. It is important to note that although theoretically longer lines result in longer delays, this is not entirely linear, as traveling from Guangzhou to Beijing via highways is different from taking county roads. Generally, timing estimates before and after routing in Cadence Innovus are based on so-called TrialRoute (trial routing) or Early Global Route (early global routing) to estimate routing conditions, and then based on this rough routing situation, extract RC parasitic parameters, and then add these parasitic parameters to the input pin capacitance of the driving unit to obtain routing delay. The most important aspect is how to obtain an accurate routing estimate; an accurate estimate can achieve a very small timing jump before and after routing.
- Routing: During this phase, not only are positions known, but also the specific metal routing. Therefore, RC parameters can be directly extracted, and timing analysis engines can be run.
Elmore Model for Interconnect Delay Calculation#
Elmore is a delay model used to calculate net delay under specific conditions of RC interconnect structures.
Slew Merging (TBD)#
Path Delay Calculation#
Review several concepts: timing path, timing arc
Theoretically, a timing path exists with a start point and an endpoint:
- Start point: input port and clk pin
- End point: d pin and output port
Thus, there are four types of timing paths: r2r
, i2o
, i2r
, and r2o
.
Timing arcs are used to describe:
- The signal transmission relationship between pins (transmission delay and how it changes)
- Timing constraints: setup/hold, etc.
Therefore, once the timing arcs annotate the whole design, calculating path delay involves summing all net arcs and cell arcs.
I2O
#
The first type of timing path is from the input port to the output port.
The transition time from the input port to the first load cell needs special handling, i.e., the transition time (or slew) at the input of the first inverter can be specified; if no such specification is made, it is assumed to be 0 (equivalent to the ideal case).
- In OpenSTA, if not specified, the load slew of the first cell is 0. The root-to-first-cell load delay and load slew can be calculated during seedRootSlew.
Additionally, an equivalent capacitance can be calculated based on the RC load conditions at the output of the first cell, allowing us to look up the first cell's delay and output slew.
Once the output slew of the first cell is calculated, the input slew of the next unit can be obtained, and this process continues in a loop.
Note that similar to the first-level input, the last-level output needs to manually set_load
; otherwise, only the line load of network N3 will be used.
I2R
#
Similar calculations apply.
R2R
#
Similar calculations apply.
Timing Graph#
STA breaks a design down into timing paths, calculates the signal propagation delay along each path, and checks for violations of timing constraints inside the design and at the input/output interface.
Timing Path#
Timing paths have start and end points, defined as follows:
Based on the start point and endpoint, timing paths can be divided into four categories:
- Input port to d pin,
I2R
- Clk pin to output port,
R2O
- Clk pin to d pin,
R2R
- Input port to output port,
I2O
Timing paths are a collection of segments of timing arcs, which can also be classified by signal type or timing check: Data path, Clock path, Clock-gating path, Asynchronous path.
Timing Graph#
Consider the following netlist:
Convert the above circuit to a 'Direct Acyclic Graph (DAG)' shown below:
OpenSTA Timing Graph#
The timing graph is a flat DAG, although OpenSTA has a full hierarchical netlist.
The following netlist is an example:
Convert it to a timing graph:
Vertices are defined as: Each vertex corresponds to one network pin.
- Includes internal pins (not shown in the diagram),
Edges are defined as: There is one edge between each pair of pins that has a timing path between them.
Each edge has its own timing role: it may represent cell delay or wire delay, or various types of timing analysis.
Additionally, a set of timing arcs is stored on each edge: A timing arc set is a group of related timing arcs between a pair of cell ports. Wire timing arcs are a special set owned by the TimingArcSet class.
Timing Analysis Methods#
Based on the analysis method, timing analysis can be divided into Path-Based
and Block-Based
, with the main difference being how the transition time of specific logic units is handled.
In reality, during the operation of the circuit, the input level transition time received by a logic unit is influenced by the preceding logic unit.
The input transition at pin C depends on the output transition of the preceding logic unit, and the output transition caused by different input pin transitions varies, thereby affecting the input transition at C. How to determine the level transition time at C is the difference in timing analysis algorithms.
Graph-Based#
Graph-based static timing analysis (GBA) is the default analysis mode for most tools, which calculates based on worst-case level transition times when reading cell delays from the standard cell library. For example, in the above example, regardless of how A and B transition, the maximum level transition time at C will be taken, such as 12ps. Therefore, even if a signal remains unchanged along a certain timing path, with changes only occurring at pin B, the blue or NOT gate logic will apply the 9ps level transition time caused by pin B, but in GBA's analysis algorithm, it will still use 12ps. Thus, GBA mode tends to be more pessimistic, potentially leading to timing violations on some paths, as actual transitions may not cause every logic unit to experience the worst-case level transition time exactly. To address this pessimism and improve accuracy, path-based static timing analysis (Path-based Analysis, PBA) was introduced.
Path-Based#
PBA adopts a path-based timing analysis method, analyzing all timing paths.
Compared to GBA, PBA traverses all possible timing paths and theoretically enumerates all possible input transition combinations for timing evaluation, thus achieving the most accurate timing analysis results. In the example above, if a transition occurs at pin B, PBA will indeed use the 9ps transition at pin B to calculate the delay of the next blue or NOT gate. However, because PBA traverses more scenarios compared to GBA, it leads to significantly slower runtime; in complex cases, PBA may be an order of magnitude slower than GBA.
GBA vs PBA#
For the same combinational design, GBA vs PBA is shown below:
min_delay_in_GBA
<=min_delay_in_PBA
max_delay_in_GBA
>=max_delay_in_PBA
In GBA (Graph Base Analysis), instead of choosing 2 combinations of AND gate (1) delay, i.e., (Combination_1: 0.5ns, 1.5ns; Combination_2: 0.2ns, 1.2ns), we choose extreme boundaries, i.e., min delay = 0.2ns and max delay = 1.5ns.
In the case of PBA (Path Base Analysis), we are using the actual delay between input pin and output combination (means choosing both combination of delay).
- Combination_1: 0.5ns, 1.5ns
- Combination_2: 0.2ns, 1.2ns
You might be thinking that this is not accurate (means why in GBA we missed 2 values), we are adding unnecessary delay in our calculation. And I am glad to say that you are right. :) The reason we are doing this is that from the tool's point of view - doing analysis or calculation as per GBA is very fast compared to PBA. Runtime of the tool is very low. And the only difference is that we are adding pessimism in our calculation.
- GBA is faster than PBA
- GBA is more pessimistic than PBA
Based on the above characteristics, GBA and PBA have different uses in static timing analysis. GBA can achieve fast but rough analysis; if no violations are detected, then because GBA is so pessimistic, there are no violations, the results of PBA analysis should also have no violations. If GBA has violations, we can then use PBA, but there is no need to analyze all timing paths again; we only need to analyze the paths that generated violations in GBA mode (of course, a global PBA can also be performed).
GBA Delay Calculation#
Each cell arc regardless of rise, fall, or min, max, takes extreme values, making calculations simpler and faster, but due to being more pessimistic, it is not accurate enough.
PBA Delay Calculation#
PBA will exhaust all arc combinations along a timing path.
Principles of Graph-Based Static Timing Analysis#
Assuming all latches receive the clock rising edge at the same time (i.e., ignoring clock skew caused by layout). Under this series of simplifications, the STA problem can be reduced to: finding out how far the timing endpoints are from the furthest timing start point in a directed graph (theoretically referring to Arrival Time, the delay of the signal from the source to a certain node), which is the longest path problem (refer to: How to find the longest path in a directed acyclic graph?, here solving the multi-source multi-sink longest path), as shown in the figure below. The simple description of the algorithm is to start from all starting points, traverse all nodes, and update the farthest distance of each node from the starting point:
Where i is the current node number, $Predecessor[i, j]$ refers to the j-th predecessor node number of node i, and $ArrivalTime[Predecessor[i, j]]$ is the time when the source signal reaches that predecessor node, $CellDelay[j]$ is the logic delay of the predecessor node, and $NetDelay[i,j]$ is the routing delay from the predecessor node to node i, which can be obtained based on the coordinates of the two nodes.
Netlist Partitioning#
Since timing conditions need to be reassessed whenever the layout changes, the longest path algorithm for multi-source multi-sink will be frequently called during the layout algorithm's operation. From the recursive formula above, we can see that to calculate the $ArrivalTime[i]$ of a node, we need to have already obtained the ArrivalTime of its predecessor nodes; otherwise, the calculation of these ArrivalTimes cannot ensure they are the longest. The circuit partitioning is aimed at enabling parallel computation, meaning we need to color and partition the nodes in the circuit so that nodes in each block of the subgraph are independent of each other (i.e., there are no edges between them, and the graph does not have to be connected), for each block of subgraph nodes, we can compute their Arrival time in parallel.
To achieve this characteristic in circuit partitioning, the basic algorithm principle is to store all timing starting points in a queue and mark them as level=0. Then, BFS begins, updating the farthest distance of each node from the timing starting point in the directed graph, and the directed graph will be marked:
Timing Propagation Based on Layered Synchronous BFS#
Forward BFS to Calculate Arrival Time#
It can be noted that in the layered structure, the nodes in level=i: (1) except for the special case of level=0 (the starting unit at level=0 will be forced to be marked as ArrivalTime=0 and not calculated), there are no edges between them, meaning there are no dependencies between their timing calculations; (2) If we calculate ArrivalTime in the order from level=0 to i, when traversing to level=i, all nodes from level=0 to i-1 have already completed their ArrivalTime calculations. Thus, all nodes at level=i can simultaneously perform their Arrival time calculations, and we can call parallelization frameworks like openmp for acceleration. We initialize all nodes' Arrival time to 0 and then run the longest path algorithm mentioned above to derive the timing information in the following diagram (the leftmost node in the third level should have an arrival time of 10):
Backward BFS to Calculate RAT#
After the forward timing analysis in the previous section, we know how far each endpoint is from the starting point. However, designers may have different constraints for each endpoint; they may want certain signals to arrive at specific timing endpoints earlier, leading to the concept of timing slack (Timing Slack):
If Timing Slack is less than 0, it indicates that the signal arrived late, resulting in a timing violation. Here, Arrival Time is calculated after the forward timing propagation is completed. Typically, for timing endpoints, the Required Arrival Time (RAT) is the clock period minus the setup time (Clock Period - Setup Time). However, for each intermediate timing node, designers usually do not set RAT. Therefore, in STA, we need to perform backward timing propagation so that every node, except for the endpoints, knows how early they are actually required to receive signals, similar to project management in our work, where each task node needs to know by what time they need to complete their tasks to avoid digging a hole for the subsequent team.
The basic method for backward timing propagation is similar to forward propagation, except that the forward formula was:
While the backward formula is:
Where $RequiredArrivalTime[i]$ is the RAT of node i, and $RequiredArrivalTime[Successor[i, j]]$ refers to the RAT of the j-th successor node of node i, $NetDelay[i,j]$ is the routing delay between the two nodes, and $CellDelay[i]$ is the logic delay of node i.
Based on the transformation of the above formula, we also need to re-partition the netlist for backward propagation, as shown in the figure below:
We initialize all nodes' RAT to infinity, while the RAT of endpoint nodes is set by the designer, and then run the longest path algorithm mentioned above to derive the RAT information in the following diagram, where we assume all endpoint RATs are 20:
Incremental Timing Analysis#
If local adjustments are made during layout, there is no need to perform global STA, as global analysis is slow and inefficient, and many nodes' timing may not change. In this case, our forward and backward layers do not need to change; we only need to reinsert all predecessor and successor nodes of the nodes that have changed into the BFS process above to achieve fast incremental timing analysis.
When the red node is moved, only the nodes covered by blue and orange need to be re-analyzed.
Timing Analysis#
Timing analysis mainly focuses on setup and hold violations, corresponding to the worst and best cases, respectively. Additionally, it is essential to master the usage of the following commands:
set_input_delay
set_output_delay
set_drive
,set_driving_cell
, andset_input_transition
set_load
Setup#
Input data must remain stable for the shortest time before the active clock edge, referred to as setup time. Note: This is measured based on the time interval from the latest (the latest) data signal exceeding its threshold (usually 50% of Vdd) to the active clock edge exceeding its threshold (usually 50% of Vdd).
Before the active edge of the clock reaches the flip-flop, the data should remain stable for a certain time, which is the setup time of the flip-flop, ensuring that the data is reliably captured by the flip-flop.
- Note that setup time checks allow launch and capture to belong to different clock domains.
Setup Case 1#
Taking the following diagram as an example, the clock CLKM period is $T_{cycle}$.
- For the launch path: The time from clock CLKM to the clock pin of flip-flop
UFF0
is $T_{launch}$ + the propagation delay of flip-flopUFF0
($T_{ck2q}$) + Data path delay ($T_{dp}$) - For the capture path: The time from clock CLKM to the clock pin of flip-flop
UFF1
is $T_{capture}$ + clock period $T_{cycle}$.
Since the capture setup time constraint requires that the data signal must be stable at least one setup time ahead of the clock signal, the following formula must be satisfied:
Arrival and Required in Setup#
It is known that the setup check must satisfy: (the above diagram is a specific case, the following formula is a generic formula)
Thus, the definitions of required and arrival time are as follows:
- Required time: capture path delay
- Arrival time: launch path delay
Since slack needs to be >= 0, the following formula holds:
R2R
Setup Check#
Analyzing the following timing report:
- Start point & end point are both flip-flops, triggered by the rising edge of clock CLKM.
- Path group: determined by capture ff.
- Path type: max, i.e., setup time check.
- Clock network delay is zero as it's an ideal clock network.
- i.e., $T_{launch}$ and $T_{capture}$ are zero.
- Clock uncertainty
- Jitter
- Setup time
Clock Network Delay#
What is the clock network delay in the timing report? Why is it marked as ideal? This line in the timing report indicates that the clock tree is considered ideal, and any buffers in the clock path are assumed to have zero delay. Once the clock tree is built, the clock network can be marked as "propagated," allowing the clock path to display actual delay values.
- Clock network delay is used to model the delay through the clock path before the clock tree is established (i.e., before clock tree synthesis). Once the clock tree is established and marked as "propagated," this clock network delay constraint is ignored. The
set_clock_latency
command can also be used to model the delay from the main clock to its derived clocks.
Additionally, if it is a displayed clock tree, i.e., a clock buffer is inserted:
The delay of the first cell needs to know its input transition, so it needs to be explicitly specified using set_drive
, set_driving_cell
, or set_input_transition
; otherwise, it is assumed that its input transition is 0.
Additionally, the definition of clock source latency, i.e., insertion delay, is the delay from the clock source to the DUA clock definition point. This can be set using the command set_clock_latency -source
.
- This command will set clock network delay if w/o source option.
I2R
Setup Check#
- Set external input delay relative to virtual clock or actual clock:
set_input_delay
The timing path from the input port to the register can be triggered by a virtual clock or an actual clock, as follows:
This clock can be considered as a virtual flip-flop driving the design input port INA, with the clock of this virtual flip-flop being VIRTUAL_CLKM. Additionally, the maximum delay from the clock pin of this virtual flip-flop to the input port INA is specified as 2.55ns, displayed in the report as input external delay.
Input delay can also be specified relative to the actual clock, and it does not necessarily have to be specified relative to the virtual clock. The actual clock can be an internal pin in the design or a clock on the input port.
R2R
Setup Check#
set_output_delay
set_load
Similar to the input port constraints mentioned above, output ports can be constrained relative to a virtual clock or an internal clock in the design, or they can be constrained relative to an actual input clock port or output clock port.
To determine the delay of the last unit connected to the output port, the load at that port must be specified, and the set_load
command is used to specify the output load. Note that port ROUT may have a load internally in DUA, while the set_load constraint specifies the additional load, i.e., the load from outside DUA.
Note that in the R2O
path, the setup check for its endpoint is calculated as $T_{period} - T_{output}$.
I2O
Setup Check#
The design may also have pure combinational logic paths from the input port to the output port.
Hold#
Hold time is the shortest time that input data must remain stable after the active clock edge, which is also measured based on the time interval from the active clock edge exceeding its threshold to the earliest (the earliest) data signal exceeding its threshold.
Hold time checks ensure that the changing output value of the flip-flop does not propagate to the capture flip-flop, and overwrites its output before the capture flip-flop has a chance to capture its original value.
Hold time violations are analyzed against the fastest launch path, requiring that the fastest signal arriving at the D pin must also remain stable relative to the clock signal for at least one hold time. Thus, the formula is:
Required and Arrival in Hold Check#
Since launch delay is arrival, and capture is required time, and the hold check requires that arrival time must be later than required time, therefore:
Hold Time Check#
Hold time violations are generally analyzed after CTS.
Other Types of Analysis#
Slew Analysis#
Two Types of Slew/Transition Analysis.
- Data (max/min)
- Clock (max/min)
Load Analysis#
Two Types of Load Analysis
- Fanout (max/main)
- Capacitance (max/min)
Clock Analysis#
Two Types of Clock Analysis
- Skew: The difference between the latencies (L1, L2, L3, L4, etc.) is referred to as skew.
- Pulse Width: This type of analysis is performed due to the parasitic elements in the clock network path, and we need to see up to which point the pulse width gets degraded.
Interconnect Parasitics#
Nets are typically single driver and multi-load. After physical implementation, nets can move across multiple metal layers on the chip, with each metal layer having different resistance and capacitance values.
For equivalent electrical representation, nets are typically divided into multiple segments, each represented by equivalent parasitic parameters. We also refer to segments as interconnect traces, which are parts of the network on a specific metal layer.
Interconnect RLC#
Interconnect RC is caused by nets traversing different metal layers, including:
- Interconnect resistance (R) comes from various metal layers and vias in the design implementation. We can view interconnect resistance as the resistance between the output pin of a unit and the input pin of the fan-out unit.
- Interconnect capacitance (C) also comes from metal traces, including ground capacitance and capacitance between adjacent signal paths.
- Interconnect inductance (L) is not considered.
Ideally, the resistance and capacitance (RC) of a part of the interconnect trace are represented by a distributed RC tree.
Additionally, a simplified method can be used to model the RC tree.
T Model#
Π Model#
WLM#
Before physical implementation, wire load models (WLM) can be used to estimate the capacitance, resistance, and area overhead caused by interconnects. Wire load models can be used to estimate the length of the network based on the number of fanouts; wire load models depend on the area of the block. Designs with different areas can choose different wire load models. Wire load models can also map the estimated length of the network to resistance, capacitance, and the corresponding area overhead caused by routing.
- Wire load models are used to estimate line lengths based on fanout and obtain corresponding RC and area overhead.
- Wire load models are determined by cell area. As the area of the block increases, the routing will also grow.
For different areas (chips or blocks), different wire load models are typically used to determine parasitic effects.
Specifying Wire Load Models (TBD)#
todo
Wire load models can be specified using the command set_wire_load_model
, and the wire load mode can be specified using set_wire_load_mode
.
Interconnect Trees (TBD)#
todo
- What is the difference between T/Pi model and RC tree?
Since the interconnect delay from the driver pin to the load pin depends not only on RC values but also on the interconnect structure.
Refs#
- Static Timing Analysis Bible Translation Plan - Summary
- How to learn static timing analysis in digital circuits?
- Static Timing Analysis Simplified Tutorial (Part 1) - CSDN Blog
- GitHub - Gogireddyravikiran/Static-Timing-Analysis
- Static Timing Analysis and Modeling of Integrated Circuits (Douban)
- Basics of Static Timing Analysis in Digital Integrated Circuits - Bilibili
- Path Base Analysis (PBA) Vs Graph Base Analysis (GBA) - part1 | VLSI Concepts
- Design VLSI EDA (5): Timing Analysis of Your Circuit Reflection Arc Length
- sta_basics_course/doc/sta_basics_course.rst at master · brabect1/sta_basics_course · GitHub