Monday, 7 October 2013

5 Tips to Help You Finish Your Low Power Design Tapeout On Time

So you're about to start your first low power design. Or second, third, or fourth. As with many tapeouts, you know that with today's tight market windows, most likely the project will go off with a sprinting start (architectural planning), followed by an endurance test (designing and implementing), then a final mad dash towards the finish line (signoff closure and tapeout).
First, the bad news - given the complexities of today's design requirements and the swiftness in which the technology market moves, the project crunch noted above is still going to happen. The good news? If you're implementing a low power design, there are a few things you could do to reduce last-minute problems.
These tips apply mostly to power-domain based designs that use techniques such as power shutoff (PSO), multiple supply voltages (MSV), and dynamic voltage-frequency scaling (DVFS). However, some apply to non-power domain designs too.
1. Check your low power library availability! If you are well into the physical optimization stage of your design flow, you're counting on doing always-on net optimization, then realize "uh oh, I have no always-on buffers," that translates to at least 1-2 weeks of schedule delay. Obviously you'd want to check for library requirements early in the project, but sometimes for low power designs the requirements aren't that obvious. So here's a short list of the priorities:
  • If you are doing an MSV or DVFS design, check for the availability of multiple supply voltage-characterized libraries. Sure, you could use k-factors to extrapolate delay characteristics based on different voltages, but that's a very risky practice due to inaccuracies.
  • Check for level shifters, isolation cells, power switches (headers of footers, depending on which on you are planning to use), and of course, always-on buffers and state retention cells for PSO designs if you plan on using them.
2. Plan to use at least RTL simulation vectors for your power analysis. Vector-less power analysis is okay for estimation purposes, but at some point you'll have to switch to using vectors. Now, getting gate-level activity vectors for your design might be a bit hard since that only comes after doing gate-level simulation. But, RTL simulation vectors are typically available much earlier.
The old saying "garbage in, garbage out" applies here. The quality of your power analysis is completely dependent on the quality of the activity vectors you are feeding it. If that doesn't scare you enough, think about where this information is used: besides determining whether your design will meet power consumption specs and also fit within the packaging selected for your design, this information is also used as a basis for measuring dynamic and static IR drop, electromigration and other electrical problems that might come back and bite you if not taken care of early.

3. Try to test out your clock trees before finalizing your floorplan. This is helpful especially for power domain based designs. As we know, power domain definitions place restrictions on your floorplan in terms of placement, optimization and other factors. If, for example, your clock tree root starts in a power domain that's physically far away from your PLL, you can be sure that there will be a lot of buffers added in between, which means a much higher latency.
Also, clocks that exit one power domain and enter another power domain might be affected by the power domain layout in terms of skew and transition time. So, by doing at least a trial clock tree synthesis run before you finalize your floorplan, you should be able to catch problems like this early on, and fix it before your floorplan is finalized.

4. Don't over-constrain (too much) on IR drop requirements. Let's face it: the reality is we always over-constrain our designs. We over-constrain on timing to leave us some margin towards the signoff stage, and we over-constrain on IR drop so that we'll be able to meet the IR drop requirements of the library even if we take into account some variation between implementation and signoff. The main reason for IR drop requirements is that library cell performance degrades in accordance to IR drop, so too much IR drop may lead to the design not meeting timing even though STA thinks it does.
Library providers usually build in a little margin when specifying IR drop requirements, and it's perfectly normal for designers to add another layer of margin to that when implementing. The problem comes when expectations are unrealistic for a given design. For power shutoff designs, power switches usually cause some additional IR drop to that power domain. One way to decrease IR drop is to increase the number of power switch cells, but that's a double-edged sword because additional power switches lead to more area and more leakage power, which will ultimately negate the effect of having power switches in the first place. So, you can see how we could potentially shoot ourselves in the foot if we specify an unrealistic IR drop constraint.

5. Plan out your high fanout always-on nets. Planning out high fanout nets in general is a good practice for any design, but this applies even more to power shutoff designs if they have always-on high-fanout nets (hint - they usually do). Power switch sleep enable nets, SRPG sleep nets, and others would fall into this category. If you are planning to tap from nearby always-on power supplies to power the secondary power pins of the buffers for those nets, it's best that there actually is a nearby always-on power net available.
With that said, I hope this has been useful to all the folks out there designing for low power. I'm aware that this is not an all-inclusive list. Would anyone else like to share any pointers on low power implementation? Voice your comment below!

Sunday, 21 July 2013

Difference Between Launch and Capture Distances in an AOCV Analysis

In a path-based analysis, the distance of a path is the diagonal of the bounding box that encompasses all of the arcs in the path. In a graph-based analysis, an arc can be both launching and capturing. As a result, there are launch and capture distances. Maintaining separate launch and capture distances for arcs in a graph-based analysis vastly improves the accuracy of the results and allows closer correlation between the graph-based and path-based analyses.

The distinction between launch and capture distances can be best described using an example. In the schematic shown below, the BUF cell arc is treated as a capture arc. The cells that contribute to the bounding box for the BUFcell arc are highlighted in green. The launch and capture paths are shown with arrows. Note that the capture path passes through the BUF cell arc.
Capture Schematic?1292864004330
Figure 1: BUF Cell Arc Treated as a Capture Arc

In the schematic shown below, the BUF cell arc is treated as a launch arc. The cells that contribute to the bounding box for the BUF cell arc are highlighted in red. The launch and capture paths are shown with arrows. Note that thelaunch path passes through the BUF cell arc.
Launch Schematic?1292864004330
Figure 2: BUF Cell Arc Treated as a Launch Arc

You can examine the launch and capture AOCV distances and depths using the report_aocvm command. For example,

pt_shell> report_aocvm [get_timing_arcs -of U3]

Friday, 5 July 2013

8 Ways to Optimize Power Using Encounter Digital Implementation (EDI) System Quick Reference

Everyone knows that the increasing speed and complexity of today's designs implies a significant increase in power consumption, which demands better optimization of your design for power. I am sure lot of us must be scratching our heads over how to achieve this, knowing that manual power optimization would be hopelessly slow and all too likely to contain errors.

Here are 8 Top Things you need to know to optimize your design for power using the Encounter Digital Implementation (EDI) System.

Given the importance of power usage of ICs at lower and lower technology nodes, it is necessary to optimize power at various stages in the flow. This blog post will focus on methods that can be used to reach an optimal solution using the EDI System in an automated and clearly defined fashion. It will give clear and concise details on what features are available within optimization, and how to use them to best reach the power goals of the design.

Please read through all of the information below before making a decision on the right approach or strategy to take. It is highly dependent on the priority of low power and what timing, runtime, area and signoff criteria were decided upon in your design. With the aid of some or all of the techniques described in this blog it is possible to, depending on the design, vastly reduce both the leakage and dynamic power consumed by the design.
 
This is a one stop quick reference and not a substitute for reading the full document.

1) VT partition uses various heuristics to gather the cells into a particular partition. Depending on how the cells get placed in a particular bucket, the design leakage can vary a lot. The first thing is to ensure that the leakage power view is correctly specified using the "set_power_analysis_mode -view" command. The "reportVtInstCount -leakage" command is a useful check to see how the cells and libraries are partitioned. Always ensure correct partitioning of cells.

2) In several designs, manually controlling certain leakage libraries in the flow might give much better results than the automated partitioning of cells. If the VT partitioning is not satisfactory, or the optimization flow is found to use more LVT cells than targeted, selectively turn off cells of certain libraries particularly in initial part of the flow i.e. preRoute flow. The user should selectively set the LVT libraries to "don't use" and run preCts/postCts optimization. Depending on final timing QOR, another incremental optimization with LVT cells enabled may be needed.

3) Depending on the importance of leakage/dynamic power in the flow, the leakage/dynamic power flow effort can be set to high or low.
setOptMode -leakagePowerEffort {low|high}
setOptMode -dynamicPowerEffort {low|high}

If timing is the first concern, but having somewhat better leakage/dynamic power is desired, then select low. If leakage/dynamic power is of utmost importance, use high.

4) PostRoute Optimization typically works with all LVT cells enabled. In case of large discrepancy between preRoute and postRoute timings or if SI timing is much worse than base timing, postRoute optimization may overuse LVT cells. So it may be worthwhile experimenting with a two pass optimization, once with LVT cells disabled, and then with LVT cells enabled.

5) In order to do quick PostRoute timing optimization to clean up final violations without doing physical updates, use the following:
setOptMode -allowOnlyCellSwapping true
optDesign -postRoute 

This will only do cell swapping to improve timing, without doing physical updates. This is specifically for timing optimization and will worsen leakage.

6) Leakage flows typically have a larger area footprint than non-leakage flows. This is because EDI trades area with power, as it uses more HVT cells to fix timing to reduce leakage. This sometimes necessitates reclaiming any extra area during postRoute Opt to get better convergence in timing. EDI has an option to turn on area reclaim postRoute which is hold aware also and will not degrade hold timing.
setOptMode -postRouteAreaReclaim holdAndSetupAware

7) Running standalone Leakage Optimization to do extra leakage reclamation:
optLeakagePower
This may be needed if some of the settings have changed or if leakage flows are not being used.

8) PreRoute Optimization works with an extra DRC Margin of 0.2 in the flow. On some designs it is known to result in extra optimization causing more runtime and worse leakage. The option below is used to reset this extra margin in DRV fixing:
setOptMode -drcMargin -0.2

Remember to reset this margin for postRoute optimization to 0, as postRoute doesn't work with this extra margin of 0.2.  Note that the extra drcMargin is sometimes useful in reducing the SI effects, so by removing the extra margin, more effort may be needed to fix SI later in the flow.
I hope these tips help you achieve your power goals of your designs!

Backend (Physical Design) Interview Questions and Answers

Do you know about input vector controlled method of leakage reduction?
  • Leakage current of a gate is dependant on its inputs also. Hence find the set of inputs which gives least leakage. By applyig this minimum leakage vector to a circuit it is possible to decrease the leakage current of the circuit when it is in the standby mode. This method is known as input vector controlled method of leakage reduction.

How can you reduce dynamic power?
  • -Reduce switching activity by designing good RTL
  • -Clock gating
  • -Architectural improvements
  • -Reduce supply voltage
  • -Use multiple voltage domains-Multi vdd
What are the vectors of dynamic power?
  • Voltage and Current

If you have both IR drop and congestion how will you fix it?
  • -Spread macros
  • -Spread standard cells
  • -Increase strap width
  • -Increase number of straps
  • -Use proper blockage

Is increasing power line width and providing more number of straps are the only solution to IR drop?
  • -Spread macros
  • -Spread standard cells
  • -Use proper blockage

In a reg to reg path if you have setup problem where will you insert buffer-near to launching flop or capture flop? Why?
  • (buffers are inserted for fixing fanout voilations and hence they reduce setup voilation; otherwise we try to fix setup voilation with the sizing of cells; now just assume that you must insert buffer !)
  • Near to capture path.
  • Because there may be other paths passing through or originating from the flop nearer to lauch flop. Hence buffer insertion may affect other paths also. It may improve all those paths or degarde. If all those paths have voilation then you may insert buffer nearer to launch flop provided it improves slack.

What is the most challenging task you handled?
What is the most challenging job in P&R flow?
  • -It may be power planning- because you found more IR drop
  • -It may be low power target-because you had more dynamic and leakage power
  • -It may be macro placement-because it had more connection with standard cells or macros
  • -It may be CTS-because you needed to handle multiple clocks and clock domain crossings
  • -It may be timing-because sizing cells in ECO flow is not meeting timing
  • -It may be library preparation-because you found some inconsistancy in libraries.
  • -It may be DRC-because you faced thousands of voilations

How will you synthesize clock tree?
  • -Single clock-normal synthesis and optimization
  • -Multiple clocks-Synthesis each clock seperately
  • -Multiple clocks with domain crossing-Synthesis each clock seperately and balance the skew

How many clocks were there in this project?
  • -It is specific to your project
  • -More the clocks more challenging !

How did you handle all those clocks?
  • -Multiple clocks-->synthesize seperately-->balance the skew-->optimize the clock tree

Are they come from seperate external resources or PLL?
  • -If it is from seperate clock sources (i.e.asynchronous; from different pads or pins) then balancing skew between these clock sources becomes challenging.
  • -If it is from PLL (i.e.synchronous) then skew balancing is comparatively easy.

Why buffers are used in clock tree?
  • To balance skew (i.e. flop to flop delay)

What is cross talk?
  • Switching of the signal in one net can interfere neigbouring net due to cross coupling capacitance.This affect is known as cros talk. Cross talk may lead setup or hold voilation.

How can you avoid cross talk?
  • -Double spacing=>more spacing=>less capacitance=>less cross talk
  • -Multiple vias=>less resistance=>less RC delay
  • -Shielding=> constant cross coupling capacitance =>known value of crosstalk
  • -Buffer insertion=>boost the victim strength

How shielding avoids crosstalk problem? What exactly happens there?
  • -High frequency noise (or glitch)is coupled to VSS (or VDD) since shilded layers are connected to either VDD or VSS.
  • Coupling capacitance remains constant with VDD or VSS.

How spacing helps in reducing crosstalk noise?
  • width is more=>more spacing between two conductors=>cross coupling capacitance is less=>less cross talk

Why double spacing and multiple vias are used related to clock?
  • Why clock?-- because it is the one signal which chages it state regularly and more compared to any other signal. If any other signal switches fast then also we can use double space.
  • Double spacing=>width is more=>capacitance is less=>less cross talk
  • Multiple vias=>resistance in parellel=>less resistance=>less RC delay


How buffer can be used in victim to avoid crosstalk?
  • Buffer increase victims signal strength; buffers break the net length=>victims are more tolerant to coupled signal from aggressor.

 more questions comming soon... :-)

Challenges of 20nm IC Design

Saleem Haider, Synopsys interview....

Designing at the 20nm node is harder than at 28nm, mostly because of the lithography and process variability challenges that in turn require changes to EDA tools and mask making. The attraction of 20nm design is realizing SoCs with 20 billion transistors. Synopsys has re-tooled their EDA software to enable 20nm design.




20nm Geometries with 193nm Wavelength

Using immersion lithography the clever process development engineers have figured out how to resolve 20nm geometries using 193nm wavelength light, however to make these geometries yield now requires two separate masks, called Double Patterning Technology (DPT).

                                             Figure 1: Immersion Lithography

With DPT you have to split a single layer like Poly or Metal 1 onto two separate masks, then the exposures from the two masks are overlaid to produce that layer with 20nm geometries.

                           
                                               Figure 2: Double Patterning Technology (DPT)

Looking ahead to 14nm and smaller nodes this trend will continue with three or more patterns per layer.

When a mask layer is turned into two parts the process is called coloring, and the trick is to make sure that two adjacent geometries are on different colors.

                
                                                     Figure 3 :DPT Coloring

With DPT you have to make sure that your cell library and Place & Route tool are both DPT-compliant.

Often in your IC layout the DPT process will have to use stitching to accomodate via arrays:


This stitching will cause issues with line-end effects that in turn can degrade yield:

                

The earlier that you identify these issues, the sooner that you can make engineering trade-offs.

Foundries create layout rules at 20nm to specify how to produce high yield, and there are some 5,000 rules at this node.

Using DPT techniques will also cause a variation in capacitance values between adjacent nets caused by subtle shifts in the double masks.

                                  

DPT-Ready EDA Tools
Synopsys has updated their EDA tools to enable 20nm design, specifically:





Q&A

Q: Where can I read more about 20nm design with Synopsys tools?
A: Achronix did a paper at the Synopsys User Group, and they fabricated at Intel's custom foundry using FinFET technology.

Q: How popular is your DRC and LVS tool, IC Validator?
A: There have been 100 tapeouts in the past year for IC Validator tool.

Q: How many 20nm designs are there?
A: Test chips were done first last year, and now production designs are taping out with commercial foundries.

Q: How many mask layers require DPT in a 20nm design?
A: It depends on the foundry. First layer metal, maybe second layer of metal. As you relax the metal pitch, then you don't need DPT. Poly needs DPT.

Q: What about mask costs at 20nm with DPT?
A: It adds to the costs. It's always a trade off, the foundry can relax the pitches and void DPT usage.

Q: Which foundries have qualified 20nm with Synopsys tools?
A: TSMC, Samsung, GLOBALFOUNDRIES have qualified and endorse the Synopsys flow for 20nm.

Q: What can you tell me about your Custom IC design tools?
A: Our custom tools are also DPT aware, (SpringSoft, CiraNova, Custom Designer) - coming together.

Q: Why should I visit Synopsys at DAC?
A: We'll have live product demos, talk about advanced nodes, show emerging nodes, 14nm, 16nm, discuss new product features, and have special events. There is an IC Compiler luncheon where customers speak, and that's on Monday.

   more information at  http://www.synopsys.com/Solutions/EndSolutions/20nmdesign/Documents/20nm-and-beyond-white-paper.pdf 
           

Library Exchange Format (LEF)

Library Exchange Format (LEF) is a specification for representing the physical layout of an integrate circuit in an ASCII format. It includes design rules and abstract information about the cells. LEF is used in conjunction with Design Exchange Format (DEF) to represent the complete physical layout of an integrated circuit while it is being designed.
An ASCII data format, used to describe a standard cell library Includes the design rules for routing and the Abstract of the cells, no information about the internal netlist of the cells

A LEF file contains the following sections:

􀂄 Technology: layer, design rules, via definitions, metal capacitance
􀂄 Site: Site extension
􀂄 Macros: cell descriptions, cell dimensions, layout of pins and blockages, capacitances.

The technology is described by the Layer and Via statements. To each layer the following  attributes may be associated:
􀂄 type: Layer type can be routing, cut (contact), masterslice (poly, active), overlap.
􀂄 width/pitch/spacing rules
􀂄 direction
􀂄 resistance and capacitance per unit square
􀂄 antenna Factor

Layers are defined in process order from bottom to top


poly masterslice
cc cut
metal1 routing
via cut
metal2 routing
via2 cut
metal3 routing

Cut Layer definition


LAYER
layername
TYPE CUT
;
SPACING
Specifies the minimum spacing allowed between via cuts on the same net or different nets. This value can be overridden by the SAMENET SPACING statement (we are going to use this statement later)END layerName

Implant Layer definition

LAYER layerName
TYPE IMPLANT ;
SPACING minSpacing
END layerName
Defines implant layers in the design. Each layer is defined by assigning it a name and simple spacing and width rules. These spacing and width rules only affect the legal cell placements. These rules interact with the library methodology, detailed placement, and filler cell support.

Masterslice or Overlap Layer definition

LAYER layerName
TYPE {MASTERSLICE | OVERLAP} ;
Defines masterslice (nonrouting) or overlap layers in the design. Masterslice layers are typically polysilicon layers and are only needed if the cell MACROs have pins on the polysilicon layer.

Routing Layer definition

LAYER layerName
TYPE ROUTING ;
DIRECTION {HORIZONTAL | VERTICAL} ;
PITCH distance;
WIDTH defWidth;
OFFSET distance ;
SPACING minSpacing;
RESISTANCE RPERSQ value ;
Specifies the resistance for a square of wire, in ohms per square.  The resistance of a wire can be defined as RPERSQU x wire length/wire width
CAPACITANCE CPERSQDIST value ;
Specifies the capacitance for each square unit, in picofarads per square micron. This is used to model wire-to-ground capacitance.

Manufacturing Grid

MANUFACTURINGGRID value ;
Defines the manufacturing grid for the design. The manufacturing grid is used for geometry alignment. When specified, shapes and cells are placed in locations that snap to the manufacturing grid.

Via

VIA viaName
DEFAULT
TOPOFSTACKONLY
FOREIGN foreignCellName [pt [orient]] ;
RESISTANCE value ;
{LAYER layerName ;
{RECT pt pt ;} ...} ...
END viaName
Defines vias for usage by signal routers. Default vias have exactly three layers used:
 A cut layer, and two layers that touch the cut layer (routing or masterslice). The cut layer rectangle must be between the two routing or masterslice layer rectangles.

Via Rule Generate

VIARULE viaRuleName GENERATE
LAYER routingLayerName ;
{ DIRECTION {HORIZONTAL | VERTICAL} ;
OVERHANG overhang ;
METALOVERHANG metalOverhang ;
| ENCLOSURE overhang1 overhang2 ;}
LAYER routingLayerName ;
{ DIRECTION {HORIZONTAL | VERTICAL} ;
OVERHANG overhang ;
METALOVERHANG metalOverhang ;
| ENCLOSURE overhang1 overhang2 ;}
LAYER cutLayerName ;
RECT pt pt ;
SPACING xSpacing BY ySpacing ;
RESISTANCE resistancePerCut ;
END viaRuleName
Defines formulas for generating via arrays. Use the VIARULE GENERATE statement to cover special wiring that is not explicitly defined in the VIARULE statement.

Same-Net Spacing

SPACING
SAMENET layerName layerName minSpace [STACK] ; ...
END SPACING
Defines the same-net spacing rules. Same-net spacing rules determine minimum spacing between geometries in the same net and are only required if same-net spacing is smaller than different-net spacing, or if vias on different layers have special stacking rules.
Thesespecifications are used for design rule checking by the routing and verification tools.
Spacing is the edge-to-edge separation, both orthogonal and diagonal.

Site

SITE siteName
CLASS {PAD | CORE} ;
[SYMMETRY {X | Y | R90} ... ;] (will discuss this later in macro definition)
SIZE width BY height ;
END siteName

Macro

MACRO macroName
[CLASS
{ COVER [BUMP]
| RING
| BLOCK [BLACKBOX]
| PAD [INPUT | OUTPUT |INOUT | POWER | SPACER | AREAIO]
| CORE [FEEDTHRU | TIEHIGH | TIELOW | SPACER | ANTENNACELL]
| ENDCAP {PRE | POST | TOPLEFT | TOPRIGHT | BOTTOMLEFT | BOTTOMRIGHT}
}
;]
[SOURCE {USER | BLOCK} ;]
[FOREIGN foreignCellName [pt [orient]] ;] ...
[ORIGIN pt ;]
[SIZE width BY height ;]
[SYMMETRY {X | Y | R90} ... ;]
[SITE siteName ;]
[PIN statement] ...
[OBS statement] ...

Macro Pin Statement

PIN pinName
FOREIGN foreignPinName [STRUCTURE [pt [orient] ] ] ;
[DIRECTION {INPUT | OUTPUT [TRISTATE] | INOUT | FEEDTHRU} ;]
[USE { SIGNAL | ANALOG | POWER | GROUND | CLOCK } ;]
[SHAPE {ABUTMENT | RING | FEEDTHRU} ;]
[MUSTJOIN pinName ;]
{PORT
[CLASS {NONE | CORE} ;]
{layerGeometries} ...
END} ...
END pinName]

Macro Obstruction Statement

OBS
{ LAYER layerName [SPACING minSpacing | DESIGNRULEWIDTH value] ;
RECT pt pt ;
POLYGON pt pt pt pt ... ;
END

DIFFERENT TYPES OF FILE FORMATS AND THEIR MEANINGS IN VLSI..

There are different type of the files generated during a design cycle or data received by the library vendor/foundry. Few of them having specific extension. Just to know the extension, you can easily identity the type of content in that file.


 *.ddc - Synopsys internal database format. This format is recommended by Synopsys to hand gate-level netlists.File Extensions: 

*.v - Verilog source file. Normally it’s a source file your write. Design Compiler, and IC Compiler can use this format for the gate-level netlist.

*.vg, .g.v - Verilog gate-level netlist file. Sometimes people use these file extension to differentiate source files and gate-level netlists.

*.svf - Automated setup file. This file helps Formality process design changes caused by other tools used in the design flow. Formality uses this file to assist the compare point matching and verification process. This information facilitates alignment of compare points in the designs that you are verifying. For each automated setup file that you load, Formality processes the content and stores the information for use during the name-based compare point matching period.

 *.vcd - Value Change Dump format. This format is used to save signal transition trace information. This format is in text format, therefore, the trace file in this format can get very large quickly. There are tools like vcd2vpd, vpd2vcd, and vcd2saif switch back and forth between different formats.

*.vpd - VCD Plus. This is a proprietary compressed binary trace format from Synopsys. This file format is used to save signal transition trace information as well.

 *.saif - Switching Activity Interchange Format. It’s another format to save signal transition trace information. SAIF files support signals and ports for monitoring as well as constructs such as generates, enumerated types, records, array of arrays, and integers.

 *.tcl - Tool Command Language (Tcl) scripts. Tcl is used to drive Synopsys tools.

 *.sdc - Synopsys Design Constraints. SDC is a Tcl-based format. All commands in an SDC file conform to the Tcl syntax rules. You use an SDC file to communicate the design intent, including timing and area requirements between EDA tools. An SDC file contains the following information: SDC version, SDC units, design constraints, and comments. 

 *.lib - Technology Library source file. Technology libraries contain information about the characteristics and functions of each cell provided in a semiconductor vendor’s library. Semiconductor vendors maintain and distribute the technology libraries. In our case the vendor is Synopsys. Cell characteristics include information such as cell names, pin names, area, delay arcs, and pin loading. The technology library also defines the conditions that must be met for a functional design (for example, the maximum transition time for nets). These conditions are called design rule constraints. In addition to cell information and design rule constraints, technology libraries specify the operating conditions and wire load models specific to that technology.

 *.db - Technology Library. This is a compiled version of *.lib in Synopsys database format.

 *.plib - Physical Library source file. Physical libraries contain process information, and physical layout information of the cells. This information is required for floor planning, RC estimation and extraction, placement, and routing.

 *.pdb - Physical Library. This is a compiled version of *.plib in Synopsys database format.

 *.slib - Symbol Library source file. Symbol libraries contain definitions of the graphic symbols that represent library cells in the design schematics. Semiconductor vendors maintain and distribute the symbol libraries. Design Compiler uses symbol libraries to generate the design schematic. You must use Design Vision to view the design schematic. When you generate the design schematic, Design Compiler performs a one-to-one mapping of cells in the netlist to cells in the symbol library.

 *.sdb - Symbol Library. This is a compiled version of *.slib in Synopsys database format.

 *.sldb - DesignWare Library. This file contains information about DesignWare libraries.

 *.def - Design Exchange Format. This format is often used in Cadence tools to represent physical layout. Synopsys tools normally use Milkyway format to save designs.

 *.lef - Library Exchange Format. Standard cells are often saved in this format. Cadence tools also often use this format. Synopsys tools normally use Milkyway format for standard cells.

 *.rpt - Reports. This is not a proprietary format, it’s just a text format which saves generated reports by the tools when you use the automated makefiles and scripts.

 *.tf - Vendor Technology File. This file contains technology-specific information such as the names, characteristics (physical and electrical) for each metal layer, and design rules. These information are required to route a design.

 *.itf - Interconnect Technology File. This file contains a description of the process crosssection and connectivity section. It also describes the thicknesses and physical attributes of the conductor and dielectric layers.

 *.map - Mapping file. This file aligns names in the vendor technology file with the names in the process *.itf file.

 *.tluplus - TLU+ file. These files are generated from the *.itf files. TLUPlus models are a set of models containing advanced process effects that can be used by the parasitic extractors in Synopsys place-and-route tools for modeling.

 *.spef - Standard Parasitic Exchange Format. File format to save parasitic information extracted by the place and route tool.

 *.sbpf - Synopsys Binary Parasitic Format. A Synopsys proprietary compressed binary format of the *.spef. Size of the file shrinks quite a bit using this format.

*.mw( Milkyway database) The Milkyway database consists of libraries that contain information about your design. Libraries contain information about design cells, standard cells, macro cells, and so on. They contain physical descriptions, such as metal, diffusion, and polygon geometries. Libraries also contain logical information (functionality and timing characteristics) for every cell in the library. Finally, libraries contain technology information required for design and fabrication. Milkyway provides two types of libraries that you can use: reference libraries and design libraries. Reference libraries contain standard cells and hard or soft macro cells, which are typically created by vendors. Reference libraries contain physical information necessary for design implementation. Physical information includes the routing directions and the placement unit tile dimensions, which is the width and height of the smallest instance that can be placed. A design library contains a design cell. The design cell might contain references to multiple reference libraries (standard cells and macro cells). Also, a design library can be a reference library for another design library. The Milkyway library is stored as a UNIX directory with subdirectories, and every library is managed by the Milkyway Environment. The top-level directory name corresponds to the name of the Milkyway library. Library subdirectories are classified into different views containing the appropriate information relevant to the library cells or the designs. In a Milkyway library there are different views for each cell, for example, NOR1.CEL and NOR1.FRAM. This is unlike a .db formatted library where all the cells are in a single binary file. With a .db library, the entire library has to be read into memory. In the Milkyway Environment, the Synopsys tool loads the library data relevant to the design as needed, reducing memory usage. The most commonly used Milkyway views are CEL and FRAM. CEL is the full layout view, and FRAM is the abstract view for place and route operations.

 simv - Compiled simulator. This is the output of vcs. In order to simulate, run the simulator by ./simv at the command line.

 alib-52 - characterized target technology library. A pseudo library which has mappings from Boolean functional circuits to actual gates from the target library. This library provides Design Compiler with greater flexibility and a larger solution space to explore tradeoffs between area and delay during optimization.

Thursday, 4 July 2013

Difference Between CCS and NLDM

What is CCS and NLDM:

CCS stands for Composit Current Sourse Model, and NLDM stands for Non-Linear Delay Model. Both CCS & NLDM are delay models used in timing analyze.

Difference between CCS & NLDM:

  • NLDM uses a voltage source for driver modeling
  • CCS uses a current source for driver modeling

Why prefer CCS to NLDM:

The issues with NLDM modeling is that, when the drive resistance RD becomes much less than Znet(network load impedance), then ideal condition arises i.e Vout=Vin. Which is impossible in practical conditions.
So with NLDM modeling parameters like the cell delay calculation, skew calculation will be inaccurate.
That is the reason why we prefer CCS to NLDM

Designing a robust clock tree structure



Clock tree synthesis (CTS) is at the heart of ASIC design and clock tree network robustness is one of the most important quality metrics of SoC design. With technology advancement happened over the past one and half decade, clock tree robustness has become an even more critical factor affecting SoC performance. Conventionally, engineers focus on designing a symmetrical clock tree with minimum latency and skew. However, with the current complex design needs, this is not enough.

Today, SoCs are designed to support multiple features. They have multiple clock sources and user modes which makes the clock tree architecture complex. Merging test clocking with functional clocking and lower technology nodes adds to this complexity. Due to the increase in derate numbers and additional timing signoff corners, timing margins are shrinking. 

To meet the current requirements, designs that are timing friendly are needed and provide minimum power dissipation. This article describes the factors which a designer should consider while defining clock tree architecture. It presents some real design examples that illustrate how current EDA tools or conventional methodologies to design clock trees are not sufficient in all cases. A designer has to understanding the nitty -gritty of clock tree architecture to be able to guide an EDA tool to build a more efficient clock tree.  First, the basics of CTS and requirements for good clock tree are presented.

Clock tree quality parameters

The primary requirements for ideal synchronous clocks are:
  1. Minimum Latency – The latency of a clock is defined as the total time that a clock signal takes to propagate from the clock source to a specific register clock pin inside the design.  The advantages of building a clock with minimum latency are obvious – fewer clock tree buffers, reduced clock power dissipation, less routing resources and relaxed timing closure.
  2. Minimum skew – The difference in arrival time of a clock at flip-flops is defined as skew.  Minimum skew helps with timing closure, especially hold timing closure. However there is a word of caution - targeting  too aggressive minimum skews can be counterproductive because it may not help meeting hold timing but it can end up having other problems like increasing overall clock latency and increasing uncommon paths between registers in order to achieve minimum skew.
  3. Duty Cycle – Maintaining a good duty cycle for the clock network is another important requirement. Many sequential devices, like flash, require minimum pulse width on the input clock to ensure error-free operation. Moreover many IO interfaces like DDR and QSPI can work on both edges of clock.  A clock tree must be designed with these considerations and symmetrical cells having similar rise-fall delays should be used to build the clock tree.
  4. Minimum Uncommon path - The logically connected registers  must have minimum uncommon clock path. Timing derates are applied to the clock path to model process variations on the die.  Using a standard timing derates methodology, derates are applied only on uncommon path of launch and capture clock path because it is unlikely that common clock paths can have different process variations in launch and capture cycle. This concept is also called CRPR adjustment. The important concept is that a clock path should have minimum uncommon path between two connected registers.
     
Figure 1: Common and ucommon clock paths between two  registers
  1. Signal Integrity – Clock signals are more prone to signal integrity problems because of high switching activity. To avoid the effect of noise and to avoid EM violations, clock trees should be constructed using a DWDS(Double width double spacing ) rule. Increased spacing will help in minimizing noise effect. Similarly, increased width will help to avoid EM violations.
  2. Minimum Power Dissipation – This is one of the most important quality parameter of a clock tree. At the architecture level, clock gating is done at multiple levels to save power and certain things are expected to done while building clock trees such as maintaining good clock transition, minimum latency etc.

EDA tool role in clock tree synthesis

Today, a lot of R&D has been done on EDA tools to design an ideal clock tree. The CTS engines of these tools support most of the SOC requirements to design a robust clock tree. These tools even generate clock spec definitions from SDC(timing constraint files).  A typical clock spec file includes:
  • All clock sources information
  • Synchronous/Asynchronous relationships between various clocks
  • Through pins
  • Exclude pins
  • Clock pulling pushing information
  • Leaf Pin

Going one level down in SoC to design an ideal clock tree

For most SoCs, the existing EDA tools are sufficient for CTS engine to generate an ideal clock tree. However, this is not always the case.  This approach presented in this paper is suitable for SoCs or IPs, which have few clock sources and a simple clock architecture with minimum muxing of multiple clocks.

Today’s microcontrollers generally don’t have such a simple clock architecture.  Microcontrollers designed for the automotive world have multiple IPs integrated into a single SoC.  For example, a single SOC may have multiple cores, IO peripherals like SPI, DSPI, LIN, DDR interfaces for multiple automotive control applications. Considering human safety in automotive SoCs, testing requirements are also very stringent in terms of test coverage such as atspeed and stuckat.  This leads to a very complex clocking architecture because it requires multiple clock sources (both on SoC clock sources such as PLLs, IRC oscillators and off SoC clock sources like EXTAL) and clock dividers in order to supply the required clock frequency to multiple IPs within a SoC.

In such cases, CTS engines cannot be relied upon to build a clock tree. Due to complicated muxing of various clock sources in multiple functional and test modes , EDA tools sometimes are not able to build the clock tree properly, often resulting in problems of increased latency, skew mismatch and huge uncommon clock path problems.

In the next section some real design case studies are used to illustrate how current EDA tools might fail to build the clock tree as expected by the designer and how a backend engineer can help design a robust clock tree either by providing proactive feedback to architecture designers or to improve the clock structure at the RT level itself or by using better implementing techniques

Case study 1 - Clock logic cloning

Suppose a clock tree is required for the following logic.

In functional mode there is one master clock source func and one generated clock source gen_clk1. In test modes there is one test clocks tck1.  In functional mode register set 2 is clocked by gen_clk1 but in test mode, test clock tck1 is used instead.

The conventional way to define the clock tree spec for this design fragment would be to define the master clock sources (func and tck1) and generated clock (gen_clk1) and to define through pin for generated clock source so as to balance the latency of the master clock and the total latency of the generated clock(source latency to register clock pin plus latency from flop output to register group3). Defining a through pin for the generated clock source ensures that a CTS engine does not consider the generated clock flop as a sink pin and instead traces the clock path through CK-> Q arc of flop.

Assume that the latency of func clock while in functional mode is constrained by register set2 (highlighted in red in figure 2).  This will force a CTS engine to build the generated clock source flop 2 with minimum latency. This will only be possible if the minimum buffering is done from func clock source to mux1 input D0 as well as from mux1 output to generated clock source gen_clk1. In order to balance the latency of register set 1 and register set 2, the tool will be forced to insert buffering between mux1 output and register set 1 clock pins.  This implementation is correct in functional mode but will cause problems in test mode.
                           
Figure 2: Original design

Test mode CTS:  The architecture of the design is such that test clock tck1 to all register sets 1-3 can be built with very low latency. However due to functional mode clocking constraints, as explained above, clock latency for the test clock will be high. The latency for test mode will be constrained by register set 1 as shown in above diagram by green line.  This cannot be avoided because of the need to balance register set1 and register set 2 latency in functional mode and buffering can only be done after mux1 output because it is not possible to increase the latency of the generated clock source. The consequence of this is that there is no option other than to increase the latency of register set 2 and register set 3 in test modes.  This is a serious problem because the latency of the test clock is increased because of functional mode clocking constraints. As discussed above in robust clock tree guidelines, an increase in clock latency can lead to multiple problems. This problem cannot be solved using advanced features like MMMC (Multi Mode Multi Corner)CTS of current EDA tools.

Solution: The solution to problem lies in cloning the clock logic as shown in figure 3.  EDA tools generally do not implement cloning of non-buffer logic in the clock path network.  The problem can be solved if there is a separate dedicated clock mux for the generated clock source flop. The limitation of placing clock buffers after the mux output has been removed and for register set 1, the clock buffering to balance latency in functional mode can be done between func clock source and cloned mux input D0.  Since there is now the bare minimum buffering done between cloned mux output D0 and register set 1, the latency numbers in test mode are not limited by register set 1 and it is possible to achieve minimum latency in test mode.
        
Figure 3: Modified design with cloned mux

Case study 2 - Clock muxing of two synchronous clocks

         
             Figure 4(a)                                       Figure 4(b)
             Figure 4: Design  with two synchronous clocks
In this example clock 1 and clock2 are synchronous to each other. The assumption is that the minimum latency of both clock1 and clock2 is not limited by register group 1-3, but by some other clock group (not shown in diagram).

A typical behavior of most of CTS engines would be to insert clock buffers after the mux output to register group2 in order to save overall clock buffers. However this will introduce a problem of larger uncommon path between register group 1 and register group2 as well as between register group 2 and register group3. A CTS engine is not intelligent enough to understand that the architecture ensures that mux select will not toggle on the fly and there cannot be a case of launch on clock 1 and capture on clock 2 for register group2.  An alternate approach for CTS for these type of cases is shown in Figure 4(b). In figure 4(b), the clock buffers have been moved for both clock 1 and clock2 before the clock mux in order to have a greater common clock path between register set 1-2 and register set 2-3. Note that this is under the same assumption that latency of clocks Clk1 and Clk2 is limited by some other register group other than 1-3 and extra clock buffers were placed by the CTS engine after clock mux to balance skew requirements.  

Case study 3 - Centralized vs decentralized clocking scheme

There is debate among designers about how to manage clock muxing and clock divider logic for the SoC. Proponents of a centralized clocking scheme argue that doing all clock muxing at a single place helps managing things in a better way, while opponents question this approach citing timing issues that crop up due to centralized muxing. Both possibilities will be considered.

Assume there are three IPs and one clock of 200 MHz frequency. The design requirement says that both IP1 and IP2 require two synchronous 200 MHz and 100 MHz clocks. Moreover, IP1 and IP2 can handshake data synchronously both at 200 MHz and 100 MHz.  Now, there are two options to implement a clocking scheme that meets this requirement. First is to divide the 200 MHz clock to generate the 100 MHz clock inside a centralized clocking module and then provide both 200 MHz and 100 MHz clocks to both IPs. The second option is to divide the 200 MHz clock separately in both IPs.  In this scenario, option one is the better option because IP1 and IP2 both need divided clocks and they are exchanging data synchronously as well. If division is done independently in both IPs, there is duplication of the divider logic and there are chances of phase mismatch in the divided clocks and additional logic may be required to solve this problem.  In this case, a centralized clocking scheme is better than decentralized one even though there may be some problem of uncommon path between the 200MHz and 100MHz clocks.

For IP3, a decentralized clocking scheme is the best approach. IP3 requires 200MHz, 100MHz and 50 MHz clocks and IP3 is exchanging data only with the external world and not with any other IP within the SoC.  In this case there is no point placing the dividers in one centralized clocking block because it will introduce uncommon path between all divided clocks. The better option would be to divide the 200 MHz clock inside IP3 to generate 100MHz and 50MHz clocks.

Figure 5: centralized and decentralized clocking


In summary, it might look tempting and convenient to keep all clock muxing and dividing logic in one place, but in some cases it might introduce timing closure problems. The better approach is to analyze the impact of centralized/decentralized clocking on a case to case basis and to take the appropriate decision after that analysis.

Case study 4 - Power vs. timing

A clock tree designer often has to choose between power and timing. One such example is shown in figure 6. Different CTS engines can behave differently.  The first CTS tools prefers power saving over
uncommon path because when clock gating is done, the maximum number of clock buffers will stop toggling . The second solution favors timing over power as both register groups now have the minimum uncommon path. A CTS designer must choose their preference between power and timing on a case by case basis. Whatever the tool algorithm, the clock spec can be modified to force the CTS tool to build the required structure.

   
Figure 6: Alternative ways of building a clock tree

Case study 5 – Back-to-back clock gating cells

Many times due to third party IPs and logic synthesis clock gating insertion, back-to-back clock gating cells may have been created. Because of this, clock latency to that register group can increase since a clock gating cell typically has a higher delay than a clock buffer. This can be rectified either by merging these clock gating cells at the RT level or if that is not possible because of integration and third-party IP issues, it can also be done during logic synthesis. Most EDA tools doing logic synthesis have a feature to merge such back-to-back clock gating cells, but the default is not to merge these clock gating cells in order to preserve the RTL implementation . This feature can be used on a case to case basis.
                                

Figure 7: back-to-back clock cells

Recommendations and guidelines/experiments for designing clock trees
For a new design when the clock tree is being constructed for first time, it is important to know optimum latency and skew numbers. Some suggested experiments for this include:
  1. Build a clock tree with no skew balancing requirements. This will force the CTS engine to build a clock tree  to all registers at its lowest latency possible without caring about skew balancing. The clock path of the register group having the highest latency should be analyzed in detail because when that clock tree is going to be built with skew requirements specified, this register group will determine the latency of that whole clock group. Explore architectural improvement that can be made to reduce latency for this register group. This exercise should be repeated for subsequent highest latency register groups until no further latency improvements can be made.  
  2. After minimum latency has been established, skew numbers should be targeted. Two or three experiments, with different skew numbers, should be performed to see if overall latency is increasing in order to meet clock skew requirements.  Inappropriate clock buffer selection could be an issue. Skew numbers should be double checked. Very low skew numbers might look tempting but too aggressive skew numbers can increase overall clock latency and can increase peak power dissipation due to all flops toggling at the same time.
  3. Another suggested way to target uncommon path problem is to compare timing reports between pre CTS stage and post CTS stages.  Ideally the timing status of a design should remain the same between pre CTS and post CTS stages because projected deterioration in the timing profile is already taken care of at pre CTS by extra clock skew and derate uncertainty.  If timing violation are seen after post CTS stage and a clock tree with respectable skew numbers has been built, the culprit is probably a huge uncommon path between launch and capture registers. Root cause analysis of the uncommon path should be done to determine if architectural improvements can be made to reduce the uncommon path.

Conclusion

The case studies, guidelines and experiments are neither compulsory nor exhaustive enough to cover all aspects of an ideal clock tree. Moreover, there are a lot of other issues such as signal integrity and clock gate ratio which have not been considered. These could be important, particularly in smaller technology nodes. This article should serve as an eye opener to change the perception about how CTS activity is taken generally in our design cycle. With timing margins being reduced, it has become very important to scrutinize the clock tree architecture thoroughly and look for every single possibility of improving clock structure.