0 évaluation0% ont trouvé ce document utile (0 vote)

34 vues265 pagesphd.pdf

© © All Rights Reserved

PDF, TXT ou lisez en ligne sur Scribd

© All Rights Reserved

0 évaluation0% ont trouvé ce document utile (0 vote)

34 vues265 pagesphd.pdf

© All Rights Reserved

Vous êtes sur la page 1sur 265

Albert Cohen

Albert Cohen. Program Analysis and Transformation: From the Polytope Model to Formal Languages.

Networking and Internet Architecture [cs.NI]. Université de Versailles-Saint Quentin en Yvelines, 1999.

English. <tel-00550829>

https://tel.archives-ouvertes.fr/tel-00550829

Submitted on 31 Dec 2010

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents

entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,

lished or not. The documents may come from émanant des établissements d’enseignement et de

teaching and research institutions in France or recherche français ou étrangers, des laboratoires

abroad, or from public or private research centers. publics ou privés.

THE SE de DOCTORAT de l'UNIVERSITE de VERSAILLES

Specialite : Informatique

presentee par

Albert COHEN

pour obtenir le titre de DOCTEUR de l'UNIVERSITE de VERSAILLES

Sujet de la these :

du modele polyedrique aux langages formels

Program Analysis and Transformation:

From the Polytope Model to Formal Languages

Jean Berstel Rapporteur

Luc Bouge Examinateur

Jean-Francois Collard Directeur

Paul Feautrier Directeur

William Jalby President

Patrice Quinton Rapporteur

Bernard Vauquelin Rapporteur

laboratoire PRi SM (Parallelisme, Reseaux, Systemes et Modelisation)

Remerciements

Cette these a ete preparee au sein du laboratoire PRiSM (Parallelisme, Re-

seaux, Systemes et Modelisation) de l'Universite de Versailles Saint-Quentin-

en-Yvelines, entre septembre 1996 et decembre 1999, sous la direction de Jean-

Francois Collard et Paul Feautrier.

Je voudrais tout d'abord m'adresser a Jean-Francois Collard (charge de

recherches au CNRS) qui a encadre cette these, et avec qui j'ai eu la chance

de faire mes premiers pas dans la recherche scientique. Ses conseils, sa dis-

ponibilite extraordinaire, son dynamisme en toutes circonstances, et ses idees

eclairees ont fait beaucoup plus qu'entretenir ma motivation. Je remercie vi-

vement Paul Feautrier (professeur au PRiSM) pour sa conance et pour son

inter^et a suivre mes resultats. A travers son experience, il m'a fait decouvrir

a quel point la recherche est enthousiasmante, au dela des dicultes et des

succes ponctuels.

Je suis tres reconnaissant envers tous les membres de mon Jury ; notam-

ment envers Jean Berstel (professeur a l'Universite de Marne-la-Vallee), Pa-

trice Quinton (professeur a l'IRISA, Universite de Rennes) et Bernard Vau-

quelin (professeur au LaBRI, Universite de Bordeaux), pour l'inter^et et la

curiosite qu'ils ont porte a l'egard de mes travaux et pour le soin avec lequel

ils ont relu cette these, y compris lorsque la problematique n'appartenait pas

a leurs domaines de recherches. Un grand merci a Luc Bouge (professeur au

LIP, Ecole Normale Superieure de Lyon) pour sa participation a ce Jury et

pour ses suggestions et commentaires eclaires. Merci enn a William Jalby

(professeur au PRiSM) pour avoir accepte de presider ce Jury et pour m'avoir

souvent conseille avec bonne humeur.

J'exprime egalement toute ma gratitude a Guy-Rene Perrin pour ses en-

couragements et pour l'acces a (( sa )) machine parallele, a Olivier Carton pour

son aide precieuse sur un domaine tres exigeant, a Denis Barthou, Ivan Djelic

et Vincent Lefebvre pour leur collaboration essentielle aux resultats de cette

these. Je me souviens aussi de passionnantes discussions avec Pierre Boulet,

Philippe Clauss, Christine Eisenbeis et Sanjay Rajopadhye ; et je n'oublie pas

non plus l'aide ecace des ingenieurs et des secretaires du laboratoire. Je re-

pense aux bons moments passes avec les tous les membres du (( monastere ))

et avec les compagnons de route du PRiSM qui sont devenus mes amis.

Merci enn a ma famille pour son soutien constant et inconditionnel, avec

une pensee particuliere pour mes parents et pour ma femme Isabelle.

Dedicated to a Brave GNU World

http://www.gnu.org

Copyright

c Albert Cohen 1999.

Verbatim copying and distribution of this document is permitted in any medium, provided

this notice is preserved.

La copie et la distribution de copies exactes de ce document sont autorisees, mais aucune

modication n'est permise.

This document was typeset using LATEX and the french package.

Graphics were designed using xg, gnuplot and the GasTEX package.

Albert.Cohen@prism.uvsq.fr

TABLE OF CONTENTS 5

Table of Contents

List of Figures 7

List of Algorithms 9

Presentation en francais 11

Grandes lignes de la these, en francais.

Dissertation summary, in French.

1 Introduction 53

1.1 Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1.2 Program Transformations for Parallelization . . . . . . . . . . . . . . . . . . . . . 57

1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2 Framework 61

2.1 Going Instancewise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.2 Program Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.2.1 Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3 Abstract Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.3.1 Naming Statement Instances . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.3.2 Sequential Execution Order . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.3.3 Adressing Memory Locations . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.3.4 Loop Nests and Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.4 Instancewise Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2.4.1 Con
icting Accesses and Dependences . . . . . . . . . . . . . . . . . . . . 76

2.4.2 Reaching Denition Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 77

2.4.3 An Example of Instancewise Reaching Denition Analysis . . . . . . . . . 78

2.4.4 More About Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.5 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.5.1 Memory Expansion and Parallelism Extraction . . . . . . . . . . . . . . . 81

2.5.2 Computation of a Parallel Execution Order . . . . . . . . . . . . . . . . . 82

2.5.3 General Eciency Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3 Formal Tools 87

3.1 Presburger Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.1.1 Sets, Relations and Functions . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.1.2 Transitive Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2 Monoids and Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.2.1 Monoids and Morphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.2.2 Rational Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.2.3 Algebraic Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.2.4 One-Counter Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6 TABLE OF CONTENTS

3.3.1 Recognizable and Rational Relations . . . . . . . . . . . . . . . . . . . . . 97

3.3.2 Rational Transductions and Transducers . . . . . . . . . . . . . . . . . . . 98

3.3.3 Rational Functions and Sequential Transducers . . . . . . . . . . . . . . . 99

3.4 Left-Synchronous Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.4.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.4.2 Algebraic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.4.3 Functional Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.4.4 An Undecidability Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3.4.5 Studying Synchronizability of Transducers . . . . . . . . . . . . . . . . . . 110

3.4.6 Decidability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.4.7 Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.5 Beyond Rational Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.5.1 Algebraic Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.5.2 One-Counter Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.6 More about Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.6.1 Intersection with Lexicographic Order . . . . . . . . . . . . . . . . . . . . 119

3.6.2 The case of Algebraic Relations . . . . . . . . . . . . . . . . . . . . . . . . 120

3.7 Approximating Relations on Words . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.7.1 Approximation of Rational Relations by Recognizable Relations . . . . . 121

3.7.2 Approximation of Rational Relations by Left-Synchronous Relations . . . 121

3.7.3 Approximation of Algebraic and Multi-Counter Relations . . . . . . . . . 122

4 Instancewise Analysis for Recursive Programs 123

4.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.1.1 First Example: Procedure Queens . . . . . . . . . . . . . . . . . . . . . . 123

4.1.2 Second Example: Procedure BST . . . . . . . . . . . . . . . . . . . . . . . 125

4.1.3 Third Example: Function Count . . . . . . . . . . . . . . . . . . . . . . . 125

4.1.4 What Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.2 Mapping Instances to Memory Locations . . . . . . . . . . . . . . . . . . . . . . . 126

4.2.1 Induction Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.2.2 Building Recurrence Equations on Induction Variables . . . . . . . . . . . 128

4.2.3 Solving Recurrence Equations on Induction Variables . . . . . . . . . . . 133

4.2.4 Computing Storage Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.2.5 Application to Motivating Examples . . . . . . . . . . . . . . . . . . . . . 137

4.3 Dependence and Reaching Denition Analysis . . . . . . . . . . . . . . . . . . . . 139

4.3.1 Building the Con ict Transducer . . . . . . . . . . . . . . . . . . . . . . . 139

4.3.2 Building the Dependence Transducer . . . . . . . . . . . . . . . . . . . . . 140

4.3.3 From Dependences to Reaching Denitions . . . . . . . . . . . . . . . . . 141

4.3.4 Practical Approximation of Reaching Denitions . . . . . . . . . . . . . . 143

4.4 The Case of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4.5 The Case of Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.6 The Case of Composite Data Structures . . . . . . . . . . . . . . . . . . . . . . . 148

4.7 Comparison with Other Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5 Parallelization via Memory Expansion 155

5.1 Motivations and Tradeos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.1.1 Conversion to Single-Assignment Form . . . . . . . . . . . . . . . . . . . . 156

5.1.2 Run-Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.1.3 Single-Assignment for Loop Nests . . . . . . . . . . . . . . . . . . . . . . 160

5.1.4 Optimization of the Run-Time Overhead . . . . . . . . . . . . . . . . . . 161

TABLE OF CONTENTS 7

5.2 Maximal Static Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

5.2.3 Formal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.2.5 Detailed Review of the Algorithm . . . . . . . . . . . . . . . . . . . . . . 177

5.2.6 Application to Real Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.2.7 Back to the Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.2.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.2.9 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.3 Storage Mapping Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.3.2 Problem Statement and Formal Solution . . . . . . . . . . . . . . . . . . . 191

5.3.3 Optimality of the Expansion Correctness Criterion . . . . . . . . . . . . . 194

5.3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

5.3.5 Array Reshaping and Renaming . . . . . . . . . . . . . . . . . . . . . . . 196

5.3.6 Dealing with Tiled Parallel Programs . . . . . . . . . . . . . . . . . . . . 199

5.3.7 Schedule-Independent Storage Mappings . . . . . . . . . . . . . . . . . . . 200

5.3.8 Dynamic Restoration of the Data-Flow . . . . . . . . . . . . . . . . . . . 201

5.3.9 Back to the Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

5.3.10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

5.4 Constrained Storage Mapping Optimization . . . . . . . . . . . . . . . . . . . . . 205

5.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

5.4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

5.4.3 Formal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

5.4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

5.4.5 Building Expansion Constraints . . . . . . . . . . . . . . . . . . . . . . . . 215

5.4.6 Graph-Coloring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

5.4.7 Dynamic Restoration of the Data-Flow . . . . . . . . . . . . . . . . . . . 219

5.4.8 Parallelization after Constrained Expansion . . . . . . . . . . . . . . . . . 222

5.4.9 Back to the Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 223

5.5 Parallelization of Recursive Programs . . . . . . . . . . . . . . . . . . . . . . . . 226

5.5.1 Problems Specic to Recursive Structures . . . . . . . . . . . . . . . . . . 227

5.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

5.5.3 Generating Code for Read References . . . . . . . . . . . . . . . . . . . . 230

5.5.4 Privatization of Recursive Programs . . . . . . . . . . . . . . . . . . . . . 232

5.5.5 Expansion of Recursive Programs: Practical Examples . . . . . . . . . . . 233

5.5.6 Statementwise Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . 235

5.5.7 Instancewise Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . 240

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

6 Conclusion 245

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

6.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Bibliography 249

Index 259

8 LIST OF FIGURES

List of Figures

1.1 Simple examples of memory expansion . . . . . . . . . . . . . . . . . . . . . . . . 58

1.2 Run-time restoration of the
ow of data . . . . . . . . . . . . . . . . . . . . . . . 59

1.3 Exposing parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.1 About run-time instances and accesses . . . . . . . . . . . . . . . . . . . . . . . . 62

2.2 Procedure Queens and control tree . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3 Control automata for program Queens . . . . . . . . . . . . . . . . . . . . . . . . 69

2.4 Hash-table declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

2.5 An inode declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

2.6 Computation of Parikh vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

2.7 Execution-dependent storage mappings . . . . . . . . . . . . . . . . . . . . . . . . 77

3.1 Studying the Lukasiewicz language . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.2 One-counter automaton for the Lukasiewicz language . . . . . . . . . . . . . . . . 96

3.3 Sequential and sub-sequential transducers . . . . . . . . . . . . . . . . . . . . . . 100

3.4 Synchronous and -synchronous transducers . . . . . . . . . . . . . . . . . . . . . 103

3.5 Left-synchronous realization of several order relations . . . . . . . . . . . . . . . 103

3.6 A left and right synchronizable example . . . . . . . . . . . . . . . . . . . . . . . 104

4.1 Procedure Queens and control tree . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.2 Procedure BST and compressed control automaton . . . . . . . . . . . . . . . . . 125

4.3 Procedure Count and compressed control automaton . . . . . . . . . . . . . . . . 126

4.4 First example of induction variables . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.5 More examples of induction variables . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.6 Procedure Count and control automaton . . . . . . . . . . . . . . . . . . . . . . . 138

4.7 Rational transducer for storage mapping f of program BST . . . . . . . . . . . . 146

4.8 Rational transducer for con
ict relation of program BST . . . . . . . . . . . . . 146

4.9 Rational transducer for dependence relation of program BST . . . . . . . . . . . 147

4.10 Rational transducer for storage mapping f of program Queens . . . . . . . . . . 147

4.11 One-counter transducer for con
ict relation of program Queens . . . . . . . . . 149

4.12 Pseudo-left-synchronous transducer for the restriction of to W R . . . . . . 150

4.13 One-counter transducer for the restriction of dependence relation to
ow de-

pendences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

4.14 One-counter transducer for reaching denition relation of program Queens . . . 152

4.15 Simplied one-counter transducer for . . . . . . . . . . . . . . . . . . . . . . . . 152

5.1 Interaction of reaching denition analysis and run-time overhead . . . . . . . . . 159

5.2 Basic optimizations of the generated code for functions . . . . . . . . . . . . . 163

5.3 Repeated assignments to the same memory location . . . . . . . . . . . . . . . . 164

5.4 Improving the SA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.5 Parallelism extraction versus run-time overhead . . . . . . . . . . . . . . . . . . . 167

5.6 First example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

5.7 First example, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

LIST OF FIGURES 9

5.9 Second example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.10 Partition of the iteration domain (N = 4) . . . . . . . . . . . . . . . . . . . . . . 171

5.11 Maximal static expansion for the second example . . . . . . . . . . . . . . . . . . 172

5.12 Third example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

5.13 Inserting copy-out code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

5.14 Parallelization of the rst example. . . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.15 Experimental results for the rst example . . . . . . . . . . . . . . . . . . . . . . 186

5.16 Computation times, in milliseconds. . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.17 Convolution example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.18 Knapsack program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.19 KP in single-assignment form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.20 Instancewise reaching denitions, schedule, and tiling for KP . . . . . . . . . . . 190

5.21 Partial expansion for KP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.22 Cases of feexp(v) 6= feexp(w) in (5.17) . . . . . . . . . . . . . . . . . . . . . . . . . 194

5.23 Motivating examples for each constraint in the denition of the interference relation195

5.24 An example of block-regular storage mapping . . . . . . . . . . . . . . . . . . . . 200

5.25 Time and space optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5.26 Performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5.27 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

5.28 Parallelization of the motivating example . . . . . . . . . . . . . . . . . . . . . . 207

5.29 Performance results for storage mapping optimization . . . . . . . . . . . . . . . 208

5.30 Maximal static expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

5.31 Maximal static expansion combined with storage mapping optimization . . . . . 209

5.32 What we want to achieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

5.33 Strange interplay of constraint and coloring relations . . . . . . . . . . . . . . . . 213

5.34 How we achieve constrained storage mapping optimization . . . . . . . . . . . . . 214

5.35 Solving the constrained storage mapping optimization problem . . . . . . . . . . 215

5.36 Single-assignment form conversion of program Queens . . . . . . . . . . . . . . . 234

5.37 Implementation of the read reference in statement r . . . . . . . . . . . . . . . . 235

5.38 Privatization of program Queens . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

5.39 Parallelization of program BST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

5.40 Second motivating example: program Map . . . . . . . . . . . . . . . . . . . . . . 237

5.41 Parallelization of program Queens via privatization . . . . . . . . . . . . . . . . . 239

5.42 Parallel resolution of the n-Queens problem . . . . . . . . . . . . . . . . . . . . . 240

5.43 Instancewise parallelization example . . . . . . . . . . . . . . . . . . . . . . . . . 241

5.44 Automatic instancewise parallelization of procedure P . . . . . . . . . . . . . . . 243

10 LIST OF ALGORITHMS

List of Algorithms

Recurrence-Build (program) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Recurrence-Rewrite (program; system) . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Recurrence-Solve (system) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Compute-Storage-Mappings (program) . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Dependence-Analysis (program) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Reaching-Denition-Analysis (program) . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Abstract-SA (program; W; ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Abstract-Implement-Phi (expanded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Convert-Quast (quast; ref ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Loop-Nests-SA (program; ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Loop-Nests-Implement-Phi (expanded) . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Abstract-ML-SA (program; W; ml ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Loop-Nests-ML-SA (program; ml ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Abstract-Implement-Phi-Not-SA (expanded) . . . . . . . . . . . . . . . . . . . . . . . 167

Maximal-Static-Expansion (program; ; ) . . . . . . . . . . . . . . . . . . . . . . . . 177

MSE-Convert-Quast (quast; ref ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Compute-Representatives (equivalence) . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Enumerate-Representatives (rel; fun) . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Storage-Mapping-Optimization (program; ; 6 ; <par ) . . . . . . . . . . . . . . . . . . 196

SMO-Convert-Quast (quast; ref ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Build-Expansion-Vector (S; ./) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Partial-Renaming (program; ./) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Constrained-Storage-Mapping-Optimization (program; ; ; ; <par ) . . . . . . . . . . 216

CSMO-Convert-Quast (quast; ref ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Cyclic-Coloring () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Near-Block-Cyclic-Coloring ( ; shape) . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

CSMO-Implement-Phi (expanded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

CSMO-Eciently-Implement-Phi (expanded) . . . . . . . . . . . . . . . . . . . . . . . 221

Recursive-Programs-SA (program; ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Recursive-Programs-Implement-Phi (expanded) . . . . . . . . . . . . . . . . . . . . . . 230

Recursive-Programs-Online-SA (program; ) . . . . . . . . . . . . . . . . . . . . . . . 232

Statementwise-Parallelization (program; ) . . . . . . . . . . . . . . . . . . . . . . . . 238

Instancewise-Parallelization (program; ) . . . . . . . . . . . . . . . . . . . . . . . . . 242

11

Presentation en francais

Apres une introduction detaillee, ce chapitre ore un resume en francais des chapitres

suivants | ecrits en anglais. Son organisation est le re
et de la structure de la these et les

sections et sous-sections correspondent respectivement aux chapitres et a leurs sections.

Le lecteur desirant approfondir un des sujets presentes pourra donc se reporter a la partie

correspondante en anglais pour y trouver le detail des algorithmes ainsi que des exemples.

I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

I.1 Analyse de programmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

I.2 Transformations de programmes pour la parallelisation . . . . . . . . . . . 16

I.3 Organisation de cette these . . . . . . . . . . . . . . . . . . . . . . . . . . 19

II Modeles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

II.1 Une vision par instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

II.2 Modele de programmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II.3 Modele formel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

II.4 Analyse par instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

II.5 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

III Outils mathematiques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

III.1 Arithmetique de Presburger . . . . . . . . . . . . . . . . . . . . . . . . . . 27

III.2 Langages formels et relations rationnelles . . . . . . . . . . . . . . . . . . 28

III.3 Relations synchrones a gauche . . . . . . . . . . . . . . . . . . . . . . . . 31

III.4 Depasser les relations rationnelles . . . . . . . . . . . . . . . . . . . . . . . 32

III.5 Complements sur les approximations . . . . . . . . . . . . . . . . . . . . . 34

IV Analyse par instance pour programmes recursifs . . . . . . . . . . . . . . . . . . 34

IV.1 Exemples introductifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

IV.2 Relier instances et cellules memoire . . . . . . . . . . . . . . . . . . . . . . 35

IV.3 Analyse de dependances et de denitions visibles . . . . . . . . . . . . . . 38

IV.4 Les resultats de l'analyse . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

IV.5 Comparaison avec d'autres analyses . . . . . . . . . . . . . . . . . . . . . 41

V Expansion et parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

V.1 Motivations et compromis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

V.2 Expansion statique maximale . . . . . . . . . . . . . . . . . . . . . . . . . 44

V.3 Optimisation de l'occupation en memoire . . . . . . . . . . . . . . . . . . 45

V.4 Expansion optimisee sous contrainte . . . . . . . . . . . . . . . . . . . . . 45

V.5 Parallelisation de programmes recursifs . . . . . . . . . . . . . . . . . . . 46

VI Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

VI.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

VI.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

12 PRESENTATION EN FRANCAIS

I Introduction

Les progres accomplis en matiere de technologie des processeurs resultent de plusieurs

facteurs : une forte augmentation de la frequence, des bus plus larges, l'utilisation de plu-

sieurs unites fonctionnelles et eventuellement de plusieurs processeurs, le recours a des

hierarchies memoire complexes pour compenser les temps d'acces, et un developpement

global des capacites de stockage. Une consequence est que le modele de machine devient

de moins en moins simple et uniforme : en depit de la gestion materielle des caches, de

l'execution superscalaire et des architectures paralleles a memoire partagee, la recherche

des performances optimales pour un programme donne devient de plus en plus complexe.

De bonnes optimisations pour un cas particulier peuvent conduire a des resultats desas-

treux avec une architecture dierente. De plus, la gestion materielle n'est pas capable de

tirer partie ecacement des architectures les plus complexes : en presence de hierarchies

memoire profondes, de memoires locales, de calcul out of core, de parallelisme d'instruc-

tions ou de parallelisme a gros grain, une aide du compilateur est necessaire pour obtenir

de bonnes performances.

L'industrie des architectures et des compilateurs tout entiere aronte en realite ce que

la communaute du calcul a hautes performances a decouvert depuis des annees. D'une

part, et pour la plupart des applications, les architectures sont trop disparates pour denir

des criteres d'ecacite pratiques et pour developper des optimisations speciques pour une

machine donnee. D'autre-part, les programmes sont ecrits de telle sorte que les techniques

traditionnelles d'optimisation et de parallelisation ont tout le mal du monde a nourrir la

b^ete de calcul l'on s'appr^ete a installer dans un banal ordinateur portable.

Pour atteindre des performances elevees a l'aide des microprocesseurs modernes et des

ordinateurs paralleles, un programme | ou bien l'algorithme qu'il implemente | doit

posseder un degre susant de parallelisme. Dans ces conditions, les programmeurs ou les

compilateurs doivent mettre en evidence ce parallelisme et appliquer les transformations

necessaires pour adapter le programme aux caracteristiques de la machine. Une autre

exigence est que le programme soit portable sur des architectures dierentes, an de

suivre l'evolution rapide des machines paralleles. Les deux possibilites suivantes sont ainsi

oertes aux programmeurs.

{ Premierement, les langages a parallelisme explicite. La plupart sont des extensions

paralleles de langages sequentiels. Ces langages peuvent ^etre a parallelisme de don-

nees, comme HPF, ou combiner parallelisme de donnees et de t^aches, comme les

extensions OpenMP pour architectures a memoire partagee. Quelques extensions

sont proposees sous la forme de bibliotheques : PVM et MPI par exemple, ou bien

des environnements de haut niveau comme IML de l'Universite de l'Illinois [SSP99]

ou Cilk du MIT [MF98]. Toutes ces approches facilitent la programmation d'algo-

rithmes paralleles. En revanche, le programmeur est charge de certaines operations

techniques comme la distribution des donnees sur les processeurs, les communica-

tions et les synchronisations. Ces operations requierent une connaissance approfon-

die de l'architecture et reduisent notablement la portabilite.

{ Deuxiemement, la parallelisation automatique d'un langage sequentiel de haut ni-

veau. Les avantages evidents de cette approche sont la portabilite et la simplicite

de la programmation. Malheureusement, la t^ache qui incombe au compilateur pa-

ralleliseur devient ecrasante. En eet, le programme doit tout d'abord ^etre analyse

an de comprendre | au moins partiellement | quels calculs sont eectues et ou

I. INTRODUCTION 13

nant en compte les specicites de l'architecture. Le langage source usuel pour la

parallelisation automatique est le Fortran 77. En eet, de nombreuses applications

scientiques ont ete ecrites en Fortran, n'autorisant que des structures de donnees et

de contr^ole relativement simples. Plusieurs etudes considerent neanmoins la paral-

lelisation du C ou de langages fonctionnels comme Lisp. Ces recherches sont moins

avancees que l'approche historique mais plus proches de ce travail : elles considerent

les structures de donnees et de contr^ole les plus generales. De nombreux projets

de recherche existent : Parafrase-2 et Polaris [BEF+96] de l'Universite de l'Illinois,

PIPS de l'E cole des Mines de Paris [IJT90], SUIF de l'Universite de Stanford [H+96],

le compilateur McCat/Earth-C de l'Universite Mc Gill [HTZ+97], LooPo de l'Uni-

versite de Passau [GL97], et PAF de l'Universite de Versailles ; il y a egalement un

nombre croissant d'outils de parallelisation commerciaux, comme CFT, FORGE,

FORESYS ou KAP.

Nous nous interessons principalement aux techniques de parallelisation automatique

et semi-automatique : cette these aborde a la fois l'analyse et la transformation de pro-

grammes.

Optimiser ou paralleliseur un programme revient generalement a transformer son code

source, en ameliorant un certain nombre de parametres de l'execution. Pour appliquer une

transformation de programme a la compilation, on doit s'assurer que l'algorithme imple-

mente n'est pas touche au cours de l'operation. E tant donne qu'un algorithme peut ^etre

implemente de bien des manieres dierentes, la validation d'une transformation de pro-

grammes requiert un processus d'ingenierie a l'envers (reverse engineering) pour etablir

l'information la plus precise possible sur ce que fait le programme. Cette technique fon-

damentale d'analyse de programmes tente de resoudre le probleme dicile de la mise en

evidence statique | c.-a-d. a la compilation | d'informations sur les proprietes dyna-

miques | c.-a-d. a l'execution.

Analyse statique

En matiere d'analyse de programmes, les premieres etudes se sont portees sur les

proprietes de l'etat de la machine entre l'execution de deux instructions. Ces etats sont

appeles points de programmes. De telles proprietes sont dites statiques car elles recouvrent

toutes les executions possibles conduisant a un point de programme donne. Bien entendu,

ces proprietes sont calculees lors de la compilation, mais le sens de l'adjectif (( statique )) ne

vient pas de la : il serait probablement plus approprie de parler d'analyse (( syntaxique )).

L'analyse de ot de donnees est le premier cadre general propose pour formaliser le

grand nombre d'analyses statiques. Parmi les nombreuses presentations de ce formalisme

[KU77, Muc97, ASU86, JM82, KS92, SRH96], on peut identier les points communs sui-

vants. Pour decrire les executions possibles, la methode usuelle consiste a construire le

graphe de ot de contr^ole du programme [ASU86]; en eet, ce graphe represente tous les

points comme des sommets, et les ar^etes entre ces sommets sont etiquetees par des instruc-

tions du programme. L'ensemble de toutes les executions possibles est alors l'ensemble de

tous les chemins depuis l'etat initial jusqu'au point de programme considere. Les proprie-

tes en un point donne sont denies de la facon suivante : puisque chaque instruction peut

14 PRESENTATION EN FRANCAIS

modier une propriete, on doit prendre en compte tous les chemins conduisant au point

de programme et rassembler (meet) toutes les informations sur ces chemins. La formalisa-

tion de ces idees est souvent appelee rassemblement sur tous les chemins ou meet over all

paths (MOP). Bien s^ur, l'operation de rassemblement depend de la propriete recherchee

et de l'abstraction mathematique pour celle-ci.

En revanche, le nombre potentiellement inni de chemins interdit toute evaluation de

proprietes a partir de la specication MOP. Le calcul est realise en propageant les resultats

intermediaires | en avant ou en arriere | le long des ar^etes du graphe de
ot de contr^ole.

On procede alors a une resolution iterative des equations de propagation, jusqu'a ce qu'un

point xe soit atteint. C'est la methode dite du point xe maximal ou maximal x-point

(MFP). Dans le cas intra-procedural, Kam et Ullman [KU77] ont prouve que MFP calcule

eectivement le resultat deni par MOP | c.-a-d. MFP concide avec MOP | lorsque

quelques proprietes simples de l'abstraction mathematique sont satisfaites ; et ce resultat

a ete etendu a l'analyse inter-procedurale par Knoop et Steen [KS92].

Les abstractions mathematiques pour les proprietes de programmes sont tres nom-

breuses, en fonction de l'application et de la complexite de l'analyse. La structure de

treillis englobe la plupart des abstractions car elle autorise le calcul des rassemblements

(meet) | aux points de rencontre | et des jointures (join) | associees aux instruc-

tions. Dans ce cadre, Cousot et Cousot [CC77] ont propose un schema d'approximation

fonde sur des connections de Galois semi-duales entre les etats concrets de l'execution et

les proprietes abstraites a la compilation. Ce formalisme appele interpretation abstraite

a deux inter^ets principaux : tout d'abord, il permet de construire systematiquement des

abstractions des proprietes a l'aide de treillis, et d'un autre cote, il garantit que tout

point xe calcule dans le treillis abstrait correspond a une approximation conservatrice

d'un point xe dans le treillis des etats concrets. Tout en etendant le concept d'analyse de

ot de donnees, l'interpretation abstraite facilite les preuves de correction et d'optimalite

des analyses de programmes. Des applications pratiques de l'interpretation abstraite et

des methodes iteratives associees sont presentees dans [Cou81, CH78, Deu92, Cre96].

Malgre d'indeniables succes, les analyses de
ot de donnees | fondees ou non sur

l'interpretation abstraite | ont rarement ete a la base des techniques de parallelisation

automatique. Certaines raisons importantes ne sont pas de nature scientique, mais de

bonnes raisons expliquent egalement ce fait :

{ les techniques MOP/MFP sont principalement orientees vers les optimisations clas-

siques avec des abstractions relativement simples (les treillis ont souvent une hau-

teur bornee) ; leur correction et leur ecacite dans un veritable compilateur sont les

enjeux determinants, alors que la precision et l'expressivite de l'abstraction mathe-

matique sont a la base de la parallelisation automatique ;

{ dans l'industrie, les methodes de parallelisation se sont traditionnellement concen-

trees sur les nids de boucles et sur les tableaux, avec des degres importants de

parallelisme de donnees et des structures de contr^ole simples (non recursives, du

premier ordre) ; prouver la correction d'une analyse est facile dans ces conditions,

alors que l'application a des programmes reels et l'implementation dans un compi-

lateur deviennent des enjeux majeurs ;

{ l'interpretation abstraite convient aux langages fonctionnels avec une semantique

operationnelle propre et simple ; les problemes souleves sont alors orthogonaux aux

questions pratiques liees aux langages imperatifs et bas niveau, traditionnellement

plus adaptes aux architectures paralleles (on verra que cette situation evolue).

I. INTRODUCTION 15

lyses statiques qui calculent des proprietes d'un point de programme donne ou d'une ins-

truction donnee. De tels resultats sont utiles aux techniques classiques de verication et

d'optimisation [Muc97, ASU86, SKR90, KRS94], mais pour la parallelisation automatique

on a besoin d'informations supplementaires.

{ Que dire des dierentes instances d'un point de programme ou d'une instruction a

l'execution ? Puisque les instructions sont generalement executees plusieurs fois, on

s'interesse a l'iteration de boucle ou a l'appel de procedure qui conduit a l'execution

de telle instruction.

{ Que dire des dierents elements d'une structure de donnees? Puisque les tableaux

et les structures de donnees allouees dynamiquement ne sont pas atomiques, on

s'interesse a l'element de tableau ou au nud de l'arbre qui est accede par une

instance donnee d'une instruction.

Analyse par instances

Les analyses de programmes pour la parallelisation automatique constituent un do-

maine assez restreint, compare avec l'immensite des proprietes et des techniques etudiees

dans le cadre de l'analyse statique. Le modele de programme considere est egalement plus

restreint | la plupart du temps | puisque les applications traditionnelles des paralleli-

seurs sont les codes numeriques avec des nids de boucles et des tableaux.

Des le debut | avec les travaux de Banerjee [Ban88], Brandes [Bra88] et Feautrier

[Fea88a] | les analyses sont capables d'identier des proprietes au niveau des instances et

des elements. Alors que la seule structure de contr^ole etait la boucle for/do, les methodes

iteratives avec de solides fondations semantiques paraissaient inutilement complexes. Pour

se concentrer sur la resolution des problemes cruciaux que sont l'abstraction des iterations

de boucles et des eets les elements de tableaux, la conception de modeles simples et spe-

cialises fut a coup s^ur preferable. Les premieres analyses etaient des tests de dependance

[Ban88] et des analyses de dependances qui rassemblent des informations sur les instances

d'instructions accedant a la m^eme cellule memoire, l'un des acces etant une ecriture. Des

methodes plus precises ont ete concues pour calculer, pour chaque element de tableau lu

dans une expression, l'instance de l'instruction qui a produit la valeur. Elles sont souvent

appelees analyses de ot de donnees pour tableaux [Fea91, MAL93], mais nous preferons

le terme d'analyse de denitions visibles par instances pour favoriser la comparaison avec

une technique particuliere d'analyse statique de ot de donnees appelee analyse de de-

nitions visibles [ASU86, Muc97]. Une information aussi precise ameliore signicativement

la qualite des techniques de transformation, et donc les performances des programmes

paralleles.

Les analyses par instances ont longtemps souert de severes restrictions sur leur mo-

dele de programmes : ceux-ci devaient initialement ne comporter que des boucles sans

instructions conditionnelles, avec des bornes et des indices de tableaux anes, et sans

appels de procedures. Ce modele limite englobe deja bon nombre de codes numeriques,

et il a egalement le grand inter^et de permettre le calcul exact des dependances et des de-

nitions visibles [Fea88a, Fea91]. Lorsque l'on cherche a supprimer des restrictions, l'une

des dicultes vient de l'impossibilite d'etablir des resultats exacts, seule une information

approchee sur les dependances est disponible a la compilation : cela induit des approxima-

tions trop grossieres sur les denitions visibles. Un calcul direct de ces denitions visibles

16 PRESENTATION EN FRANCAIS

est donc necessaire. De telles techniques ont ete recemment mises au point par Barthou,

Collard et Feautrier [CBF95, BCF97, Bar98] et par Pugh et Wonnacott [WP95, Won95],

avec des resultats extr^emement precis dans le cas intra-procedural. Par la suite, et dans le

cas des tableaux et nids de boucles sans restrictions, notre analyse de denitions visibles

par instances sera l'analyse
oue de
ot des donnees ou fuzzy array data
ow analysis

(FADA) de Barthou, Collard et Feautrier [Bar98].

Il existe de nombreuses extensions de ces analyses qui sont capables de prendre en

compte les appels de procedure [TFJ86, HBCM94, CI96], mais ce ne sont pas pleinement

des analyses par instances car elles ne distinguent pas les executions multiples d'une

instruction associees a des appels dierents de la procedure englobante. En eet, cette

these presente la premiere analyse qui soit pleinement par instances pour des programmes

comportant des appels de procedures | eventuellement recursifs.

Il est bien connu que les dependances limitent la parallelisation des programmes ecrits

dans un langage imperatif ainsi que leur compilation ecace sur les processeurs modernes

et les super-calculateurs. Une methode generale pour reduire le nombre de dependances

consiste a reduire la reutilisation de la memoire en aectant des cellules memoires dis-

tinctes a des ecritures independantes, c'est-a-dire a expanser les structures de donnees.

Il y a de nombreuses techniques pour calculer des expansions de la memoire, c'est-a-

dire pour transformer les acces memoire dans les programmes. Les methodes classiques

comportent : le renommage de variables ; le decoupage ou l'unication de structures de

donnees du m^eme type ; le redimensionnement de tableaux, en particulier l'ajout de nou-

velles dimensions ; la conversion de tableaux en arbres ; la modication du degre d'un

arbre ; la transformation d'une variable globale en une variable locale.

Les references en lecture sont expansees egalement, en utilisant les denitions visibles

pour implementer la reference expansee [Fea91]. La gure 1 presente trois programmes

pour lesquels aucune execution parallele n'est possible, en raison des dependances de sortie

(certains details du code sont omis). Les versions expansees sont presentees en partie

droite de la gure, pour illustrer l'inter^et de l'expansion de la memoire pour l'extraction

du parallelisme.

Malheureusement, lorsque le ot de contr^ole ne peut pas ^etre predit a la compilation,

un travail supplementaire est necessaire lors de l'execution pour preserver le ot de don-

nees d'origine : des fonctions peuvent ^etre necessaires pour (( rassembler )) les denitions

en provenance de divers chemins de contr^ole entrants. Ces fonctions sont semblables

| mais non identiques | a celles du formalisme d'assignation unique statique ou sta-

tic single-assignment (SSA) de Cytron et al. [CFR+91], et Collard et Griebl les ont ete

etendues pour la premiere fois aux methodes d'expansion par instances [GC95, Col98].

L'argument d'une fonction est l'ensemble des denitions visibles possibles pour la refe-

rence en lecture associee (cette interpretation est tres dierente de la semantique usuelle

des fonctions du formalisme SSA). La gure 2 propose deux programmes avec des ex-

pressions conditionnelles et des index de tableau inconnus. Des versions expansees avec

fonctions sont donnees en partie droite de la gure.

L'expansion n'est pas une etape obligatoire de la parallelisation ; elle reste cependant

une technique tres generale pour exposer plus de parallelisme dans les programmes. En

ce qui concerne l'implementation de programmes paralleles, deux visions dierentes sont

possibles, en fonction du langage et de l'architecture.

I. INTRODUCTION 17

........................................................................................

int x; int x1, x2;

x =

; = x; x1 = ; = x1;

x =

; = x; x2 = ; = x2;

Apres expansion, c.-a-d. apres renommage de x en x1 et x2, les deux premieres instructions

peuvent ^etre executees en parallele avec les deux autres.

int A[10]; int A1[10], A2[10][10];

for (i=0; i<10; i++) { for (i=0; i<10; i++) {

s1 A[0] = ; s 1 A1[i] = ;

for (j=1; j<10; j++) { for (j=1; j<10; j++) {

s2 A[j] = A[j-1] + ; s 2 A2[i][j] = { if (j=1) A1[i];

} else A2[i][j-1]; }

+ ;

}

Apres expansion, c.-a-d. apres renommage du tableau A en A1 et A2 puis ajout d'une

dimension au tableau A2, la boucle for est parallele. La denition visible par instances

de la reference A[j-1] depend des valeurs de i et j, comme le montre l'implementation

avec une instruction conditionnelle.

int A[10]; struct Tree {

void Proc (int i) { int value;

A[i] = ; Tree *left, *right;

= A[i]; } *p;

if (

) Proc (i+1); void Proc (Tree *p, int i) {

if (

) Proc (i-1); p->value = ;

} = p->value;

if (

) Proc (p->left, i+1);

if (

) Proc (p->right, i-1);

}

Apres expansion, les deux appels de procedure peuvent ^etre executes en parallele. L'allo-

cation dynamique de la structure Tree est omise.

. . . . . . . . . . . . . . . . . . . . . . Figure 1. Quelques exemples d'expansion . . . . . . . . . . . . . . . . . . . . . .

instructions dierentes du m^eme bloc de programme. Le but consiste a remplacer le plus

d'executions sequentielles d'instructions par des executions paralleles. En fonction du lan-

gage, il y a plusieurs syntaxes dierentes pour coder ce type de parallelisme, et celles-ci

peuvent ne pas toutes avoir le m^eme pouvoir d'expression. Nous preferons la syntaxe

spawn/sync de Cilk [MF98] (proche de celle de OpenMP) aux blocs parall eles de Al-

gol 68 et du compilateur EARTH-C [HTZ+97]. Comme dans [MF98], les synchronisations

portent sur toutes les activites asynchrones commencees dans le bloc englobant, et des

synchronisations implicites sont ajoutees aux points de retour des procedures. En ce qui

concerne l'exemple de la gure 3, l'execution de A, B et C en parallele suivie sequen-

tiellement de D puis de E a ete ecrite dans une syntaxe a la Cilk. En pratique, chaque

instruction de cet exemple serait probablement un appel de procedure.

18 PRESENTATION EN FRANCAIS

........................................................................................

int x; int x1, x2;

s1 x =

; s 1 x1 = ;

s2 if (

) x = ; s ; 2 if ( ) x2 =

r = x; r (fs ; s g); = 1 2

Apres expansion, on ne peut pas decider a la compilation quelle est la valeur lue par

l'instruction r. On ne sait seulement que celle-ci ne peut venir que de s1 ou de s2 , et le

calcul de cette valeur est cache dans l'expression (fs1; s2g). Celle-ci observe si s2 a ete

executee, si oui elle retourne la valeur de x2, sinon celle de x1.

int A[10]; int A1[10], A2[10];

s1 A[i] = ; s

1 A1[i] = ;

s

2 A[ ] = ; s

2 A2[ ] = ;

r = A[i]; r (fs ; s g)

= 1 2 ;

Apres expansion, on ne sait pas a la compilation quelle est la valeur lue par l'instruction

r, puisque l'on ne conna^t pas l'element du tableau A ecrit par l'instruction s2.

. . . . . . . . . . . . . . . Figure 2. Restauration du ot de donnees a l'execution . . . . . . . . . . . . . . .

........................................................................................

spawn A;

spawn B;

spawn C;

sync; // attente de la terminaison de A, B et C

D ;

E ;

entre des instances dierentes de la m^eme instruction ou du m^eme bloc. Le modele a

parallelisme de donnees a ete longuement etudie dans le cas des nids de boucles [PD96],

en raison de son adequation avec les techniques ecaces de parallelisation pour les al-

gorithmes numeriques et pour les operations repetitives sur de gros jeux de donnees.

On utilisera une syntaxe similaire a la declaration de boucles paralleles en OpenMP, ou

toutes les variables sont supposees partagees par defaut, et une synchronisation implicite

est ajoutee a la n de chaque sortie de boucle.

Pour generer du code a parallelisme de donnees, beaucoup d'algorithmes utilisent des

transformations de boucles intuitives comme la ssion, la fusion, l'echange, le renverse-

ment, la torsion, la reindexation de boucles et le reordonnancement des instructions. Mais

le parallelisme de donnees est egalement adapte a l'expression d'un ordre d'execution

parallele sous forme d'ordonnancement, c'est-a-dire en aectant une date d'execution a

chaque instance d'une instruction. Le schema de programme de la gure 4 montre donne

une idee de la methode generale pour implementer un tel ordonnancement [PD96]. Le

concept de front d'execution F (t) est fondamental pusiqu'il rassemble toutes les instances

{ qui s'executent a la date t.

Le premier algorithme d'ordonnancement est d^u a Kennedy et Allen [AK87], lequel a

I. INTRODUCTION 19

........................................................................................

L

for (t=0; t<= ; t++) { // L est la latence de l'ordonnancement

parallel for ( { 2 F (t)

)

execute instance {

// synchronisation implicite

}

de donnees

........................................................................................

inspire de nombres methodes. Elles se fondent toutes sur des abstractions relativement ap-

proximatives des dependances, comme les niveaux, les vecteurs et les c^ones de dependance.

La complexite raisonable et la facilite d'implementation dans un compilateur industriel

constituent les avantages principaux de ces methodes ; les travaux de Banerjee [Ban92] et

plus recemment de Darte et Vivien [DV97] donnent une vision globale de ces algorithmes.

Une solution generale a ete proposee par Feautrier [Fea92]. L'algorithme propose est tres

utile, mais l'absence de support pour decider du parametre de l'ordonnancement que l'on

doit optimiser constitue un point faible : est-ce la latence L, le nombre de communications

(sur une machine a memoire distribuee), la largeur des fronts?

Pour nir, il est bien connu que le parallelisme de contr^ole est plus general que le pa-

rallelisme de donnees, en ce sens que tout programme a parallelisme de donnees peut ^etre

reecrit dans un modele a parallelisme de contr^ole, sans perte de parallelisme. C'est d'au-

tant plus vrai pour les programmes recursifs ou la distinction entre les deux paradigmes

n'est pas tres claire [Fea98]. En revanche, pour des programmes et des architectures reels,

le parallelisme de donnees a longtemps ete nettement plus adapte au calcul massivement

parallele | principalement en raison du surco^ut associe a la gestion des activites. Des

avancees recentes dans le materiel et les logiciels ont poutant montre que la situation est

entrain d'evoluer : d'excellents resultats pour des programmes paralleles recursifs (simu-

lations de jeux comme les echecs, et algorithmes de tri) ont ete obtenus avec Cilk par

exemple [MF98].

Quatre chapitres structurent cette these avant la conclusion nale, et ceux-ci se re-

etent dans les sections suivantes. La section II | resumant le chapitre 2 | decrit un

formalisme general pour l'analyse et la transformation de programmes, et presente les

denitions utiles aux chapitres suivants. Le but est d'^etre capable d'etudier une large

classe de programmes, des nids de boucles avec tableaux aux programmes et structures

de donnees recursifs.

Des resultats mathematiques sont rassembles dans la section III | resumant le cha-

pitre 3 ; certains sont bien connus, comme l'arithmetique de Presburger et la theorie des

langages formels ; certains sont plut^ot peu courants dans les domaines du parallelisme

et de la compilation, comme les transductions rationnelles et algebriques ; et les autres

sont principalement des contributions, comme les transductions synchrones a gauche et

les techniques d'approximation pour transductions rationnelles et algebriques.

20 PRESENTATION EN FRANCAIS

programmes recursifs. Celle-ci est fondee sur une extension de la notion de variable d'in-

duction aux programmes recursifs et sur de nouveaux resultats en theorie des langages

formels. Deux algorithmes pour l'analyse de dependance et de denition visible sont pro-

poses. Ceux-ci sont experimentes sur des exemples.

Les techniques de parallelisation fondees sur l'expansion de la memoire constituent

l'objet de la section V | resumant le chapitre 5. Les trois premieres sous-sections pre-

sentent des techniques pour expanser les nids de boucles sans restriction d'expressions

conditionnelles, de bornes de boucles et d'index de tableaux ; la quatrieme sous-section

est une contribution a l'optimisation simultanee des parametres d'expansion et de pa-

rallelisation ; et la cinquieme sous-section presente nos resultats sur l'expansion et la

parallelisation de programmes recursifs.

II Modeles

An de conserver un formalisme et un vocabulaire constant tout au long de cette

these, nous presentons un cadre general pour decrire des analyses et des transformations

de programmes. Nous avons mis l'accent sur la representation des proprietes de pro-

grammes au niveau des instances, tout en maintenant une certaine continuite avec les

autres travaux du domaine. Nous ne cherchons a concurrencer aucun formalisme existant

[KU77, CC77, JM82, KS92] : l'objectif principal consiste a etablir des resultats convain-

cants sur la pertinence et l'ecacite de nos techniques.

Apres une presentation formelle des instances d'instructions et des executions d'un

programme, nous denissons un modele de programmes pour le reste de cette etude.

Nous decrivons ensuite les abstractions mathematiques associees, avant de formaliser les

notions d'analyse et de transformation de code.

Au cours de l'execution, chaque instruction peut ^etre executee un certain nombre

de fois, a cause des structures de contr^ole englobantes. Pour decrire les proprietes du

ot de donnees aussi precisement que possible, nos techniques doivent ^etre capables de

distinguer entre ces dierentes executions d'une m^eme instruction. Pour une instruction s,

une instance de s a l'execution est une execution particuliere de s au cours de l'execution

du programme. Dans le cas des nids de boucles, on utilise souvent les compteurs de boucles

pour nommer les instances, mais cette technique n'est pas toujours applicable : un schema

general de nommage sera etudie dans la section II.3.

Les programmes dependent parfois de l'etat initial de la memoire et interagissent

avec leur environnement, plusieurs executions du m^eme code sont donc associees a des

ensembles d'instances dierents et a des proprietes du ot incompatibles. Nous n'aurons

pas besoin ici d'un degre eleve de formalisation : une execution e d'un programme P est

donnee par une trace d'execution de P , c'est-a-dire une sequence nie ou innie (lorsque le

programme ne termine pas) de congurations (etats de la machine). L'ensemble de toutes

les executions possibles est note E. Pour un programme donne, on note Ie l'ensemble

des instances associees a l'execution e 2 E. En plus de representer l'execution, l'indice e

rappelle que l'ensemble Ie est (( exact )) : ce n'est pas une approximation.

Bien entendu, chaque instruction peut comporter plusieurs (y compris zero) refe-

rences a la memoire, l'une d'entre elles etant eventuellement une ecriture (c.-a-d. en

II. MODELES 21

partie gauche). Un couple ({; r) constitue d'une instance d'instruction et d'une reference

dans l'instruction est appele un acces. Pour une execution donnee e 2 E d'un programme,

l'ensemble de tous les acces est note Ae. Il se partitionne en : Re, l'ensemble de toutes

les lectures, c.-a-d. les acces eectuant une operation de lecture en memoire ; et We,

l'ensemble de toutes les ecritures, c.-a-d. les acces eectuant une operation d'ecriture en

memoire. Dans le cas d'une instruction comportant une reference a la memoire en partie

gauche, on confond souvent les acces en ecriture associes et les instances de l'instruction.

Nos programmes seront ecrits dans un style imperatif, avec une syntaxe a la C (avec

des extensions syntaxiques de C++). Les pointeurs sont autorises, et les tableaux a plu-

sieurs dimensions sont accedes avec la syntaxe [i1 ,: : : ,in ] | ce n'est pas la syntaxe du

C | pour faciliter la lecture. Cette etude s'interesse principalement aux structures du

premier ordre, mais des techniques d'approximation permettent de prendre egalement en

compte les pointeurs de fonction [Cou81, Deu90, Har89, AFL95]. Les appels recursifs, les

boucles, les instructions conditionnelles, et les mecanismes d'exception sont autorises ; on

suppose en revanche que les goto ont ete prealablement elimines par des algorithmes de

restructuration de code [ASU86, Bak77, Amm92].

Nous ne considererons que les structures de donnees suivantes : les scalaires (booleens,

entiers, ottants, pointeurs...), les enregistrements (ou records) de scalaires non recursifs,

les tableaux de scalaires ou d'enregistrements, les arbres de scalaires ou d'enregistrements,

les arbres de tableaux et les tableaux d'arbres (m^eme cha^nes recursivement). Pour sim-

plier, nous supposons que les tableaux sont toujours accedes avec leur syntaxe speci-

que (l'operateur []) et que l'arithmetique de pointeurs est donc interdite. Les structures

d'arbres sont accedees a l'aide de pointeurs explicites (a travers les operateurs * et ->).

La (( forme )) des structures de donnees n'est pas explicite dans les programmes C : il

n'est pas evident de savoir si telle structure est une liste ou un arbre et non un graphe quel-

conque. Des informations supplementaires donnees par le programmeur peuvent resoudre

le probleme [KS93, FM97, Mic95, HHN92], de m^eme que des analyses a la compilation

de la forme des structures de donnees [GH96, SRW96]. L'association des pointeurs a une

instance donnee d'une structure d'arbre n'est pas evidente non plus : il s'agit d'un cas par-

ticulier de l'analyse d'alias [Deu94, CBC93, GH95, LRZ93, EGH94, Ste96]. Par la suite,

nous supposerons que de telles techniques ont ete appliquees par le compilateur.

Une question importante a propos des structures de donnees : comment sont-elles

construites, modiees et detruites? La forme des tableaux est souvent connue statique-

ment, mais il arrive que l'on ait recours a des tableaux dynamiques dont la taille evolue a

chaque depassement de bornes (c'est le cas dans la section V) ; en revanche, les structures a

base de pointeurs sont allouees dynamiquement avec des instructions explicites. Feautrier

a etudie le probleme dans [Fea98] et nous aurons la m^eme vision : toutes les structures

de donnees sont supposees construites jusqu'a leur extension maximale | eventuellement

innie. La correction d'une telle abstraction est garantie lorsque l'on interdit toute inser-

tion et toute suppression a l'execution. Cette regle tres stricte soure tout de m^eme deux

exceptions que nous etudierons apres avoir introduit l'abstraction mathematique pour les

structures de donnees. Il n'en reste pas moins que de nombreux programmes ne respectent

malheureusement pas cette regle.

22 PRESENTATION EN FRANCAIS

Nous presentons d'abord une methode de nommage pour les instances d'instructions,

puis nous proposons une abstraction mathematique des cellules memoire.

Nommer les instances d'instructions

Desormais, on suppose que chaque instruction porte une etiquette, l'alphabet des eti-

quettes est note ctrl . Les boucles meritent une attention particuliere : elles ont trois

etiquettes, la premiere represente l'entree dans la boucle, la deuxieme correspond a la

verication de la condition, et la troisieme represente l'iteration 1. De la m^eme maniere,

les instructions conditionnelles ont deux labels : un pour la condition et pour la branche

then, un autre pour la branche else. Nous etudierons l'exemple de la gure 5 ; cette

procedure calcule toutes les solutions du probleme des n reines.

........................................................................................

int A[n]; F

P void Queens (int n, int k) { P

I if (k <n) {

A= =aA for (int i=0; i<n; i++) { I

B= =bB for (int j=0; j<k; j++)

r

= A[j] ; A A a A a A

J if (

) { J J J

s

s Q

A[k] = ;

Q Queens (n, k+1); s s

}

}

FPIAAaAaAJs P

} I

}

A A

int main () { J B

F Queens (n, 0);

} r

FPIAAaAaAJQPIAABBr F

. . . . . . . . . . . Figure 5. La procedure Queens et un arbre de contr^ole (partiel) . . . . . . . . . . .

Les traces d'execution sont souvent utilisees pour nommer les instances a l'execution.

Elles sont generalement denies comme un chemin de l'entree du graphe de
ot de contr^ole

jusqu'a une instruction donnee. 2 Chaque execution d'une instruction est enregistree, y

compris les retours de fonctions. Dans notre cas, les traces d'execution ont un certain

nombre d'inconvenients, le plus grave etant qu'une instance donnee peut avoir plusieurs

traces d'execution dierentes en fonction de l'execution du programme. Ce point interdit

l'utilisation des traces pour donner un unique nom a chaque instance. Notre solution

utilise une autre representation de l'execution du programme [CC98, Coh99a, Coh97,

Fea98]. Pour une execution donnee, chaque instance d'une instruction se situe a l'extremite

1. En C, la verication se fait juste apres l'entree dans la boucle et avant chaque iteration

2. Sans se soucier des expressions conditionnelles et des bornes de boucles.

II. MODELES 23

procedures. A chaque liste correspond un certain mot : la concatenation des etiquettes des

instructions. Ces concepts sont illustres sur l'arbre de la gure 5, dont la denition est

donnee ulterieurement.

Denition 1 L'automate de contr^ole d'un programme est un automate ni dont les etats

sont les instructions et ou une transition d'un etat q a un etat q0 exprime que l'instruc-

tion q0 appara^t dans le bloc q. Une telle transition est etiquetee par q0. L'etat initial

est la premiere instruction executee, et tous les etats sont naux.

Les mots acceptes par l'automate de contr^ole sont appeles mots de contr^ole . Par

construction, ils decrivent un langage rationnel Lctrl inclus dans ctrl .

Si I est l'union de tous les ensembles d'instances Ie pour toute execution donnee e 2 E,

il y a une injection naturelle de I sur le langage Lctrl des mots de contr^ole. Ce resultat

nous permet de parler du (( mot de contr^ole d'une instance )). En general, les ensembles

E et Ie | pour une execution donnee e | ne sont pas connus a la compilation. Nous

considererons souvent l'ensemble de toutes les instances susceptibles d'^etre executees,

independamment des instructions conditionnelles et des bornes de boucles. Cet ensemble

est en bijection avec l'ensemble des mots de contr^ole. Nous parlerons donc egalement de

(( l'instance w )), qui signie (( l'instance dont le mot de contr^

ole est w )).

On remarque que certains etats n'ont qu'une transition entrante et une transition

sortante. En pratique, on considere souvent un automate de contr^ole compresse ou tous

ces etats sont elimines. Cette transformation n'a pas de consequences sur les mots de

contr^ole. Les automates du programme Queens sont decrits sur la gure 6.

........................................................................................

F F P P

I FP

A A I P

A

IAA QP

A A

B a P aA A

J J

B J a BB

bB B J

B s Q r s

B s Q r s

B

r b

r b Figure 6.b. Automate de contr^ole com-

presse pour Queens

Figure 6.a. Automate de contr^ole

. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 6. Automates de contr^ole . . . . . . . . . . . . . . . . . . . . . . . . . . .

L'ordre d'execution sequentiel d'un programme denit un ordre total sur les instances

que l'on note <seq. De plus, on peut denir un ordre textuel partiel <txt sur les instructions

du programme : les instructions d'un m^eme bloc sont ordonnees selon leur apparition, et

24 PRESENTATION EN FRANCAIS

les instructions apparaissant dans des blocs dierents sont incomparables. Dans le cas

des boucles, l'etiquette de l'iteration s'execute apres toutes les instructions du corps de

boucle. Pour la procedure Queens on a B <txt J <txt a, r <txt b et s <txt Q. Cet ordre

textuel engendre un ordre lexicographique sur les mots de contr^ole (ordre du dictionnaire)

note <lex. Cet ordre est partiel sur ctrl et sur Lctrl (notamment a cause des instructions

conditionnelles). Par construction de l'ordre textuel, une instance {0 s'execute avant une

instance { si et seulement si leurs mots de contr^ole w0 et w respectifs verient w0 <lex w.

Enn, le langage des mots de contr^ole s'interprete facilement comme un arbre inni,

dont la racine est nommee " et chaque ar^ete est etiquetee par une instruction. Chaque

nud correspond alors au mot de contr^ole egal a la concatenation des etiquettes sur la

branche issue de la racine. Un tel arbre est appele arbre de contr^ole . Un arbre d'appel

partiel pour le programme Queens est donne par la gure 5.

L'adressage des cellules memoire

Nous generalisons ici un certain nombre de formalismes que nous avions proposes

precedemment [CC98, Coh99a, Coh97, Fea98, CCG96]. Celui-ci s'inspire egalement d'ap-

proches assez diverses [Ala94, Mic95, Deu92, LH88].

Sans surprise, les elements de tableau sont indexes par des entiers ou des vecteurs

d'entiers. L'adressage des arbres se fait en concatenant les etiquettes des ar^etes en partant

de la racine. L'adresse de la racine est donc " et celle du nud root->l->r dans un arbre

binaire est lr. L'ensemble des noms d'ar^etes est note data ; la disposition des arbres en

memoire est donc decrite par un langage rationnel Ldata data.

Pour travailler a la fois sur les arbres et sur les tableaux, on note que ces deux structures

partagent la m^eme abstraction mathematique : le monode (voir Section III.2). En eet,

les langages rationnels (adressage des arbres) sont des sous-ensembles de monodes libres

avec la concatenation des mots, et les ensembles de vecteurs d'entiers (indexation des

tableaux) sont des monodes commutatifs libres avec l'addition des vecteurs. L'abstraction

d'une structure de donnees par un monode est notee Mdata, et le sous-ensemble de ce

monode associe aux elements valides de la structure sera note Ldata.

Le cas des embo^tements d'arbres et de tableaux est un peu plus complexe, mais il

revele l'expressivite des abstractions sous forme de monodes. Toutefois, nous ne parle-

rons pas davantage de ces structures hybrides dans ce resume en francais. Par la suite,

l'abstraction pour n'importe quelle structure de donnees de notre modele de programmes

sera un sous-ensemble Ldata du monode Mdata avec la loi .

Il est temps de revenir sur l'interdiction des insertions et des suppressions de la section

precedente. Notre formalisme est capable en realite de gerer les deux exceptions suivantes :

puisque le
ot des donnees ne depend pas du fait que l'insertion d'un nud s'eectue au

debut du programme ou en cours d'execution, les insertions en queue de liste et aux

feuilles des arbres sont permises ; lorsque des suppressions sont eectuees en queue de

liste ou aux feuilles des arbres, l'abstraction mathematique est toujours correcte mais

risque de conduire a des approximations trop conservatrices.

Nids de boucles et tableaux

De nombreuses applications numeriques sont implementees sous formes de nids de

boucles sur tableaux, notamment en traitement du signal et dans les codes scientiques

ou multimedia. E normement de resultats d'analyse et de transformation ont ete obtenus

pour ces programmes. Notre formalisme decrit sans probleme ce genre de codes, mais il

II. MODELES 25

semble plus naturel et plus simple de revenir a des notions plus classiques pour nommer

les instances et adresser la memoire. En eet, les vecteurs d'entiers sont plus adaptes que

les mots de contr^ole, car les Z-modules ont une structure beaucoup plus riche que celle

de simples monodes commutatifs.

En utilisant des correspondances de Parikh [Par66], nous avons montre que les vecteurs

d'iterations | le formalisme classique pour nommer les instances dans les nids de boucles

| sont une interpretation particuliere des mots de contr^ole, et que les deux notions sont

equivalentes en l'absence d'appels de procedures. Enn, les instances d'instructions ne se

reduisent pas uniquement a des vecteurs d'iteration, et nous introduisons les notations

suivantes (qui generalisent les notations intuitives de la section II.1) : hS; xi represente

l'instance de l'instruction S dont le vecteur d'iteration est x ; hS; x; refi represente l'acces

construit a partir de l'instance hS; xi et de la reference ref.

D'autres comparaisons entre vecteurs d'iteration et mots de contr^ole sont presentees

dans la section IV.5.

La denition des executions d'un programme n'est pas tres pratique puisque notre

modele utilise des mots de contr^ole et non des traces d'execution. Nous preferons ici utiliser

une vision equivalente ou l'execution sequentielle e 2 E d'un programme est un couple

(<seq ; fe), ou <seq est l'ordre d'execution sequentiel sur toutes les instances possibles et

fe associe chaque acces a la cellule memoire qu'il lit ou ecrit. On remarque que <seq ne

depend pas de l'execution, l'ordre sequentiel etant deterministe. Au contraire, le domaine

de fe est exactement l'ensemble Ae des acces associes a l'execution e. La fonction fe

est appelee la fonction d'acces pour l'execution e du programme [CC98, Fea98, CFH95,

Coh99b, CL99]. Pour simplier, lorsque l'on parlera du (( programme (<seq; fe) )), on

entendra l'ensemble des executions (<seq; fe) du programme pour e 2 E.

Con its d'acces et dependances

Les analyses et transformations requierent souvent des informations sur les (( con its ))

entre acces a la memoire. Deux acces a et a0 sont en con it s'ils accedent | en lecture ou

en ecriture | a la m^eme cellule memoire : fe(a) = fe(a0).

L'analyse des con its ressemble beaucoup a l'analyse d'alias [Deu94, CBC93] et s'ap-

plique egalement aux analyses de caches [TD95]. La relation de con it | la relation entre

con its d'acces | est notee e pour une execution donnee e. Comme on ne peut generale-

ment pas conna^tre exactement fe et e, l'analyse des con its d'acces consiste a determiner

une approximation conservatrice de la relation de con it qui soit compatible avec n'im-

porte quelle execution du programme :

8e 2 E; 8v; w 2 Ae : fe(v) = fe(w) =) v w :

Pour paralleliser, on a besoin de conditions susantes pour autoriser que deux acces

s'executent dans un ordre quelconque. Ces conditions s'expriment en terme de depen-

dances : un acces a depend d'un autre acces a0 si l'un d'entre eux est une ecriture, s'ils

sont en con it | fe(a) = fe(a0) | et si a0 s'execute avant a | a0 <seq a. La relation de

dependance pour une execution e est notee e : a depend de a0 est note a0 e a.

8e 2 E; 8a; a0 2 Ae : a0 e a ()

def

(a 2 We _ a0 2 We) ^ a0 <seq a ^ fe(a) = fe(a0 ):

26 PRESENTATION EN FRANCAIS

Une analyse de dependances se contente a nouveau d'un resultat approche , tel que

8e 2 E; 8a; a0 2 Ae : a0 e a =) a0 a :

Analyse de denitions visibles

Dans certains cas, on recherche une information plus precise que les dependances :

etant donne une lecture en memoire, on veut conna^tre l'instance qui a produit la valeur.

L'acces en lecture est appele utilisation et l'instance qui a produit la valeur est appelee

denition visible . Il s'agit en fait de la derniere instance | selon l'ordre d'execution | en

dependance avec l'utilisation. La fonction associant son unique denition visible a chaque

acces en lecture est notee e :

8e 2 E; 8u 2 Re : e (u) = max

<

v 2 We : v e u :

seq

Il se peut qu'une instance en lecture n'ait en fait aucune denition visible dans le

programme considere. On ajoute donc une instance virtuelle ? qui s'execute avant toutes

les instances du programme et initialise toutes les cellules memoire.

Lorsque l'on eectue une analyse de denitions visibles , on calcule une relation qui

approxime de maniere conservatrice les fonctions e :

8e 2 E; 8u 2 Re; v 2 We : v = e (u) =) v u :

On peut aussi voir comme une fonction qui calcule des ensembles de denitions vi-

sibles possibles. Lorsque ? appara^t dans un ensmble d'instances, une valeur non initialisee

risque d'^etre lue. Cette information peut ^etre utilisee pour verier les programmes.

Par la suite on aura besoin de considerer des ensembles approches d'instances et d'ac-

ces : On a deja rencontre la notation I qui represente l'ensemble de toutes les instances

possibles pour n'importe quelle execution d'un programme donne :

8e 2 E : { 2 Ie =) { 2 I ;

De m^eme, on utilisera les approximations conservatrices A, R et W des ensembles Ae,

Re et We.

II.5 Parallelisation

Avec le modele introduit par la section II.4, paralleliser un programme (<seq ; fe) signie

construire un programme (<par; feexp), ou <par est un ordre d'execution parallele , c'est-a-

dire un ordre partiel et un sous ordre de <seq. On appelle expansion de la memoire le fait de

construire une nouvelle fonction d'acces feexp a partir de fe. Bien s^ur, un certain nombre

de proprietes doivent ^etre satisfaites par <par et feexp an de preserver la semantique de

l'execution sequentielle.

L'expansion de la memoire a pour but de reduire le nombre de dependances super
ues

qui sont dues a la reutilisation des m^emes cellules memoire. Indirectement, l'expansion met

donc en evidence plus de parallelisme. On considere en eet une relation de dependance

eexp pour une execution e du programme expanse :

8e 2 E; 8a; a0 2 Ae :

a0 eexp a ()

def

(a 2 We _ a0 2 We) ^ a0 <seq a ^ feexp(a) = feexp(a0):

III. OUTILS MATHEMATIQUES 27

Pour denir un ordre parallele compatible avec n'importe quelle execution du pro-

gramme, on doit considerer une approximation conservatrice exp. Cette approximation

est en generale induite par la strategie d'expansion (voir section V.4 par exemple).

Theoreme 1 (correction d'un ordre parallele) La condition suivante garantit que

l'ordre d'execution parallele est correct pour le programme expanse (il preserve la

semantique du programme d'origine).

8({1 ; r1); ({2; r2 ) 2 A : ({1; r1) exp ({2; r2) =) {1 <par {2:

On remarque que eexp coincide avec e lorsque le programme est mis en assignation

unique. On supposera donc que exp = pour paralleliser de tels programmes.

Enn, on ne reviendra pas ici sur les techniques utilisees pour calculer eectivement

un ordre d'execution parallele, et pour generer le code correspondant. Les techniques de

parallelisation de programmes recursifs sont relativement recentes et seront etudiees dans

la section 5.5. En ce qui concerne les methodes associees aux nids de boucles, de nombreux

algorithmes d'ordonnancement et de partitionnement | ou de pavage (tiling) | ont ete

proposes ; mais leur description ne para^t pas indispensable a la bonne comprehension des

techniques etudiees par la suite.

Cette section rassemble les rappels et les contributions portant sur les abstractions

mathematiques que nous utilisons. Le lecteur interesse par les techniques d'analyse et de

transformation peut se contenter de noter les denitions et theoremes principaux.

Nous avons besoin de manipuler des ensembles, des fonctions et des relations sur des

vecteurs d'entiers. L'arithmetique de Presburger nous convient particulierement puisque

la plupart des questions interessantes sont decidables dans cette theorie. On la denit

a partir des formules logiques construites sur 8, 9, :, _, ^, l'egalite et l'inegalite de

contraintes anes entieres. La satisfaction d'une formule de Presburger est au cur de

la plupart des calculs symboliques avec des contraintes anes : c'est un probleme NP-

complet de programmation lineaire en nombres entiers [Sch86]. Les algorithmes utilises sont

super-exponentiels dans le pire cas [Pug92, Fea88b, Fea91], mais d'une grande ecacite

pratique sur des problemes de taille moyenne.

Nous utilisons principalement Omega [Pug92] dans nos experimentations et implemen-

tations de prototypes ; la syntaxe des ensembles, relations et fonctions etant tres proche

des notations mathematiques usuelles. PIP [Fea88b] | l'outil parametrique de program-

mation lineaire en nombre entiers | utilise une autre representation pour les relations

anes : la notion d'arbre de selection quasi-ane ou quasi-ane selection tree, plus sim-

plement appele quast .

Denition 2 (quast) Un quast representant une relation ane est une expression condi-

tionnelle a plusieurs niveaux, dans laquelle les predicats sont des tests sur le signe de

formes quasi-anes 3 et les feuilles sont des ensembles de vecteurs decrits dans l'arith-

3. Les formes quasi-anes etendent les formes anes avec des divisions entieres par des constantes et

des restes de telles divisions.

28 PRESENTATION EN FRANCAIS

metique de Presburger etendue avec ? | qui precede tout autre vecteur pour l'ordre

lexicographique.

Lorsque des ensembles vides apparaissent dans les feuilles, ils dierent du singleton

f?g et decrivent les vecteurs qui ne sont pas dans le domaine de la relation. Des exemples

seront donnes dans la section V.

Une operation classique sur les relations consiste a determiner la cl^oture transitive. Les

algorithmes classiques ne considerent que des graphes nis. Malheureusement, dans le cas

des relations anes, il se trouve que la cl^oture d'une relation ane n'en est generalement

pas une.

Nous utiliserons donc des techniques d'approximation developpees par Kelly et al. et

implementees dans Omega [KPRS96]. L'idee generale consiste a se ramener a une sous-

classe par approximation, puis de calculer exactement la cl^oture.

Certains concepts font partie du fond commun en informatique theorique, comme les

monodes, les langages rationnels et algebriques, les automates nis, et les automates a pile.

Les ouvrages de reference sont [HU79] et [RS97a], mais il existe egalement de nombreuses

introductions en francais. Nous nous contenterons donc de xer les notations utilisees

par la suite, a l'aide d'un exemple classique. Dans un deuxieme temps, nous etudierons

des objets mathematiques plus originaux : nous presenterons les resultats essentiels sur la

classe des relations rationnelles entre monodes de type ni.

Langages formels : exemple et notations

Le langage de Lukasiewicz est un exemple simple de langage a un compteur | c.-

a-d. reconnu par un automate a un compteur | sous-classe des langages algebriques.

Le langage de Lukasiewicz L- sur un alphabet fa; bg est engendre par l'axiome et la

grammaire dont les productions sont

! a j b:

Ce langage est apparente aux langages de Dyck [Ber79], ses premiers mots etant

b; abb; aabbb; ababb; aaabbbb; aababbb; : : :

L'encodage d'un compteur sur une pile se fait de la facon suivante : trois symboles

sont utilises, Z est le symbole de fond de pile, I code les nombres positifs, et D les code

nombres negatifs ; ZI n represente donc l'entier n, ZDn represente n, et Z code la valeur

0 du compteur. La gure 7 decrit un automate a pile acceptant le langage L- ainsi que son

interpretation en termes de compteur.

Une generalisation naturelle des langages a un compteur consiste a en mettre plu-

sieurs : il s'agit alors d'une machine de Minsky [Min67]. Cependant, les automates a deux

compteurs ont deja le m^eme pouvoir d'expression que les machines de Turing, et la plupart

des questions interessantes deviennent donc indecidables. Pourtant, en imposant quelques

restrictions sur la famille des langages a plusieurs compteurs, des resultats de decidabilite

recents ont ete obtenus. L'etude de ces objets parait riche en applications, notamment

dans le cas des travaux de Comon et Jurski [CJ98].

III. OUTILS MATHEMATIQUES 29

........................................................................................

b; I ! " b; > 0 ; 1

!Z 1 "; Z ! Z !1 1 "; =0

2 2

a; I ! II a; Z ! ZI a; +1

Figure 7.a. Automate a pile Figure 7.b. Automate a un compteur associe

. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 7. Exemples d'automates . . . . . . . . . . . . . . . . . . . . . . . . . . .

Relations rationnelles

Nous nous contentons de quelques rappels ; consulter [AB88, Eil74, Ber79] pour de

plus amples details. Soit M un monode. Un sous-ensemble R de M est un ensemble

reconnaissable s'il existe a monode ni N , un morphisme de M dans N et un sous-

ensemble P de N tels que R = 1(P ).

Ces ensembles generalisent les langages rationnels tout en conservant la structure

d'algebre booleenne : en eet, la classe des ensembles reconnaissables est close pour l'union,

l'intersection et le complementaire. Les ensembles reconnaissables sont egalement clos

pour la concatenation, mais pas pour l'operation etoile. C'est le cas en revanche de la

classe des ensembles rationnels, dont la denition etend celle des langages rationnels :

soit M un monode, la classe des ensembles rationnels de M est la plus petite famille de

sous-ensembles de M comportant ? et les singletons fmg M , close pour l'union, la

concatenation et l'operation etoile.

En general, les ensembles rationnels ne sont pas clos pour le complementaire et l'inter-

section. Si M est de la forme M1 M2 , ou M1 et M2 sont deux monodes, un sous-ensemble

reconnaissable de M est appele relation reconnaissable , et un sous-ensemble rationnel de

M est appele relation rationnelle . Le resultat suivant decrit la (( structure )) des relations

reconnaissables.

Theoreme 2 (Mezei) Une relation reconnaissable R M1 M2 est une union nie

d'ensembles de la forme K L ou K et L sont des ensembles rationnels de M1 et M2 .

Par la suite nous ne considererons que des ensembles reconnaissables et rationnels qui

sont des relations entre monodes de type ni.

Les transductions donnent une vision (( plus fonctionnelle )) des relations reconnais-

sables et rationnelles. A partir d'une relation R entre des monodes M1 et M2 , on denit

une transduction de M1 dans M2 comme une fonction de M1 dans l'ensemble P(M2 )

des parties de M2 , telle que v 2 (u) ssi uRv. Une transduction est reconnaissable (resp.

rationnelle) ssi son graphe est une relation reconnaissable (resp. rationnelle). Ces deux

classes sont closes pour l'inversion, et la classe des transductions reconnaissables est ega-

lement close pour la composition.

Celle des transductions rationnelles est egalement close pour la composition dans le

cas de monodes libres : c'est le theoreme de Elgot et Mezei [EM65, Ber79], fondamental

pour l'analyse de dependances (section IV).

Theoreme 3 (Elgot and Mezei) Si A, B et C sont des alphabets, 1 : A ! B et

30 PRESENTATION EN FRANCAIS

transduction rationnelle.

La representation (( mecanique )) des relations et transductions rationnelles est appelee

transducteur rationnel ; ceux-ci etendent naturellement les automates nis en ajoutant un

(( ruban de sortie )) :

node (( de sortie )) M2 4, on denit un transducteur rationnel T = (M1 ; M2 ; Q; I; F; E )

avec un ensemble ni d'etats Q, un ensemble d'etats initaux I Q, an ensemble d'etats

naux F Q, et un ensemble ni de transitions (ou ar^etes) E Q M1 M2 Q.

Le theoreme de Kleene assure que les relations rationnelles de M1 M2 sont exacte-

ment les relations reconnues par un transducteur rationnel. On note jT j la transduction

reconnue par le transducteur T : on dit que T realise la transduction jT j. Lorsque les

monodes M1 et M2 sont libres, l'element neutre est le mot vide note ".

Theoreme 4 Les problemes suivants sont decidables pour les relations rationnelles : est-

ce que deux mots sont en relation (en temps lineaire), la vacuite, la nitude.

Soient R et R0 deux relations rationnelles sur des alphabets A et B avec au moins

deux lettres. Il n'est pas decidable de savoir si R \ R0 = ?, R R0 , R = R0 , R =

A B , (A B ) R est ni, R est reconnaissable.

Quelques resultats interessants concernent les transductions qui sont des fonctions

partielles. Une fonction rationnelle : M1 ! M2 est une transduction rationnelle qui est

une fonction partielle, c.-a-d. telle que Card( (u)) 1 pour tout u 2 M1 . E tant donnes

deux alphabets A et B , il est decidable qu'une transduction rationnelle de A dans B

est une fonction partielle (en O(Card(Q)4) [Ber79, BH77]). On peut egalement decider si

une fonction rationnelle est incluse dans une autre et si elles sont egales.

Parmi les transducteurs realisant des fonctions rationnelles, on s'interesse notamment

a ceux que l'on peut (( calculer a la volee )) en lisant leur entree. Soient A et B deux

alphabets. Un transducteur est sequentiel lorsqu'il est etiquete sur A B et que son

automate d'entree (obtenu en omettant les sorties) est deterministe. Un transducteur

sequentiel realise une fonction rationnelle. Cette notion de (( calcul a la volee )) est un peu

trop restrictive, on considere plut^ot l'extension suivante :

Denition 4 (transducteur sous-sequentiel) Pour deux alphabets A et B , un trans-

ducteur sous-sequentiel (T ; ) sur A B est un couple ou T est un transducteur

sequentiel avec F pour ensemble d'etats naux, et ou : F ! B est une fonction.

La fonction realisee par (T ; ) est denie comme suit : si u 2 A , la valeur (u) est

denie s'il existe un chemin dans T acceptant (ujv) aboutissant a un etat nal q ; dans

ce cas (u) = v(q).

En d'autres termes, ajoute un mot a la n de la sortie d'un transducteur sequentiel.

Partant d'une demonstration de Chorut [Cho77], Beal et Carton [BC99b] ont propose

un algorithme polynomial pour decider si une fonction rationnelle est sous-sequentielle, et

un autre pour decider si une sous-sequentielle est sequentielle. Ils ont egalement propose

un algorithme polynomial pour trouver une realisation sous-sequentielle d'une fonction

rationnelle, lorsqu'elle existe.

4. Les monodes M1 et M2 sont souvent omis de la denition.

III. OUTILS MATHEMATIQUES 31

Les relations rationnelles ne sont pas closes pour l'intersection, mais cette operation est

indispensable dans le cadre de l'analyse de dependances. Feautrier [Fea98] a propose un

(( semi-algorithme )) pour r epondre a la question indecidable de la vacuite d'une intersection

de relations rationnelles : l'algorithme ne termine a coup s^ur que lorsque l'intersection n'est

pas vide. Puisque nous voulons calculer cette intersection, nous adoptons une approche

dierente : on se ramene | par approximations conservatrices | a une classe de relations

rationnelles avec une structure d'algebre booleenne (c.-a-d. avec l'union, l'intersection et

le complementaire).

Les relations reconnaissables constituent bien une algebre booleene, mais nous avons

construit une classe plus generale : les relations synchrones a gauche. Cette classe a ete

etudiee independamment par Frougny et Sakarocitch [FS93], mais notre representation est

dierente, les preuves sont nouvelles et de nouveaux resultats ont ete obtenus. Ce travail

est le resultat d'une collaboration avec Olivier Carton (Universite de Marne-la-Vallee).

On rappelle une denition classique, equivalente a la propriete de preservation de la

longueur pour les mots d'entree et de sortie : Un transducteur rationnel sur des alphabets

A et B est synchrone s'il est etiquete sur A B . Nous etendons cette notion de la facon

suivante.

Denition 5 (synchronisme a gauche) Un transducteur rationnel sur des alphabels

A et B est synchrone a gauche s'il est etiquete sur (A B ) [ (A f"g) [ (f"g B )

et seules des transitions etiquetees sur A f"g (resp. f"g B ) peuvent suivre des

transitions etiquetees sur A f"g (resp. f"g B ).

Une relation ou une transduction rationnelle est synchrone a gauche si elle peut

^etre realisee par un transducteur synchrone a gauche. Un transducteur rationnel est

synchronisable a gauche s'il realise une relation synchrone a gauche.

La gure 8 montre des transducteurs synchrones a gauche sur un alphabet A qui

realisent l'ordre prexe et l'ordre lexicographique (<txt est un ordre particulier sur A).

........................................................................................

Pour les transducteurs suivants, x et y remplacent respectivement 8x 2 A et 8y 2 A.

"jy

xj x "jy 5 "jy

"jy xjy

1 2 "jy

"jy 4

xjy; x <txt y

1 2

Figure 8.a. Ordre prexe 3

xj"

xjx

xj"

Figure 8.b. Ordre lexicographique

. . . . . . . . . . . . . . . Figure 8. Exemple de transducteurs synchrones a gauche . . . . . . . . . . . . . . .

32 PRESENTATION EN FRANCAIS

Il est connu que les transducteurs synchrones constituent une algebre booleenne 5.

Theoreme 5 La classe des relations synchrones a gauche constitue une algebre boo-

leenne : elle est close pour l'union, l'intersection et le complementaire. De plus, les

relations reconnaissables sont synchrones a gauche ; si S est synchrone et T est syn-

chrone a gauche, alors ST est synchrone a gauche ; si T est synchrone a gauche et

R est reconnaissable, alors TR est synchrone a gauche. Enn, la classe des relations

synchrones a gauche est close pour la composition.

Les relations synchrones sont decidables parmi les relations rationnelles [Eil74] , mais

ce n'est pas le cas des relations reconnaissables [Ber79] et nous avons montre qu'il en est

de m^eme des relations synchrones a gauche.

On s'interesse cependant a certains cas particuliers pour lesquels une relation ration-

nelle peut ^etre prouvee synchrone a gauche. A cet eet, on rappelle la notion de taux de

transmission d'un chemin etiquete par (u; v ) : il s'agit du rapport jv j=juj 2 Q + [ f+1g.

Si T est un transducteur synchrone a gauche, les cycles de T ne peuvent avoir que trois

taux de transmission possibles : 0, 1 et +1. Tous les cycles d'une m^eme composante for-

tement connexe doivent avoir le m^eme taux de transmission, seuls les composants de taux

0 peuvent suivre ceux de taux 0, et seuls les composants de taux +1 peuvent suivre ceux

de taux +1. Il existe une reciproque partielle :

Theoreme 6 Si le taux de transmission de chaque cycle d'un transducteur rationnel est

0, 1 ou +1, et si aucun cycle de taux 1 suit un cycle de taux dierent de 1, alors le

transducteur est synchronisable a gauche.

Nous pouvons donc \resynchroniser" une certaine classe de transducteurs synchroni-

sables a gauche, a savoir les transducteurs satisfaisant les hypotheses du theoreme 6. En

se fondant sur un algorithme de Beal et Carton [BC99a], on peut ecrire un algorithme

de resynchronisation pour calculer des approximations synchrones a gauche de relations

rationnelles. Cette technique sera utilisee dans la section III.5.

Nous terminons sur des proprietes de decidabilite, essentielles pour l'analyse de de-

pendances et de denitions visibles.

Lemme 1 Soient R et R0 des relations synchrones a gauche sur des alphabets A et B . Il

est decidable que R \ R0 = ?, R R0, R = R0, R = A B , (A B ) R est ni.

Nous travaillons toujours sur la decidabilite des relations reconnaissables parmi les

synchrones a gauche.

Nous avons parfois besoin d'une puissance d'expression superieure a celle des relations

rationnelles. Nous utiliserons donc la notion de relation algebrique | ou hors-contexte |

qui etend naturellement celle de langage algebrique. Ces relations sont denies a partir

des transducteurs a pile :

Denition 6 (transducteur a pile) E tant donnes deux alphabets A et B , un trans-

ducteur a pile T = (A ; B ; ; 0; Q; I; F; E ) est constitue d'un alphabet de pile 6 ,

un mot non vide 0 dans + appele mot de pile initial, un ensemble ni d'etats Q, un

5. Toutes les proprietes etudiees dans cette section ont des preuves constructives.

6. Les alphabets A et B sont souvent omis de la denition.

III. OUTILS MATHEMATIQUES 33

ni de transitions (ou ar^etes) E Q A B Q.

La notion de transducteur a pile realisant une relation est denie de la m^eme maniere

que celle d'automate a pile realisant un langage.

Denition 7 (relation algebrique) La classe des relations realisees par des transduc-

teurs a pile est appelee classe des relations algebriques .

Bien entendu, les transductions algebriques constituent la vision fonctionnelle des rela-

tions algebriques.

Theoreme 7 Les relations algebriques sont closes pour l'union, la concatenation et l'ope-

ration etoile. Elles sont egalement closes pour la composition avec des transductions

rationnelles. L'image d'un langage rationnel par une transduction algebrique est un

langage algebrique.

Les questions suivantes sont decidables pour les relations algebriques : est-ce que

deux mots sont en relation (en temps lineaire), la vacuite, la nitude.

Il y a tres peu de resultats sur les transductions algebriques qui sont des fonctions par-

tielles, appelees fonctions algebriques . En particulier, nous ne connaissons pas de sous-classe

de ces fonctions qui soit (( calculable a la volee )) au sens des fonctions sous-sequentielles.

Neanmoins, une sous-classe interessante des relations algebriques est celle des relations

a un compteur , realisees par un transducteur a un compteur | denition semblable a celle

d'un automate a un compteur. On peut egalement considerer plus d'un compteur, mais

l'on obtient alors la m^eme puissance d'expression que les machines de Turing. Cette classe

nous interesse lorsque nous sommes amenes a composer des transductions rationnelles

entre monodes non libres (le theoreme de Elgot et Mezei ne s'applique plus).

Theoreme 8 Soient A et B deux alphabets et n un entier positif. Si 1 : A ! Zn et

2 : Zn ! B sont des transductions rationnelles, alors 2 1 : A ! B est une

transduction a n compteurs.

Ce theoreme sera utilise pour l'analyse de dependances, principalement avec n = 1.

De plus, on peut deduire un resultat important de la preuve du theoreme :

Proposition 1 Soient A et B deux alphabets et n un entier positif. Soient 1 : A ! Zn

et 2 : Zn ! B des transductions rationnelles et T un transducteur a n compteurs

realisant 2 1 : A ! B (calcule avec le theoreme 8. Alors, le transducteur rationnel

sous-jacent a T | obtenu en omettant les manipulations de pile | est reconnaissable.

Ce resultat garantit la cl^oture pour l'intersection avec n'importe quelle transduction

rationnelle, d'apres le resultat suivant :

Proposition 2 Soit R1 une relation algebrique realisee par un transducteur a pile dont

le transducteur rationnel sous-jacent est synchrone a gauche, et soit R2 une relation

synchrone a gauche. Alors R1 \ R2 est une relation algebrique, et on peut construire

un transducteur a pile qui la realise dont le transducteur rationnel sous-jacent est

synchrone a gauche.

Enn, le theoreme 8 s'etend aux monodes partiellement commutatifs libres associes

aux embo^tements d'arbres et de tableaux, que nous n'abordons pas dans ce resume.

34 PRESENTATION EN FRANCAIS

Le calcul d'intersection est tres utilise dans le cadre de nos techniques d'analyse et

de transformation de programmes. Les relations rationnelles et algebriques ne sont pas

closes pour cette operation ; mais nous avons identie des sous-classes qui le sont. Nous

montrons ici comment s'y ramener en appliquant des approximations conservatrices.

Plusieurs methodes permettent d'approcher des relations rationnelles par des relations

reconnaissables. L'idee generale consiste a considerer le produit cartesien de l'entree et de

la sortie. Des techniques plus precises consistent a eectuer cette operation pour chaque

couple d'un etat initial et d'un etat nal, et pour chaque composante fortement connexe.

Le resultat est toujours une relation reconnaissable, gr^ace au theoreme 2.

L'approximation par des relations synchrones a gauche est fondee sur l'algorithme de

resynchronisation, et donc sur le theoreme 6. Lorsque l'algorithme echoue, on remplace

une composante fortement connexe par une approximation reconnaissable et on recom-

mence. Des optimisations permettent de n'appliquer qu'une seule fois l'algorithme de

resynchronisation.

L'approximation de relations algebriques | ou a plusieurs compteur | peut se faire

de deux manieres : soit on approxime la pile | ou les compteurs | par des etats supple-

mentaires, soit on approxime le transducteur rationnel sous-jacent par un transducteur

synchrone a gauche. Les deux techniques seront utilisees par la suite.

Apres un certain nombre de travaux sur l'analyse par instances de programmes re-

cursifs [CCG96, Coh97, Coh99a, Fea98, CC98], nous presentons une evolution majeure

avec un formalisme plus general et une automatisation complete du processus. Au dela

de l'objectif theorique d'obtenir le maximum de precision possible, nous verrons dans la

section V.5 comment ces informations permettent d'ameliorer les techniques de paralleli-

sation automatique de programmes recursifs.

En partant d'exemples reels, nous discutons du calcul de variables d'induction puis

nous presentons les analyses de dependances et de denitions visibles proprement dites.

Cette section se termine sur une comparaison avec les analyses statiques et avec les travaux

recents portant sur l'analyse par instances de nids de boucles.

Nous etudions deux exemples pour donner un apercu intuitif de notre analyse par

instances pour structures recursives. Un troisieme exemple est presente dans la these,

mais il utilise une structure hybride entre arbres et tableaux dont nous ne parlons pas ici.

Premier exemple: le programme Queens

Nous considerons a nouveau la procedure Queens presentee dans la section II.3. Le

programme est reproduit sur la gure 9 avec un arbre de contr^ole partiel.

Nous etudions les dependances entre les instances a l'execution des instructions. Obser-

vons par exemple l'instance FPIAAaAaAJQPIAABBr de l'instruction r, representee par une

etoile sur la gure 9.b. La variable j est initialisee a 0 par l'instruction B et incrementee

par l'instruction b, nous savons donc que la valeur de j en FPIAAaAaAJQPIAABBr est 0 ;

IV. ANALYSE PAR INSTANCE POUR PROGRAMMES RECURSIFS 35

........................................................................................

int A[n];

P void Queens (int n, int k) {

I if (k <n) { FP

A= =a

A for (int i=0; i<n; i++) {

B= =b

B for (int j=0; j<k; j++) IAA aA aA

r

= A[j] ;

J if (

) { J J J

s A[k] =

;

Q Queens (n, k+1); s s s QP

} FPIAAJs

} FPIAAaAJs IAA

} FPIAAaAaAJs

} ecrivent A[0] J BB

int main () {

r

F Queens (n, 0); FPIAAaAaAJQPIAABBr F lit A[0]

}

Figure 9.b. Arbre de contr^ole (compresse)

Figure 9.a. Procedure Queens

. . . . . . . . . . . . . . . . Figure 9. La procedure Queens et un arbre de contr^ole . . . . . . . . . . . . . . . .

par des carres. La variable k est initialisee a 0 lors du premier appel a Queens, puis elle est

incrementee par l'appel recursif Q. Les instances FPIAAJs, FPIAAaAJs et FPIAAaAaAJs

ecrivent donc dans A[0], et sont ainsi en dependance avecFPIAAaAaAJQPIAABBr.

Laquelle de ces denitions atteint elle FPIAAaAaAJQPIAABBr ? En observant la -

gure a nouveau, on remarque que l'instance FPIAAaAaAJs | le carre noir | s'exe-

cute en dernier. De plus, on peut assurer que cette instance est executee lorsque la

lecture FPIAAaAaAJQPIAABBr s'execute. Les autres ecritures sont donc ecrasees par

FPIAAaAaAJs qui est ainsi la denition visible de FPIAAaAaAJQPIAABBr. Nous montre-

rons ulterieurement comment generaliser cette approche intuitive.

Deuxieme exemple : le programme BST

Considerons a present la procedure BST de la gure 10. Cette procedure echange les

valeurs des nuds pour convertir un arbre binaire en arbre binaire de recherche, ou binary

search tree. Les nuds de l'arbre sont references par des pointeurs, et p->value contient

la valeur entiere du nud. Il y a peu de dependances sur ce programme : les seules sont

des anti-dependances entre certaines instances d'instructions a l'interieur des blocs I1 ou

I2 . Par consequent, l'analyse de denition visible donne un resultat tres simple : la seule

denition visible de tout acces en lecture est ?.

On a denit dans la section II.4 la notion de fonction d'acces. Celle-ci relie les acces

aux cellules memoire qu'ils lisent ou ecrivent. Nous avons desormais besoin d'expliciter ces

fonctions, et nous introduisons pour cela la notion de variable d'induction. En presence de

36 PRESENTATION EN FRANCAIS

........................................................................................

P void BST (tree *p) {

I1 if (p->l!=NULL) {

L BST (p->l);

I2 if (p->value <p->l->value) {

a t = p->value;

b p->value = p->l->value;

c p->l->value = t; LP FP RP

} P

} I1 J1

J1 if (p->r!=NULL) {

R BST (p->r); I1 J1

J2 if (p->value >p->r->value) { I2 J2

d t = p->value;

e p->value = p->r->value;

I2 J2

f p->r->value = t;

a b c d e f

} a b c d e f

}

}

int main () {

F if (root!=NULL) BST (root);

}

procedures recursives, cette notion historiquement liee aux nids de boucles [Wol92] doit

^etre redenie. Pour simplier l'exposition, nous supposons que chaque variable possede

un nom distinctif unique ; on pourra ainsi parler sans ambigute de (( la variable i )). Notre

denition des variables d'induction est la suivante :

{ les arguments entiers d'une fonction qui sont initialises par une constante ou par

une variable entiere d'induction plus une constante, a chaque appel recursif ;

{ les compteurs de boucle entiers translates d'une constante a chaque iteration ;

{ les arguments de type pointeur qui sont initialises par une constante ou par une

variable d'induction de type pointeur eventuellement dereferencee.

L'analyse requiert un certain nombre d'hypotheses supplementaires sur le modele de

programme de la section II.2 : les structures de donnees analysees doivent ^etre declarees

globales ; les indices de tableaux doivent ^etre des fonctions anes des variables d'induction

entieres et de constantes symboliques ; et les acces aux arbres doivent dereferencer une

variable d'induction de type pointeur ou une constante.

Prealablement a l'analyse de dependances, nous devons calculer les fonctions d'acces

an de decrire les con
its eventuels. Soit une instruction et w une instance de . La

valeur de la variable i a l'instance w est denie comme la valeur de i immediatement apres

execution de l'instance w de l'instruction . Cette valeur est notee [ i] (w).

En general, la valeur d'une variable en un mot de contr^ole donne depend de l'execution.

Pourtant, gr^ace aux restrictions que nous avons imposees au modele de programme, les

IV. ANALYSE PAR INSTANCE POUR PROGRAMMES RECURSIFS 37

variables d'induction sont completement determinees par les mots de contr^ole. On montre

que pour deux executions dierentes e et e0, les valeurs de deux variables d'induction

sont identiques sur en un mot de contr^ole donne. Les fonctions d'acces pour dierentes

executions concident donc, et nous considererons donc par la suite une fonction d'acces

f independante de l'execution.

Le resultat suivant montre que les variables d'induction sont decrites par des equations

recurrentes :

Lemme 2 On considere le monode (Mdata; ) qui abstrait la structure de donnees consi-

deree, une instruction , et une variable d'induction i. L'eet de l'instruction sur

la valeur de i est decrit par l'une des equations suivantes :

ou bien 9 2 Mdata; j 2 induc : 8u 2 Lctrl : [ i] (u) = [ j] (u)

ou alors 9 2 Mdata : 8u 2 Lctrl : [ i] (u) =

ou induc est l'ensemble des variables d'induction du programme, y compris i.

Le resultat sur la procedure Queens est le suivant. On ne s'interesse qu'aux variables

inductives i et k, seules utiles pour l'analyse de dependances.

De l'appel principal F : [ Arg(Queens; 2)]](F ) = 0

De la procedure P : 8uP 2 Lctrl : [ k] (uP ) = [ Arg(Queens; 2)]](u)

De l'appel recursif Q : 8uQ 2 Lctrl : [ Arg(Queens; 2)]](uQ) = [ k] (u) + 1

De l'entree de boucle B : 8uB 2 Lctrl : [ j] (uB ) = 0

De l'iteration de boucle b : 8ub 2 Lctrl : [ j] (ub) = [ j] (u) + 1

Arg(proc; num) represente le nume argument eectif d'une procedure proc , et toutes

les autres instructions laissent les variables inchangees.

On a concu un algorithme pour construire automatiquement un tel systeme decrivant

l'evolution des variables d'induction dans un programme. Combine avec le resultat suivant,

cet algorithme permet de construire automatiquement la fonction d'acces.

Theoreme 9 La fonction d'acces f | qui associe chaque acces possible dans A a la

cellule memoire qu'il lit ou ecrit | est une fonction rationnelle de ctrl dans Mdata.

Le resultat pour le programme Queens est le suivant :

(urjf (ur; A[j])) = (FPIAAj0) (JQPIAAj0) + (aAj0) (B Bj0) (bBj1) (rj0)

(usjf (us; A[k])) = (FPIAAj0) (JQPIAAj1) + (aAj0) (Jsj0)

On a applique la m^eme technique au programme BST :

8 2 fI2; a; bg :

(ujf (u; p->value)) = (FP j") (I1 LP jl) + (J1 RP jr) (I1 I2j")

8 2 fI2; b; cg :

(ujf (u; p->l->value)) = (FP j") (I1 LP jl) + (J1 RP jr) (I1 I2jl)

8 2 fJ2; d; eg :

(ujf (u; p->value)) = (FP j") (I1 LP jl) + (J1 RP jr) (J1 J2j")

8 2 fJ2; e; f g :

(ujf (u; p->r->value)) = (FP j") (I1 LP jl) + (J1 RP jr) (J1 J2jr)

38 PRESENTATION EN FRANCAIS

A l'aide des fonctions d'acces, notre premier objectif consiste a calculer la relation

entre les acces con ictuels a la memoire. Nous ne pouvons pas esperer un resultat exact

en general, mais on peut proter du fait que la fonction d'acces f ne depend pas de

l'execution. La relation de con it approchee que nous calculons est la suivante :

def

8u; v 2 Lctrl : u v () v 2 f 1(f (v)):

D'apres le theoreme de Elgot et Mezei (section III.2) et le theoreme 8, la composition

de f 1 et de f est soit une transduction rationnelle soit une transduction a plusieurs

compteurs. Le nombre de compteurs correspond a la dimension du tableau accede, et on

peut se ramener a un seul compteur par une approximation conservatrice.

On remarque que tester la vacuite de est equivalent a l'analyse d'alias entre pointeurs

[Deu94, Ste96], et la vacuite d'une relation rationnelle ou algebrique est decidable.

Pour etablir le transducteur decrivant les dependances, on doit d'abord restreindre la

relation aux couples d'acces comportant au moins une ecriture, puis on intersecte avec

l'ordre lexicographique. En utilisant les techniques des sections III.3, III.4 et III.5, on peut

toujours calculer une approximation conservatrice . Celle-ci est realisee par un transduc-

teur a un compteur dans le cas des tableaux, et par un transducteur rationnel dans le cas

des arbres. De plus, gr^ace a la proposition 1, l'intersection avec l'ordre lexicographique

n'est pas approximative dans le cas des tableaux.

Si l'on cherche a calculer les denitions visibles a partir de l'information approchee sur

les dependances, on aura beaucoup de mal a obtenir un resultat precis. Passee la premiere

etape de restriction de aux seules dependances de ot, on doit utiliser des proprietes

additionnelles sur le ot des donnees. La technique principale que nous utilisons est fondee

sur une propriete structurelle des programmes :

Denition 8 (anc^etre) On denit unco : un sous-ensemble de ctrl constitue de toutes

les etiquettes de blocs qui ne sont pas des instructions conditionnelles ou des corps

de boucles, et de tous les appels de procedure (non gardes), c.-a-d. les blocs dont

l'execution est inconditionnelle.

Soient r et s deux instructions dans ctrl , et soit u un prexe strict d'un mot de

contr^ole wr 2 Lctrl (une instance de r). Si v 2 unco est tel que uvs 2 Lctrl , alors

uvs est appele anc^etre de wr.

Cette denition se comprend aisement sur un arbre de contr^ole comme celui de la

gure 9.b page 35: le carre noir FPIAAaAaAJs est un anc^etre de FPIAAaAaAJQPIAABBr,

mais pas les carres gris adjacents. Les anc^etres ont les deux proprietes suivantes :

1. l'execution de wr implique celle de u qui est sur le chemin de la racine au nud wr ;

2. l'execution de u implique celle de uvs car v 2 unco .

Ainsi, si une instance s'execute, tous ses anc^etres le font egalement. Pour appliquer

ce resultat a l'analyse de denitions visibles, on commence par identier les instances

dont l'execution est garantie par la propriete des anc^etres, puis on applique des regles

d'elimination de transitions sur le transducteur des dependances de ot. On obtient un

transducteur qui realise une approximation des denitions visibles.

L'integration de ces idees dans l'algorithme d'analyse de denitions visibles etant re-

lativement technique, nous en resterons la dans ce resume.

IV. ANALYSE PAR INSTANCE POUR PROGRAMMES RECURSIFS 39

Revenons tout d'abord sur le cas des structures d'arbres. La fonction d'acces pour le

programme BST est un transducteur rationnel decrit par la gure 11.

........................................................................................

LP jl FP j" RP jr

I j" P J j" 1

1

I1 J1

I2 j" I2 jl J2 j " J2 j r

p->r

I2j" J2 j "

p p->l p

I2 aj"

p I2 cjl I2 p->l J2 dj"

p J2 f j r J2p->r

a bp bp->l c d ep ep->r f

Le transducteur du con it realisant est toujours rationnel dans le cas des arbres.

Lorsque le resultat est un transducteur synchrone a gauche, on peut calculer les depen-

dances sans approximation, sinon une approximation de a l'aide d'un transducteur

synchrone a gauche est necessaire. Le resultat pour BST est decrit par la gure 12.

........................................................................................

LP jLP FP jFP RP jRP

I1jI1 1 J1 jJ1

2 8 J2 jJ2f

I2 jI2bp

p

I2 jI2 c

p->l J2 jJ2ep

p p->r

I2jI2 J2jJ2

3 4 b jc 5 9 10 e jf 11

ajbp p->l djep p->r

6 7 12 13

........................................................................................

On retrouve sur ce resultat le fait que les dependances se situent entre les instances des

instructions d'un m^eme bloc I1 ou J1. Nous verrons que ce resultat permet de paralleliser

le programme.

40 PRESENTATION EN FRANCAIS

E tudions a present le cas des tableaux. La fonction d'acces pour le programme Queens

est decrite par un transducteur rationnel de ctrl dans Mdata = Z, donne sur la gure 13.

........................................................................................

FPj0 FPj0

P P0

IAAj0 QPj0 IAAj0 QPj1

aAj0 A aAj0 A0

J j0 J j0

BBj0

bBj1 B J J0

r j0 sj0

r s0

........................................................................................

On utilise le theoreme 8 pour calculer un transducteur a un compteur realisant la

relation de con it . Pour obtenir la relation de dependance, on applique l'algorithme de

resynchronisation au transducteur rationnel sous-jacent (qui est reconnaissable), le calcul

est toujours exact. Le resultat pour Queens est donne par la gure 14.

........................................................................................

"jbB; 1 "jaA J jaA

aAjaA 12

"jBB "jJ IAAjIAA

4 3 5 J jJ

"jr "jIAA "jQ QPjQP; +1

2 1 13 11

"j" sjQP FPjFP

6 7 15 14 10 ! 0

IAAj" QPj"; +1 sj"; =0 "jr; =0 "jIAA "jQP

8 9 17 "jBB

16 "jJ

18

J j"

aAj" "jbB; 1 "jaA

. . . . . . . . . Figure 14. Transducteur a un compteur pour les dependances de ot . . . . . . . . .

On peut desormais eectuer l'analyse de denition visibles : en utilisant des infor-

mations supplementaires sur les instructions conditionnelles du programme Queens on

demontre que seuls des anc^etres d'une instance de r peuvent ^etre des denitions visibles.

Cette propriete tres forte permet d'eliminer toutes les transitions qui ne menent pas a

IV. ANALYSE PAR INSTANCE POUR PROGRAMMES RECURSIFS 41

un anc^etre dans le transducteur des dependances. Le resultat est donne par la gure 15.

On peut montrer facilement que le resultat est exact : une unique denition visible est

calculee pour chaque acces en lecture.

........................................................................................

JQPIAAjJQPIAA; +1 "jJQPIAA "jbB; 1

!0 1 FPIAAjFPIAA

2 JsjJQPIAA 3 "jB B

4 "jr; =0

5

aAjaA "jaA

. . . . . . . . . . . . . . . . . . . Figure 15. Transducteur a un compteur pour . . . . . . . . . . . . . . . . . . .

Parmi les restrictions du modele de programme, certaines peuvent ^etre eliminees a

l'aide de transformations prealables. De surcro^t, de nombreuses restrictions semblent

pouvoir ^etre retirees dans des versions futures de l'analyse, a l'aide d'approximation ade-

quates. Il subsiste neanmoins une restriction tres importante qui est fermement enracinee

dans notre formalisme, et nous ne voyons pas de methode generale pour s'en passer : les

insertions et suppressions dans les arbres ne sont autorisees qu'au niveau des feuilles.

Les analyses statiques de dependance et de denition visibles obtiennent generale-

ment des resultats similaires, qu'elles soient fondees sur l'interpretation abstraite [Cou81,

JM82, Har89, Deu94] ou d'autres formalismes d'analyse de ot de donnees [LRZ93, BE95,

HHN94, KSV96]. Une etude interessante des analyses statiques utiles en parallelisation est

proposee dans [RR99]. Il est aise de comparer notre technique avec ces analyses : aucune

ne travail au niveau des instances. Aucune n'atteint la precision necessaire pour identier

quelle instance de quelle instruction est en con it, en dependance, ou est une denition

visible possible. Ces analyses sont cependant utiles pour lever un certain nombre de res-

trictions de notre modele de programmes, et pour calculer des proprietes utiles a l'analyse

de denitions visibles par instances. Il est plus interessant de comparer ces analyses en

matiere d'applications a la parallelisation, voir section V.5.

Comparons a present avec les analyses par instance pour nids de boucles, par exemple

avec la FADA [BCF97, Bar98]. Sur l'intersection commune de leurs modeles de pro-

grammes, le resultat general n'est pas surprenant : les resultats de la FADA sont bien

plus precis. En eet, nous n'utilisons les informations sur les instructions conditionnelles

qu'a travers des analyses externes, des approximations supplementaires sont necessaires

dans le cas de tableaux a plusieurs dimensions, les transducteurs rationnels et algebriques

n'ont pas un pouvoir d'expression assez eleve pour manipuler des parametres entiers (un

seul compteur peut ^etre decrit), et des operations fondamentales comme l'intersection

necessitent parfois des approximations. On peut tout de m^eme noter des points positifs :

l'exactitude du resultat peut ^etre decidee en temps polyn^omial sur les transducteurs ra-

tionnels ; la vacuite est toujours decidable, ce qui permet une detection automatique des

variables non initialisees ; dans le cas des arbres, les tests de dependance s'eectuent sur

des langages rationnels de mots de contr^ole, ce qui est tres utile pour la parallelisation ;

42 PRESENTATION EN FRANCAIS

enn, dans le cas des tableaux, les tests de dependance sont equivalents a l'intersection

d'un langage rationnel avec un langage algebrique.

V Expansion et parallelisation

Les recherches sur l'expansion de la memoire portent principalement sur les nids de

boucles anes. Les techniques les plus courantes sont la mise en assignation unique

[Fea91, GC95, Col98], la privatisation [MAL93, TP93, Cre96, Li92] et de nombreuses

optimisations pour la gestion ecace de la memoire [LF98, CFH95, CDRV97, QR99].

Lorsque le
ot de contr^ole n'est pas previsible a la compilation ou lorsque les index de

tableaux ne sont pas anes, le probleme de la restauration du
ot des donnees devient

capital, et les convergences d'inter^et avec le formalisme SSA (static single-assignment)

[CFR+91] sont tres nettes. En partant d'exemples simples, nous etudions les problemes

speciques aux nids de boucles non anes, et proposons des algorithmes de mise en assi-

gnation unique. De nouvelles techniques d'expansion et d'optimisation de l'occupation en

memoire sont ensuite proposees pour la parallelisation automatique de codes irreguliers.

Les principes du calcul parallele en presence de procedures recursives sont tres die-

rents de ceux des nids de boucles, et les methodes de parallelisation existantes se fondent

generalement sur des tests de dependance au niveau des instructions, alors que notre ana-

lyse decrit la relation de dependance au niveau des instances ! Nous montrons que cette

information tres precise permet d'ameliorer notablement les techniques classiques de pa-

rallelisation. Nous etudions aussi la possibilite d'expanser la memoire dans les programmes

recursifs, et cette etude se termine par des resultats experimentaux.

La mise en assignation unique ou single-assignment form conversion (SA) est l'une

des methodes d'expansion les plus classiques. elle correspond au cas extr^eme ou chaque

cellule memoire est ecrite au plus une fois au cours de l'execution. Elle diere donc de

la mise en assignation unique statique (SSA) [CFR+91, KS98], ou l'expansion se limite a

des renommages de variables.

L'idee consiste a remplacer chaque assignation d'une structure de donnees D par une

assignation a une nouvelle structure Dexp dont les elements sont du m^eme type que ceux

de D, et sont en bijection avec l'ensemble W de tous les acces en ecriture possibles au

cours de l'execution. Dans une deuxieme etape, les references en lecture doivent ^etre mises

a jour en consequence : c'est ce que l'on appelle la restauration du ot des donnees. On

utilise pour cela les denitions visibles par instances : pour une execution donnee e 2 E,

la reference a D en lecture h{; refi doit ^etre remplacee par un acces a l'element de Dexp

associe a e (h{; refi). Puisque l'on ne dispose que d'une approximation des denitions

visibles, cette technique n'est applicable que lorsque (h{; refi) est un singleton. Si ce

n'est pas le cas, on doit generer un code de restauration dynamique du ot des donnees.

Ce code est generalement represente par une fonction , dont l'argument est l'ensemble

(h{; refi) des denitions visibles possibles.

Pour generer le code de restauration dynamique associe aux fonctions , on utilise une

structure de donnees supplementaire en bijection avec Dexp : cette structure est notee Dexp.

On doit memoriser deux informations dans Dexp : l'adresse de la cellule memoire ecrite

dans le programme d'origine et l'identite de la derniere instance qui a ecrit une valeur

dans cette cellule. Comme le programme est en assignation unique, l'instance est deja

V. EXPANSION ET PARALLELISATION 43

decrite par l'element de Dexp lui m^eme : Dexp doit donc contenir des adresses de cellules

memoire. L'utilisation de cette structure est la suivante : on initialise Dexp a NULL ; puis

a chaque assignation de Dexp on ecrit dans Dexp l'adresse de la cellule memoire ecrite

dans le programme d'origine ; enn une reference (set) est implementee par un calcul de

maximum | selon l'ordre sequentiel | de tous les { 2 set tels que Dexp[{] soit egal a

l'adresse de la cellule memoire lue dans le programme d'origine.

L'analyse de denitions visibles par instances est a la base de la restauration du
ot

des donnees [Col98] : des resultats precis permettent non seulement de reduire le nombre

de fonctions , mais egalement de simplier les arguments de celles-ci, et donc d'optimiser

les calculs de maximum au cours de l'execution. On remarquera egalement que le calcul

de a l'execution peut lui m^eme se reveler co^uteux, m^eme en l'absence de fonction .

Dans le cas des nids de boucles, le surco^ut n'est pourtant d^u qu'a l'implementation du

quast associe a ; des techniques de parcours de polyedre [AI91] permettent d'optimiser le

code genere. L'exemple de la gure 16 illustre ces remarques. Dans le cas des programmes

recursifs, nous verrons que le probleme du calcul de est plus delicat.

........................................................................................

double A[N], AT , AS [N, N], AR [N, N];

double A[N];

T A[0] = 0;

T AT = 0;

for (i=0; i<N; i++)

for (i=0; i<N; i++)

for (j=0; j<N; j++) {

S

for (j=0; j<N; j++) {

S AS [i, j] =

;

R

A[i+j] = ;

R AR [i, j] = (fhT ig [ fhS; i ; j i :

0 0

}

A[i] = A[i+j-1] ;

(i ; j ) < ( ; )g)

0 0 lex i j

}

Figure 16.a. Programme d'origine Figure 16.b. SA sans analyse de denitions visibles

double A[N], AT ;

double A[N], AT ;

double AS [N, N], AR [N, N];

double AS [N, N], AR [N, N];

AT = 0;

T AT = 0;

AS [1, 1] = ;

for (i=0; i<N; i++)

for (j=0; j<N; j++) {

AR [1, 1] = AT

;

AS [i, 1] = ;

R AR [i, j] = if (j==0)

AR [i, 1] = AS [i-1, 1] ;

if (i==0) AT

for (j=0; j<N; j++) {

else AS [i-1, j]

AS [i, j] =

;

else AS [i, j-1]

;

AR [i, j] = AS [i, j-1] ;

}

}

}

Figure 16.c. SA avec une analyse precise des Figure 16.d. Analyse precise et (( eplu-

denitions visibles chage )) de la boucle

Figure 16. Interactions entre l'analyse de denitions visibles et le surco^ut a l'execution

........................................................................................

L'implementation reelle de ces techniques depend des structures de contr^ole et de

44 PRESENTATION EN FRANCAIS

donnees. Dans le cas des boucles et des tableaux, nous proposons des algorithmes de

mise en assignation unique qui etendent les resultats existants a des nids quelconques. La

mise en assignation unique de programmes recursifs est un domaine nouveau que nous

etudierons dans la section V.5.

Nous avons egalement developpe trois techniques pour optimiser le calcul des fonctions

. La premiere applique des optimisations simples sur les structures Dexp ; la deuxieme

reduit les ensembles de denitions visibles possibles (les arguments des fonctions ) a

l'aide d'une nouvelle information sur le
ot des donnees appelee denitions visible d'une

cellule memoire ; et la troisieme elimine les redondances dans le calcul du maximum

en eectuant les calculs au fur et a mesure. Cette derniere technique ne genere pas a

proprement parler un programme en assignation unique, ce qui peut parfois nuire a son

utilisation en parallelisation automatique. Avec une vision dierente de l'expansion (pas

necessairement en assignation unique), la section V.4 propose une version amelioree de la

methode d'elimination des redondances (appelee aussi (( placement optimise des fonctions

))) qui ne nuit pas a la parallelisation.

V.2 Expansion statique maximale

Le but de l'expansion statique maximale est d'expanser la memoire le plus possible |

et donc d'eliminer le maximum de dependances | sans recourir a des fonctions pour

restaurer le
ot des donnees.

Considerons deux ecritures v et w appartenant a l'ensemble des denitions visibles

possibles d'une lecture u, et supposons qu'elles aectent la m^eme cellule memoire. Si v

et w ecrivent dans deux cellules memoire dierentes apres expansion, une fonction sera

necessaire pour choisir laquelle des deux ecritures denit la valeur lue par u. On introduit

donc la relation R entre les ecritures qui sont des denitions visibles possibles pour la

m^eme lecture :

8v; w 2 W : v R w () 9u 2 R : v u ^ w u:

Lorsque deux denitions visibles possibles pour la m^eme lecture ecrivent la m^eme cellule

memoire dans le programme d'origine, elles doivent faire de m^eme dans le programme

expanse. Puisque (( ecrire dans la m^eme cellule memoire )) est une relation d'equivalence,

on considere en fait la cl^oture transitive R de la relation R. En se limitant a des fonctions

d'acces expansees feexp de la forme (fe; ), ou est une certaine fonction sur les acces en

ecriture, on montre le resultat suivant :

Proposition 3 Une fonction d'acces feexp = (fe; ) est une expansion statique maximale

pour toute execution e ssi

8v; w 2 We; fe(v) = fe(w) : v R w () (v) = (w):

A partir de ce resultat, on peut calculer une fonction en enumerant les classes d'equi-

valence d'une certaine relation. Le formalisme est donc tres general, mais l'algorithme que

nous proposons est limite aux nids de boucles quelconques sur tableaux. Un certain nombre

de points techniques | notamment la cl^oture transitive de relations anes | requierent

une attention particuliere, mais ceux-ci ne sont pas traites dans ce resume en francais.

Dans le cas general, la mise en assignation unique expose plus de parallelisme que

l'expansion statique, il s'agit donc d'un compromis entre surco^ut a l'execution et paral-

lelisme extrait. Nous presentons egalement trois exemples, sur lesquels nous appliquons

semi-automatiquement (avec Omega [Pug92]) l'algorithme d'expansion. Toutefois, un seul

exemple est etudie dans ce resume, voir section V.4.

V. EXPANSION ET PARALLELISATION 45

Nous presentons maintenant une technique pour reduire l'occupation en memoire d'un

programme expanse sans perte de parallelisme. Nous supposons ainsi qu'un ordre d'exe-

cution parallele <par a deja ete determine pour le programme d'origine (<seq; fe) |

probablement a partir de la relation approchee des denitions visibles . Il est interes-

sant de noter que cet ordre parallele peut ^etre obtenu par n'importe quelle technique |

ordonnancement ou partitionnement par exemple | tant que le resultat peut ^etre decrit

par une relation ane.

Moyennant un calcul de cl^oture transitive, il est m^eme possible de partir de l'ordre

(( data- ow )), c'est a dire l'ordre (( le plus parallele possible )) d'apres la relation de

denitions visibles. On obtient alors un programme expanse qui requiert (generalement)

moins de memoire que la forme en assignation unique, mais qui est compatible avec

n'importe quelle execution parallele legale.

Notre premiere t^ache pour formaliser le probleme consiste a determiner quelles sont les

expansions correctes vis a vis de cet ordre parallele, c.-a-d. quelles sont les fonctions d'acces

expansees feexp qui garantissent que l'ordre d'execution parallele preserve la semantique

du programme d'origine. En utilisant la notation

8v; w 2 W : v ./ w () def

9u 2 R : v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v _ v 6 w)

_ 9u 2 R : w u ^ v par w ^ u par v ^ (u <seq v _ v <seq w _ w 6 v) ;

nous avons montre le resultat suivant :

Theoreme 10 (correction des fonctions d'acces) Si la condition suivante est rem-

plie, l'expansion est correcte, c'est a dire qu'elle garantit que l'ordre d'execution pa-

rallele preserve la semantique du programme d'origine.

8e 2 E; 8v; w 2 We : v ./ w =) feexp(v) 6= feexp(w):

Intuitivement, une denition visible v d'une lecture u et une autre ecriture w doivent

ecrire dans des cellules memoires distinctes lorsque : w s'execute entre v et u dans le

programme parallele, et soit w ne s'execute pas entre v et u soit w assigne une autre

cellule memoire que v dans le programme d'origine. De plus, nous avons montre que ce

critere de correction est optimal, pour une approximation donnee des denitions visibles

et de la fonction d'acces du programme d'origine.

A l'aide de ce critere, la generation du code expanse requiert la coloration d'un graphe

non borne decrit par une relation ane. La methode est la m^eme que dans le cas des nids

de boucles anes, elle est detaillee en francais dans la these de Lefebvre [Lef98].

Nous montrons a present qu'il est possible de combiner les deux techniques d'expansion

precedentes, et nous proposons un cadre general pour optimiser simultanement le surco^ut

de l'expansion et le parallelisme extrait : l'expansion contrainte optimisee. Le formalisme et

les algorithmes sont trop techniques pour faire partie de ce resume, nous nous contenterons

donc de donner un exemple illustrant l'expansion contrainte | qui generalise l'expansion

statique | combinee avec l'optimisation de l'occupation en memoire.

46 PRESENTATION EN FRANCAIS

........................................................................................

double xT [M+1, M+1], xS [M+1, M+1, N+1];

double x;

parallel for (i=1; i<=M; i++) {

for (i=1; i<=M; i++) {

parallel for (j=1; j<=M; j++)

for (j=1; j<=M; j++)

if ( P (i; j )

) {

if ( P (i; j )

) {

T x = 0;

T xT [i, j] = 0;

for (k=1; k<=N; k++)

for (k=1; k<=N; k++)

S S xS [i, j, k] = if (k==1) xT [i, j];

}

x = x ;

else xS [i, j, k-1] ;

R

= x ; R

}

= (fhS; ; 1; N i; : : : ; hS; ; M; N ig)

i i ;

}

}

Figure 17.a. Programme d'origine Figure 17.b. Mise en assignation unique

. . . . . . . . . . . . . . . . . . . . . . . . Figure 17. Exemple de parallelisation . . . . . . . . . . . . . . . . . . . . . . . .

Nous etudions le pseudo-code de la gure 17.a. Nous supposons que N est strictement

positif et que le predicat P (i; j ) est vrai au moins une fois pour chaque iteration de la

boucle externe. Les dependances sur x interdisent toute execution parallele, on transforme

donc le programme en assignation unique. Le resultat de l'analyse de denitions visibles

est exact pour les instances de S , mais pas pour celles de R : une fonction est necessaire.

Les deux boucles externes deviennent alors paralleles, comme le montre la gure 17.b.

En raison de cette fonction et de l'utilisation d'un tableau a trois dimensions, on

observe que l'execution en parallele de ce programme est environ cinq fois plus lente que

l'execution sequentielle (sur SGI Origin 2000 avec 32 processeurs). Il est donc necessaire

de reduire l'occupation en memoire. L'application de l'algorithme de la section V.3 montre

que l'expansion selon la boucle la plus interne n'est pas necessaire, pas plus que le renom-

mage de x en xS et xT . On obtient le code de la gure 18.a. On a implemente la fonction

avec une technique optimisee de calcul a la volee (voir section V.1) et le calcul du max

cache une synchronisation. Les performances sont donc correctes pour un petit nombre

de processeurs, mais se degradent tres rapidement au dela de quatre.

L'application de l'algorithme d'expansion statique maximale permet de se debarrasser

de la fonction , en interdisant l'expansion selon la boucle intermediaire, voir gure 18.b ;

seule la boucle externe reste parallele. Le programme parallele sur un processeur est en-

viron deux fois plus lent que le programme sequentiel (probablement en raison des acces

au tableau a deux dimensions), mais l'acceleration est excellente. On observe que la va-

riable x a ete a nouveau expansee selon la boucle interne, bien que cela n'apporte aucun

parallelisme supplementaire : il est donc necessaire de combiner les deux techniques d'ex-

pansion. Le resultat est tres proche de l'expansion statique maximale avec une dimension

de moins pour le tableau x : x[i] au lieu de x[i, ]. Bien entendu, les performances

sont excellentes : l'acceleration est de 31; 5 sur 32 processeurs (M = 64 et N = 2048).

Des techniques de parallelisation automatique pour programmes recursifs commencent

a voir le jour, gr^ace aux environnements et aux outils | comme Cilk [MF98] | facilitant

l'implementation ecace de programmes a parallelisme de contr^ole [RR99]. Nous propo-

sons une technique de mise en assignation unique et une technique de privatisation pour

V. EXPANSION ET PARALLELISATION 47

........................................................................................

double x[M+1, M+1];

int @x[M+1];

parallel for (i=1; i<=M; i++) {

@x[i] = ; ?

parallel for (j=1; j<=M; j++) double x[M+1, N+1];

if ( P (i; j )

) { parallel for (i=1; i<=M; i++) {

T x[i, j] = 0; for (j=1; j<=M; j++)

for (k=1; k<=N; k++) if ( P (i; j )

) {

S x[i, j] = x[i, j] ; T x[i, 0] = 0;

@x[i] = max (@x[i], j); for (k=1; k<=N; k++)

} S x[i, k] = x[i, k-1] ;

R = x[i, @x[i]] ; }

} R

= x[i, N] ;

}

Figure 18.a. Optimisation de l'occu-

pation en memoire Figure 18.b. Expansion statique maximale

. . . . . . . . . . . . . . . . . . . . . . Figure 18. Deux parallelisations dierentes . . . . . . . . . . . . . . . . . . . . . .

programme recursifs, puis nous presentons deux methodes de generation de code parallele.

Expansion de programmes recursifs

Dans un programme recursif en assignation unique, les structures expansees ont gene-

ralement une structure d'arbre : ses element sont en bijection avec les mots de contr^ole.

L'allocation dynamique et l'acces a ces structures est donc plus delicat que dans le cas des

nids de boucles. L'idee generale est de construire chaque structure expansee Dexp (( a la

volee )), en propageant un pointeur sur le nud courant. L'acces direct a Dexp est toutefois

necessaire pour la mise a jour des references en lecture : on doit tout d'abord calculer les

denitions visibles possibles a l'aide du transducteur fourni par l'analyse, puis retrouver

les cellules memoire associees dans Dexp. M^eme en l'absence de fonction , la restauration

du
ot des donnees risque donc d'^etre tres co^uteuse.

Si les denitions visibles sont connues exactement, peut ^etre vue comme une fonc-

tion partielle de R dans W. Lorsque cette fonction peut ^etre calculee (( a la volee )), il

est possible de generer un code ecace pour les references en lecture du programme ex-

panse : il sut d'implementer le calcul pas a pas du transducteur. C'est notamment le cas

pour les transducteurs sous-sequentiels (voir section III.2), lorsque le programme recursif

manipule une structure d'arbre. En presence de tableaux, il est plus dicile de savoir si

le transducteur a un compteur des denitions visibles est calculable (( a la volee )). Nous

avons toutefois propose un algorithme de mise en assignation unique pour programmes

recursif, incluant le calcul a la volee des denitions visibles lorsque cela est possible.

Nous avons etendu la notion de privatisation aux programmes recursifs : elle consiste

a transformer les structures de donnees globales en variables locales. Dans le cas general,

une copie des donnees doit ^etre eectuee lors de chaque appel et de chaque retour d'une

procedure. Ceci peut se reveler co^uteux lors de la copie des structures locales dans les

structures de la procedure appelante (le copy-out), notamment a cause des synchronisa-

tions inevitables en cas d'execution parallele. Toutefois, lorsque les denitions visibles sont

obligatoirement des anc^etres, seule la premiere phase de copie (le copy-in) est necessaire ;

48 PRESENTATION EN FRANCAIS

c'est le cas du programme Queens, de la plupart des algorithmes de tri, et plus generale-

ment des schemas d'execution du type diviser pour regner ou programmation dynamique.

Nous proposons donc un algorithme de privatisation pour programme recursifs, ou les

fonctions sont remplacees par des copies de structures de donnees.

Generation de code parallele

........................................................................................

int A[n];

P void Queens (int A[n], int n, int k) {

int B[n];

memcpy (B, A, k * sizeof (int));

I if (k <n) {

A=a for (int i=0; i<n; i++) {

B=b for (int j=0; j<k; j++) {

r

=

B[j] ;

}

J if (

) {

s B[k] = ; 32

Optimal

Q spawn Queens (B, n, k+1); Speed-up (parallel / original) 16

13-Queens

}

8

}

} 4

} 2

int main () {

F Queens (A, n, 0); 1

} 0.5

1 2 4 8 16 32

Processors

Nous montrons que les proprietes de decidabilite des transducteurs rationnels et al-

gebriques permettent de realiser des tests de dependance ecaces. On en deduit un al-

gorithme de parallelisation au niveau des instructions qui permet d'executer certaines

instructions de maniere asynchrone et qui introduit des synchronisations lorsque les de-

pendances l'exigent. Cet algorithme est applique au programme BST, ainsi qu'au pro-

gramme Queensapres privatisation, voir gure 19. L'experimentation a ete faite sur une

SGI Origin 2000 pour n = 13. Le ralentissement sur un processeur est d^u aux copies de

tableaux, et dans une moindre mesure a l'ordonnanceur de Cilk [MF98].

Nous montrons egalement que notre algorithme de parallelisation donne de meilleurs

resultats que les techniques existantes, lorsque la decouverte de parallelisme necessite une

information au niveau des instances. Enn, nous etudions la parallelisation par instances

de programmes recursifs, ou les synchronisations sont gardees par les conditions precises

| sur le mot de contr^ole | pour lesquelles une dependance est possible. L'algorithme

que nous proposons exploite pleinement le resultat de l'analyse de dependances par ins-

tances, et la possibilite de tester ecacement si un couple de mots est reconnu par un

transducteur. Un exemple concret permet de valider cette nouvelle technique.

VI. CONCLUSION 49

VI Conclusion

Cette these se conclut par une recapitulation des principaux resultats, suivie d'une

discussion sur les developpements a venir.

VI.1 Contributions

Nos contributions se repartissent en quatre categories fortement interdependantes. Les

trois premieres concernent la parallelisation automatique et sont resumees dans le tableau

suivant ; la quatrieme categorie concerne les transductions rationnelles et algebriques.

Nids affines Nids generaux Programmes recursifs

sur tableaux sur tableaux sur arbres et tableaux

par instances [Fea88a, Fea91, Pug92] [WP95, Won95] publie dans [CC98] 2

analyse de definitions [Fea88a, Fea91, Pug92] [CBF95, BCF97, Bar98] section IV,

visibles par instances [MAL93] [WP95, Won95] publie dans [CC98] 2

assignation unique sections V.1 et V.4

statique maximale publie dans [BCC98, Coh99b, BCC00]

l'occupation memoire [SCFS98, CDRV97] publie dans [CL99, Coh99b]

par instances [DV97] [Col95b]

Structures de contr^ole et de donnees : au dela du modele polyedrique Dans la

section II, nous avons deni un modele de programmes et des abstractions mathematiques

pour les instances d'instructions et les elements de structures de donnees. Ce cadre general

a ete utilise tout au long de ce travail pour formaliser la presentation de nos techniques,

en particulier dans le cas des structures recursives.

De nouvelles analyses de dependances et de de denitions visibles ont ete propo-

sees dans la section IV. Elles utilisent un formalisme de la theorie des langages formels,

plus precisement des transductions rationnelles et algebriques. Une nouvelle denition

des variables d'induction adaptee aux programmes recursifs a permis de decrire l'eet de

chaque instance a l'aide d'une transduction rationnelle ou algebrique. Une comparaison

avec d'autres analyses conclut ce travail.

En revanche, lorsque nous avons concu des algorithmes pour les nids de boucles sur

tableaux | un cas particulier de notre modele | nous sommes restes deles aux vecteurs

1. Il s'agit d'un test de dependances pour les arbres uniquement.

2. Pour les tableaux uniquement.

50 PRESENTATION EN FRANCAIS

de relations anes dans l'arithmetique de Presburger.

Expansion de la memoire : de nouvelles techniques pour resoudre de nouveaux

problemes L'application de l'expansion de la memoire a la parallelisation est une tech-

nique ancienne, mais les analyses de denitions visibles par instances se sont recemment

etendues aux programmes avec des expressions conditionnelles, avec des references com-

plexes aux structures de donnees | par exemple des index de tableaux non anes | ou

avec des appels recursifs, et cela pose de nouvelles questions. La premiere est de garantir

que les acces en lecture dans le programme expanse referent la bonne cellule memoire ; la

deuxieme question reside dans l'adequation des techniques d'expansion avec les nouveaux

modeles de programmes.

Les deux questions sont traitees dans les sections V.1, V.2, V.3 et V.4, dans pour

les nids de boucles (sans restrictions) sur tableaux. Nous avons presente une nouvelle

technique pour reduire le surco^ut de l'expansion a l'execution, et nous avons etendu aux

nids de boucles sans restrictions une methode de reduction de l'occupation en memoire.

La combinaison des deux a ete etudiee et nous avons concu des algorithmes pour optimiser

la restauration du ot des donnees a l'execution. Quelques resultats experimentaux sont

presentes pour une architecture a memoire partagee.

L'expansion de la memoire pour programmes recursifs est un domaine de recherche

totalement nouveau, et nous avons decouvert que l'abstraction mathematique pour les

denitions visibles | les transductions rationnelles ou algebriques | peuvent engendrer

des surco^uts importants. Nous avons neanmoins developpe des algorithmes qui expansent

des programmes recursifs particuliers avec un faible surco^ut a l'execution.

Parallelisme: extension des techniques classiques Notre analyse de dependance

a ete mise a prot pour paralleliser des programmes recursifs. Nous avons pu demontrer

les applications pratiques des transductions rationnelles et algebriques, en utilisant leurs

proprietes decidables. Notre premier algorithme ressemble aux methodes existantes, mais

il prote de l'information plus precise recueillie par l'analyse et on obtient en general

de meilleurs resultats. Un autre algorithme permet la parallelisation par instances de

programmes recursifs : cette nouvelle technique est rendue possible par l'utilisation des

transductions rationnelles et algebriques. Quelques resultats experimentaux sont decrits,

en combinant expansion et parallelisation sur un programme recursif bien connu.

Theorie des langages formels : quelques contributions et des applications Les

derniers resultats de ce travail n'appartiennent pas au domaine de la compilation. Ils se

trouvent principalement dans la section III.3 ainsi que dans les sections suivantes. Nous

avons deni une sous-classe des transductions rationnelles qui admet une structure d'al-

gebre booleene et de nombreuses autres proprietes interessantes. Nous avons montre que

cette classe n'est pas decidable parmi les transductions rationnelles, mais des techniques

d'approximation conservatrices permettent de benecier de ces proprietes dans la classe

des transductions rationnelles tout entiere. Nous avons egalement presente quelques nou-

veaux resultats sur la composition de transductions rationnelles sur des monodes non

libres, avant d'etudier l'approximation de transductions algebriques.

VI. CONCLUSION 51

VI.2 Perspectives

De nombreuses questions se sont posees tout au long de cette these, et nos resultats

suggerent plus de recherches interessantes qu'ils ne resolvent de problemes. Nous com-

mencons par aborder les questions liees aux programmes recursifs, puis nous discutons

des travaux futurs dans le modele polyedrique.

En premier lieu, la recherche d'une abstraction mathematique capable de decrire des

proprietes au niveau des instances appara^t de nouveau comme un enjeu capital. Les

transductions rationnelles et algebriques ont souvent donne de bons resultats, mais leur

expressivite limitee a egalement restreint leur champ d'application. C'est l'analyse de

denitions visibles qui en a le plus souert, ainsi que l'integration des expressions condi-

tionnelles et des bornes de boucles dans l'analyse de dependances. Dans ces conditions,

nous aurions besoin de plus d'un compteur dans les transducteurs, tout en conservant la

possibilite de savoir si un ensemble est vide et de decider d'autres proprietes interessantes.

Nous sommes donc fortement interesses par les travaux de Comon et Jurski [CJ98] sur

la decision de la vacuite dans une sous-classe des langages a plusieurs compteurs, et plus

generalement nous voudrions suivre de plus pres les etudes sur la verication de systemes

fondees sur des classes restreintes de machines de Minsky, comme les automates tempori-

ses. L'utilisation de plusieurs compteurs permettrait en plus d'etendre l'une des grandes

idees de l'analyse
oue de
ot des donnees [CBF95] : l'insertion de nouveaux parametres

pour ameliorer la precision en decrivant les proprietes des expressions non anes.

De plus, nous pensons que les proprietes de decidabilite ne sont pas forcement le point

le plus important pour le choix d'une abstraction mathematique : de bonnes approxi-

mations sur les resultats sont souvent susantes. En particulier, nous avons decouvert

en etudiant les relations synchrones a gauche et les relations deterministes qu'une sous-

classe avec de bonnes proprietes de decision ne peut pas ^etre utilisee dans notre cadre

general d'analyse sans methode ecace d'approximation. L'amelioration de nos methodes

de resynchronisation et d'approximation de transducteurs rationnels est donc un enjeu

important. Nous esperons aussi que ceci demontre l'inter^et mutuel des cooperations entre

theoriciens et chercheurs en compilation.

Au dela de ces problemes de formalisme, une autre voie de recherche consiste a dimi-

nuer autant que possible les restrictions imposees au modele de programme. Comme on l'a

propose precedemment, la meilleure methode consiste a rechercher une degradation pro-

gressive des resultats a l'aide de techniques d'approximation. Cette idee a ete etudiee dans

un contexte semblable [CBF95], et l'application aux programmes recursifs promet des tra-

vaux futurs interessants. Une autre idee serait de calculer les variables d'induction a partir

des traces d'execution (au lieu des mots de contr^ole) | pour autoriser les modications

dans n'importe quelle instruction | puis de deduire des informations approximatives sur

les mots de contr^ole ; l'utilisation de techniques d'interpretation abstraite [CC77] serait

probablement une aide precieuse pour prouver la correction de nos approximations.

Nous n'avons pas travaille sur le probleme de l'ordonnancement des programmes re-

cursifs, car nous ne connaissons aucune methode permettant d'assigner des ensembles

d'instances a des dates d'execution. La construction d'un transducteur rationnel des dates

aux instances est peut ^etre une bonne idee, mais la generation de code pour enumerer les

ensembles d'instances devient plut^ot dicile. Mais ces raisons techniques ne doivent pas

cacher que l'essentiel du parallelisme dans les programmes recursifs peut d'ores et deja

^etre exploite par des techniques a parallelisme de contr^ole, et la necessite de recourir a un

modele d'execution a parallelisme de donnees n'est pas evidente.

En plus de leur incidence sur notre etude des programmes recursifs, les techniques

52 PRESENTATION EN FRANCAIS

issues du modele polyedrique recouvrent une partie importante de cette these. Un objec-

tif majeur tout au long de ces travaux a ete de conserver une certaine distance avec la

representation mathematique des relations anes. Ce point de vue a l'inconvenient de ne

pas faciliter l'ecriture d'algorithmes optimises pr^ets a l'emploi dans un compilateur, mais

il a surtout l'avantage de presenter notre approche dans toute sa generalite. Parmi les pro-

blemes techniques qui devraient ^etre ameliores, tant pour l'expansion statique maximale

et pour l'optimisation de l'occupation en memoire, les plus importants sont les suivants.

Nous avons presente de nombreux algorithmes pour la restauration dynamique du

ot des donnees, mais nous avons tres peu d'experience pratique de la parallelisation de

nids de boucles avec un
ot de contr^ole imprevisible et des index de tableaux non anes.

Comme le formalisme SSA [CFR+91] est principalement utilise en tant que representation

intermediaire, les fonctions sont rarement implementees en pratique. La generation d'un

code de restauration ecace est donc un probleme plut^ot recent.

Aucun paralleliseur pour nids de boucles sans restrictions n'a jamais ete ecrit. Il en

resulte qu'une experimentation de grande ampleur n'a jamais pu ^etre conduite. Pour appli-

quer des analyses et des transformations precises sur des programmes reels, un important

travail d'optimisation reste a conduire. Les idees principales seraient de partitionner le

code [Ber93] et d'etendre nos techniques aux graphes de dependance hierarchiques, aux

regions de tableaux [Cre96] ou aux ordonnancements hierarchiques [CW99].

Un compilateur parallelisant doit ^etre capable de regler automatiquement un grand

nombre de parametres : le surco^ut a l'execution, l'extraction du parallelisme, l'occupation

en memoire, le placement des calculs et des communications... Nous avons vu que le

probleme d'optimisation est encore plus complexe pour des nids de boucles non anes. Le

formalisme d'expansion contrainte permet d'optimiser simultanement un certain nombre

de parametres lies a l'expansion de la memoire, mais il ne s'agit que d'un premier pas.

53

Chapter 1

Introduction

Performance increase in computer architecture technology is the combined result of several

factors: fast increase of processor frequency, broader bus widths, increased number of

functional units, increased number of processors, complex memory hierarchies to deal

with high latencies, and global increase of storage capacities. New improvements and

architectural designs are proposed every day. The result is that the machine model is

becoming less and less uniform and simple: despite the hardware support for caches,

superscalar execution and shared memory multiprocessing, tuning a given program for

performance becomes more and more complex. Good optimizations for some particular

case can lead to disastrous results with a dierent machine. Moreover, hardware support

is generally not sucient when the complexity of the system becomes too high: dealing

with deep memory hierarchies, local memories, out of core computations, instruction level

parallelism and coarse grain parallelism requires additional support from the compiler

to translate raw computation power into sustained performance. The recent shift of

microprocessor technology from superscalar models to explicit instruction level parallelism

is one of the most concrete signs of this trend.

Indeed, the whole of computer architecture and compiler industry is now facing what

the high performance computing community has discovered for years. On the one hand,

and for most applications, architectures are too diverse to dene practical eciency cri-

teria and to develop specic optimizations for a particular machine. On the second hand,

programs are written in such a way that traditional optimization and parallelization tech-

niques have many problems to feed the huge computation monster everybody will have

tomorrow in his laptop.

In order to achieve high performances on modern microprocessors and parallel com-

puters, a program|or at least the algorithm it implements|must contain a signicant

degree of parallelism. Even then, either the programmer and/or the compiler has to ex-

pose this parallelism and apply the necessary optimizations to adapt it to the particular

characteristics of the target machine. Moreover, the program should be portable in order

to cope with the fast obsolescence of parallel machines. The following two possibilities

are oered to the programmer to meet these requirements.

First, explicitly parallel languages. Most of these are parallel extensions of sequen-

tial languages. This includes well known data parallel languages such as HPF, and

recent mixed data and control parallel approaches such as OpenMP extensions for

shared memory architectures. Some extensions also appear under the form of li-

braries: PVM and MPI for instance, or higher-level multi-threaded environments

such as IML from the University of Illinois [SSP99] or Cilk from the MIT [MF98].

54 CHAPTER 1. INTRODUCTION

possible. However, besides parallel algorithmics, the programmer is also in charge

of more technical and machine-dependent operations, such as the distribution of

data on the processors depending on their memory capacities, communications and

synchronizations. This requires a deep knowledge of the target architecture and re-

duces portability. Several eorts have been done in HPF so as to make the compiler

take care of some parts of this job, but it seems that the programmer still needs to

have a precise knowledge of what the compiler does.

Second, automatic parallelization of a high level sequential language. The obvi-

ous advantages of this approach are the portability, the simplicity of programming

and the fact that even old undocumented sequential codes may be automatically

parallelized (in theory). However the task alloted to the compiler-parallelizer is over-

whelming. Indeed, the program has rst to be analyzed in order to understand|at

least partially|what is performed and where the parallelism lies. The compiler then

has to take some decisions about how to generate a parallel code which takes into

account the specicities of the target architecture. Even for short programs and a

simplied model of parallel machine, \optimality" in both steps is out of reach for

decidability reasons. As a matter of fact, a wide panel of parallelization techniques

exists, and the diculty often lies in choosing the more appropriate.

The usual source languages for automatic parallelization is Fortran 77. Indeed,

many scientic applications have been written with Fortran, which allows only rel-

atively simple data structures (scalar and arrays) and control ow. Several studies

however deal with the parallelization of C or of functional languages such as Lisp.

These studies are less advanced than the historical approach, but also more related

with the present work: they handle programs with general control and data struc-

tures. Many research projects already exist, among others: Parafrase-2 and Polaris

[BEF+96] from the University of Illinois, PIPS from E cole des Mines [IJT90], SUIF

from Stanford University [H+ 96], the McCat/EARTH-C compiler from Mc Gill Uni-

versity [HTZ+97], LooPo from the University of Passau [GL97], and PAF from the

University of Versailles; there are also an increasing number of commercial paral-

lelizing tools, such as CFT, FORGE, FORESYS or KAP.

We are mostly interested in automatic and semi-automatic parallelization techniques:

this thesis addresses both program analysis and source to source program transformation.

Optimizations and parallelizations are usually seen as source to source code transforma-

tions which improves one or several run-time parameters. To apply a program transfor-

mation at compile-time, one must check that the algorithm implemented by the program

is unharmed during the process. Because an algorithm can be implemented in many dif-

ferent ways, applying a program transformation requires \reverse engineering" the most

precise information about what the program does. This fundamental program analy-

sis technique addresses the dicult problem of gathering compile-time |a.k.a. static |

information about run-time |a.k.a. dynamic |properties.

1.1. PROGRAM ANALYSIS 55

Static Analysis

Program analyses often compute properties of the machine state between execution of

two instructions. These machine states are known as program points. Such properties

are called static because they cover every possible run-time execution leading to a given

program point. Of course these properties are computed at compile-time, but this is not

the meaning of the \static" adjective: \syntactic" would probably be more appropriate...

Data-
ow analysis is the rst proposed framework to unify the large number of static

analyses. Among the various wordings and formal presentations of this framework [KU77,

Muc97, ASU86, JM82, KS92, SRH96], one may expose the following common issues. To

formally state the possible run-time executions, the usual method is to build the control

ow graph of the program [ASU86]; indeed, this graph represents all program points as

nodes, and edges between these nodes are labeled with program statements. The set of

all possible executions is then the set of all paths from the initial state to the considered

program point. Properties at a given program point are dened as follows: because

each statement may modify some property, one must consider every path leading to the

program point and meet all informations along these paths. The formal statement of these

ideas is usually called meet over all paths (MOP) [KS92]. Of course, the meet operation

depends on the property to be evaluated and on its mathematical abstraction.

However, because of the possibly unbounded number of paths, the MOP specication

of the problem cannot be used for practical evaluation of static properties. Practical

computation is done by|forward or backward|propagation of the intermediate results

along edges of the control
ow graph. An iterative resolution of the propagation equations

is performed, until a x-point is reached. This method is known as maximal xed point

(MFP). In the intra-procedural case, Kam and Ullman [KU77] have proven that MFP

eectively computes the result dened by MOP|i.e. MFP coincides with MOP|when

some simple properties of the mathematical abstraction are satised; and this result has

been extended to inter-procedural analysis by Knoop and Steen [KS92].

Mathematical abstractions for program properties are very numerous, depending on

the application and complexity of the analysis. The lattice structure encompasses most ab-

stractions because it supports computation of both meet |at merge points|and join |at

computational statements|operations. In this context, Cousot and Cousot [CC77] have

proposed an approximation framework based on semi-dual Galois connections between

concrete run-time states of a program and abstract compile-time properties. This math-

ematical formulation called abstract interpretation has two main interests: rst it allows

systematic approaches to the construction of a lattice abstraction for program properties,

and second, it ensures that any computed x-point in the abstract lattice corresponds

to a conservative approximation of an actual x-point in the lattice of concrete states.

While extending the concept of data-
ow analysis, abstract interpretation helps proving

the correctness and optimality of program analyses. Practical applications of abstract in-

terpretation and related iterative methods can be found in [Cou81, CH78, Deu92, Cre96].

Despite the undisputable successes of data-
ow and abstract interpretation frame-

works, the automatic parallelization community has very rarely based its analysis tech-

niques on one of these frameworks. Beyond the important reasons which are not of a

scientic nature, we will discuss the good reasons:

MOP/MFP techniques focus on classical optimizations techniques, with rather sim-

ple abstractions (lattices often have a bounded height); correctness and eciency in

a production compiler are the main motivations, whereas precision and expressive-

56 CHAPTER 1. INTRODUCTION

ness of the mathematical abstraction are the main issues for parallelization;

in the industry, parallelization has traditionally addressed nests of loops and arrays,

with high degrees of data parallelism and simple (non recursive, rst order) control

structures; proving the correctness of an analysis is easy in this context, whereas

applications to real programs and practical implementation in a compiler become

issues of critical interest;

abstract interpretation is well suited to functional languages with clean and simple

operational semantics; problems raised in this context are orthogonal with practical

issues of imperative and low-level languages such as Fortran or C, traditionally more

suitable for parallel architectures (but we will see that this point is evolving).

As a result, data-
ow and abstract interpretation frameworks have mostly focused on

static analysis techniques, which compute properties at a given program point or state-

ment. Such results are well suited to most classical techniques for program checking and

optimization [Muc97, ASU86, SKR90, KRS94], but for automatic parallelization purposes,

one needs more information.

What about distinct run-time instances of program points and statements? Because

statements are likely to execute several times, we are interested in which iteration

of a loop or which call to a procedure induced execution of some program statement.

What about distinct elements in a data structure? Because arrays and dynamically

allocated structures are not atomic, we are interested in which array element or

which graph node is accessed by some run-time instance of a statement.

Because of orthogonal interests in the data-
ow analysis and the automatic paral-

lelization communities, it is not surprising that results of the ones could not be applied by

the others. Indeed, a very small number of data-
ow analyses [DGS93, Tzo97] addressed

both instancewise and elementwise issues, but results are very far from the requirements

of a compiler in terms of precision and applicability.

Instancewise Analysis

Program analyses for automatic parallelization are a rather restricted domain, compared

to the broad range of properties and techniques studied in data-
ow analysis frameworks.

The program model considered is also more restricted|most of the time|since traditional

applications of parallelizing compilers are numerical codes with loop nests and arrays.

Since the very beginning|with works by Banerjee [Ban88], Brandes [Bra88] and

Feautrier [Fea88a]|analyses are oriented towards instancewise and elementwise proper-

ties of programs. When the only control structure was the for/do loop, iterative methods

with a high semantical background seemed overly complex. To focus on solving critical

problems such as abstracting loop iterations and eects of statement instances on array

elements, designing simple and ad-hoc frameworks was obviously more protable than

trying to build on unpractical data-
ow frameworks. The rst analyses were dependence

tests [Ban88] and dependence analyses [Bra88, Pug92] which collected information about

statement instances which access the same memory location, one of the accesses being a

write. More precise methods have been designed to compute, for every array element read

in an expression, the very statement instance which produced the value. They are usually

called array data-
ow analyses [Fea91, MAL93], but we prefer to call them instancewise

1.2. PROGRAM TRANSFORMATIONS FOR PARALLELIZATION 57

reaching denition analyses for better comparison with a specic static data-
ow analysis

technique called reaching denition analysis [ASU86, Muc97]. Such accurate informa-

tion signicantly improves the quality of program transformation techniques, hence the

performance of parallel programs.

Instancewise analyses have long suered strong program model restrictions: programs

used to be nested loops without conditional statements, with ane bounds and array

subscripts, and without procedure calls. This very limited model is already sucient

to address many numerical codes, and has the major interest of allowing computation

of exact dependence and reaching denition information [Fea88a, Fea91]. One of the

diculties in removing the restrictions is that exact results cannot be hoped for anymore,

and only approximate dependences are available at compile-time: this induces overly

conservative approximations of reaching denition information. A direct computation of

reaching denitions is thus needed. Recently, such direct computations have been crafted,

and extremely precise intra-procedural techniques have been designed by Barthou, Collard

and Feautrier [CBF95, BCF97, Bar98] and by Pugh and Wonnacott [WP95, Won95]. In

the following, fuzzy array data
ow analysis (FADA) by Barthou, Collard and Feautrier

[Bar98] will be our prefered instancewise reaching denition analysis for programs with

unrestricted nested loops and arrays.

Many extensions to handle procedure calls have been proposed [TFJ86, HBCM94,

CI96], but they are not fully instancewise in the sense that they do not distinguish be-

tween multiple executions of a statement associated with distinct calls of the surround-

ing procedure. Indeed, the rst fully instancewise analysis for programs with|possibly

recursive|procedure calls is presented in this thesis.

The next section introduces program transformations useful to parallelization. Most

of these transformations will be studied in more detail in the rest of this thesis. Of course,

they are based on instancewise and elementwise analysis of program properties.

Dependences are known to hamper parallelization of imperative programs and their e-

cient compilation on modern processors or supercomputers. A general method to reduce

the number of memory-based dependences is to disambiguate memory accesses in assign-

ing distinct memory locations to independent writes, i.e. to expand data structures.

There are many ways to compute memory expansions, i.e. to transform memory ac-

cesses in programs. Classical ways include renaming scalars, arrays and pointers, splitting

or merging data structures of the same type, reshaping array dimensions, including adding

new dimensions, converting arrays into trees, changing the degree of a tree, and changing

a global variable into a local one.

Read references are also expanded, using instancewise reaching denition information

to implement the expanded reference [Fea91]. Figure 1.1 shows three programs with no

possible parallel execution because of output dependences (details of the code are omitted

when not useful for presentation). Expanded versions are given in the right-hand side of

the gure, to illustrate the benet of memory expansion for parallelism extraction.

Unfortunately, when the control- ow cannot be predicted at compile-time, some run-

time computation is needed to preserve the original data ow: functions may be needed

to \merge" data denitions due to several incoming control paths. These functions are

similar|but not identical|to those of the static single-assignment (SSA) framework by

Cytron et al. [CFR+91], and have been rst extended for instancewise expansion schemes

58 CHAPTER 1. INTRODUCTION

........................................................................................

int x; int x1, x2;

x =

; = x; x1 =

; = x1;

x =

; = x; x2 =

; = x2;

After expansion, i.e. renaming x in x1 and x2, the rst two statements can be executed

in parallel with the two others.

int A[10]; int A1[10], A2[10][10];

for (i=0; i<10; i++) { for (i=0; i<10; i++) {

s1 A[0] = ; s1 A1[i] = ;

for (j=1; j<10; j++) { for (j=1; j<10; j++) {

s2 A[j] = A[j-1] + ; s2 A2[i][j] = { if (j=1) A1[i];

} else A2[i][j-1]; }+ ;

}

After expansion, i.e. renaming array A in A1 and A2 then adding a dimension to array

A2, the for loop is parallel. The instancewise reaching denition of the A[j-1] reference

depends on the values of i and j, as implemented with a conditional expression.

int A[10]; struct Tree {

void Proc (int i) { int value; Tree *left, *right;

A[i] = ; } *p;

= A[i]; void Proc (Tree *p, int i) {

if (

) Proc (i+1); p->value = ;

if (

) Proc (i-1);

= p->value;

} if (

) Proc (p->left, i+1);

if (

) Proc (p->right, i-1);

}

After expansion, the two procedure calls can be executed in parallel. Memory allocation

for the Tree structure is not shown.

. . . . . . . . . . . . . . . . . . Figure 1.1. Simple examples of memory expansion . . . . . . . . . . . . . . . . . .

by Collard and Griebl [GC95, Col98]. The argument of a function is the set of possible

reaching denitions for the associated read reference.1 Figure 1.2 shows two programs

with some unknown conditional expressions and arrays subscripts. Expanded versions

with functions are given in the right side of the gure.

Notice that memory expansion is not a mandatory step for parallelization; it is yet a

general technique to expose parallelism in programs. Now, implementation of a parallel

program depends on the target language and architecture. Two main techniques are used.

The rst technique takes benet of control parallelism , i.e. parallelism between dif-

ferent statements in the same program block. Its goal is to replace as many sequential

executions of statements|denoted with ; in C|by parallel executions. Depending on

the language, there are many dierent syntaxes to code this kind of parallelism, and all

these syntaxes may not have the same expressive power. We will prefer the Cilk [MF98]

spawn/sync syntax (similar to OpenMP's syntax) to the parallel block notation from

Algol 68 or the EARTH-C compiler [HTZ+97]. As in [MF98], synchronizations involve

1 This interpretation of functions is very dierent from their usual semantics in the SSA framework.

1.2. PROGRAM TRANSFORMATIONS FOR PARALLELIZATION 59

........................................................................................

int x; int x1, x2;

s1 x =

; s 1 x1 = ;

s2 if (

) x = ; s ; 2 if ( ) x2 =

r = x; r (fs ; s g); = 1 2

After expansion, one may not decide at compile-time what value is read by statement

r. One only knows that it may either come from s1 or from s2 , and the eective value

retrieval code is hidden in the (fs1; s2g) function. It checks whether s2 executed or not,

then if it did, it returns the value of x2, else it returns the value of x1.

int A[10]; int A1[10], A2[10];

s1 A[i] = ; s 1 A1[i] = ;

s

2 A[ ] = ; s 2 A2[ ] = ;

r = A[i]; r (fs ; s g) = 1 2 ;

After expansion, one may not decide at compile-time what value is read by statement r,

because one does not know which element of array A is assigned by statement s2.

. . . . . . . . . . . . . . . . . Figure 1.2. Run-time restoration of the
ow of data . . . . . . . . . . . . . . . . .

every asynchronous computation started in the surrounding program block, and implicit

synchronizations are assumed at return points in procedures. For the example in Fig-

ure 1.3.a, execution of A, B , C in parallel followed sequentially by D and E has been

written in a Cilk-like syntax (each statement would probably be a procedure call).

........................................................................................

spawn A; // L is the latency of the schedule

spawn B ; for (t=0; t<=L; t++) {

spawn C ; parallel for ({ 2 F (t))

sync; execute instance {

// wait for A, B and C to complete // implicit synchronization

D; }

E;

Figure 1.3.b. Data parallel implementation for

Figure 1.3.a. Control parallelism schedules

. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3. Exposing parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . .

The second technique is based on data parallelism , i.e. parallelism between dierent

instances of the same statement or block. The data parallel programming model has

been extensively studied in the case of loop nests [PD96], because it is very well suited

to ecient parallelization of numerical algorithms and repetitive operations on large data

sets. We will consider a syntax similar to OpenMP parallel loop declaration, where all

variables are supposed to be shared by default, and an implicit synchronization takes

place at each parallel loop termination.

The rst algorithms to generate data parallel code were based on intuitive loop trans-

formations such as loop ssion, loop fusion, loop interchange, loop reversal, loop skewing,

loop reindexing and statement reordering. Moreover, dependences abstractions were much

less expressive than ane relations. But data parallelism is also appropriate when de-

scribing a parallel order with a schedule , i.e. giving an execution date for every statement

60 CHAPTER 1. INTRODUCTION

instance. The program pattern in Figure 1.3.b shows the general implementation of such

a schedule [PD96]. It is based on the concept of execution front F (t) which gathers all

instances { executing at date t.

The rst scheduling algorithm was designed by Allen and Kennedy [AK87], from which

many other methods have been designed. These are all based on a rather approximative

abstractions of dependences, like dependence levels, vectors and cones. Despite the lack

of generality, the benet of such methods is the low complexity and easy implementation

in a industrial parallelizaing compiler; see the work by Banerjee [Ban92] or more recently

by Darte and Vivien [DV97] for a survey of these algorithms.

The rst general solution to the scheduling problem was proposed by Feautrier [Fea92].

The proposed algorithm is very useful, but its weak point is the lack of help to decide what

parameter of the schedule to optimize: is it the latency L, the number of communications

(on a distributed memory machine), the width of the fronts?

Eventually, it is well known that control parallelism is more general than data paral-

lelism, meaning that every data parallel program can be rewritten in a control parallel

model, without loosing any parallelism. This is especially true for recursive programs,

for which the distinction between the two paradigms becomes very unclear, as shown in

[Fea98]. However, for practical programs and architectures, it has long been the case

that architectures for massively parallel computations were much more suited to data

parallelism, and that getting good speed-ups on such architectures was dicult with con-

trol parallelism|mainly due to asynchronous task management overhead. But recent

advances in hardware and software systems are showing an evolution in this situation:

excellent results for parallel recursive programs (game simulations like chess, and sorting

algorithms) have been shown with Cilk for example [MF98].

This thesis is organized in four chapters and a nal conclusion. Chapter 2 describes a

general framework for program analysis and transformation, and presents the formal de-

nitions useful to the following chapters. The main interest of this chapter is to encompass

a very large class of programs, from nests of loops with arrays to recursive programs and

data structures.

A collection of mathematical results is gathered in Chapter 3; some are rather well

known, such as Presburger arithmetcis and formal language theory; some are very un-

common in compiler and parallelism elds, such as rational and algebraic transductions;

and the others are mostly contributions, such as left-synchronous transductions and ap-

proximation techniques for rational and algebraic transductions.

Chapter 4 addresses instancewise analysis of recursive programs. Based on an ex-

tension of the induction variable concept to recursive programs and on new results in

formal language theory, it presents two algorithms for dependence and reaching denition

analysis. These algorithms are applied to several practical examples.

Parallelization techniques based on memory expansion are studied in Chapter 5. The

rst three sections present new techniques to expand nested loops with unrestricted condi-

tionals, bounds and array subscripts; the fourth section is a contribution to simultaneous

optimization of expansion and parallelization parameters; and the fth section presents

our results about parallelization of recursive programs.

61

Chapter 2

Framework

The previous introduction and motivation has covered several very dierent concepts

and approaches. Each one has been studied by many authors who have dened their

own vocabulary and abstractions. Of course, we would like to keep the same formalism

along the whole presentation. This chapter presents a framework for describing program

analysis and transformation techniques and for proving their correctness or theoretical

properties. The design of this framework has been governed by three major goals:

1. build on well dened concepts and vocabulary, while keeping the continuity with

related works;

2. focus on instancewise properties of programs, and take benet of this additional

information to design new transformation techniques;

3. head for both generality and high precision, minimizing the necessary number of

tradeos.

This presentation does not compete with other formalisms, some of which are rmly

rooted in semantically and mathematically sound theories [KU77, CC77, JM82, KS92].

Because we advocate for instancewise analysis and transformations, we primarily focused

on establishing convincing results about eectiveness and feasibility. This required leaving

for further studies the necessary integration of our techniques in a more traditional analysis

theory. We are sure that instancewise analysis can be modeled in a formal framework such

as abstract interpretation, even if very few works have addressed this important issue.

We start with a formal presentation of run-time statement instances and program

executions in Section 2.1, then the program model we will consider throughout this study

is exposed and motivated in Section 2.2. Section 2.3 proposes mathematical abstractions

for these instance and program models. Program analysis and transformation frameworks

are addressed in Sections 2.4 and 2.5 respectively.

During program execution, each statement can be executed several times, depending on

the surrounding control structures (loops, procedure calls and conditional expressions).

To capture data- ow information as precisely as possible, our analysis and transformation

techniques should be able to distinguish between the distinct executions of a statement.

Denition 2.1 (instance) For a statement s, a run-time instance of s is some particular

execution of s during execution of the program.

62 CHAPTER 2. FRAMEWORK

For short, a run-time instance of a statement is called an instance . If the program termi-

nates, each statement has a nite number of instances.

Consider the two example programs in Figure 2.1. They both display the sum of an

array A with an unknown number N of elements; one is implemented with a loop and

the other with a recursive procedure. Statements B and C are executed N times during

execution of each program, but statements A and D are executed only once. The value

of variable i can be used to \name" each instance of B and C and to distinguish at

compile-time between the 2N + 2 run-time instances of statements A, B , C and D: the

unique instances of statements A and D are denoted respectively by hAi and hC i, and the

N instances of statement B (resp. statement C ) associated with some value i of variable

i are denoted by hB; ii (resp. by hC; ii), 0 i < N . Such an \iteration variable" notation

is not always possible, and a general naming scheme will be studied in Section 2.3.

........................................................................................

int A[N];

int A[N];

int Sum (int i) {

int c;

if (i<N)

A c = 0;

C return A[i] + Sum (i+1);

for (i=0; i<N; i++) {

else

B c = c + A[i];

D return 0;

}

}

printf ("%d", c);

printf ("%d", Sum (0));

Because of the state of memory and possible interactions with its environment, several

executions of the same program may yield dierent sets of run-time statement instances

and incompatible results. We will not formally dene this concept of program execution

in operational semantics: a very clean framework has indeed been dened by Cousot

and Cousot [Cou81] for abstract interpretation, but the correctness of our analysis and

transformation techniques does not require so many details.

Denition 2.2 (program execution) Let P be a program. A program execution e is

given by an execution trace of P , which is a nite or innite (when the program does

not terminate) sequence of congurations |i.e. machine states. The set of all possible

program executions is denoted by E.

Now, the set of all run-time instances for a given program execution e 2 E is denoted

by Ie. Subscript e denotes a given program execution, but it also recalls that set Ie

is \exact": it is the eective unapproximate set of statement instances executed during

program execution e. This formalism will be used in every further denition of execution-

dependent concept.

Considering again the two programs in Figure 2.1, the execution of statements B and C

is governed by a comparison of variable i with the constant N . Without any information

on the possible values of N , it is impossible to decide at compile-time whether some

instance of B or C executes. In the extreme case of an execution e where N is equal

to zero, both statements are never executed, and the set Ie is equal to fhAi; hDig. In

general, Ie is equal to fhAi; hDig [ fhB; ii; hC; ii : 0 i < N g, the value of N being part

of the denition of e.

2.2. PROGRAM MODEL 63

Of course, each statement can involve several (including zero) memory references, at

most one of these being a write (i.e. in left-hand side).

Denition 2.3 (access) A pair ({; r) of a statement instance and a reference in the

statement is called an access .

For a given execution e 2 E of a program, the set of all accesses is denoted by Ae. It

can be decomposed into:

Re, the set of all reads , i.e. accesses performing some load operation from memory;

and We, the set of all writes , i.e. accesses performing some store operation into

memory.

Due to our syntactical restrictions, no access may be simultaneously a read and a

write. Since a statement performing some write in memory involves exactly one reference

in left-hand side, its instances are often used in place of its write accesses (this sometimes

simplies the exposition).

Looking again at our two programs in Figure 2.1:

statement A has one write reference to variable c, the single associated access is

denoted by hA; ci;

statement B has one write and one read reference to variable c, since both references

are identical, the associated accesses are both denoted by hB; i; ci, 0 i < N ;

statement B has one read reference to array A, the associated accesses are denoted

by hB; i; A[i]i, 0 i < N ;

statement C has one read reference to array A, the associated accesses are denoted

by hC; i; A[i]i, 0 i < N ;

statement D has no memory reference, thus no associated access.

Our framework focuses on imperative programs. This section describes the control and

data structure syntax we consider. In a preliminary work [CCG96], we dened a toy

language|called LEGS|which allowed explicit declaration of complex data structures

shapes tting our program model. Most of the program model restrictions we enumerate

in this section were also enforced by the language semantics. We chose yet to dene our

program model with a C-like syntax (with C++ syntactic sugar facilities): despite the the

lack of formal semantics available in C, we hope this choice will ease the understanding

of practical examples and the communication of our new ideas.

Procedures are seen as functions returning the void type and explicit|typed|pointers

are allowed. Multi-dimensional arrays are accessed with syntax [i1 ,: : : ,in ]|not C

syntax|for better understanding.

Denition 2.4 (statement and block) A program statement is any C expression

ended with \;" or \}". A program block is a special kind of statement that starts

64 CHAPTER 2. FRAMEWORK

one or more sub-statements.

To simplify the exposition, the only control structures that may appear in the right-

hand side of an assignment, in a function call or in a loop declaration are conditional

statements. Moreover, multiple expressions separated by , are not allowed, and loops

are supposed to follow some minimal \code of ethics": each loop variable is aected by

a single loop and its value is not used outside of this loop; as a consequence, each loop

variable must be initialized.

This framework is primarily designed for rst-order control structures: any function

call should be fully specied at compile-time, and \computed" gotos are forbidden. But

higher-order structures can be handled conservatively, in approximating the possible func-

tion calls using external analysis techniques [Cou81, Deu90, Har89, AFL95]. Calls to

input/output functions are allowed as well, but completely ignored by analysis and trans-

formation techniques, possibly yielding incorrect parallelizations.

Recursive calls, loops with unrestricted bounds, and conditional statements with unre-

stricted predicates are allowed. Classical exception mechanisms, breaks, and continues

are supported as well. However, we suppose that gotos are removed by well known algo-

rithms for structuring programs [Bak77, Amm92], at the cost of some code duplication in

the rare cases where the control ow graph is not reducible [ASU86].

We only consider

scalars (boolean, integer, oating-point, pointer...);

records (non-recursive and non-array structure with scalar and record elds);

arrays of scalars or records;

trees of scalars or records;

arrays of trees;

and trees of arrays.

Records are seen as compound scalars with unaliased named elds. Moreover, unre-

stricted array values in trees and tree elements in arrays are allowed, including recursive

nestings of arrays and trees.

Arrays are accessed through the classical syntax, and other data structures are accessed

through the use of explicit pointers. However, to simplify the exposition, we suppose that

no variable is simultaneously used as a pointer (through operators * and ->) and as an

array (through operator []): in particular, explicit array subscripts must be preferred to

pointer arithmetic.

By convention, edge names in trees are identical to the label of pointer elds in the

tree declaration.

2.3. ABSTRACT MODEL 65

In practical implementations, recursive data structures are not made explicit. More

precisely, two main problems arise when trying to build an abstract view of data structure

denition and usage in C programs.

1. Multiple structure declarations may be relative to the same data structure, with-

out explicit declaration of the shape of the whole object. Moreover, even a sin-

gle recursive struct declaration can describe several very dierent objects, such

as lists, doubly-linked lists, trees, acyclic graphs, general graphs, etc. Building a

compile-time abstraction of data structures used in a program is thus a dicult

problem, but it is essential to our analysis and transformation framework. It can be

achieved in two opposite ways: either \decorating" the C code with shape descrip-

tions which guide the compiler when building its abstract view of data structures

[KS93, FM97, Mic95, HHN92] or running a compile-time shape analysis of pointer-

based structures [GH96, SRW96].

2. Two pointer variables may be aliased , i.e. they may be two dierent names for the

same memory location. The goal of alias analysis [Deu94, CBC93, GH95] (store-less)

and points-to analysis [LRZ93, EGH94, Ste96] (store-based) techniques is precisely

to disambiguate pointer accesses, when pointer updates are not too complex to be

analyzed. In practice, one may expect good results for strongly typed programs

without pointer arithmetics, especially if the goal of the alias analysis is to check

whether two pointers refer the same structure or not. Element-wise alias analysis is

very costly and still a largely open problem: indeed, no instancewise alias analysis for

pointers has been proposed so far, and it could be an interesting future development

of our framework.

In the following, we thus suppose that the shape of each data structure has been

identied as one of the supported data types, and that each pointer reference has been

associated the data structure instance it refers to.

Now, there is one last question about data structures: how are they constructed,

modied and destroyed? When dealing with arrays, a compile-time shape declaration is

available in most cases; but some programs require dynamic arrays whose size is updated

dynamically every time an out-of-bound access is detected: this is the case of some ex-

panded programs studied in Chapter 5. The problem is more critical with pointer-based

data structures: they are most of the time allocated at run-time with explicit malloc or

new operations. This problem has already been addressed by Feautrier in [Fea98] and

we consider the same abstraction: all data structures are supposed to by built to their

maximal extent|possibly innite|in a preliminary part of the code. To guarantee that

this abstraction is correct regarding data-
ow information, we must add an additional re-

striction to the program model: any run-time insertion and deletion is forbidden. In fact

there are two exceptions to this very strong rule, but they will be described in the next

section after presenting the mathematical abstraction for data structures. Nevertheless, a

lot of interesting programs with recursive pointer-based structures perform random inser-

tions and deletions, and these programs cannot be handled at present in our framework.

This issue is left for future work.

We start with a presentation of a naming scheme for statement instances, and show that

execution traces are not suitable to our purpose. Then, we propose a powerful abstraction

66 CHAPTER 2. FRAMEWORK

In the following, every program statement is supposed to be labeled. The alphabet of

statement labels is denoted by ctrl . Now, loops and conditionals requires special atten-

tion.

Because a loop involves an initialization step, a bound check step, and an iteration

step, loops are given three labels: the rst one represents the loop entry, the second

one is the check for termination, and the third one is the loop iteration. Remem-

ber that, in C, a bound check is performed immediately after the loop entry and

immediately after each increment. The loop check is considered as a block and a

conditional statement, and the two other are non-block labels.

An if then else statement is given two labels: one for the condition

and the then branch, and one for the else branch. Both labels are considered as

block labels.

Consider the program example in Figure 2.2.a. This simple recursive procedure com-

putes all possible solutions to the n-Queens problem, using an array A (details of the code

are omitted here); it is our running example in this section.

There are two assignment statements: s writes into array A and r performs some read

access in A. Statement I and J are conditionals, and statement Q is a recursive call to

procedure Queens. Loop statements are divided into three sub-statements which are given

distinct labels: the rst one denotes the loop entry |e.g. A or B |the second one denotes

the bound check|e.g. A or B|and the third one denotes the loop iteration |e.g. a or b.

Finally, P is the label of the procedure and F denotes the initial call in main.

A primary goal for instancewise analysis and transformation is to name each statement

instance. To achieve this, many works in the program analysis eld rely on execution

traces . Their interpretation for program analysis is generally dened as a path from the

entry of the control ow graph to a given statement.1 They record every execution of a

statement, including return from functions.

For our purpose, these execution traces have three main drawbacks:

1. because of return labels, traces belong to a non-rational language in ctrl , as soon

as there are recursive function calls;

2. full-length traces are huge and extremely redundant: if an instance executes before

another in the same program execution, its trace prexes the other;

3. a single statement instance may have several execution traces because statement

execution is unknown at compile time.

To overcome the rst problem, a classical technique relies on a function called Net on

ctrl [Har89]: intuitively this function collapses all call-return pairs in a given execution

trace, yielding compact rational sets of execution traces. The third point is much more

unpleasant because it forbids to give a unique name to each statement instance. Notice

however that dierent execution traces for the same instance must be associated with

distinct executions of the program.

1 Without notice of conditional expressions and loop bounds.

2.3. ABSTRACT MODEL 67

........................................................................................

F

int A[n];

P

P void Queens (int n, int k) {

I if (k < n) { I

A=A=a for (int i=0; i<n; i++) {

B=B=b for (int j=0; j<k; j++) A

r = A[j] ; A a A a A

J if ( ) { J J J

s A[k] = ;

Q Queens (n, k+1); s s s Q

}

} FPIAAaAaAJs P

}

} I

int main () { A A

F Queens (n, 0);

B

}

J

Figure 2.2.a. Procedure Queens r

FPIAAaAaAJQPIAABBr F

Figure 2.2.b. Control tree

. . . . . . . . . . . . . . . . . . . . Figure 2.2. Procedure Queens and control tree . . . . . . . . . . . . . . . . . . . .

Our solution starts from another representation of the program
ow: the intuition

behind our naming scheme for instances is to consider some kind of \extended stack

states" where loops are seen as special cases of recursive procedures. The dedicated

vocabulary for this representation has been dened in parts and with several variations

in [CC98, Coh99a, Coh97, Fea98].

Let us start with an example: the rst instance of statement s in procedure Queens.

Depending on the number of iterations of the innermost loop|bounded by k|an execu-

tion trace for this rst instance can be one of FPIAABBJs, FPIAABBbBJs, FPIAABBbBbBJs,

: : : , FPIAABB(bB)kJs. Since we would like to give a unique name to the rst instance of

s, all B , B and b labels should intuitively be left out. Now, for a given program execution,

any statement instance is associated with a unique (ordered) list of block enterings, loop

iterations and procedure calls leading to it. To each list corresponds a word: the con-

catenation of statement labels. This is precisely what we get when forgetting about the

innermost loop in execution traces of the rst instance of statement s: the single word

FPIAAJs. These concepts are illustrated by the tree in Figure 2.2.b, to be dened later.

We now formally describe these words and their relation with statement instances.

Denition 2.5 (control automaton and control words) The control automaton of

the program is a nite-state automaton whose states are statements in the program

and where a transition from a state q to a state q0 express that statement q0 occurs in

68 CHAPTER 2. FRAMEWORK

block q. Such a transition is labeled by q0. The initial state is the statement executed

at the beginning of program execution, and all states are nal.

Words accepted by the control automaton are called control words . By construction,

they build a rational language Lctrl included in ctrl .

Lemma 2.1 Ie being the set of statement instances for a given execution e of a program,

there is a natural injection from Ie to the language Lctrl of control words.

Proof: Any statement instance in a program execution is associated with a unique

list of block enterings, loop iterations and procedure calls leading to it. We can thus

dene a function f from Ie to Nctrl |lists of statements labels|mapping statement

instances to their respective list of block enterings, loop iterations and procedure calls.

Consider an instances {1 of a statement s1 and an instance {2 of a statement s2 , and

suppose f ({1) = f ({2 ) = l. By denition of f , both statements s1 and s2 must be part

of the same program block B , and precisely, the last element of l is B . Considering a

pair of a statement s and an instance { of s, this proves that no other instance {0 of a

statement s0 may be such that (f ({); s) = (f ({0); s0).

Consider a function from Ie to Lctrl |control words|which maps an instance { of

a statement s to the concatenation of all labels in f ({) and s itself. Thanks to the

preceding property on pairs (f ({); s), function is injective.

Theorem 2.1 Let I be the union of all sets of statement instances Ie for every possible

execution e of a program. There is a natural injection from I to the language Lctrl of

control words.

Proof: Consider two executions e1 and e2 of a program. The function dened in the

proof of Lemma 2.1 is denoted by 1 for execution e1 and 2 for execution e2 . If an

instance { is part of both Ie1 and Ie2 of a program, control words 1 ({) and 2 ({) are

the same, because the list of block enterings, loop iterations and function calls leading

to { are unchanged. Lemma 2.1 terminates the proof.

We are thus allowed to talk about \the control word of a statement instance". In

general, the set E of possible program executions and the set Ie for e 2 E are unknown

at compile-time, and we may consider all instances that may execute during any pro-

gram execution. Eventually, the natural injection becomes a one-to-one mapping when

extending the set Ie with all possible instances associated to \legal" control words. As

a consequence, if w is a control word, we will say \instance w" instead of \the instance

whose control word is w".

We are also interested in encoding accesses themselves with control words. A simple

solution consists in considering pairs (w; ref), where w is a control word for some instance

of a statement s and ref is a reference in statement s. But we prefer to encode the full

access \inside" the control word: we thus extend the alphabet of statement labels ctrl

with letters of the form sref , for all statement s 2 ctrl and reference ref in s. Of

course, extended labels may only take place as the last letter in a control word: when the

last letter in a control word w is of the form sref , it means that w represents an access

instead of an instance. However, when clear from the context, i.e. when there is only one

\interesting" reference in a given statement or all references are identical, the reference

will be taken out of the control word of accesses. This will be the case in most practical

examples.

2.3. ABSTRACT MODEL 69

Eventually, notice that some states in the control automaton have exactly one incom-

ing transition and one outgoing transition (looping transitions count as both incoming

and outgoing). Now, these states do not carry any information about where a statement

can be reached from or lead to: in every control word, the label of the outgoing transition

follows the label of the incoming one. In practice, we often consider a compressed con-

trol automaton where all states with exactly one incoming transition and one outgoing

transition are removed. This transformation has no impact on control words.

Observe that loops in the program are represented by looping transitions in the com-

pressed control automaton, and that cycles involving more than one state are associated

with recursive calls.

........................................................................................

F

F

P

P

I FP

I P

A IAA QP

A aA A

J

A BB

A A bB B J

B a P

J r s

B J a r s

B s Q Figure 2.3.b. Compressed control au-

B s Q tomaton

B

r b

r b

Figure 2.3.a. Control automaton

. . . . . . . . . . . . . . . . . . Figure 2.3. Control automata for program Queens . . . . . . . . . . . . . . . . . .

Figure 2.3.a describes the plain control automaton for procedure Queens.2 Since states

F , I , A, B , Q, a and b are useless, they are removed along with their outgoing edges.

The compressed automaton is described in Figure 2.3.b.

As a practical remark, notice that it is often desirable to restrict the language of

control words to instances of a particular statement. This is easily achieved in choosing

the state associated to this statement as the only nal one.

To conclude this presentation of a naming scheme for statement instances, it is possible

to compare the execution traces of an instance { and the control word of {. Indeed, the

2 Every state is nal, but this is not made explicit on the gure.

70 CHAPTER 2. FRAMEWORK

following property is quite natural: it results from the observation that traces of an

instance may only dier in labels of statements that are not part of the list of block

enterings, loop iterations and function calls leading to this instance.

Proposition 2.1 The control word of a statement instance is a sub-word of every exe-

cution trace of this instance.

The sequential execution order of the program denes a total order over instances, call it

<seq . In English, words are ordered by the lexicographic order generated by the alphabet

order a < b < c < . Similarly, in any program one can dene a partial textual order

<txt over statements: statements in the same block are sorted in apparition order, and

statements appearing in dierent blocs are mutually incomparable.

Remember the special case of loops: the iteration label executes after all the state-

ments inside the loop body, but entry and check labels are not comparable with these

statements. For procedure Queens in Figure 2.2.a, we have B <txt J <txt a, r <txt b

and s <txt Q.

This textual order generates a lexicographic one on control words, denoted by <lex:

w0 <lex w () _9x; x9v20 2ctrl ; u; v; v0 2 ctrl : w = uxv; w0 = ux0v0; x0 <txt x

0

ctrl : w = w0v (a.k.a. prex order):

This order is only partial on ctrl . However, by construction of the textual order:

Proposition 2.2 An instance {0 executes before an instance { i their respective control

words w0 and w satisfy w0 <lex w.

Notice that the lexicographic order <lex is not total on Lctrl because both cases on a

conditional are not comparable! This does not yield a contradiction, because the then and

else cases of the same if instance are never simultaneously executed in a single execution.

In general, the lexicographic order is total on the subset of control words corresponding

to instances that do execute|in one-to-one mapping with Ie for some execution e 2 E.

Eventually, the language of control words is best understood as an innite tree, whose

root is named " and every edge is labeled by a statement. Each node then corresponds

to the control word equal to the concatenation of edge labels starting from the root.

Consider a control word ux, u 2 ctrl and x 2 ctrl ; every downward edge from a node

whose control word is ux corresponds to an outgoing transition from state x in the control

automaton. To represent the lexicographic order, downward edges are ordered from left

to right according to the textual order. Such a tree is usually called a call tree in the

functional languages community, but control tree is more adequate in the presence of loops

and other non-functional control structures. One may talk about plain and compressed

control trees, dependending on the control automaton which denes them.

A partial control tree for procedure Queens is shown in Figure 2.2.b (a compressed

one will be studied later in Figure 4.1 page 124). Control word FPIAAaAaAJQPIAABBr

is a possible run-time instance of statement r|depicted by a star in Figure 2.2.b, and

control word FPIAAaAaAJs |depicted by a black square|is a possible run-time instance

of statement s.

2.3. ABSTRACT MODEL 71

A large number of data structure abstractions have been designed for the purpose of

program analysis. This presentation can be seen as an extension of several frameworks

we already proposed [CC98, Coh99a, Coh97, Fea98] some of which in collaboration with

Griebl [CCG96], but it is also highly relevant to previous work by Alabau and Vauquelin

[Ala94], by Giavitto, Michel and Sansonnet [Mic95], by Deutsch [Deu92] and by Larus

and Hilnger [LH88].

With no surprise, array elements are addressed by integers, or vectors of integers for

multi-dimensional ones. Tree adresses are concatenation of edge names (see Section 2.2.2)

starting from the root. The address of the root is simply ", the zero-length word. For

example, the name of node root->l->r in a binary tree is lr. The set of edge names is

denoted by data. The layout of trees in memory is thus described by a rational language

Ldata data over edge names.

For the purpose of dependence analysis, we are looking for a mathematical abstraction

which captures relations between integer vectors, between words, and between the two.

Dealing with trees only, Feautrier proposed to use rational transductions between free

monoids in [Fea98]. We will formally dene such transductions in Section 3.3, and then

show how the same idea can can be extended to more general classes of transductions and

monoids, to handle arrays and nested trees and arrays as well.

Extending the Data Structure Model

Some interesting structures are basically tree structures enhanced with traversal edges.

In many cases, these traversal edges have a very regular structure. Most usual cases

are reference to the parent and links between nodes at the same height in a tree. Such

traversal edges are often used to facilitate special-purpose traversal algorithms. There

is some support for such structures when traversal edges are known functions of the

generators of the tree structure [KS93, FM97, Mic95], i.e. the \back-bone" spanning tree

of the graph. In such a case, traversal edges are merely an \algorithmical sugar" for better

performance. But even though, our support is limited since recursion and iteration over

traversal edges is not supported. We will not study this extension any further because

a full chapter would be necessary and our support for traversal edges does not include

recursion and iteration.

Abstract Memory Model

The key idea to handle both arrays and trees is that they share a common mathematical

abstraction: the monoid. For a quick recall of monoid denitions and properties, see

Section 3.2. Indeed rational languages (tree addresses) are subsets of free monoids with

word concatenation, and sets of integer vectors (array subscripts) are free commutative

monoids with vector addition. The monoid abstraction for a data structure will be denoted

by Mdata, and the subset of this monoid corresponding to valid elements of the structure

will be denoted by Ldata.

The case of nested arrays and trees is a bit more complex but reveals the expressive-

ness of monoid abstractions. Our rst example is the hash-table structure described in

Figure 2.4. It denes an array whose elements are pointers to lists on integers. A monoid

abstraction Mdata for this structure is generated by Z [ fng, and its binary operation

72 CHAPTER 2. FRAMEWORK

........................................................................................

1 9 15 17 struct key {

// value of key

int value;

// next key

0 11 16 19 key *n;

};

key *hash[7];

2 18

is dened as follows:

= nn n n (2.1)

8i 2 Z : i n = i n (2.2)

8i 2 Z : n i = ni (never used for the hash-table) (2.3)

8i; j 2 Z : i j = i + j: (2.4)

The set Ldata Mdata of valid memory locations in this structure is thus

Ldata = Zn :

Check that the third case in the denition of operation is never used in Ldata.

Our second example is the structure described in Figure 2.5. It denes an array whose

elements are references to other arrays or integers. Each array is either terminal with

integer elements or intermediate with array reference elements. This denition is very

similar to le-system storage structures, such as UNIX's inodes. The monoid abstraction

Mdata for this structure is the same as the hash-table one. However, the set Ldata Mdata

of valid memory locations in this structure is now

Ldata = (Zn)Z:

Now the denition of operation is the same as for the hash-table structure, see (2.1).

In the general case of nested arrays and trees, the monoid abstraction is generated by

the union of node names in trees and integer vectors. Its binary operation is dened as

word concatenation with additional commutations between vectors of the same dimension.

The result is called a free partially commutative monoid [RS97b]:

Denition 2.6 (free partially commutative monoid) A free partially commutative

monoid M with binary operation is dened as follows:

generators of M are letters in an alphabet A and all vectors from a nite union of

free commutative monoids of the form Zn;

2.3. ABSTRACT MODEL 73

........................................................................................

4 2 2 3 true

78 30 true 2

true 2 19

2 45 66 17

123 29

22 18

56

struct inode {

//

true means terminal array of integers

//

false means intermediate array of pointers

boolean terminal

// array size

int length

union {

// array of block numbers

int a[];

// array of inode pointers

inode *n[];

}

} quad;

for a given integer n, operation coincides with vector addition on Zn, 8x; y 2 Zn :

x y = x + y.

This framework clearly supports recursively nested trees and arrays.

In the following, we abstract any data structure as a subset Ldata of the monoid Mdata

with binary operation . ( denotes word concatenation for trees and usual sum for

arrays.)

Eventually, we have required in the previous section that no run-time insertion or

deletion appeared in the program. This rule is indeed too conservative, and two exceptions

can be handled by our framework.

1. Because it makes no dierence for the ow of data whether the insertion is done be-

fore the program or during execution|only assignment of the value does matters|

insertions at a list's tail or tree's leaf are supported.

2. The abstraction is still correct when deletions at a list's tail or tree's leaf are sup-

ported, but may lead to overly conservative results. Indeed, suppose an insertion

74 CHAPTER 2. FRAMEWORK

follows a deletion at the tail of a list. Considering words in the free monoid abstrac-

tion of the list, the memory location of the tail node before deletion will be aliased

with the new location of the inserted one.

The case of nested loops with scalar and array operations is very important. It applies to

a wide range of numerical, signal-processing, scientic, and multi-media codes. A large

amount of work has been devoted to such programs (or program fragments), and very

powerful analysis and transformation techniques have been crafted. While the framework

above easily captures such programs, it seems both easier and more natural to use another

framework for memory addressing and instance naming. Indeed, we prefer the natural

addressing scheme in arrays, using integers and integer vectors, because Z-modules have

a much richer structure than plain commutative monoids.

To ensure consistency of the control word and integer vector frameworks, we show how

control words can be embedded into vectors. This embedding is based on the following

denition, introduced by Parikh [Par66] to study properties of algebraic subsets of free

commutative monoids:

Denition 2.7 A Parikh mapping over alphabet ctrl is a function from words over

ctrl to integer vectors in N Card(ctrl ) , such that each word w is mapped to the vector

of occurrence count of every label in w.

There is no specic order in which labels are mapped to dimensions, but we are interested

in a particular mapping where dimensions are ordered from the label of the outer loop to

the label of the inner one.

The loop nest structure is non-recursive, hence the only cycles in the control automaton

are transitions looping on the same state. As a result, the language of control words is in

one-to-one mapping with its set of Parikh vectors. The following mapping is computed

for the loop nest in Figure 2.6:

AA(aA) B B(bB)s + C C(cC) r ! N 11

w 7 ! jwjA; jwjA; jwja; jwjB ; jwjB; jwjb;

jwjC ; jwjC; jwjc; jwjs; jwjr :

Respective Parikh vectors of instances AAaAaAaAaABBbBbBs and AAaAaACCcCcCcCr are

(1; 5; 4; 1; 2; 2; 0; 0; 0; 1; 0) and (1; 4; 3; 0; 0; 0; 1; 4; 3; 0; 1).

........................................................................................

A=A=a for (i=0; i<100; i++) {

B=B=b for (j=0; j<100; j++)

s A[i,j] =

C=C=c for (k=0; k<100; k++)

r

= A[i,k]

}

From Parikh vectors, we build iteration vectors by removing all labels of non-iteration

statements and collapsing all loops at the same nesting level in the same dimension. Doing

2.4. INSTANCEWISE ANALYSIS 75

this, there is a one-to-one mapping between Parikh vectors and pairs built of iteration

vectors and statement labels. Indeed, the statement label captures both the last non-zero

component of the Parikh vector|i.e. the identity of the statement|and the identity of

the surrounding loops|i.e. which dimension corresponds to which loop.

Continuing the example in Figure 2.6, the only remaining labels are a, b and c|i.e.

labels of iteration statements|and labels b and c are collapsed together into the second

dimension.

Iteration vector of instance AAaAaAaAaABBbBbBs of statement s is (4; 2).

Iteration vector of instance AAaAaACCcCcCcCr of statement r is (2; 3).

In this process, the lexicographic order <lex on control words is replaced by the lex-

icographic order on iteration vectors (the rst dimensions having a higher priority than

the last).

As a conclusion, Parikh mappings show that iteration vectors |the classical frame-

work for naming instances in loop nests|are a special case of our general control word

framework.

Because a statement instance cannot be reduced to an iteration vector, we introduce

the following notations (these notations generalize the intuitive ones at the end of Sec-

tion 2.1):

hS; xi stands for the instance of statement S whose iteration vector is x;

hS; x; refi stands for the access built from instance hS; xi and reference ref.

This does not imply that control words are a case of overkill when studying loop nests.

In particular, they may still be useful when gotos and non-recursive function calls are

considered. However, most interesting loop nest transformation techniques are rooted too

deeply in the linear algebraic model to be rewritten in terms of control words. Further

comparison is largely open, but some ideas and results are pointed out in Section 4.7.

Because our execution model is based on control words instead of execution traces, the

previous Denition 2.2 of a program execution is not very practical. For our purpose,

a sequential execution e 2 E of a program is seen as a pair (<seq; fe), where <seq is

the sequential order over all possible statement instances (associated to the language of

control words) and fe maps every access to the memory location it either reads or writes.

Notice that <seq is not dependent on the execution: it is dened as the order between all

possible statement instances for all executions, which is legal because sequential execution

is deterministic. Order <seq is thus partial, but its restriction to a set of instances Ie for

a given execution e 2 E is a total order. However, fe clearly depends on the execution e,

and its domain is exactly the set Ae of accesses.

Function fe is the storage mapping for execution e of the program [CFH95, Coh99b,

CL99]|it is also called access function [CC98, Fea98]. Storage mapping gathers the eect

of every statement instance, for a given execution of the program. It is a function from the

exact set Ae of accesses (see Denition 2.3) that actually execute into the set of memory

locations.

76 CHAPTER 2. FRAMEWORK

In practice, the sequential execution order is explicitly dened by the program syntax,

but it is not the case of the storage mapping. Some analysis has to be performed, either

to compute fe(a) for all executions e and accesses a, or to compute approximations of fe.

Eventually, (<seq ; fe) has been dened as a view of a specic program execution e,

but it can also be seen as a function mapping e 2 E to pairs (<seq; fe). For the sake of

simplicity, such a function|which denes all possible executions of a program|will be

referred as \program (<seq ; fe)" in the following.

Many analysis and transformation techniques require some information on \con icts"

between memory accesses.

Denition 2.8 (con ict) Two accesses a and a0 are in con ict if they access|either

read or write|the same memory location: fe(a) = fe(a0).

This vocabulary is inherited from the cache analysis framework and its con ict misses

[TD95]. Analysis of con icting accesses is also very similar to alias analysis [Deu94,

CBC93]. The con ict relation is the relation between con icting accesses, and is denoted

by e for a given execution e 2 E. An exact knowledge of fe and e is impossible in

general, since fe may depend on the initial state of memory and/or input data. Thus,

analysis of con icting accesses consists in building a conservative approximation of the

con ict relation, compatible with any execution of the program: v w must hold when

there is an execution e such that v; w 2 Ae and fe(v) = fe(w), i.e.

8e 2 E; 8v; w 2 Ae : fe(v) = fe(w) =) v w : (2.5)

This condition is the only requirement on relation , but a precise approximation is

generally hoped for. For most program analysis purposes, this relation only needs to

be computed on writes, or between reads and writes, but other problems such as cache

analysis [TD95] require a full computation.

Consider the example in Figure 2.7 where FirstIndex and SecondIndex are external

functions on which no information is available. Because the sign of v is unknown at

compile-time, the set of statement instances Ie can be either statement S or statement

T (statements coincides with statement instances since they are not surrounded by any

loop or procedure call), depending on the execution. Since the results of FirstIndex

and SecondIndex are unpredictable too, no exact storage mapping can be computed at

compile-time. The only available compile-time information is that S and T may execute,

and then they may also yield con icting accesses, i.e.

hS; A[FirstIndex ()] i hT; A[SecondIndex ()] i:

However, another information is that executions of S and T are mutually exclusive (due to

the if then else construct syntax), and then S and T cannot be con icting

accesses:

@e 2 E : S 2 Ae ^ T 2 Ae:

This example shows the need for computing approximative results about data- ow prop-

erties such as con icting accesses, and it also shows how complex it is to achieve precise

results.

2.4. INSTANCEWISE ANALYSIS 77

........................................................................................

int v, A[10];

scanf ("%d", &v);

if (v > 0)

S A[FirstIndex ()] =

else

T A[SecondIndex ()] =

. . . . . . . . . . . . . . . . . Figure 2.7. Execution-dependent storage mappings . . . . . . . . . . . . . . . . .

For the purpose of parallelization, we need sucient conditions to allow two accesses

to execute in any order. Such conditions can be expressed in terms of dependences :

Denition 2.9 (dependence) An access a depends on another access a0 if at least one

is a write (i.e. a 2 We or a0 2 We), if they are in con
ict|i.e. fe(a) = fe(a0 )|and if

a0 executes before a|i.e. a0 <seq a.

The dependence relation for an execution e is denoted by e: a depends on a0 is written

0

a e a:

8e 2 E; 8a; a0 2 Ae : a0 e a ()

def

(a 2 We _ a0 2 We) ^ a0 <seq a ^ fe(a) = fe(a0 ):

(2.6)

Once again, an exact knowledge of e is impossible in general. Thus, dependence analysis

consists in building a conservative approximation , i.e.

8e 2 E; 8a; a0 2 Ae : a0 e a =) a0 a : (2.7)

Eventually, Bernstein's conditions tell that two accesses can be executed in any order|

e.g. in parallel|if they are not dependent.

Some techniques require more precision than is available through dependence analysis:

given a read access in memory, they need to identify the statement instance that produced

the value. Then the read access is called the use and the instance that produced the value

is called the \denition" that \reaches" the use, or reaching denition. The reaching

denition is indeed the last instance |according to the execution order|on which the use

depends.

We thus dene function e, mapping every read access to its reaching denition:

8e 2 E; 8u 2 Re : e (u) = max

<

v 2 We : v

seq

e u ; (2.8)

or, replacing max with its denition:

8e 2 E; 8u 2 Re; v 2 We : v = e (u) ()

def

v e u ^ 8w 2 We : u <seq w _ w <seq v _ :(w u) :

78 CHAPTER 2. FRAMEWORK

def

8e 2 E; 8u 2 Re; v 2 We : v = e (u) ()

v <seq u ^ 8w 2 We : u <seq w _ w <seq v _ fe(v) 6= fe(w) :

So denition v reaches use u if it executes before the use, if both refer to the same memory

location, and if no intervening write w kills the denition.

When a read instance u has no reaching denition, either u reads an uninitialized

value (hinting at a programming error) or the analyzed program is only a part of a

larger program. To cope with this problem, we add a virtual statement instance ? which

executes before all instances in the program and assigns every memory location. Then,

each read instance u has a unique reaching denition, which may be ?.

Because no exact knowledge of e can be hoped for in general, reaching denition

analysis computes a conservative approximation . It is preferably seen as a relation, i.e.

8e 2 E; 8u 2 Re; v 2 We : v = e (u) =) v u : (2.9)

One may also use as a function from reads to sets of writes, and we talk about sets

of possible reaching denitions. One must be very careful in the distinction between a

set of eective instances S Ie and the set S [ f?g: if ? 62 (u) then it says that u

reads a value produced by some instance in S, but if ? 2 (u) then u may read a value

produced before executing the program. The fact that ? appears in a set of possible

reaching denitions is the key to program checking techniques, since it may correspond

to uninitialized values.

This section is an overview of fuzzy array data ow analysis (FADA); which was rst

presented in [CBF95]. The program model is restricted to loop nests with unrestricted

conditionals, loop bounds and array subscripts. The aim of this short presentation is

to allow comparison with our own analysis for recursive programs, and because the re-

sults of an instancewise reaching denition analysis for loop nests are extensively used in

Chapter 5.

Intuitive Flavor

According to (2.8), the exact reaching denition of some read access u|e (u)|is dened

as the maximum of the set of writes in e (u) (for a given program execution e 2 E).

As soon as the program model includes conditionals, while loops, and do loops with

non-linear bounds, we have to cope with a conservative approximation of the dependence

relation. In the case of nested loops, one usually look for an ane relation, and non-

ane constraints in (2.6) are approximated using additional analyses on variables and

array subscripts.

But then, and with the exception of very special cases, computing the maximum of an

approximate set of dependences has no meaning: the very execution of instances in (u)

is not guaranteed. One solution is to take the entire set (u) as an approximation of the

reaching denition. Can we do better than that? Let us consider an example. Notice rst

that, for expository reasons, only scalars are considered. The method, however, applies

to arrays with any subscript.

for (i=0; i<N; i++) {

2.4. INSTANCEWISE ANALYSIS 79

if ( )

S1 x =;

else

S2 x =;

}

R = x ;

Assuming that N 1, what is the reaching denition of reference x in statement R?

Since all instances of S1 and S2 are in dependence with hRi, it seems like we cannot do

better that approximating (hRi) with fhS1; 1i; : : : ; hS1; N i; hS2; 1i; : : : ; hS1; N ig.

Let us introduce a new boolean function be(i) which represents the outcome of the

test at iteration i, for a program execution e 2 E. This allows to compute the exact

dependence relation e at compile-time:

8e 2 E; 8v 2 We :

v e hRi () 9i 2 f1; : : : ; N g : (v = hS1; ii ^ be (i)) _ (v = hS2 ; ii ^ :be (i));

which can also be written

8e 2 E : e (hRi) = fhS1; ii : 1 i N ^ be(i)g [ fhS2; ii : 1 i N ^ :be (i)g:

Since the above result is not approximate, the exact reaching denition e (hRi) of hRi is

the maximum of e (hRi).

Suppose e (hRi) is an instance hS1; e1i for some execution e 2 E. Because be(i) _

:be (i) is equal to true for all i 2 f1; : : : ; N g, any value produced by an instance hS1; ii or

hS2; ii with i < N is overwritten either by hS1 ; N i or by hS2; N i. This proves that e1 must

be equal to N . Conversely, supposing e (hRi) is an instance hS2; e2i, the same reasoning

proves that e2 must be equal to N . Then, we have the following result for function e :

8e 2 E : e (hRi) = fhS1; N i : be(N )g [ fhS2; N i : :be (N )g: (2.10)

We may now replace be and :be by their conservative approximations:

(hRi) = fhS1; N i; hS2; N ig: (2.11)

Notice here the high precision achieved.

To summarize these observations, our method will be to give new names to the result of

maxima calculations in the presence of non-linear terms. These names are called parame-

ters and are not arbitrary: as shown in the example, some properties on these parameters

can be derived. More generally, one can nd relations on non-linear constraints|like be|

by a simple examination of the syntactic structure of the program or by more sophisticated

techniques. These relations imply relations on the parameters, which are then used to

increase the accuracy of the reaching denition. In some cases, these relations may be so

precise as to reduce the \fuzzy" reaching denition to a singleton, thus giving an exact

result. See [BCF97, Bar98] for a formal denition and handling of these parameters.

The general result computed by FADA is the following: the instancewise reaching

denition relation is a quast, i.e. a nested conditional in which predicates are tests

for the positiveness of quasi-ane forms (which include integer division), and leaves are

either sets of instances whose iteration vector components are again quasi-ane, or ?.

See Section 3.1 for details about quasts.

80 CHAPTER 2. FRAMEWORK

Improving Accuracy

To improve the accuracy of our analysis, properties on non-ane constraints involved in

the description of the dependences can be integrated in the data-
ow analysis. As shown

in the previous example, these properties imply properties on the parameters introduced

in our computation.

Several techniques have been proposed to nd properties on the variables of the pro-

gram or on non-ane functions (see [CH78, Mas93, MP94, TP95] for instance). They use

very dierent formalisms and algorithms, from pattern-matching to abstract interpreta-

tion. However, the relations they nd can be written as rst order formulas of additive

arithmetic (a.k.a. Presburger arithmetics, see Section 3.1) on the variables and non-ane

functions of the program. This general type of property makes the data-
ow analysis

algorithm independent of the practical technique involved to nd properties.

How the properties are taken into account in the analysis is detailed in [BCF97, Bar98].

The quality of the approximation is dened w.r.t. the ability of the analysis to integrate

(fully or partially) these properties. In general, the analysis cannot nd the smallest

set of possible reaching denitions [Bar98]. This is due to decidability reasons; but for

some kind of properties, such as properties implied by the program structure, the best

approximation can be found.

Until then, every set of instances or accesses considered was exact and dependent on the

execution. However, as hinted before, we will mostly consider approximative sets and

relations in the following. For this reason, we need the following conservative approxima-

tions:

I, the set of all possible statement instances for every possible execution of a given

program,

8e 2 E : { 2 Ie =) { 2 I;

A, the set of all possible accesses,

8e 2 E : a 2 Ae =) a 2 A ;

R, the set of all possible reads,

8e 2 E : a 2 Re =) a 2 R ;

W, the set of all possible writes,

8e 2 E : a 2 We =) a 2 W :

They can be very conservative or be the result of a very precise analysis. In practice, the

precision of these sets is not critical because they are rarely directly used in algorithms

(but they are widely used in theoretical frameworks associated with these algorithms).

Most of the time, they are implicitly present as domains or images of every relation over

instances and accesses, which have their own dedicated analysis and approximation.

Sets I, A, R, W and relations , 6 , , are the key to program analysis and trans-

formation techniques. In our framework, no other instancewise information is available

at compile-time. In particular, when we present an optimality result for some algorithm

it means optimality according to this information : nobody can do a better job if his only

informations are the sets and relations above.

2.5. PARALLELIZATION 81

2.5 Parallelization

With the model dened in Section 2.4, parallelization of some program (<seq ; fe) means

construction of a program (<par; feexp), where <par is a parallel execution order : a partial

order and a sub-order of <seq. Building a new storage mapping feexp from fe is called

memory expansion.3 Obviously, <par and feexp must satisfy several properties in order to

preserve the sequential program semantics.

Some additional properties that are not mandatory for the expansion correctness, are

guaranteed by most practical expansion techniques. For example, the property that they

eectively \expand" data structures. Intuitively, a storage mapping feexp is ner than fe

when it uses at least as much memory as fe. More precisely:

Denition 2.10 (ner) For a given execution e of a program, a storage mapping feexp

is ner than fe if

8v; w 2 W : feexp(v) = feexp(w) =) fe(v) = fe(w):

2.5.1 Memory Expansion and Parallelism Extraction

Some basic expansion techniques techniques to build a storage mapping feexp have been

listed in Section 1.2, they are used implicitly or explicitly in most memory expansion

algorithms, such as the ones presented in Chapter 5.

Now, the benet of memory expansion is to remove spurious dependences due to mem-

ory reuse: \the more expansion, the less memory reuse". Then, removing dependences

extracts more parallelism: \the less memory reuse, the more parallelism". Indeed, con-

sider the exact dependence relation eexp for the same execution of the expanded program

with sequential execution order (<seq ; feexp):

8e 2 E; 8a; a0 2 Ae :

a0 eexp a ()

def

(a 2 We _ a0 2 We) ^ a0 <seq a ^ feexp(a) = feexp(a0): (2.12)

Any parallel order <par (over instances) must be consistent with dependence relation eexp

(over accesses):

8e 2 E; 8({1; r1); ({2; r2) 2 Ae : ({1; r1) eexp ({2 ; r2) =) {1 <par {2

({1 , {2 are instances and r1, r2 are references in a statement).

Of course, we want a compile-time description and consider a conservative approxi-

mation exp of eexp. This approximation does not require any specic analysis in general:

its computation is induced by the expansion strategy, see Section 5.4.8 for example.

Theorem 2.2 (correctness criterion of parallel execution orders) Given the fol-

lowing condition, the parallel order is correct for the expanded program (it preserves

the original program semantics).

8({1 ; r1); ({2; r2 ) 2 A : ({1; r1) exp ({2; r2) =) {1 <par {2: (2.13)

An important remark is that eexp is actually equal to e when the program is con-

verted to single-assignment form (but not SSA): every dependence due to memory reuse

is removed. We may thus consider exp = to parallelize such codes.

3 Because most of the time, feexp requires more memory than fe .

82 CHAPTER 2. FRAMEWORK

In this section, we recall some classical results about loop nest parallelization; recursive

programs will be addressed in Section 5.5. We have already presented|in Section 1.2|

two main paradigms to generate parallel code. To compute the parallel execution order

<par, data parallelism |the second paradigm|will be assumed.

Extending parallelization techniques to irregular loop nests has already been studied

by several authors: [Col95a, Col94b, GC95] to cite only the results nearest to our work.

Instead of presenting a novel algorithm for parallelization, we show how most of the

existing ones can be integrated in our framework.

Scheduling

Dependence or reaching denition analyses derive a graph where nodes are operations and

edges are constraints on the execution order. The problem is now to traverse the graph in

a partial order; this order is the execution order for the parallel program. The more partial

the order, the higher the parallelism. In general, this partial order cannot be expressed

as the list of relation pairs: one needs an expression of the partial order that does not

grow with problem size, i.e. a closed form. Additional constraints on the expression of

partial orders are: have a high expressive power; be easily found and manipulated; allow

optimized code generation.

A suitable solution is to use a schedule [Fea92], i.e. a function from the set I of all

instances to the set N of positive integers. In a more general presentation of schedules,

vectors of integers can be used: one may then talk about multidimensional \time" and

schedules. This issue is studied by Feautrier in [Fea92]. The following denitions con-

sider one-dimensional schedules only, but it makes no fundamental dierence with multi-

dimensional ones. From Theorem 2.2, we already know how the correct parallel execution

orders are dened from the dependence relation in the expanded program. Rewriting this

result for a schedule function, the correctness becomes

8({1; r1); ({2 ; r2) 2 A : ({1 ; r1) exp ({2; r2) =) ({1) < ({2); (2.14)

where exp is the dependence relation in the expanded program. (for multidimensional

schedules, <lex is used to compare vectors). If no expansion has been performed exp is

the original dependence relation . If the program has been converted to single assignment

form, it is the reaching denition relation . On the other hand, since is integer valued,

the constraint above is equivalent to:

8({1; r1 ); ({2; r2) 2 A : ({1 ; r1) exp ({2; r2) =) ({1) + 1 ({2 ): (2.15)

This system of functional inequalities, called causality constraints , must be solved for the

unknown function . As it is often true for system of inequalities, it may have many

dierent solutions. One can minimize various objective functions, as e.g. the number of

synchronization points or the latency.

Feautrier's Scheduling Algorithm

In the following, notation Iter({) denotes the iteration vector of instance {. Considering

(2.15), let us introduce , the vector of all variables in the problem: is obtained by

concatenating Iter({1), Iter({2 ), and the vector of symbolic constants in the problem

2.5. PARALLELIZATION 83

(recall Iter(hS; xi) = x). It so happens that, in the context of ane dependence relations,

(({1 ; r1) exp ({2; r2)) is the disjunction of conjunctions of ane inequalities. In other words,

the set f(u; v) : u exp vg is a union of convex polyhedra. This result, built for general ane

relations, is also true when the dependence relation is approximated in various ways such

as dependence cones, direction vectors and dependence levels, see [PD96, Ban92, DV97].

Since the constraints in the antecedent of (2.15) are ane; let us denote them by

Ci( ) 0, 1 i M . Similarly, let ( ) 0 be the consequent (v) (u) 1 0 in

(2.15). Then, we can apply the following lemma:

Lemma 2.2 (Ane Form of Farkas' Lemma) An ane function ( ) from integer

vectors to integers is non-negative on a polyhedron f : Ci( ) 0; 1 i M g if there

exists non-negative integers 0 ; : : : ; M (the Farkas multipliers) such that:

M

X

( ) = 0 + i Ci( ) (2.16)

i=1

This relation is valid for all values of . Hence, one can equate the constant term and the

coecient of each variable in each side of the identity, to get a set of linear equations where

the unknowns are the coecients of the schedules and the Farkas multipliers, i. Since the

latter are constrained to be positive, the system must be solved by linear programming

[Fea88b, Pug92] (see also Section 3.1).

Unfortunately, some loop nests do not have \simple" ane schedules. The reason is

that when a loop nest has an ane schedule, it has a large degree of parallelism. However,

it is clear that some loop nests have few or even no parallelism, hence no ane schedule.

The solution in this case is to use a multidimensional ane schedule, whose domain is N d ,

d > 1, ordered according to the lexicographic order. Such a schedule can have as low a

degree of parallelism as necessary, and can even represent sequential programs. The selec-

tion of a multidimensional schedule can be automated by using algorithms from [Fea92].

It can be proved that any loop nest in an imperative program has a multidimensional

schedule. Notice that multidimensional schedules are particularly useful in the case of

dynamic control programs, since we have in that case to overestimate the dependences

and hence to underestimate the degree of parallelism.

Code generation of parallel scheduled programs is simple in theory, but very com-

plex in practice: issues such as polyhedron-scanning [AI91], communication handling,

task placement, and low-level optimizations are critical for ecient code generation

[PD96] (pages 79{103). Dealing with complex loop bounds and conditionals raises new

code generation problems{not talking about allocation of expanded data structures|see

[GC95, Col94a, Col95b].

Other Scheduling Techniques

Before the general solution to the scheduling problem proposed by Feautrier, most algo-

rithms were based on classical loop transformation techniques that include loop ssion,

loop fusion, loop interchange, loop reversal, loop skewing, loop scaling, loop reindexing

and statement reordering. Moreover, dependences abstractions were much less expressive

than ane relations.

The rst algorithm was designed by Allen and Kennedy [AK87], which inspired many

other solutions [Ban92]. Several complexity and optimality results have also been dis-

covered by Darte and Vivien [DV97]. Extending previous results, they designed a very

84 CHAPTER 2. FRAMEWORK

powerful algorithm, but its abstraction does not support the full expressive power of ane

relations.

Moreover, many optimizations of Feautrier's algorithm have been designed, mainly

because of the wide range of objective functions to optimize. For example, Lim and Lam

propose in [LL97] a technique to reduce the number of synchronizations induced by a

schedule, and they compare their technique with other recent improvements.

Speculative execution is a classical technique to improve scheduling of nite depen-

dence graphs, but it is not for general ane relations. It has been explored by Collard and

Feautrier as a way to extract more parallelism from programs with complex loop bounds

and conditionals [Col95a, Col94b].

Eventually, all schedule functions computed by these techniques can be captured by

ane functions of iteration vectors. The associated parallel execution order is thus an

ane relation <par, well suited to our formal framework:

8u; v 2 W : u <par v () (u) < (v)

for one-dimensional schedules, and

8u; v 2 W : u <par v () (u) <lex (v)

for multidimensional ones.

Tiling

Despite the good theoretical results and recent achievements, scheduling techniques can

lead to very bad performance, mainly because of communication overhead and cache

problems. Indeed, ne grain parallelization is not suitable to most parallel architectures.4

Partitioning run-time instances is thus an important issue: the solution is to group ele-

mentary computations in order to take advantage of memory hierarchies and to overlap

communications and computations.

The tiling technique groups elementary computations into a tile , each tile being ex-

ecuted on a processor in an atomic way. It is well suited to nested loops with regular

computation patterns [IT88, CFH95, BDRR94]. An important goal of these researches

is to nd the best tiling strategy respecting measure criteria like the number of commu-

nications happening between the tiles. This strategy must be known at compile time to

generate ecient code for a particular machine.

Most tiling techniques are limited to perfect loop nests, and dependences are often

supposed uniform when evaluating the amount of communications. The most usual tile

model has been dened by Irigoin and Triolet in [IT88]; it enforces the following con-

straints:

tiles are bounded for local memory requirements;

tiles are identical by translation to allow ecient code generation and automatic

processing;

tiles are atomic units of computation with synchronization steps at their beginning

and at their end.

4 But it is suitable for instruction-level parallelism.

2.5. PARALLELIZATION 85

Many dierent algorithms have been designed to nd an ecient tile shape and then to

partition the nest of loops. Scheduling of individual tiles is done using classical schedul-

ing algorithms. However, inner-tile sequential execution is open for a larger scope of

techniques, depending on the context. The simplest inner-tile execution order is the orig-

inal sequential execution of elementary computations, but other execution orders|still

compatible with the program dependences|could be more suitable for the local memory

hierarchy, or would enable more aggressive storage mapping optimization techniques (see

Section 5.3 for details, but further study of this idea is left for future work). A more

extensive presentation of tiling can be found in [BDRR94].

We make one hypothesis to handle parallel execution orders produced by tiling tech-

niques in out framework: the inner-tile execution order must be ane. It is denoted by

<inn. Nevertheless, we are not aware of techniques that would not build ane inner-tile

execution orders. The tile shape can be any bounded parallelepiped (or part of a paral-

lelepiped on iteration space boundaries), but is often a rectangle in practice. Then, the

result of a tiling technique is a pair (T; ), where the tiling function T maps statement

instances to individual tiles and the schedule maps tiles to integers or vectors of integers.

Eventually, the result of a tiling technique can be captured by our parallel execution

order framework, with an ane relation <par:

8u; v 2 W : u <par v () (T (u)) < (T (v)) _ (T (u) = T (v) ^ u <inn v) (2.17)

for a one-dimensional schedule of tiles, and

8u; v 2 W : u <par v () (T (u)) <lex (T (v)) _ (T (u) = T (v) ^ u <inn v) (2.18)

for a multidimensional schedule.

When dealing with nest of loops, it is well known that complex loop transformations

require complex polytope traversals, which slightly increases execution time. Moreover,

even when no run-time restoration of the data ow is required, the right-hand side of

statements often grow huge because of nested conditional expressions. Then, the code

generated by a straightforward application of parallelization algorithms is very inecient.

Moving conditionals and splitting loops is very useful, as well as polytope scanning tech-

niques [AI91, FB98].

These remarks naturally extend to recursive programs and recursive data structures.

The only dierence is that most optimization techniques|such as constant propagation,

forward substitution, invariant code motion, dead-code elimination [ASU86, Muc97]|are

either limited to non-recursive programs or much less eective with complex recursive

structures. In this work, indeed, most experimentations with recursive programs have

required manual optimizations. This should encourage us to develop more aggressive

techniques suitable for recursive programs.

Of course, shape and alias analyses discussed in Section 2.2.2 are very useful when

pointer-based data structures are considered. A single pair of aliased pointers is likely to

forbid any further precise analysis or aggressive program transformation, especially when

using generic types (such as void*).

Induction variable detection [Wol92] and other related symbolic analysis techniques

[HP96] are critical for program analysis and transformation. It is especially true for

86 CHAPTER 2. FRAMEWORK

instancewise analyses: computing the value of an integer (or pointer) variable at each

instance of a statement is the key information for dependence analysis. We will indeed

present a new induction variable detection technique suitable for our recursive program

model.

In the following, when no specic contribution has been proposed in this work, we will

not address these necessary previous stages and optimizations:

we will always consider that the required information about data structure shape,

aliases or induction variables is available, when this information can be derived by

classical techniques;

we will generate unoptimized transformed programs, supposing that classical opti-

mization techniques can do the job.

We make the hypothesis that our techniques, if implemented in a parallelizing compiler,

are preceded and followed by the appropriate analyses and optimizations.

87

Chapter 3

Formal Tools

Most technical results on mathematical abstractions are gathered in this chapter. Sec-

tion 3.1 is a general presentation of Presburger arithmetics and algorithms for systems of

ane inequalities. Section 3.2 recalls classical results on formal languages and Section 3.3

addresses rational relations over monoids. Contributions to an interesting class of ratio-

nal relations are found in Section 3.4. Section 3.5 addresses algebraic relations, and also

presents some new results. The two last sections are mostly devoted to applicability of

formal language theory to our analysis and transformation framework: Section 3.6 dis-

cusses intersection of rational and algebraic relations, and approximation of relations is

the purpose of Section 3.7.

The reader whose primary interest is in the analysis and transformation techniques

may skip all proofs and technical lemmas, to concentrate on the main theorems. Because

this chapter is more a \reference manual" for mathematical objects, it can also been read

\on demand" when technical information is required in the following chapters.

When dealing with iteration vectors, we need a mathematical abstraction to capture sets,

relations and functions. This abstraction must also support classical algebraic operations.

Presburger arithmetics is well suited to this purpose, since most interesting questions are

decidable within this theory. It is dened by logical formulas build from :, _ and ^,

equality and inequality of integer ane constraints, and rst order quantiers 9 and

8. Testing the satisability of a Presburger formula is at the core of most symbolic

computations involving ane constraints. It is known as integer linear programming and

is decidable, but NP-complete, see [Sch86] for details. Indeed, all known algorithms are

super-exponential in the worst case, such as the Fourier-Motzkin algorithm implemented

by Pugh in Omega [Pug92] and the Simplex algorithm with Gomory cuts implemented by

Feautrier in PIP [Fea88b, Fea91]. In practice, Fourier-Motzkin is very ecient on small

problems, and the Simplex algorithm is more ecient on medium problems, because its

complexity is polynomial in the mean. Computing exact solutions to large integer linear

programs is an open problem at present, and this is a problem for practical application

of Presburger arithmetics to automatic parallelization.

88 CHAPTER 3. FORMAL TOOLS

We consider vectors of integers, and sets, functions, and relations thereof. Functions

are seen as a special case of relation and relations are also interpreted as functions: a

relation on sets A and B can equivalently be described by a function from A to the set

P(B ) of subsets of B . Notice the range and domain of a function or relation may not

have the same dimension. Sets of integer vectors are ordered by the lexicographic order

<lex, and the \bottom element" ? denotes by denition an element which precedes all

integer vectors. Strictly speaking, we consider sets, functions and relations described by

Presburger formulas on integer vectors extended with ?.

To describe mathematical objects in Presburger arithmetics, we use three types of

variables: bound, unknowns and parameters. Bound variables are quantied by 9 and 8 in

logical formulas, whereas unknown variables and parameters are free variables. Unknown

variables appear in input, output or set tuples, whereas parameters are fully unbound and

interpreted as symbolic constants. Handling parameters is trivial with Fourier-Motzkin,

but required a specic extension of the Simplex algorithm, called Parametric Integer

Programming (PIP) by Feautrier [Fea88b].

Omega [Pug92] is widely used in our prototype implementations and semi-automatic

experiments, and its syntax is very close to the usual mathematical one. Non-intuitive

details will be explained when needed in the experimental sections. PIP uses another rep-

resentation for ane relations called quasi-ane selection tree or quast , where quasi-ane

forms are an extension of ane forms including integer division and modulo operations

with integer constants.

Denition 3.1 (quast) A quasi-ane selection tree (quast) representing an ane rela-

tion1 is a many level conditional, in which

predicates are tests for the positiveness of quasi-ane forms in the input variables

and parameters,

and leaves are sets of vectors described in Presburger arithmetics extended with ?

| which precedes any other vector for the lexicographic order.

It should be noticed that bound variables in ane relations appear as parameters in

quasts called wildcard variables. These wildcard variables are not free: they are con-

strained inside the quast itself. Moreover, quasi-ane forms (with modulo and division

operations) in conditionals and leaves can be converted into \pure" ane forms thanks

to additional wildcard variables, see [Fea91] for details.

Empty sets are allowed in leaves|they dier from the singleton f?g|to describe

vectors that are not in the domain of a relation. Let us give a few examples.

The function corresponding to integer addition is written

f(i1 ; i2) ! (j ) : i1 + i2 = j g

and can be represented by the quast

fi1 + i2g

1 In fact, this is an extension of Feautrier's denition to capture unrestricted ane relations and not

only ane functions, see [GC95].

3.1. PRESBURGER ARITHMETICS 89

The same function restricted to integers less than a symbolic constant N is written

f(i1 ; i2) ! (j ) : i1 < N ^ i2 < N ^ i1 + i2 = j g

and as a quast

if i < N

1

if i < N

2

f

then then i1 + i2

g

else ?

else ?

The relation between even numbers is written

f(i) ! (j ) : (9; : i = 2 ^ j = 2 )g

(we keep the functional notation ! for better understanding, and to be compliant

with Omega's syntax) and a quast representation

if i = 2

then

f2 : 2 Zg

else ?

( is a wildcard variable)

Many other examples of quasts occur in Chapter 5.

A new interface to PIP has been written in Objective Caml, allowing easy and ecient

handling of these quasts. Implementation was done by Boulet and Barthou, see [Bar98]

for details. The quast representation is neither better nor worse than the classical logical

one, but it is very useful to code generation algorithms and very near from the parametric

integer programming algorithm.

To conclude this presentation of mathematical abstractions for ane relations, we

suppose that Make-Quast is an algorithm to compute a quast representation for any

ane relation. (The reverse problem is much easier and not useful to our framework.) Its

extensive description is rather technical but we may sketch the principles of the algorithm.

The Presburger formula dening the ane relation is rst converted to a form with only

existential quantiers, by the way of negation operators (a technique also used in the

Skolem transformation of rst order formulas); then every bound variable is replaced by a

new wildcard variable; unknown variables are isolated from equalities and inequalities to

build sets of integer vectors; and eventually the ^ and _ operators are rewritten in terms

of conditional expressions. Subsequent simplications, size reductions and canonical form

computations are not discussed here, see [Fea88b, PD96, Bar98] for details.

For more details on Presburger arithmetics, integer programming, mathematical repre-

sentations of ane relations, specic algorithms and applications to compiler technology,

see [Sch86, PD96, Pug92, Fea88b].

Computing the transitive closure of a relation is a classical technique in computer science,

but most algorithms target relations whose graph is nite. This hypothesis is obviously

90 CHAPTER 3. FORMAL TOOLS

not acceptable in the case of ane relations. The problem is that the transitive closure of

an ane relation may not be an ane relation; and knowing when it is an ane relation

is not even decidable. Indeed, we can encode the multiplication using transitive closure,

which is not denable inside Presburger arithmetics:

f(x; y) ! (x + 1; y + z)g = f(x; y) ! (x0; y + z(x0 x)) : x x0 g:

It should be noted that testing if a relation R is closed by transitivity is very simple:

it is equivalent to R R R being empty.

We are thus left with approximation techniques. Indeed, nding a lower bound is

rather easy in theory: the transitive closure R of a relation R can be dened as

[

R = Rk ;

k2N

S

and computing nk=0 RkSfor increasing values of n yields increasingly accurate lower

bounds. In some cases, nk=0 Rk is constant for n greater than some value n0 , and this

constant gives the exact result for R. But in general, the size of the result grows very

quickly without reaching the exact transitive closure. This method can still be used with

\reasonable" values of n to compute a lower bound.

Now, the previous iterative technique is unable to nd the exact transitive closure of

relation R = f(i) ! (i + 1)g, and it is even unable to give any interesting approximation.

The transitive closure of R is nevertheless a very simple ane relation: R = f(i) !

(i0) : i i0 g. More clever techniques should thus be used to approximate transitive

closures of ane relations. Kelly et al. designed such a method and implemented it

in Omega [KPRS96]. It is based on approximating general ane relations in a sub-

class where transitive closure can be computed exactly. They coined the term d-form

(d for dierence) to dene this class. Their technique allows computation of both upper

bounds|i.e. conservative approximations|and lower bounds, see [KPRS96] for details.

This section starts with a short review of basic concepts, then we recall formal languages

properties interesting to our purpose. See the well known book by Hopcroft and Ullman

[HU79], the rst two chapters of the book by Berstel [Ber79], and the Handbook of Formal

Languages (volume 1) [RS97a] for details.

A semi-group consists of a set M and an associative binary operation on M , usually

denoted by multiplication. A semi-group which has a neutral element is a monoid . The

neutral element of a monoid is unique, and is usually denoted by 1M or 1 for short. The

monoid structure is widely used in this work, with several dierent binary operations.

Given two subsets A and B of a monoid M , the product of A and B is dened by

AB = fc 2 M : (9a 2 A; 9b 2 B : c = ab)g:

This denition converts P(M ) into a monoid with unit f1M g. A subset A of M is a

sub-semi-group (resp. sub-monoid) of M if A2 A (resp. A2 A and 1M 2 A). Given

3.2. MONOIDS AND FORMAL LANGUAGES 91

A+ = An

n1

is a sub-semi-group of M , and [

A = An

n0

with A0 = f1M g is a sub-monoid of M . In fact, A+ (resp. A) is the least sub-semi-group

(resp. sub-monoid) for the order of set inclusion containing A. It is called the sub-semi-

group (resp. sub-monoid) generated by A. If M = A for some A M , then A is a system

of generators of M . A monoid is nitely generated if it has a nite set of generators.

For any set A, the free monoid A generated by A is dened by tuples (a1; : : : ; an)

of elements of A, with n 0, and with tuple concatenation as binary operation. When

A is nite and non-empty, it is called an alphabet, tuples are called words, elements of A

are called letters and the neutral element is called the empty word and denoted by ". A

formal language is a subset of a free monoid A , and the length juj of a word u 2 A is

the number of letters composing u. By denition, the length of the empty word is 0. For

a letter a in an alphabet A, the number of occurrences of a in A is denoted by juja. We

will also use the classical notions of prexes, suxes, word reversal, sub-words and word

factors. The product of two languages is also called concatenation.

We also recall the denition of a monoid morphism. If M and M 0 are monoids, a

(monoid) morphism : M ! M 0 is a function satisfying

(1M ) = 1M and 8m1 ; m2 2 M : (m1; m2 ) = (m1 )(m2):

0

(AB ) = (A)(B ); (A+) = (A)+; and (A) = (A):

This sections recalls basic denitions and results, to set notations and allow reference in

later chapters.

Given an alphabet A, a (nite-state) automaton A = (A ; Q; I; F; E ) consists of a

nite set Q of states, a set I Q of initial states, a set F Q of nal states, and a nite

set of transitions (a.k.a. edges) E Q A Q.

Free monoid A is often removed for comodity, when clear fromx the context: we write

A = (Q; I; F; E ). A transition (q; x; q0) 2 E is usually written q ! q0, q is the departing

state, q0 is the arrival state, and x is the label of the transition. A transition whose label

is " is called an "-transition.

A path is a word (p1; x1 ; q1) (pn; xn; qn) in E such as qi = pi+1 for all i 2 f1; : : : ; n

1g, and x1 xn is called the label of the path. An accepting path goes from an initial

state to a nal one. An automaton is trim when all its states are accessible and may be

part of an accepting path.

An automaton is deterministic when it has a single initial state, every transition label

is a single letter or ", at most one transition may share the same departing state and

label, and a state with departing "-transition may not have departing labeled transitions.

The language jAj realized by a nite-state automaton A is dened by u 2 jAj i u

labels an accepting path of A. A regular language is a language realized by some nite-state

automaton.

92 CHAPTER 3. FORMAL TOOLS

and where all transition labels are single letters. Any regular language can be realized by

a deterministic nite-state automaton.

The family of rational languages over an alphabet A is equal to the least family of

languages over A containing the empty set and singletons, and closed under union, con-

catenation and the star operation.

The following well known theorem is at the core of formal language theory.

Theorem 3.1 (Kleene) Let A be an alphabet. The family of rational and regular lan-

guages over A coincides.

Beyond the closure properties included in the denition, rational languages are closed

under the plus operation, intersection, complementation, reversal, morphism and inverse

morphism.

Proposition 3.1 The following problems are decidable for rational languages: member-

ship in linear time, emptiness, niteness, emptiness of the complement, niteness of

the complement, inclusion, equality.

We recall a few basic facts about algebraic languages and push-down automata. See

[HU79, Ber79] for an extensive introduction.

An algebraic grammar |a.k.a. context-free grammar |G = (A; V; P ) consists of an al-

phabet A of terminal letters, an alphabet V of variables |also known as non-terminals |

distinct from A, and a nite set P V (V [ A) of productions.

When clear from the context, the alphabet is removed from the grammar denition,

and we write G = (V; P ). A production (; ) 2 P is usually written in the form ! ,

and if ! 1; 2; : : : ; ! n are productions of G having the same left-hand side ,

they are grouped together using notation ! 1 j 2 j j n.

Let A be an alphabet and let G = (V; P ) be an algebraic grammar. We dene the

derivation relation as an extension of the production notation !:

f ! g () 9 2 V; 9u; ; v 2 (V [ A) : ! 2 P ^ f = uv ^ g = uv:

Then, for any p 2 N , !p is the pth iteration of !, and !

+

and ! are dened as usual.

In general, grammars are presented with a distinguished non-terminal S called the

axiom. This allows to dene the language LG generated by a grammar G = (V; P ) by

LG = fu 2 A : S ! ug:

A language LG generated by some algebraic grammar G is an algebraic language |a.k.a.

context-free language .

Most expected closure properties hold for algebraic languages, but not intersection.

Indeed, algebraic languages are closed under union, concatenation, star and plus opera-

tions, reversal, morphism, inverse morphism, and intersection with rational languages.

Although the most natural denition of algebraic languages comes from the grammar

model, we prefer in this work another representation.

Given an alphabet A, a push-down automaton A = (A ; ; 0; Q; I; F; E ) consists of a

stack alphabet , a non-empty word 0 in + called the initial stack word , a nite set Q

3.2. MONOIDS AND FORMAL LANGUAGES 93

of states, a set I Q of initial states, a set F Q of nal states, and a nite set of

transitions (a.k.a. edges) E Q A Q.

Free monoid A is often removed for commodity,

x:g!
0

when clear from the context. A tran-

sition (q; x; g;
; q ) 2 E is usually written q ! q , the nite-state automata vocabulary

0

is inherited, and g is called the top stack symbol . An empty stack word is denoted by ".

A conguration of a push-down automaton is a triple (u; q;
), where u is the word to

be read, q is the current state and
2 is the word composed of symbols in the stack.

The transition between two congurations c1 = (u1; q1;
1) and c2 = (u2; q2 ;
2) is denoted

by relation 7! and dened by c 7 ! c0 i there exist (a; g;
;
0) 2 A such

that

u1 = au2 ^
1 =
0g ^
2 =
0
^ (q1 ; a; g;
; q2) 2 E:

p

Then 7 ! with p 2 N , 7 +! and 7 ! are dened as usual.

A push-down automaton A = ( ;
0; Q; I; F; E ) is said to realize the language L by

nal state , when u 2 L i there exist (qi ; qf ;
) 2 I F such that

(u; qi;
0) 7 ! ("; qf ;
):

A push-down automaton A = ( ;
0; Q; I; F; E ) is said to realize the language L by empty

stack , when u 2 L i there exist (qi ; qf ) 2 I F such that

(u; qi;
0) 7 ! ("; qf ; "):

Notice that realization by empty stack implies realization by nite state: qf is still required

to be in the set of nal states.

Theorem 3.2 The family of languages realized by nal state or by empty stack by push-

down automata is the family of algebraic languages.

Unlike nite-state automata, the deterministic property for push-down automata im-

poses some restrictions on the expressive power and brings an interesting closure property.

A push-down automaton is deterministic when it has a single initial state, every transition

label is a single letter or ", at most one transition may share the same departing state, la-

bel and top stack symbol, and a state with departing "-transition may not have departing

labeled transitions.

It is straightforward that any algebraic language can be realized by a push-down au-

tomaton whose transition labels are either " or a single letter. The family of languages

realized by nal state by deterministic push-down automata is called the family of deter-

ministic algebraic languages . It should be noticed that this family is also known as LR(1)

(which is equal to LR(k) for k 1) in the syntactical analysis framework [ASU86].

Proposition 3.2 The family of languages realized by empty stack by deterministic push-

down automata is the family of deterministic algebraic languages with prex property.

Recall that a language L has the prex property when a word uv belonging to L

forbids u to belong to L, for all words u and non-empty words v. The interesting closure

property is the following:

Proposition 3.3 The family of deterministic algebraic languages is closed under com-

plementation.

94 CHAPTER 3. FORMAL TOOLS

However, closure of deterministic algebraic languages under union and intersection are

not available. Decidability of deterministic algebraic languages among algebraic ones is

unknown, despite the number of tries and related works [RS97a].

Proposition 3.4 The following problems are decidable for algebraic languages: member-

ship, emptiness, niteness.

These additional problems are decidable for deterministic algebraic languages:

membership in linear time, emptiness of the complement, niteness of the comple-

ment.

The following problems are undecidable for algebraic languages: being a rational

language, emptiness of the complement, niteness of the complement, inclusion (open

problem for deterministic algebraic languages), equality (idem).

We conclude this section with a simple algebraic language example whose properties

are frequently observed in our analysis framework [Coh99a]. The Lukasiewicz language

L- over an alphabet fa; bg is the language generated by axiom and the grammar with

productions

! a j b:

The Lukasiewicz language is apparented to Dyck languages [Ber79] and is the simplest

of a family of languages constructed in order to write arithmetic expressions without

parentheses (prex or \polish" notation): the letter a represents a binary operation and

b represents the operand. Indeed, the rst words of L- are

b; abb; aabbb; ababb; aaabbbb; aababbb; : : :

Proposition 3.5 Let w 2 fa; bg . Then w 2 L- i jwja jwjb = 1 and juja jujb 0

for any proper left factor u of w (i.e. 9v 2 fa; bg+ : w = uv). Moreover, if w; w0 2 L- ,

then

jww0ja jww0jb = jwja jwjb + jw0ja jw0jb:

This implies that L- has the prex property, see [Ber79] for details. A graphical rep-

resentation may help understand intuitively the previous proposition and properties of

L- : drawing the graph of function u 7! juja jujb as u ranges over the left factors of

w = aabaabbabbabaaabbb yields Figure 3.1.a.

Eventually, Figure 3.1.b shows a push-down automaton which realizes the Lukasiewicz

language by empty stack. It has a single state, which is both initial and nal, a single stack

symbol I . The initial stack word is also I , it is denoted as ! I on the initial state. The

push-down automaton in Figure 3.1.c realizes L- by nal state. Two states are necessary,

as well as two stack symbols Z and I , the initial stack word being Z .

Important remark. In the following, every push-down automaton will implicitly ac-

cept words by nal state .

An interesting sub-class of algebraic languages is called the class of one-counter languages.

It is dened through push-down automata. A classical denition is the following: A push-

down automaton is a one-counter automaton if its stack alphabet contains only one letter.

3.2. MONOIDS AND FORMAL LANGUAGES 95

........................................................................................

0

a a b a a b b a b b a b a a a b b b b

1

Figure 3.1.a. Evolution of occurrence count dierences

b; I ! " b; I ! "

!I 1 !Z 1 "; Z ! Z

2

a; I ! II a; I ! II a; Z ! ZI

Figure 3.1.b. Push-down automaton Figure 3.1.c. Push-down automaton accept-

accepting by empty stack ing by nal state

. . . . . . . . . . . . . . . . . . . . Figure 3.1. Studying the Lukasiewicz language . . . . . . . . . . . . . . . . . . . .

(by nal state).

However, we prefer a denition which is more suitable to our practical usage of one-

counter languages. This denition is a bit more technical.

Denition 3.2 (one-counter automaton and language) A push-down automaton

is a one-counter automaton if its stack alphabet contains three letters, Z (for \zero"),

I (for \increment") and D (for \decrement") and if the stack word belongs to the

(rational) set ZI + ZD. An algebraic language is a one-counter language if it is

realized by a one-counter automaton (by nal state).

It is easy to show that Denition 3.2 describes the same family of languages as the

preceding classical denition: the idea is to replace all stack symbols by I and to \remem-

ber" the original symbol in the state name. Intuitively, if n is a positive integer, stack

word ZI n stands for counter value n, stack word ZDn stands for counter value n, and

stack word Z stands for counter value 0.

The family of one-counter languages is strictly included in the family of algebraic

languages, and appears as a natural abstraction in our program analysis framework. The

Lukasiewicz language is a simple example of one-counter language, Figure 3.2 shows a one-

counter automaton realizing it. This example introduces specic notations to simplify the

presentation of one-counter automata:

! n stands for initialization of the stack word to ZI n is n is positive, ZDn if n is

negative, and Z if n is equal to zero;

+n for n 0 stands for pushing I n onto the stack if the stack word is in ZI , and if

96 CHAPTER 3. FORMAL TOOLS

the stack word is ZDk its stands for removing max(n; k) symbols then, if n > k,

pushing back I n k onto the stack;

+n for n < 0 stands for ( n);

n for n 0 stands for pushing Dn onto the stack if the stack word is in ZD, and if the

stack word is ZI k its stands for removing max(n; k) symbols then, if n > k, pushing

back Dn k onto the stack;

n for n < 0 stands for +( n);

=0 stands for testing if the top stack symbol is Z ;

6= 0 stands for testing if the top stack symbol is not Z ;

> 0 stands for testing if the top stack symbol is I ;

< 0 stands for testing if the top stack symbol is D;

0 stands for testing if the top stack symbol is Z or I ;

0 stands for testing if the top stack symbol is Z or D.

These operations are the only available means to check and update the counter. Moreover,

tests for 0 can be applied before additions or subtractions: < 0 ; 1 stands for allowing the

transition and decrementing the counter when the counter is negative, and "; +1 stands

for incrementing the counter in all cases. See also the transition labeled by b on Figure 3.2.

The general form for a one-counter automaton is thus (A; c0 ; Q; I; F; E )], where A is

an alphabet (removed when clear from the context), c0 is the initial value of the counter,

and E Q A f"; =0 ; 6= 0 ; > 0 ; < 0 ; 0 ; 0g Z Q.

........................................................................................

b; > 0 ; 1

!1 1 "; =0

2

a; +1

After this short presentation of one-counter languages, one would expect a generaliza-

tion to multi-counter languages, also called Minsky machines [Min67]. The general form

of n-counter automata is (A ; c10; : : : ; cn +0; Q; I; F; E ), where ck0 is the initial value of the

kth counter and E is dened on the product of all stacks. However, it has been shown

that two-counter automata have the same expressive power as Turing machines|which

is a stronger result than the well known equivalence of Turing machines and two-stack

automata. Most interesting questions thus become undecidable for multi-counter lan-

guages. However, a few additional restrictions on this family of languages have recently

3.3. RATIONAL RELATIONS 97

been proven to enable several decidability results, as for the emptiness problem. Studying

the applicability of these new results to our program analysis framework is left for future

work, but most interesting applications would probably arise from work by Comon and

Jurski [CJ98].

We start with denition and basic properties of recognizable and rational relations, then

introduce the machines realizing rational transductions. After studying some examples,

we review decision problems and closure properties. This section recalls classical results,

see [Eil74, Ber79, AB88] for details.

We recall the denition and a useful characterization of recognizable sets in nitely gen-

erated monoids.

Denition 3.3 (recognizable set) Let M be a monoid. A subset R of M is a recog-

nizable set if there exist a nite monoid N , a morphism from M to N and a subset

P of N such that (R) = P .

Recognizable sets can be seen as a generalization of rational (a.k.a. regular) languages

to non-free monods which preserves the structure of boolean algebra :

Proposition 3.6 Let M be a monoid, both ? and M are recognizable sets in M . Rec-

ognizable sets are closed under union, intersection and complementation.

Although recognizable sets are closed under concatenation, they are not closed under

the star operation. But it is the case of rational sets, which extend recognizable ones.

Their denition is borrowed from rational languages:

Denition 3.4 (rational set) Let M be a monoid. The family of rational sets in M is

the least family of subsets of M holding ? and singletons fmg M , closed under

union, concatenation and the star operation.

However, rational sets are not closed under complementation and intersection, in gen-

eral.

When there are two monoids M1 and M2 such that M = M1 M2 , a recognizable

subset of M is called a recognizable relation. The following result describes the \structure"

of recognizable relations.

Theorem 3.3 (Mezei) A recognizable relation R in M1 M2 is a nite union of sets of

the form K L where K (resp. L) is a rational set of M1 (resp. M2 ).

When there are two monoids M1 and M2 such that M = M1 M2 , a rational subset

of M is called a rational relation. In the following, we will only consider recognizable or

rational sets which are relations between nitely generated monoids.

98 CHAPTER 3. FORMAL TOOLS

rational relations by means of rational languages and monoid morphisms. (The formula-

tion is slightly dierent from the original theorem by Nivat, see [Ber79] for details.)

Theorem 3.4 (Nivat) Let M and M 0 be two monoids. Then R is a rational relation over

M and M 0 i there exist an alphabet A, two morphisms : A ! M , 0 : A ! M 0 ,

and a rational language K A such that

R = f((h); 0(h)) : h 2 K g:

We recall here a \more functional" view of recognizable and rational relations. From a

relation R over M1 and M2 , we dene a transduction from M1 into M2 as a function from

M1 into the set P(M2 ) of subsets of M2 , such that v 2 (u) i uRv. For commodity,

may also been extended to a mapping from P(M1 ) to P(M2 ), and we write : M1 ! M2.

A transduction : M1 ! M2 is recognizable (resp. rational) i its graph is a recog-

nizable (resp. rational) relation over M1 and M2 . Both recognizable and rational trans-

ductions are closed under inversion (i.e. relational symmetry).

In the next sections, we use either relations or transductions, depending on the context.

The family we will study lies somewhere between recognizable and rational relations; it

retains the boolean algebra structure and the closure under composition.

The following result|due to Elgot and Mezei [EM65, Ber79]|is restricted to free

monoids.

Theorem 3.5 (Elgot and Mezei) If A, B and C are alphabets, 1 : A ! B and 2 :

B ! C are rational transductions, then 2 1 : A ! C is a rational transduction.

Nivat's theorem can be rewritten for rational transductions:

Theorem 3.6 (Nivat) Let M and M 0 be two monoids. Then : M ! M 0 is a rational

transduction i there exist an alphabet A, two morphisms : A ! M , 0 : A ! M 0 ,

and a rational language K A such that

8m 2 M : (m) = 0( 1(m) \ K ):

These two theorems are key results for dependence analysis and dependence testing,

see Chapter 4.

The \mechanical" representations of rational relations and transductions are called

rational transducers; they extend nite-state automata in a very natural way:

Denition 3.5 (rational transducer) A rational transducer T = (M1 ; M2 ; Q; I; F; E )

consists of an input monoid M1, an output monoid M2, a nite set of states Q, a set of

initial states I Q, a set of nal states F Q, and a nite set of transitions (a.k.a.

edges) E Q M1 M2 Q.

Monoids M1 and M2 are often removed for commodity, when clear from the context: we

write T = (Q; I; F; E ). Since we only consider nitely generated monoids, the transitions

of a transducer can equivalently be chosen in Q0 (G1 [ f1M1 g) (G2 [ f1M2 g) Q0,

where G1 (resp. G2) is a set of generators for M1 (resp. M2) and Q0 is some set of states

larger than Q.

3.3. RATIONAL RELATIONS 99

Most of the time, we will be dealing with free monoids|i.e. languages; the empty

word is then the neutral element and is denoted by ".

A path is a word (p1; x1 ; y1; q1) (pn; xn; yn; qn) in E such as qi = pi+1 for all i 2

f1; : : : ; n 1g, and (x1 xn; y1 yn) is called the label of the path. A transducer is

trim when all its states are accessible and may be part of an accepting path.

The transduction jT j realized by a rational transducer T is dened by g 2 jT j(f ) i

(f; g) labels an accepting path of T . It is a consequence of Kleene's theorem that a subset

of M1 M2 is a rational relation i it is recognized by a rational transducer :

Proposition 3.7 A transduction is rational i it is realized by a rational transducer.

Let us now present decidability and undecidability results for rational relations.

Theorem 3.7 The following problems are decidable for rational relations: whether two

words are in relation (in linear time), emptiness, niteness.

However, most other usual questions are undecidable for rational relations.

Theorem 3.8 Let R, R0 be rational relations over alphabets A and B with at least two

letters. It is undecidable whether R \ R0 = ?, R R0, R = R0, R = A B ,

(A B ) R is nite, R is recognizable.

A few questions may become decidable when replacing A and B by some particular

nitely generated monoids, but it is not the case in general.

The following denition will be useful in some technical discussions and proofs in the

following. It formalizes the fact that a rational transducer can be interpreted as a nite-

state automaton on a more complex alphabet. But beware: both interpretations have

dierent properties in general.

Denition 3.6 Let T be a rational transducer over alphabets A and B . The nite-

state automaton interpretation of T is a nite-state automaton A over the alphabet

(A B ) [ (A f"g) [ (f"g B ) dened by the same states, initial states, nal states

and transitions.

We need a few results about rational transductions that are partial functions.

Denition 3.7 (rational function) Let M1 and M2 be two monoids. A rational func-

tion : M1 ! M2 is a rational transduction which is a partial function, i.e. such that

Card( (u)) 1 for all u 2 M1 .

Most classical results about rational functions suppose that M1 and M2 are free

monoids, but we will see a result about composition of rational functions over non-free

monoids in Section 3.5. In the following, however, M1 and M2 will be free monoids.

Given two alphabets A and B , it is decidable whether a rational transduction from

A into B is a partial function. However, the rst algorithm by Schutzenberger was

exponential [Ber79]. The following result by Blattner and Head [BH77] shows that it is

decidable in polynomial time.

Theorem 3.9 It is decidable in O(Card(Q)4 ) whether a rational transducer whose set of

states is Q implements a rational function.

100 CHAPTER 3. FORMAL TOOLS

Theorem 3.10 Given two rational functions f and f 0 from A to B , it is decidable

whether f f 0 and whether f = f 0.

Among transducers realizing rational functions, we are especially interested in trans-

ducers whose output can be \computed online" with its input. Our interpretation for

\online computation" is the following: it requires that when a path e leading to a state

q is labeled by pair of words (u; v), and when a letter x is read, there is only one state

q0 and one output letter y such that (ux; vy) labels a path prexed by e. This is best

understood using the following denitions.

Denition 3.8 (input and output automata) The input automaton (resp. output au-

tomaton) of a transducer is obtained by omitting the output label (resp. input label)

of each transition.

Denition 3.9 (sequential transducer) Let A and B be two alphabets. A sequential

transducer is labeled in A B and its input automaton is deterministic (which enforces

that it has a single initial state).

A sequential transducer obviously realizes a rational function; and a function is se-

quential if it can be realized by a sequential transducer. The transducer example in

Figure 3.3.a, whose initial state is 1 is sequential. It replaces by a the bs which appear

after an odd number of bs.

........................................................................................

bjb bjb

aja 1 2 bjb aja 1 2 bjb

bja a aja b

. . . . . . . . . . . . . . . . Figure 3.3. Sequential and sub-sequential transducers . . . . . . . . . . . . . . . .

Note that a if is a sequential function and (") is dened, then (") = ". Moreover,

when all the states of a sequential transducer are nal, the function it realizes is prex

closed, i.e. if uv belongs to its domain then it is the same for u.2 To a sequential transducer

T = (A; B ; Q; I; F; E ), one may associate a \next state" function : Q A ! Q and a

\next output" function : Q A ! B whose purpose is self-explanatory. Together with

the set F of nal states, functions and are indeed an equivalent characterization of

T.

However, the sequential transducer denition is a bit too restrictive regarding our

\online computation" property, and we prefer the following extension.

Denition 3.10 (sub-sequential transducer) If A and B are two alphabets, a sub-

sequential transducer (T ; ) over A B is a pair composed of a sequential transducer

2 In [Ber79, Eil74], all states of a sequential transducer are nal.

3.4. LEFT-SYNCHRONOUS RELATIONS 101

function realized by (T ; ) is dened as follows: let u be a word in A, the value

(u) is dened i there is an accepting path in T labeled by (ujv) and leading to a

nal state q; in this case (u) = v(q).

In other words, the function is used to append a word to the output at the end

of the computation. A sub-sequential transducer is obviously a rational function; and a

function is sub-sequential if it can be realized by a sequential transducer. A sequential

function is sub-sequential: consider (q) = " for all nal states q.

This denition matches our \online computation" property. The function realized by

the sub-sequential transducer in Figure 3.3.b appends to each word its last letter. This

function is not sequential because all its states are nal and it is not prex closed.

The following result has been proven by Chorut in [Cho77].

Theorem 3.11 It is decidable if a function realized by a transducer is sub-sequential,

and it is decidable if a sub-sequential function is sequential.

Beal and Carton [BC99b] give two polynomial-time algorithms to decide if a rational

function is sub-sequential, and if a sub-sequential function is sequential. Two algorithms

to build a sub-sequential realization and a sequential realization are also provided, but

the rst may generate an exponential number of states; as a result, this does not provide

a polynomial-time algorithm to decide if a rational function is sequential.

Before we conclude this section, notice that the \online computation" property satis-

ed by sub-sequential transducers is still satised for a larger class of rational functions:

Denition 3.11 (online rational transducer) A rational transducer is online if it is a

rational function and if its input automaton is deterministic. A rational transduction

is online if it is realized by an online rational transducer.

The only dierence with respect to sub-sequential transducers is that " is allowed in

the input automaton, as long as the deterministic property is kept. We are not aware of

any result for this class of rational functions, strictly larger than the class of sub-sequential

transductions. But if it was decidable among rational functions, it would probably replace

every use of sub-sequential functions in the following applications.

In our analysis and transformation framework, we will only use rational and sub-

sequential functions, which are decidable in polynomial-time among rational transduc-

tions.

We have seen that rational relations are not closed under intersection, but intersection is

critical for dependence analysis. Addressing the undecidable problem of testing whether

the intersection of two rational relations is empty or not, Feautrier designed a \semi-

algorithm" for dependence testing which sometimes not terminate [Fea98]. Because we

would like to eectively compute the intersection, and not only testing its emptiness, our

approach is dierent: we are looking for a sub-class of rational relations with a boolean

algebra structure (i.e. with union, intersection and complementation).

Indeed, the class of recognizable relations is a boolean algebra, but we have found

a more expressive one: the class of left-syncrhonous relations. We will show that left-

synchronous relations are not decidable among rational ones, but we could dene a precise

102 CHAPTER 3. FORMAL TOOLS

this point is even more interesting for us than decidability. Many results presented here

have already been published by Frougny and Sakarovitch in [FS93]. However, our work

has been done independently and based on a dierent|more intuitive and versatile|

representation of transductions. Proofs are all new, and several unpublished results have

also been discovered.

Notice that a larger class with a boolean algebra structure is the class of deterministic

relations [PS98] dened by Pelletier and Sakarovitch. But some interesting decidability

properties are lost and we could not dene any precise approximation algorithm for this

class, See Section 3.4.7.

This work has been done in collaboration with Olivier Carton (University of Marne-

la-Vallee).

3.4.1 Denitions

We recall the denition of synchronous transducers:3

Denition 3.12 (synchronism) A rational transducer on alphabets A and B is syn-

chronous if it is labeled on A B .

A rational relation or transduction is synchronous if it can be realized by a syn-

chronous transducer. A rational transducer is synchronizable if it realizes a synchronous

relation.

Obviously, such a transducer is length preserving; Eilenberg and Schutzenberger [Eil74]

showed that the reciprocal is true: a length preserving rational transduction is realized

by a synchronous transducer.

A rst extension of the synchronous property is the -synchronous one:

Denition 3.13 (-synchronism) A rational transducer on alphabets A and B is -

synchronous if every transition appearing in a cycle of the transducer's graph is labeled

on A B .

A rational relation or transduction is -synchronous if it can be realized by a

synchronous transducer. A rational transducer is -synchronizable if it realizes a -

synchronous relation.

Such a transducer has a bounded length dierence; Frougny and Sakarovitch [FS93]

showed that the reciprocal is true: a bounded length dierence rational transduction is

realized by a -synchronous transducer. Obviously, the bound is 0 when the transducer is

synchronous. Two examples are shown in Figure 3.4. They respectively realize f(u; v) 2

fa; bg fa; bg : u = vg and f(u; v) 2 fa; bg fcg : juja = jvjc ^ jujb = 2g.

Then, we dene two new extensions:

Denition 3.14 (left-synchronism) A rational transducer over alphabets A and B is

left-synchronous if it is labeled on (A B ) [ (A f"g) [ (f"g B ) and only transitions

labeled on A f"g (resp. f"g B ) may follow transitions labeled on A f"g (resp.

f"g B ).

A rational relation or transduction is left-synchronous if it is realized by a left-

synchronous transducer. A rational transducer is left-synchronizable if it realizes a

left-synchronous relation.

3 It appears to be a special case of k; l-synchronous transducers, where k = l = 1, see Section 3.4.7.

3.4. LEFT-SYNCHRONOUS RELATIONS 103

........................................................................................

aja, bjb ajc ajc ajc

bj" bj"

1 1 2 3

. . . . . . . . . . . . . . . Figure 3.4. Synchronous and -synchronous transducers . . . . . . . . . . . . . . .

is right-synchronous if it is labeled on (AB )[(Af"g)[(f"gB ) and only transitions

labeled on A f"g (resp. f"g B ) may precede transitions labeled on A f"g (resp.

f"g B ).

A rational relation or transduction is right-synchronous if it can be realized by a

right-synchronous transducer. A rational transducer is right-synchronizable if it realizes

a right-synchronous relation.

Figure 3.5 shows left-synchronous transducers over an alphabet A realizing two orders

(a.k.a. orderings), where <txt is some order on A: the prex order f <pre g , f9h 2

A : f = ghg and the lexicographic order f <lex g , ff <pre g _ (9u; v; w 2 A; a; b 2

A : f = uav ^ g = ubw ^ a < b)g.

........................................................................................

In the following transducers, labels x and y stand for 8x 2 A and 8y 2 A respectively.

"jy

xj x "jy 5 "jy

"jy xjy

1 2 "jy

"jy 4

xjy; x <txt y

1 2

Figure 3.5.a. Prex order 3

xj"

xjx

xj"

Figure 3.5.b. Lexicographic order

. . . . . . . . . . Figure 3.5. Left-synchronous realization of several order relations . . . . . . . . . .

The word-reversal operation converts a left-synchronous transducer into a right-

synchronous one and conversely.4 The two denitions are not contradictory: some re-

lations are left and right synchronous, such as synchronous ones.

4 Recognizable, synchronous and -synchronous relations are closed under word-reversal.

104 CHAPTER 3. FORMAL TOOLS

Figure 3.6 shows a transducer realizing the relation = f(u; v) 2 A B : juj jvj

mod 2g. It is neither left-synchronous nor right-synchronous, but the left-synchronous and

right-synchronous realizations in the same gure show that is left and right synchronous.

........................................................................................

In the three following transducers, labels x and y stand for 8x 2 A and 8y 2 B .

xjy "jy xjy "jy xjy

"jx 2 3 3 2 "jx

"jx "jx

1 1 1

y j" y j"

xj" 4 5 xyj" "jxy 5 4 xj"

xj" xj"

(left-synchronous) (left and right synchronizable) (right-synchronous)

. . . . . . . . . . . . . . . . . Figure 3.6. A left and right synchronizable example . . . . . . . . . . . . . . . . .

In the following we mostly consider left-synchronous transducers, because all results

extend to right-synchronous through the word-reversal operation and most interesting

transducers are left-synchronous.

It is well known that synchronous and -synchronous relations are closed under union,

complementation, intersection. We show that it is the same for left-synchronous relations.

Lemma 3.1 (Union) The class of left-synchronous relations is closed under union.

Proof: Let T = (Q; I; F; E ) and T 0 = (Q0 ; I 0; F 0; E 0) be left-synchronous transducers.

Q and Q0 can be supposed disjoint without loss of generality; and then (Q [ Q0 ; I [

I 0; F [ F 0; E [ E 0) realizes jT j [ jT 0j.

The proof is constructive: given two left-synchronous realizations, one may compute a

left-synchronous realization of the union.

Here is a direct application:

Theorem 3.12 Recognizable relations are left-synchronous.

Proof: Let R be a recognizable relation in A B . From Theorem 3.3, there

exists an integer n, A1 ; : : : ; An 2 A , and B1; : : : ; Bn 2 B such that tau = A1

B1 [ [ An Bn. Let i 2 f1; : : : ; ng, AA = (QA ; IA; FA; EA) accepting Ai, and

AB = (QB ; IB ; FB ; EB ) accepting Bi. We suppose QA and QB are disjoint sets|

without loss of generality|and dene a transducer T = (Q; I; F; E ), where Q =

(QA QB ) [ QA [ QB , I = IA IB , F = FA FB [ FA [ FB , and E is dened as

follows:

1. All transitions in EA and EB are also in E ;

3.4. LEFT-SYNCHRONOUS RELATIONS 105

jy 0 0

2. If qA x! qA0 2 EA and qB y! qB0 2 EB , then (qA; qB ) x! (qA; qB ) 2 E ;

y 0

3. If qA (resp. qB0 ) is a nal state and qB ! qB 2 EB (resp. qA x! qA0 2 EA), then

jy 0 j" 0

(qA; qB ) "! qB 2 E (resp. (qA ; qB ) x! qA 2 E ).

By construction, T is left-synchronous, its input is Ai and its output is Bi. Moreover,

it accepts any combination of input words in Ai and output words in Bi . Lemma 3.1

terminates the proof.

The proof is constructive: given a decomposition of a recognizable relation into products

of rational languages, one may build a left-synchronous transducer.

Another application is this useful decomposition result for left-synchronous relations:

Proposition 3.8 Any left-synchronous relation can be decomposed into a union of rela-

tions of the form SR, where S is synchronous and R has either no input or no output

(R is thus recognizable).

Proof: Consider a relation U 2 A B realized by a left-synchronous transducer

T , and consider an accepting path e in T . The restriction of T to the states and

transitions in e yields a transducer Te, such as jTej jT j. Morover, Te can be divided

into transducers Ts and Tr , such as the (unique) nal state of the rst is the (unique)

initial state of the second, Ts is synchronous and Tr has either no input or no output.

Therfore, Te realizes a left-synchronous relation of the form SR, where S is synchronous

and R has either no input or no output. Since the number of \restricted" transducers

Te is nite, closure under union terminates the proof.

The proof is constructive if the left-synchronous relation to be decomposed is given by a

left-synchronous realization.

To study complementation and intersection, we need two more denitions: unambi-

guity and completion.

Denition 3.16 (unambiguity) A rational transducer T over A and B is unambiguous

if any couple of words over A and B labels at most one path in T . A rational relation

is unambiguous if it is realized by an unambiguous transducer.

This denition coincides with the one in [Ber79] Section IV.4 for rational functions,

but diers for general rational transductions.

Denition 3.17 (completion) A rational transducer T is complete if every pair of

words labels at least one path in T (accepting or not).

It is obviously not always possible to complete a transducer in a trim one. From these

two denitions, let us recall a very general result.

Theorem 3.13 The class of a complete unambiguous rational relations is closed under

complementation.

Proof: Let R be a complete unambiguous relation realized by transducer T =

(Q; I; F; E ). We dene a transducer T 0 = (Q; I; Q F; E ) such that an accepting path

in T cannot be one of T 0. The completion of T and the uniqueness of accepting paths

in T shows that the complementation of R is realized by T 0.

106 CHAPTER 3. FORMAL TOOLS

Now, we specialize this result for left-synchronous relations.

Lemma 3.2 A left-synchronous relation is realized by an unambiguous left-synchronous

transducer.

Proof: Let T be a left-synchronous transducer over A and B realizing a relation R.

Let A be the nite-state automaton interpretation of T |over the alphabet (A B ) [

(A f"g) [ (f"g B )|and let A0 be a deterministic nite-state automaton accepting

the same language as A. Let f; g two words such that jT j(f ) = g, and let e and e0 be

two accepting paths in T .

Suppose e diers from e0 . By the determinim property, the words w and w0 they accept

in A0 also diers; let (x; y) and (x0 ; y0) be the rst dierence. If x = " and x0 6= ",

the denition of left-synchronous transducers imposes that w to be labeled in f"g B

after (x; y), then e and e0 accept dierent inputs in T . The same reasoning applies to

the three other cases|y = " and y0 6= ", x0 = " and x 6= ", y0 = " and y 6= "|and

yields dierent inputs or outputs for paths e and e0. This contradicts the denition of

e and e0.

Thus f and g are accepted by a unique path in the rational transducer interpretation

T 0 of A0. Since A0 is the determinization of A, a transition labeled on A f"g (resp.

f"g B ) may only be followed by another transition labeled on A f"g (resp. f"g B ).

Eventually, T 0 is unambiguous and left-synchronous, and it realizes R.

The proof is constructive.

Proposition 3.9 A left-synchronous relation is realized by a complete unambiguous left-

synchronous transducer.

Proof: Let R be a left-synchronous relation. We use Lemma 3.2 to compute an

unambiguous left-synchronous transducer T = (Q; I; F; E ) which realizes R. We dene

a transducer T 0 = (Q0 ; I; F; E 0), where qi, qo and qio are three new states, Q0 =

Q [ fqi; qo ; qiog, and E 0 is dened as follows:

1. All transitions in E are also in E 0 .

jy

2. For all (x; y) 2 A B , qio x! qio 2 E 0 .

j" j"

3. For all x 2 A, qio x! qi 2 E 0 and qi x! qi 2 E 0 .

jy jy

4. For all y 2 B , qio "! qo 2 E 0 and qo "! qo 2 E 0 .

j"

5. If q 2 Q is such that 8(x0 ; q0) 2 A Q : q0 x! q 62 E , then 8(y00; q00) 2 B Q :

0

q "j! q 62 E ) q "j!

y 00 y

qo 2 E 0 .

00 00

y

q 62 E , then 8(x00; q00) 2 A Q :

0

00 00

j"

7. If q 2 Q is such that 8(x0 ; q0) 2 A Q : q0 x! q 62 E and 8(y0; q0) 2 B Q : q0 "j!

y

0 0

jy 00 jy

q 62 E , then 8(x00; y00 ; q00 ) 2 A B Q : q x ! q 62 E ) q x ! qio 2 E 0.

00 00 00 00

3.4. LEFT-SYNCHRONOUS RELATIONS 107

over, the three last cases have been carefully designed to preserve the unambiguous

property: no transition departing from a state q is added if its label is already the one

of an existing transition departing from q.

The proof is constructive.

Theorem 3.14 (Complementation and Intersection) The class of left-synchronous

relations is closed under complementation and intersection.

Proof: As a corollary of Theorem 3.13 and Proposition 3.9, we have the closure

under complementation. Together with closure under union, this proves closure under

intersection.

Eventually, we have proven that the class of left-synchronous relations is a boolean

algebra, which will be of great help for dependence and reaching denition analysis, see

Section 4.3.

Synchronous and -synchronous relations are obviously closed under concatenation,

but it is not true for left-synchronous ones. However, we have the following result:

Proposition 3.10 Let S , T and R be rational relations.

(i) If S is synchronous and T is left-synchronous, then ST is left-synchronous.

(ii) If T is left-synchronous and R is recognizable, then TR is left-synchronous.

Proof: Proof of (i) is a straightforward application of the denition of left-

synchronous transducers (see Proposition 3.12 for a generalization).

We use Proposition 3.8 to partition T into S1R1 ; : : : ; SnRn where Si is synchronous

and Ri is recognizable for all 1 i n. Now, RiR is recognizable, hence left-

synchronizable from Theorem 3.12. Application of (i) shows that SiRi R is left-

synchronizable. Closure under union terminates the proof of (ii).

The proof is constructive when a left-synchronous realization of T is provided, thanks to

Proposition 3.8. A generalization of (i) is given in Section 3.4.5.

To close this section about algebraic properties, one should notice that the nite-state

automaton interpretation (see Denition 3.6) of a left-synchronous transducer T has ex-

actly the same properties as T itself, regarding computation of the complementation and

intersection. Indeed, by denition of left-synchronous relations, applying classical algo-

rithms from automata theory to the nite-state automaton interpretation yields correct re-

sults on the transducer. This remark shows that algebraic operations for left-synchronous

transducers have the same complexity as for nite-state automata in general.

Synchronous and -synchronous transductions are closed under inversion (i.e. relational

symmetry) and composition. Clearly, the class of left-synchronous transductions is also

closed under inversion.

Combined with the boolean algebra structure, the following result is useful for reaching

denition analysis (to solve (4.17) in Section 4.3.3).

Theorem 3.15 The class of left-synchronous transductions is closed under composition.

108 CHAPTER 3. FORMAL TOOLS

1 : B ! C , and two left-synchronous transducers T1 = (Q1 ; I1; F1; E1) realizing 1

and T2 = (Q2; I2; F2 ; E2) realizing 2 . We suppose Q1 and Q2 are disjoint sets|without

loss of generality|and dene T = (Q1 Q2 [ Q1 [ Q2 ; I1 I2 ; F1 F2 [ F1 [ F2 ; E ) as

1. All transitions in E1 and E2 are also in E ;

jy 0 jz 0 jz 0 0

2. If q1 x! q1 2 E1 and q2 y! q2 2 E2 , then (q1; q2 ) x! (q1 ; q2) 2 E ;

j" 0 jz 0 jz 0 0

3. If q1 x! q1 2 E1 and q2 "! q2 2 E2 , then (q1; q2 ) x! (q1 ; q2) 2 E ;

jy 0 j" 0 j" 0 0

4. If q1 "! q1 2 E1 and q2 y! q2 2 E2 , then (q1; q2 ) "! (q1 ; q2) 2 E ;

jy 0 j" 0 j" 0 0

5. If q1 x! q1 2 E1 and q2 y! q2 2 E2 , then (q1; q2 ) x! (q1 ; q2) 2 E ;

jy 0 jz 0 jz 0 0

6. If q1 "! q1 2 E1 and q2 y! q2 2 E2 , then (q1; q2 ) "! (q1 ; q2) 2 E ;

j" 0 j" 0

7. If q1 x! q1 2 E1 , then 8q2 2 F2 : (q1; q2 ) x! q1 2 E ;

jz 0 jz 0

8. If q2 "! q2 2 E2 , then 8q1 2 F1 : (q1; q2 ) "! q2 2 E .

First, consider an accepting path e in T for a couple of words (f; h). We may write

e = e12 e0 , where e12 is the Q1 Q2 part of e. By construction of T , the end state of

e12 is a nal state of T1 and e0 is a path of T2 , or it is the opposite. Considering the

projection of states in e12 on Q1 , e12 accepts a couple of words (f; g) in T1 such as

h 2 2 (g). Hence h 2 2 1 (f ).

Second, consider three words f; g; h such as g 2 1 (f ) and h 2 2 (g). Let e1 be an

accepting path for (f; g) in T1 and e2 be one for (g; h) in T2 . Suppose je1j > je2j. Build a

path e12 in T from the product of states and labels of the rst je2j transitions in e1 and

e2 ; its end state is (q1; q2 ) with q1 2 Q1 and q2 2 F2. Now, the last je1 j je2j transitions

in e1 can be written (q1 ; x; "; q10 ):e01, hence e12 :((q1; q2 ); x; "; q10 ):e01 is an accepting path

for (f; h) in T .

Eventually, we have shown that T realizes 2 1. Now, using the classical "j"-

transition removal algorithm for nite-state automata, we dene transducer T 0. It

is left-synchronous because T1 and T2 are, and transitions involving states of Q1 or

Q2 |labeled on A f"g or f"g C |are never followed by transitions involving states

of Q1 Q2 .

The proof is constructive.

Before showing an important application of this result, we need an additional deni-

tion:

Denition 3.18 (-selection) Let : A ! B be a rational transduction, and be

a rational order on B |i.e. a rational relation which is re exive, anti-symmetric and

transitive. The -selection of is a partial function dened by

8u; v 2 A B : v = (u) () v = min

(u):

Proposition 3.11 Let : A ! B be a left-synchronous transduction, and be a

left-synchronous order on B . The -selection of is a left-synchronous function.

3.4. LEFT-SYNCHRONOUS RELATIONS 109

the proof comes from the fact that = (( ) )

The most interesting application of this to our framework appears when choosing the

lexicographic order for , see Section 4.3.3. For more details on -selection, also known

as uniformization, see [PS98].

It is well known that the recognizability of a transduction is undecidable. This is proved

by Berstel in [Ber79] Theorem 8.4, and we use a similar technique to show that it is the

same for left-synchronous relations. We start with a preliminary result.

Lemma 3.3 Let K be a positive integer, let A = fa; bg, let B be any alphabet, and let

u1; u2; : : : ; up 2 B . Dene

U = f(abK ; u1); (ab2K ; u2); : : : ; (abpK ; up)g:

Then, U and U + are rational relations, and relation (A B ) U + is also rational.

Proof: Relation U is nite, hence rational, and U + is rational by closure under

concatenation and the star operation.

Usually, the class of rational relations is not closed under complementation, so we have

to prove something here. This is done the same way as in [Ber79] Lemma 8.3, with

the only substitution of b by bK .

Theorem 3.16 Let A and B be alphabets with at least two letters. Given a rational

relation R over A and B , it is undecidable whether R is left-synchronous.

Proof: We may assume that A contains exactly two letters, and set A = fa; bg.

Consider two sequences u1; u2; : : : ; up and v1; v2 ; : : : ; vp of non-empty words over B ,

and let K be their maximum length. Dene

U = f(abK ; u1); : : : ; (abpK ; up)g and V = f(abK ; v1 ); : : : ; (abpK ; vp)g:

From Lemma 3.3, U , V , U +, V +, U = (A B ) U + and V = (A B ) V + are

rational relations.

Let R = U [ V . Since left-synchronous transductions are closed under complementa-

tion, R is left-synchronous i (A B ) R = U + \ V + is.

Assume U + \ V + is non-empty and realized by a left-synchronous transducer T . Con-

sider (m; u) 2 U + \ V +. We may write m = fg with jf j = juj and jgj > 0. Left-

synchronism requires that (g; ") labels a path in T . Moreover, ((fg)k ; uk ) 2 U + \ V +

for all k 1, hence the path labeled by (g; ") must be part of a cycle:

9g0 : 8k : (fg(g0g)k ; u) 2 U + \ V +:

However, because u1; : : : ; up and v1 ; : : : ; vp are non-empty, the ratio between the

length of input and output words must be less than or equal to K + 1; this is contra-

dictory.

110 CHAPTER 3. FORMAL TOOLS

is exactly solving the Post's Correspondence problem for u1; : : : ; up and v1; : : : ; vp, we

have proven that left-synchronism is undecidable.

A similar proof shows the following result, which is not a corollary of Theorem 3.16.

Theorem 3.17 Let A and B be alphabets with at least two letters. Given a rational

relation R over A and B , it is undecidable whether R is left and right synchronous.

Despite the general undecidability results, we are interested in particular cases where a

rational relation can be proved left-synchronous.

Transmission Rate

We recall the following useful notion to give an alternative description of synchronism

in transducers. The transmission rate of a path labeled by (u; v) is dened as the ratio

jvj=juj 2 Q + [ f+1g.

Eilenberg and Schutzenberger [Eil74] showed that the synchronism property of a

transducer is decidable. Frougny and Sakarovitch [FS93] showed a similar result for

-synchronism, and their algorithm operates directly on the transducer that realizes the

transduction. The result is:

Lemma 3.4 A rational transducer is -synchronizable i the transmission rate of all its

cycles is 1.

There is no characterization of recognizable transducers through the transmission rate

of its cycles, but one may give a sucient condition:

Lemma 3.5 If the transmission rate of all cycles in a rational transducer is 0 or +1,

then it realizes a recognizable relation.

Proof: Let T be a rational transducer whose cycles transmission rates are only 0 and

+1. Considering a strongly-connected component, all its cycles must be of the same

rate. Hence a strongly-connected component has either no input or no output. This

proves that strongly-connected components are recognizable. Closure of recognizable

relations by concatenation and by union terminates the proof.

There is no characterization of left-synchronizable transducers either. However, as a

straightforward application of previous denitions, one may give the following result:

Lemma 3.6 If T is a left-synchronous transducer, then cycles of T may only have three

dierent transmission rates: 0, 1 and +1. All cycles in the same strongly-connected

component must have the same transmission rate, only components of rate 0 may

follow components of rate 0, and only components of rate +1 may follow components

of rate +1.

Even if synchronizable transducers may not satisfy these properties, some kind of

reciprocal is available, see Theorem 3.19.

5 We have also proven here that U + and V + are not left-synchronous.

3.4. LEFT-SYNCHRONOUS RELATIONS 111

Classes of Transductions

We have shown that left-synchronous transductions extend algebraic properties of rec-

ognizable transductions. The following theorem shows that they also extend real-time

properties of -synchronous transducers.

Theorem 3.18 -synchronous transductions are left-synchronous.

Proof: Consider a -synchronous transducer T realizing a relation R over alphabets

A and B , and call the upper bound on delays between input and output words

accepted by T . Taking advantage of closure under intersection, one may partition R

into relations Ri of constant delay i, for all i . Let Ti realize relation Ri: by

construction, v 2 jTij(u) i juj = jvj + i.

Let \ " be a new label; if i is non-negative (resp. negative), dene Ti0 from Ti in

substituting its nal state by a transducer accepting ("; i) (resp. ( i; ")). Each Ti0

is length preserving, hence synchronizable. Transducer T 0 = T 0 [ [ T0 is thus

synchronizable, hence left-synchronizable.

Let P realize relation f(u; u a) : u 2 A; a 0g and Q realize relation f(v b; v) : v 2

B ; b 0g, which are both left-synchronizable. Transducer Q T 0 P realizes the

same transduction as T , and it is left-synchronizable from Theorem 3.15.

One may go a bit further and give a generalization of Theorems 3.12 and 3.18, based

on Lemmas 3.5 and 3.4:

Theorem 3.19 If the transmission rate of each cycle in a rational transducer is 0, 1 or

+1, and if no cycle whose rate is 1 follows a cycle whose rate is not 1, then the

transducer is left-synchronizable.

Proof: Consider a rational transducer T satisfying the above hypotheses, and con-

sider an acceptation path e in T . The restriction of T to the states and transitions in e

yields a transducer Te, such as jTej jT j. Moreover, Te can be divided into transduc-

ers Ts and Tr , such as the (unique) nal state of the rst is the (unique) initial state of

the second, and the transmission rate of all cycles is 1 in Ts and either 0 or +1 in Tr .

From Lemma 3.5, Tr is recognizable. From Lemma 3.4, Ts is -synchronizable, hence

left-synchronizable from Theorem 3.18. Eventually, Proposition 3.10 shows that Te is

left-synchronizable. Since the number of \restricted" transducers Te is nite, closure

under union terminates the proof.

The proof is constructive.

As an application of this theorem, one may give a generalization of Proposition 3.10.(i):

Proposition 3.12 If is -synchronous and is left-synchronous, then : is left-

synchronous.

Notice that the left and right synchronizable transducer example in 3.6|which is even

recognizable|does not satisfy conditions of Theorem 3.19, since the transmission rate of

some cycles is 2.

112 CHAPTER 3. FORMAL TOOLS

Resynchronization Algorithm

Although left-synchronism is not decidable, one may be interested in a synchronization

algorithm that work on a subset of left-synchronizable transducers: the class of transducers

satisfying the hypothesis of Theorem 3.19.

Extending an implementation by Beal and Carton [BC99a] of the algorithm in [FS93],

it is possible to \resynchronize" our larger class along the lines of the proof of Theo-

rem 3.19. This technique will be used extensively in Sections 3.6 and 3.7, to compute|

possibly approximative|intersections of rational relations. Presentation of the full algo-

rithm and further investigations about its complexity are left for future work.

We rst present an extension of the minimality concept for nite-state automata to left-

synchronous transducers. Let T = (Q; I; F; E ) be a transducer over alphabets A and B .

We dene the following predicate, for q 2 Q and (u; v) 2 A B :

Accept(q; u; v ) i (u; v ) labels an accepting path starting at q:

Nerode's equivalence, noted , is dened by

q q0 i for all (u; v) 2 A B : Accept(q; u; v) () Accept(q0; u; v):

The equivalence class of q 2 Q is denoted by q^. Let

T = = (Q= ; I= ; F= ; E^ );

where E^ is naturally dened by

(^q1 ; x; y; q^2) 2 E^ () 9(q10 ; q20 ) 2 q^1 q^2 : (q10 ; x; y; q20 ) 2 E:

Using Nerode's equivalence, we extend the concept of minimal automaton to left-

synchronous transducers.

Theorem 3.20 Any left-synchronous transduction is realized by a unique minimal left-

synchrnonous transducer (up to a renaming of states).

Proof: Let be a transduction over alphabets A and B , realized by a left-synchronous

transducer T = (Q; I; F; E ). We suppose without loss of generality that T is trim.

By denition of , it is clear that T = realizes .

Every transition on T = is labeled on A B [ A f"g[f"g B . Consider two states

q; q0 2 Q such that q q0 and q holds an input transition labeled on A f"g (resp.

f"g B ); and consider (u; v) 2 A B such that Accept(q; u; v) and Accept(q0; u; v).

Any output transition from q must be labeled on A f"g (resp. f"g B ), hence v

(resp. u) must be empty. Since this is true for all accepted (u; v), and since T is trim,

any output transition from q0 must also be labeled on A f"g (resp. f"g B ); this

proves that T = is left-synchronous.

Finally, let A be the nite-state automaton interpretation of T (see Denition 3.6). It

is well known that A= is the unique minimal automaton realizing the same rational

language as A (up to a renaming of states). Thus, if T 0 is an realization of with as

3.4. LEFT-SYNCHRONOUS RELATIONS 113

a renaming of states) which is the interpretation of T = . This proves the unicity of

the minimal left-synchronous transducer.

As a corollary of closure under complementation and intersection, usual questions

become decidable for left-synchronous transductions:

Lemma 3.7 Let R, R0 be left-synchronous relations over alphabets A and B . It is

decidable whether R \ R0 = ?, R R0 , R = R0 , R = A B , (A B ) R is nite.

These properties are essential for formal reasoning about dependence and reaching

denition abstractions in the following chapter.

Eventually, we are still working on decidability of recognizable relations among left-

synchronous ones. We have strong arguments to expect a positive result, but no proof at

the moment.

We now consider possible extensions of left-synchronizable relations.

Constant Transmission Rates

An elementary variation on synchronous transducers consists in enforcing a single trans-

mission rate in all cycles which is not necessary 1: if k and l are positive integers, a

(k; l)-synchronous relation over A B is realized by a transducer whose transitions

are labeled in Ak B l . Similarly, one may dene -(k; l)-synchronous and left-(k; l)-

synchronous transducers.

When noticing that a change of the alphabet converts a (k; l)-synchronous transducer

into a classical synchronous one, it obviously appears that the same properties are satised

for any k and l, including k = l = 1. The only dierence is that transmission rates of

cycles is now 0, +1 and k=l. Mixing relations in (k; l)-synchronous classes for dierent

(k; l) is not allowed, of course.

However, most rational transductions useful to our framework, including orders, are

left-(1; 1)-synchronous, that is left-synchronous... This strongly reduces the usefulness of

general left-(k; l)-synchronous transductions.

Deterministic Transducers

Much more interesting is the class of deterministic relations introduced by Pelletier and

Sakarovitch in [PS98]:

Denition 3.19 (deterministic transducer and relation) Let A and B be two al-

phabets. A transducer T = (A ; B ; Q; I; F; E ) is deterministic if the following condi-

tions hold:

(i) there exists a partition of the set of states Q = QA [ QB such that the label of an

edge departing from a state in QA is in A f"g and the label of an edge departing

from a state in QB is in f"g B ;

(ii) for every p 2 Q and every (x; y) 2 (A f"g) [ (f"g B ), there exists at most one

q 2 Q such that (p; x; y; q) is in E (i.e. the nite-state automaton interpretation is

deterministic);

114 CHAPTER 3. FORMAL TOOLS

A deterministic relation is realized by a deterministic transducer.

This class is strictly larger than left-synchronous relations, and keeps most of its good

properties: the greatest loss is closure under composition. Moreover, because relation U +

is deterministic in the proof of Theorem 3.16, it is undecidable whether a deterministic

relation is recognizable, left-synchronous or both left and right synchronous.

But the most important reason for us to use left-synchronous relations instead of

deterministic ones is that there is no result such as Theorem 3.19 to nd a deterministic

realization of a relation, or to help approximate a rational relation by a deterministic one.

For the purpose of our program analysis framework, we sometimes require more expres-

siveness than rational relations: \nite automata cannot count", and we need counting

to handle arrays! We thus present an extension of the algebraic|also known as context-

free|property to relations between nitely generated monoids. As one would expect,

the class of algebraic relations includes rational relations, and retains several decidable

properties. This sections ends with a few contributions: Theorems 3.27 and 3.28, and

Proposition 3.13.

We dene algebraic relations through push-down transducers, dened similarly to push-

down automata (see Section 3.2.3).

Denition 3.20 (push-down transducer) Given alphabets A and B , a push-down

transducer T = (A ; B ; ; 0; Q; I; F; E )|a.k.a. algebraic transducer |consists of a

stack alphabet , a non-empty word 0 in + called the initial stack word, a nite set

Q of states, a set I Q of initial states, a set F Q of nal states, and a nite set

of transitions (a.k.a. edges) E Q A B Q.

Free monoids A and B are often removed for commodity, when clear from the context.

A transition (q; x; y; g; ; q0) 2 E is usually written q xjy:!g! 0

q . The push-down automata

and rational transducer vocabularies are inherited.

A conguration of a push-down automaton is a quadruple (u; v; q; ), where (u; v)

is the pair of word to be accepted or rejected, q is the current state and 2 is

the word composed of symbols in the stack. The transition between two congurations

c1 = (u1; v1 ; q1; 1) and c2 = (u2; v2; q2; 2) is denoted by relation 7! and dened by c 7 ! c0

i there exist (x; y; g; ; 0) 2 A B such that

u1 = xu2 ^ v1 = yv2 ^ 1 = 0g ^ 2 = 0 ^ (q1 ; x; y; g; ; q2) 2 E:

p

Then 7 ! with p 2 N , 7 +! and 7 ! are dened as usual.

A push-down transducer T = ( ; 0; Q; I; F; E ) is said to realize the relation R, when

(u; v) 2 R i there exist (qi; qf ; ) 2 I F such that

(u; v; qi; 0) 7 ! ("; "; qf ; ):

3.5. BEYOND RATIONAL RELATIONS 115

(u; v) 2 R i there exist (qi; qf ) 2 I F such that

(u; v; qi; 0) 7 ! ("; "; qf ; "):

Notice that realization by empty stack implies realization by nite state: qf is still required

to be in the set of nal states.

Denition 3.21 (algebraic relation) The class of relations realized by nal state or

by empty stack by push-down transducers is called the class of algebraic relations .

As for rational relations, the following characterization of algebraic relations is fun-

damental: it allows to express algebraic relations by means of algebraic languages and

monoid morphisms. A proof in a much more general case can be found in [Kar92]. (Berstel

uses this theorem as a denition for algebraic relations in [Ber79].)

Theorem 3.21 (Nivat) Let A and B be two alphabets. Then R is an algebraic relation

over A and B i there exist an alphabet C , two morphisms : C ! A, : C !

B , and an algebraic language L C such that

R = f((h); (h)) : h 2 Lg:

To generalize Section 3.3.2, algebraic transductions are the functional counterpart of

algebraic relations.

Nivat's theorem can be formulated as follows for algebraic transductions:

Theorem 3.22 (Nivat) Let A and B be two alphabets. Then : A ! B is an

algebraic transduction i there exist an alphabet C , two morphisms : C ! A ,

: C ! B , and an algebraic language L C such that

8w 2 A : (w) = ( 1(w) \ L):

Let us recall some useful properties of algebraic relations and transductions.

Theorem 3.23 Algebraic relations are closed under union, concatenation, and the star

operation. They are also closed under composition with rational transductions (similar

to Elgot and Mezei theorem). The image of a rational language by an algebraic

transduction is an algebraic language (thanks to Nivat's theorem).

The image of an algebraic language by an algebraic transduction may not be algebraic,

but there are some interesting exceptions:

Theorem 3.24 (Evey) Given a push-down transducer T , if L is the algebraic language

realized by the input automaton of T (see Denition 3.8), the image T (L) is an

algebraic language.

The following denition will be useful in some technical discussions and proofs in the

following. It formalizes the fact that a push-down transducer can be interpreted as a

push-down automaton on a more complex alphabet. But beware: both interpretations

have dierent properties in general.

Denition 3.22 Let T be a push-down transducer over alphabets A and B . The push-

down automaton interpretation of T is a push-down automaton A over the alphabet

(A B ) [ (A f"g) [ (f"g B ) dened by the same stack alphabet, initial stack word,

states, initial states, nal states and transitions.

116 CHAPTER 3. FORMAL TOOLS

Among the usual decision problems, only the following are available for algebraic

relations:

Theorem 3.25 The following problems are decidable for algebraic relations: whether

two words are in relation (in linear time), emptiness, niteness.

Important remarks. In the following, every push-down transducer will implicitly ac-

cept words by nal state . Recognizable and rational relations were dened for any nitely

generated monoids, but algebraic relations are dened for free monoids only.

Algebraic Functions

There are very few results about algebraic transductions that are partial functions. Here

is the denition:

Denition 3.23 (algebraic function) Let A and B be two alphabets. An algebraic

function f : A ! B is an algebraic transduction which is a partial function, i.e. such

that Card(f (u)) 1 for all u 2 A.

However, we are not aware of any decidability result for an algebraic transduction to

be a partial function, and we believe that the most likely answer is negative.

Among transducers realizing algebraic functions, we are especially interested in trans-

ducers whose output can be \computed online" with its input. As for rational transducers,

our interpretation for \online computation" is based on the determinism of the input au-

tomaton:

Denition 3.24 (online algebraic transducer) An algebraic transducer is online if

it is a partial function and if its input automaton is deterministic. An algebraic

transduction is online if it is realized by an online algebraic transducer.

Nevertheless, we are not aware of any results for this class of algebraic functions; even

decidability of deterministic algebraic languages among algebraic ones is unknown.

An interesting sub-class of algebraic relations is called the class of one-counter relations.

It is dened through push-down transducers. A classical denition is the following:

Denition 3.25 A push-down transducer is a one-counter transducer if its stack alphabet

contains only one letter. An algebraic relation is a one-counter relation if it is realized

by a one-counter transducer (by nal state).

As for one-counter languages, we prefer a denition which is more suitable to our

practical usage of one-counter relations.

Denition 3.26 (one-counter transducer and relation) A push-down transducer is

a one-counter transducer if its stack alphabet contains three letters, Z (for \zero"),

I (for \increment") and D (for \decrement") and if the stack word belongs to the

(rational) set ZI + ZD. An algebraic relation is a one-counter relation if it is realized

by a one-counter transducer (by nal state).

3.5. BEYOND RATIONAL RELATIONS 117

It is easy to show that Denition 3.26 describes the same family of languages as the

preceding classical denition.

We use the same notations as for one-counter languages, see Section 3.2.4. The family

of one-counter relations is strictly included in the family of algebraic relations.

Notice that using more than one counter gives the same expressive power as Turing

machines, as for multi-counter automata, see the last paragraph in Section 3.2.4 for further

discussions about this topic.

Now, why are we interested in such a class of relations? We will see in our program

analysis framework that we need to compose rational transductions over non-free monoids.

Indeed, the well known theorem by Elgot and Mezei (Theorem 3.5 in Section 3.3) can be

\partly" extended to any nitely generated monoids:

Theorem 3.26 (Elgot and Mezei) If M1 and M2 are nitely generated monoids, A

is an alphabet, 1 : M1 ! A and 2 : A ! M2 are rational transductions, then

2 1 : M1 ! M2 is a rational transduction.

But this extension is not interesting in our case, since the \middle" monoid in our

transduction composition is not free. More precisely, we would like to compute the com-

position of two rational transductions 2 1 , when 1 : A ! Zn and 2 : Zn ! B , for

some alphabets A and B and some positive integer n. Sadly, because of the commutative

group nature of Z, composition of 2 and 1 is not a rational transduction in general. An

intuitive view of this comes from the fact that all \words" on Z of the form

1| + 1 +{z + 1} | 1 1 {z 1}

k k

are equal to 0, but do not build a rational language in f1; 1g (they built a context-free

one).

We have proven that such a composition yields a n-counter transduction in general,

and the proof gives a constructive way to build a transducer realizing the composition:

Theorem 3.27 Let A and B be two alphabets and let n be a positive integer. If 1 :

A ! Zn and 2 : Zn ! B are rational transductions, then 2 1 : A ! B is a

n-counter transduction.

Proof: We rst suppose that n is equal to 1. Let T1 = (A ; Z; Q1; I1; F1 ; E1) realize

1 and T2 = (Z; B ; Q2 ; I2; F2; E2) realize 2 . We dene a one-counter transducer

T10 = (A ; B ; 0; Q1; I1; F1; E10 )|with no output on B |from T1 : if (q; u; v; q0) 2 E1

then (q; u; "; "; +v ; q0) 2 E10 (no counter check). Similarly, we dene a one-counter

transducer T20 = (A; B ; 0; : : : ; cn0 ; Q2; I2 ; F2; E20 )|with no input from A|from T2 :

if (q; u; v; q0) 2 E2 then (q; "; v; "; u; q0) 2 E20 (no counter check). Intuitively, the

output of T1 and T2 are replaced by counter updates in T10 and opposite counter

updates in T20 .

Then we dene a one-counter transducer T = (A ; B ; 0; Q1 [ Q2 [ fqF g; I1; fqF g; E )

as a kind of concatenation of T10 and T20:

if e 2 E10 then e 2 E ;

if e 2 E20 then e 2 E ;

118 CHAPTER 3. FORMAL TOOLS

if q1 2 F1 and q2 2 I2 then (q1 ; "; "; "; "; q2) 2 E (neither counter check nor counter

update);

if q2 2 F2 then (q2 ; "; "; =0 ; "; qF ) 2 E (no counter update);

no other transition is in E .

Intuitively, T accepts pairs of words (u; v) when (u; ") would be accepted by T1 , ("; v)

would be accepted by T2 and the counter is zero when reaching state qF . Then, T is

a one-counter transducer and recognizes 2 1.

Finally, if n is greater than 1, the same construction can be applied to each dimension

of Zn, and the associated counter check and updates can be combined to build a

n-counter transducer realizing 2 1.

Theorem 3.27 will be used in Section 4.3 to prove properties of the dependence analysis.

In practice, we will restrict ourselves to n = 1 applying conservative approximations

described in Section 3.7, either on 1 and 2 or on the multi-counter composition.

We now require an additional formalization of the rational transducer \skeleton" of a

push-down transducer.

Denition 3.27 (underlying rational transducer) Let T = ( ;
0; Q; I; F; E ) be a

push-down transducer. We can build a rational transducer T 0 = (Q; I; F; E 0) from T

in setting

(q; x; y; q0) 2 E 0 () 9g 2 ;
2 : (q; x; y; g;
; q0) 2 E:

The underlying rational transducer of T is the rational transducer obtained in trimming

T 0 and removing all transitions labeled "j".

Looking at the proof of Theorem 3.27, there is a very interesting property about

transducer T realizing 2 1 : the transmission rate of every cycle in T is either 0 or +1.

Thanks to Lemma 3.5 in Section 3.4, we have proven the following result:

Proposition 3.13 Let A and B be two alphabets and let n be a positive integer. Let

1 : A ! Zn and 2 : Zn ! B be rational transductions and let T be a n-counter

transducer realizing 2 1 : A ! B (computed from Theorem 3.27). Then, the

underlying rational transducer of T is recognizable.

Applications of this result include closure under intersection with any rational trans-

duction, thanks to the technique presented in Section 3.6.2.

Eventually, when studying abstract models for data structures, we have seen that

nested trees and arrays are neither modeled by free monoids nor by free commutative

monoids. Their general structure is called a free partially commutative monoid, see Sec-

tion 2.3.3. Let A and B be two alphabets, and M be such a monoid with binary opera-

tion . We still want to compute the composition of rational transductions 2 1 , when

1 : A ! M and 2 : M ! B . The following result is an extension of Theorem 3.27,

and its proof is still constructive:

Theorem 3.28 Let A and B be two alphabets and let M be a free partially commutative

monoid. If 1 : A ! M and 2 : M ! B then 2 1 : A ! B is a multi-counter

transduction. The number of counters is equal to the maximum dimension of vectors

in M (see Denition 2.6).

3.6. MORE ABOUT INTERSECTION 119

Proof: Because the full proof is rather technical while its intuition is very natural, we

only sketch the main ideas. Considering two rational transducers T1 and T2 realizing

1 and 2 respectively, we start applying the classical composition algorithm for free

monoids to build a transducer T realizing 2 1 . But this time, T will be multi-

counter, every counter is initialized to 0, and transitions generated by the classical

composition algorithm simply ignore the counters.

Now, every time a transition of T1 writes a vector v (resp. T2 reads a vector v), the

\normal execution" of the classical composition algorithm is \suspended", only tran-

sitions reading (resp. writing) vectors of the same dimension as v are considered in T2

(resp. T1), and v is added to the counters using the technique in Theorem 3.27. When

a letter is read or written during the \suspended mode", each counter is checked for

zero before \resuming" the \normal execution" of the classical composition algorithm.

The result is a transducer with rational and multi-counter parts, separated by checks

for zero.

Theorem 3.28 will also be used in Section 4.3.

Intersecting relations is a major issue in our analysis and transformation framework. We

have seen that this operation neither preserve the rational property nor the algebraic

property of a relation; but we have also found sub-classes of relations, closed under in-

tersection. The purpose of this section is to extend these sub-classes in order to support

special cases of intersections.

For the purpose of dependence analysis, we have already mentioned the need for intersec-

tions with the lexicographic order. Indeed, the class of left-synchronous relations includes

the lexicographic order and is closed under intersection.

In this section, we restrict ourselves to the case of relations over A A for some

alphabet A. We will describe a class larger than synchronous relations over A A which

is closed under intersection with the lexicographic order only.6

Denition 3.28 (pseudo-left-synchronism) Let A be an alphabet. A rational trans-

ducer T = (A; A; Q; I; F; E ) (same alphabet A) is pseudo-left-synchronous if there exist

a partition of the set of states Q = QI [ QS [ QT satisfying the following conditions:

(i) any transition between states of QI is labeled xjx for some a in A;

(ii) any transition between a state of QI and a state of QT is labeled xjy for some x 6= y

in A;

(iii) the restriction of T to states in QI [ QS is left-synchronous.

A rational relation or transduction is pseudo-left-synchronous if it is realized by a

pseudo-left-synchronous transducer. A rational transducer is pseudo-left-synchronizable

if it realizes a pseudo-left-synchronous relation.

6 This class is not comparable with the class of deterministic relations proposed in Denition 3.19 of

Section 3.4.7.

120 CHAPTER 3. FORMAL TOOLS

satises the left-synchronism property everywhere but after transitions labeled xjy with

x 6= y. The motivation for such a denition comes from the following result:

Proposition 3.14 The class of pseudo-left-synchronous relations is closed under inter-

section with the lexicographic order.

Proof: Because the non-left-synchronous part is preceded by transitions labeled xjy

6 y, which are themselves preceded by transitions labeled xjx, intersection with

with x =

the lexicographic order becomes straightforward on this part: if x < y the transition

is kept in the intersection, otherwise it is removed. Intersecting the left-synchronous

part is done thanks to Theorem 3.14.

Another intersecting result is the following:

Proposition 3.15 Intersecting a pseudo-left-synchronous relation with the identity re-

lation yields a left-synchronous relation.

Proof: Same idea as the preceding proof, but transitions xjy with x 6= y are now

removed every time.

Of course, pseudo-left-synchronous relations are closed under union, but not intersec-

tion, complementation and composition.

Eventually, the constructive proof of Theorem 3.19 can be modied to look for pseudo-

left-synchronous relations: when a transition labeled xjy is found after a path of transitions

labeled xjx, leave the following transitions unchanged.

What about intersection of algebraic relations? The well known result about closure of

algebraic languages under intersection with rational languages has no extension to alge-

braic relations. Still, it is easy to see that there is a property similar to left-synchronism

which brings partial intersection results for algebraic relations.

Proposition 3.16 Let R1 be an algebraic relation realized by a push-down trans-

ducer whose underlying rational transducer is left-synchronous, and let R2 be a left-

synchronous relation. Then R1 \ R2 is an algebraic relation, and one may compute a

push-down transducer realizing the intersection whose underlying rational transducer

is left-synchronous.

Proof: Let T1 be a push-down automaton realizing R1 whose underlying rational

transducer T10 is left-synchronous, and let T2 be a left-synchronous realization of R2.

The proof comes from the fact that intersecting T10 and T2 can be done without \for-

getting" the original stack operation associated with each transition in T1 . This is

due to the cross-product nature of the intersection algorithm for nite-state automata

(which also applies to left-synchronous transducers).

3.7. APPROXIMATING RELATIONS ON WORDS 121

synchronous one, yielding the following result:

Proposition 3.17 Let A be an alphabet and let R be an algebraic relation over A A

realized by a push-down transducer whose underlying rational transducer is pseudo-

left-synchronous. Then intersecting R with the lexicographic order (resp. identity rela-

tion) yields an algebraic relation, and one may compute a push-down transducer real-

izing the intersection whose underlying rational transducer is pseudo-left-synchronous

(resp. left-synchronous).

This section is a transition between the long study of mathematical tools exposed in this

chapter and applications of these tools to our analysis and transformation framework.

Remember we have seen in Section 2.4 that exact results were not required for data- ow

information, and that our program transformations were based on conservative approxi-

mations of sets and relations. Studying approximations is rather unusual when dealing

with words and relations between words, but we will show its practical interest in the

next chapters.

Of course, such conservative approximations must be as precise as possible, and exact

results should be looked for every time it is possible. Indeed, approximations are needed

only when a question or an operation on rational or algebraic relations is not decidable.

Our general approximation scheme for rational and algebraic relations is thus to nd a

conservative approximation in a smaller class which supports the required operation or

for which the required question is decidable.

lations

Sometimes a recognizable approximation of a rational relation may be needed. If R is a

rational relation realized by a rational transducer T = (Q; I; F; E ), the simplest way to

build a recognizable relation K which is larger than R is to dene K as the product of

input and output languages of R.

A smarter approximation is to consider each pair (qi ; qf ) of initial and nal states in T ,

and to dene Kqi;qf as the product of input and output languages of the relation realized

by (Q; fqig; fqf g; E ). Then K is dened as the union of all Kqi;qf for all (qi; qf ) 2 I F .

This builds a recognizable relation thanks to Mezei's Theorem 3.3.

The next level of precision is achieved in considering each strongly-connected compo-

nent in T and approximating it with the preceding technique. The resulting relation K

is still recognizable, thanks to Mezei's theorem. This technique will be considered in the

following when looking for a recognizable approximation of a rational relation.

Relations

Because recognizable approximations are not precise enough in general, and because the

class of left-synchronous relations retains most interesting properties of recognizable re-

lations, we will rather approximate rational relations by left-synchronous ones.

122 CHAPTER 3. FORMAL TOOLS

The key algorithm in this context is based on the constructive proof of Theorem 3.19

presented in Section 3.4.5. In practical cases, it often returns a left-synchronous transducer

and no approximation is necessary. When it fails, it means that some strongly-connected

component could not be resynchronized. The idea is then to approximate this strongly

connected component by a recognizable relation, and then to restart the resynchronization

algorithm.

For better eciency, all strongly-connected components whose transmission rate is

not 0, 1 or +1 should be approximated this way in a rst stage. In the same stage, if

a strongly-connected component C whose transmission rate is 1 follows some strongly-

connected components C1; : : : ; Cn whose transmission rates are 0 or +1, then a recogniz-

able approximation KC of C should be added to the transducer with same outgoing tran-

sitions as C , and all paths from C1; : : : ; Cn to C should now lead to KC . Applying such a

rst stage guarantees that the resynchronization algorithm will return a left-synchronous

approximation of R, thanks to Theorem 3.19.

Eventually, when trying to intersect a rational transducer with the lexicographic order,

we are looking for a pseudo-left-synchronous approximation. The same technique as before

can then be applied, using the extended version of Theorem 3.19 proposed in Section 3.6.

There are two very dierent techniques when approximating algebraic relations. The sim-

plest one is used to give conservative results to a few undecidable questions for algebraic

transducers that are decidable for rational ones. It consists in taking the underlying ratio-

nal transducer as a conservative approximation. Precision can be slightly improved when

the stack size is bounded: the nite number of possible stack words can be encoded in

state names. This may induce a large increase of the number of states. The second tech-

nique is used when looking for an intersection with a left-synchronous relation: it consists

in approximating the underlying rational transducer with a left-synchronous (or pseudo-

left-synchronous) one without modifying the stack operations. In fact, stack operations

can be preserved in the resynchronization algorithm (associated with Theorem 3.19), but

they are obviously lost when approximating a strongly-connected component with a rec-

ognizable relation. Which technique is applied will be stated every time an approximation

of an algebraic relation is required.

Eventually, we have seen that composing two rational transductions over Zn yields a

n-counter transduction by Theorem 3.27. Approximation by a one-counter transduction

then consists in saving the value of bounded counters into new states names, then removing

all unbounded counters but one. Smart choices of the remaining counter and attempts to

combine two counters into one have not been studied yet, and are left for future work.

123

Chapter 4

Instancewise Analysis for Recursive

Programs

Even though dependence information is at the core of virtually all modern optimizing

compilers, recursive programs have not received much attention. When considering in-

stancewise dependence analysis for recursive data structures, less than three papers have

been published. Even worse is the state of the art in reaching denition analysis: be-

fore our recent results for arrays [CC98], no instancewise reaching denition analysis for

recursive programs has been proposed.

Considering the program model proposed in Chapter 2, we now focus on dependence

and reaching denition analysis at the run-time instance level. The following presentation

is built on our previous work on the subject [CCG96, Coh97, Coh99a, Fea98, CC98], but

has been going through several major evolutions. It results in a much more general and

mathematically sound framework, with algorithms for automation of the whole analysis

process, but also in a more complex presentation. The primary goal of this work is rather

theoretical: we look for the highest precision possible. Beyond this important target, we

will show in a later chapter (see Section 5.5) how this precise information can be used

to outperform current results in parallelization of recursive programs, and also to enable

new program transformation techniques.

We start our presentation with a few motivating examples, then discuss induction

variable and storage mapping function computation in Section 4.2, the general analysis

technique is presented in Section 4.3, with questions specic to particular data structures

deferred to the next sections. Eventually, Section 4.7 compares our results with static

analyses and with recent works on instancewise analysis for loop nests.

Studying three examples, we present an intuitive avor of the instancewise dependence

and reaching denition analyses for recursive control and data structures.

Our rst example is still the procedure Queens, presented in Section 2.3. It is reproduced

here in Figure 4.1.a with a partial control tree.

Studying accesses to array A, our purpose is to nd dependences between run-time

instances of program statements. Let us study instance FPIAAaAaAJQPIAABBr of state-

124 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

int A[n];

I if (k < n) { FP

A=A=a for (int i=0; i<n; i++) {

B=B=b for (int j=0; j<k; j++) IAA aA aA

r = A[j] ; J J J

J if ( ) {

s A[k] = ;

s s s QP

Q Queens (n, k+1);

FPIAAJs

}

FPIAAaAJs IAA

} FPIAAaAaAJs

write A[0] J BB

}

}

r

int main () {

FPIAAaAaAJQPIAABBr F reads A[0]

F Queens (n, 0);

}

Figure 4.1.b. Compressed control tree

Figure 4.1.a. Procedure Queens

. . . . . . . . . . . . . . . . . . . . Figure 4.1. Procedure Queens and control tree . . . . . . . . . . . . . . . . . . . .

like to know which memory location is accessed. Since j is initialized to 0 in state-

ment B , and incremented by 1 in statement b, we know that the value of variable j at

FPIAAaAaAJQPIAABBr is 0, so FPIAAaAaAJQPIAABBr reads A[0].

We now consider instances of s, depicted as squares: since statement s writes into

A[k], we are interested in the value of variable k: it is initialized to 0 in main (by the

rst call Queens(n, 0)), and incremented at each recursive call to procedure Queens in

statement Q. Thus, instances such as FPIAAJs, FPIAAaAJs or FPIAAaAaAJs write into

A[0], and are therefore in dependence with FPIAAaAaAJQPIAABBr.

Let us now derive which of these denitions reaches FPIAAaAaAJQPIAABBr. Looking

again at Figure 4.1.b, we notice that instance FPIAAaAaAJs |denoted by a black square|

is, among the three possible reaching denitions that are shown, the last to execute. And it

does execute: since we assume that FPIAAaAaAJQPIAABBr executes, then FPIAAaAaAJ

(hence FPIAAaAaAJs ) has to execute. Therefore, other instances writing in the same

array element, such as FPIAAJs and FPIAAaAJs, cannot reach the read instance, since

their value is always overwritten by FPIAAaAaAJs.1 Noticing that no other instance of

s could execute after FPIAAaAaAJs, we can ensure that FPIAAaAaAJs is the reaching

denition of FPIAAaAaAJQPIAABBr. We will show later how this simple approach to

computing reaching denitions can be generalized.

1 FPIAAaAaAJs is then called an ancestor of FPIAAaAaAJQPIAABBr, to be formally dened later.

4.1. MOTIVATING EXAMPLES 125

Let us now look at procedure BST, as shown in Figure 4.2. This procedure swaps node

values to convert a binary tree into a binary search tree (BST). Nodes of the tree structure

are referenced by pointers; p->l (resp. p->r) denotes the pointer to the left (resp. right)

child of the node pointed by p; p->value denotes the integer value of the node.

........................................................................................

P void BST (tree *p) {

I1 if (p->l!=NULL) {

L BST (p->l);

I2 if (p->value < p->l->value) {

a t = p->value;

b p->value = p->l->value;

c p->l->value = t; LP FP RP

} P

} I 1 J1

J1 if (p->r!=NULL) {

R BST (p->r); I1 J1

J2 if (p->value > p->r->value) { I2 J2

d t = p->value;

e p->value = p->r->value;

I2 J2

f p->r->value = t;

a b c d e f

} a b c d e f

}

}

int main () {

F if (root!=NULL) BST (root);

}

There are few dependences on program BST. If u is an instance of block I2, then there

are anti-dependences between the rst read access in u and instance ub, between the

second read access in u and uc, between the read access in instance ua and instance ub,

and between the read access in ub and instance uc. It is the same for an instance v of

block J2: there is an anti-dependence between the rst read access in u and ue, between

the read access in u and uf , between the read access in ud and ue, and between the read

access in ue and uf . No other dependences are found. We will show in the following how

to compute this result automatically. Eventually, a reaching denition analysis tells that

? is the unique reaching deniton of each read access.

4.1.3 Third Example: Function Count

Our last motivating example is function Count, as shown in Figure 4.3. It operates on

the inode structure presented in Section 2.3.3. This function computes the size of a le

in blocks, in counting terminal inodes.

Since there is no write access to the inode structure, there are no dependences on

the Count program (not considering the other data structures, such as scalar c). How-

126 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

P int Count (inode *p) {

I if (p->terminal)

a return p->length; FP

E else { P

b c = 0; I E

L=L=l for (int i=0; i<p->length; i++)

c c += Count (p->n[i]); I E

d return c; a b LL d

}

}

a b L d

cP

int main () { lL

F Count (file);

}

ever, an interesting result for cache optimization techniques [TD95] would be that each

memory location is read only once. We will show that this information can be computed

automatically by our analysis techniques.

In the rest of this chapter, we formalize the concepts introduced above. In Section 4.2,

we compute maps from instance names to data-element names. Then, the dependence

and reaching denitions relation are computed in Section 4.3.

In Section 2.4, we dened storage mappings from accesses |i.e. pairs of a run-time instance

and a reference in the statement|to memory locations. To abstract the eect of every

statement instance, we need to make explicit these functions. This is done through the

use of induction variables.

After a few denitions and additional restrictions of the program model, we show

that induction variables are described by systems of recurrence equations, we prove a

fundamental resolution theorem for such systems, and nally we apply this theorem in an

algorithm to compute storage mappings.

To simplify the notations of variables and values, we write \v" for the name of an

existing program variable, and \v" is an abbreviation for \the value of variable \v".

We now extend the classical concept of induction variable |strongly connected with nested

loops|to recursive programs. To simplify the exposition, we suppose that every integer

or pointer variable that is local to a procedure or global to the program has a unique

distinctive name. This allows quick and non-misleading wordings such as \variable i",

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 127

and has no eect on the generality of the approach. Compared to classical works with

nests of loops [Wol92], we have a rather original denition of induction variables :

integer arguments of a function that are initialized to a constant or to an integer

induction variable plus constant (e.g. incremented or decremented by a constant),

at each procedure call;

integer loop counters that are incremented (or decremented) by a constant at each

loop iteration;

pointer arguments that are initialized to a constant or to a possibly dereferenced

pointer induction variable, at each procedure call;

pointer loop variables that are dereferenced at each loop iteration;

For example, suppose i, j and k are integer variables, p and q are pointer variables

to a list structure with a member next of type list*, and Compute is some procedure

with two arguments. In the code in Figure 4.4, reference 2*i+j appears in a non-recursive

function call, hence i, j, p and q are considered induction variables. On the opposite, k

is not an induction variable because it retains its last value at the entry of the inner loop.

........................................................................................

void Compute (int i, list *p) {

int j, k;

list *q;

for (q=p, k=0; q!=NULL; q=q->next)

for (j=0; j<100; j+=2, k++)

// recursive call

Compute (j+1, q);

printf ("%d", 2*i+j);

}

As a kind of syntactic sugar to increase the versatility of induction variables, some

cases of direct assignments to induction variables are allowed|i.e. induction variable

updates outside of loop iterations and procedure calls. Regarding initialization and in-

crement/decrement/dereference, the rules are the same than for a procedure call, but

there are two additional restrictions. These restrictions are those of the code motion

[KRS94, Gup98] and symbolic execution techniques [Muc97] used to move each direct

assignment to some loop/procedure block surrounding it. After such a transformation,

direct assignments can be interpreted as \executed at the entry of that block", the name

of the statement being replaced by the actual name of the block.

Of course, symbolic execution techniques cannot convert all cases of direct assignations

into legal induction variable updates, as shown by the following examples. Considering

the program in Figure 4.5.a, i is an induction variable because the while loop can be

converted into a for loop on i, but j is not an induction variable since it is not initialized

at the entry of the inner for loop. Considering the other program in Figure 4.5.b, variable

i is not an induction variable because s is guarded by a conditional.

128 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

int i=0, j=0, k, A[200];

while (i<10) {

int i, A[10, 10];

for (k=0; k<10; k++) {

for (i=0, j=0; i<10; i++) {

j = j + 2;

;

if ( )

s i = i + 2;

r

}

A[i] = A[i] + A[j];

r A[i, j] = ;

}

s i = i + 1;

}

Figure 4.5.b. Third example

Figure 4.5.a. Second example

. . . . . . . . . . . . . . . . . . Figure 4.5. More examples of induction variables . . . . . . . . . . . . . . . . . .

program model presented in Section 2.2, our analysis requires a few additional hypotheses:

every data structure subject to dependence or reaching denition analysis must be

declared global (notice that local variables can be made global using explicit memory

allocations and stacks);

every array subscript must be an ane function of integer induction variables (not

any integer variable) and symbolic constants;

every tree access must dereference a pointer induction variable (not any pointer

variable) or a constant.

Describing con icts between memory accesses is at the core of dependence analysis. We

must be able to associate memory locations to memory references in statement instances

(i.e. A[i], *p, etc.) by means of storage mappings . This analysis is done independently

on each data-structure. For each induction veriable, we thus need a function mapping

a control word to the associated value of the induction variable. In addition, the next

denition introduces a notation for the relation between control words and induction

variable values.

Denition 4.1 (value of induction variables) Let be a program statement or

block, and w be an instance of . The value of variable i at instance w is dened

as the value of i immediately after executing (resp. entering) instance w of statement

(resp. block) . This value is denoted by [ i] (w).

For a program statement and an induction variable i, we call [ i; ] the set of

all pairs (u; i) such that [ i] (u) = i, for all instances u of .

We consider pairs of elements in monoids, and to be consistent with the usual notation

for rational sets and relations, a pair (x; y) will be denoted by (xjy).

In general, the value of a variable at a given control word depends on the execution.

Indeed, an execution trace keeps all the information about variable updates, but not a

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 129

control word. However, due to our program model restrictions, induction variables are

completely dened by control words :

Lemma 4.1 Let i be an induction variable and u a statement instance. If the value

[ i] (u) depends on the eect of an instance v|i.e. the value depends on whether v

executes or not|then v is a prex of u.

Proof: Simply observe that only loop entries, loop iterations and procedure calls may

modify an induction variable, and that loop entries are associated with initialisations

which \kill" the eect of all preceeding iterations (associated with non-prex control

words).

For two program executions e; e0 2 E, the consequence of Lemma 4.1 is that storage

mappings fe and fe coincides on Ae \ Ae . This strong property allows to extend the

0 0

computation of a storage mapping fe to the whole set A of possible accesses. With this

extension, all storage mappings for dierent executions of a program coincides. We will

thus consider in the following a storage mapping f independent on the execution.

The following result states that induction variable are described by recurrence equa-

tions :

Lemma 4.2 Let (Mdata; ) be the monoid abstraction of the considered data structure.

Consider a statement and an induction variable i. The eect of statement on the

value if i is captured by one of the following equations:

either 9 2 Mdata; j 2 induc : 8u 2 Lctrl : [ i] (u) = [ j] (u) (4.1)

or 9 2 Mdata : 8u 2 Lctrl : [ i] (u) = (4.2)

where induc is the set of all induction variables in the program, including i.

Proof: Consider an edge in the control automaton. Due to our syntactical restric-

tions, edge corresponds to a statement in the program text that can modify i in

only two ways:

either there exist an induction variable j whose value is j 2 Mdata just before

executing instance u of statement and a constant 2 Mdata such that the

value of i after executing instance u is j |translation from a possibly identical

variable;

or there exist a constant 2 Mdata such that the value of i after executing instance

u is |initialization.

Notice that, when accessing arrays, we allow general ane subscripts and not only

induction variables. Therefore we also build equations on ane functions a(i,j, )

of the induction variables. For example, if a(i,j,k) = 2*i+j-k then we have to build

equations on [ 2 i + j k] (u) knowing that [ 2 i + j k] (u) = 2[[i] (u) + [ j] (u)

[ k] (u).2

2 We have indeed to generate new equations, since computing [ 2 i + j k] (u) from [ i] (u), [ j] (u)

and [ k] (u) is not possible in general: variables i, j and k may have dierent scopes.

130 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

tions:

Undefined is a polymorphic value for induction variables, [ i] (w) = Undefined means

that variable i has an undened value at instance w; it may also be the case that i

is not visible at instance w;

Arg(proc; num) stands for the numth actual argument of procedure proc.

Algorithm Recurrence-Build applies Lemma 4.2 in turn for each statement in the

program.

Recurrence-Build (program)

program: an intermediate representation of the program

returns a list of recurrence equations

1 sys ?

2 for each statement in program

3 do for each induction variable i in

4 do switch

5 case = for (i=init; ; ) : // loop entry

6 sys sys [ f8u 2 Lctrl : [ i] (u) = initg

7 case = for ( ; ; i=i+inc) : // loop iteration

8 sys sys [ f8u 2 Lctrl : [ i] (u) = [ i] (u) incg

9 case = for ( ; ; i=i->inc) : // loop iteration

10 sys sys [ f8u 2 Lctrl : [ i] (u) = [ i] (u) incg

11 case = proc (|{z} , var, ) :

m 1

12 sys sys [ f8u 2 Lctrl : [ Arg(proc; m)]](u) = [ var] (u)g

13 , var+cst, ) :

case = proc (|{z}

m 1

14 sys sys [ f8u 2 Lctrl : [ Arg(proc; m)]](u) = [ var] (u) cstg

15 , var->cst, ) :

case = proc (|{z}

m 1

16 sys sys [ f8u 2 Lctrl : [ Arg(proc; m)]](u) = [ var] (u) cstg

17 , cst, ) :

case = proc (|{z}

m 1

18 sys sys [ f8u 2 Lctrl : [ Arg(proc; m)]](u) = cstg

19 case default :

20 sys sys [ f8u 2 Lctrl : [ i] (u) = [ i] (u)g

21 for each procedure p declared proc (type1 arg1 , , typen argn ) in

22 do for m 1 to n

23 do sys sys [ f8up 2 Lctrl : [ argm ] (up) = [ Arg(proc; m)]](u)g

24 return sys

Now, suppose that there exist a statement , two induction variables i and j, and a

constant 2 Mdata such that [ i] (u) = [ j] (u) is an equation generated by Lemma 4.2.

Transposed to [ i; ] |the set of all pairs (uj[ i] (u))|it says that

(ujj ) 2 [ j; 0] =) (ujj ) 2 [ i; ] ;

for all statements 0 that may precede in a valid control word u. Second, suppose that

there exist a statement , an induction variables i, and a constant 2 Mdata such that

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 131

(uji) 2 [ i; 0 ] =) (uj ) 2 [ i; ] ;

for all statements 0 that may precede in a valid control word u. These two observa-

tions allow to build a new system involving equations on sets [ i; ] from the result of

Recurrence-Build. Algorithm to achieve this is called Recurrence-Rewrite: the

two conditionals in Recurrence-Rewrite are associated with u = ", i.e. with recur-

rence equations of the form [ i] () = [ j] (") ([[j] (") is an undened value) or [ i] () = ,

and the two loops on 0 consider predecessors of .

Recurrence-Rewrite (program; system)

program: an intermediate representation of the program

system: a system of recurrence equations produced by Recurrence-Build

returns a rewritten system of recurrence equations

1 Lctrl language of control words of program

2 new ?

3 for each equation 8u 2 Lctrl : [ i] (u) = [ j] (u) in system

4 do if 2 Lctrl

5 then new new [ f(jj ) 2 [ i; ] g

6 for each 0 such that (ctrl 0 \ Lctrl ) 6= ?

7 do new new [ f8u 2 Lctrl : (ujj ) 2 [ j; 0 ] ) (ujj ) 2 [ i; ] g

8 for each equation 8u 2 Lctrl : [ i] (u) = in system

9 do if 2 Lctrl

10 then new new [ f(j ) 2 [ i; ] g

11 for each 0 such that (ctrl 0 \ Lctrl ) 6= ?

12 do new new [ f8u 2 Lctrl : (uji) 2 [ i; 0 ] ) (uj ) 2 [ i; ] g

13 return new

Algorithms Recurrence-Build and Recurrence-Rewrite are now applied to

procedure Queens. There are three induction variables, i, j and k; but variable i is not

useful for computing storage mapping functions. We get the following equations:

From main call F : [ Arg(Queens; 2)]](F ) = 0

From procedure P : 8uP 2 Lctrl : [ k] (uP ) = [ Arg(Queens; 2)]](u)

From recursive call Q: 8uQ 2 Lctrl : [ Arg(Queens; 2)]](uQ) = [ k] (u) + 1

From entry B of loop B=B=b: 8uB 2 Lctrl : [ j] (uB ) = 0

From iteration b of loop B=B=b: 8ub 2 Lctrl : [ j] (ub) = [ j] (u) + 1

All other statements let induction variables unchanged or undened:

[ j] (F ) = Undefined

8uP 2 Lctrl : [ j] (uP ) = Undefined

8uI 2 Lctrl : [ j] (uI ) = Undefined

8uA 2 Lctrl : [ j] (uA) = Undefined

8uA 2 Lctrl : [ j] (uA) = Undefined

8ua 2 Lctrl : [ j] (ua) = Undefined

8uB 2 Lctrl : [ j] (uB) = [ j] (u)

8ur 2 Lctrl : [ j] (ur) = [ j] (u)

8uJ 2 Lctrl : [ j] (uJ ) = [ j] (u)

8uQ 2 Lctrl : [ j] (uQ) = Undefined

8us 2 Lctrl : [ j] (us) = Undefined

132 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

[ k] (F ) = Undefined

8uI 2 Lctrl : [ k] (uI ) = [ k] (u)

8uA 2 Lctrl : [ k] (uA) = [ k] (u)

8uA 2 Lctrl : [ k] (uA) = [ k] (u)

8ua 2 Lctrl : [ k] (ua) = [ k] (u)

8uB 2 Lctrl : [ k] (uB ) = [ k] (u)

8uB 2 Lctrl : [ k] (uB) = [ k] (u)

8ub 2 Lctrl : [ k] (ub) = [ k] (u)

8ur 2 Lctrl : [ k] (ur) = [ k] (u)

8uJ 2 Lctrl : [ k] (uJ ) = [ k] (u)

8uQ 2 Lctrl : [ k] (uQ) = [ k] (u)

8us 2 Lctrl : [ k] (us) = [ k] (u)

Now, recall that [ j; ] (resp. [ k; ] ) is the set of all pairs (ujj ) (resp. (ujk)) such that

[ j] (u) = j (resp. [ k] (u) = k), for all instances u of a statement . From equations

above, Recurrence-Rewrite yields:

8

>

> (F jUndefined) 2 [ j; F ]

>

>

>

> 8uP 2 Lctrl : (ujj ) 2 [ j; F ] ) (uP jUndefined) 2 [ j; P ]

>

>

>

> 8uP 2 Lctrl : (ujj ) 2 [ j; Q] ) (uP jUndefined) 2 [ j; P ]

>

>

>

> 8uI 2 Lctrl : (ujj ) 2 [ j; P ] ) (uI jUndefined) 2 [ j; I ]

>

>

>

> 8uA 2 Lctrl : (ujj ) 2 [ j; I ] ) (uAjUndefined) 2 [ j; A]

>

>

>

> 8uA 2 Lctrl : (ujj ) 2 [ j; A] ) (uAjUndefined) 2 [ j; A]

>

>

>

> 8uA 2 Lctrl : (ujj ) 2 [ j; a] ) (uAjUndefined) 2 [ j; A]

<

8ua 2 Lctrl : (ujj ) 2 [ j; A] ) (uajUndefined) 2 [ j; a]

>

> 8uB 2 Lctrl : (ujj ) 2 [ j; A] ) (uB j0) 2 [ j; B ]

>

>

>

> 8uB 2 Lctrl : (ujj ) 2 [ j; B ] ) (uBjj ) 2 [ j; B]

>

>

>

> 8uB 2 Lctrl : (ujj ) 2 [ j; b] ) (uBjj ) 2 [ j; B]

>

>

>

> 8ub 2 Lctrl : (ujj ) 2 [ j; B] ) (ubjj + 1) 2 [ j; b]

>

>

>

> 8 ur 2 Lctrl : (ujj ) 2 [ j; B] ) (urjj ) 2 [ j; r]

>

>

>

> 8uJ 2 Lctrl : (ujj ) 2 [ j; A] ) (uJ jUndefined) 2 [ j; J ]

>

>

>

> 8 uQ 2 Lctrl : (ujj ) 2 [ j; J ] ) (uQjUndefined) 2 [ j; Q]

:

8us 2 Lctrl : (ujj ) 2 [ j; J ] ) (usjUndefined) 2 [ j; s]

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 133

8

>

> (F jUndefined) 2 [ k; F ]

>

>

>

> 8uP 2 Lctrl : (ujx) 2 [ Arg(Queens; 2); F ] ) (uP jx) 2 [ k; P ]

>

>

>

> 8uP 2 Lctrl : (ujx) 2 [ Arg(Queens; 2); Q] ) (uP jx) 2 [ k; P ]

>

>

>

> 8uI 2 Lctrl : (ujk) 2 [ k; P ] ) (uI jk) 2 [ k; I ]

>

>

>

> 8uA 2 Lctrl : (ujk) 2 [ k; I ] ) (uAjk) 2 [ k; A]

>

>

>

> 8uA 2 Lctrl : (ujk) 2 [ k; A] ) (uAjk) 2 [ k; A]

>

>

>

> 8uA 2 Lctrl : (ujk) 2 [ k; a] ) (uAjk) 2 [ k; A]

>

>

>

> 8ua 2 Lctrl : (ujk) 2 [ k; A] ) (uajk) 2 [ k; a]

<

8uB 2 Lctrl : (ujk) 2 [ k; A] ) (uB jk) 2 [ k; B ]

>

> 8uB 2 Lctrl : (ujk) 2 [ k; B ] ) (uBjk) 2 [ k; B]

>

>

>

> 8uB 2 Lctrl : (ujk) 2 [ k; b] ) (uBjk) 2 [ k; B]

>

>

>

> 8ub 2 Lctrl : (ujk) 2 [ k; B] ) (ubjk) 2 [ k; b]

>

>

>

> 8ur 2 Lctrl : (ujk) 2 [ k; B] ) (urjk) 2 [ k; r]

>

>

>

> 8 uJ 2 Lctrl : (ujk) 2 [ k; A] ) (uJ jk) 2 [ k; J ]

>

>

>

> 8uQ 2 Lctrl : (ujk) 2 [ k; J ] ) (uQjk) 2 [ k; Q]

>

>

>

> 8us 2 Lctrl : (ujk) 2 [ k; J ] ) (usjk) 2 [ k; s]

>

>

>

> (F j0) 2 [ Arg(Queens; 2); F ]

:

8uQ 2 Lctrl : (ujk) 2 [ k; J ] ) (uQjk + 1) 2 [ Arg(Queens; 2); Q]

4.2.3 Solving Recurrence Equations on Induction Variables

The following result is at the core of our analysis technique, but it is not limited to this

purpose. It will be applied in the next section to the system of equations returned by

Recurrence-Rewrite.

Lemma 4.3 Consider two monoids L and M with respective binary operations and ?.

Let R be a subset of L M dened by a system of equations of the form

(E1 ) 8l 2 L; m1 2 M : (ljm1) 2 R1 =) (l 1 jm1 ? 1 ) 2 R

and

(E2) 8l 2 L; m2 2 M : (ljm2) 2 R2 =) (l 2 j2) 2 R;

where R1 L M and R2 L M are some set variables constrained in the system

(possibly equal to R), 1; 2 are constants in L and 1; 2 are constants in M . Then,

R is a rational set.

Proof: Our rst task is to convert these expressions on unstructured elements of L

and M , into expressions in the monoid L M . Then our second task is to derive set

expressions in L M , of the form set constant set or constant set (the induced

operation is denoted by \"). Indeed, the right-hand-side of (E1 ) can be written

(ljm1 ) (1j1 ) 2 R:

Thus, (E1) gives

R1 (1 j1) R:

The right-hand-side of (E2 ) can also been written

(lj") (2j2 ) 2 R

but (lj") is neither a variable nor a constant of L M .

134 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

To overcome this diculty, we call R" the set of all pairs (lj") such that 9m 2 M :

(ljm) 2 R. It is clear that R" satises the same equations as R with all right pair

members replaced by ". Now, (E2 ) yields two equations:

R2" (2j") R" and R" ("j2) R:

At last, if the only equations on R are (E1 ) and (E2), we have

R" = R1 (1j") + R2" (2 j")

R = R1 (1j1) + R" ("j2)

More generally, applying this process to R1 , R2 and to every subset of L M described

in the system, we get a new system of regular equations dening R. It is well known

that such equations dene a rational subset of L M .

Thanks to classical list operations Insert, Delete and Member (systems are en-

coded as lists of equations), and to string operation Concat (equations are encoded

as strings), algorithm Recurrence-Solve gives an automatic way to solve systems of

equations of the form (E1 ) or (E2 ).

Recurrence-Solve (system)

system: a list of recurrence equations of the form (E1 ) and (E2)

returns a list of regular expressions

1 sets ?

2 for each implication \ (ljm) 2 A ) (l jm ? ) 2 B " in system

3 do Insert (sets; fA (j ) B g)

4 Insert (sets; fA" (j") B " g)

5 for each implication \ (ljm) 2 A ) (l j ) 2 B " in system

6 do Insert (sets; fB " ("j ) B g)

7 Insert (sets; fA" (j") B " g)

8 variables ?

9 for each inclusion \ A (xjy) B " in sets

10 do if Member (variables; B )

11 then equation Delete (variables; B )

12 Insert (variables; Concat (equation; \ +A (xjy ) "))

13 else Insert (variables; \ B = A (xjy) ")

14 variables Compute-Regular-Expressions (variables)

15 return variables

Algorithm Compute-Regular-Expressions solves a system of regular equations

between rational sets, then returns a list of regular expressions dening these sets. The

system is seen as a regular grammar and resolution is done through variable substitution|

when the variable in left-hand side does not appear in right-hand side|or Kleene star

insertion|when it does. Well known heuristics are used to reduce the size of the result,

see [HU79] for details.

The main result of this section follows: we can solve recurrence equations in Lemma 4.2

to compute the value of induction variables at control words.

Theorem 4.1 The storage mapping f that maps every possible access in A to the mem-

ory location it accesses is a rational function from ctrl to Mdata.

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 135

Proof: Since array subscripts are ane functions of integer induction variables, and

since tree accesses are given by dereferenced induction pointers, one may generate a

system of equations according to Lemma 4.2 (or Recurrence-Build) for any read

or write access.

The result is a system of equations on induction variables. Thanks to Recurrence-

Rewrite, this system is rewritten in terms of equations on sets of pairs (uj[ i] (u)),

where u is a control word and i is an iteration variable, describing the value of

i for any instance of statement . We thus get a new system which inductively

describes subset [ i; ] of ctrl Mdata. Because this system satises the hypotheses

of Lemma 4.3, we have proven that [ i; ] is a rational set of ctrl Mdata. Now, for

a given memory reference in , we know that pairs (wjf (w))|where w is an instance

of |build a rational set. Hence f is a rational transduction from ctrl to Mdata.

Because f is also a partial function, it is a rational function from ctrl to Mdata.

The proof is constructive, thanks to Recurrence-Build and Recurrence-Solve,

and Compute-Storage-Mappings is the algorithm to automatically compute storage

mappings for a recursive program satisfying the hypotheses of Section 4.2.1. The result

is a list of rational transducers |converted by Compute-Rational-Transducer from

regular expressions|realizing the rational storage mappings for each reference in right-

hand side.

Compute-Storage-Mappings (program)

program: an intermediate representation of the program

returns a list rational transducers realizing storage mappings

1 system Recurrence-Build (program)

2 new Recurrence-Rewrite (program; system)

3 list Recurrence-Solve (new)

4 newlist ?

5 for each regular expression reg in list

6 do newlist newlist [ Compute-Rational-Transducer (reg)

7 return newlist

Let us now apply Compute-Storage-Mappings on program Queens. Starting from

the result of Recurrence-Rewrite, we apply Recurrence-Solve. Just before call-

ing Compute-Regular-Expressions, we get the following system of regular equations:

136 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

8

>

> [ j; F ] = (F jUndefined)

>

>

>

> [ j; P ] = [ j; F ] (P jUndefined) + [ j; Q] (P jUndefined)

>

>

>

> [ j; I ] = [ j; P ] (I jUndefined)

>

>

>

> [ j; A] = [ j; I ] (AjUndefined)

>

>

>

> [ j; A] = [ j; A] (AjUndefined) + [ j; a] (AjUndefined)

>

>

>

> [ j; a] = [ j; A] (ajUndefined)

>

>

>

> [ j; B ] = [ j; B ] " ("j0)

>

>

>

> [ j; B] = [ j; B ] (Bj0) + [ j; b] (Bj0)

>

>

>

> [ j; b] = [ j; B] (bj1)

>

>

>

> [ j; r] = [ j; B] (rj0)

>

>

>

> [ j; J ] = [ j; A] (J jUndefined)

<

[ j; Q] = [ j; J ] (QjUndefined)

>

> [ j; s] = [ j; J ] (sjUndefined)

>

>

>

> [ j ; F ] " = (F j")

>

>

>

> [ j; P ] " = [ j; F ] " (P j0) + [ j; Q] " (P j0)

>

>

>

> [ j; I ] " = [ j; P ] " (I j0)

>

>

>

> [ j; A] " = [ j; I ] " (Aj0)

>

>

>

> [ j; A] " = [ j; A] " (Aj0) + [ j; a] " (Aj0)

>

>

>

> [ j; a] " = [ j; A] " (aj0)

>

>

>

> [ j; B ] " = [ j; A] " (B j0)

>

>

>

> [ j; B] " = [ j; B ] " (Bj0) + [ j; b] " (Bj0)

>

>

>

> [ j; b] " = [ j; B] " (bj0)

>

>

>

> [ j; J ] " = [ j; A] " (J j0)

:

[ j; Q] " = [ j; J ] " (Qj0)

8

>

> [ k; F ] = (F jUndefined)

>

>

>

> [ k; P ] = [ Arg(Queens; 2); F ] (P j0) + [ Arg(Queens; 2); Q] (P j0)

>

>

>

> [ k; I ] = [ k; P ] (I j0)

>

>

>

> [ k; A] = [ k; I ] (Aj0)

>

>

>

> [ k; A] = [ k; A] (Aj0) + [ k; a] (Aj0)

>

>

>

> [ k; a] = [ k; A] (aj0)

>

>

< [ k; B ] = [ k; A] (B j0)

[ k; B] = [ k; B ] (Bj0) + [ k; b] (Bj0)

>

>

>

> [ k; b] = [ k; B] (bj0)

>

>

>

> [ k; r] = [ k; B] (rj0)

k; J ] = [ k; A] (J j0)

>

>

>

> [

>

>

>

> [ k; Q] = [ k; J ] (Qj0)

>

>

>

> [ k; s] = [ k; J ] (sj0)

>

>

>

> [ Arg(Queens; 2); F ] = (F j0)

:

[ Arg(Queens; 2); Q] = [ k; J ] (Qj1)

These systems|seen as regular grammars |can be solved with Compute-Regular-

Expressions, yielding regular expressions. These expressions describe rational functions

from ctrl to Z, but we are only interested in [ j; r] and [ k; s] (accesses to array A):

[ j; r] = (FPIAAj0) (JQPIAAj0) + (aAj0) (B Bj0) (bBj1) (rj0) (4.3)

[ k; s] = (FPIAAj0) (JQPIAAj1) + (aAj0) (Jsj0) (4.4)

4.2. MAPPING INSTANCES TO MEMORY LOCATIONS 137

Eventually, we have found the storage mapping function for every reference to the array:

(urjf (ur; A[j])) = (FPIAAj0) (JQPIAAj0) + (aAj0) (B Bj0) (bBj1) (rj0)(4.5)

(usjf (us; A[k])) = (FPIAAj0) (JQPIAAj1) + (aAj0) (Jsj0) (4.6)

We have already applied Compute-Storage-Mappings on program Queens, and we

repeat the process for the two other motivating examples.

Procedure BST

Algorithm Compute-Storage-Mappings is now applied to procedure BST in Fig-

ure 4.2. The only induction variable is p:

From main call F : [ Arg(BST; 1)]](F ) = "

From procedure BST: 8uP 2 Lctrl : [ k] (uP ) = [ Arg(BST; 1)]](u)

From rst recursive call L: 8uL 2 Lctrl : [ Arg(BST; 1)]](uL) = [ p] (u)l

From second recursive call R: 8uR 2 Lctrl : [ Arg(BST; 1)]](uR) = [ p] (u)r:

All other statements let the induction variable unchanged. Recall that [ p; ] is the

set of all pairs (ujp) such that [ p] (u) = p, for all instances u of a statement . From

equations above, this set satisfy the following regular equations:

8

>

> [ p; P ] = (FP j") + [ p; I1 ] (LP jl) + [ p; J1 ] (RP jr)

>

>

>

> [ p; I1 ] = [ p; P ] (I1j")

>

>

>

> [ p; J1] = [ p; P ] (J1j")

>

>

>

> [ p; I2 ] = [ p; I1 ] (I2j")

>

>

< [ p; J2] = [ p; J1 ] (J2 j")

[ p; a] = [ p; I2 ] (aj")

>

>

>

> [ p; b] = [ p; I2 ] (bj")

>

>

>

> [ p; c] = [ p; I2 ] (cj")

>

>

>

> [ p; d] = [ p; J2 ] (dj")

>

>

>

> [ p; e] = [ p; J2 ] (ej")

:

[ p; f ] = [ p; J2 ] (f j")

This system describes rational functions from ctrl to Z, but we are only interested in

[ p; ] for 2 fI2; a; b; c; J2; d; e; f g (accesses to node values):

8 2 fI2 ; a; b; cg : [ p; ] = (FP j") (I1 LP jl) + (J1RP jr) (I1I2j") (4.7)

8 2 fJ2 ; d; e; f g : [ p; ] = (FP j") (I1 LP jl) + (J1RP jr) (J1J2j") (4.8)

Eventually, we can compute the storage mapping function for every reference to the tree:

8 2 fI2 ; a; bg :

(ujf (u; p->value)) = (FP j") (I1LP jl) + (J1RP jr) (I1I2 j") (4.9)

8 2 fI2 ; b; cg :

(ujf (u; p->l->value)) = (FP j") (I1LP jl) + (J1RP jr) (I1I2 jl) (4.10)

8 2 fJ2 ; d; eg :

(ujf (u; p->value)) = (FP j") (I1LP jl) + (J1RP jr) (J1J2 j") (4.11)

8 2 fJ2 ; e; f g :

(ujf (u; p->r->value)) = (FP j") (I1LP jl) + (J1RP jr) (J1J2 jr) (4.12)

138 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

Function Count

Algorithm Compute-Storage-Mappings is now applied to procedure Count in Fig-

ure 4.3. Variable p is a tree index and variable i is an integer index. Indeed, the inode

structure is neither a tree nor an array: nodes are named in the language Ldata = (Zn)Z.

Thus, the eective induction variable should combine both p and i and be interpreted in

Ldata, with binary operation dened in Section 2.3.3. But no such variable appears in

the program... The reason is that the code is written in C, in which the inode structure

cannot be referenced through a uniform \cursor"|like a tree pointer or array subscript.

........................................................................................

P int Count (inode &p) {

I if (p->terminal)

a return p->length;

E else {

b c = 0;

L=L=l for (int i=0, inode &q=p->n; i<p->length; i++, q=q->1)

c c += Count (q);

d return c;

}

}

main () {

F Count (file);

}

This would become possible in a higher-level language: we have rewritten the program

in a C++-like syntax in Figure 4.6. Now, p is a C++ reference and not a pointer, and

operation -> has been redened to emulate array accesses.3 References p and q are the

two induction variables:

From main call F : [ Arg(Count; 1)]](F ) = "

From procedure P : 8uP 2 Lctrl : [ p] (uP ) = [ Arg(Count; 1)]](u)

From recursive call c: 8uc 2 Lctrl : [ Arg(Count; 1)]](uc) = [ q] (u)

From entry L of loop L=L=l: 8uL 2 Lctrl : [ q] (uL) = [ p] (u) n

From iteration l of loop L=L=l: 8ul 2 Lctrl : [ q] (ul) = [ q] (u) 1

All other statements let induction variables unchanged or undened. Recall that

[ p; ] (resp. [ q; ] ) is the set of all pairs (ujp) (resp. (ujq)) such that [ p] (u) = p (resp.

[ q] (u) = q), for all instances u of a statement . From equations above, these sets satisfy

3 Yes, C++ is both high-level and dirty!

4.3. DEPENDENCE AND REACHING DEFINITION ANALYSIS 139

8

>

> [ p; P ] = (FP j") + [ q; L] (cP j")

>

>

>

> [ p; I ] = [ p; P ] (I j")

>

>

>

> [ p; E ] = [ p; P ] (E j")

>

>

>

> [ p; a] = [ p; I ] (aj")

>

>

>

> [ p; b] = [ p; E ] (bj")

>

>

>

> [ p; L] = [ p; E ] (Lj")

>

>

>

> [ p; L] = [ p; L] (Lj") + [ p; L] (lLj")

<

[ p; d] = [ p; E ] (dj")

>

> [ q; P ] = (F jUndefined) + [ q; L] (cP jUndefined)

>

>

>

> [ q; I ] = [ q; P ] (I jUndefined)

q; E ] = [ q; P ] (E jUndefined)

>

>

>

> [

>

>

>

> [ q; a] = [ q; I ] (ajUndefined)

>

>

>

> [ q; b] = [ q; E ] (bjUndefined)

>

>

>

> [ q; L] = [ p; E ] (Ljn)

>

>

>

> [ q; L] = [ q; L] (Lj0) + [ q; L] (lLj1)

:

[ q; d] = [ q; E ] (djUndefined)

These systems describe rational functions from ctrl to (Zn)Z, but we are only

interested in [ p; I ] , [ p; a] and [ p; L] (accesses to inode values):

[ p; I ] = (uI jf (uI; p->terminal))

= (FP j") (ELLjn) (lLj1) (cP j") (I j") (4.13)

[ p; a] = (uajf (ua; p->length))

= (FP j") (ELLjn) (lLj1) (cP j") (Iaj") (4.14)

[ p; L] = (uLLjf (uLL; p->length))

= (F j") (ELLjn) (lLj1) (cP j") (ELj") (4.15)

When all program model restrictions are satised, we have shown in the previous section

that storage mappings are rational transductions. Based on this result, we will now present

a general dependence and reaching denition analysis scheme for recursive programs.

Both classical results and recent contributions to formal languages theory will be useful,

denitions and details can be found in Chapter 3.

This section tackles the general dependence and reaching denition analysis problem

in our program model. See Sections 4.4 (trees), 4.5 (arrays) and 4.6 (nested trees and

arrays) for technical questions depending on the data structure context.

In Section 2.4.1, we have seen that analysis of con icting accesses is one of the rst

problems arising when computing dependence relations. We thus present a general com-

putation scheme for the con ict relation , but technical issues and precise study is left for

the next sections.

We consider a program whose set of statement labels is ctrl . Let Lctrl ctrl be

the rational language of control words. Let Mdata be the monoid abstraction for a given

140 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

data structure D used in the program, and Ldata Mdata be the rational language of valid

data structure elements.

Now because f is used instead of fe (it is independent on the execution), the exact

con
ict relation e is dened by

8e 2 E; 8u; v 2 Lctrl : u e v () (u; v 2 Ae) ^ f (u) = f (v);

which is equivalent to

8e 2 E; 8u; v 2 Lctrl : u e v () (u; v 2 Ae) ^ v 2 f 1(f (u)):

Because f is a rational transduction from ctrl to Mdata, f 1 is a rational transduction

from Mdata to ctrl , and Mdata is either a free monoid, or a free commutative monoid,

or a free partially commutative monoid, we know from Theorems 3.5, 3.27 and 3.28 that

f 1 f is either a rational or a multi-counter counter transduction. The result will thus

be exact in almost all cases: only multi-counter transductions must be approximated by

one-counter transductions.

We cannot compute an exact relation e, since Ae depends on the execution e. More-

over, guards of conditionals and loop bounds are not taken into account for the moment,

and the only approximation of Ae we can use is the full language A = Lctrl of control

words. Eventually, the approximate con
ict relation we compute is the following:

8u; v 2 Lctrl : u v ()

def

v 2 f 1(f (u)): (4.16)

In all cases, we get a transducer realization (rational or one-counter) of transduction .

This realization is often unapproximate on pairs of control words which are eectively

executed.

One may immediately notice that testing for emptiness of is equivalent to testing

whether two pointers are aliased [Deu94, Ste96], and emptimess is decidable for rational

and algebraic transductions (see Chapter 3). This is an important application of our

analysis, considering the fact that is often unapproximate in practice.

Notice also that this computation of does not require access functions to be rational

functions : if a rational transduction approximation of f was available, one could still

compute relation using the same techniques. However, a general approximation scheme

for function f has not been designed, and further study is left for future work.

To build the dependence transducer, we need rst to restrict relation e to pairs of write

accesses or read and write accesses, and then to intersect the result with the lexicographic

order <lex:

8e 2 E; 8u; v 2 Lctrl : u e v () u e \ ((W W) [ (W R) [ (R W))\ <lex v:

Thanks to techniques described in Section 3.6.2, we can always compute a conservative

approximation of e . Relation is realized by a rational transducer in the case of trees

and by a one-counter transducer in the case of arrays or nested trees and arrays.

Approximations may either come from the previous approximation of e or from

the intersection itself. The intersection may indeed be approximate in the case of trees

and nested trees and arrays, because rational relations are not closed under intersection

4.3. DEPENDENCE AND REACHING DEFINITION ANALYSIS 141

(see Section 3.3). But thanks to Proposition 3.13 it will always be exact for arrays. More

details in each data structure case can be found in Sections 4.4, 4.5 and 4.6. We can now

give a general dependence analysis algorithm for our program model. The Dependence-

Analysis algorithm is exactly the same for every kind of data structure, but individual

steps may be implemented dierently.

Dependence-Analysis (program)

program: an intermediate representation of the program

returns a dependence relation between all accesses

1 f Compute-Storage-Mappings (program)

2 (f 1 f )

3 if is a multi-counter transduction

4 then one-counter approximation of

5 if the underlying rational transducer of is not left-synchronous

6 then resynchronization with or without approximation of

7 \ ((W W) [ (W R) [ (R W))

8 \ <lex

9 return

The result of Dependence-Analysis is limited to dependences on a specic data

structure. To get the full dependence relation of the program, it is necessary to compute

the union for all the data structures involved.

Remember the formal denition in Section 2.4.2: the exact reaching denition relation is

dened as a lexicographic selection of the last write access in dependence with a given

read access, i.e.

8e 2 E; 8u 2 Re : e (u) = max<lex

fv 2 We : v e ug:

Clearly, this maximum is unique for each read access u in the course of execution.

In the case of an exact knowledge of e, and when this relation is left-synchronous, one

may easily compute an exact reaching denition relation, using lexicographic selection ,

see Section 3.4.3.

The problem is that e is not known precisely in general, and the above solution is

rarely applicable. Moreover, using the computation scheme above, conditionals and loop

bounds have not been taken into account: the result is that many non-existing accesses

are considered dependent for relation . We should thus be looking for a conservative

approximation of e, built on the available approximate dependence relation . Relying

on makes computation of from (4.17) almost impossible, for two reasons: rst, a

write v may be in dependence with u without being executed by the program, and sec-

ond, all writes which are not eectively in con ict with u may be considered as possible

dependences.

However, we know we can compute an approximate reaching denition relation from

when at least one of the following conditions is satised.

Suppose we can prove that some statement instance does not execute, and that this

information can be inserted in the original transduction: some ow dependences can

be removed. The remaining instances are described by predicate emay(w) (instances

that may execute).

142 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

On the opposite, if we can prove that some instance w does execute, and if this

information can be inserted in the original transduction, then writes executing before

w are \killed": they cannot reach an instance u such that w u. Instances that

are eectively executed are described by predicate emust (w) (instances that must

execute).

Eventually, one may have some information econditional (v; w) about an instances w

that does execute whenever another instance v does: this \conditional" information

is used the same way as the former predicate emust.

The more precise the predicates emay, emust and econditional , the more precise the reaching

denition relation. In some cases, one may even compute an exact reaching denition

relation.

Now, remember all our work since Section 4.2 has completely ignored guards in condi-

tional statements and loop bounds. This information is of course critical when trying to

build predicates emay, emust and econditional . Retrieving this information can be done using

both the results of induction variable analysis (see Section 4.2) and additional analyses

of the value of variables [CH78, Mas93, MP94, TP95]. Such external analyses would for

example compute loop and recursion invariants.

Another source of information|mostly for predicate econditional |is provided by a

simple structural analysis of the program, which consists in exploiting every information

hidden in the program syntax:

in a if then else construct, either the then or the else branch is

executed;

in a while construct, assuming some instance of a statement does execute, all

instances preceding it in the while loop also execute;

in a sequence of non-guarded statements, all instances of these statements are si-

multaneously executed or not;

Notice this kind of structural analysis was already critical for nested loops [BCF97, Bar98,

Won95].

Another very important structural property is described with the following additional

denition:

Denition 4.2 (ancestor) Consider an alphabet ctrl of statement labels and a lan-

guage Lctrl of control words. We dene unco: a subset of ctrl made of all block

labels which are not conditionals or loop blocks, and all (unguarded) procedure call

labels, i.e. blocks whose execution is unconditional.

Let r and s be two statements in ctrl , and let u be a strict prex of a control word

wr 2 Lctrl (an instance of r). If v 2 unco (without labels of conditional statements)

is such that uvs 2 Lctrl , then uvs is called an ancestor of wr.

The set of ancestors of an instance u is denoted by Ancestors(u).

This denition is best understood on a control tree , such as the one in Figure 4.1.b

page 124: black square FPIAAaAaAJs is an ancestor of FPIAAaAaAJQPIAABBr, but not

gray squares FPIAAaAJs and FPIAAJs. Now, observe the formal ancestor denition:

1. execution of wr implies execution of u, because it is in the path from the root of

the control tree to node wr;

4.3. DEPENDENCE AND REACHING DEFINITION ANALYSIS 143

only, without conditional statements.

We thus have the following result:

Proposition 4.1 If an instance u executes, then all ancestors of u also execute. This

can be written using predicates emust and econditional :

8u 2 Lctrl : econditional (u; Ancestors(u));

8u 2 Lctrl : emust(u) =) emust(Ancestors(u)):

At last, we can dene a conservative approximation of the reaching denition rela-

tion, built on , emay, emust and econditional :

8u 2 R : (u) = v 2 (u) : emay(v) ^

@w 2 (u) : v <lex w ^ (emust(w) _ econditional (v; w) _ econditional (u; w) : (4.17)

Predicates emay, emust, econditional should dene rational sets, in order to compute the

algebraic operations involved in (4.17). When, in addition, relation is left-synchronous,

closure under union, intersection, complementation, and composition, allows unaproxi-

mate computation of with (4.17).

However, designing a general computation framework for these predicates is left for

future work, and we will only consider a few \rules" useful in our practical examples.

Instead of building automata for predicates emay, emust and econditional then computing

from (4.17), we present a few rewriting rules to rene the sets of possible reaching

denitions, starting from a very conservative approximation of the reaching denition

relation: the restriction of dependence relation to ow dependences (i.e. from a write to

a read access). This technique is less general than solving (4.17), but it avoids complex|

and approximate|algebraic operations.

Applicability of the rewriting rules is governed by the compile-time knowledge ex-

tracted by external analyses, such as analysis of contitional expressions, detection of in-

variants, or structural analysis. In the Section 4.5, we will demonstrate practical usage of

these rules when applying our reaching denition analysis framework to program Queens.

For the moment, we choose a statement s with a write reference to memory, and try to

rene sets of possible reaching denitions among instances of s. Rening sets of possible

reaching denitions which are instances of several statements will be discussed at the end

of this section.

The vpa Property (Values are Produced by Ancestors)

This property comes from the common observation about recursive programs that \values

are produced by ancestors". Indeed, a lot of sort, tree, or graph-based algorithms perform

in-depth explorations where values are produced by ancestors. This behavior is also

strongly assessed by scope rules of local variables.

vpa () 8e 2 E; u 2 Re; v 2 We : v = e (u) =) v 2 Ancestors(u) :

144 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

Since all possible reaching denitions are ancestors of the use, rule vpa consists

in removing all transitions producing non-ancestors. Formally, all transitions 0j s.t.

0 <txt and 0 6= s are removed.

We may dene one other interesting property useful to automatic property checking;

its associated rewriting rule is not given.

The oka Property (One Killing Ancestor)

If it can be proven that at least one ancestor vs of a read u is in dependence with u, it

kills all previous writes since it does execute when u does.

oka () 8u 2 R : (u) 6= ? =) (9v 2 Ancestors(u) : v 2 (u)) :

Property Checking

Property oka can be discovered using invariant properties on induction variables. Check-

ing for property vpa is dicult, but we may rely on the following result: when property

oka holds, checking vpa is equivalent to checking whether an ancestor vs in dependence

with us may followed|according to the lexicographic order|by a non-ancestor instance

w in dependence with us.

Other properties can be obtained by more involved analyses: the problem is to nd a

relevant rewriting rule for each one.

Now, remember we restricted ourselves to one assignation statement s when presenting

the rewriting rules. Designing rules which handle the global
ow of the program is a bit

more dicult. When comparing possible reaching denition instances of two writes s1

and s2 , it is not possible in general to decide whether one may \kill" the other without

a specic transducer (rational or one-counter, depending on the data structure). The

problem is thus to intersect two rational or algebraic relations, which cannot be done

without approximations in general, see Sections 3.6 and 3.7. In many cases, however,

storage mappings for s1 and s2 are very similar, and exact results can be easily computed.

The Reaching-Definition-Analysis algorithm is a general algorithm for reaching

denition analysis inside our program model. Algebraic operations on sets and relations

in the second loop of the algorithm may yield approximative results, see Sections 3.4, 3.6

and 3.7. The intersection with R fw : emay(w)g in the third line serves the purpose of

restricting the domain to read accesses and the image to writes which may execute; it can

be computed exactly since R fw : emay(w)g is a recognizable relation. The Reaching-

Definition-Analysis algorithm is applied to program Queens in Section 4.5.

Notice that all output and anti-dependences are removed by the algorithm, but some

spurious
ow dependences may remain when the result is approximate.

Now, there is something missing in this presentation of reaching denition analysis:

what about the ? instance? When predicates emust(v) or econditional (u; v) are empty for

all possible reaching denitions v of a read instance u, it means that an uninitialized value

may be read by u, hence that ? is a possible reaching denition; and the reciprocal is

true. In terms of our \practical properties", oka can be used to determine whether ?

is a possible reaching denition or not. This gives an automatic way to insert ? when

needed in the result of Reaching-Definition-Analysis.

To conclude this section, we have shown a very clean and powerful framework for

instancewise dependence analysis of recursive programs, but we should also recognize

the limits of relying on a list of renement rules to compute an approximate reaching

4.4. THE CASE OF TREES 145

Reaching-Definition-Analysis (program)

program: an intermediate representation of the program

returns a reaching denition relation between all accesses

1 compute emay; emust and econditional using structural and external analyses

2 Dependence-Analysis

3 \ (R fw : emay(w)g)

4 for each assignment statement s in program

5 do check for properties oka, vpa, and other properties

6 using external static analyses or asking the user

7 apply renement rules on accordingly

8 for each pair of assignment statements (s; t) in program

9 do kill f(us; w) 2 W R : (9vt 2 W : us w ^ vt w ^ us <lex vt

10 ^ (emust(vt) _ econditional (us; vt) _ econditional (w; vt)))g

11 kill

12 return

denition relation from an approximate dependence relation. Now that the feasibility

of instancewise reaching denition analysis for recursive programs has been proven, it is

time to work on a formal framework to compute predicates emay, emust and econditional ,

from which we could expect a powerful reaching denition analysis algorithm.

We will now precise the dependence and reaching denition analysis in the case of a tree

structure. Practical computations will be performed on program BST presented in 4.2.

The rst part of the Dependence-Analysis algorithm consists in computing the

storage mapping. When the underlying data structure is a tree, its abstraction is the

free monoid Mdata = fl; rg and the storage mapping is a rational transduction between

free monoids. Computation of function f for program BST has already been done in Sec-

tion 4.2.5. Figure 4.7 shows a rational transducer realizing rational function f . Following

the lines of Section 2.3.1 page 68, the alphabet of statement labels has been extended to

distinguish between distinct references in I2 , J2, b and e, yielding new labels I2 , I2 ,

p p->l

J2 , J2 , bp , bp->l , ep and ep->r (these new labels may only appear as the last letter in a

p p->r

control word).

Computation of is done thanks to Elgot and Mezei's theorem, and yields a rational

transduction. The result for program BST is given by the transducer in Figure 4.8.

When is realized by a left-synchronous transducer, the last part of the Dependence-

Analysis algorithm does not require any approximation: dependence relation =

\ <lex can be computed exactly (after removing con ict between reads in ). It is

the case for program BST, and the exact dependence analysis result is shown in Fig-

ure 4.9. In the general case, a conservative left-synchronous approximation of must be

computed, see Section 3.7.

One may immediately notice that every pair (u; v) accepted by the dependence trans-

ducer is of the form u = wu0 and v = wv0 where w 2 fF; P; L; R; I1; J1g and u0; v0 do

not hold any recursive call|i.e. L or R. That means that all dependences lie between

instances of the same block I1 or J1. We will show in Section 5.5 that this result can be

used to run the rst if block|statement I1|in parallel with the second|statement J1 .

Eventually, it appears that dependence transduction is a rational function, and the

146 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

FP j"

LP jl RP jr

P

I1 j" J1 j "

I1 J1

I2 j" I2 jl J2 j" J2 j rp->r

I2 j" J2 j "

p p->l p

I2

p I2 I2p->l J2 p J2 J2 p->r

bpj" bp->l jl epj" ep->r jr

a bp bp->l c d ep ep->r f

........................................................................................

LP jLP FP jFP RP jRP

1

I1jI1 J1 j J1

J2 f jJ2

p->r

I2 cjI2 J2 epjJ2

p->l p

I2 bpjI2

p

p->r

I2jI2 J2jJ2

p p->l p

3 4 5 9 10 11

ajbp bp->l jc djep ep->r jf

b p ja cjbp->l ep jd f jep->r

6 7 12 13

. . . . . . . . Figure 4.8. Rational transducer for con ict relation of program BST . . . . . . . .

restriction of to pairs (u; v) of a read u and a write v yields the empty relation ! Indeed,

the only dependences on program BST are anti-dependences.

4.5. THE CASE OF ARRAYS 147

........................................................................................

LP jLP FP jFP RP jRP

1

I1 jI1 J1 jJ1

2 8

I2 jI2bp

p

I2 jI2c

p->l J2 jJ2ep

p

J2 jJ2f

p->r

I2jI2 J2jJ2

3 4 5 9 10 11

ajbp bp->l jc djep ep->r jf

6 7 12 13

We will now precise the dependence and reaching denition analysis in the case of an

array structure. Practical computations will be performed on program Queens presented

in 4.1.

........................................................................................

FPj0 FPj0

P P0

IAAj0 QPj0 IAAj0 QPj1

aAj0 A aAj0 A0

J j0 J j0

BBj0

bBj1 B J J0

r j0 sj0

r s0

The rst part of the Dependence-Analysis algorithm consists in computing the

storage mapping. When the underlying data structure is an array, its abstraction is the

free commutative monoid Mdata = Z. Computation of function f for program Queens

has already been done in Section 4.2.5. Figure 4.10 shows a rational transducer realizing

148 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

and (4.6).

Computation of is done thanks to Theorem 3.27, and yields a one-counter transduc-

tion. The result for program Queens is given by the transducer in Figure 4.11|with four

initial states.

To compute a dependence relation , one rst restrict to pairs of accesses with at least

one write, then intersect the result with the lexicographic order. From Proposition 3.13

the underlying rational transducer of is recognizable , hence left-synchronous (from The-

orem 3.12) and can thus be resynchronized with the constructive proof of Theorem 3.19

to get a one-counter transducer whose underlying rational transducer is left-synchronous.

Resynchronization of has been applied to program Queens in Figure 4.12: it is

limited to con icts of the form (us; vr), us; vr 2 Lctrl . The lacking three fourths of the

transducer have not been represented because they are very similar the the rst fourth

and not used for reaching denition analysis. The underlying rational transducer is only

pseudo-left-synchronous because resynchronization has not been applied completely, see

Section 3.6 and Denition 3.28

Intersection with <lex is done with Theorem 3.14. As a result, the dependence relation

can be computed exactly and is realized by a one-counter transducer whose underlying

rational transducer is left-synchronous.

This is applied to program Queens in Figure 4.13, starting from the pseudo-left-

synchronous transducer in Figure 4.12. Knowing that B <txt J <txt a and s <txt Q,

transitions J ja and sjQ are kept but transitions ajJ , ajB and J jB are removed (and the

transducer is trimmed). This time, only one third of the actual transducer is shown: the

transducer realizing ow dependences. Anti and output dependences are realized by very

similar transducers, and are not used for reaching denition analysis.

We now demonstrate the Reaching-Definition-Analysis algorithm on program

Queens. A simple analysis of the inner loop shows that j is always less than k . This

proves that for any instance w of r, there exists u; v 2 ctrl s.t. w = uQvr and us uQvr.

Because us is an ancestor of uQvr, property oka is satised. Dependence transducer in

Figure 4.13 shows that all instances of s executing after us are of the form uQv0s, and it

also shows that reading Q increases the counter: the result is that no instance executing

after us may be in dependence with w. In combination with oka, property vpa thus

holds. Applying rule vpa, we can remove transition J jaA which does not yield ancestors.

We get the one-counter transducer in Figure 4.14. Notice that the ? instance (associated

with uninitialized values) is not accepted as a possible reaching denition: this is because

property oka ensures that at least an ancestor of every read instance dened a value.

The transducer is \compressed" in Figure 4.15 to increase readability. It is easy to

prove that this result is exact: a unique reaching denition is computed for every read

instance. However, the general problem of the functionality of an algebraic transduction

is \probably" undecidable. As a result, we achieved|in a semi-automated way|the best

precision possible. This precise result will be used in Section 5.5 to parallelize program

Queens.

We will now precise the dependence and reaching denition analysis in the case of a

nested list and array structure. Practical computations will be performed on program

Count presented in 4.3.

4.6. THE CASE OF COMPOSITE DATA STRUCTURES 149

........................................................................................

FPj"; ! 0

1 6

IAAj" QPj" "jIAA "jQP; 1

2 aAj" "jaA 7

BBj" "jJ

J j" "jFP; 1

bBj"; +1 3 5 8

r j" "js =0

4 9

FPj"; ! 0

10 15

"jIAA "jQP IAAj" QPj"; +1

11 "jaA "jFP aAj" 16

"jBB J j"

"jJ

"jbB; 1 12 14 17

"jr; =0 sj"

13 FPj"; ! 0

18 FPj"; ! 0

19 29

IAAj" QPj" IAAj" QPj"; +1

BBj"; +1 20 aAj" aAj" 30

J j"

J j"

bBj"; +1 21 23 31

33

rj" "jFP "jFP; 1 sj"

22 24 34 32

"jIAA "jQP "jIAA "jQP; 1

25 "jaA "jaA 35

"jBB "jJ

"jJ

"jbB; 1 26 28 36

38

"jr "js

27 37

. . . . Figure 4.11. One-counter transducer for con ict relation of program Queens . . . .

storage mapping. When the underlying data structure is built of nested trees and arrays,

its abstraction is a free partially commutative monoid Mdata. Computation of function f

for program Count has already been done in Section 4.2.5.

150 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

J j BB aAjJ

"jBB "jJ "jBB "jJ

4 3 5 aAjBB

13 12 14

"jr "jIAA "jQP "jr "jIAA "jQP

2 1 11 10

"j" aAjaA 21

6 7 J jJ IAAjIAA 15 16

"j" QPj"; +1 sj"; =0

IAAj" QPj"; +1 sj"; =0 IAAj"

QPjQP; +1

8 9 22 20 17 J j"

18

J j" sjQP FPjFP

aAj" 24 23 19 ! 0 aAj"

"jr; =0 "jIAA "jQP

26 "jBB 25 "jJ

27

"jbB; 1 "jaA

........................................................................................

Computation of is done thanks to Theorem 3.28, and yields a one-counter transduc-

tion. On program Count, there are no write accesses to the inode structure. Now, we

could be interested in an analysis of con ict-misses for cache optimization [TD95]. The

result f 1 f for program Count is thus interesting, and it is the identity relation! This

proves that the same memory location is never accessed twice during program execution.

Now, when computing a dependence relation in general, Proposition 3.13 does not

apply: it is necessary in general to approximate the underlying rational transducer by a

left-synchronous one. Eventually, the Reaching-Definition-Analysis algorithm has

no technical issues specic to nested trees and arrays.

Before evaluating our analysis for recursive programs, we summary its program model

restrictions. First of all, some restrictions are required to simplify algorithms and should

be considered harmless thanks to previous code transformations|see Sections 2.2 and 4.2

for details:

no function pointers (i.e. higher-order control structures) and no gotos are allowed;

4.7. COMPARISON WITH OTHER ANALYSES 151

........................................................................................

"jBB "jJ

4 3 5

"jr "jIAA "jQ

2 1

"j" aAjaA 12

6 7 J jJ IAAjIAA

IAAj" QPj"; +1 sj"; =0

QPjQP; +1

8 9 13 11

J j" sjQP FPjFP

aAj" 15 14 10 ! 0

"jr; =0 "jIAA "jQP

17 "jBB 16 "jJ

18

"jbB; 1 "jaA

dependences

........................................................................................

a loop variable is initialized at the loop entry and used only inside this loop;

expressions in right-hand side may hold conditionals but no function calls and no

loops;

every data structure subject to dependence or reaching denition analysis must be

declared global;

Now, some restrictions on the program model cannot been avoided with preliminary

program transformations, but should be removed in further versions of the analysis, thanks

to appropriate approximation techniques (induction variables are dened in Section 4.2):

only scalars, arrays, trees and nested trees and arrays are allowed as data structures;

induction variables must follow very strong rules regarding initialization and update;

every array subscript must be an ane function of integer induction variables and

symbolic constants;

every tree access must dereference a pointer induction variable or a constant.

152 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

........................................................................................

aAjaA 3

J jJ IAAjIAA

QPjQP; +1

4 2

sjQP FPjFP

6 5 1 !0

"jr; =0 "jIAA "jQP

8 "jBB

7 "jJ

9

"jbB; 1 "jaA

Figure 4.14. One-counter transducer for reaching denition relation of program Queens

........................................................................................

........................................................................................

JQPIAAjJQPIAA; +1 "jJQPIAA "jbB; 1

!0 1 FPIAAjFPIAA

2 JsjJQPIAA 3 "jB B

4 "jr; =0

5

aAjaA "jaA

Eventually, one restriction is very deeply rooted in the monoid abstraction for tree

structures, and we expect no general way to avoid it:

random insertions and deletions in trees are forbidden (allowed only at trees' leaves).

We are now able to compare the results of our analysis technique with those of classical

static analyses|some of which also handle our full program model|and with those of

the existing instancewise analyses for loop nests.

Static dependence and reaching denition analyses generally compute the same kind of

results, whether they are based on abstract interpretation [Cou81, JM82, Har89, Deu94]

or other data-
ow analysis techniques [LRZ93, BE95, HHN94, KSV96]. A comprehen-

sive study of static analysis useful to parallelization of recursive programs can be found

in [RR99]. Comparison of the results is rather easy: none of these static analyses is in-

stancewise.4 None of these static analyses is able to tell which instance of which statement

4 We think that building an instancewise analysis of practical interest in the data-
ow or abstract

interpretation framework is indeed possible, but very few works have been made in this direction, see

4.7. COMPARISON WITH OTHER ANALYSES 153

are very useful to remove a few restrictions in our program model, and they also compute

properties useful to instancewise reaching denition analysis. Remember that our own

instancewise reaching denition analysis technique makes a heavy use of so called \exter-

nal" analyses, which precisely are classical static analyses. A short comparison between

parallelization from the results of our analysis and parallelization from static analyses will

be proposed in Section 5.5, along with some practical examples.

Comparison with instancewise analyses for loop nests is more topical, since our tech-

nique was clearly intended to extend such analyses to recursive programs. A simple

method to get a fair evaluation consists in running both analyses on their common program

model subset. The general result is not surprising: today's most powerful reaching deni-

tion analyses for loop nests such as fuzzy array data ow analysis (FADA) [BCF97, Bar98]

and constraint-based array dependence analysis [WP95, Won95] are far more precise than

our analysis for recursive programs. There are many reasons for that:

we do not use conditionals and loop bounds to establish our results, or when it is

the case, it is through \external" static analyses;

multi-dimensional arrays are roughly approximated by one-dimensional ones;

rational and algebraic transducers have a limited expressive power when dealing

with integer parameters (only one counter can be described);

some critical algebraic operations such as intersection and complementation are not

decidable and thus require further approximations.

A major dierence between FADA and our analysis for recursive program is deeply

rooted the philosophy of each technique.

FADA is a fully exact process with symbolic computations and \dummy" parameters

associated with unpredictable constraints, and only one approximation is performed

at the end; this ensures that no precious data- ow information is lost during the

computation process (see Section 2.4.3).

Our technique is not as clever, since many approximation stages can be involved.

It is more similar to iterative methods in that sense, and hence it is far from being

optimal: some approximations are made even if the mathematical abstraction could

have enough expressive power to avoid it.

But the comparison also reveals very positive aspects, in terms of all the information

available in the result of our analysis:

exactness of the result is equivalent to deciding the functionality of a transduction,

and is thus polynomial for rational transductions; but it is unknown for algebraic

ones, and decidability of the niteness of a set of reaching denitions can help in

some cases;

emptiness of a set of reaching denitions is decidable, which allows automatic de-

tection of read accesses to uninitialized variables;

[DGS93, Tzo97, CK98].

154 CHAPTER 4. INSTANCEWISE ANALYSIS FOR RECURSIVE PROGRAMS

languages of control words, because of Nivat's Theorem 3.6 and the fact that rational

languages are closed under intersection; this is very useful for parallelization;

in the case of algebraic transductions, dependence testing is equivalent to the inter-

section of an algebraic language and a rational one, because of Nivat's Theorem 3.21

for algebraic transductions and Evey's Theorem 3.24; this is still very useful for par-

allelization.

We refer to Section 5.5 for additional comparisons between the applicability of our

analysis and loop nest analyses to parallelization.

4.8 Conclusion

We presented an application of formal language theory to the automatic discovery of

some semantic properties of programs: instancewise dependences and reaching deni-

tions. When programs are recursive and nothing is known about recursion guards, only

conservative approximations can be hoped for. In our case, we approximate the relation

between reads and their reaching denitions by a rational (for trees) or algebraic (for ar-

rays) transduction. The result of the reaching denition analysis is a transducer mapping

control words of read instances to control words of write instances. Two algorithms for

dependence and reaching denition analysis of recursive programs were designed. Inci-

dentally, these results showed the use of the new class of left-synchronous transductions

over free monoids.

We have applied our techniques on several practical examples, showing excellent ap-

proximations and sometimes even exact results. Some problems obviously remain. First,

some strong restrictions on the program model limit the practical use of our technique.

We should thus work on a graceful degradation of our analyses to encompass a larger set

of recursive programs: for example, restrictions on induction variables operations could

perhaps be removed by allowing computation of approximate storage mappings. Second,

reaching denition analysis is not quite mature now, since it relies on rather ad-hoc tech-

niques whose general applicability is unknown. More theoretical studies are needed to

decide whether precise instancewise reaching denition information can be captured by

rational and algebraic transducers.

We will show in the next chapters that decidability properties on rational and alge-

braic transductions allow several applications of our framework, especially in automatic

parallelization of recursive programs. These applications include array expansion and

parallelism extraction.

155

Chapter 5

Parallelization via Memory

Expansion

The design of program transformations dedicated to dependence removal is a well studied

topic, as far as nested loops are concerned. Techniques such as conversion to single-

assignment form [Fea91, GC95, Col98], privatization [MAL93, TP93, Cre96, Li92], and

many optimizations for ecient memory management [LF98, CFH95, CDRV97, QR99]

have been proven useful for practical parallelization of programs (automatically or not).

However, these works have mostly targeted ane loop nests and few techniques have

been extended to dynamic control
ow and general array subscripts. Very interesting

issues arise when trying to expand data structures in unrestricted nests of loops, and

because of the necessary data-
ow restoration, con
uent interests with the SSA (static

single-assignment) [CFR+91] framework become obvious.

Motivation for memory expansion and introduction of the fundamental concepts is

the rst goal of Section 5.1; then, we study specic problems related with non-ane

nests of loops and we design practical solutions for a general single-assignment form

transformation. Novel expansion techniques presented in Sections 5.2, 5.3 and 5.4 are

contributions to bridging the gap between the rich applications of memory expansion

techniques for ane loop nests and the few results with irregular codes.

When extending the program model to recursive procedures, the problem is of another

nature: principles of parallel processing are then very dierent from the well mastered

data parallel model for nested loops. Applicable algorithms have been mostly designed

for statementwise dependence tests, when our analysis computes an extensive instance-

wise description of the dependence relation! There is of course a large gap between the

two approaches and we should now demonstrate that using such a precise information

brings practical improvements over existing parallelization techniques. These issues are

addressed by Section 5.5, starting with an investigation of memory expansion techniques

for recursive programs. Because this last section addresses a new topic, several negative

or disappointing answers are mixed with successful results.

To point out the most important issues related with memory expansion, and to motivate

the following sections of this chapter, we start with a study of the well-known expansion

technique called conversion to single-assignment form. Both abstract and practical point

of views are discussed. Several results presented here have been already presented by

156 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

many authors, with their formalism and their program model, but we prefered to rewrite

most of this work in our syntax to x the notations and to show how memory expansion

also makes sense out of the loop nest programming model.

One of the most usual and simplest expansion schemes is conversion to single-assignment

(SA) form. It is the extreme case where each memory location is written at most once

during execution. This is slightly dierent from static single-assignment form (SSA)

[CFR+91, KS98], where each variable is written at most in one statement in the program,

and expansion is limited to variable renaming.

The idea of conversion to SA-form is to replace every assignment to a data structure

D by an assignment to a new data structure Dexp whose elements have the same type as

elements of D, and are in one-to-one mapping with the set W of all possible write accesses

during any program execution. Each element of Dexp is associated to a single write access.

This aggressive transformation ensures that the same memory location is never written

twice in the expanded program. The second step is to transform the read references

accordingly, and is called restoration of the ow of data. Instancewise reaching denition

information is of great help to achieve this: for a given program execution e 2 E, the value

read by some access h{; refi to D in right-hand side of a statement is precisely stored in the

element of Dexp associated with e (h{; refi) (see Section 2.4 for notations and denitions).

In general, an exact knowledge of e for each execution e is not available at compile time:

the result of instancewise reaching denition analysis is an approximate relation . The

compile-time data- ow restoration scheme above is thus unapplicable when (h{; refi) is

a non-singleton set: the idea is then to generate a run-time data- ow restoration code,

which tracks what is the last instance executed in (h{; refi). As we have seen for general

expansion schemes in Section 1.2, this run-time restoration code is hidden in a function

whose argument is the set (h{; refi) of possible reaching denitions.

A few notations are required to simplify the syntax of expanded programs.

CurIns holds the run-time instance value, encoded as a control word or iteration

vector, for any statement in the program. It is supposed to be updated on-line in

function calls, loop iterations and every block entry. More precisions about this

topic in Section 5.1.3 and Section 5.5.3.

has the syntax of a function from sets of run-time instances to untyped values,

but its semantics is to summarize a piece of data- ow restoration code. It is very

similar to functions in the SSA framework [CFR+91, KS98]. Code generation for

functions is the purpose of Section 5.1.2.

Dexp is the expanded data structure associated with some original data structure D.

Its \abstract" syntax is inherited from arrays: Dexp[set of element names ] for

the declaration and Dexp[element name ] for the read or write access. In practice,

element names are either integer vectors or words, and Dexp is an array, a tree, or a

nest of trees and arrays. Its \concrete" syntax is then implemented as an array or

as a pointer to a tree structure. See Sections 5.1.3 and 5.5.1 for details.

We now present Abstract-SA: a very general algorithm to compute the single-

assignment form. This algorithm is neither really new nor really practical, but it denes

a general transformation scheme for SA programs, independently of the control and data

5.1. MOTIVATIONS AND TRADEOFFS 157

structures. It takes as input the sequential program and the result of an instancewise

reaching denition analysis|seen as a function. Control structures are left unchanged.

This algorithm is very \abstract" since data structures are not dened precisely and some

parts of the generated code have been encapsulated in high-level notations: CurIns and

.

Abstract-SA (program; W; )

program: an intermediate representation of the program

W: a conservative approximation of the set of write accesses

: a reaching denition relation, seen as a function

returns an intermediate representation of the expanded program

1 for each data structure D in program

2 do declare a data structure Dexp[W]

3 for each statement s assigning D in program

4 do left-hand side of s Dexp[CurIns]

5 for each reference ref to D in program

6 do ref if ( (CurIns; ref )==f?g) ref

else if ( (CurIns; ref )=f{g) Dexp [{]

else ( (CurIns; ref ))

7 return program

We will show in the following that several \abstract" parts of the algorithm can be

implemented when dealing with \concrete" data structures. Generating code for the

function is the purpose of the next section.

When generating code for functions, the common idea is to compute at run-time the

last instance that may possibly be a reaching denition of some use. In general, for each

expanded data structure Dexp one needs an additional structure in one-to-one mapping

with Dexp. In the static single-assignment framework for arrays [KS98], these additional

structures are called @-structures and store statement instances. Dealing with a more gen-

eral single-assignment form, we propose another semantics for additional structures, hence

another notation: the data structure in one-to-one mapping with Dexp is a -structures

denoted by Dexp.

To ensure that run-time restoration of the ow of data is possible, elements of Dexp

should store two informations: the memory location assigned in the original program

and the identity of the last instance which assigned this memory location. Because we

are dealing with single-assignment programs, the identity of the last instance is already

captured by the element itself (i.e. the subsrcipt of Dexp ).1 Elements of Dexp should

thus store memory locations.

Dexp is initialized to NULL before the expanded program;

Every time Dexp is modied, the associated element of Dexp is set to the value of

the memory location that would have been written in the original program.

1 This run-time restoration technique is thus specic to SA-form. Other expansions require dierent

type and/or semantics of -structures.

158 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

When a read access to D in the original program is expanded into a call of the form

(set), the function is implemented as the maximum|according to the sequential

execution order|of all { 2 set such that Dexp[{] is equal to the memory location

read in the original program.

Abstract-Implement-Phi (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each data structure Dexp in expanded

2 do if there are functions accessing Dexp

3 then declare a structure Dexp with the same shape as Dexp initialized to NULL

4 for each read reference ref to Dexp whose expanded form is (set)

5 do for each statement s involved in set

6 do refs write reference in s

7 if not already done for s

8 then following s insert Dexp[CurIns] = fe(CurIns; refs)

9 (set) Dexp[max< f{ 2 set :Dexp[{]= fe(CurIns; ref)g]

return expanded

seq

10

Abstract-Implement-Phi is the abstract algorithm to generate the code for

functions. In this algorithm, the syntax fe(CurIns; ref ) means that we are interested in

the memory location accessed by reference ref , and not that some compile-time knowledge

of fe is required. Of course, practical details and optimizations depend on the control

structures, see Section 5.1.4. Notice that the generated code is still in SA form: each

element of a new -structure is written at most once.

An important remark at this point is that instancewise reaching denition analysis is

the key to run-time overhead optimization. Indeed, as shown by our code generation al-

gorithm, SA-transformed programs are more ecient when functions are sparse. Thus,

a parallelizing compiler has many reasons to perform a precise instancewise reaching def-

inition analysis: it improves parallelism detection, allow to choose between a larger scope

of parallel execution orders (depending on the \grain size" and architecture), and re-

duces run-time overhead. An example borrowed from program sjs in [Col98] is presented

in Figure 5.1. The most precise reaching denition relation for reference A[i+j-1] in

right-hand side of R is

if j

1

h

then S; i; j

1i

(hR; i; j; A[i+j-1]i) = if i

1 :

else

then

hS; i 1; j i

elsehT i

This exact result shows that denitions associated with the reference in left-hand side of

R never reach any use. Expanding the program with a less precise reaching denition

relation induces a spurious function, as in Figure 5.1.b. One may notice that the quast

implementation in Figure 5.1.c is not really ecient and may be rather costly; but using

classical optimizations such as loop peeling|or general polyhedron scanning techniques

[AI91]|can signicantly reduce this overhead, see Figure 5.1.d. This remark advocates

once more for further studies about integrating optimization techniques.

5.1. MOTIVATIONS AND TRADEOFFS 159

........................................................................................

double A[N], AT , AS [N, N], AR [N, N];

double A[N]; T AT = 0;

T A[0] = 0; for (i=0; i<N; i++)

for (i=0; i<N; i++) for (j=0; j<N; j++) {

for (j=0; j<N; j++) { S AS [i, j] =

;

S A[i+j] =

; R AR [i, j] = (fhT ig [ fhS; i ; j i :

0 0

R A[i] = A[i+j-1] ; (i ; j ) < ( ; )g)

0 0 lex i j

}

}

Figure 5.1.a. Original program

Figure 5.1.b. SA without reaching denition analysis

double A[N], AT , AS [N, N], AR [N, N];

T AT = 0;

for (i=0; i<N; i++)

for (j=0; j<N; j++) {

S AS [i, j] =

R AR [i, j] = if (j==0) if (i==0) AT else AS [i-1, j]

else AS [i, j-1]

;

}

double A[N], AT , AS [N, N], AR [N, N];

AT = 0;

AS [1, 1] = ;

AR [1, 1] = AT ;

for (i=0; i<N; i++) {

AS [i, 1] = ;

AR [i, 1] = AS [i-1, 1] ;

for (j=0; j<N; j++) {

AS [i, j] = ;

AR [i, j] = AS [i, j-1] ;

}

}

. . . . . Figure 5.1. Interaction of reaching denition analysis and run-time overhead . . . . .

Eventually, one should notice that functions are not the only source of run-time

overhead: computing reaching denitions using at run-time may also be costly, even

when it is a function (i.e. it is exact). But there is a big dierence between the two

sources of overhead: run-time computation of can be costly because of the lack of

expressiveness of control structures and algebraic operations in the language or because of

the mathematical abstraction. For example, transductions generally induce more overhead

than quasts. On the opposite, the overhead of functions is due to the approximative

160 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

knowledge of the
ow of data and its non-deterministic impact on the generated code; it

is thus intrinsic to the expanded program, no matter how it is implemented. In many

cases, indeed, the run-time overhead to compute can be signicantly reduced by classical

optimization techniques|an example will be presented later on Figure 5.1|but it is not

the case for functions.

In this section, we only consider intra-procedural expansion of programs operating on

scalars and arrays. An extension to function calls, recursive programs and recursive data

structures is studied at the end of this chapter, in Section 5.5. These restrictions simplify

the exposition of a \concrete" SA algorithm in the classical loop nest framework.

When dealing with nest of loops, instancewise reaching denitions are described by an

ane relation (see [BCF97, Bar98] and Section 2.4.3). We pointed in Section 3.1.1 that

seeing an ane relation as a function, it can be written as a nested conditional called

a quast [Fea91]. This representation of relation is especially interesting for expansion

purposes since it can be easily and eciently implemented in a programming language.

Algorithm Make-Quast introduced in Section 3.1.1 builds a quast representation for

any ane relation.

We use the following notations:

Stmt(hS; xi) = S (the statement),

Iter(hS; xi) = x (the iteration vector),

and Array(S ) is the name of the original data structure assigned by statement S .

Given a quast representation of reaching denitions, Convert-Quast generates an ef-

cient code to retrieve the value read by some reference. This code is more or less a

compile-time implementation of the conditional generated at the end of Abstract-SA.

A function is generated when a non-singleton set is encountered. Eventually, because

statements partition the set of memory locations in the single-assignment program, we

use an array AS [x] instead of the proposed AexphS; xi in the abstract SA algorithm.

Thanks to Convert-Quast, we are ready to specialize Abstract-SA for loop nests.

The new algorithm is Loop-Nests-SA. Current instance CurIns is implemented by its

iteration vector (built from the surrounding loop variables). To simplify the exposition,

scalars are seen are one-dimensional arrays of a single element. All memory accesses are

thus performed through array subscripts.

The abstract code generation algorithm for functions can also be precised when

dealing with loop nests and arrays only. For the same reason as before, run-time in-

stances are stored in a distinct structure for each statement: we use AS [x] instead of

Aexp [hS; xi]. The new algorithm is Loop-Nests-Implement-Phi. Ecient computa-

tion of the lexicographic maximum can be done thanks to parallel reduction techniques

[RF94].

One part of the code is still unimplemented: the array declaration. The main problem

regarding array declaration is to get a compile-time evaluation of its size. In many cases,

loop bounds are not easily predictable at compile-time. One may thus have to consider

some expanded arrays as dynamic arrays whose size is updated at run-time. Another

solution proposed by Collard [Col94b, Col95b] is to prefer a storage mapping optimization

technique|such as the one presented in Section 5.3|to single-assignment form, and to

5.1. MOTIVATIONS AND TRADEOFFS 161

quast: the quast representation of the reaching denition function

ref : the original reference, used when ? is encountered

returns the implementation of quast as a value retrieval code for reference ref

1 switch

2 case quast = f?g :

3 return ref

4 case quast = f{g :

5 A Array({)

6 S Stmt({)

7 x Iter({)

8 return AS [x]

9 case quast = f{1 ; {2; : : : g :

10 return (f{1; {2; : : : g)

11 case quast = if predicate then quast1 else quast2 :

12 return if predicate Convert-Quast (quast1; ref )

else Convert-Quast (quast2 ; ref )

Loop-Nests-SA (program; )

program: an intermediate representation of the program

: a reaching denition relation, seen as a function

returns an intermediate representation of the expanded program

1 for each array A in program

2 do for each statement S assigning A in program

3 do declare an array AS

4 left-hand side of S is replaced by AS [Iter(CurIns)]

5 for each read reference ref to A in program

6 do =ref \ (I ref )

7 quast Make-Quast (=ref )

8 map Convert-Quast (quast; ref )

9 ref map (CurIns)

10 return program

fold the unbounded array into a bounded one when the associated memory reuse does not

impairs parallelization. Such structures are very usual in high-level languages, but may

result in poor performance when the compiler is unable to remove the run-time verication

code. Two examples of code generation for functions are proposed in the next section.

Most of the run-time overhead comes from dynamic restoration of the data ow, using

functions; and this cost is critical for non-scalar data structures distributed across

processors. The technique presented in Section 5.2 (maximal static expansion) eradicates

such run-time computations, to the cost of some loss in parallelism extraction. Indeed,

functions may sometimes be a necessary condition for parallelization. This justies the

design of optimization techniques for function computation, which is the second purpose

of this section.

We now present three optimizations to the code-generation algorithm in Section 5.1.2.

The rst method groups several basic optimizations for loop nests, the second one is based

162 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Loop-Nests-Implement-Phi (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each array AS in expanded

2 do dA dimension of array AS

3 refS write reference in S

4 if there are functions accessing AS

5 then declare an array of dA-dimensional vectors AS

6 initialize AS to NULL

7 for each read access to AS of the form (set) in expanded

8 do if not already done for S

9 then insert

10 AS [Iter(CurIns)] = fe (CurIns; refS )

11 immediately after S

12 for each original array A in expanded

13 do for each read access (set) associated with A in expanded

14 do (set) parallel for (each S in Stmt(set))

vector[S ] = max< fx : hS; xi 2 set ^ AS [x] = fe (CurIns; ref )g

instance = max< fhS; vector[S ]i : S 2 Stmt(set)g

lex

seq

AStmt(instance) [Iter(instance)]

15 return expanded

on a new instancewise analysis, and the last one avoid redundant computations during

the propagation of \live" denitions. The second and third methods apply to loop nests

and recursive programs as well.

First Method: Basic Optimizations for Loop Nests

When dealing with nests of loops, the -structures are -arrays indexed by iteration vectors

(see Loop-Nests-Implement-Phi). Because of the hierarchical structure of loop nests,

accesses in a set (u) are very likely to share a few iteration vector components. This

allows the removal of the associated dimensions in -arrays and to reduce the complexity

of lexicographic maximum computations. Another consequence is the applicability of up-

motion techniques for invariant assignments. An example of -array simplication and

up-motion is described in Figure 5.2, where function max computes the maximum of a set

of iteration vectors, and where the maximum of an empty set is the vector ( 1; : : : ; 1).

Another interesting optimization is only applicable to while loops and for loops whose

termination condition is complex: non-ane bounds, break statements or exceptions.

When a loop assigns the same memory location an unbounded number of times, conversion

to single-assignment form often requires a -function but the last dening write can be

computed without using -arrays: its iteration vector is associated with the last value of

the loop counter.2 An example is described in Figure 5.3.

2 The semantics of the resulting code is correct, but rather dirty: a loop variable is used outside of the

loop block.

5.1. MOTIVATIONS AND TRADEOFFS 163

........................................................................................

double x; double x, xS [N+1, N+1, N+1];

for (i=1; i<=N; i++) { for (i=1; i<=N; i++) {

for (j=1; j<=N; j++) for (j=1; j<=N; j++)

if (

) if (

)

for (k=1; k<=N; k++) for (k=1; k<=N; k++)

S x = ; S xS [i, j, k] = ;

R = x; R = (fhS; ; j ; i : 1 j g [ f?g);

i 0 N 0 N

} }

double x, xS [N+1, N+1, N+1], xS [N+1, N+1, N+1]={NULL};

for (i=1; i<=N; i++) {

for (j=1; j<=N; j++)

if (

)

for (k=1; k<=N; k++) {

S xS [i, j, k] = ;

xS [i, j, k] = &x;

}

R = {

maxS = max i 0 0f( ; j ; k ) : 1 j ^ k = ^

0 N 0 N xS [i 0 0 ] &x ; ;j ;k = g

if (maxS != ( 1; 1; 1)

) xS [maxS ] else x;

}

}

double x, xS [N+1, N+1, N+1], xS [N+1]={NULL};

for (i=1; i<=N; i++) {

for (j=1; j<=N; j++) {

if (

) {

for (k=1; k<=N; k++) {

S xS [i, j, k] = ;

xS [j] = &x;

}

R = {

maxS = max 0 fj : 1 j ^

0 N xS [ 0 ] &x ; j = g

if (maxS != 1

) xS [maxS ] else x;

}

}

. . . . . . . . . Figure 5.2. Basic optimizations of the generated code for functions . . . . . . . . .

In some cases, functions can be computed without -arrays to store possible reaching

denitions. When the read statement is too complex to be analyzed at compile-time,

164 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

double x, xS [ ];

w = 1;

double x;

while (

) {

S

while (

x =

;

)

S xS [w] =

;

R = x;

}

w++;

Figure 5.3.b. SA program

double x, xS [

], xS [

]={NULL};

w = 1;

while ( ) {

double x, xS [ ];

S xS [w] =

;

w = 1;

while ( ) {

xS [w] = &x;

w++;

S xS [w] = ;

w++;

}

R = {

R

}

maxS = max fw :

xS [ ] w = g

&x ;

= if (w>1) xS [w-1] else x;

if (maxS != 1

) xS [maxS ] else x;

Figure 5.3.d. Optimized implementa-

}

tion

Figure 5.3.c. Standard implementation

. . . . . . . . . . . Figure 5.3. Repeated assignments to the same memory location . . . . . . . . . . .

the set of possible reaching denitions can be very large. However, if we could compute

the very memory location accessed by the read statement, the set of possible reaching

denitions would be much smaller|sometimes reduced to a singleton. This shows the

need for an additional instancewise information, called reaching denition of a memory

location: the exact function which depends on an execution e 2 E of the program is

denoted by eml and its conservative approximation by ml. Here are the formal denitions:

8e 2 E; 8u 2 Re; c 2 fe(We) : eml (u; c) = max

<

v 2 We :

seq

v <seq u ^ fe (v ) = c ;

Computing relation ml is not really dierent from reaching denition analysis. To

compute the ml for a reference r in right-hand side of a statement, r is replaced by a

read access to a new symbolic memory location c, then classical instancewise reaching

denition analysis is performed. The result is a reaching denition relation parameterized

by c. Seeing c as an argument, it yields the expected approximate relation ml. In some

rare cases, this computation scheme yields unnecessary complex results:3 the general

solution is then to intersect the result with .

Algorithm Abstract-ML-SA is an improved single-assignment form conversion al-

gorithm based on reaching denitions of memory locations. It is based on the exact

3 Consider an array A, an assignment to A[foo ] and a read reference to A[foo ], where foo is some

complex subscript. A precise reaching denition analysis would compute an exact result because the

subscript is the same in the two statements. However, the reaching denition of a given memory location

is not known precisely, because foo in the assignment statement is not known at compile time.

5.1. MOTIVATIONS AND TRADEOFFS 165

run-time computation of the symbolic memory location with storage mapping fe. This

algorithm can also been specialized for loop nests and arrays, using quasts parameterized

by the current instance and the symbolic memory location, see Loop-Nests-ML-SA.

In both cases, the value of fe should not be interpreted, it must be used as the original

reference code|possibly complex|to be substituted to the symbolic memory location c.

An example is described in Figure 5.4.

........................................................................................

double A[N+1];

for (i=1; i<=N; i++)

for (j=1; j<=N; j++)

S A[j] = A[j] + A[foo ];

double A[N+1], AS [N+1, N+1];

for (i=1; i<=N; i++)

for (j=1; j<=N; j++)

S AS [j] = if (i>1) AS [i-1, j] else A[j]

+ if (i>1 || j>1) 0 0 (f?g [ fhS; i ; j i : 1 i0; j 0 N ^ (i0 ; j 0) <lex (i; j)g)

else A[foo ];

double A[N+1], AS [N+1, N+1];

for (i=1; i<=N; i++)

for (j=1; j<=N; j++)

S AS [j] = if (i>1) AS [i-1, j] else A[j]

+ if (foo <j) AS [i, foo ]

else if (i>1) AS [i-1, foo ] else A[foo ];

. . . . . . . . . . . . . . . . . . . . . . . Figure 5.4. Improving the SA algorithm . . . . . . . . . . . . . . . . . . . . . . .

A general problem with implementations of functions based on -structures is the large

redundancy of lexicographic maximum computations. Indeed, each time a function

is encountered, the maximum of the full set of possible reaching denitions must be

computed. In the static single-assignment framework (SSA) [CFR+91, KS98], a large

part of the work is devoted to optimized placement of functions, in order to never

recompute the maximum of the same set. These techniques are well suited to the variable

renaming involved in SSA, but are unable to support the data structure reconstruction

performed by SA algorithms. Nevertheless, for another expansion scheme presented in

Section 5.4.7, we are able to avoid redundancies and to optimize the placement of

functions, but the algorithm is rather complex.

The method we propose here has been studied with the help of Laurent Vibert. It

removes redundant computations, but computation is not made with -structures in SA

166 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

program: an intermediate representation of the program

W: a conservative approximation of the set of write accesses

ml: reaching denitions of memory locations

returns an intermediate representation of the expanded program

1 for each data structure D in program

2 do declare a data structure Dexp[W]

3 for each statement s assigning D in program

4 do left-hand side of s Dexp[CurIns]

5 for each reference ref to D in program

6 do ref if (ml ((CurIns; ref ); fe(CurIns; ref ))=f?g) ref

else if ( ml ((CurIns; ref ); fe (CurIns; ref ))==f{g) {

Dexp [ ]

else ( ml ((CurIns; ref ); fe (CurIns; ref )))

7 return program

Loop-Nests-ML-SA (program; ml)

program: an intermediate representation of the program

ml: reaching denitions of memory locations

returns an intermediate representation of the expanded program

1 for each array A in program

2 do for each statement S assigning A in program

3 do declare an array AS

4 left-hand side of S AS [Iter(CurIns)]

5 for each reference ref to A in program

6 do =ref

ml ml \ (I ref )

7 u symbolic access associated with reference ref

8 quast Make-Quast (=ref ml (u; f (u)))

e

9 map Convert-Quast (quast; ref )

10 ref map (CurIns)

11 return program

single-assignment (SSA) framework [KS98]. This is a simple compromise between de-

pendence removal and ecient computation of functions, based on the commutativity

and associativity of the lexicographic maximum. The idea is to use @-structures in one-

to-one mapping with the original data structures instead of the expanded ones. Notice

@-structures are not in single-assignment form, and maximum computation must be done

in a critical section. Both the write instance and the memory location should be stored,

but the memory location is now encoded in the subscript: @-structures are thus storing

instances instead of memory locations, see Abstract-Implement-Phi-Not-SA.

The original memory-based dependences are displaced from the original data struc-

tures to their @-structures: they have not disappeared! However, thanks to the properties

of the lexicographic maximum, output dependences can be ignored without violating the

original program semantics. Spurious anti-dependences remain, and must be taken into

account for parallelization purposes. The rst example in Figure 5.5 can be parallelized

with this technique, but not the second.

In the case of loop nests and arrays, a simple extension to the technique can be helpful.

It is sucient, for example, to parallelize the second example in Figure 5.5. Consider a

call of the form (set). If the component value of some dimensions is constant for all

5.1. MOTIVATIONS AND TRADEOFFS 167

Abstract-Implement-Phi-Not-SA (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each original data structure D[shape] in expanded

2 do if there are functions accessing Dexp

3 then declare a data structure @D[shape] initialized to ?

4 for each read reference ref to D whose expanded form is (set)

5 do sub subscript of reference ref

6 for each statement s involved in set

7 do subs subscript of the write reference to D in s

8 if not already done for s

9 then following s insert @D[subs ] = max (@D[subs], CurIns)

10 (set) if (@D[sub ]!=?) Dexp[@D] else D[sub]

11 return expanded

iteration vectors of instances in set, then it is legal to expand the @-array along these

dimensions. Applied to the second example in Figure 5.5, @x is replaced by @x[i], which

makes the outer loop parallel.

........................................................................................

double x;

for (i=1; i<=N; i++) {

T x =

;

for (j=1; j<=N; j++)

double x; S if (

) x = x ;

for (i=1; i<=N; i++) R = x;

S if (

) x = ; }

R = x;

Figure 5.5.c. Second example

Figure 5.5.a. First example

double x, xT [N+1], xS [N+1, N+1];

double x, xS [N+1], @x= ; 1 double @x= ( 1; 1);

parallel for (i=1; i<=N; i++) for (i=1; i<=N; i++) {

S if (

) { T xT [i] =

;

x =

; for (j=1; j<=N; j++)

@x = max (@x, i); S if (

) {

} xS [i, j] = if (j>1) xS [i, j-1]

R = if (@x != 1

) xS [@x] else xT [i] ;

else x; @x = max (@x, (i, j));

}

Figure 5.5.b. First example: R = if (@x != ( 1; 1)

) xS [@x]

parallel expansion else xT [i];

}

not parallelizable expansion

. . . . . . . . . . . . . Figure 5.5. Parallelism extraction versus run-time overhead . . . . . . . . . . . . .

In practice, this technique is both very easy to implement and very ecient for run-

168 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

time restoration of the data
ow, but it can often hamper parallelism extraction. It is a

rst and simple attempt to nd a tradeo between parallelism and overhead.

All the single-assignment form algorithms described and most techniques for run-time

restoration of the data ow share the same major drawback: run-time overhead. By

essence, SA form requires a huge memory usage, and is not practical for real programs.

Moreover, some functions cannot be implemented eciently with the optimizations

proposed. To avoid or reduce these sources of run-time overhead, it is thus necessary to

design more pragmatic expansion schemes: both memory usage and run-time data- ow

restoration code should be handled with care. This is the purpose of the three following

sections.

The present section studies a novel memory expansion paradigm: its motivation is to

stick with the compile-time restoration of the ow of data while keeping in mind the

approximative nature of the compile-time information. More precisely, we would like

to remove as many memory-based dependences as possible, without the need of any

function (associated with run-time restoration of the data- ow). We will show that this

goal requires a change in the way expanded data structures are accessed, to take into

account the approximative knowledge of storage mappings.

An expansion of data structures that does not need a function is called a static

expansion [BCC98, BCC00].4 The goal is to nd automatically a static way to expand

all data structures as much as possible, i.e. the maximal static expansion. Maximal static

expansion may be considered as a trade-o between parallelism and memory usage.

We present an algorithm to derive the maximal static expansion; its input is the (per-

haps conservative) output of a reaching denition analysis, so our method is \optimal"

with respect to the precision of this analysis. Our framework is valid for any imperative

program, without restriction|the only restrictions being those of your favorite reaching

denition analysis. We then present an intra-procedural algorithm to construct the maxi-

mal static expansion for programs with arrays and scalars only, but where subscripts and

control structures are unrestricted.

5.2.1 Motivation

The three following examples introduce the main issues and advocate for a maximal static

expansion technique.

First Example: Dynamic Control Flow

We rst study the pseudo-code shown in Figure 5.6; this kernel appears in several convo-

lution codes5 . Parts denoted by are supposed to have no side-eect.

4 Notice that according to our denition, an expansion in the static single-assignment framework

[CFR+ 91, KS98] may not be static.

5 For instance, Horn and Schunck's algorithm to perform 3D Gaussian smoothing by separable convo-

lution.

5.2. MAXIMAL STATIC EXPANSION 169

........................................................................................

double x;

for (i=1; i<=N; i++) {

T x = ;

while ( )

S x = x ;

R

= x ;

}

Each instance hT; ii assigns a new value to variable x. In turn, statement S assigns x

an undened number of times (possibly zero). The value read in x by statement R is thus

dened either by T , or by some instance of S , in the same iteration of the for loop (the

same i). Therefore, if the expansion assigns distinct memory locations to hT; ii and to

instances of hS; i; wi,6 how could instance hR; ii \know" which memory location to read

from?

We have already seen that this problem is solved with an instancewise reaching deni-

tion analysis which describe where values are dened and where they are used. We may

thus call the mapping from a read instance to its set of possible reaching denitions.

Applied to the example in Figure 5.6, it tells us that the set (hS; i; wi) of denitions

reaching instance hS; i; wi is:

(hS; i; wi) = if w > 1 then fhS; i; w 1ig else fhT; iig (5.1)

And the set (hR; ii) of denitions reaching instance hR; ii is:

(hR; ii) = hT; ii [ hS; i; wi : w 1 ; (5.2)

where w is an articial counter of the while-loop.

Let us try to expand scalar x. One way is to convert the program into SA, making T

write into xT [i] and S into xS [i; w]: then, each memory location is assigned to at most

once, complying with the denition of SA. However, what should right-hand sides look

like now? A brute-force application of (5.2) yields the program in Figure 5.7. While the

right-hand side of S only depends on w, the right-hand side of R depends on the control

ow, thus needing a function.

The aim of maximal static expansion is to expand x as much as possible in this program

but without having to insert functions.

A possible static expansion is to uniformly expand x into x[i] and to avoid output

dependencies between distinct iterations of the for loop. Figure 5.8 shows the resulting

maximal static expansion of this example. It has the same degree of parallelism and is

simpler than the program in single-assignment.

Notice that it should be easy to adapt the array privatization techniques by Maydan

et al. [MAL93] to handle the program in Figure 5.6; this would tell us that x can be

privatized along i. However, we want to do more than privatization along loops, as

illustrated in the following examples.

6 We need a virtual loop variable w to track iterations of the while loop.

170 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

for (i=1; i<=N; i++) {

T xT [i] =

w = 1;

while (

) {

S xS [i,w] = if (w==1) xT [i] else xS [i,w-1]

w++;

}

R = (fhT; ig [ fhS; ; wi : w 1g)

i i

}

........................................................................................

for (i=1; i<=N; i++) {

T x[i] =

while ( )

S x[i] = x[i]

R = x[i]

}

Let us give a more complex example; we would like to expand array A in the program in

Figure 5.9.

Since T always executes when j equals N , a value read by hS; i; j i, j > N is never

dened by an instance hS; i0; j 0i of S with j 0 N . Figure 5.9 describes the data- ow

relations between S instances: an arrow from (i0; j 0) to (i; j ) means that instance (i0 ; j 0)

denes a value that may reach (i; j ).

........................................................................................

j

double A[4*N]; 2N

for (i=1; i<=2*N; i++)

for (j=1; j<=2*N; j++) {

if ( )

S A[i-j+2*N] = A[i-j+2*N] ; N

T if (j==N) A[i+N] = ;

}

i

N 2N

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.9. Second example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2. MAXIMAL STATIC EXPANSION 171

if j

N

(hS; i; j i) =

then

h

S; i i i0 2N ^ 1 j 0 < j ^ i0 j 0 = i j

0; j 0 : 1

(5.3)

else h

S; 0 0

i ; j0 : 1i i0 2N ^ N < j 0 < j ^ i 0 j 0 = i j

[ hT; i ; N : i 1 i0 < i ^ i0 = i j + N

Because reaching denitions are non-singleton sets, converting this program to SA form

would require run-time computation of the memory location read by S .

........................................................................................

j j

2N 2N

N N

i i

N 2N N 2N

Figure 5.10.a. Instances involved in the Figure 5.10.b. Counting groups per memory

same data
ow location

. . . . . . . . . . . . . . . . Figure 5.10. Partition of the iteration domain (N = 4) . . . . . . . . . . . . . . . .

However, we notice that the iteration domain of S may be split into disjoint subsets

by grouping together instances involved in the same data
ow. These subsets build a

partition of the iteration domain. Each subset may have its own memory space that

will not be written nor read by instances outside the subset. The partition is given in

Figure 5.10.a.

Using this property, we can duplicate only those elements of A that appear in two

distinct subsets. These are all the array elements A[c], 1 + N c 3N 1. They are

accessed by instances in the large central set in Figure 5.10.b. Let us label with 1 the

subsets in the lower half of this area, and with 2 the subsets in the top half. We add one

dimension to array A, subscripted with 1 and 2 in statements S2 and S3 in Figure 5.11,

respectively. Elements A[c], 1 c N are accessed by instances in the upper left triangle

in Figure 5.10.b and have only one subset each (one subset in the corresponding diagonal

in Figure 5.10.a), which we label with 1. The same labeling holds for sets corresponding

to instances in the lower right triangle.

The maximal static expansion is shown in Figure 5.11. Notice that this program has

the same degree of parallelism as the corresponding single-assignment program, without

the run-time overhead.

172 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

double A[4*N, 2];

for (i=1; i<=2*N; i++)

for (j=1; j<=2*N; j++) {

// expansion of statement S

if (-2*N+1<=i-j && i-j<=-N) {

if ( )

S1 A[i-j+2*N, 0] = A[i-j+2*N, 1]

;

} else if (-N+1<=i-j && i-j<=N-1) {

if (j<=N) {

if ( )

S2 A[i-j+2*N, 0] = A[i-j+2*N, 0] ;

} else

if ( )

S3 A[i-j+2*N, 1] = A[i-j+2*N, 1] ;

} else

if ( )

S4 A[i-j+2*N, 0] = A[i-j+2*N, 0]

;

// expansion of statement T

T if (j==N) A[i+N, 2] = ;

}

........................................................................................

double A[N+1]; double A[N+1, N+1];

for (i=1; i<=N; i++) { for (i=1; i<=N; i++) {

for (j=1; j<=N; j++) for (j=1; j<=N; j++)

T A[j] = ; T A[j, i] = ;

S A[foo (i)] = ; S A[foo (i), i] = ;

R =

A[bar (i)]; R

=

A[bar (i), i];

} }

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.12. Third example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Consider the program in Figure 5.12.a, where foo and bar are arbitrary subscripting

functions8. Since all array elements are assigned by T , the value read by R at the ith

iteration must have been produced by S or T at the same iteration. The data- ow graph

7 Some instances of S read uninitialized values (e.g. when j = 1) and they have no reaching denition.

As a consequence, the expanded program in Figure 5.11 shoud begin with a copy-in code from the original

array to the expanded one.

8 A[foo (i)] stands for an array subscript between 1 and N , \too complex" to be analyzed at compile-

time.

5.2. MAXIMAL STATIC EXPANSION 173

(hR; ii) = hS; ii [ hT; i; j i : 1 j N : (5.4)

The maximal static expansion adds a new dimension to A subscripted by i. It is sucient

to make the rst loop parallel.

What Next?

These examples show the need for an automatic static expansion technique. We present

in the following section a formal denition of expansion and a general framework for

maximal static expansion. We then describe an expansion algorithm for arrays that

yields the expanded programs shown above. Notice that it is easy to recognize the original

programs in their expanded counterparts, which is a convenient property of our algorithm.

It is natural to compare array privatization [MAL93, TP93, Cre96, Li92] and maximal

static expansion: both methods expose parallelism in programs at a lower cost than single-

assignment form transformation. However, privatization generally resorts to dynamic

restoration of the data ow, and it only detects parallelism along the enclosing loops;

it is thus less powerful than general array expansion techniques. Indeed, the example in

Section 5.2.1 shows that our method not only may expand along diagonals in the iteration

space but may also do some \blocking" along these diagonals.

We assume an instancewise reaching denition analysis is performed previously, yielding

a conservative approximation of the relation between uses and reaching denitions .

The denition of static expansion has rst been introduced in [BCC98]: the idea is to

avoid dynamic restoration of the data ow. Let us consider two writes v and w belonging

to the same set of reaching denitions of some read u. Suppose they both write in the

same memory location. If we assign two distinct memory locations to v and w in the

expanded program, then a function is needed to restore the data ow, since we do

not know which of the two locations has the value needed by u. Using the notations

introduced in Sections 2.4 and 2.5, \v and w write in the same memory location" is

denoted by fe(v) = fe(w), and \u and w are assigned distinct memory locations in the

expanded program" is denoted by feexp(v) 6= feexp(w).

We introduce relation R between denitions that possibly reach the same read (recall

that we do not require the reaching denition analysis to give exact results):

8v; w 2 W : v R w () 9u 2 R : v u ^ w u:

Whenever two denitions possibly reaching the same read assign the same memory lo-

cation in the original program, they must still assign the same memory location in the

expanded program. Since \writing in the same memory location" is an equivalence rela-

tion, we actually use R , the transitive closure of R (see Section 5.2.4 for computation

details). Relation R, therefore, generalizes webs [Muc97] to instances of references, and

the rest of this work shows how to compute R in the presence of arrays.9

9 Strictly speaking, webs include denitions and uses, whereas R applies to denitions only.

174 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Relation R holds between denitions that reach the same use. Therefore, mapping

these writes to dierent memory locations is precisely the case where functions would

be necessary, a case a static expansion is designed to avoid:

Denition 5.1 (static expansion) For an execution e 2 E of the program, an expan-

sion from storage mapping fe to storage mapping feexp is static if

8v; w 2 We : v R w ^ fe(v) = fe(w) =) feexp(v) = feexp(w): (5.5)

When clear from the context, we say \static expansion feexp" instead of \static ex-

pansion from fe to feexp". Now, we are interested in removing as many dependences as

possible, without introducing functions. We are looking for the maximal static expansion

(MSE), assigning the largest number of memory locations while verifying (5.5):

Denition 5.2 (maximal static expansion) For an execution e, a static expansion

feexp is maximal on the set We of writes, if for any static expansion fe0 ,

8v; w 2 We : feexp(v) = feexp(w) =) fe0 (v) = fe0 (w): (5.6)

Intuitively, if feexp is maximal, then fe0 cannot do better: it maps two writes to the same

memory location when feexp does.

We need to characterize the sets of statement instances on which a maximal static

expansion feexp is constant, i.e. equivalence classes of relation fu; v 2 We : feexp(u) =

feexp(v)g. However, this hardly gives us an expansion scheme, because this result does not

tell us how much each individual memory location should be expanded. The purpose of

Section 5.2.3 is to design a practical expansion algorithm for each memory location used

in the original program.

Following the lines of [BCC00], we are interested in the static expansion which removes

the largest number of dependences.

Proposition 5.1 (maximal static expansion) Given a program execution e, a stor-

age mapping feexp is both a maximal static expansion of fe and ner than fe if and

only if

8v; w 2 We : v R w ^ fe(v) = fe(w) () feexp(v) = feexp(w) (5.7)

Proof: Sucient condition|the \if" part

Let feexp be a mapping s.t. 8u; v 2 W : feexp(u) = feexp(v) , u R v ^ fe(u) = fe(v):

By denition, feexp is a static expansion and feexp is ner than fe.

Let us show that feexp is maximal. Suppose that for u; v 2 W: feexp(u) = feexp(v).

(5.7) implies u R v and fe(u) = fe(v). Thus, from (5.5), any other static expansion

fe0 satises fe0 (u) = fe0 (v) too. Hence, feexp(u) = feexp(v) ) fe0 (u) = fe0 (v), so feexp is

maximal.

Necessary condition|the \only if" part

Let feexp be a maximal static expansion ner than fe. Because feexp is a static expan-

sion, we only have to prove that

8u; v 2 W : feexp(u) = feexp(v) =) u R v ^ fe(u) = fe(v):

5.2. MAXIMAL STATIC EXPANSION 175

On the one hand, feexp(u) = feexp(v) ) fe(u) = fe(v) because fe is ner than fe. On

the other hand, for some u and v in W, assume feexp(u) = feexp(v) and :(u R v). We

show that it contradicts the maximality of feexp: for any w in W, let fe0 (w) = feexp(w)

when :(u R w), and fe0 (w) = c when u R w, for some c 6= feexp(u). fe0 is a static

expansion: By construction, fe0 (u0) = fe0 (v0) for any u0 and v0 such that u0 R v0. The

contradiction comes from the fact that fe0 (u) 6= fe0 (v).

Results above make use of a general memory expansion feexp. However, constructing it

from scratch is another issue. To see why, consider a memory location c and two accesses v

and w writing into c. Assume that v R w: these accesses must assign the same memory

location in the expanded program. Now assume the contrary: if :(v R w), then the

expansion should make them assign two distinct memory locations.

We are thus strongly encouraged to choose an expansion feexp of the form (fe; ) where

function is constructed by the analysis and must be constant on equivalence classes of

R . Notation (fe; ) is merely abstract. A concrete method for code generation involves

adding dimensions to arrays, and extending array subscripts with , see Section 5.2.4.

Now, a storage mapping feexp = (fe; ) is ner than fe by construction, and it is a

maximal static expansion if function satises the following equation:

8e 2 E; 8v; w 2 We; fe(v) = fe(w) : v R w () (v) = (w):

In practice, fe(v) = fe(w) can only be decided when fe is ane. In general, we have to

approximate fe with relation and derive two constraints from the previous equation:

Expansion must be static: 8v; w 2 W : v w ^ v R w =) (v) = (w); (5.8)

Expansion must be maximal: 8v; w 2 W : v w ^ :(v R w) =) (v) 6= (w): (5.9)

First, notice that changing into its transitive closure has no impact on (5.8), and

that the transformed equation yields an equivalence class enumeration problem. Second,

(5.9) is a graph coloring problem: it says that two writes cannot \share the same color" if

related. Direct methods exists to address these two problems simultaneously (see [Coh99b]

or Section 5.4), but they seem much two complicated for our purpose.

Now, the only purpose of relation is to avoid unnecessary memory allocation, and

using a conservative approximation harms neither the maximality not the static prop-

erty of the expansion. Actually, we found that relation diers from |meaning is

not transitive|only in contrived examples, e.g. with tricky combinations of ane and

non-ane array subscripts. Therefore, consider the following maximal static expansion

criterion:

8v; w 2 W; v w : v R w () (v) = (w) (5.10)

Now, given an equivalence class of , classes of R are exactly the sets where storage

mapping feexp is constant:

Theorem 5.1 A storage mapping feexp = (fe; ) is a maximal static expansion for all

execution

W

e 2 E i for each equivalence class C 2 , is constant on each class

in C R and takes distinct values between dierent classes: 8v; w 2 C : v R w ,

(v) = (w).

Proof:

C 2 W denotes a set of writes which may assign the same memory cell, and

C R is the set of equivalence classes for relation R on writes in C. A straightforward

application of (5.10) concludes the proof.

176 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Notice that is only supposed to take dierent values between classes in the same

C: if C1; C2 2 W with C1 6= C2, u1 2 C1 and u2 2 C2 , nothing prevents that

(u1) = (u2 ).

0

As a consequence, two maximal static expansions fe and fe are identical on a class of

exp

W , up to a one-to-one mapping between constant values. An interesting result follows:

Lemma 5.1 The expansion factor for each memory location assigned by writes in C is

C

Card( R ).

Let C be an equivalence class in W (statement instances that may hit the same

memory location). Suppose we have a function mapping each write u in C to a rep-

resentative of its equivalence class in C (see Section 5.2.4 for details). One may label

each class in C R , or equivalently, label each element of (C). Such a labeling scheme is

obviously arbitrary, but all programs transformed using our method are equivalent up to

a permutation of these labels. Labeling boils down to scanning exactly once all the integer

points in the set of representatives (C), see Section 5.2.5 for details. Now, remember

that function feexp is of the form (fe; ). From Theorem 5.1, we can take for (u) the

label we choose for (u), then storage mapping feexp is a maximal static expansion for our

program.

Eventually, one has to generate code for the expanded program, using storage mapping

fe . It is done in Section 5.2.4.

exp

5.2.4 Algorithm

The maximal static expansion scheme given above works for any imperative program.

More precisely, you may expand any imperative program using maximal static expansion,

provided that a reaching denition analysis technique can handle it (at the instance level)

and that transitive closure computation, relation composition, intersection and union are

feasible in your framework.

In the sequel, since we use FADA (see [BCF97, Bar98] and Section 2.4.3) as reaching

denition analysis, we inherit its syntactical restrictions: data structures are scalars and

arrays; pointers are not allowed. Loops, conditionals and array subscripts are unrestricted.

Therefore, Maximal-Static-Expansion and MSE-Convert-Quast are based on the

classical single-assignment algorithms for loop nests, see Section 5.1. They rely on Omega

[KPRS96] and PIP [Fea88b] for symbolic computations. Additional algorithms and tech-

nical points are studied in Section 5.2.5. In Maximal-Static-Expansion, the function

mapping instances to their representatived is encoded as an ane relation between it-

eration vectors (augmented with the statement label), and labeling function is encoded

as an ane relation between the same iteration vectors and a \compressed" vector space

found by Enumerate-Representatives, see Section 5.2.5.

An interesting but technical remark is that, by construction of function |seen as a

parameterized vector, a few components may take a nite|and hopefully small|number

of values. Indeed, such components may represent the \statement part" of an instance.

In such case, splitting array A into several (renamed) data structures10 should improve

performance and decrease memory usage (avoiding convex hulls of disjoint polyhedra).

Consider for instance MSE of the second example: expanding A into A1 and A2 would

require 6N 2 array elements instead of 8N 2 in Figure 5.11. Other techniques reducing

10 Recall that in single-assignment form, statements assign disjoint (renamed) data structures.

5.2. MAXIMAL STATIC EXPANSION 177

Maximal-Static-Expansion (program; ; )

program: an intermediate representation of the program

: the con
ict relation

: the reaching denition relation, seen as a function

returns an intermediate representation of the expanded program

1 Transitive-Closure ()

2 R Transitive-Closure ( 1)

3 Compute-Representatives ( \ R )

4 Enumerate-Representatives ( ; )

5 for each array A in program

6 do A component-wise maximum of (u) for all write accesses u to A

7 declaration A[shape] is replaced by Aexp[shape, A ]

8 for each statement S assigning A in program

9 do left-hand side A[subscript] of S is replaced by Aexp[subscript; (CurIns)]

10

11 for each read reference ref to A in program

12 do =ref restriction of to accesses of the form ({; ref )

13 quast Make-Quast ( =ref )

14 map MSE-Convert-Quast (quast; ref )

15 ref map (CurIns)

16 return program

MSE-Convert-Quast (quast; ref )

quast: the quast representation of the reaching denition function

ref : the original reference

returns the implementation of quast as a value retrieval code for reference ref

1 switch

2 case quast = f?g :

3 return ref

4 case quast = f{g :

5 A Array({)

6 S Stmt({)

7 x Iter({)

8 subscript original array subscript in ref

9 return Aexp[subscript; x]

10 case quast = f{1 ; {2; : : : g :

11 error \this case should never happen with static expansion!"

12 case quast = if predicate then quast1 else quast2 :

13 return if predicate MSE-Convert-Quast (quast1; ref )

else MSE-Convert-Quast (quast2 ; ref )

the number of useless memory locations allocated by our algorithm are not described in

this paper.

A few technical points and computational issues are raised in the previous algorithm.

This section is devoted to their analysis and resolution.

178 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Finding a \good" canonical representative in a set is not a simple matter. We choose the

lexicographic minimum because it can be computed using classical techniques, and our

rst experiments gave good results.

Notice also that representatives must be described by a function on write instances.

Therefore, the good \parametric" properties of lexicographical minimum computations

[Fea91, Pug92] are well suited to our purpose.

A general technique to compute the lexicographical minimum follows. Let be an

equivalence relation, and C an equivalence class for . The lexicographical minimum of

C is:

min

<lex

(C) = v 2 C s:t: @u 2 C; u <lex v:

Since <lex is a relation, we can rewrite the denition using algebraic operations:

min

<lex

(C) = n(<lex ) (C): (5.11)

This is applied in our framework to classes of R and with order <seq .

Compute-Representatives (equivalence)

equivalence: an ane equivalence relation over instances

returns an ane function mapping instances to a canonical representative

1 repres equivalence n (<seq equivalence)

2 return repres

Applying Algorithm Compute-Representatives to relation R yields an ane

function , but this does not readily provide the labeling function . The last step

consists in enumerating the image of inside classes of equivalence relation .

Computing a Dense Labeling

To label each memory location, we associate each location to an integer point in the ane

polyhedron of representatives, i.e. the image of function whose range is restricted to

a class of equivalence relation . Labeling boils down to scanning exactly once all the

integer points in the set of representatives. This can be done using classical polyhedron-

scanning techniques [AI91, CFR95] or simply by considering a \part" of the representative

function in one-to-one mapping with this set. It is thus easy to compute a labeling function

.

But computing a \good" labeling function is much more dicult: a \good" labeling

should be as dense as possible, meaning that the number of memory locations accessed

by the program must be as near as possible as the number of memory locations allocated

from the shape of function .

A possible idea would be to count the number of integer points in the image of function

, thanks to Erhart polynomials [Cla96], and to build a labeling (non-ane in general)

from this computation. But this would be extremely costly in practice and would some-

times generate very intricate subscripts; moreover, most compile-time properties on

would be lost, due to the possible non-ane form. As a result, the \dense labeling prob-

lem" is mostly open at the moment. We have found an interesting partial result by Wilde

and Rajopahye [WR93], but studying applicability of their technique to our more general

case is left for future work.

5.2. MAXIMAL STATIC EXPANSION 179

to the regularity of iteration spaces of practical loop nests, techniques such as global

translation, division by an integer constant|when a constant stride is discovered|and

projection gave excellent results on every example we studied. Algorithm Enumerate-

Representatives implements these simple transformations to enumerate the image of

a function whose range is restricted to a class of some equivalence relation.

Enumerate-Representatives (rel; fun)

rel: equivalence relation whose classes dene enumeration domains

fun: the ane function whose image should be enumerated

returns a dense labeling of the image of fun restricted to a class of rel

1 repres Compute-Representatives (rel)

2 enum Symbolic-Vector-Subtract (fun; repres fun)

3 apply appropriate translations, divisions and projections to iteration vectors in enum

4 return enum

For each array in the source program, the algorithm proceeds as follows:

Compute the reciprocal relation 1 of . This is dierent from computing the

inverse of a function and consists only in a swap of the two arguments of .

Composing two relations and 0 boils down to eliminating y in x y ^ y 0 z.

Computing the exact transitive closure of R or is impossible in general: Presburger

arithmetic is not closed under transitive closure. However, very precise conservative

approximations (if not exact results) can be computed. Kelly et al. [KPRS96] do not

give a formal bound on the complexity of their algorithm, but their implementation

in the Omega toolkit proved to be ecient if not concise. A short review of their

algorithm is presented in Section 3.1.2. Notice again that the exact transitive closure

is not necessary for our expansion scheme to be correct.

Moreover, R and happens to be transitive in most practical cases. In our imple-

mentation, the Transitive-Closure algorithm rst checks whether the dierence

(R R) n R is empty, before triggering the computation. In all three examples, both

relations R and are already transitive.

In the algorithm above, is a lexicographical minimum. The expansion scheme just

needs a way to pick one element per equivalence class. Computing the lexicograph-

ical minimum is expensive a priori, but was easy to implement.

Finally, numbering classes becomes costly only when we have to scan a polyhedral

set of representatives in dimension greater than 1. In practice, we only had intervals

on our benchmark examples.

Is our Result Maximal?

Our expansion scheme depends on the transitive closure calculator, and of course on the

accuracy of input information: instancewise reaching denitions and approximation

of the original program storage mapping. We would like to stress the fact that the

180 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

expansion produced is static and maximal with respect to the results yielded by these

parts, whatever their accuracy:

The exact transitive closure may not be available (for computability or complex-

ity reasons) and may therefore be over-approximated. The expansion factor of a

memory location c is then lower than Card( f u 2 W : f e ( u) = cg R ). However, the

expansion remains static and is maximal with respect to the transitive closure given

to the algorithm.

Relation approximating the storage mapping of the original program may be

more or less precise, but we required it to be pessimistic (a.k.a. conservative). This

point does not interfere with the staticity or maximality of the expansion; but the

more accurate the relation , the less unused memory is allocated by the expanded

program.

Despite good performance results on small kernels (see following sections), it is obvious

that reaching denition analysis and MSE will become unacceptably expensive on larger

codes. When addressing real programs, it is therefore necessary to apply the MSE al-

gorithm independently to several loop nests. A parallelizing compiler (or a proler) can

isolate loop nests that are critical program parts and where spending time in powerful

optimization techniques is valuable. Such techniques have been investigated by Berthou

in [Ber93], and also in the Polaris [BEF+96] and SUIF [H+96] projects.

However, some values may be initialized outside of the analyzed code. When the set

of possible reaching denitions for some read accesses is not a singleton and includes ?,

it is necessary to perform some copy-in at the beginning of the code. Each array holding

values that may be read by such accesses must be copied into the appropriate expanded

arrays. In practice this is expensive when expanded arrays hold many copies of original

values. However, the process is fully parallel and can hopefully not cost more than the

loop nest itself.

There is a simple way to avoid copy-in, to the cost of some loss in the expansion degree.

It consists in adding \virtual write accesses" for every memory location and replacing ?s

in the reaching denition relation by the appropriate virtual access (accesses indeed, when

the memory location accessed is unknown). Since all ?s have been removed, computing

the maximal static expansion from this modied reaching denition relation requires no

copy-in; but additional constraints due to the \virtual accesses" may forbid some array

expansions. This technique is especially useful when many temporary arrays are involved

in a loop nest. But its application to the second motivating example (Figure 5.9) would

forbid all expansion since almost all reads may access values dened outside the nest.

Moreover, the data structures created by MSE on each loop nest may be dierent, and

the accesses to the same original array may now be inconsistent. Consider for instance the

original pseudo code in Figure 5.13.a. We assume the rst nest was processed separately

by MSE, and the second nest by any technique. The code appears in Figure 5.13.b.

Clearly, references to A may be inconsistent: a read reference in the second nest does not

know which 1 to read from.

A simple solution is then to insert, between the two loop nests, a copy-out code in

which the original structure is restored (see Figure 5.13). Doing this only requires to add,

at the end of the rst nest, \virtual accesses" that reads every memory locations written

5.2. MAXIMAL STATIC EXPANSION 181

........................................................................................

for i for i

A[f1 (i)] A1[ 1 f (i), 1 (i)]

end for end for

for i for i

= A[f2 (i)] = A1[ 2 f (i), /* unknown */]

end for end for

for i

A1[ 1 f (i), 1 (i)]

end for

for c // copy-out code

A[c] = A1[c, 1( ( ))]

end for

for i

= A[ 2 f (i)]

end for

in the nest. The reaching denitions within the nest give the identity of the memory

location to read from. Notice that no functions are necessary in the copy code|the

opposite would lead to a non-static expansion. More precisely, if we call V (c) the \virtual

access" to memory location c after the loop nest, we can compute the maximal static

expansion for the nest and the additional virtual accesses, and the value to copy back into

c is located in (c; ( (V (c)))).

Fortunately, with some knowledge on the program-wide
ow of data, several opti-

mizations can remove the copy-out code11 . The simplest optimization is to remove the

copy-out code for some data structure when no read access executing after the nest uses a

value produced inside this nest. The copy-out code can also be removed when no func-

tions are needed in read accesses executing after the nest. Eventually, it is always possible

to remove the copy-out code in performing a forward substitution of (c; ( (V (c)))) into

read accesses to a memory location c in following nests.

This section applies our algorithm to the motivating examples, using the Omega Calcu-

lator [Pug92] as a tool to manipulate ane relations.

11 Let us notice that, if MSE is used in codesign, the intermediate copy-code and associated data

structures would correspond to additional logic and buers, respectively. Both should be minimized in

complexity and/or size.

182 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

First Example

Consider again the program in Figure 5.6 page 169. Using the Omega Calculator text-

based interface, we describe a step-by-step execution of the expansion algorithm. We

have to code instances as integer-valued vectors. An instance hSs; ii is denoted by vector

[i,..,s], where [..] possibly pads the vector with zeroes. We number T; S; R with 1,

2, 3 in this order, so hT; ii, hS; i; j i and hR; ii are written [i,0,1], [i,j,2] and [i,0,3],

respectively.

From (5.1) and (5.2), we construct the relation S of reaching denitions:

S := {[i,1,2]->[i,0,1] : 1<=i<=N}

union {[i,w,2]->[i,w-1,2] : 1<=i<=N && 2<=w}

union {[i,0,3]->[i,0,1] : 1<=i<=N}

union {[i,0,3]->[i,w,2] : 1<=i<=N && 1<=w};

Since we have only one memory location, relation tells us that all instances are

related together, and can be omitted.

Computing R is straightforward:

S' := inverse S;

R := S(S');

R;

{[i,w,2]->[i,0,1] : 1<=i<=N && 1<=w} union

{[i,0,1]->[i,w',2] : 1<=i<=N && 1<=w'} union

{[i,w,2]->[i,w',2] : 1<=i<=N && 1<=w' && 1<=w}

In mathematical terms, we get:

hT; ii R hT; ii () 1 i N

hS; i; wi R hS; i; w0i () 1 i N; w 1; w0 1

hS; i; wi R hT; ii () 1 i N ^ w 1

hT; ii R hS; i; w0i () 1 i N ^ w0 1 (5.12)

Relation R is already transitive, no closure computation is necessary:

R = R

There is only one equivalence class for .

Let us choose (u) as the rst executed instance in the equivalence class of u for R

(the least instance according to the sequential order): (u) = min<seq (fu0 : u0 R ug). We

may compute this expression using (5.11):

8i; w; 1 i N; w 1 : (hT; ii) = hT; ii; (hS; i; wi) = hT; ii:

Computing (W) yields N instances of the form hT; ii. Maximal static expansion of

accesses to variable x requires N memory locations. Here, i is an obvious label:

8i; w; 1 i N; w 1 : (hS; i; wi) = (hT; ii) = i: (5.13)

All left-hand side references to x are transformed into x[i]; all references to x in the

right hand side are transformed into x[i] too since their reaching denitions are instances

of S or T for the same i. The expanded code is thus exactly the one found intuitively in

Figure 5.8.

The size declaration of the new array is x[1..N].

5.2. MAXIMAL STATIC EXPANSION 183

Second Example

We now consider the program in Figure 5.9. Instances hS; i; j i and hT; i; N i are denoted

by [i,j,1] and [i,N,2], respectively.

From (5.3), the relation S of reaching denitions is dened as:

S := {[i,j,1]->[i', j',1] : 1<=i,i'<=2N && 1<=j'<j<=N && i'-j'=i-j}

union {[i,j,1]->[i',j',1] : 1<=i,i'<=2N && N<j'<j<=2N && i'-j'=i-j}

union {[i,j,1]->[i',N,2] : 1<=i,i'<=2N && N<j<=2N && i'=i-j+N};

It is easy to compute relation since all array subscripts are ane: two instances of

S or T , whose iteration vectors are (i; j ) and (i0 ; j 0) write in the same memory location

i i j = i0 j 0. This relation is transitive, hence = . We call it May in Omega's

syntax:

May := {[i,j,s]->[i',j',s'] : 1<=i,j,i',j'<=2N && i-j=i'-j' &&

(s=1 || (s=2 && j=N) || s'=1 || (s'=2 && j'=N))};

S' := inverse S;

R := S(S');

R;

&& i<j+i' && j+i'<N+i} union

{[i,j,1]->[i',j-i+i',1] : N<j<=2N-1 && 1<=i<=2N-1 && 1<=i'<=2N-1

&& N+i<j+i' && j+i'<2N+i} union

{[i,N,2]->[i',N-i+i',1] : 1<=i<i'<=2N-1 && i'<N+i} union

{[i,j,1]->[N+i-j,N,2] : N<j<=2N-1 && i<=2N-1 && j<N+i} union

{[i,N,2]->[i,N,2] : 1<=i<=2N-1}

That is:

hT; i; N i R hT; i; N i , 1 i 2N 1

hS; i; j i R hS; i0; j 0i , (1 i; i0 2N 1) ^ (i j = i0 j 0)

^ 1 j; j 0 < N _ N < j; j 0 < 2N 1

hS; i; j i R hT; N + i j; N i , (1 i 2N 1) ^ (N < j 2N 1) ^ (j < N + i)

hT; i; N i R hS; i0; N i + i0i , 1 i < i0 2N 1 ^ i0 < N + i

Relation R is already transitive: R = R . Figure 5.10.a shows the equivalence classes

of R .

Let C be an equivalence class for relation . There is an integer k s.t. C = fhS; i; j i :

i j = kg[fhT; k + N; N ig. Now, for u 2 C, (u) = min<seq (fu0 2 W : u0 u ^ u0 R ug).

Then, we compute (u) using Omega:

1 2N i j N : (hS; i; j i) = hS; 1; 1 i + j i

1 N i j N 1 ^ j < N : (hS; i; j i) = hS; i j + 1; 1i

1 N i j N 1 ^ j >= N : (hS; i; j i) = hT; i; N i

N i j 2N 1 : (hS; i; j i) = hS; i j + 1; 1i

1 i 2N 1 : (hT; i; N i) = hT; i; N i

184 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

The result shows three intervals of constant cardinality of C R ; they are described in

Figure 5.10.b. A labeling can be found mechanically. If i j N or i j N , there

is only one representative, thus (hS; i; j i) = 1. If 1 N i j N 1, there are two

representatives; then we dene (hS; i; j i) = 1 if j N , (hS; i; j i) = 2 if j > N , and

(hT; i; N i) = 2.

The static expansion code appears in Figure 5.11. As hinted in Section 5.2.4, condi-

tionals in have been taken out of array subscripts.

Array A is allocated as A[4*N, 2]. Note that some memory could have been spared

in dening two dierent arrays: A1 standing for A[ ; 0] holding 4N 1 elements, and

A2 standing for A[ ; 1] holding only 2N 1 elements. This idea was pointed out in

Section 5.2.4.

Third Example: Non-Ane Array Subscripts

We come back to the program in Figure 5.12.a. Instances hT; i; j i, hS; ii and hR; ii are

written [i,j,1], [i,0,2] and [i,0,3].

From (5.4), we build the relation of reaching denitions:

S := {[i,0,3]->[i,j,1] : 1<=i,j<=N}

union {[i,0,3]->[i,0,2] : 1<=i<=N};

Since some subscripts are non ane, we cannot compute at compile-time the exact

relation between instances writing in some location A[x]. We can only make the following

pessimistic approximation of : all instances are related together (because they may assign

the same memory location).

S' := inverse S;

R := S(S');

R;

&& 1<=j'<=N} union

{[i,0,2]->[i,j',1] : 1<=i<=N && 1<=j'<=N} union

{[i,j,1]->[i,0,2] : 1<=i<=N && 1<=j<=N} union

{[i,0,2]->[i,0,2] : 1<=i<=N}

R is already transitive: R = R .

There is only one equivalence class for .

We compute (u) using Omega:

8i; 1 i N : (hS; ii) = hT; i; 1i

8i; j; 1 i N; 1 j N : (hT; i; j i) = hT; i; 1i

Note that every hT; i; j i instance is in relation with hT; i; 1i.

Computing (W) yields N instances of the form hT; ii. Maximal static expansion

of accesses to variable x requires N memory locations. We can use i to label these

representatives; thus the resulting function is:

(hS; ii) = (hT; i; j i) = i:

5.2. MAXIMAL STATIC EXPANSION 185

Using this labeling, all left hand side references to A[ ] become A[ , i] in the

expanded code. Since the source of hR; ii is an instance of S or T at the same iteration

i, the right hand side of R is expanded the same way. Expanding the code thus leads to

the intuitive result given in Figure 5.12.b.

The size declaration of A is now A[N+1, N+1].

5.2.8 Experiments

We ran a few experiments on an SGI Origin 2000, using the mp library. Implementation

issues are discussed in Section 5.2.9.

Performance Results for the First Example

For the rst example, the parallel SA and MSE programs are given in Figure 5.14. Re-

member that w is an articial counter of the while-loop, and M is the maximum number

of iterations of this loop. We have seen that a function is necessary for SA form, but it

can be computed at low cost: it represents the last iteration of the inner loop.

........................................................................................

double xT [N], xS [N, M];

parallel for (i=1; i<=N; i++) {

T xT [i] = ;

w = 1; double x[N+1];

while (

) { parallel for (i=1; i<=N; i++)

S xS [i][w] = if (w==1) xT [i] ; T x[i] = ;

w++; while ( )

} S x[i] = x[i] ;

else xS [i, w-1] ; R

= x[i] ;

R = if (w==1) xT [i] ; }

else xS [i, w-1] ;

// the last two lines implement Figure 5.14.b. Maximal static expan-

// (fhT; iig [ fhS; i; wi : 1 w Mg) sion

}

. . . . . . . . . . . . . . . . . . . Figure 5.14. Parallelization of the rst example. . . . . . . . . . . . . . . . . . . .

Table in Figure 5.15 rst describes speed-ups for the maximal static expansion relative

to the original sequential program, then speed-ups for the MSE version relative to the

single-assignment form. As expected, MSE shows a better scaling, and the relative speed-

up quickly goes over 2. Moreover, for larger memory sizes, the SA program may swap or

fail for lack of memory.

5.2.9 Implementation

The maximal static expansion is implemented in C++ on top of the Omega library. Fig-

ure 5.16 summarizes the computation times for our examples (on a 32MB Sun SPARC-

station 5). These results do not include the computation times for reaching denition

analysis and code generation.

186 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

M N

Conguration 200 250 200 500 200 1000 200 2000 200 4000

Speed-ups for MSE versus original program

16 processors 6.72 9.79 12.8 13.4 14.7

32 processors 5.75 9.87 15.3 21.1 24.8

Speed-ups for MSE versus SA

16 processors 1.43 1.63 1.79 1.96 2.07

32 processors 1.16 1.33 1.52 1.80 1.99

........................................................................................

1st example 2nd example 3rd example

transitive

closure 100 100 110

(check)

picking the

representatives 110 160 110

(function )

other 130 150 70

total 340 410 290

Moreover, computing the class representatives is relatively fast; it validates our choice

to compute function (mapping instances to their representatives) using a lexicographical

minimum. The intuition behind these results is that the computation time mainly depends

on the number of ane constraints in the data-
ow analysis relation.

Our only concern, so far, would be to nd a way to approximate the expressions of

transitive closures when they become large.

Memory expansion techniques have two main drawbacks: high memory usage and run-

time overhead. Parallelization via memory expansion thus requires both moderation in

the expansion degree and eciency in the run-time computation of data- ow restoration

code.

Moderation in the expansion degree can be addressed in two ways: either with \hard

constraints" such as the one presented in Section 5.2 or with optimization techniques that

do not interfere with parallelism extraction. This section addresses such optimization

5.3. STORAGE MAPPING OPTIMIZATION 187

techniques, and presents the main results of a collaboration with Vincent Lefebvre. It can

be seen as an extension of a work by Feautrier and Lefebvre [LF98] and also by Strout et

al. [SCFS98].

Our contributions are the following: we formalize the correctness of a storage map-

ping, according to a given parallel execution order, for any nest of loops with unrestricted

conditional expressions and array subscripts ; we show that schedule-independent storage

mappings dened in [SCFS98] correspond to correct storage mappings according to the

data-
ow execution order ; and we present an algorithm for storage mapping optimization,

applicable to any nest of loops and to all parallelization techniques based on polyhedral

dependence graphs (i.e. captured by Presburger arithmetics).

5.3.1 Motivation

First Example: Dynamic Control Flow

We rst study the kernel in Figure 5.17.a, which was already the rst motivating example

in Section 5.2. Parts denoted by have no side-eect. Each loop iteration spawns

instances of statements included in the loop body.

........................................................................................

double x; double xT [N+1], xS [N+1, M+1]

for (i=1; i<=N; i++) { parallel for (i=1; i<=N; i++) {

T x = ; T xT [i] = ;

while ( ) { w = 1;

S x = x ; while (

) {

} S xS [i][w] = if(w=1) xT [i] ;

R

= x ; else xS [i, w-1] ;

} w++;

}

Figure 5.17.a. Original program R = if (w==1) xT [i] ;

else xS [i, w-1] ;

// the last two lines implement

// (fhT; iig [ fhS; i; wi : 1 w Mg)

}

double xTS [N+1]

parallel for (i=1; i<=N; i++) {

T xTS [i] = ;

while ( ) {

S xTS [i] = xTS [i] ;

}

R = xTS [i] ;

}

. . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.17. Convolution example . . . . . . . . . . . . . . . . . . . . . . . . . .

Any instancewise reaching denition analysis is suitable to our purpose, but FADA

188 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

[BCF97] is prefered since it handles any loop nest and achieves today's best precision.

Value-based dependence analysis [Won95] is also a good choice. In the following, The

results for references x in right-hand side of R and S are nested conditionals :

(hS; i; w; xi) = if w = 1 then fT g else fhS; i; w 1ig

(hR; i; xi) = fhS; i; wi : 1 wg:

Here, memory-based dependences hampers direct parallelization via scheduling or

tiling. We need to expand scalar x and remove as many output,
ow and anti-dependences

as possible. Reaching denition analysis is at the core of single-assignment (SA) algo-

rithms, since it records the location of values in expanded data structures. However,

when the
ow of data is unknown at compile-time, functions are introduced for run-

time restoration of values [CFR+91, Col98]. Figure 5.17.b shows our program converted

to SA form, with the outer loop marked parallel (M is the maximum number of iterations

of the inner loop). A function is necessary but can be computed at low cost since it

represents the last iteration of the inner loop.

SA programs suer from high memory requirements: S now assigns a huge N M

array. Optimizing memory usage is thus a critical point when applying memory expansion

techniques to parallelization.

Figure 5.17.c shows the parallel program after partial expansion. Since T executes

before the inner loop in the parallel version, S and T may assign the same array. Moreover

a one-dimensional array is sucient since the inner loop is not parallel. As a side-eect, no

function is needed any more. Storage requirement is N , to be compared with NM + N in

the SA version, and with 1 in the original program (allowing no legal parallel reordering).

This partial expansion has been designed for a particular parallel execution order.

However, it is easy to show that it is also compatible with all other execution orders,

since the inner loop cannot be parallelized. We have thus built a schedule-independent

(a.k.a. universal) storage mapping, in the sense of [SCFS98]. On many programs, a more

memory-economical technique consists in computing a legal storage mapping according

to a given parallel execution order, instead of nding a schedule-independent storage

compatible with any legal execution order. This is done in [LF98] for ane loop nests

only.

Second Example: a More Complex Parallelization

We now consider the program in Figure 5.18 which solves the well known knapsack prob-

lem (KP). This kernel naturally models several optimization problems [MT90]. Intuitively:

M is the number of objects, C is the \knapsack" capacity, W[k] (resp. P[k]) is the weight

(resp. prot) of object number k; the problem is to maximize the prot without exceeding

the capacity. Instances of S are denoted by hS; k; W [k]i, : : : ,hS; k; C i, for 1 k M .

........................................................................................

int A[C+1], W[M+1], P[M+1];

for (k=1; k<=M; k++)

for (j=W[k]; j<=C; j++)

S A[j] = max (A[j], P[k] + A[j-W[k]]);

5.3. STORAGE MAPPING OPTIMIZATION 189

We suppose (from additional static analyses) that W[k] is always positive and less than

or equal to an integer K . The result for references A[j] and A[j-W[k]] in right-hand

side of S are conditionals :

if k = 1

(hS; k; j; A[j]i) = then f?g

else fhS; k 1; j ig

(hS; k; j; A[j-W[k]]i) = fhS; k0; j 0i : 1 k0 k ^ max(0; j K ) < j 0 < j 1g

First notice that program KP does not have any parallel loops, and that memory-

based dependences hampers direct parallelization. Therefore, parallelizing KP requires

the application of preliminary program transformations.

Thanks to the reaching denition information, Figure 5.19 shows program KP con-

verted to SA form. The unique function implements a run-time choice between values

produced by fhS; k0; j 0i : 1 k0 k ^ max(0; j K ) < j 0 < j 1g, for some read access

hS; k; j; A[j-W[k]]i.

........................................................................................

int A[C+1], W[M+1], P[M+1]

int AS [M+1, C+1]

for (k=1; k<=M; k++)

for (j=W[k]; j<=C; j++)

S AS [k, j] = if (k==1)

max (A[j], P[1] + A[j-W[1]]);

else

max (AS [k-1, j],

P[k] + 0 0 (fhS; k ; j i : 1 k k ^ max(0; j K ) < j 0 < j 1g);

0

Eventually, in this particular case, the function is really easy to compute: the value

of A[j-W[k]] has been \moved" by SA form transformation \to" AS [k, j-W[k]]. Then

(fhS; k0; j 0i : 1 k0 k ^ max(0; j K ) < j 0 < j 1g) is equal to AS [k,j-W[k]].

This optimization avoids the use of temporary arrays. It can be performed automatically,

along with other interesting optimizations, see Section 5.1.4.

The good thing with SA-transformed programs is that the only remaining dependences

are true dependences between a reaching denition instance and its use instances. Thus

a legal parallel schedule for program KP is: \execute instance hS; k; j i at step k + j ", see

Figure 5.20 (see Section 2.5.2 for schedule computation).

Since KP is a perfectly nested loop, it is also possible to apply tiling techniques

to single-assignment KP, based on instancewise reaching denition information. Tiling

techniques improve data locality and reduce communications in grouping together com-

putations aecting the same part of a data structure (see Section 2.5.2). Rectangular

m c tiles seem appropriate in our case; the height m and width c can be tuned thanks

to theoretical models [IT88, CFH95, BDRR94] or proling techniques. The knapsack

problem has been much studied and very ecient parallelizations have been crafted by

Andonov and Rajopadhye [AR94], see also [BBA98] for additional information on tiling

190 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

j j j

k k k

. . . . . . Figure 5.20. Instancewise reaching denitions, schedule, and tiling for KP . . . . . .

the knapsack algorithm. The third graph in Figure 5.20 represents 2 2 tiles, but larger

sizes are used in practice, see Section 5.3.10.

Consider the dependences in Figure 5.20. The value produced by instance hS; k; j i may

be used by hS; k; j + 1i; : : : ; hS; k; min(C; j + K )i or by hS; k + 1; j i. Using the schedule

or the tiling proposed in Figure 5.20, we can prove that some value produced during the

execution stops being useful after a given delay: if 1 k; k0 M and 1 j; j 0 C are

such that k + j + K < k0 + j 0, the value produced by hS; k; j i is not used by hS; k0 ; j 0i.

This allows a cyclic folding of the storage mapping: every access of the form AS [k, j]

can be safely replaced by AS [k % (K+1), j]. The result is shown in Figure 5.21.

........................................................................................

int A[C+1], W[M+1], P[M+1]

int AS [K+2, C+1]

for (k=1; k<=M; k++)

for (j=W[k]; j<=C; j++)

S AS [k % (K+1), j] = if (k==1)

max (A[j], P[1] + A[j-W[1]]);

else

max (AS [(k-1) % (K+1), j],

P[k] + 0 0 (fhS; k ; j i : 1 k k ^ max(0; j K ) < j 0 < j 1g);

0

Storage requirement for array AS is (K +1)C , to be compared with MC in the SA ver-

sion, and with C in the original program (where no legal parallel reordering was possible).

This suggests two observations:

rst, the gain is only signicant when K is much smaller than M , which may not

be the case in practice;

second, the expanded subscript in left-hand side is not ane any more, since K is

a symbolic constant.

In general, when the cyclic folding is based on a symbolic constant (like K ), it becomes

both dicult to measure the eectiveness of the optimization and to reuse the generated

5.3. STORAGE MAPPING OPTIMIZATION 191

code in subsequent analyses. In [Lef98], Lefebvre proposed to forbid such symbolic fold-

ings, but we believe they can still be useful when some compile-time information on the

symbolic bounds (like K ) is available.

Eventually, this partial expansion is not schedule-independent, because it highly de-

pends on the \parallel front" direction associated with the proposed schedule and tiling.

Given an original program (<seq; fe), we suppose that an instancewise reaching denition

analysis has already been performed|yielding relation |and that a parallel execution

order <par has been computed using some suitable technique (see Chapter 2.5.2). Our

problem is here to compute a new storage mapping feexp such that (<par; feexp) preserves

the original semantics of (<seq ; fe).

Given a parallel execution order <par, we have to characterize correct expansions

allowing parallel execution to preserve the program semantics. In addition to the con ict

relation e , we use the no-con ict relation 6 e, which is the complement of e. As in

Section 2.4.1, we build a conservative approximation 6 of this relation:

8e 2 E; 8v; w 2 Ae : fe(v) 6= fe(w) =) v 6 w :

Since both approximations and 6 are conservative, we have to be very careful that they

are not complementary in general. Indeed, e and 6 e are complementary for the same

execution e 2 E, but is dened as a \may con ict" approximation for all executions,

and 6 is the negation of the \must con ict" approximation.

Our rst task is to formalize the memory reuse constraints enforced by the partial

order <par. We introduce e0 : the exact reaching denition function for a given execution

e of parallelized program (<par; feexp).12 The expansion is correct i, for every program

execution, the source of every access is the same in the sequential and in the parallel

program:

8e 2 E; 8u 2 Re; 8v 2 We : v = e (u) =) v = e0 (u): (5.14)

We are looking for a correctness criterion telling whether two writes may use the same

memory location or not. To do this, we return to the denition of e0 :

8e 2 E : v = e0 (u) ()

v <par u ^ feexp(u) = feexp(v) ^ 8w 2 We : u <par w _ w <par v _ feexp(v) 6= feexp(w) :

(5.15)

Plugging (5.15) in (5.14), we get

8e 2 E; 8u 2 Re; 8v; w 2 We : v = e (u) ^ u par w ^ w par v =)

v <par u ^ feexp(u) = feexp(v) ^ feexp(v) 6= feexp(w):

We may simplify this result since v <par u and feexp(u) = feexp(v) constraints are already

implied by v = e (u)|through (5.14)|and do not bring any information between feexp(v)

and feexp(w):

8e 2 E; 8u 2 Re; 8v; w 2 We :

v = e (u) ^ u par w ^ w par v =) feexp(v) 6= feexp(w): (5.16)

12 The fact that <par is not a total order makes no dierence for reaching denitions.

192 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

It means that we cannot reuse memory (i.e. we must expand) when both v = e (u) and

v par w ^ u par w are true. Starting from this dynamic correctness condition, we would

like to deduce a correctness criterion based on static knowledge only. This criterion must

be valid for all executions; in other terms, it should be stronger than condition (5.16).

We can now expose the expansion correctness criterion. It requires the reaching de-

nition v of a read u and an other write w to assign dierent memory locations when: w

executes between v and u in the parallel program, and either w does not execute between

v and u or w assigns a dierent memory location from v (v 6 w) in the original program;

see Figure 5.22. Here is the precise formulation of the correctness criterion:

Theorem 5.2 (correctness of storage mappings) If the following condition holds,

then the expansion is correct|i.e. allows parallel execution to preserve the program

semantics.

8e 2 E; 8v; w 2 W :

9u 2 R : v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v _ v 6 w)

=) feexp(v) =

6 feexp(w): (5.17)

Proof: We rst rewrite the denition of v being the reaching denition of u:

8e 2 E; 8u 2 Re; 8v 2 We :

v = e (u) =) v <seq u ^ fe(u) = fe(v) ^

8w 2 We : u <seq w _ w <seq v _ fe(v) 6= fe(w):

As a consequence,

8e 2 E; 8u 2 Re; 8v 2 We :

v = e (u) =) 8w 2 We : u <seq w _ w <seq v _ fe(v) 6= fe(w) : (5.18)

The right-hand side of (5.18) can be inserted into (5.16) as an additional constraint:

(5.16) is equivalent to

8e 2 E; 8u 2 Re; 8v; w 2 We :

v = e (u) ^ w par v ^ u par w ^ u <seq w _ w <seq v _ fe(v) 6= fe(w)

=) feexp(v) =

6 feexp(w): (5.19)

Let us now replace e with its approximation in (5.19)|using v = e (u) ) v u:

8e 2 E; 8u 2 Re; 8v; w 2 W

e:

v u ^ u <seq w _ w <seq v _ fe(v) =6 fe(w) ^ w par v ^ u par w

=) fe (v) =

exp 6w feexp(w)

approximation: v = e (u) ) v u

8e 2 E; 8u 2 Re; 8v; w 2 We:

v = e (u) ^ u <seq w _ w <seq v _ fe(v) =6 fe(w) ^ w par v ^ u par w

=) fe (v) =

exp 6 feexp(w)

5.3. STORAGE MAPPING OPTIMIZATION 193

fe(u) ) v 6 u:

8v; w 2 W :

9u 2 R : v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v _ v 6 w)

=) feexp(v) =

6w feexp(w)

6 fe(u) ) v 6 u

approximation: fe (v ) =

8e 2 E; 8u 2 Re; 8v; w 2 W

e:

v u ^ u <seq w _ w <seq v _ fe(v) =6 fe(w) ^ w par v ^ u par w

=) fe (v) =

exp 6 feexp(w)

This proves that (5.17) is stronger than (5.19), itself equivalent to (5.16).

Notice we returned to the denition of e at the beginning of the proof. Indeed, some

information on the storage mapping may be available, and we do not want to loose

it13: the right-hand side of (5.18) gathers information on w which would have been lost in

approximating e by in (5.16). Without this information on w, we would have computed

the following correctness criterion:

8e 2 E; 8v; w 2 W :

9u 2 R : v = (u) ^ u par w ^ w par v =) feexp(v) 6= feexp(w): (5.20)

Sadly, this choice is not satisfying here.14 Indeed, consider the motivating example: two

instances hS; i; wi and hS; i; w0i would satisfy the left-hand side of (5.20) as long as w 6=

w0. Therefore, they should assign dierent memory locations in any correct expanded

program. This leads to the single-assignment version of the program... but we showed in

Section 5.3.1 that a more memory-economical solution was available: see Figure 5.17.c.

A precise look to (5.16) explains why replacing e with in 5.16) is too conservative:

it \forgets" that w is not executed after the reaching denition e (u). Indeed, w par v

in left-hand side of (5.20) is much stronger: it states that w is not executed after any

possible reaching denitions of u, which includes many instances execution before the

reaching denition e (u).

In the following, we introduce a new notation for the expansion correctness criterion:

the interference relation ./ is dened as the symmetric closure of the left-hand side of

(5.17):

8v; w 2 W : v ./ w () def

9u 2 R : v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v _ v 6 w)

_ 9u 2 R : w u ^ v par w ^ u par v ^ (u <seq v _ v <seq w _ w 6 v) : (5.21)

We take the symmetric closure because v and w play symmetric roles in (5.17). Using

a tool like Omega [Pug92], it is much easier to handle set and relation operations than

13 Such information may be more precise than deriving it from the approximate reaching denition

relation .

14 This criterion was enough for Lefebvre and Feautrier in [LF98] since they only considered ane loop

nests and exact reaching denition relations.

194 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

v 2 (u) u

Sequential <seq

w <seq v v 6 w u <seq w

v 2 (u) u

Parallel <par

w par v

u par w

. . . . . . . . . . . . . . . . . . . Figure 5.22. Cases of feexp(v) 6= feexp(w) in (5.17) . . . . . . . . . . . . . . . . . . .

logic formulas with quantiers. We thus rewrite the previous denition using algebraic

operations:15

./ = ( (R) W)\ par \(>seq [ 6 ) [ par \( (par \ <seq))

[ ( (R) W)\ par \(<seq [ 6 ) [ par \( (par \ <seq )): (5.22)

Rewriting (5.17) with this new syntax, v and w must assign distinct memory locations

when v ./ w|one may say that \v interferes with w":

8e 2 E; 8v; w 2 W : v ./ w =) feexp(v) 6= feexp(w): (5.23)

An algorithm to compute feexp from Theorem 5.2 is presented in Section 5.3.4. Notice

that we compute an exact storage mapping feexp which depends on the execution.

We start with three examples showing the usefulness of each constraint in the denition

of ./, see Figure 5.23.

We now present the following optimality result:16

Proposition 5.2 Let <par be a parallel execution order. Consider two writes v and w

such that v ./ w (dened in (5.22) page 194), and a storage mapping feexp such that

feexp(v) = feexp(w)|that is, feexp does not satisfy the expansion correctness criterion

dened by Theorem 5.2. Then, executing program (<par; feexp) violates the original

program semantics, according to approximations and 6 .

Proof: Suppose v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v _ v 6 w)

is satised for a read u, and two writes v and w. One may distinguish three cases

regarding execution of w relatively to u and v, see Figure 5.22.

15 Each line of (5.21) is rewritten independently, then predicates depending on u are separated from the

others. The existential quantication on u is captured by composition with . Because v is the possible

reaching denition of some read access, intersection with ( (R) W) is necessary in the rst disjunct

of each line.

16 See Section 2.4.4 for a general remark about optimality.

5.3. STORAGE MAPPING OPTIMIZATION 195

........................................................................................

T x = ; S k T <seq R is legal but requires renaming: this is

S x = ; enforced by T <seq S , i.e. w <seq v (and T par S , i.e.

R = x ; w par v, and R par T , i.e. u par w).

Figure 5.23.a. Constraints w <seq v and w par v, u par w

S x = ; S <seq T <seq R is legal but requires renaming: this is

R = x ; enforced by R <seq T , i.e. u <seq w.

T x = ;

Figure 5.23.b. Constraints w par v, u par w and u <seq w

S A[1] = ; S k T <seq R is legal but requires renaming: this is en-

T A[foo ] = ; forced by S 6 T , i.e. v 6 w, since S may assign a dierent

R = A[1] ; memory location as T .

Figure 5.23.c. Constraints w par v, u par w and v 6 w

Figure 5.23. Motivating examples for each constraint in the denition of the interference

relation

........................................................................................

The rst two cases are (1) u executes before w in the sequential program, i.e. u <seq w,

or (2) w executes before v in the sequential program, i.e. w <seq v: then w must assign

a dierent memory location than v, otherwise the value produced by v would never

reach u as in the sequential program.

When w executes neither before v nor after u in the sequential program, one may

keep v and w assigning the same memory location if it was the case in the sequential

program. However, if it might not be the case, i.e. if v 6 w, then w must assign a

dierent memory location than v, otherwise the value produced by v would never

reach u as in the sequential program.

5.3.4 Algorithm

The formalism presented in the previous section is general enough to handle any imper-

ative program. However, as a compromise between expressivity and computability, and

because our prefered reaching denition analysis is FADA [BCF97], we choose ane rela-

tions as an abstraction. Tools like Omega [Pug92] and PIP [Fea91] can thus be used for

symbolic computations, but our program model is now restricted to loop nests operating

on arrays, with unrestricted conditionals, loop bounds and array subscripts.

Finding the minimal amount of memory to store the values produced by the program

is a graph coloring problem where vertices are instances of writes and edges represent

interferences between instances: there is an edge between v and w i they can't share the

same memory location, i.e. when v ./ w. Since classic coloring algorithms only apply to

nite graphs, Feautrier and Lefebvre designed a new algorithm [LF98], which we extend

to general loop-nests.

The more general application of our technique starts with instancewise reaching deni-

tion analysis, then apply a parallelization algorithm using as dependence graph |thus

196 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

partial order <par, and eventually apply the following partial expansion algorithm.

Partial Expansion Algorithm

Storage-Mapping-Optimization and SMO-Convert-Quast are simple extensions

of the classical single-assignement algorithms for loop nests, see Section 5.1. Input is

the sequential program, the results and 6 of an instancewise analysis, and parallel

execution order <par (not used for simple SA form conversion). The big dierence with

SA-form is the computation of an expansion vector ES of integers or symbolic constants:

its purpose is to reduce memory usage of each expanded array AS with a \cyclic folding"

of memory references, see Build-Expansion-Vector in Section 5.3.5. To reduce the

number of expanded arrays, partial renaming is called at the end of the process to coalesce

data structures using a classical graph coloring heuristic, see Partial-Renaming in

Section 5.3.5.

Storage-Mapping-Optimization (program; ; 6 ; <par)

program: an intermediate representation of the program

: the reaching denition relation, seen as a function

6 : the no-con ict relation

<par: the parallel execution order

returns an intermediate representation of the expanded program

1 ./ ( (R) W)\ par \(>seq [ 6 ) [ par \( (par \ <seq ))

2 [ ( (R) W)\ par \(<seq [ 6 ) [ par \( (par \ <seq ))

3 for each array A in program

4 do for each statement S assigning A in program

5 do ES Build-Expansion-Vector (S; ./)

6 declare an array AS

7 left-hand side of S AS [Iter(CurIns) % ES ]

8 for each reference ref to A in program

9 do =ref \ (I ref )

10 quast Make-Quast (=ref )

11 map SMO-Convert-Quast (quast; ref )

12 ref map (CurIns)

13 program Partial-Renaming (program; ./)

14 return program

This algorithm outputs an expanded program whose data layout is well suited for

parallel execution order <par: we are assured that the original program semantic will be

preserved in the parallel version.

Two technical issues have been pointed out. How is the expansion vector ES built

for each statement S ? How is partial renaming performed? This is the purpose of Sec-

tion 5.3.5.

Building an Expansion Vector

For each statement S , the expansion vector must ensure that two instances v and w

assign dierent memory locations when v ./ w. Moreover, it should introduce memory

5.3. STORAGE MAPPING OPTIMIZATION 197

quast: the quast representation of the reaching denition function

ref : the original reference, used when ? is encoutered

returns the implementation of quast as a value retrieval code for reference ref

1 switch

2 case quast = f?g :

3 return ref

4 case quast = f{g :

5 A Array({)

6 S Stmt({)

7 x Iter({)

8 return AS [x % ES ]

9 case quast = f{1 ; {2; : : : g :

10 return (f{1; {2; : : : g)

11 case quast = if predicate then quast1 else quast2 :

12 return if predicate SMO-Convert-Quast (quast1; ref )

else SMO-Convert-Quast (quast2 ; ref )

Building an expanded program with memory reuse on S introduces output depen-

dences between some instances of this statement (there is an output dependence between

two instances v and w in the expanded code if v 2 W, w 2 W and feexp(v) = feexp(w)).

An output dependence between v and w is valid in the expanded program i the left-hand

side of the expansion correctness criterion is false for v and w, i.e. i v and w are not

related by ./. Such an output dependence is called a neutral output dependence [LF98].

The aim is to elaborate an expansion vector which gives to AS an optimized but sucient

shape to only authorize neutral output dependences on S .

The dimension of ES is equal to the number of loops surrounding S , written NS .

Each element ES [p + 1] is the expansion degree of S at depth p (the depth of the loop

considered), with p 2 f0; : : : ; NS 1g and gives the size of dimension (p + 1) of AS . Each

dimension of AS must have a sucient size to forbid any non-neutral output dependence.

For a given access v, the set of instances which may not write in the same location as v

can be deduced from the expansion correctness criterion (5.17), call it WpS (v). It holds

all instances w such that:

w is an instance of S : Stmt(w) = S ;

Iter(v)[1::p] = Iter(w)[1::p] and Iter(v)[p + 1] < Iter(w)[p + 1];

And v ./ w.

Let wpS (v) be the lexicographic maximum of WpS (v). For all w in WpS (v), we have the

following relations:

Iter(v )[1::p] = Iter(w)[1::p] = Iter(wpS (v ))[1::p]

Iter(v )[p + 1] < Iter(w)[p + 1] Iter(wpS (v ))[p + 1]

If ES [p + 1] is equal to (Iter(wpS (v))[p + 1] Iter(v)[p + 1] + 1) and knowing that

the index function will be AS [Iter(v) % ES ], we ensure that no non-neutral output

dependence appear between v and any instance of WpS (v). But this property must be

198 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

veried for each instance of S , and ES should be set to the maximum of (Iter(wpS (v))[p +

1] Iter(v)[p + 1] + 1) for all instances v of S . This proves that the following denition

of ES forbids any output dependence between instances of S in relation with ./:

ES [ p + 1] = max Iter(wpS (v))[p + 1] Iter(v)[p + 1] + 1 : v 2 W ^ Stmt(v) = S

(5.24)

Computing this for each dimension of ES ensures that AS has a sucient size for the

expansion to preserve the sequential program semantics. This is the purpose of Build-

Expansion-Vector: working is relation (v; WpS (v )) and maxv is relation (v; wpS (v )).

For a detailed proof, an intuitive introduction and related works, see [LF98, Lef98]. For

the Build-Expansion-Vector algorithm, the simplest optimality concept is dened by

the number of integer-valued components in ES , i.e. the number of \projected" dimensions,

as proposed by Quillere and Rajopadhye in [QR99]. But even with this simple denition,

optimality is still an open problem. Since the algorithm proposed by [QR99] has been

proven optimal, we should try to combine both techniques to yield better results, but his

is left for future work.

Build-Expansion-Vector (S; ./)

S : the current statement

./: the interference relation

returns expansion vector ES (a vector of integers or symbolic constants)

1 NS number of loops surrounding S

2 for p = 1 to NS

3 do working f(v; w) : hS; vi 2 W ^ hS; wi 2 W

4 ^ v[1::p] = w[1::p] ^ v[1::p + 1] < w[1::p + 1]

5 ^ hS; vi ./ hS; wig

6 maxv f(v; max<lex fw : (v; w) 2 workingg)g

7 vector[p + 1] max<lex fw v[p + 1] + 1 : (v; w) 2 maxvg

8 return vector

Now, a component of ES computed by Build-Expansion-Vector can be a symbolic

constant. When this constant can be proven \much smaller" than the associated dimen-

sion of iteration space of S , it is useful for reducing memory usage; but if such a result

cannot be shown with the available compile-time information, the component is set to

+1, meaning that no modulo computation should appear in the generated code (for this

particular dimension). The interpretation of \much smaller" depends on the application:

Lefebvre considered in [Lef98] that only integer constants where allowed in ES , but we

believe that this requirement is too strong, as shown in the knapsack example (a modulo

K + 1 is needed).

Partial Renaming

Now every array AS has been built, one can perform an additional storage reduction to

the generated code. Indeed, for two statements S and T , partial expansion builds two

structures AS and AT which can have dierent shapes. If at the end of the renaming

process S and T are authorized to share the same array, this one would have to be the

rectangular hull of AS and AT : AST . It is clear that these two statements can share the

same data i this sharing is not contradictory with the expansion correctness criterion

5.3. STORAGE MAPPING OPTIMIZATION 199

for instances of S and T . One must verify for every instance u of S and v of T , that the

value produced by u (resp. v) cannot be killed by v (resp. u) before it stops being useful.

Finding the minimal renaming is NP-complete. Our method consists in building a

graph similar to an interference graph as used in the classic register allocation process.

In this graph, each vertex represents a statement of the program. There is an edge

between two vertices S and T i it has been shown that they cannot share the same

data structure in their left-hand sides. Then one applies on this graph a greedy coloring

algorithm. Finally it is clear that vertices that have the same color can have the same

data structure. This partial renaming algorithm is sketched in Partial-Renaming (the

Greedy-Coloring algorithm returns a function mapping each statement to a color).

Partial-Renaming (program; ./)

program: the program where partial renaming is required

./: the interference relation

returns the program with coalesced data structures

1 for each array A in program

2 do interfere ?

3 for each pair of statements S and T assigning A in program

4 do if 9hS; vi; hT; wi 2 W : hS; vi ./ hT; wi

5 then interfere interfere [ f(S; T )g

6 coloring Greedy-Coloring (interfere)

7 for each statements S assigning A in program

8 do left-hand side A[subscript] of S Acoloring(S)[subscript]

9 return program

The partial expansion algorithm often yields poor results, especially on tiled programs.

The reason is that subscripts of expanded arrays are of the form AS [subscript % ES ],

and the block regularity of tiled programs does not really t in this cyclic pattern. Fig-

ure 5.24 shows an example of what we would like to achieve on some block-regular expan-

sions. No cyclic folding would be possible on such an example, since the two outer loops

are parallel.

The design of an improved graph coloring algorithm able to consider both block and

cyclic patterns is still an open problem, because it requires non-ane constraints to be

optimized. We only propose a work-around, which works when some a priori knowledge

on the tile shape is available. The technique consists in dividing each dimension with the

associated tile size. Sometimes, the resulting storage mapping will be compatible with

the required parallel execution, and sometimes not: decision is made with Theorem 5.2.

Expanded array subscripts are thus of the form AS [i1 =shape1 , , iN =shapeN ], where

(i1 ; : : : ; iN ) is the iteration vector associated with CurIns (dened in Section 5.1), and

where shapei is either 1 or the size of the ith dimension of the tile.

It is possible to improve this technique in combining divisions and modulo operations,

but the expansion scheme is somewhat dierent: see Section 5.4.6.

200 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

int x; int xS [N, N];

for (i=0; i<N; i++) for (i=0; i<N; i++)

for (j=0; j<N; j++) { for (j=0; j<N; j++) {

S x =

; S xS [i, j] = ;

R = x ; R = xS [i, j] ;

} }

int xS [N/16, N/16];

parallel for (i=0; i<N; i+=16)

parallel for (j=0; j<N; j+=16)

for (ii=0; ii<16; ii++)

for (jj=0; jj<16; jj++) {

S xS [i/16, j/16] = ;

R = xS [i/16, j/16] ;

}

The technique presented in Section 5.3.4 yields the best results, but involves an external

parallelization technique, such as scheduling or tiling. It is well suited to parallelizing

compilers.

A schedule-independent (a.k.a. universal) storage mapping [SCFS98] is useful whenever

no parallel execution scheme is enforced. The aim is to preserve the \portability" of SA

form, at a much lower cost in memory usage.

From the denition of ./|the interference relation|in (5.21), and considering two

parallel execution orders <1par and <2par whose associated interference relations are ./1

and ./2, we have:

<1par<2par =) ./2 ./1 :

Now, a schedule-independent storage mapping feexp must be compatible with any possi-

ble parallel execution <par of the program. Partial order <par used in the Storage-

Mapping-Optimization algorithm should thus be included in any correct execution

order. By denition of correct execution orders|Theorem 2.2 page 81|this condition is

satised by the data- ow execution order , which is the transitive closure of the reaching

denition relation: +.

Section 3.1.2 describes a way to compute the transitive closure of (useful remarks

and experimental study are also presented in Section 5.2.5). In general, no exact result

can be hoped for the data- ow execution order +, because Presburger arithmetic is not

closed under transitive closure. Hence, we need to compute an approximate relation.

Because the approximation must be included in all possible correct execution order, we

want it to be a sub-order of the exact data- ow order (i.e. the opposite of a conservative

approximation). Such an approximation can be computed with Omega [Pug92].

5.3. STORAGE MAPPING OPTIMIZATION 201

Implementing functions for a partially expanded program is not very dierent from what

we have seen in Section 5.1.3. Indeed, algorithm Loop-Nests-Implement-Phi applies

without modication. But doing this, no storage mapping optimization is performed on -

arrays. Now, remember -arrays are supposed to be in one-to-one mapping with expanded

data structures. Single-assignment -arrays are not necessary to preserve the semantics

of the original program, since the same dependences will be shared by expanded arrays

and -arrays.

The resulting code generation algorithm is very similar to Loop-Nests-Implement-

Phi. The rst step consists in replacing every reference to AS [x] with its \folded"

counterpart AS [x % ES ]. In a second step, one merge -arrays together using the result

of algorithm Partial-Renaming.

Eventually, for a given function, the set of possible reaching denitions should be

reconsidered: values produced by a few instances may now be overwritten, according to the

new storage mapping. As in the motivating example, the function can even disappear,

see Figure 5.17. A good technique to automatically achieve this is not to perform a new

reaching denition analysis. One should update the available sets of reaching denitions:

a (set) reference should be replaced by

fv 2 set : @w 2 set : v <seq w ^ feexp(v) = feexp(w)g :

Moreover, if coloring is the result of the greedy graph coloring algorithm in Partial-

Renaming, feexp(hs; xi) = feexp (hs0 ; x0 i) is equivalent to

coloring(s) = coloring(s0) ^ (x mod Es = x0 mod Es ): 0

First Example

Using the Omega Calculator text-based interface, we describe a step-by-step execution

of the expansion algorithm. We have to code instances as integer-valued vectors. An

instance hs; ii is denoted by vector [i, , s], where [ ] possibly pads the vector

with zeroes. We number T , S , R with 1, 2, 3 in this order, so hT; ii, hS; i; j i and hR; ii

are written [i,0,1], [i,j,2] and [i,0,3], respectively.

Schedule-dependent storage mapping. We rst apply the partial expansion algo-

rithm according to the parallel execution order proposed in Figure 5.17.

The result of instancewise reaching denition analysis is written in Omega's syntax:

S := {[i,0,2]->[i,0,1] : 1<=i<=N}

union {[i,w,2]->[i,w-1,2] : 1<=i<=N && 1<=w}

union {[i,0,3]->[i,0,1] : 1<=i<=N}

union {[i,0,3]->[i,w,2] : 1<=i<=N && 0<=w};

The no-con
ict relation is trivial here, since the only data structure is a scalar variable:

NCon := {[i,w,s]->[i',w',s'] : 1=2}; # 1=2 means FALSE!

202 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

We consider that the outer loop is parallel. It gives the following execution order:

Par := {[i,w,2] -> [i,w',2] : 1 <= i <= N && 0 <= w < w'} union

{[i,0,1] -> [i,w',2] : 1 <= i <= N && 0 <= w'} union

{[i,0,1] -> [i,0,3] : 1 <= i <= N} union

{[i,w,2] -> [i,0,3] : 1 <= i <= N && 0 <= w};

call it Int.

# The "full" relation

Full := {[i,w,s]->[i',w',s'] : 1<=s<=3 && (s=2 || w=w'=0)

&& 1<=i,i'<=N && 0<=w,w'};

Lex := {[i,w,2]->[i',w',2] : 1<=i<=i'<=N && 0<=w,w' && (i<i' || w<w')}

union {[i,0,1]->[i',0,1] : 1<=i<i'<=N}

union {[i,0,3]->[i',0,3] : 1<=i<i'<=N}

union {[i,0,1]->[i',w',2] : 1<=i<=i'<=N && 0<=w'}

union {[i,w,2]->[i',0,1] : 1<=i,i'<=N && 0<=w && i<i'}

union {[i,0,1]->[i',0,3] : 1<=i<=i'<=N}

union {[i,0,3]->[i',0,1] : 1<=i<i'<=N}

union {[i,w,2]->[i',0,3] : 1<=i<=i'<=N && 0<=w}

union {[i,0,3]->[i',w',2] : 1<=i<i'<=N && 0<=w'};

ILex := inverse Lex;

INpar := inverse NPar;

union (INPar intersection S(NPar intersection Lex));

Int := Int union (inverse Int);

Int;

{[i,w,2] -> [i',w',2] : 1 <= i' < i <= N && 1 <= w <= w'} union

{[i,0,2] -> [i',w',2] : 1 <= i' < i <= N && 0 <= w'} union

{[i,w,2] -> [i',w-1,2] : 1 <= i' < i <= N && 1 <= w} union

{[i,w,2] -> [i',w',2] : 1 <= i' < i <= N && 0 <= w' <= w-2} union

{[i,0,1] -> [i',0,1] : 1 <= i' < i <= N} union

{[i,0,2] -> [i',0,1] : 1 <= i' < i <= N} union

{[i,0,1] -> [i',w',2] : 1 <= i' < i <= N && 0 <= w'} union

{[i,0,3] -> [i',0,1] : 1 <= i' < i <= N} union

{[i,0,3] -> [i',w',2] : 1 <= i' < i <= N && 0 <= w'} union

{[i,w,2] -> [i',0,3] : 1 <= i < i' <= N && 0 <= w} union

{[i,0,1] -> [i',0,3] : 1 <= i < i' <= N} union

{[i,w,2] -> [i',0,1] : 1 <= i < i' <= N && 0 <= w} union

{[i,0,1] -> [i',0,2] : 1 <= i < i' <= N} union

5.3. STORAGE MAPPING OPTIMIZATION 203

{[i,w,2] -> [i',w',2] : 1 <= i < i' <= N && 0 <= w <= w'-2} union

{[i,w,2] -> [i',w+1,2] : 1 <= i < i' <= N && 0 <= w} union

{[i,w,2] -> [i',0,2] : 1 <= i < i' <= N && 0 <= w} union

{[i,w,2] -> [i',w',2] : 1 <= i < i' <= N && 1 <= w' <= w}

Int intersection {[i,w,s]->[i,w',s']}

is empty, meaning that neither expansion nor renaming must be done inside an iteration

of the outer loop. In particular: ES [2] should be set to 0. However, computing the set

W0S (v) (i.e. for the outer loop) yields all accesses w executing after v (for the same i).

Then ES [1] should be set to N . We have automatically found the partially expanded

program.

Schedule-independent storage mapping. We now apply the expansion algorithm

according to the "data-
ow" execution order. The parallel execution order is dened as

follows:

Par := S+;

Once again

Int intersection {[i,w,s]->[i,w',s']}

is empty. The schedule-independent storage mapping is thus the same as the previous,

parallelization-dependent, one.

The resulting program for both techniques is the same as the hand-crafted one in

Figure 5.17.

Second Example

We now consider the knapsack program in Figure 5.18. It is easy to show that a schedule-

independent storage mapping would give no better result that single-assignment form.

More precisely, it is impossible to nd any schedule such that a \cyclic folding"|a storage

mapping with subscripts of the form AS [CurIns % ES ]|would be more economical than

single-assignment form.

We are thus looking for a schedule-dependent storage mapping. An ecient paral-

lelization of program KP requires tiling of the iteration space. This can be done using

classical techniques since the loop is perfectly nested. Section 5.3.10 has shown good

performance for 16 32 tiles, but we consider 2 1 tiles for the sake of simplicity. The

parallel execution order considered is the same as the one presented in Section 5.3.1: tiles

are scheduled in fronts of constant k + j , and the inner-tile order is the original sequential

execution one.

The result of instancewise reaching denition analysis is written in Omega's syntax:

S := {[k,j]->[k-1,j] : 2<=k<=M && 1<=j<=C} union

{[k,j]->[k,j'] : 1<=k<=M && 1<=j'<j<=C && j'-K<=j};

Instances which may not assign the same memory location are dened by the following

relation:

NCon := {[k,j]->[k',j'] : 1<=k,k'<=M && 1<=j,j'<=C && j!=j'};

204 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

InnerTile := {[k,j]->[k',j] : (exists kq,kr,kr' : k=2kq+kr

&& k'=2kq+kr' && 0<=kr<kr'<2)};

InterTile := {[k,j]->[k',j'] : (exists kq,kr,kq',kr' : k=2kq+kr

&& k'=2kq'+kr' && 0<=kr,kr'<2 && kq+j<kq'+j')};

Par := Lex intersection (InnerTile union InterTile);

call it Int.

# The "full" relation

Full := {[k,j]->[k',j'] : 1<=k,k'<=M && 1<=j,j'<=C};

Lex := Full intersection {[k,j]->[k',j'] : k<k' || (k=k' && j<j')};

ILex := inverse Lex;

INpar := inverse NPar;

union (INPar intersection S(NPar intersection Lex));

Int := Int union (inverse Int);

Int;

{[k,j] -> [k',j'] : 1 <= k <= k' <= M && 1 <= j < j' <= C} union

{[k,j] -> [k',j'] : 1 <= k < k' <= M && 1 <= j' < j <= C} union

{[k,j] -> [k',j'] : Exists ( alpha : 1, 2alpha+2 <= k < k' < M

&& j <= C && 1 <= j' && k'+2j' <= 2+2j+2alpha)} union

{[k,j] -> [k',j'] : Exists ( alpha : 1, 2alpha+2 <= k' < k < M

&& j' <= C && 1 <= j && k+2j <= 2+2j'+2alpha)} union

{[k,j] -> [k',j'] : 1 <= j < j' <= C && 1 <= k' < k <= M} union

{[k,j] -> [k',j'] : 1 <= k' <= k <= M && 1 <= j' < j <= C}

Int intersection {[k,j]->[k+K+1,j']}

5.3.10 Experiments

Partial expansion has been implemented for Cray-Fortran ane loop nests [LF98]. Semi-

automatic storage mapping optimization has also been performed on general loop-nests,

using FADA, Omega, and PIP.

Figure 5.25 summarizes expansion and parallelization results for several programs.

The three ane loop nests examples have already been studied by Lefebvre in [LF98,

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 205

........................................................................................

Sequential Parallel Parallel Size Run-time Overhead

Program Complexity Size Complexity SA Optimized SA Optimized

MVProduct O(N )

2 N +2N +1

2 O(N ) 2N +3N

2 N 2 +2N no no

Cholesky O(N 3 ) N 2 + N +1 O(N ) N +N

3 2 2N 2 + N no no

Gaussian O(N )

3 N 2 + N +1 O(N ) N +N +N

3 2 2N 2 +2N no no

Knapsack O(MC ) C +2M O(M + C ) MC + C +2M KC +2C +2M free free

Convolution O(NM ) 1 O(M ) NM + N N cheap no

. . . . . . . . . . . . . . . . . . . . . . Figure 5.25. Time and space optimization . . . . . . . . . . . . . . . . . . . . . .

experiments have been made on an SGI Origin 2000, using the mp library (but not PCA,

the built-in automatic parallelizer). As one would expect, results for the convolution

program are excellent even for small values of N. Execution times for program KP appear

in Figure 5.26. The rst graph compares execution time of the parallel program and of

the original (not expanded) one ; the second one shows the speed-up. We got very good

results for medium array sizes,17 both in terms of speed-up and relatively to the original

knapsack program.

........................................................................................

140 32

Sequential Optimal

Parallel Effective

120

16

100

Time (ms)

Speed-up

80 8

60 4

40

2

20

0 1

1 2 4 8 16 32 1 2 4 8 16 32

Processors Processors

Sections 5.2 and 5.3 addressed two techniques to optimize parallelization via memory

expansion. We show here that combining the two techniques in a more general expansion

framework is possible and brings signicant improvements. Optimization is achieved from

two complementary directions:

Adding constraints to limit memory expansion, like static expansion avoiding -

functions [BCC98], privatization [TP93, MAL93], or array static single assignment

[KS98]. All these techniques allow partial removal of memory-based dependences,

but may extract less parallelism than conversion to single assignment form.

17 Here C=2048, M=1024 and K=16, with 16 32 tiles (scheduled similarly to Figure 5.18).

206 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Applying storage mapping optimization techniques [CL99]. Some of these are either

schedule-independent [SCFS98] or schedule-dependent [LF98] (yielding better opti-

mizations) whether they require former computation of a parallel execution order

(scheduling, tiling, etc.) or not.

We try here to get the best of both directions and show the benet of combining them

into a unied framework for memory expansion. The motivation for such a framework is

the following: because of the increased complexity of dealing with irregular codes, and

given the wide range of parameters which can be tuned when parallelizing such programs,

a broad range of expansion techniques have been or will be designed for optimizing one

or a few of these parameters. The two preceding sections are some of the best examples

of this trend. We believe that our constrained expansion framework greatly reduces the

complexity of the optimization problem, in reducing the number of parameters and helping

the automation process.

With the help of a motivating example we introduce the general concepts, before

we formally dene correct constrained storage mappings. Then, we present an intra-

procedural algorithm which handles any imperative program and most loop nest paral-

lelization techniques.

5.4.1 Motivation

We study the pseudo-code in Figure 5.27.a. Such nested loops with conditionals appear

in many kernels, but most parallelization techniques fail to generate ecient code for

these programs. Instances of T are denoted by hT; i; j i, instances of S by hS; i; j; ki, and

instances of R by hR; ii, for 1 i; j M and 1 k N . (\P (i; j )" is a boolean function

of i and j .)

........................................................................................

double xT [M+1, M+1], xS [M+1, M+1, N+1];

double x;

for (i=1; i<=M; i++) {

for (i=1; i<=M; i++) {

for (j=1; j<=M; j++)

for (j=1; j<=M; j++)

if ( P (i; j )

) {

if ( P (i; j )

) {

T x = 0;

T xT [i, j] = 0;

for (k=1; k<=N; k++)

for (k=1; k<=N; k++)

S S xS [i, j, k] = if (k==1) xT [i, j];

}

x = x ;

else xS [i, j, k-1] ;

R = x ; R

}

= (fhS; ; 1; N i; : : : ; hS; ; M; N ig)

i i ;

}

}

Figure 5.27.a. Original program Figure 5.27.b. Single assignment form

. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.27. Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . .

On this example, assume N is positive and predicate \P (i; j )" evaluates to true at least

one time for each iteration of the outer loop. A precise instancewise reaching denition

analysis tells us that the reaching denition of the read access hS; i; j; ki to x is hT; i; j i

when k = 1 and hS; i; j; k 1i when k > 1. We only get an approximate result for

denitions that may reach hR; ii: those are fhS; i; 1; N i; : : : ; hS; i; M; N ig. In fact, the

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 207

value of x may only come from S (since N > 0) for the same i (since T executes at least

one time for each iteration of the outer loop), and for k = N .

Obviously, memory-based dependences on x hampers parallelization. Our intent is to

expand scalar x so as to get rid of as many dependences as possible. Figure 5.27.b shows

our program converted to SA form. The unique function implements a run-time choice

between values produced by hS; i; 1; N i; : : : ; hS; i; M; N i.

SA removed enough dependences to make the two outer loops parallel, see Fig-

ure 5.28.a. Function is computed at run-time using array @x. It holds the last value of

j at statement S when x was assigned. This information allows value recovery in R, see

the third method in Section 5.1.4 for details.

But this parallel program is not usable on any architecture. The main reason is

memory usage: variable x has been replaced by a huge three-dimensional array, plus two

smaller arrays. This code is approximately ve times slower than the original program on

a single processor (when arrays can be accomodated in memory).

........................................................................................

double xT [M+1, M+1], xS [M+1, M+1, N+1];

int @x[M+1];

parallel for (i=1; i<=M; i++) { double x[M+1, M+1];

@x[i] = ; ? int @x[M+1];

parallel for (j=1; j<=M; j++) parallel for (i=1; i<=M; i++) {

if ( P (i; j )

) { @x[i] = ; ?

T xT [i, j] = 0; parallel for (j=1; j<=M; j++)

for (k=1; k<=N; k++) if ( P (i; j )

) {

S xS [i, j, k] = if (k==1) T x[i, j] = 0;

xT [i, j]; for (k=1; k<=N; k++)

else xS [i, j, k-1] ; S x[i, j] = x[i, j] ;

@x[i] = max (@x[i], j); @x[i] = max (@x[i], j);

} }

R = xS [i, @x[i], N] ; R = x[i, @x[i]] ;

} }

. . . . . . . . . . . . . . . . Figure 5.28. Parallelization of the motivating example . . . . . . . . . . . . . . . .

This shows the need for a memory usage optimization technique. Storage mapping

optimization (SMO) [CL99, LF98, SCFS98] consists in reducing memory usage as much

as possible as soon as a parallel execution order has been crafted, see Section 5.3. A

single two-dimensional array can be used, while keeping the two outer loops parallel, see

Figure 5.28.b. Run-time computation of function with array @x seems very cheap at

rst glance, but execution of @x[i] = max (@x[i], j) hides synchronizations behind

the computation of the maximum! As usual, it results in a very bad scaling: good

accelerations are obtained for a very small number of processors, then speed-up drops

dramatically because of synchronizations. Figure 5.29 gives execution time and speed-up

for the parallel program, compared to the original|not expanded|one. We used the mp

library on an SGI Origin 2000, with M = 64 and N = 2048, and simple expressions for

\ " parts.

This bad result shows the need for a ner parallelization scheme. The question is to

208 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

........................................................................................

140 4

Sequential

SMO Optimal

120 SMO

2

100

Time (ms)

80

1

60

40

0.5

20

0 0.25

1 2 4 8 16 32 1 2 4 8 16 32

Processors Processors

widely-used parallel computers, the processor number is likely to be less than 100, but

SA form extracted two parallel loops involving M 2 processors! The intuition is that we

wasted memory and run-time overhead.

One would prefer a pragmatic expansion scheme, such as maximal static expansion

(MSE) [BCC98], or privatization [TP93, MAL93]. Choosing static expansion has the

benet that no function is necessary any more: x can be safely expanded along outermost

and innermost loops, but expansion along j is forbidden|it requires a function thus

violates the static constraint, see Section 5.2. Now, only the outer loop is parallel, and we

get much better scaling, see Figure 5.30. However, on a single processor the program still

runs two times slower than the original one: scalar x|probably promoted to a register in

the original program|has been replaced by a two-dimensional array.

........................................................................................

32

double x[M+1, N+1]; Optimal

Speed-up (parallel / original)

MSE

parallel for (i=1; i<=M; i++) { 16

for (j=1; j<=M; j++)

if ( ) { P (i; j ) 8

T x[i, 0] = 0; 4

for (k=1; k<=N; k++)

S x[i, k] = x[i, k-1] ; 2

} 1

R

= x[i, N] ; 0.5

} 1 2 4 8 16 32

Processors

Maximal static expansion expanded x along the innermost loop, but it was of no

interest regarding parallelism extraction. Combining it with storage mapping optimization

solves the problem, see Figure 5.31. Scaling is excellent and parallelization overhead is

very low: the parallel program runs 31:5 times faster than the original one on 32 processors

(for M = 64 and N = 2048).

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 209

zation and static expansion|with storage mapping optimization techniques, to improve

parallelization of general loop nests (with unrestricted conditionals and array subscripts).

In the following, we present an algorithm useful for automatic parallelization of impera-

tive programs. Although this algorithm cannot itself choose the \best" parallelization, it

aims to simultaneous optimization of expansion and parallelization constraints.

........................................................................................

32

double x[M+1]; Optimal

MSE + SMO

parallel for (i=1; i<=M; i++) { 16

for (j=1; j<=M; j++)

if ( P (i; j )

) {

8

T x[i] = 0; 4

for (k=1; k<=N; k++)

S x[i] = x[i]

;

2

} 1

; 0.5

} 1 2 4 8 16 32

Processors

Figure 5.31. Maximal static expansion combined with storage mapping optimization

........................................................................................

Because our framework is based on maximal static expansion and storage mapping opti-

mization, we inherit their program model and mathematical abstraction: we only consider

nests of loops operating on arrays and abstract these programs with ane relations.

Introducing Constrained Expansion

The motivating example shows the benets of putting an a priori limit to expansion.

Static expansion [BCC98] is a good example of constrained expansion. What about other

expansion schemes? The goal of constrained expansion is to design pragmatic techniques

that does not expand variables when the incurred overhead is \too high". To generalize

static expansion, we suppose that some equivalence relation on writes is available from

previous compilation stages|possibly with user interaction. It is called the constraint

relation. A storage mapping constrained by is any mapping feexp such that

8e 2 E; 8v; w 2 W : v w ^ fe(v) = fe(w) =) feexp(v) = feexp(w): (5.25)

It is dicult to decide whether to forbid expansion of some variable or not. A short

survey of this problem is presented in Section 5.4.5, along with a discussion about building

constraint relation from a \syntactical" or \semantical" constraint. Moreover, we leave

for Section 5.4.8 all discussions about picking the right parallel execution order.

Now, the two problems are part of the same two-criteria optimization problem: tun-

ing expansion and parallelism for performance. We do not present here a solution to this

complex problem. The algorithm described in the next sections should be seen as an inte-

grated tool for parallelization, as soon as the \strategy" has been chosen|what expansion

210 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

constraints, what kind of schedule, tiling, etc. Most of these strategies have already been

shown useful and practical for some programs; our main contribution is their integration

in an automatic optimization process. The summary of our optimization framework is

presented in Figure 5.32.

........................................................................................

Single-assignment form Data-
ow execution order

Expansion constrained by

(scheduling, tiling, etc.)

Expansion Parallelism

. . . . . . . . . . . . . . . . . . . . . . . . Figure 5.32. What we want to achieve . . . . . . . . . . . . . . . . . . . . . . . .

We rst dene correct parallelizations then state our optimization problem.

What is a Correct Parallel Execution Order?

Memory expansion partially removes dependences due to memory reuse. Recall from

Section 2.5 that relation exp approximates the dependence relation of (<seq; feexp), the

expanded program with sequential execution order. ( exp equals when the program is

converted to SA form.) Thanks to Theorem 2.2 page 81, we want any parallel execution

order <par to satisfy the following condition:

8({1 ; r1); ({2; r2) 2 A : ({1; r1) exp ({2 ; r2) =) {1 <par {2: (5.26)

Computation of approximate dependence relation exp from storage mapping feexp is pre-

sented in Section 5.4.8.

What is a Correct Expansion?

Given parallel order <par, we are looking for correct expansions allowing parallel execu-

tion to preserve original semantics. Our task is to formalize memory reuse constraints

enforced by <par. Using interference relation ./ dened in Section 5.3.2, we have proven

in Theorem 5.2 that the expansion is correct if the following condition holds.

8e 2 E; 8v; w 2 W : v ./ w =) feexp(v) 6= feexp(w): (5.27)

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 211

We formalized the parallelization correctness with an expansion constraint (5.25) and two

correctness criteria (5.26) and (5.27). Let us show how solving these equations simulta-

neously yields a suitable parallel program (<par; feexp).

Following the lines of Section 5.2.3, we are interested in removing as many dependences

as possible, without violating the expansion constraint. We can prove|like Proposi-

tion 5.1 in Section 5.2.3|that a constrained expansion is maximal |i.e. assigns the largest

number of memory locations while verifying (5.25)|i

8e 2 E; 8v; w 2 We : v w ^ fe(v) = fe(w) () feexp(v) = feexp(w):

Still following Section 5.2.3, we assume that feexp = (fe; ), where is constant on equiv-

alence classes of . Indeed, if fe(v) = fe(w), condition feexp(v) = feexp(w) becomes

equivalent to (v) = (w). Because we need to approximate over all possible executions,

we use con ict relation , and our maximal constrained expansion criterion becomes

8v; w 2 W; v w : v w () (v) = (w) (5.28)

Computing is done by enumerating equivalence classes of . For any access v in a class

of (instances that \may" hit the same memory location), (v) can be dened via a

representative of the equivalence class of v for relation . Computing the lexicographical

minimum is a simple way to nd representatives, see Section 5.2.5.

It is time to compute dependences exp of program (<seq; feexp): an access w depends

on v if they hit the same memory location, v executes before w, and at least one is a

write. The full computation is done in Section 5.4.8 and uses (5.28); the result is

8v 2 W; w 2 R : v exp w , 9u 2 W : u w ^ v u ^ v u ^ v <seq w

8v 2 R; w 2 W : v exp w , 9u 2 W : u v ^ u w ^ u w ^ v <seq w

8v; w 2 W : v exp w , v w ^ v w ^ v <seq w (5.29)

We rely on classical algorithms to compute <par from exp [Fea92, DV97, IT88, CFH95].

Knowing (<par; feexp), we could stop and say we have successfully parallelized our

program; but nothing ensures that feexp is an \economical" storage mapping (remember

the motivating example). We must build a new expansion from <par that minimizes

memory usage while satisfying (5.27).

For constrained expansion purposes, feexp has been chosen of the form (fe; ). This

has some consequences on the expansion correctness criterion: when fe(v) 6= fe(w), it is

not necessary to set (v) 6= (w) to enforce feexp(v) 6= feexp(w). As a consequence, the

v 6 w clause in (5.22) is not necessary any more (see page 194), and we may rewrite the

expansion correctness criterion thanks to a simplied denition of interference relation ./.

Let be the interference relation for constrained expansion:

v w ()

def

9u 2 R : v u ^ w par v ^ u par w ^ (u <seq w _ w <seq v)

_ 9u 2 R : w u ^ v par w ^ u par v ^ (u <seq v _ v <seq w) : (5.30)

We can rewrite this denition using algebraic operations:

= ( (R) W)\ par \ >seq [ par \( (par \ <seq))

[ ( (R) W)\ par \ <seq [ par \( (par \ <seq )): (5.31)

212 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

ping feexp is of the form (fe; ) and the following condition holds, then feexp is a correct

expansion of fe|i.e. feexp allows parallel execution to preserve the program semantics.

8v; w 2 W; v w : v w =) (v) 6= (w): (5.32)

Proving Theorem 5.3 is a straightforward rewriting of the proof of Theorem 5.2 and

the optimality result of Proposition 5.2 also holds: the only dierence is that the v 6 w

clause has been replaced by v w in left-hand side of (5.32).

Building a function satisfying (5.32) is almost what the partial expansion algorithm

presented in Section 5.3.5 has been crafted for. Instead of generating code, one can

redesign this algorithm to compute an equivalence relation over writes: the coloring

relation. Its only requirement is to assign dierent colors to interfering writes,

8v; w 2 W : v w =) :(v w); (5.33)

but we are also interested in minimizing the number of colors. When v w, it says that

it is correct to have feexp(v) = feexp(w). The new graph coloring algorithm is presented in

Section 5.4.6.

By construction of relation , a function dened by

8v; w 2 W; v w : v w () (v) = (w)

satises expansion correctness (5.32), but annoyingly, nothing ensures that expansion

constraint (5.25) is still satised: for all v; w 2 W such as v w, we have v w ) (v) 6=

(w) but not necessarily v w ) (v) 6= (w). Indeed, denes a minimal expansion

allowing the parallel execution order to preserve the original semantics, but it does not

enforce that this expansion satises the constraint.

The rst problem is to check the compatibility of and . This is ensured by the

following result.18

Proposition 5.3 For all writes v and w, it is not possible that v w and v w at the

same time.19

Proof: Suppose v w, v w, v w and v <seq w. The third line of (5.29) shows that

v exp w, hence v <par w from (5.26). This proves that the v par w conjunct in second

line of (5.30) does not hold. Now, since v w, one may consider a read instance u 2 R

such that the rst line of (5.30) is satised: v u ^ w par v ^ u par w ^ u <seq w.

Exchanging the role of u and v in the second line of (5.29) shows that u exp w, hence

u <par w from (5.26); this is contradictory with u par w.

Likewise, the case w <seq v yields a contradiction with u par v in the second line of

(5.30). This terminates the proof.

We now have to dene from a new equivalence relation, considering both and .

Figure 5.33 shows that [ is not sucient: consider three writes u, v and w such that

fe(u) = fe(v) = fe(w), u v and v w. (5.28) enforces feexp(u) = feexp(v) since u v.

Moreover, to spare memory, we should use coloring relation and set feexp(v) = feexp(w).

Then, no expansion is done and parallel order <par may be violated.

18 The proof of this strong result is rather technical but helps understanding the role of each conjunct

in equations (5.29), (5.26) and (5.30).

19 A non-optimal denition of relation would not yield such a compatibility result.

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 213

........................................................................................

w if ( ) x = u x = u y =

rw = x w if ( ) x = w if ( ) x =

u x = rw = x rw = x

v if ( ) x = v if ( ) y =

v if ( ) x = ruv = x ruv = y

ruv = x

(rw ) = fwg and moving u to the top: rw assigning y in u and v

(ruv ) = fu; vg. may read the value and moving u to the

produced by u. top.

. . . . . . . . . . Figure 5.33. Strange interplay of constraint and coloring relations . . . . . . . . . .

To avoid this pitfall, coloring relation must be used with care: one may safely set

feexp(u) = feexp(v) when for all u0 u, v0 v: u0 v0 (i.e. u0 and v0 share the same color).

We thus build a new relation over writes, built from and . It is called the constraint

coloring relation, and is dened by

8v; w 2 W : v w ()

def

v w _ 8v0; w0 : v0 v ^ w0 w =) v0 w0 : (5.34)

We can rewrite this denition using algebraic operations:

= [ n ((W W) n ) : (5.35)

The good thing is that relation is an equivalence: the proof is simple since both

and are equivalence relations. Moreover, choosing (v) = (w) when v w and

(v) 6= (w) when its not the case ensures that feexp = (fe; ) satises both the expansion

constraint and the expansion correctness criterion.

The following result solves the constraint storage mapping optimization problem:20

Theorem 5.4 Storage mapping feexp of the form (fe; ) such that

8v; w 2 W; v w : v w () (v) = (w) (5.36)

is the minimal storage mapping|i.e. accesses the fewer memory locations|which is

constrained by and allows the parallel execution order <par to preserve the program

semantics, and being the only information about permitting two instances to

assign the same memory location.

Proof: From Proposition 5.3, we already know that and have an empty inter-

section. Together with the inclusion of n ((W W) n ) into , this proves

the correctness of feexp = (fe; ). The constraint is also enforced by feexp since .

To prove the optimality result, one rst observe that denes an equivalence relation

of write instances, and second that is the largest equivalence relation included in

[ .

Theorem 5.4 gives us an automatic method to minimize memory usage, according to

a parallel execution order and a predened expansion constraint. Figure 5.34 gives an

20 See Section 2.4.4 for a general remark about optimality.

214 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

intuitive presentation of this complex result: starting from the \maximal constrained

expansion ", we compute a parallel execution order, from which we compute a \minimal

correct expansion ", before combining the result with the constraint to get a \minimal

correct constrained expansion ".

........................................................................................

Single-assignment form Data-
ow execution order

Constrained expansion

(scheduling, tiling, etc.)

<par

(storage mapping optimization)

<seq

Original storage mapping Sequential program

Expansion Parallelism

. . . . . . . Figure 5.34. How we achieve constrained storage mapping optimization . . . . . . .

5.4.4 Algorithm

As a summary of the optimization problem, one may group the formal constraints exposed

in Section 5.4.3 into the system:

8

>

> Constraints on feexp = (fe; ):

>

>

>

> 8v; w 2 W : v w ^ v w =) (v) = (w)

<

8v; w 2 W : v w ^ v w =) (v) 6= (w)

>

>

>

>

>

> Constraints on <par:

:

8({1 ; r1); ({2; r2) 2 A : ({1 ; r1) exp ({2 ; r2) =) {1 <par {2

Figure 5.35 shows the acyclic graph allowing computation of relations and mappings

involved in this system.

The algorithm to solve this system is based on Theorem 5.4. It computes relation

with an extension of the partial expansion algorithm presented in Section 5.3.4,

rewritten to handle constrained expansion. Before applying Constrained-Storage-

Mapping-Optimization, we suppose that parallel execution order <par has been com-

puted from <seq , , , and , by rst computing dependence relation exp then ap-

plying some appropriate parallel order computation algorithm (scheduling, tiling, etc.).

Then, this parallel execution order is used to compute the expansion correctness criterion

. Algorithm Constrained-Storage-Mapping-Optimization reuses Compute-

Representatives and Enumerate-Representatives from Section 5.2.5.

As in the last paragraph of Section 5.2.4, one may consider splitting expanded arrays

into renamed data structures to improve performance and reduce memory usage.

Eventually, when the compiler or the user knows that the parallel execution order <par

has been produced by a tiling technique, we have already pointed in Section 5.3.6 that

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 215

........................................................................................

Program (<seq ; fe) Expansion scheme

Program analysis Program analysis

<seq

Section 5.4.5

exp

Scheduling, etc.

<par

Coloration

Enumeration of equivalence classes

. . . . . Figure 5.35. Solving the constrained storage mapping optimization problem . . . . .

the cyclic graph coloring algorithm is not ecient enough. If the tile shape is known,

one may build a vector of each dimension size, and use it as a \suggestion" for a block-

cyclic storage mapping. This vector of block sizes is used when replacing the call to

Cyclic-Coloring with a call to Near-Block-Cyclic-Coloring in Constrained-

Storage-Mapping-Optimization.

Our goal here is not to choose the right constraint suitable to expand a given program,

but this does not mean leaving the user compute relation !

As shown in Section 5.4.2, enforcing the expansion to be static corresponds to setting

= R . The constraint is thus built from instancewise reaching denition results (see

Section 5.2).

Another example is privatization, seen as expansion along some surrounding loops,

without renaming. Consider two accesses u and v writing into the same memory location.

After privatization, u and v assign the same location if their iteration vectors coincide on

the components associated with privatized loops:

u v () Iter(u)[privatized loops] = Iter(v)[privatized loops];

where Iter(u)[privatized loops] holds counters of privatized loops for instance u.

216 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

program: an intermediate representation of the program

: the con ict relation

: the reaching denition relation, seen as a function

: the expansion constraint

<par: the parallel execution order

returns an intermediate representation

of the expanded program

1 ( (R) W)\ par \ >seq [ par \( (par \ <seq ))

2 [ ( (R) W)\ par \ <seq [ par \( (par \ <seq))

3 Cyclic-Coloring ( \ )

4 [( n ((W W) n ) )

5 Compute-Representatives ( \ )

6 Enumerate-Representatives ( ; )

7 for each array A 2 program

8 do A component-wise maximum of (u) for all write accesses u to A

9 declaration A[shape] Aexp[shape, A ]

10 for each statement S assigning A in program

11 do left-hand side A[subscript] of S Aexp[subscript, (CurIns)]

12 for each reference ref to A in program

13 do =ref \ (I ref )

14 quast Make-Quast ( =ref )

15 map CSMO-Convert-Quast (quast; ref )

16 ref map (CurIns)

17 return program

CSMO-Convert-Quast (quast; ref )

quast: the quast representation of the reaching denition function

ref : the original reference

returns the implementation of quast as a value retrieval code for reference ref

1 switch

2 case quast = f?g :

3 return ref

4 case quast = f{g :

5 A Array({)

6 S Stmt({)

7 x Iter({)

8 subscript original array subscript in ref

9 return Aexp[subscript, x]

10 case quast = f{1; {2 ; : : : g :

11 return (f{1; {2 ; : : : g)

12 case quast = if predicate then quast1 else quast2 :

13 return if predicate CSMO-Convert-Quast (quast1; ref )

else CSMO-Convert-Quast (quast2 ; ref )

Building the constraint for array SSA is even simpler. Instances of the same statement

assigning the same memory location must still do so in the expanded program (only

variable renaming is performed):

u v () Stmt(u) = Stmt(v)

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 217

denitions of memory locations . This denition can be used to weaken the static expan-

sion constraint: if the aim of constrained expansion is to reduce run-time overhead due

to functions, then ml seems more appropriate than to dene the constraint. Indeed,

if Loop-Nests-ML-SA is used to convert a program to SA form, we have seen that

functions generated by the classical algorithm have disappeared, see the second method

in Section 5.1.4. It would thus be interesting to replace

Make-Quast ( =ref )

in line 14 of Constrained-Storage-Mapping-Optimization by

Make-Quast (=ref ml (u; f (u)))

e

and to consider the constraint dened by the transitive closure of relation W

8v; w 2 W : v W w () 9c 2 f (u) : v; w 2 ml (u; c);

where f is some conservative approximation of fe. Maximal expansion according to

constraint W is called weakened static expansion. Eventually, setting = W combines

weakened static expansion with storage mapping optimization.

These practical examples give the insight that building from the formal denition

of an expansion strategy is not dicult. New expansion strategies should be designed and

expressed as constraints|statement-by-statement, user-dened, knowledge-based, and es-

pecially architecture dependent (number of processors, memory hierarchy, communication

model) constraints.

Our graph coloring problem is almost the same as the one studied by Feautrier and

Lefebvre in [LF98], and the core of their solution has been recalled in Section 5.3.5.

However, the formulation is slightly dierent now: it is no longer mixed-up with code

generation. An easy work-around would be to redesign the output of algorithm Storage-

Mapping-Optimization, as proposed in [Coh99b]: let Stmt(u) (resp. Iter(u)) be the

statement (resp. iteration vector) associated with access u, and let NewArray(S ) be

the name of the new array assigned by S (after partial expansion),

8v; w 2 W : v w ()

def

NewArray(Stmt(v )) = NewArray(Stmt(w))

^ Iter(v) mod EStmt(v) = Iter(w) mod EStmt(w) :

This solution is simple but not practical. We thus present a full algorithm suitable

for graph dened by ane relations: Cyclic-Coloring is used on statement instances

for our storage mapping optimization purposes. Since the algorithm is general purpose,

we consider an interference relation between vectors (of the same dimension). Using this

algorithm for statement instances requires a preliminary encoding of statement name

inside the iteration vector, and a padding of short vectors with zeroes. We already use

this technique when formatting instances to the Omega syntax: see Section 5.2.7 for a

practical example.

Remember that Storage-Mapping-Optimization was based on two independent

techniques: building of an expansion vector and partial renaming. This decomposi-

tion came from the bounded statement number which allowed ecient greedy coloring

218 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

techniques, and the innity of iteration vectors which required a specic cyclic coloring.

Cyclic-Coloring proceeds in a very similar way, and the reasoning of Section 5.3.5 and

[LF98, Lef98] is still applicable to prove its correctness. However, the decomposition into

two coloring stages is extended here in considering all nite dimensions of the vectors con-

sidered: if the vectors related with an interference relation have some dimensions whose

components may only take a nite number of values, it is interesting to apply a classical

coloring algorithm to these nite dimensions. We then build an equivalence relation of

vectors that share the same nite dimensions: it is called finite in the Cyclic-Coloring

algorithm (the number of equivalence classes is obviously nite). When vectors encode

statement instances, it is clear that the last dimension is nite, but some examples may

present more nite dimensions, for example with small loops whose bounds are known at

compile time. This extension may thus bring more ecient storage mappings that the

Storage-Mapping-Optimization algorithm in Section 5.3.4.

Cyclic-Coloring ()

: the ane interference graph

returns a valid and economical cyclic coloration

1 N dimension of vectors related with interfere

2 finite equivalence relation of vectors sharing the same nite components

3 for each class set in finite

4 do for p = 1 to N

5 do working f(v; w) : v 2 set ^ w 2 set

6 ^ v[1::p] = w[1::p] ^ v[1::p + 1] < w[1::p + 1]

7 ^ hS; vi hS; wig

8 maxv f(v; max<lex fw : (v; w) 2 workingg)g

9 vector[p + 1] max<lex fw v[p + 1] + 1 : (v; w) 2 maxvg

10 cyclicset v mod vector

11 interfere ?

12 for each set; set0 in finite

13 do if (9v 2 set; v0 2 set0 : v v0)

14 then interfere interfere [ f(set; set0 )g

15 coloring Greedy-Coloring (interfere)

16 col ?

17 for each set in finite

18 do col col [ (cyclicset; coloring(set))

19 return col

The Near-Block-Cyclic-Coloring algorithm is an optimization of Cyclic-

Coloring: it includes an improvement of the technique to eciently handle graphs

associated with tiled programs, as hinted in Section 5.3.6. In this particular case, we

consider|as in most tiling techniques|a perfectly nested loop nest. Notice the \=" sym-

bol is used for symbolic integer division. The intuitive idea is that a block-cyclic coloring

is prefered to the cyclic one of the classical algorithm.

The Near-Block-Cyclic-Coloring algorithm should be seen as a rst attempt

to compute optimized storage mappings for tiled programs. As shown in Section 5.3.6,

the block-cyclic coloring problem is still open for ane interference relations.

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 219

: a symbolic interference graph

shape: a vector of block sizes suggested by a tiling algorithm

returns a valid and economical block-cyclic coloration

1 N number of nested loops

2 quotient f(x; x) : x 2 ZN g

3 for p = 1 to N

4 do quotient0 quotient

5 f(x; y) : y[1] = x[1]; : : : 1; y[p] = x[p]=shapep ; : : : ; y[N ] = x[N ]g

6 if (@z : z quotient0 quotient0 z)

7 then quotient quotient0

8 col Cyclic-Coloring (quotient quotient 1)

9 return col quotient

5.4.7 Dynamic Restoration of the Data-Flow

As in Section 5.3.8, -arrays should be chosen in one-to-one mapping with the expanded

data structures, and arguments of functions|i.e. sets of possible reaching denitions|

should be updated according to the new storage mapping. The technique is essentially

the same: function feexp is used to access -arrays, then relation 6 and function are

used to recompute the sets of possible reaching denitions:21 a (set) reference should be

replaced by

fv 2 set : @w 2 set : v <seq w ^ :(v 6 w) ^ (v) = (w)g :

Another optimization is based on the shape of -arrays: since feexp = (fe; ), the

memory location written by a possible reaching denition can be deduced from the array

subscript, and the boolean type is now preferred for -arrays elements. This very simple

optimization reduces both memory usage and run-time overhead. Algorithm CSMO-

Implement-Phi summarizes these optimizations.22

As hinted in Section 5.1.4, the goal is now to avoid redundancy in the run-time restora-

tion of the data ow. Our technique extends ideas from the algorithms to eciently place

functions in the SSA framework [CFR+91, KS98]. However, code generation for the

online computation of functions is rather dierent.

As in the SSA framework, functions should be placed at the joins of the control- ow

graph [CFR+91]: there is a join at some program point when several control- ow paths

merge together. Remember the control- ow graph is not the control automaton dened

in Section 2.3.1, and a program point is an inter-statement location in the program text

[ASU86]. Of course, textual order <txt is extended to program points.

Joins are eciently computed with the dominance frontier technique, see [CFR+91] for

details. Indeed, the only \interesting" joins are those located on a path from a write w

to a use whose set of possible reaching denitions is non empty and holds w. If Points

is the set of program points, the set of \interesting" joins for an array (or scalar) A is

21 We use :(v 6 w) to approximate the relation between writes that must assign the same memory

location.

22 For eciency reasons, an expanded array Aexp is partitioned into several sub-arrays, as proposed in

Section 5.4.4. To correctly handle this partitioning, some simple|but rather technical|modications

should be made on the algorithm.

220 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

CSMO-Implement-Phi (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each array Aexp[shape] in expanded

2 do if there are functions accessing Aexp

3 then declare an array Aexp[shape] initialized to false

4 for each read reference ref to Aexp whose expanded form is (set)

5 do sub array subscript in ref

6 short fv 2 set : @w 2 set : v <seq w ^ :(v 6 w) ^ (v) = (w)g

7 for each statement s involved in set

8 do refs write reference in s

9 subs array subscript in refs

10 if not already done for s

11 then following s insert

12 Aexp [subs , (CurIns; refs )] = true

13 (set) Aexp[max< f{ 2 short :Aexp [sub, ({; ref)]=trueg]

return expanded

seq

14

denoted by JoinsA, and is formally dened by

8p 2 Points : p 2 JoinsA () 9v; u 2 I :

v u ^ Stmt(v) <txt p <txt Stmt(u) ^ Array(Stmt(u)) = A: (5.37)

For each array (or scalar) A in the original program, the idea is to insert at each join

j in JoinsA a pseudo-assignment statement

Pj A[] = A[];

which copies the entire structure into itself. Then, the reaching denition relation is

extended to these pseudo-assignment statements and the constraint storage-mapping op-

timization process is performed on the modied program instead of the original one.23

Application of Constrained-Storage-Mapping-Optimization and then CSMO-

Implement-Phi (or an optimized version, see Section 5.1.4) generates an expanded pro-

gram whose interesting property is the absence of any redundancy in functions. Indeed,

the lexicographic maximum of two instances is never computed twice, since it is done as

early as possible in the function of some pseudo-assignment statement.

However, the expanded program suers from the overhead induced by array copying,

which was not the case for a direct application of Constrained-Storage-Mapping-

Optimization and CSMO-Implement-Phi. Knobe and Sarkar encounter a similar

problem with SSA for arrays [KS98] and propose several optimizations (mostly based

on copy propagation and invariant code motion), but they provide no general method

to remove array copies{it is the very nature of SSA to generate temporary variables.

Nevertheless, there is such a general method, based on the observation that each pseudo-

assignment statement in the expanded program is followed by an -array assignation, by

construction of pseudo-assignment statements and the set JoinsA . Consider the following

code generation for a pseudo-assignment statement P :

for ( ) { // iterate through the whole array

23 Extending the reaching denition relation does not require any other analysis: the sets of possible

reaching denitions for pseudo-assignment accesses can be deduced from the original reaching denition

relation.

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 221

Aexp [

subscript] = true;

Aexp [

}

Statement P does not compute anything, it only gathers possible values coming from

dierent control paths. The idea is thus to store instances instead of booleans and to use

@-arrays (see Section 5.1.4) instead of -arrays. An array @Aexp is initialized to ?, and

the array copy is bypassed in updating @Aexp[subscript] with the maximum in right-hand

side of P . The previous code fragment can thus safely be replaced by:

for ( ) { // iterate through the whole array

@Aexp [subscript] = max (set);

}

Implement-Phi: the optimized generation code algorithm for functions. Remember

that before calling this algorithm, Constrained-Storage-Mapping-Optimization

should be applied on the original program extended with pseudo-assignment statements.24

CSMO-Efficiently-Implement-Phi (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each array Aexp[shape] in expanded

2 do if there are functions accessing Aexp

3 then declare an array @Aexp[shape] initialized to ?

4 for each read reference ref to Aexp whose expanded form is (set)

5 do sub array subscript in ref

6 short fv 2 set : @w 2 set : v <seq w ^ :(v 6 w) ^ (v) = (w)g

7 for each statement s involved in set

8 do refs write reference in s

9 subs array subscript in refs

10 if not already done for s

11 then following s insert

12 @Aexp [subs , (CurIns; refs )] = CurIns

13 (set) Aexp[max< f{ 2 short :@Aexp[sub, ({; ref)]g]

for each pseudo-assignment P to Aexp with reference (set)

seq

14

15 do genmax code-generation for the lexicographic genmax in set

16 right-hand side of -array assignment following p genmax

17 remove statement P

18 return expanded

Eventually, computing the lexicographic maximum of a set|dened in Presburger

arithmetics|is a well known problem with very ecient parallel implementations [RF94].

but it is easier and sometimes faster to perform an online computation. Let us denote

by NextJoin the next instance of the nearest pseudo-assignment statement following

CurIns. Computation of the lexicographic maximum in (set) can be performed online

in replacing each assignment of the form

@Aexp [subscript, (CurIns)] = CurIns;

24 Same remark regarding partitioning of expanded arrays as for CSMO-Implement-Phi.

222 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

by

@Aexp [ subscript, (NextJoin)] = max (@Aexp [subscript, (NextJoin)], CurIns);

Applying CSMO-Efficiently-Implement-Phi and this transformation to the mo-

tivating example yields the same result as the SA form in Figure 5.28.

This section aims to characterize correct parallel execution orders for a program after

maximal constrained expansion . The benet memory expansion is to remove spurious

dependences due to memory reuse, but some memory-based dependences may remain after

constrained expansion. We still denote by eexp (resp. exp) the exact (resp. approximate)

dependence relation of the expanded program with sequential execution order (<seq ; feexp).

As announced in Section 5.4.3, we now give the full computation details for (5.29).

Dependences left by constrained expansion are, as usual, of three kinds.

1. Output dependences due to writes connected to each other by the constraint (e.g.

by R in the case of MSE).

2. True dependences, from a denition to a read, where the denition either may reach

the read or is related (by ) to a denition that reaches the read.

3. Anti dependences from a read to a denition where the denition, even if it executes

after the read, is related (by ) to a denition that reaches the read.

Formally, we thus dene eexp for an execution e 2 E as follows:

8e 2 E; 8v; w 2 Ae : v eexp w () vw

_ fe(v) = fe(w) ^ v w ^ v <seq w

_ fe(v) = fe(e (w)) ^ v e (w) ^ v <seq w

_ fe(w) = fe(e (v)) ^ e (v) w ^ v <seq w

Then, the following denition of exp is the best pessimistic approximation of eexp, sup-

posing relation is the best available approximation of function fe and is the best

available approximation of function e :

def

8v; w 2 A : v exp w () vw (5.38)

_ v w ^ v w ^ v <seq w

(5.39)

_ 9u 2 W : u w ^ v u ^ v u ^ v <seq w (5.40)

_ 9u 2 W : u v ^ u w ^ u w ^ v <seq w (5.41)

Now, since and are re exive relations, we observe that (5.38) is already included in

(5.40). We may simplify the denition of exp:

8v 2 W; w 2 R : v exp w , 9u 2 W : u w ^ v u ^ v u ^ v <seq w

8v 2 R; w 2 W : v exp w , 9u 2 W : u v ^ u w ^ u w ^ v <seq w

8v; w 2 W : v exp w , v w ^ v w ^ v <seq w (5.42)

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 223

Eventually, we get an algebraic denition of the dependence relation after maximal con-

strained expansion:

exp = ( \ ) [ ( \ ) [ 1 ( \ ): (5.43)

The rst term describes output dependences, the second one describes
ow dependences

(including reaching denitions), and the third one describes anti-dependences.

Using this denition, Theorem 2.2 page 81 describes correct parallel execution order

<par after maximal constrained expansion. Practical computation of <par is done with

scheduling or tiling techniques, see Section 2.5.2.

As an example, we parallelize the convolution program in Figure 5.6 (page 169). The

constraint is the one of the maximal static expansion. First, we dene the sequential

execution order <seq within Omega (with conventions dened in Section 5.2.7):

Lex := {[i,w,2]->[i',w',2] : 1<=i<=i'<=N && 1<=w,w' && (i<i' || w<w')}

union {[i,0,1]->[i',w',2] : 1<=i<=i'<=N && 1<=w'}

union {[i,w,2]->[i',0,1] : 1<=i,i'<=N && 1<=w && i<i'}

union {[i,0,1]->[i',0,1] : 1<=i<i'<=N}

union {[i,0,3]->[i',0,3] : 1<=i<i'<=N}

union {[i,0,1]->[i',0,3] : 1<=i<=i'<=N}

union {[i,0,3]->[i',0,1] : 1<=i<i'<=N}

union {[i,w,2]->[i',0,3] : 1<=i<=i'<=N && 1<=w}

union {[i,0,3]->[i',w',2] : 1<=i<i'<=N && 1<=w'};

Second, recall from Section 5.2.7 that all writes are in relation for (since the data

structure is a scalar variable), and that relation R is dened by (5.12). We compute exp

from (5.43):

D := (R union R(S) union S'(R)) intersection Lex;

D;

{[i,w,2] -> [i,w',2] : 1 <= i <= N && 1 <= w < w'} union

{[i,0,1] -> [i,w',2] : 1 <= i <= N && 1 <= w'} union

{[i,0,1] -> [i,0,3] : 1 <= i <= N} union

{[i,w,2] -> [i,0,3] : 1 <= i <= N && 1 <= w}

After MSE, it only remains dependences between instances sharing the same value of

i. It makes the outer loop parallel (it was not the case without expansion of scalar x).

The parallel program in maximal static expansion is given in Figure 5.14.b.

Using the Omega Calculator text-based interface, we describe a step-by-step execution

of the expansion algorithm. We have to code instances as integer-valued vectors. An

instance hs; ii is denoted by vector [i,..,s], where [..] possibly pads the vector with

zeroes. We number T , S , R with 1, 2, 3 in this order, so hT; i; j i, hS; i; j; ki and hR; ii are

written [i,j,0,1], [i,j,k,2] and [i,0,0,3], respectively.

The result of instancewise reaching denition analysis is written in Omega's syntax:

S := {[i,0,0,3]->[i,j,k,2] : 1<=i,j<=M && 1<=k<=N}

union {[i,j,1,2]->[i,j,0,1] : 1<=i,j<=M}

union {[i,j,k,2]->[i,j,k-1,2] : 1<=i,j<=M && 2<=k<=N};

224 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

The con
ict and no-con
ict relations are trivial here, since the only data structure is

a scalar variable: is the full relation and 6 is the empty one.

Con := {[i,j,k,s]->[i',j',k',s'] : 1<=i,i',j,j'<=M && 1<=k,k'<=N

&& ((s=1 && k=0) || s=2 || (s=3 && j=k=0))

&& ((s'=1 && k'=0) || s'=2 || (s'=3 && j'=k'=0))};

NCon := {[i,j,k,s]->[i',j',k',s'] : 1=2}; # 1=2 means FALSE!

dened as R in Section 5.2.2:

S' := inverse S;

R := S(S');

dependences is done according to (5.43) and relation Con is removed since it always holds:

D := R union R(S) union S'(R);

In this case, a simple solution to computing a parallel execution order is the transitive

closure computation:

Par := D+;

We can now compute relation ./ in left-hand side of the expansion correctness criterion,

call it Int.

# The "full" relation

Full := {[i,j,k,s]->[i',j',k',s'] : 1<=i,i',j,j'<=M && 1<=k,k'<=N

&& ((s=1 && k=0) || s=2 || (s=3 && j=k=0))

&& ((s'=1 && k'=0) || s'=2 || (s'=3 && j'=k'=0))};

Lex := {[i,j,0,1]->[i',j',0,1] : 1<=i<i'<=M && 1<=j,j'<=M}

union {[i,j,0,1]->[i',j',k',2] : 1<=i<=i'<=M && 1<=j,j'<=M

&& 1<=k'<=N}

union {[i,j,k,2]->[i',j',0,1] : 1<=i<i'<=M && 1<=j,j'<=M

&& 1<=k<=N}

union {[i,j,k,2]->[i',j',k',2] : 1<=i<=i'<=M && 1<=j,j'<=M

&& 1<=k,k'<=N && (i<i' || (j<=j' && (j<j' || k<k')))}

union {[i,j,0,1]->[i',0,0,3] : 1<=i<=i'<=M}

union {[i,0,0,3]->[i',j',0,1] : 1<=i<i'<=M}

union {[i,j,k,2]->[i',0,0,3] : 1<=i<=i'<=M && 1<=j<=M

&& 1<=k<=N}

union {[i,0,0,3]->[i',j',k',2] : 1<=i<i'<=M && 1<=j'<=M

&& 1<=k'<=N}

union {[i,0,0,3]->[i',0,0,3] : 1<=i<i'<=M};

ILex := inverse Lex;

INPar := inverse NPar;

5.4. CONSTRAINED STORAGE MAPPING OPTIMIZATION 225

union (INPar intersection S(NPar intersection Lex));

Int := Int union (inverse Int);

Int;

&& 1 <= k <= k' <= N && 1 <= i' < i <= M} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= j < j' <= M

&& 1 <= k' < k <= N && 1 <= i' < i <= M} union

{[i,j,k,2] -> [i',j,k',2] : 1 <= k' < k <= N

&& 1 <= i' < i <= M && 1 <= j <= M} union

{[i,j,1,2] -> [i',j',1,2] : N = 1

&& 1 <= i' < i <= M && 1 <= j' < j <= M} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= k <= k' <= N

&& 1 <= i' < i <= M && 1 <= j' < j <= M && 2 <= N} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= k' < k <= N

&& 1 <= i' < i <= M && 1 <= j' < j <= M} union

{[i,j,k,2] -> [i',j,k',2] : k'-1, 1 <= k <= k'

&& 1 <= i < i' <= M && 1 <= j <= M && k < N} union

{[i,j,k,2] -> [i',j',k',2] : 1, k'-1 <= k <= k'

&& 1 <= i < i' <= M && 1 <= j < j' <= M && k < N} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= i < i' <= M

&& 1 <= j < j' <= M && 1 <= k' < k < N} union

{[i,j,k,2] -> [i',j',k',2] : k'-1, 1 <= k <= k'

&& 1 <= i < i' <= M && 1 <= j' < j <= M && k < N} union

{[i,j,k,2] -> [i',j',k',2] : k-1, 1 <= k' <= k

&& 1 <= j < j' <= M && 1 <= i' < i <= M && k' < N} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= k < k' < N

&& 1 <= i' < i <= M && 1 <= j' < j <= M} union

{[i,j,k,2] -> [i',j',k',2] : 1, k-1 <= k' <= k

&& 1 <= i' < i <= M && 1 <= j' < j <= M && k' < N} union

{[i,j,k,2] -> [i',j,k',2] : k-1, 1 <= k' <= k

&& 1 <= i' < i <= M && 1 <= j <= M && k' < N} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= i < i' <= M

&& 1 <= j < j' <= M && 1 <= k < k' <= N} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= i < i' <= M

&& 1 <= j < j' <= M && 1 <= k' <= k <= N && 2 <= N} union

{[i,j,1,2] -> [i',j',1,2] : N = 1 && 1 <= i < i' <= M

&& 1 <= j < j' <= M} union

{[i,j,k,2] -> [i',j,k',2] : 1 <= i < i' <= M

&& 1 <= k < k' <= N && 1 <= j <= M} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= i < i' <= M

&& 1 <= k < k' <= N && 1 <= j' < j <= M} union

{[i,j,k,2] -> [i',j',k',2] : 1 <= i < i' <= M

&& 1 <= k' <= k <= N && 1 <= j' <= j <= M}

226 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Int intersection {[i,j,k,2]->[i,j,k',2]};

and

Int intersection {[i,j,0,1]->[i,j,k',2] : k' != 0};

are both empty. It means that hT; i; j i and hS; i; j; ki should share the same color for

all 1 k N (R does not perform any write). However, the sets W0T (v), W0S (v) (for

the i loop), W1T (v), W1S (v) (for the j loop) hold all accesses w executing after v. Then,

dierent i or j enforces dierent color for hT; i; j i and hS; i; j; ki. Application of the graph

coloring algorithm thus yields the following denition of the coloring relation:

Col := {[i,j,0,1]->[i,j,k,2] : 1<=i,j<=M && 1<=k<=N}

union {[i,j,k,2]->[i,j,k',2] : 1<=i,j<=M && 1<=k,k'<=N};

Eco := R union (Col-R(Full-Col(R)));

(relation always holds and has been removed):

Rho := Eco-Lex(Eco);

Rho;

{[i,j,k,2] -> [i,j,0,1] : 1 <= i <= M && 1 <= j <= M && 1 <= k <= N}

The labeling scheme is obvious: the last two dimensions are stripped o from Rho.

The resulting function is thus

(hT; i; j i) = (i; j ) and (hS; i; j; ki) = (i; j ):

Following the lines of Constrained-Storage-Mapping-Optimization, we have

computed the same storage mapping as in Figure 5.31.

The last contribution of this work is about automatic parallelization of recursive programs.

This topic has received little interest from the compilation community, but the situation

is evolving thanks to new powerful multi-threaded environments for ecient execution

of programs with control parallelism. When dealing with shared-memory architectures

and software-emulated shared memory machines, tools like Cilk [MF98] provide a very

suitable programming model for automatic or semi-automatic code generation [RR99].

Now, what programming model should we consider for parallel code generation? First,

it it still an open problem to compute a schedule from a dependence relation described

by a transducer. This is of course a strong argument against data parallelism as a model

of choice for parallelization of recursive programs. Moreover, we have seen in Section 1.2

5.5. PARALLELIZATION OF RECURSIVE PROGRAMS 227

that the control parallel paradigm was well suited to express parallel execution in re-

cursive programs. In fact, this assertion is true when most iterative computations are

implemented with recursive calls, but not when parallelism is located within iterations of

a loop. Since loops can be rewritten as recursive procedure calls, we will stick to control

parallelism in the following.

Notice we have studied powerful expansion techniques for loop nests, but no practical

algorithm for recursive structures has been proposed yet. We thus start with an investiga-

tion of specic aspects of expanding recursive programs and recursive data structures in

Section 5.5.1. Then we present in Section 5.5.2 a simple algorithm for single-assignment

form conversion of any code that t into our program model: the algorithm can be seen as

a practical realization of Abstract-SA, the abstract algorithm for SA-form conversion

(page 157). Then, a privatization technique for recursive programs is proposed in Sec-

tion 5.5.4; and some practical examples are studied in Section 5.5.5. We also give some

perspectives about extending maximal static expansion or storage mapping optimization

to this larger class of programs.

The rest of this section addresses generation of parallel recursive programs. Sec-

tion 5.5.6 starts with a short state of the art on parallelization techniques for recursive

programs, then motivates the design of a new algorithm based on instancewise data-

ow information. In Section 5.5.7, we present an improvement of the statementwise

algorithm which allows instancewise parallelization of recursive programs: whether some

statements execute in parallel or in sequence can be dependent on the instance of these

statements|but it is still decided at compile-time. This technique is also completely novel

in parallelization of recursive programs.

Before proposing a general solution for SA-form conversion of recursive programs, we

investigate several issues which make the problem more dicult for recursive control and

data structures. Recall that elements in data structures in single-assignment form are

in one-to-one mapping with control words. Thus, the preferred layout of an expanded

data structure is a tree. Expanded data structures can sometimes be implemented with

arrays: it is the case when only loops and simple recursive procedures are involved, and

when loops and recursive calls are not \interleaved"|program Queens is such an example.

But automatic recognition of such programs and eective design of a specic expansion

technique are left for future work. We will thus always consider that expanded data

structures are trees whose edges are labeled by statement names.

Management of Recursive Data-Structures

Compared to arrays, lists and trees seems much less easy to access and traverse: they

are indeed not random access data structures. For example, the abstract algorithm

Abstract-SA (page 157) for SA-form conversion uses the notation Dexp [CurIns] to

refer the access of an element index by word { in a data structure Dexp. But when Dexp is

a tree, what does it mean? How is it implemented? Is it ecient?

There is a quick answer to all these questions: the tree is traversed from its root using

pointer dereferences along letters in CurIns, the result is of course very costly at run-

time. A more clever analysis shows that CurIns is not a random word: it is the current

control word. Its \evolution" during program execution is fully predictable: it can be seen

228 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

as a dierent local variable in each program statement, a new letter being added at each

block entry.

The other problem with recursive data structures is memory allocation. Because they

cannot be allocated at compile-time in general, a very ecient memory management

technique should be used to reduce the run-time overhead. We thus suppose that an

automatic scheme for grouping mallocs or news is implemented, possibly at the C-compiler

or operating system level.

Eventually, both problems can be solved with a simple and ecient code generation

algorithm. The idea is the following: suppose a recursive data structure indexed by

CurIns must be generated by algorithm Abstract-SA; each time a block is entered,

a new element of the data structure is allocated and the pointer to the last element|

stored in a local variable|is dereferenced accordingly. This technique is implemented in

Recursive-Programs-SA.

When trying to extend maximal static expansion and storage mapping optimization to

recursive programs, two kind of problems immediately arise:

transductions are not as versatile as ane relations, because some critical algebraic

operations are not decidable and require conservative approximations;

the results of dependence and reaching denition analyses are not always as precise

as one would expect, because of the lack of expressiveness of rational and one-counter

transductions.

These two points are of course limiting the applicability of \evolved" expansion techniques

which intensively rely on algebraic operations on sets and relations.

In addition, a few critical operations useful to \evolved" expansion techniques are

lacking, e.g., the class of left-synchronous relations is not closed under transitive closure.

Conversely, the problem of enumerating equivalence classes seems rather easy because

the lexicographical selection of a left-synchronous transduction is left-synchronous, see

Section 3.4.3; a remaining problem would be to label the class representatives...

We are not aware of any result about coloring graphs of rational relations, but op-

timality should probably not hoped for, even for recognizable relations. Graph-coloring

algorithms for rational relations would of course be useful for storage mapping optimiza-

tion; but recall from Section 5.3.2 that many algebraic operations are involved in the

expansion correctness criterion, and most of these operations are undecidable for rational

relations.

The last point is that we have not found enough codes that both t into our program

model and require expansion techniques more \evolved" than single-assignment form or

privatization. But this problem is more with the program model restrictions than with

the applicability of static expansion and storage mapping optimization.

5.5.2 Algorithm

Algorithm Recursive-Programs-SA is a rst attempt to give a counterpart of al-

gorithm Loop-Nests-SA for recursive programs. It works together with Recursive-

Programs-Implement-Phi to generate the code for functions. Expanded data struc-

tures all have the same type, ControlType, which is basically a tree type associated with

5.5. PARALLELIZATION OF RECURSIVE PROGRAMS 229

the language Lctrl of control words. It can be implemented using recursive types and

sub-types, or simply with as many pointer elds as statement labels in ctrl . An addi-

tional eld in ControlType stores the element value, it has the same type as original data

structure elements, and it is called value.

Recursive-Programs-SA (program; )

program: an intermediate representation of the program

: a reaching denition relation, seen as a function

returns an intermediate representation of the expanded program

1 dene a tree type called ControlType whose elements are indexed in Lctrl

2 for each data structure D in program

3 do dene a data structure Dexp of type ControlType

4 dene a global pointer variable Dlocal = &Dexp

5 for each procedure in program

6 do insert a new argument Dlocal in the rst place

7 for each call to a procedure p in program

8 do insert Dlocal->p = new ControlType () before the call

9 insert a new argument Dlocal->p in the rst place

10 for each non-procedure block b in program

11 do insert Dlocal->b = new ControlType () at the top of b

12 dene a local pointer variable Dlocal = Dlocal->b

13 for each statement s assigning D in program

14 do left-hand side of s Dlocal->value

15 for each reference ref to D in program

16 do ref ( (CurIns; ref ))

17 return program

A simple optimization to spare memory consists in removing all \useless" elds from

ControlType , and every pointer update code in the associated program blocks and state-

ments. By useless, we mean statement labels which are not useful to distinguish between

dierent memory locations, i.e. which cannot be replaced by another label and yield an-

other instance of an assignation statement to the considered data structure. Applied to

program Queens, only three labels can be considered to dene the elds of ControlType:

Q, a, and b; all other labels are unnecessary to enforce the single-assignment property.

This optimization should of course be applied on a data structure per data structure basis,

to take benet of the locality of data structure usage in programs.

One should notice that every read reference requires a function! This is clearly a big

problem for ecient code generation, but detecting exact results and computing reaching

denitions at run-time is not as easy as in the case of loop nests. In fact, a part of the

algorithm is even \abstract": we have not discussed yet how the argument of the can be

computed. To simplify the exposition, all these issues are addressed in the next section.

Of course, algorithm Recursive-Programs-Implement-Phi generates the code for

-structures Dexp using the same techniques as the SA-form algorithm. These -structures

store addresses of memory locations, computed from the original write references in as-

signment statements. Each function requires a traversal of -structures to compute the

exact reaching denition at run-time: the maximum is computed recursively from the

root of Dexp, and the appropriate element value in Dexp is returned. This computation of

the maximum can be done in parallel, as usual for reduction operations on trees.

230 CHAPTER 5. PARALLELIZATION VIA MEMORY EXPANSION

Recursive-Programs-Implement-Phi (expanded)

expanded: an intermediate representation of the expanded program

returns an intermediate representation with run-time restoration code

1 for each expanded data structure Dexp in expanded

2 do if there are functions accessing Dexp

3 then dene a data structure Dexp of type ControlType

4 dene a global pointer variable Dlocal = &Dexp

5 for each procedure in program

6 do insert a new argument Dlocal in the rst place

7 for each call to a procedure p in program

8 do insert Dlocal->p = new ControlType () before the call

9 insert a new argument Dlocal->p in the rst place

10 for each non-procedure block b in program

11 do insert Dlocal->b = new ControlType () at the top of b

12 dene a local pointer variable Dlocal = Dlocal->b

13 insert Dlocal->value = NULL

14 for each read reference ref to Dexp whose expanded form is (set)

15 do for each statement s involved in set

16 do refs write reference in s

17 if not already done for s

18 then following s insert Dlocal->value = &refs

19 (set) { traverse Dexp and Dexp in lexicographic order

using pointers Dlocal and Dlocal respectively

if (Dlocal ->value == & ref

) maxloc = Dlocal ;

maxloc->value; }

20 return expanded

Two problems remain with function implementation.

The tree traversal does not use the set argument of functions at all! Indeed,

testing for membership in a rational language is not a constant-time problem, and

it is even not linear in general for algebraic languages. This point is also related

with run-time computation of sets of reaching denitions: it will be discussed in the

next section.

Several functions may induce many redundant computations, since the maximum

must everytime be computed on the whole structure, not taking benet of the

previous results. This problem was solved for loop nests using a complex technique

integrated with constrained storage mapping optimization (see Section 5.4.7), but

no similar technique for recursive programs is available.

In the last section, all read accesses were implemented with functions. This solution

ensures correctness of the expanded program, but it is obviously not the most ecient.

If we know that the reaching denition relation is a partial function (i.e. the result is

exact), we can hope for an ecient run-time computation of its value, as it is the case

for loop nests (with the quast representation). Sadly, this is not as easy in general: some

rational functions cannot be computed for a given input in linear time, and it is even

worse for algebraic functions.

5.5. PARALLELIZATION OF RECURSIVE PROGRAMS 231

The class of sequential functions is interesting for this purpose, since it is decidable

and allows ecient online computation, see Section 3.3.3. Because for every state and

input letter, the output letter and next state are known unamiguously, we can compute

sequential functions together with pointer updates for expanded data structures. This

technique can be easily extended to a sub-sequential function (T ; ), in adding the pointer

updates associated with function (from states to words, see Denition 3.10 page 100).

The class of sub-sequential transductions is decidable in polynomial time among ratio-

nal transductions and functions [BC99b]. This online computation technique is detailed

in algorithm Recursive-Programs-Online-SA, for sub-sequential reaching dention

transductions. An extension to online rational transduction would also be possible, with-

out signicantly increasing the run-time computation cost, but decidability is not known

for this class.

Dealing with algebraic functions is less enthusiastic, because deciding whether an

algebraic relation is a function is rather unlikely, and it is the same for the class of online

algebraic transductions. But supposing we are lucky enough to k