Next: Summary of DFC Measurements Up: APPLICATION OF ESOP Previous: Introduction.

The Basic Research Ideas of the PTG.

The Pattern Theory Group (PTG) in the Avionics Directorate at Wright Laboratory develops new system-level concepts for military applications, mostly based on image processing and machine learning technologies. Since 1988, the PTG developed a radically new approach to machine learning, where by ``machine learning'' we understand any method of teaching a computer to recognize any kind of patterns. Specifically, we are examining Supervised Classification Learning paradigms. The machine learning technologies most influential in military applications until now have been the neural nets and the bayesian methods. If successfully completed, the new approach of PTG will allow both automatic learning of any kind of images, and automatic creation of algorithms. Interestingly, in contrast to most of the well-known approaches, the approach of the PTG is based on logic synthesis methods: the so-called Curtis Decomposition of Boolean functions is applied here as the main component[18]. Many decomposition ideas have been implemented in the programming system FLASH (the Function Learning and Synthesis Hotbed) developed by this group[20]. This system is a Testbed for machine learning based on the logic synthesis approach. The group also compares FLASH to other logic optimizers and machine learning programs, (such as Espresso and C4.5, respectively) from the point of view of the robustness of learning[7, 8].

Simplifying, the main difference of the logic approach and the previous approaches to machine learning is that in these previous methods, the recognizing network had some assumed structure which was ``tuned'' by the learning process (for instance, by decreasing and increasing the numerical values of some coefficients). Thus, creating a new structure is accomplished by setting some coefficients to zero. All ``new'' structures are in a sense ``hidden'' in the assumed mother-structure. Also the type of an element, such as a formal neuron in Neural Nets, or a polynomial in Abductory Inference Mechanism (AIM) is decided before the learning takes place.

In contrast, in the Curtis Decomposition approach, there is no a priori assumption of structure of the learning network, nor on the type of the elements. The elements are arbitrary discrete mappings (functions) in the form of a combinational network with universal gates. Both the structure and the elements are calculated in the learning process, and this process is entirely based on finding patterns in data.

The central concept of Pattern Theory is a ``pattern.'' In Pattern Recognition and Knowledge Discovery the problems with nice representations based on ``features'' belong to a more general class of problems with restrictive ``patterns.'' Pattern finding is, therefore, a generalization and formalization of feature extraction. The goal of the Pattern Theory is to support ``automating'' the pattern finding process, and to construct a representation of a function based on examples; therefore, this is a method for constructive induction. Furthermore, it constructs this representation by minimizing complexity as in Occam-based learning. The Ashenhurst Function Decomposition (AFD) method based on Curtis Decomposition implemented in FLASH is unusual in that it learns both the architecture of the combinational representation and the component functions from the examples.

Induction is often modeled as the process of extrapolating samples of a function. This extrapolation requires both the samples and the ``inductive bias.'' The bias towards low complexity, as in Occam's Razor, is particularly important. There is a strong theoretical basis for Occam-based learning, see for example [2, 3]. Kolmogorov complexity was developed specifically for induction [14], however, it has been proven that its exact computation is not tractable. There have been some tractable measures of complexity used in actual implementations of Occam-based learning, such as the Abductory Inference Mechanism which uses polynomial networks and C4.5 [17] which uses decision trees. However, the measures of complexity used in these applications are relatively narrow, which implies some compromise in the robustness of the learning; for example, neither of these methods would find the parity function to have low complexity even though it is highly patterned and can be easily computed. The challenge is to develop robust and tractable measures of complexity.

Pattern Theory [18] treats robust minimal complexity determination as the problem of finding a pattern. Pattern theory uses Decomposed Function Cardinality (DFC), proposed by Y. S. Abu-Mostafa as a general measure of complexity [1, p.128,]. DFC is based on the cardinality of a function. After all, a function is a set of ordered pairs and, as with any set, has a definite property in its number of elements or cardinality. DFC is defined as the sum of the cardinalities of the components of a combinational representation of a function, when that sum has been minimized. DFC is especially robust in the sense that it reflects patterns of many different kinds. Its robustness is supported by its relationship to more conventional measures of complexity, including circuit size complexity, time complexity, decision tree or diagram size and Kolmogorov complexity. If a problem has low complexity by any of these measures then it will also have a low DFC [18, Chapter 4,]. The PTG work has concentrated on functions with binary inputs, but the concept is easily extended to continuous and multiple-valued functions [21].

The decompositions are evaluated based on the sum of the function cardinality of the decomposed partial blocks. Of course, the real DFC of f is less than or equal to this sum in any particular (usually approximate) realization. The objective of the decomposition is to search through partitions of input variables to find one that produces the smallest total DFC. In other words, the ``cost'' of the feature is measured in terms of DFC and that cost is minimized. Therefore, features are constructed to minimize their computational complexity. The use of DFC as the measure of complexity allows for robustness in the type of feature that can be generated.

Let us observe that both in the Curtis decomposition and the ESOP minimization approach of EXORCISM, we look for certain patterns in data: in Curtis decomposition these patterns are in columns of the map corresponding to the cofactors of the bond set of variables [5]. The patterns for ESOP minimization are based on certain rules of Boolean Algebra[24].

Let us also observe that in both cases we minimize a certain cost of the circuit. Traditionally in ESOP minimization one calculates the number of terms and the number of literals. In our case, we calculated additionally the total DFC as a sum of function cardinality of all non-decomposable blocks (this is an upper bound approximation of the minimum DFC). For an arbitrary non-decomposable block in Curtis Decomposition, the DFC of the block is calculated as where k is the number of inputs to the block. In ``gate-based'' minimizers such as Espresso and EXORCISM it is then fair to assume that a DFC of a decomposable gate (such as AND, OR or EXOR) is equal to the total DFC of a circuit equivalent to this gate, that is constructed from two-input gates. The DFC of a four input AND gate, OR gate, or EXOR gate is then + + = 12, since such gates can be decomposed to balanced trees of three two-input gates.

One variant of the AFD algorithm looks at all partitions at a given level and 40 random partitions one level deeper. Essentially, this is a 2-ply look ahead search where we only explore 40 grand children. It chooses a particular partition over another based on the function's DFC. Exorcism finds an ESOP, Espresso finds a SOP, and C4.5 finds a low complexity decision tree. Minimizing complexity provides good generalization performance [8, 15, 18].

Next: Summary of DFC Measurements Up: APPLICATION OF ESOP Previous: Introduction.

Marek Perkowski
Tue Nov 11 17:11:29 PST 1997