The Lolita natural language engineering system [Morgan et al., 1994] has been developed at Durham University. The team's interest in parallelism is partly as a means of reducing runtime, and partly also as a means to increase functionality within an acceptable response-time. The overall structure of the program bears some resemblance to that of a compiler, being formed from the following large stages:
Our immediate goal in parallelising this system is to expose sufficient parallelism to fully utilise a 4-processor shared-memory Sun SPARCserver. A pipeline approach is a promising way to achieve this relatively small degree of parallelism (see figure below). Each stage listed above is executed by a separate thread, which are linked to form a pipeline. The key step in parallelising the system is to define strategies on the complex intermediate data structures (e.g. parse trees) which are used to communicate between these stages. This data-oriented approach simplifies the top-down parallelisation of this very large system, since it is possible to define the parts of the data structure that should be evaluated in parallel without considering the algorithms that produce the data structures.
In addition to the pipeline parallelism, we introduce parallelism in the syntactic parsing stage. The parallelism in this module has the overall structure of a parallel tree traversal. In order to improve the granularity in this stage we apply a thresholding strategy (similar to the one at the end of Section *) to a system parameter, which reflects the depth in the tree. In fact the same polymorphic threshholding strategy is applied to two lists of different types.
Another source of parallelism can be used to improve the quality of the analysis by applying the semantic and pragmatic analyses in a data-parallel fashion on different possible parse trees for the same sentence. Because of the complexity of these analyses the sequential system always picks the first parse tree, which may cause the analysis to fail, although it would succeed for a different parse tree. We have included code to exploit this kind of parallelism but not yet tested its influence on the quality of the result.
The figure at the bottom of this page shows the parallel structure arising when all of the sources of parallelism described above are used. Note that the analyses also produce information that is put into a `global context' containing information about the semantics of the text. This creates an additional dependence between different instances of the analysis (indicated as vertical arcs). Lazy evaluation ensures that this does not completely sequentialise the analyses, however.
The code of the top level function wholeTextAnalysis in Figure * clearly shows how the algorithm is separated from the dynamic behaviour in each stage. The only changes in the algorithm are
wholeTextAnalysis opts inp global = result where -- (1) Morphology (g2, sgml) = prepareSGML inp global sentences = selectEntitiesToAnalyse global sgml -- (2) Parsing rawParseForest = parMap rnf (heuristic_parse global) sentences -- (3)-(5) Analysis anlys = stateMap_TimeOut (parse2prag opts) rawParseForest global2 -- (6) Back End result = back_end anlys opts -- Pick the parse tree with the best score from the results of -- the semantic and pragmatic analysis. This is done speculatively! parse2prag opts parse_forest global = pickBestAnalysis global $|| evalScores $ take (getParsesToAnalyse global) $ map analyse parse_forest where analyse pt = mergePragSentences opts $ evalAnalysis evalAnalysis = stateMap_TimeOut analyseSemPrag pt global evalScores = parList (parPair rwhnf (parTriple rnf rwhnf rwhnf)) -- Pipeline the semantic and pragmatic analyses analyseSemPrag parse global = prag_transform $|| rnf $ pragm $|| rnf $ sem_transform $|| rnf $ sem (g,) $|| rnf $ addTextrefs global $| rwhnf $ subtrTrace global parse back_end inp opts = mkWholeTextAnalysis $| parTriple rwhnf (parList rwhnf) rwhnf $ optQueryResponse opts $|| rnf $ traceSemWhole $|| rnf $ addTitleTextrefs $|| rnf $ unifyBySurfaceString $|| rnf $ storeCategoriseInf $|| rnf $ unifySameEvents opts $| parPair rwhnf (parList (parPair rwhnf rwhnf)) $ unpackTrees $| parPair rwhnf (parList rwhnf) $ inp
The strategies used in parse2prag are of special interest. The parse forest rawParseForest contains all possible parses of a sentence. The semantic and pragmatic analyses are then applied to a predefined number (specified in global) of these parses. The strategy that is applied to the list of these results (parList (parPair ...)) demands only the score of each analysis (the first element in the triple), and not the complete parse. This score is used in pickBestAnalysis to decide which of the parses to choose as the result of the whole text analysis. Since Lolita makes heavy use of laziness it is very important that the strategies are not too strict. Otherwise redundant computations are performed, which yield no further improvements in runtime. It should be emphasised that specifying the strategies that describe this parallel behaviour entailed understanding and modifying only one of about three hundred modules in Lolita and three of the thirty six functions in that module. So far, the only module we have parallelised is the syntactic parsing stage. If it proves necessary to expose more parallelism we could parallelise other sub-algorithms, which also contain significant sources of parallelism. In fact, the most tedious part of the code changes was adding instances of NFData for intermediate data structures, which are spread over several dozen modules. However, this process has been partially automated, as described in Section *
We are currently tuning the performance of Lolita on the Sun SPARCserver. A realistic simulation showed an average parallelism between 2.5 and 3.1, using just the pipeline parallelism and parallel parsing. Since Lolita was originally written without any consideration for parallel execution and contains a sequential front end (written in C) of about 10-15%, we are pleased with this amount of parallelism. In particular the gain for a set of rather small changes is quite remarkable. The wall-clock speedups obtained to date are disappointing. With two processors and small inputs we obtain an average parallelism of 1.4. With more processors the available physical memory is insufficient and heavy swapping causes a drastic degradation in performance. The reason for this is that GUM, which is designed to support distributed-memory architectures uniformly, loads a copy of the entire code, and a separate local heap, onto each processor. Lolita is a very large program, incorporating large static data segments (totaling 16Mb), and requires 100Mb of virtual memory in total in its sequential incarnation. Clearly a great deal of this data could be shared, an opportunity we are exploring. We hope to obtain better results soon using a machine with twice the memory capacity. We are also making the C parsing functions re-entrant which will allow the analysis to be performed in a data-parallel fashion over a set of input sentences.