Presto, the "preform SQL-on-anything" engine

slides, and the main url

understanding the presto query engine and it's optimizations under the hood

lifecycle

parsing (simplified)

parsing
- validating input types and values
- operators have the right number of arguments
- sequence of SQL clauses are correct (FROM comes after SELECT etc)
- important to for presto to perform planning of the query execution

analysis breakdown

planning

optimization

Optimization
- is the process of applying a set of semantic preserving transformations to the plan to produce a more optimal plan that can be executed
  - semantic preserving transformations mean: it transforms a plan, and at every transformation step, it guarantees that the output of the transformed plan matches the output of the original plan
- the end result is a set of fragments (the dotted line boxes) that represent an abstract topology of the query that will be executed in the cluster
- e.g. the Scan operation and Filter operation gets turned into a Filtered Scan operation (which is faster)

scheduling and execution

Scheduling and execution
- coordinator identifies workers that can do the work
- submits the fragments of work (logic)
- submits the data that the worker needs to work on (data)
- wires the fragment dependencies according to the linkage structure (N:M exchange, 1:M exchange in the optimization chart) to satisfy that topology.

explain

given a query, how is the engine going to be structuring and optimizing the query?
use the keyword EXPLAIN to see how presto is going to represent the nodes internally
instead of the uery, it will output a series of fragments