Presto, the "preform SQL-on-anything" engine

slides, and the main url

tuning video from presto creator

common table expressions

https://chartio.com/resources/tutorials/using-common-table-expressions/ https://github.com/prestodb/presto/issues/3431

some query tuning references

https://gist.github.com/ian-whitestone/211d0a27687ac3daa3f49699b4606613

understanding the presto query engine and it's optimizations under the hood

what is presto

presto query execution lifecycle

lifecycle

a look into the coordinator

parsing (simplified)

  1. parsing
    • validating input types and values
    • operators have the right number of arguments
    • sequence of SQL clauses are correct (FROM comes after SELECT etc)
    • important to for presto to perform planning of the query execution

analysis breakdown

  1. analysis
    • takes the tree and makes sure that the tree represents a query that is semantically valid
      • e.g. meaning it makes sure the components of the values work with the given operators
      • e.g. if we have a group by clause and a select clause, the expresisons in the select clause should satisfy the requirements of the group by
    1. it starts from the FROM clause
      • presto looks for the table and gets its schema using presto's 'connectors'
    2. then it assigns types to all the nodes (ref above: orderstatus, totalprice, orderdate etc)
    3. then it looks at its parent operators (e.g. the larger than operator from example above) and resolves that operator from the registry of operators it has (a larger than function that works compares two DATE objects and returns a boolean)
    4. at the end of the process the analysis step ends up with a tree that's annotated with all the information that's required by the engine to plan it's query.

planning

  1. planning
    • turns a syntax tree into an intermediate representation (Plan IR)

optimization

  1. Optimization
    • is the process of applying a set of semantic preserving transformations to the plan to produce a more optimal plan that can be executed
      • semantic preserving transformations mean: it transforms a plan, and at every transformation step, it guarantees that the output of the transformed plan matches the output of the original plan
    • the end result is a set of fragments (the dotted line boxes) that represent an abstract topology of the query that will be executed in the cluster
    • e.g. the Scan operation and Filter operation gets turned into a Filtered Scan operation (which is faster)

scheduling and execution

  1. Scheduling and execution
    • coordinator identifies workers that can do the work
    • submits the fragments of work (logic)
    • submits the data that the worker needs to work on (data)
    • wires the fragment dependencies according to the linkage structure (N:M exchange, 1:M exchange in the optimization chart) to satisfy that topology.

explain the EXPLAIN

explain

cost based optimizer