An Introduction to Parsing Tools for Python Static Types and Practices

This article introduces the support for static types from Python, the development status of the community, the introduction and comparison of type check tools, and the practice of type parsing.

1. Background

Python is a dynamic type language featuring strong type. Developers can dynamically specify the type (dynamic) for the object, but operations with mismatched types are not allowed. (For example, variables str and int cannot be added together.)

Dynamic types help developers write code easily. However, as the saying goes, using dynamic types is fun for a while, while refactoring is painful. Dynamic types also bring a lot of trouble. If dynamic languages can add static type markers, the main benefits are listed below:

Easier Writing: Definition jump and type prompt can be implemented with various IDE tools.
More Reliable Encoding: Supported by type definitions, many tools can detect semantic errors in advance in the static coding stage.
More Reassuring Refactoring: Input and output parameters of the interface are clarified, making the code refactoring clearer and more stable.

Currently, most mainstream languages (such as Java, Go, and Rust) support static types. Dynamic languages (Python, JS) are also embracing static types, such as TypeScript.

This article introduces the support for static types from Python, the development status of the community, the introduction and comparison of type check tools, and the practice of type parsing.

2. Support for Python Static Types

As early as 2006, Python 3.0 introduced the syntax for the type annotation and listed many items for improvements.

# Before adding the type
def add(a, b):
    return a + b
    
# After adding the type
def add(a:int, b:int) -> int:
    return a + b

With continuous evolution, Python 3.5 can implement Type Hints. IDE can implement Type Checking by combining with Type labeling.

Then, in Python 3.7, the static type support is virtually perfect.

The following part is a detailed description of type inspection tools and some basic concepts.

3. An Introduction to Type Inspection Tools

Python authors and mainstream vendors have released the inspection tool for Python types.

The features of these tools are similar:

3.1 mypy

The earliest officially launched mypy was developed personally by Guido van Rossum, the father of Python. It is integrated by various mainstream editors (such as PyCharm, Emacs, Sublime Text, and VS Code). The user base is solid, and the documentation and experience are rich.

3.2 pytype

Google's pytype is capable of type inspection and provides some useful tools:

annotate-ast: Mark the AST tree during the process
merge-pyi: Merge the generated pyi files into the original file. It can also hide the type and load again during type inspection.
pytd-tool: Parse the pyi files into customized PYTD files in pytype
pytype-single: With all the dependent pyi files given, parse a single Python file.
pyxref: Generate cross-references

3.3 pyre

Facebook's pyre-check has two special features:

Watchman: Listen to code files and track changes
Query: Perform local checks for the source code. For example, query the type of an expression in a row or query all methods of a class and return the results in a list. This avoids global checks.

3.4 pyright

Microsoft's pyright is the latest open-source tool. It claims the following advantages:

Fast Speed: It is five times faster than mypy and other tools written in Python.
No Dependence on the Python Environment: It is written in TypeScript, runs on nodes, and does not depend on the Python environment or third-party packages.
Strong Configurability: It supports flexible configuration and different running environments (PYTHONPATH settings, Python versions, and platform targets.)
Complete Check Items: It supports checks of types, other syntax items (such as PEP-484, PEP-526, and PEP-544), function return values, class variables, and global variables. It can also check conditional loop statements.
Command Line Tool: It contains two VS Code plug-ins: a command line tool and a Language Server Protocol.
Built-In Stubs. It uses the copy of Typeshed. Note: Use static pyi files and check built-in modules, standard libraries, and third-party components.
Language Service Features: It supports prompt information hovering, symbol definition jumps, and real-time edition feedback.

4. An Introduction to pytype

Why pytype? mypy is relatively old, and many functions are not useful. We plan to use Python LSP to process Python files and provide some syntax services. pyre-check uses Ocamel, so we use pytype based on the Python language to implement the desired functions. In addition, pytype provides some useful tools for parsing a pyi file and generating a pyi file based on the Python file type.

4.1 Basic Concepts

pyi Files

The letter "i" in pyi refers to interface, which stores the type definitions in Python files in pyi files in the form of an interface to assist the type check.

For commonly used Pycharm, go to External Libraries > Python 3.6 > Typeshed Stubs. There are many built-in pyi files to assist in the type indication and positioning during encoding.

Typeshed Stubs

The Typeshed Stubs mentioned above are equivalent to the pyi collection integrated in advance. pycharm seems to maintain a copy of the data itself. Many large open-source projects are also providing stubs, such as pyTorch. Tensorflow is also under consideration.

Many large Python libraries require a lot of work to create pyi, and there are also many API calls of C languages. We need to be patient.

4.2 Practices

We have viewed the source code of pytype and summarized the practical code with the requirements. The following part gives some examples:

Overall performance:

import logging
import sys
import os
import importlab.environment
import importlab.fs
import importlab.graph
import importlab.output
from importlab import parsepy

from sempy import util
from sempy import environment_util

from pytype.pyi import parser

In the demo, use Importlab to parse project dependencies and corresponding pyi files:

def main():
    # Specify the directory for parsing
    ROOT = '/path/to/demo_project'
    # Specify the directory TYPESHED, which can be downloaded at：https://github.com/python/typeshed
    TYPESHED_HOME = '/path/to/typeshed_home'
    util.setup_logging()
    # Load typeshed. If TYPESHED_HOME is not correctly configured, return None
    typeshed = environment_util.initialize_typeshed_or_return_none(TYPESHED_HOME)
    # Load valid files from the target directory
    inputs = util.load_all_py_files(ROOT)
    # Generate the environment for generating import_graph
    env = environment_util.create_importlab_environment(inputs, typeshed)
    # Generate import graph based on pyi files and engineering files
    import_graph = importlab.graph.ImportGraph.create(env, inputs, trim=True)
    # Print the dependency tree
    logging.info('Source tree:\n%s', importlab.output.formatted_deps_list(import_graph))
    # Alias of the import module, e.g. import numpy as np -> {'np': 'numpy'}
    alias_map = {}
    # Import the module name and mapping of pyi files, e.g. import os -> {'os': '/path/to/os/__init__.pyi'}
    import_path_map = {}
    # Value of alias_map, which corresponds to the key of import_path_map. The key of alias_map can be used to find the real implementation file.
    for file_name in inputs:
        # If found, it is marked as resolved.
        # If Build_in dependency exists, skip and do not return results.
        # If custom dependency exists, mark as unresolved for further parsing and positioning the engineering file.
        (resolved, unresolved) = import_graph.get_file_deps(file_name)
        for item in resolved:
            item_name = item.replace('.pyi', '') \
                .replace('.py', '') \
                .replace('/__init__', '').split('/')[-1]
            import_path_map[item_name] = item
        for item in unresolved:
            file_path = os.path.join(ROOT, item.new_name + '.py')
            import_path_map[item.name] = file_path
        import_stmts = parsepy.get_imports(file_name, env.python_version)
        for import_stmt in import_stmts:
            alias_map[import_stmt.new_name] = import_stmt.name
    print('The import relationship obtained through importlab parsing is as follws\n\n')

    # For code query scenarios, alias_map can associate with the introduced module through currently used object.
    print('\n\n#################################\n\n')
    print(' For code query scenarios, alias_map can associate with the introduced module through currently used object.')
    print('alias_map: ', alias_map)

    # For code supplement scenarios, parse current files and referred pyi files. If current files are __init__ files, conduct global search for all files in the directory.
    print('\n\n#################################\n\n')
    print(' For code supplement scenarios, parse current files and referred pyi files. If current files are __init__ files, conduct global search for all files in the directory.')
    print('import_path_map: ', import_path_map)

    print('\n\n\n By using pytype, parse AST of pyi files to analyze the returned types of third-party dependencies and find out the type of current variables.\n\n')
    # Use pytype to parse dependent pyi files and obtain the return value of call methods
    fname = '/path/to/parsed_file'
    with open(fname, 'r') as reader:
        lines = reader.readlines()
    sourcecode = '\n'.join(lines)
    ret = parser.parse_string(sourcecode, filename=fname, python_version=3)

    constant_map = dict()
    function_map = dict()
    for key in import_path_map.keys():
        v = import_path_map[key]
        with open(v, 'r') as reader:
            lines = reader.readlines()
        src = '\n'.join(lines)
        try:
            res = parser.parse_pyi(src, v, key, 3)
        except:
            continue
        # Alias
        # Classes
        for constant in res.constants:
            constant_map[constant.name] = constant.type.name
        for function in res.functions:
            signatures = function.signatures
            sig_list = []
            for signature in signatures:
                sig_list.append((signature.params, signature.return_type))
            function_map[function.name] = sig_list

    var_type_from_pyi_list = []
    for alias in ret.aliases:
        variable_name = alias.name
        if alias.type is not None:
            typename_in_source = alias.type.name
            typename = typename_in_source
            # Import case of alias and convert it
            if '.' not in typename:
                # If it is a common alias instead of return value of functions, ignore it
                continue
            if typename.split('.')[0] in alias_map:
                real_module_name = alias_map[typename.split('.')[0]]
                typename = real_module_name + typename[typename.index('.'):]
            if typename in function_map:
                possible_return_types = [item[1].name for item in function_map[typename]]
                var_type_from_pyi_list.append((variable_name, possible_return_types))
            if typename in constant_map:
                possible_return_type = constant_map[typename]
                var_type_from_pyi_list.append((variable_name, possible_return_type))
                pass
    print('\n\n#################################\n\n')
    print('These are all return value types analyzed from pyi files.')
    for item in var_type_from_pyi_list:
        print('Variable name:', item[0], 'Return type:', item[1])

if __name__ == '__main__':
    sys.exit(main())

The following code is parsed below:

# demo.py
import os as abcdefg
import re
from demo import utils
from demo import refs


cwd = abcdefg.getcwd()
support_version = abcdefg.supports_bytes_environ
pattern = re.compile(r'.*')


add_res = utils.add(1, 3)
mul_res = refs.multi(3, 5)


c = abs(1)

Procedure

pytype takes advantage of Importlab, another open-source project of Google.

Place the files in the typeshed directory into the environment to analyze the dependencies between files. Then, Importlab can generate a dependency graph.

env = environment_util.create_importlab_environment(inputs, typeshed)
import_graph = importlab.graph.ImportGraph.create(env, inputs, trim=True)
# If any pyi file is found, mark as resolved.
# If Build_in dependency exists, skip and do not return the result.
# If custom dependency exists, mark as unresolved for further parsing and positioning the engineering file.
(resolved, unresolved) = import_graph.get_file_deps(file_name)

Through the import graph, we obtain the source of the variable (including the reference alias and the return value of the method call):

{'ast': 'ast', 'astpretty': 'astpretty', 'abcdefg': 'os', 're': 're', 'utils': 'demo.utils', 'refs': 'demo.refs', 'JsonRpcStreamReader': 'pyls_jsonrpc.streams.JsonRpcStreamReader'}

Through the dependency graph, we can know the location of dependencies of the direct reference:

import_path_map:  {'ast': '/Users/zhangxindong/Desktop/search/code/sempy/sempy/typeshed/stdlib/ast.pyi', 'astpretty': '/Users/zhangxindong/Desktop/search/code/sempy/venv/lib/python3.9/site-packages/astpretty.py', 'os': '/Users/zhangxindong/Desktop/search/code/sempy/sempy/typeshed/stdlib/os/__init__.pyi', 're': '/Users/zhangxindong/Desktop/search/code/sempy/sempy/typeshed/stdlib/re.pyi', 'utils': '/Users/zhangxindong/Desktop/search/code/sempy/sempy/demo/utils.py', 'refs': '/Users/zhangxindong/Desktop/search/code/sempy/sempy/demo/refs/__init__.py', 'streams': '/Users/zhangxindong/Desktop/search/code/sempy/venv/lib/python3.9/site-packages/pyls_jsonrpc/streams.py'}

Next, parse the corresponding files. The requirement is to get the return value types of some methods. For pyi files, pytype can help us parse them. Then, we match them through the call relationship.

print('\n\n\n By using pytype, parse AST of pyi files to analyze the returned types of third-party dependencies and find out the type of current variables.\n\n')
# Use pytype to parse dependent pyi files and obtain the return value of call methods
fname = '/path/to/parsed_file'
with open(fname, 'r') as reader:
    lines = reader.readlines()
sourcecode = '\n'.join(lines)
ret = parser.parse_string(sourcecode, filename=fname, python_version=3)

constant_map = dict()
function_map = dict()
for key in import_path_map.keys():
    v = import_path_map[key]
    with open(v, 'r') as reader:
        lines = reader.readlines()
    src = '\n'.join(lines)
    try:
        res = parser.parse_pyi(src, v, key, 3)
    except:
        continue
    # Alias
    # Classes
    for constant in res.constants:
        constant_map[constant.name] = constant.type.name
    for function in res.functions:
        signatures = function.signatures
        sig_list = []
        for signature in signatures:
            sig_list.append((signature.params, signature.return_type))
        function_map[function.name] = sig_list

var_type_from_pyi_list = []
for alias in ret.aliases:
    variable_name = alias.name
    if alias.type is not None:
        typename_in_source = alias.type.name
        typename = typename_in_source
        # Import case of alias and convert it
        if '.' not in typename:
            # If it is a common alias instead of return value of functions, ignore it
            continue
        if typename.split('.')[0] in alias_map:
            real_module_name = alias_map[typename.split('.')[0]]
            typename = real_module_name + typename[typename.index('.'):]
        if typename in function_map:
            possible_return_types = [item[1].name for item in function_map[typename]]
            # print('The possible return type of', typename_in_source, 'is', possible_return_types)
            var_type_from_pyi_list.append((variable_name, possible_return_types))
        if typename in constant_map:
            possible_return_type = constant_map[typename]
            var_type_from_pyi_list.append((variable_name, possible_return_type))
            pass

For example:

pattern = re.compile(r'.*')

In the /Users/zhangxindong/Desktop/search/code/sempy/sempy/typeshed/stdlib/re.pyi file, we load two methods. Both are re.compile, but the input parameters are different. The return values are all of the Pattern type.

Then, we know that the type of the pattern variable is re.Pattern.

These are the return value types analyzed from the pyi file.
Return type of the variable cwd: ['str']
Return type of the variable support_version: bool
Return type of the variable pattern: ['typing.Pattern', 'typing.Pattern']

5. Applications

Some features of Python syntax analysis have been applied for search and recommendation of code files and smart code supplement in Alibaba Cloud Dev Studio.

5.1 Search and Recommendation of Code Files

If developers do not know how to use an API (for example, the call method or method input parameter), they can move the pointer over the specified API for more information. Developers can view the API summary provided by the smart encoding plug-in. Developers can click "API documentation" to view the detailed information (such as the official API documentation and code samples) on the right bar. They can also search for the required API code document directly. Search and recommendation for code documents in JavaScript and Python are supported.

In the process of document collection, we can get the API name and the corresponding class of the API. In the code, we can get the corresponding class information based on the called method through syntax analysis and use the information for document search.

5.2 Smart Code Supplement

When writing code, the smart encoding plug-in perceives the code context automatically and provides developers with precise code supplement candidates. The one marked with ✨ is the result of smart code supplement. Currently, this feature is available for Java, JavaScript, and Python.

Through syntax analysis during the code supplement process, the class information of user variables can be known more accurately, which helps filter out unreasonable options recommended by deep learning models. Some reasonable options can be recalled based on the internal method set of the class.

6. Summary

The concepts and tools of Python static types are perfect. However, due to the heavy burden and the lack of driving force in the community, the results are limited. In addition, the official party, major vendors, and local IDEs have implementation and analysis methods with no unified standard or format. You can choose a suitable parsing method based on the preceding advantages and disadvantages and the tool set and data set. We look forward to more support from the Python community to static types.

Community

An Introduction to Parsing Tools for Python Static Types and Practices

1. Background

2. Support for Python Static Types

3. An Introduction to Type Inspection Tools

3.1 mypy

3.2 pytype

3.3 pyre

3.4 pyright

4. An Introduction to pytype

4.1 Basic Concepts

pyi Files

Typeshed Stubs

4.2 Practices

Procedure

5. Applications

5.1 Search and Recommendation of Code Files

5.2 Smart Code Supplement

6. Summary

Read previous post:

Biexiang

You may also like

Comments

Biexiang

Related Products

Web Hosting Solution

Web Hosting

EMAS Superapp

Web App Service