Building a Modern Deep Learning Framework from the Ground (TinyDL-0.01)

By Shanze

Learn to Embrace AI

AI is a major trend and a long-term game. At the dual inflection points, the dilemma of Java developers is how to transform and adapt to these changes.

Here are some of my thoughts:

In recent years, the market and technological environment is experiencing dual inflection points: On one hand, the development of the consumer internet fully enters an era of stock competition, where incremental benefits no longer exist, and cost reduction and efficiency improvement become the main themes. On the other hand, significant breakthroughs have been achieved in the latest development of AI technology, surpassing human performance in many areas in terms of efficiency and cost. This new era forces Java developers like me to transform and embrace new technologies.

There are two new directions in technology: One is represented by 3D technologies related to the metaverse, such as blockchain and XR technologies. The other is the transition towards AI, particularly large models. Recently, I started learning about deep learning. So, as a Java programmer who tries to understand CNN, DNN, RNN, and other network models, I'm currently brushing up on linear algebra, calculus, and PyTorch.

The great physicist Richard Feynman said, "What I cannot create, I do not understand." I think maybe he is right, so I decide to give it a try.

Just for Fun. Salute Linux-0.01

On September 17, 1991, 21-year-old Finnish student Linus released the open-source operating system Linux-0.01 on the Internet, with concise code and clear directories. Following in his footsteps, I specifically choose 0.01 for the version number of my project, TinyDL-0.01 (Tiny Deep Learning), as a tribute to Linux-0.01. I'm sharing this with Java developers who are interested in AI, to help them understand the principles and simple implementations of deep learning from a low-level engineering perspective. Anyone interested is welcome to join and learn together.

What Makes TinyDL-0.01 Different from Others?

There are numerous deep learning frameworks, most of which are written in Python (the bottom layer is C/C++), such as TensorFlow, PyTorch, and MXNet. There are two well-known Java-based deep learning frameworks: DeepLearning4J, maintained by the Eclipse open-source community, and DJL, an open-source deep learning Java framework from AWS.

DeepLearning4J is a full-stack implementation, but its complex and extensive technology stack (65% Java, 69.7w lines, 24% C++, 3.4% CUDA, and others) and heavy reliance on complex scientific computing libraries make it challenging to learn from the code.

DJL is just a high-level interface for deep learning in Java, without any actual implementation. It ultimately runs on top of deep learning engines like TensorFlow or PyTorch.

TinyDL-0.01, as its name suggests, is a minimalist and lightweight deep-learning framework implemented in Java. Compared with other implementations:

1) It is extremely simple, with virtually no third-party dependencies.

2) It is full-stack, covering everything from the low-level tensor operations to high-level application examples.

3) It is modular and easy to extend, with each layer including core concepts and principles, and clear boundaries between layers. Of course, the limitations are evident: it has limited features and poor performance.

While TinyDL-0.01 is just a tiny framework, it aims to enable the basic characteristics of modern deep learning frameworks such as dynamic computation graphs, automatic differentiation, multiple optimizers, and various network layer implementations. With a focus on simplicity and clarity, it is ideal for beginners to learn about deep learning. If you want to read the code of frameworks like PyTorch to deeply understand deep learning, you might be deterred directly.

1. Overall Architecture

Deep learning frameworks primarily address engineering challenges in deep network training and inference, including the complexity of multi-layer neural networks, the computational efficiency of large matrix operations and parallel computing, and scalability across multiple computing devices.

Commonly used deep learning frameworks include TensorFlow, an open-source framework developed by Google; PyTorch, an open-source framework developed by Facebook; and MXNet, an open-source framework developed by Amazon. After years of development, their architectures and functionalities are gradually similar. Let us look at the general capabilities of a modern deep-learning framework.

1.1 What Does a General Deep Learning Architecture Look Like?

Specifically, we can refer to the most popular deep learning framework, PyTorch, which is generally divided into four layers (source: Zhihu):

1.2 What Does the Overall Architecture of TinyDL Look Like?

TinyDL adheres to the principle of simplicity and clear layering, and refers to the general layering logic. The overall structure is as follows:

Maintaining a strict layering logic from bottom to top:

ndarr package: The core class NdArray, a simple implementation of underlying linear algebra, currently only implements the CPU version. A GPU version would require a substantial third-party library dependency.
func package: The core classes Function and Variable represent abstract mathematical functions and variables. They are used to automatically construct a computation graph during forward propagation and implement automatic differentiation. The Variable corresponds to the tensor in PyTorch.
nnet package: The core classes Layer and Block represent neural network layers and blocks. Any complex deep network is built through the stacking of these Layers and Blocks. It implements common CNN layers, RNN layers, norm layers, and encoder-decoder seq2seq architectures.
mlearning package: The representation of general components of machine learning. Deep learning is a branch of machine learning, and there is a set of general components that correspond to broader machine learning tasks. These components include datasets, loss functions, optimization algorithms, trainers, predictors, and evaluators.
modality package: Falls under the application layer. Currently, deep learning is mainly applied to three domains: computer vision, natural language processing, and reinforcement learning. There is no implementation of corresponding fields for now, but it is hoped that GPT-2 prototypes will be implemented in version 0.02.
example package: Includes some simple working examples, primarily including classification and regression in machine learning. Examples include curve fitting, spiral curve classification, handwritten digit recognition, and sequence data prediction.

Next, we will go through each layer from bottom to top, providing a full-stack overview of the core concepts and simple implementations involved.

2. Linear Algebra and Tensor Operations

Let us move on to the first layer of deep learning: the tensor operation layer. Operations and calculations of tensor (multidimensional arrays) form the foundation of deep learning, with almost all computations based on tensor arithmetic (a single number is called a scalar, a one-dimensional array is a vector, a two-dimensional array is a matrix, and three-dimensional or higher arrays are N-dimensional tensors). These operations are typically implemented using efficient numerical computation libraries, often written in C/C++, which run on specific computational hardware. They provide various tensor operations such as matrix multiplication, convolution, and pooling.

This section is divided into three main parts, starting with some basic knowledge about linear algebra, then a CPU-based minimization implementation, and finally a comparison of why deep learning relies heavily on a new computing paradigm, the GPU.

2.1 Basic Linear Algebra

First, let’s take a look at a graph that might make you feel a bit overwhelmed—these were just basic exercises when I was preparing for my graduate school entrance exam:

Then there are some common concepts in linear algebra:

Vector: A vector is a quantity that has both magnitude and direction. In linear algebra, vectors are usually represented by a column of numbers.
Matrix: A matrix is a two-dimensional array made up of rows and columns. It can be used to represent systems of linear equations or linear transformations.
Vector Space: A vector space is a set of vectors that satisfy some specific properties, such as closure, and the combination of addition and multiplication.
Linear Transformation: A linear transformation is an operation that maps one vector space to another. It preserves linear combinations and colinearity.
Linear Equation System: A linear equation system is a collection of linear equations where each equation is of degree one and exhibits a linear relationship.
Eigenvalue and Eigenvector: In a matrix, eigenvalues are scalars and eigenvectors are non-zero vectors that satisfy the condition that the product of the matrix and the vector equals the eigenvalue times the vector.
Inner Product and Outer Product: The inner product is an operation between vectors that measures their angle and length. The outer product is an operation between vectors that generates a new vector perpendicular to the original vectors.
Determinants: A determinant is a scalar value composed of elements from a square matrix according to a specific rule. It is used to calculate the inverse of a matrix and determine its singularity.

Feeling a little stressed out? But if you see the simple CPU implementation, you’ll find it surprisingly straightforward (it currently only supports scalars, vectors, and matrices, not higher-dimensional tensors).

2.2 Simple Implementation for the CPU Version

/**
 * Support: 1. Scalars; 2. Vectors; 3. Matrices 
 * <p>
 * Does not support: Higher-dimensional tensors, which can be implemented through inheritance from the simple version.
 */
public class NdArray {
protected Shape shape;
/**
     * Actual data storage, using float32
     */
private float[][] matrix;
}
/**
 * represents the shape of a matrix or a vector
 */
public class Shape {
/**
     * represents the number of rows
     */
public int row = 1;
/**
     * represents the number of columns
     */
public int column = 1;
public int[] dimension;
}

In fact, they are all operations on two-dimensional arrays, roughly divided into three categories: First, some simple initialization functions; Second, basic arithmetic operations such as addition, subtraction, multiplication, and division; Finally, some operations that will change the shape of the matrix, with the inner product being the longest.

2.3 Why Do We Need GPU?

Through the matrix operations implemented on the CPU - even a simple inner product operation requiring multiple layers of loop nesting, it becomes clear that the CPU architecture, designed for logical control flow, is not well-suited for parallel computation. The rows and columns of matrix operations can be processed in parallel, so the matrix operations relied upon by deep learning are extremely inefficient on CPU. For a more intuitive comparison, refer to the following figure: Compared with CPU, GPU has weaker control logic units (blue units) but possesses a large number of ALUs (arithmetic logic units, green units).

Most deep learning frameworks such as TensorFlow and PyTorch provide support for GPUs, allowing for easy utilization of GPU for parallel computing. With the recent surge in the popularity of ChatGPT, GPU becomes essential infrastructure for AI, rapidly turning into a new mainstream computing paradigm.

3. Computation Graphs and Automatic Differentiation

Now we move onto the second layer of deep learning frameworks: the func layer, which primarily implements two very important features: computation graphs and automatic differentiation.

1) Computation graphs are graphical representations that describe the flow of data and dependencies between operations during computation. In deep learning, both the forward and backward propagation in neural networks can be depicted using computation graphs.

2) Automatic differentiation is a technique for computing derivatives, used to calculate the derivative or gradient of a function. In deep learning, the backward propagation algorithm is a form of automatic differentiation, employed to compute the gradients of each parameter to the loss function.

Through the combination of computation graphs and automatic differentiation, complex neural networks can efficiently compute the gradients of a large number of parameters, enabling model training and optimization.

3.1 Numerical Differentiation and Analytical Differentiation

3.1.1 Numerical Differentiation

The derivative is the slope of a function at a particular point, that is, the ratio of the y-axis increment (Δy) and the x-axis increment (Δx) at Δx->0. Differentiation refers to the increment in the vertical coordinate (dy) when the horizontal coordinate has an increment Δx.

Numerical differentiation is a method that uses numerical approximation to estimate the derivative of a function. It aims to approximate the derivative value by calculating finite differences near a specific point. The central difference formula is commonly used, which estimates the derivative at a point by taking the average of the function values at points before and after the point. The formula is as follows: f'(x) ≈ (f(x + h) - f (x-h)) / (2h). In that formula, h is the step size of the difference, and the smaller the step size, the more accurate the result. However, as numerical differentiation is an approximate computation method, there is some error between the result and the real derivative value.

3.1.2 Analytical Differentiation

Analytical differentiation is another approach in calculus, used to precisely calculate the derivative of a function at a given point. It solves the derivative by applying the definition of derivative and the basic rules of differentiation. You can determine the function expression based on the definition of the function. For example, given a function f(x), you need to determine its expression, such as f(x) = x ^ 2 + 2x + 1. Here are some examples of analytical derivatives for common functions:

3.2 Computation Graphs

A computation graph is defined as a directed graph where nodes correspond to mathematical operations. Computation graphs provide a means to express and evaluate mathematical expressions. For example, consider the equation g=(x+y)×z, and we can draw the computation graph for this equation.

The computation graph has an addition node (a node with a "+" sign) and a multiplication node with three input variables xyz and one output g.

Now look at the next more complex function:

The following function f(x,y) is calculated as follows:

The advantages of computation graphs are their ability to clearly represent complex function computations and facilitate backward propagation and parameter updates. In deep learning, computation graphs are typically used to construct neural network models, where each node represents a layer or operation in the neural network, and edges represent the data flow between layers. By constructing a computation graph, complex function computations can be decomposed into a series of simple operations. The backward propagation algorithm can then be used to compute the gradient at each node, enabling the optimization and training of model parameters.

3.3 Backward Propagation

Let us use a vivid example found online to illustrate backward propagation. Suppose you want to buy fruit. In our everyday thinking, this is a simple process: calculate the price and pay. However, this process can be abstracted into a computation graph, involving several steps. As shown in the figure below, you need to first calculate the price of apples multiplied by the quantity and then by the consumption tax, and the final amount is the actual payment.

The process described above is called forward propagation. Backward propagation, as the name suggests, works in the opposite direction, but the computation method is also slightly different. Suppose, in the fruit-buying example, we want to know the derivative of the apple price to the final price. How would we calculate it?

Forward propagation is to calculate the final result of the model, so what is backward propagation for? It is used to obtain the impact coefficient of each parameter in the model on the final result, which allows us to adjust the parameters based on these coefficients to improve the performance of the model. Clearly, we need to work backward, layer by layer, multiplying the coefficients along the way. These coefficients are essentially the derivatives at each layer.

Backward propagation is represented by arrows pointing in the opposite direction (indicated by bold lines). Backward propagation passes "local derivatives," with the values of the derivatives written below the arrows. In this example, backward propagation passes the value of the derivatives from right to left (1 → 1.1 → 2.2). From this result, it can be seen that the value of the "derivative of the payment amount to the price of the apple" is 2.2. This means that if the price of apples increases by 1 yen, the final payment amount will increase by 2.2 yen (strictly speaking, if the apple price increases by a small amount, the final payment amount will increase by 2.2 times that small amount).

3.3.1 The Chain Rule

The chain rule is an important theorem in calculus used to find the derivative of composite functions. Partial derivatives are the partial differentials of a multivariate function to one of its variables, and the chain rule also applies to partial derivatives of multivariate functions. Suppose there are two functions: y = f(u) and u = g(x), where y is a function of x. According to the chain rule, the derivative of y to x can be computed by multiplying the derivative of f to u by the derivative of g to x. Specifically, the chain rule may be expressed as dy/dx = (dy/du) * (du/dx). To compute the derivative of a complex function, the chain rule is applied. Neural networks, in essence, can be seen as complex functions, and finding their derivatives is akin to differentiating a complex function.

It is a challenging topic, often considered one of the most difficult parts of college calculus, which many students fail in their first year. Interested readers can refer to Mathematics for Deep Learning for more details.

3.4. The Design and Implementation of the Func Layer

The core classes in the func layer are Function and Variable, corresponding to the abstract mathematical function form y=f(x), where x and y represent Variable objects and f() represents a Function. Each concrete implementation of a Function needs to implement two methods: forward and backward.

 /**
     * Forward propagation of functions
     *
     * @param inputs
     * @return
     */
    public abstract NdArray forward(NdArray... inputs);

    /**
     * Backward propagation of functions, derivative
     *
     * @param yGrad
     * @return
     */
    public abstract List<NdArray> backward(NdArray yGrad);

For example, a Sigmoid function is implemented as follows:

public class Sigmoid extends Function {
    @Override
    public NdArray forward(NdArray... inputs) {
        return inputs[0].sigmoid();
    }

    @Override
    public List<NdArray> backward(NdArray yGrad) {
        NdArray y = getOutput().getValue();
        return Collections.singletonList(yGrad.mul(y).mul(NdArray.ones(y.getShape()).sub(y)));
    }

    @Override
    public int requireInputNum() {
        return 1;
    }
}

The variable is implemented as follows, which represents the abstraction of variables in mathematics, where backward is the entry function of backward propagation.

**
 * Abstraction of variables in mathematics
 */
public class Variable {
    private String name;

    private NdArray value;

    /**
     * Gradient
     */
    private NdArray grad;

    /**
     * The current Variable is generated by the function object
     */
    private Function creator;

    private boolean requireGrad = true;

    /**
     * Backward propagation of variables
     */
    public void backward() {}
}

The implementation of the entire func layer is as follows:

4. Neural Networks and Deep Learning

Neural networks are networks composed of multiple neurons, designed to process information and learn by simulating biological neural systems. The design of neural networks is inspired by the interaction between neurons in the human brain. In a neural network, each neuron receives inputs from other neurons, performs a weighted sum according to the input weights, and then passes the result through an activation function to produce an output. The learning process in neural networks typically involves adjusting the connection weights between neurons so that the network can make predictions and classifications based on input data. The following figure illustrates both the biological and mathematical models of a neuron.

Neural networks are hierarchical structures composed of multiple nodes, where each node processes input data through a weighted sum and a nonlinear activation function and then passes the results to the next layer of nodes. Deep learning builds upon neural networks by using deep neural networks with multiple hidden layers. Neural networks serve as the foundational model for deep learning, while deep learning introduces multiple-layer network structures that can automatically learn more abstract and high-level feature representations.

4.1 Error Backward Propagation Algorithm

In 2006, Hinton and others proposed a deep neural network model based on unsupervised learning. The training of this model addressed the problem of vanishing gradients through a novel layer-wise training approach, breaking the previous difficulty of training deep networks and laying the foundation for the development of deep learning.

However, modern deep learning still relies on the error backward propagation algorithm for training, primarily due to improvements such as the introduction of new activation functions, regularization, parameter initialization techniques, and efficient gradient descent across the entire network.

Backward propagation is the primary method for training neural networks. Based on gradient descent, it calculates the gradient of the loss function to the network parameters and adjusts these parameters in the opposite direction of the gradient to bring the network output closer to the true values.

The backward propagation algorithm utilizes the chain rule to propagate the error between the network output and the true values back through the network to the input layer, calculating the gradient for the parameters of each layer. The process is as follows:

Forward Propagation: Feed the input sample through the forward computation of the neural network to obtain the output value.
Loss Function Calculation: Calculate the difference between the output value and the true value of the network as the input to the loss function, quantifying the error between the predicted values and the true values.
Backward Propagation: Based on the value of the loss function, compute the gradient of each parameter to the loss function layer by layer. Use the chain rule, multiply the gradient from the previous layer by the derivative of the activation function of the current layer to its input to get the gradient of the current layer, and pass it to the preceding layer, continuing until the input layer is reached.
Network Parameter Update: Use the gradient descent algorithm, update the value of each parameter in the direction opposite to the gradient, thereby reducing the loss function.
Repeat the above steps until the stopping condition for training is met or the maximum number of iterations is reached.

The core idea of the backward propagation algorithm is to compute the gradient for the parameters of each layer and update them layer by layer, allowing the network to approach the true values.

One common challenge faced by the backward propagation algorithm in deep neural networks is the vanishing gradient problem. To solve the vanishing gradient problem, many methods are proposed:

1) Activation Function Selection: Use non-linear activation functions such as ReLU (Rectified Linear Unit) or Leaky ReLU, which can help alleviate the vanishing gradient problem.

2) Weight Initialization: Proper weight initialization can help avoid vanishing gradients. For example, initializing weights with a small variance can maintain a reasonable gradient size.

3) Batch Normalization: A technique that normalizes the data in each mini-batch, helping to stabilize the gradients and accelerate network training.

4) Residual Connections: A technique that uses skip connections, allowing activation and gradients to flow directly through the network.

5) Gradient Clipping: A technique that limits the size of the gradients by setting a threshold, preventing gradient explosion, and mitigating the vanishing gradient problem to some extent. These methods can be used individually or in combination to help address the vanishing gradient problem. They also contribute to the resurgence of deep learning.

4.2 The Stacking of Layers and Blocks

To implement complex networks, the concept of neural network blocks is typically introduced. A block can describe a single layer, a component made up of multiple layers, or the entire model itself. One benefit of using blocks for abstraction is that they can be combined into larger components, a process that is often recursive. By defining code to generate blocks of arbitrary complexity on demand, we can implement complex neural networks with concise code.

public interface LayerAble {

    String getName();

    Shape getXInputShape();

    Shape getYOutputShape();

    void init();

    Variable forward(Variable... inputs);

    Map<String, Parameter> getParams();

    void addParam(String paramName, Parameter value);

    Parameter getParamBy(String paramName);

    void clearGrads();

}

/**
* indicates blocks of a larger neural network composed of layers.
*/
public abstract class Block implements LayerAble

/**
* indicates specific layers within a neural network
*/
public abstract class Layer extends Function implements LayerAble

The figure below represents the overall class for nnet layers, with the core centered around the implementation of Layer and Block. Block acts as a container class for Layer, and the directory and file relationships revolve around their implementations. Each implementation of a Layer or Block corresponds to a well-known academic paper, with deep mathematical derivations supporting the effectiveness of adding that type of layer to the network:

5. Machine Learning and Models

Let us first clarify the relationship between machine learning and deep learning: Deep learning is simply a branch of neural networks that evolves to greater depths. Deep learning is just one branch of machine learning, a special case of a model.

For the broader field of machine learning, there is a set of general components, including datasets, loss functions, optimization algorithms, trainers, predictors, and evaluators. In TinyDL, the general components of machine learning are not tightly coupled with deep learning but are implemented as a separate layer, which facilitates the future expansion of more non-neural network models, such as random forests and support vector machines.

The following figure shows the overall implementation of mlearning:

5.1 Datasets

The DataSet component is designed for loading data and preprocessing it into a format suitable for model learning. Currently, some simple data sources are implemented assuming the data can be loaded entirely into memory (ArrayDataset), such as SpiralDataSet and MnistDataSet.

5.2 Loss Functions

A loss function measures the difference between the predicted values and the true values of the model, or in other words, the prediction error of the model. It serves as the objective function for model optimization, and by minimizing the loss function, the predictions of the model become closer to the true values.

The choice of loss function depends on the type of problem being addressed, such as classification, regression, or other tasks. Common loss functions are:

1) Mean Square Error (MSE): Used for regression problems, it calculates the mean of the squared differences between the predicted and true values.

2) Cross Entropy: Used for classification problems, it compares the probability distribution of predicted classes with the distribution of true classes.

3) Log loss: Also used for classification problems, based on the principle of the log likelihood, it measures the difference between the predicted probabilities and the true labels for binary classification models. The selection of a loss function should align with the task of the model and the characteristics of the data, leading to better model performance and training outcomes.

Currently, TinyDL implements the most commonly used loss functions: Mean Squared Error (MSE) and Softmax Cross-Entropy. The Softmax Cross-Entropy is used to convert the regression outputs into class probability distributions, providing a solution for transforming a regression problem into a classification task. Through Softmax Cross-Entropy, the regression problems can be transformed into multi-classification problems, and the model can be trained and optimized by minimizing the Softmax Cross-Entropy loss function. In the prediction phase, the category with the highest probability can be selected as the predicted result according to the probability distribution.

5.3 Optimizers

Common optimizers used in machine learning include:

1) Stochastic Gradient Descent (SGD): Use a single sample at each iteration to calculate gradients and update parameters, making the computation faster compared with batch gradient descent.

2) Batch Gradient Descent: Use the entire training set at each iteration to compute gradients and update parameters, resulting in slower convergence but better stability.

3) Momentum Optimizer: Accelerate the update process of gradient descent by introducing a momentum term that considers previous gradient changes during parameter updates, thereby reducing oscillations.

4) AdaGrad: Adapt the learning rate based on historical gradients, decreasing the learning rate for frequently occurring features and increasing it for sparsely occurring ones.

5) Adam: Combine the advantages of both momentum and RMSProp optimizers, typically providing faster convergence and better performance. Currently, TinyDL implements the most commonly used SGD and Adam:

6. Application Tasks and Trials

Deep learning finds applications in many fields, particularly in computer vision, where it is widely used for image classification, object detection, object recognition, facial recognition, and image generation. It also plays a significant role in natural language processing, applied to machine translation, text classification, sentiment analysis, semantic understanding, and question-answering systems, such as smart assistants and social media analytics.

Two main directions in large-scale AI-generated content (AIGC) are currently text-to-image and text-to-text transformations. The primary model architectures used for these tasks are stable diffusion and transformer. These advancements rely on robust foundational capabilities. At present, TinyDL-0.01 supports only the training and inference of smaller models. Here are some examples:

6.1 Fitting of Lines and Curves

6.2 Classification Problems with Spiral Data

In the third figure, you can see that the model can learn very clear block boundaries.

6.3 Classification Problems with Handwritten Digit

In deep learning, the classification problem with handwritten digits is a classic example often used to introduce and learn about deep learning algorithms. The goal is to correctly classify images of handwritten digits into their corresponding numbers.

After a simple 50-round training session, the loss decreased from 1.830 to 0.034. On the test dataset, the prediction accuracy reached 96% (the best accuracy in the world is around 99.8%).

6.4 Fitting a Cosine Curve with RNN

The computation graph for the RNN is as follows: The computation graph when the recursive hidden layer window is set to 3:

The computation graph when the recursive hidden layer window is set to 5:

7. Summary

1. TinyDL is merely a beginner-friendly demo framework for Java developers entering the field of AI.

I always believe that clean code will speak for itself and strive to document the code itself. Rather than explaining everything, the best approach would be to debug the code directly. TinyDL is a demo-level deep learning framework aimed at helping Java developers learn AI. Currently, it is not suitable for real-world use, but I hope it can play a small part in enabling Java programmers to embrace AI. Given the current trend, Python has a clear advantage in the AI ecosystem, and it was designed with mathematical expressions in mind, making it more readable in terms of operator overloading. To effectively leverage AI capabilities, Python seems to be an unavoidable choice.

2. TinyDL-0.02 plans to add the todo capabilities and support advanced network features.

Due to time constraints, the Transformer network architecture and attention layers were not implemented. TinyDL-0.02 aims to implement a prototype of the Transformer based on the seq2seq framework. Additionally, the current model training is done using the simplest single-threaded execution, and future versions will implement a parameter server-based distributed network training prototype.

3. Using ChatGPT to assist with coding can significantly boost productivity.

Approximately one-third of the code in TinyDL was completed with the help of ChatGPT.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.