×
Community Blog Building a Serverless PDF Text Recognition Using Function Compute with Node.js in 10 Minutes

Building a Serverless PDF Text Recognition Using Function Compute with Node.js in 10 Minutes

This tutorial demonstrates how you can develop a PDF-to-Text conversion (OCR) function with Alibaba Cloud Function Compute.

By Johnson Chiang, Solutions Architect

Alibaba Cloud Function Compute (FC) is a, serverless FaaS with an event-driven programming model. This tutorial demonstrates how you can develop a PDF-to-Text conversion function with Function Compute, and you will see the simple yet powerful paradigm of FC to implement such helper service.

What You Will Learn

This tutorial is organized into the following sections. Each section represents a specific task when developing a Function Compute service:

  1. Write Function Codes
    1. How to integrate a 3rd-party Node.js library and built-in OSS library as FC codes, in Node.js.
    2. How to package the integrated codes as a ZIP deployment file.

  2. Configure Service and Function
    1. How to create a FC function with OSS trigger by deploying the packaged ZIP file.

  3. Invoke Function
    1. Test the function by posting a sample PDF file onto the source directory, and verify the function is triggered to extract texts from the PDF file and write extracted texts onto the output directory.

  4. Troubleshooting
    1. How to use handy tools to debug those problems you will easily hit when developing a FC service.

Prerequisites

Preparing OSS:

  1. Make sure OSS is activated.
  2. Log on OSS console, create a bucket <YOUR_BUCKET>.
  3. Under the bucket, create two directories: /in and /out, where you will upload source PDF files to the former and converted output text files will be placed in the latter.

Preparing FC:

  1. Make sure Function Compute service is activated.
  2. Download the working ZIP deployment package.
  3. Download the sample PDF file.

Write Function Codes

Currently FC supports runtimes including Java/Python/PHP/Node.js. We will code upon Node.js and use the npm pdfreader module to read text from PDF files.

  1. Under your (for example, /tmp/pdf-to-text), install and test the pdfreader module using npm:

    1

  2. Create index.js under and code the FC handler. The codes of index.js are shown below; it implements the event handler which will be invoked when the FC function is triggered.
    // required modules
    var OSS = require('ali-oss').Wrapper;           // FC built-in module
    var PdfReader = require("pdfreader").PdfReader; // packaged 3rd-party PDF parser module
    
    console.log('Loading function');
    module.exports.handler = function (eventBuf, ctx, callback) {
        console.log('Received event:', eventBuf.toString());
        let eventObj = JSON.parse(eventBuf);
        let ossEvent = eventObj.events[0];
        let ossRegion = "oss-" + ossEvent.region;
    
        // Init oss client instance where credentials can be retrieved from context.
        let ossClient = new OSS({
            region: ossRegion,
            accessKeyId: ctx.credentials.accessKeyId,
            accessKeySecret: ctx.credentials.accessKeySecret,
            stsToken: ctx.credentials.securityToken
        });
        ossClient.useBucket(ossEvent.oss.bucket.name); //  Bucket name is from OSS event
    
        // Source PDF from "in/<filename>.pdf", processed to "out/<filename>.txt"
        let newKey = ossEvent.oss.object.key.replace("in/", "out/").replace(".pdf", ".txt");
        
        // Parse PDF to text
        console.log("Getting object: " + ossEvent.oss.object.key);
        ossClient.get(ossEvent.oss.object.key).then(function (val) {
    
            let pdfBuf = val.content;
            let convertedTxt = "";
    
            console.log("Start parsing PDF buffer.");
            new PdfReader().parseBuffer(pdfBuf, function(err, item) {
                if (err) {
                    console.error("Failed to read PDF binary");
                    callback (err);
                    return;
                }
    
                if (!item) {
                    console.log("Done parsing text.");
    
                    const outBuf = Buffer.from(convertedTxt, "utf8");
                    // Upload converted text as buffer to "out" directory
                    ossClient.put(newKey, outBuf).then(function (val) {
                        console.log("Put object: ", val);
                        callback(null, val);
                        return;
                    }).catch(function (err) {
                        console.error("Failed to put object: %j", err);
                        callback(err);
                        return;
                    });
    
                    return;
                }
    
                if (item.text) {
                    console.log("Continue parsed text: " + item.text);
                    convertedTxt += item.text;
                }
            });
    
        }).catch (function (err) {
            console.error("Failed to get object: %j", err);
            callback(err);
            return;
        });
    };

  3. More often, after you package up the index.js and node_modules into a ZIP, you can directly upload the ZIP package via either the FC console or fcli command line tool. However, when the ZIP deployment package exceeds 50Mb, the maximum file size FC allows, we need to trim the size by identifying and removing unnecessary large files. In this case, we delete ./node_modules/pdf2json/test directory not used in runtime and then ensure the repackaged ZIP file (pdf-to-text.zip) is now skinny enough for upload. Check Install third-party dependencies to learn more.
    $ ls -l; du -hs .
    total 8
    -rw-r--r--@ 1 owner  staff  2600 Jan 21 20:00 index.js
    drwxr-xr-x  5 owner  staff   170 Jan 21 20:00 node_modules
    180M    .
    
    $ du -h -d3 | sort -nr | head -n8
    660K    ./node_modules/pdf2json/node_modules
    180M    ./node_modules
    180M    .
    178M    ./node_modules/pdf2json
    176M    ./node_modules/pdf2json/test
    108K    ./node_modules/pdf2json/lib
     88K    ./node_modules/pdf2json/.idea
     28K    ./node_modules/pdfreader/lib
    
    $ zip pdf-to-text.zip index.js node_modules/
      adding: index.js (deflated 63%)
      adding: node_modules/ (stored 0%)
    
    $ ls -lh pdf-to-text.zip
    -rw-r--r--  1 owner  staff   1.3K Jan 21 20:10 pdf-to-text.z


You can download the working ZIP deployment package to proceed to next step.

Configure Service and Function

We will primarily be using the Alibaba Cloud Console to complete this task. In our case, all Alibaba Cloud resources are in the same region, ap-southeast-1.

  1. Configure Service: Log on FC console and create a Service such as FileConvertService.

    2

    1. Role Config: authorize policies including AliyunOSSFullAccess, AliyunLogFullAccess, and AliyunFCFullAccess.

      3

    2. Log Configs (optional): bind a Log Project and Log Store. This is strongly recommended to debug runtime errors using the Log function.

  2. Configure Function: Under the Service, create a Function such as pdf2Text. The Function are configured as follows:

    1. Code:
      1. Runtime - nodejs6 (or nodejs8)
      2. Code Configuration - Upload the .zip file.

    2. Trigger: Create an OSS trigger with following configurations so that the function will be triggered whenever a *.pdf file is posted or put onto /in. For more information, see OSS event trigger.
      1. Trigger Type - OSS
      2. Trigger Name: (for example, newPDFTrigger)
      3. Bucket: (for example, my-cool-demo)
      4. Events - select oss:ObjectCreated:PostObject and oss:ObjectCreated:PutObject. When an object is uploaded to the specified bucket directory and matches the trigger rule, OSS will publish an trigger event to invoke the function code.
      5. Trigger Rule - Prefix in/ with Suffix .pdf

        4


By completing above configurations, you have created the PDF-to-Text function with OSS event trigger.

Invoke Function

Next, to test the conversion function, you will upload the sample PDF file to OSS <YOUR_BUCKET>/in to invoke the FC function.

5

Then, check the <YOUR_BUCKET>/out, and see the pdf-sample.txt created and view the texts recognized from the PDF file. That's it.

6

Troubleshooting

When you implement your own FC, you will always run a testing and debugging cycle. Listed here are two common errors you may potentially encounter, and the corresponding troubleshooting tips:

  1. Use FC logging to troubleshoot runtime execution errors: The Log is definitely the swiss army knife you will rely on to debug any runtime error. To use the FC Log, you will need to activate Log Service, and enable the Log Configs of the FC service. Then you will be able to iteratively test the function and view logs for your function execution. Following example shows a runtime error pointing out a Node.js ReferenceError:

    7

    For more information, see documentation at Log Service.

  2. Ensure permissions are authorized to FC: Following log shows an error when you don't authorize FC with the access right to OSS.

    8

What's Next?

In this tutorial, you have completed a quick and powerful file conversion service using FC with OSS trigger. Here are some suggestions for you to get more information we recommend for next:

  1. Visit our product documentation of Function Compute.
  2. Learn more about FC from introductory articles on Alibaba Cloud community: Serverless Computing with Alibaba Cloud Function Compute, How to Use Function Compute on Alibaba Cloud
  3. Intelligent Media Management (IMM) service, currently available in Alibaba Cloud China site, is a more powerful SaaS tool provided by Alibaba Cloud to process media data - for example, Office file format conversion, image and video processing. It provides RESTful API for integration. FC in conjunction with IMM will offer more powerful conversion capability.
1 1 1
Share on

Alibaba Clouder

2,599 posts | 764 followers

You may also like

Comments

Raja_KT February 14, 2019 at 6:51 am

Whaw , it helps a lot, just make tweak a little in the code, copy, paste...