Sunday, April 1, 2018

Code snippet scrapping from website using Chrome extension plugin and NodeJS



Objective:
The objective of this project is to develop a chrome extension plugin which should scan for code-snippets from any active URL opened in the Chrome browser and should store them in MySQL database.

Requirement:
1) Need to develop Chrome Browser extension plugin, which on-click should find the source code snippet from the active URL in the Web browser and store the code snippet to MySQL database.
2) Incase if the URL is not valid, it should display the message saying invalid URL
3) Incase if the URL don’t have any code snippets, it should display the message saying no code snippets found and no record should be inserted to the MySQL database.

GitHub link:

Installation Steps:

1. Setup all the required pre-requisite softwares (as mentioned under the section Pre-requisites)
2. Git clone the project (For example, in my case, under D:\Projects\)
3. Once the clone is complete, execute the DB scripts available in D:\Projects\NodeJS_Projects\SourceCodeFind\db_scripts\db_scripts.sql in your local MySQL database.

CREATE DATABASE nodemysql;

USE nodemysql;

CREATE TABLE `code_capture` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `url_scanned` varchar(300) DEFAULT NULL,
 `code_snippet` varchar(1000) DEFAULT NULL,
 `last_updated_dt` datetime DEFAULT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=68 DEFAULT CHARSET=latin1;


4. Now open the command prompt and cd to path D:\Projects\NodeJS_Projects\SourceCodeFind\NodeRestAPI and execute the command "npm install"

5. Now execute the command “nodemon”


Note: Please don’t close this. This is the Server code which should be always running. Incase if you want to stop this, you can use CTRL + C and close the window.

Error scenario:
Incase if you get the below error, it means that MySQL server is not up and running. Please start the MySQL and run the command “nodemon”.


6. Now install the Chrome Extension plugin. For this, open Chrome browser and type “chrome://extensions” in tab, enable the “Developer mode” on the left hand top,  click “LOAD UNPACKED”. It will open the “Browse For Folder” option (as shown in the below screen shot) and enter the path D:\Projects\NodeJS_Projects\SourceCodeFind\ChromeExtension and click Ok button.


Now you will see the extension added and you will see a blue icon added on the left hand side (as highlighted in yellow). 

Now you can open any valid URL in the browser where you can find code snippet (like stackoverflow page) and click this icon (highlighted in yellow), you will see the “Copy” button. Now click this “Copy” button, it will display “Code snippets are inserted successfully to the database”. Note that the Server component in Step 5 is running for this to be working.




You can query the MYSQL database table (code_capture) to see a latest entry, where you can see the following values:
Id: latest auto increment no
url_scanned: the URL active in the browser
code_snippet: The Code snippet from the URL
last_update_dt: The date time stamp of the insert

Negative scenarios:
1. If you try to hit a URL where you find any code snippet (For ex: www.google.com), you will receive the message saying – “No code snippets found in the URL: <URL>”.


2. If you try to hit an invalid http (or https) URL (for ex: chrome://extensions/), then you will receive the message saying – “Not a valid http or https URL. URL received is: <URL>”.

Pre-requisites:

1. Node JS:
The pre-requisite for this project is to install NodeJS in your laptop. So let’s now install Node JS. 
Note: If NodeJS is already installed on your machine, you can skip this step and move to “Node and NPM upgrade on Windows” section of this document, to ensure you have the latest npm version installed.
Installation Steps:
Download the Windows installer 64 bit from the Nodes.js® web site - https://nodejs.org/en/download/

Run the installer (the .msi file you downloaded in the previous step.)
Follow the prompts in the installer (Accept the license agreement, click the NEXT button a bunch of times and accept the default installation settings).

Restart your computer. You won’t be able to run Node.js® until you restart your computer.

Test the installation:
Make sure you have Node and NPM installed by running simple commands to see what version of each is installed and to run a simple test program:
Test Node: To see if Node is installed, open the Windows Command Prompt, Powershell or a similar command line tool, and type node -v. This should print a version number, so you’ll see something like this v6.10.2.
Test NPM: To see if NPM is installed, type npm -v in Terminal. This should print NPM’s version number so you’ll see something like this 5.4.2
Create a test file and run it: A simple way to test that node.js works is to create a JavaScript file: name it hello.js, and just add the code console.log('Node is installed!');. To run the code simply open your command line program, navigate to the folder where you save the file and type node hello.js. This will start Node and run the code in the hello.js file. You should see the output Node is installed!.


Node and NPM upgrade on Windows:
1. Run PowerShell as Administrator
2. Run the following commands in the PowerShell to upgrade 

Set-ExecutionPolicy Unrestricted -Scope CurrentUser -Force
npm install -g npm-windows-upgrade
npm-windows-upgrade 



Explanation of the Source Code:
Project Structure:
1. The below is the project structure for this implementation:
\ChromeExtension – holds the Chrome Extension related code
\db_scripts – holds the db_scripts.sql file which has the database script to create database and table.
\NodeRestAPI – holds the Server side Node JS code, which will have the logic to persist the code-snippet details to the MySQL table via REST apis. 

2. SourceCodeFind\ChromeExtension\manifest.json:
Chrome Extension needs manifest.json, which holds the metadata details like:
Icons to be used – icon128.png, icon38.png, icon19.png
Browser_action – with default_popup value having the html file to be invoked. 
Permission – holding the Permission details to tell only activeTab to be considered, allow permission to invoke the Node JS API via the URL http://localhost:3000/insert

icon19.png, icon38.png, icon128.png are the icons of various sizes used.

3. SourceCodeFind\ChromeExtension\popup.html:
It has logic to display just a button named “Copy” and initialize the javascript file (popup.js) and CSS file (custom.css).


4. SourceCodeFind\ChromeExtension\popup.js
On-click of the copy button, the addEventListner() function will be invoked, which will inturn invoke the function triggerCopy(). This triggerCopy() function will fetch the active URL from the ChromeTab, invokes the REST API POST URL http://localhost:3000/insert via fetch API.  The Fetch API call will accept parameters like:
Method: POST
Headers: for Accept and Content-Type
Body: URL value fetched from Chrome tab
Once the Fetch API response is received, it will replace the div id “output” (defined in popup.html) using innerHTML() method.

5. SourceCodeFind\ChromeExtension\custom.css:
.button will apply the css attributes for button
.div.sansserif will apply the font attributes for the text displayed below the “Copy” button. 

6. SourceCodeFind\db_scripts\db_scripts.sql:
This holds the MySQL database script to create database, create table.
Id  - is the Auto_Increment number field
url_scanned – holds the URL that is active in the browser and will be scanned for code snippet
code_snippet – holds the code snippet found in the URL.
Last_updated_dt – holds the latest datetime value when the record got inserted.


7. SourceCodeFind\NodeRestAPI\app.js:
This is the Node component which will perform the following componets:
a) Import all the required dependencies via require().
express – Node library to start the server
mysql – Node library to connect to MySQL Database
cheerio – Node library to connect to HTML web scrape and look for keyword <pre> in our case in any website
request – Node library to hit the URL and get the request object
body-paser – Node library to parsing the request body to accept only json and URL encoding
moment-timezone – Node library to handle with the timezone and time, required while set today’s date time with Indian timezone (required while inserting the last_updated_dt in MYSQL database)

b) Initialize the Node express server, set the bodyparser middleware to accept only json and url encoding and start the express server to listen on port 3000.

c) Create MySQL database connection and connect to the database

d) Set the timezone to Indian Timezone using Moment Timezone library and date-time to current date time using Date.now().

e) Invoke the Post API method /insert, which should perform the following activities:
Fetch the url (selected in the Chrome browser) from request body.
Check if the incoming URL is valid URL (starting with http or https) using regex pattern
If the URL is valid, invoke the URL via request library, load the website body into cheerio library and scan for the HTML keywords <pre> </pre>. Usually all the code snippets in any website will be marked between the <pre> and </pre> html code. Once you get the code snippets, insert them to MySQL Database via Insert script.



Complete Code snippet:
https://github.com/shyamnarayan2001/NodeJS_Projects/blob/master/SourceCodeFind/NodeRestAPI/app.js

8. SourceCodeFind\NodeRestAPI\package.json:
This file holds the node dependencies required to create node_modules folder. The “npm install” command will use this package.json file to install all the required dependencies mentioned for this project.


That's all. Hope this tutorial was useful.

No comments:

Post a Comment