Cluster computing

Sunday, December 15, 2013

In today's post, we will read from a book "building the data warehouse" by Inmon. We discussed briefly about CIF Architecture and comparison with Star-Kimball model. We look at data warehouse in more detail now.
Data Warehouse is an accumulation of data. It has different levels of detail. There is an older level of detail, a current level of detail and a level of lightly summarized data (the data mart level), and a level of highly summarized data Before data makes its way into the warehouse, there's tremendous transformation involved. When data ages, it moves from current detail to older detail. When data is summarized it moves from current detail to lightly summarized and then to highly summarized. The data across levels is organized based on subject areas. Typical subject areas in a corporate data model involve Customer, Product, Transaction, Policy, Claim and Account. A subject area could consist of several (order of tens or hundreds of) physical tables. If we take a subject area such as a Customer, there could be a physical table for base customer say between 1985-87, another for 1988-1990. There could be a cumulative table for 1986-1990 and a summary record for each customer record based on customer activity per month. All of the physical tables are related by a common key - the customer ID. Direct Access Storage Device, magnetic tape, fiche and optical disk can all be used to store data. a fiche is good for storing data that does not have to be reproduced in an electronic medium again
Some tables are organized on a from date to date basis while others are cumulative.
Due to their accumulation and size, Data warehouses are not built all at once. For example, on day 1 there may be just the legacy systems doing operational, transactional processing. On day 2, the first few tables of the first subject area of data warehouse are populated. Users start to discover warehouse and analytical processing. On day 3, there are more subjects and more curiosity drawing more users and a DSS analyst. On day 4, more data is extracted from operational environment, transformed and placed in the data warehouse. Many DSS applications spring up. Access needs to be granted for usage and some users are
On day 5, departmental databases come up. Departments now extract the data from the warehouse to their level for analysis.
On day 6, the land rush to departmental, multidimensional systems takes place. This is because the departmental data is now directly available from the warehouse. And users no longer look at the data level details but focus on the departmental analytical processing.
On day n, the architecture is fully developed. Although at this point it looks like a spider web, it is not one. "The Cabinet Design" article on May 1991 explains why.
Note that the level of the detail or the summarization unit of the data plays a very critical and pervasive role in the design of the data warehouse. This is referred to as Granularity.
In operational systems, the data is mandatorily stored at the lowest granularity. In data warehouses, such assumptions are not made. For eg. subscriber call records for a month are a high level of detail and a low level of granularity. But the summary of same subscriber calls are a low level of detail and a high level of granularity. Note the number of records to search the existence of one record is order of magnitudes greater in the first than in the second.
In most retail companies, there are usually dual-levels of granularity. There is a level of granularity for the true archival level and a lightly summarized data in the other level. The lightly summarized data can come from say 30 days' details maintained by the operational data.

Saturday, December 14, 2013

We will talk about choosing cache engines in a LAMP stack:
We briefly discussed a few of them with the modules. We will consider their usage now.
Code cleanup and performance tuning in PHP scripts goes only so far to make pages fast enough. Caching engines prevent a trip to the server for pages that are already available. Caching can be done for internally compiled PHP scripts, the generated output or the intermediary data processing objects.
APC is an opcode caching engine and comes with PECL/PEAR repository. The compiled code from the PHP runtime engine is stored in the shared memory for later access. It can be linked into the PHP code or as Dynamic Shared Object (DSO) and by specifying the extension as apc.so in the php.ini file
eAccelerator is a similar compiled-state PHP 4 caching engine and also optimizes as well as encodes code. The resulting can be distributed to those that have eAccelerator installed. Again the compiled objects are stored in shared memory and the different setting such as memory size, cache directory, time-to-live, prune period can be described in the php.ini file. The extension entry is accelerator.so Similarly Zend optimizer uses the native Zend engine that powers PHP to cache compile time objects.
JPCache works by storing the generated output of a PHP script or disk to disk or in SQL Server. It's written as a PHP library. In addition to caching, it also provides compression resulting in more improvements in page load time. Configuration options include those for the database, the credentials and the compression level.
Memcached is a distributed memory object caching system and is available as a PHP DSO. we can add entries to memcache using the add method that takes key values as parameters.
The get_stats() method on the memcache is particularly helpful and gives such information as current connections, total connections, hits, misses, current items, total items, etc.
As we see now, there are overlap between the caching frameworks. The complimentary ones could increase performance but others may not work well together.
JPCache and memcached can work well with other but because eAccelerator and Zend have their own encoders, they are mutually exclusive. Moreover, you only want one opcode caching. Memcached is a great choice for large objects and varied lifetime or small objects but frequent database access. If there are static pages, JPCache works well. If there is a lot of PHP code, then an opcode caching like APC works well.

In this post, we talk about Ajax (Asynchronous JavaScript and XML) in the LAMP stack.
Ajax helps make calls to the server to update a page without a page refresh. This allows client side of the application to remain responsive even when the data is exchanged with the server.
With page refreshes, the entire page was being transmitted several times consuming bandwidth, transmitting same elements over and over again and making the web application unresponsive during transfers. With Ajax, the page is responsive to user actions during transfers.
Ajax allows rich fat clients to be written. By that we mean the client is capable of a lot of functionality. Take Gmail for instance. It's a web client that uses Ajax. The page isn't refreshed but the elements are manipulated via JavaScript with the application and XML communication with the GMail Server.
Prior to Ajax, the techniques used to communicate with the server without refresh included the following: The image source trick - with the src property of an image in the HTML, developers manipulated the brower's behavior to send a web request for the image and to package its response with a server side script to access a data store. However it was one-way and it wasn't robust because it couldn't handle errors.
Hidden Frames was another technique where the developers used it to send and receive information from the web server It's simple to use and debug and the frame load adds a browser history entry.
The downsides are that the user can manipulate the content.
Hidden iFrames instead of Frames help alleviate the issues with the behavior of the back button.
XMLHttp and XMLHttpRequest interfaces enables the following methods that are supported in all browsers.
abort() - aborts a request,
getResponseHeader(key) : returns the value of the HTTP response header named key
open(method, uri) : sets the HTTP method and the request URI
overrideMimeType(type): Forces the interpretation of the response
send(requestcontent) sends the HTTP request
setRequestHeader(key, value) assigns header elements and called before open or send
Since the interface goes by different names and is instantiated differently for different browsers, the AJAX code handles this by using the native XMLHttpRequest or the window.ActiveXObject for instantiating the interfaces.
Similarly the status code from the request is checked before the response is parsed.
The responseXML property of the request is parsed and elements may be added to the DOM. Knowing the XML structure may or may not be required. For example, handlers can be written at the level of an XML element instead of the response level.
Some common AJAX libraries are : SAJAX (Simple AJAX toolkit) which allows functions to be defined in PHP and called from the client. CPAINT which is the cross-platform Asynchronous interface toolkit that supports both plain text and XML response type and JPSPAN the bridges the gap between JavaScript and PHP. It allows for a seamless importing of PHP classes to JavaScript.

Friday, December 13, 2013

This post is about site security on LAMP stacks.
Authentication is important for site security. We discussed Apache authentication with the mod_auth_mysql modules. This module bolsters the site security by using a user database together with Apaches' authentication. We will see the PHP builtin security mechanisms now which allow us to make our own authentication system.
Security include restricting access. We can do this with Apache or PHP or both.
The easiest way to lock down a website is to use Basic Authentication. Using a combination of .htaccess and password files, we can restrict access to any folder within the website. This is easy to implement and it comes with the default installation. The downside to using the builtin apache authentication is the complex hierarchy of restricted folders and users.
Instead of using this file based mechanism, we can use the mod_auth_mysql that better manages the username/password storage. for example, you can have complex permission matching abilities.
In the PHP based mechanism, we emulate the basic authentication by sending proper/response headers and checking for the right server variables. We setup a table to store username/password and write a script to query the database for a given username and password.
Other user parameters can also be checked. For example, we could restrict access based on IP address. Also server variables such as HTTP_Referer could come in very useful. The thing to note is that anything you might expect from a client can get corrupted. Information gathering, cross-site scripting, sql injection are just some of the examples.
CAPTCHA functionality is available in PEAR.
Other kind of website attacks include abusing register_globals. This is a setting in the PHP.ini that controls the auto-population of variables with the same name as form elements or cookies.
Keeping the register_globals set to on is a bad practice. For example it makes $username field visible. Turning it off forces the values to be read from the input instead. If the register_globals cannot be disabled, the .htaccess file can be used to turn it off.
SQL Injection attack by itself is damaging but together with register_globals attack, it can be even more critical. This is because they insert malicious SQL statements.
With input scrubbing and data validation, this can be avoided.
Some settings in PHP obviate this threat such as with magic_quotes_gpc in php.ini, and using add_slashes() to all data that is passed to MySQL.
In general, the policy to resolve user privileges should grant only the minimum required for user access. This practice helps mitigate the threat by incremental access for all elevated privileges.
With SQL injection attack and register_global attacks are on the input data to the system, cross-site scripting targets the unclean dynamic output. The attacker changes the page's generated markup with the addition of Javascript that discloses sensitive information. PHP gives you an assortment of options to choose from to clean the dynamic output: htmlspecialchars() that escapes any ampersand, htmlentities() that escapes any special characters that has an HTML entity equivalent, strip_tags() that removes all HTML and PHP tags from a string and utf8_decode() that converts UTF-8 encoded characters to ISO-8859-1 characters.

Today's post we look at the top ten web design mistakes as mentioned by Jakob Nielsen (in 1999).
1. Breaking or slowing down the back button. He listed this as the second most used navigation feature and hence the importance. Sites tend to break the button with opening a new browser window, using an immediate redirect and preventing caching.
2. Opening new browser windows by sites is a crime when the user is hostile to taking over his machine. It beats the very purpose of keeping the user glued to the site. This includes opening PDF files in windows because its an 'undifferentiated blob' of content that's hard to navigate.
3. Not changing the color of visited links: Users want to know which links they visited because they want to exclude the ones that didn't work for them.
4. Non-scannable text : This is when the blogpost such as this one renders text without subheads, bulleted lists, highlighted keywords, short paragraphs, the inverted pyramid, a simple writing style, and a non-verbose language.
5. Fixed font - size: Readability for people over 40 is critical to gaining their attention. With CSS disabling the change-font-size feature on the browser, this is yet another disappointment.
6. Page Titles with low search engine visibility- Search is one of the most popular ways users discover new sites and navigate within sites. The title and the first 66 characters of micro-content are the most appealing part to the user. If all the pages have similar titles, it opens up more windows with less visibility to the user.
7. Anything that looks like an advertisement : Users tend to avoid it. Banners, animations and pop-ups are turn downs for the user in his goal-driven navigation.
8. Violating design conventions: Consistency sets the users' expectation and prevents any surprises, removes insecurity and gives more feeling of control.
9. Bad search: Search is described by the author as the user's lifeline when navigation fails. The simpler the search and less literal and difficult for the user to read, the better.
10. Not answering user's questions : If the website doesn't provide what the user is looking for, it defeats their purpose. Simple question like where's the price should be visible rightaway.
Now a look at how content management systems work on the LAMP stack:
Content Management Systems come as two types - Enterprise CMS and Web CMS/Portals.
The former is high-powered software package that provides comprehensive solutions integrating functions such as shipping and delivery systems, invoicing, employee and human resources information, document management and transactional systems. Thus they are usually highly customized company wide solutions.
The portals are mostly created for use on the web. They allow users to collaborate on a website and enable developers to bring functionality to a website quickly

Thursday, December 12, 2013

Continue from the book reading:
Here are some of the Apache tricks from the book:
Enabling mod_rewrite: This is an Apache module that's used for URL rewriting, keeping query-string input safe, and keeping the site search-engine-friendly and used for making URLs friendly and to stop hot linking.
A RewriteRule is a configuration directive that can provide a wide range of rewriting abilities. Rewrites can be for simple redirects to complex multiple-rule pattern matching substitutions.
A RewriteCond helps specify a condition for the rule. Http header values, request variables, server variables, time variables and some special variables can all be used inside the RewriteCond. RewriteLog and RewriteLogLevel are directives that can be used to specify a log file where a running record of the internal rewrite processing will be sent and for the verbosity of the log.
URL spell checking is accomplished with the mod_speling module. With mod_speling module, Apache server takes care of finding the intent of the user and suggesting the correction. The CheckSpelling is the configuration directive to be used.
Content Compression is achieved with the mod_deflate module. Since HTML and the resource are in an uncompressed state, this feature allows us to transparently compress the page markup, images, stylesheets and content and send it across the internet where it is uncompressed before rendering.
We use the AddOutputFilterByType directive to accomplish this. The DeflateCompressionLevel sets the level of compression used to shrink the files.
For databases, we create security zones using Apache and basic authentication. This is available via mod_auth_MySQL module. The configuration directives in this case are AuthName for naming a security zone, AuthType for Basic, AuthMySQLDB for dbname, AuthMySQLUser for userid, AuthMySQLPassword for password, AuthMySQLEnable to On, and require valid-user.
The communication between client and server can be set to be encrypted using SSL with the mod_ssl, an Apache module that SSL-enables a website that Apache controls. The configuration directive in this case is the OpenSSL and a certificate is used for the server Then a VirtualHost section is specified in the config file with this certificate.
Apache can also be used as a file repository with the WebDAV. WebDAV stands for Webbased Distributed Authoring and Versioning, enables Apache to allow users to treat an enabled directory as a remote directory or drive on their own computers..

Continuing from the book reading :
This book talks about code efficiency. Performance of code may depend on the machine configuration. For example, a little more memory can help tremendously with page swapping. To speed up a site, there are generally two categories of techniques - namely benchmark and profiling. Benchmarking is experimenting to determine the best approach for something before implementing it for real. Profiling is experimenting on the real thing to see how well it performs. In Benchmarking, there is a timer involved that is started before measuring the duration of some code execution and then stopped right after it. This start-stop is repeated for accumulating the durations. Further the timers could be labeled so that they can be started and stopped independently. The PEAR benchmarking class provides this functionality with its Benchmark_Timer.
Profiling can help you plan the hardware improvements to the existing code. Things like connections, processor, memory can tremendously improve the actual performance.
Web server improvements also will help improve actual performance. Apache is highly configurable. Apache has several httpd processes that run to handle the requests. More processes can be started when the load becomes high and then subsequently the excess can be shutdown. In some cases, these processes may be shut down even if there's work to do to reclaim memory that's been leaking. An httpd process may also crash failing the request but another process can take the same request again. For example, the configuration options to control the number of different httpd processes include MaxClients, MaxRequestsPerChild, StartServers and MinSpareServers. These are specified in the httpd.conf file.
With improvements in hardware, database and Apache, we can focus on code improvements next. PEAR coding standards come useful here.
PEAR provides caching framework Static version of pages can be generated and served instead of regenerating each time. The source code for the page can be stored in pre-parsed format that the Zend engine can readily execute. Even browser cache can be used to improve experience. The LastModified header comes useful to determine the last time the content was changed.
The Expires header can be used to determine how long the page should be valid for. Clients could also use the If-Modified-Since header that can be used by the server to generate a full response or send a Not-Modified response.
Furthermore, output buffering can be used to gather up the contents of the page and send it out all at once. This can be done by including ob_start() and ob_end_flush() through different sections of the html. The ob_start('ob_gzhandler') can be used to make sure the buffered output is compressed.
There are several different caching engines available to choose from. For example, these include Alternative PHP Cache (APC) which is a PHP opcode caching engine and comes with PECL/PEAR, Zend Engine which has also become part of PECL, eAccelerator which is another compiled state PHP caching engine, JPCache which is a memory/disk caching solution and is a PHP library that can be used to store the generated output of a PHP script to disk or in a SQL Server instead of saving the compiled state, and memcached which unlike the output caching systems of APC, eAccelerator and JPCache relies on the caching of backend code objects such as the database result objects and the data-model entities.