Wednesday, December 31, 2014


Schedulers:
Scheduling refers to the methods by which threads, processes or data flows are given access to system resources. The need for scheduling arises usually due to multitasking or multiplexing. The former refers to execution of more than one processes while the latter refers to transmitting multiple data streams across a single physical channel.  The goal of the scheduler can be one of the following:
1)      To increase throughput  - such as to the total number of processes that complete execution in unit time
2)      To decrease latency – such that the total time between the submission of a process and its completion (turnaround time) is reduced or the time it takes between submitting a job and its first response (response time) is reduced
3)      To increase fairness – where the cpu time allotted to each process is fair or weighted based on priority ( the process priority and workload is utilized)
4)      To decrease waiting time – time the process remains in the waiting queue.
As we can see the goals can be conflicting, such as throughput vs latency. Thus schedulers can be instantiated for different goals.  Preference given to the goals depends upon requirements.
Schedulers may have additional requirements such as to ensure that scheduled processes can meet deadlines so that the system is stable. This is true for embedded systems or real-time environments.
The notion of a scheduler brings up a data structure of one or more queues. In fact, the simplest scheduler is a queue. A ready queue is one where jobs can be retrieved for execution. Queues can also be prioritized.
Schedulers may perform their actions one time, every once in a while or at regular intervals. They are classified accordingly as long-term, mid term or short term schedulers.  The long-term or admission schedulers are generally used in batch processing systems where the scheduler admits the processes or delays and determines the degree of concurrency. Usually processes are distinguished as CPU-bound or I/O bound. The mid term scheduler usually swaps in or swaps out the jobs in order to free up the processes can execute and the system resources are available for them to execute.  The short term scheduler decides which of the jobs needs to be executed at periodic regular intervals called quantums.
Scheduling Algorithm can be one of the following:
FIFO – First Come First Served is a simple queue.
Shortest remaining time – also known as Shortest Job First, the scheduler tries to arrange the jobs in the order of the least estimated processing time remaining.
Fixed priority pre-emptive scheduling – A rank is assigned to each job and the scheduler arranges the job in the ready queue based on this rank. Lower priority processes can get interrupted by incoming higher priority processes.
Round-Robin scheduling – The scheduler assigns a fixed time unit per process and cycles through them. This way starvation is avoided.
Multilevel queue scheduling – is used when processes can be divided into groups.

#codingexercise
Double GetAlternateNumberRangeMean()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMean();
}

Tuesday, December 30, 2014

Today we conclude our discussion on shared memory consistency models. We were discussing the programmer centric approach. We took an example where the memory operations are distinguished as a synchronization operation or not. At the language level if there is support for parallelism, we can write doall loops or explicit synchronization constructs. Correct use of a doall loop implies that no two parallel iterations of the loop should access the same location if at least one of the accesses is a write. A library of common routines for programmers could also be provided so that the programmer doesn't have to specify the hardware level directives.finally a programmer may directly specify the synchronization options. One way to do this would be to provide static instructions at the program level. Associating the information with a specific memory instruction can be done in one of two different ways : first to provide multiple flavors of multiple instructions by providing extra opcodes and second to use any high order bits of virtual memory address. Commercial systems may choose instead to transform this information to explicit fence instructions supported at the hardware level such as a memory barrier. Thus we see that the relaxed memory models provide better performance than is possible with sequential consistency and in face enable many of the compiler optimizations.

#codingexercise
Double GetAlternateNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMin();
}

#codingexercise
Double GetAlternateNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMax();
}

Monday, December 29, 2014

Today we continue our discussion on shared memory consistency models. We review the programmability of relaxed memory models.  A key goal of the programmer centric approach is to define the operations that should be distinguished as synchronization. In other words a user's program consists of operations that are to be synchronized or otherwise categorized as data operations in an otherwise sequentially consistent program. Operations are labeled as synchronization operations when they can potentially cause a race condition by accessing the same location and where at least one operation is  a write and there are no intervening operations between them. With this separation of synchronization versus data operations, the system can be conservative with one and aggressive in optimizations with the other. There are two caveats with this approach. One that the programmer be allowed to mark an operation as synchronization when in doubt. This will help ensure correctness by being conservative as well as enable incremental tune up for performance. Another caveat is that the programmer supplied information could be incorrect but we don't have to do anything special here.
Many languages specify high level paradigms for parallel tasks and synchronization, and restrict programmers to using these paradigms. The language may provide a library of common synchronization routines for the programmer which in turn can be conveyed to compiler and hardware. In addition, the programmer may also be allowed to directly call the memory operations. For example, this information can be associated with static instructions at the program level which implicitly marks all dynamic operations generated from that region as synchronization.Another option is to assign a synchronization attribute to a shared variable or memory address such as volatile. While the language constructs play a role on their ease of use for the programmer, making the synchronization operation as default can potentially decrease errors by requiring programmers to explicitly declare the more aggressive data operations.
#codingexercise
Double GetAlternateNumberRangeSum ()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeSum();
}

Friday, December 26, 2014

Today we continue our discussion on WRL shared memory consistency models. We were reviewing relaxing all program orders  We were discussing the weak ordering model and two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type  and provide stricter ordering constraints for some type of operations.
The Weak ordering model classifies memory operations into two categories : data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as  a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program correctness. Compared to the weak ordering model, the release consistency flavors provides further distinctions among memory operations. Operations are first distinguished as special or ordinary. Special operations are further distinguished as sync or nsync operations. Sync operation is further distinguished as acquire or release operations. The two flavors of release consistency differ based on the program orders they maintain among special operations.  The first flavor maintains sequential consistency while the second flavor maintains processor consistency. The first flavor enforces acquire to precede all operations and all operations to precede release operation in addition to requiring special operations to precede special operations.  The second flavor enforces almost the same with the exception for a special write followed by a special read. Imposing program order from a write to a read operation requires using read-modify-write operations much the same way as we had seen earlier.
The other category of models for relaxing all program orders such as Alpha, RMO and PowerPC - all provide explicit fence instructions as their safety nets.
The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.
The RMO model has more flavors for fence instructions. It can be customized to order a combination of read and write operations with respect to future read and write operations using a four bit encoding. Since a combination can be used to order a write with respect to a following read, there is no need for read-modify-write semantics.
The PowerPC model provides a single fence instruction : the SYNC instruction. This is similar to the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model therefore requires read-modify-write semantics to enforce program order.
#codingexercise
Decimal GetAlternateNumberRangeSum ()(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeSum ();
}

#codingexercise
Double GetAlternateNumberRangeMean ()(Double[] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMean();
}
We now look at an alternate abstraction for Relaxed Memory Models. The models mentioned so far have put to use on a wide variety of systems however they have required a higher level of complexity for the programmers. This comes from the system-centric commands that the programmer has to use. Further, they directly expose the programmer to the reordering and atomicity optimizations that are allowed by a model. Instead the programmer could use safety nets (eg. fence instructions, more conservative operation types, or read-modify-write operations) to impose the ordering and atomicity requirements on memory operations. However its not easy to identify the ordering constraints.  For example, the weak ordering requires that programmers should identify all synchronization operations. In order that we define a programmer centric specification, we must first define the notion of correctness for the programs. Here we could use sequential consistency. And second, the information required from the programmer must be defined precisely. The information used in the weak ordering model or the release consistency model could be candidates for this information required by the programmers although it may be described in program level information.

Thursday, December 25, 2014

#codingexercise
Decimal GetAlternateNumberRangeStdDev ()(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev ();
}

We will continue our discussion on Shared Memory Consistency Model today. We discussed relaxing sequential consistency followed by relaxing write to read program order and relaxing write to read and write to write program orders. We now review relaxing all program orders.
By that we mean relaxing program order between all operations to different locations. Thus, a read or write operation may be re-ordered with respect to a following read or write to a different location. We discuss the weak ordering model and two flavors of the release consistency model (RCsc/RCpc) The weak ordering model classifies memory operations into two categories : data operations and synchronization operations This model reasons that since the programmer has called out one of operations as the synchronization operation, the reordering of memory operations between synchronization operations does not typically affect the correctness of the program. In other words the operations called out as synchronization operations provide the safety net to enforce program order. To implement this, each processor can provide a counter to keep track of its outstanding operations. This counter is incremented when the processor issues an operation and is decremented when a previously issued operation completes.  Each processor must ensure that a synchronization operation is not issued until all previous operations are complete, which is signaled by a zero value for that counter. Furthermore, no operations are issued until the previous synchronization operation completes. This way the memory operations may be reordered while enforcing the program order. Moreover, the writes always appear atomic to the programmer, therefore no safety net is required for write atomicity.  We now discuss the two flavors of release consistency that differ based on the program orders they maintain among special operations. The first maintains sequential consistency among special operations and the second flavor maintains processor consistency among such operations.  In the first flavor, the constraints for program order are enforced with the requirements that the acquire precedes all operations and all operations precede release in addition to special operations preceding other special operations in program order. For processor consistency in the second flavor, the write to read program order is eliminated. The acquire precedes all operations and all operations precede release operations and special operations precede special operations except for a special write followed by a special read. In the processor consistency case, imposing a program order from a write to a read operation requires using read modify write operations. Further if the write being ordered is ordinary, then the write in the corresponding read modify write needs to be a release, otherwise the write in the read-modify-write can be any special write. The use of read-modify-write also helps enforce atomicity of writes in this case.
#codingexercise

Decimal GetAlternateNumberRangeVariance ()(Decimal [] A)

{

if (A == null) return 0;

Return A.AlternateNumberRangeVariance ();

}


Today we continue our discussion

Monday, December 22, 2014

We quickly summarize the discussion on sequential consistency from the WRL report on Shared Memory consistency models. The requirements are : First, a processor must ensure that its previous memory operation is complete before proceeding with its next memory operation in program order.  This is the "program order" requirement. To determine that an operation is complete, acknowledgement messages are exchanged. Further, in a cache based system, a write must generate invalidate or update messages.  The second requirement pertains only to cache based systems and concerns write atomicity. It requires that the writes to the same memory location be serialized and made visible to all the processors. The value of the write is not returned until all messages are exchanged. This is the "write atomicity" requirement. Compiler optimizations such as register allocations violate this sequential consistency and are unsafe. Hardware and Compiler optimizations that do not violate sequential consistency include: First technique involves prefetching the ownership for any write operations that are delayed and thus partially overlapping with the operations preceding them in the program order. This technique is only applicable to cache based systems that use an invalidation based protocol.  The second technique speculatively services read operations that are delayed due to the program order requirement.  The sequential consistency is ensured in this case by rolling back and reissuing the read and subsequent operations in the case that the read line gets invalidated. The second technique is generally applicable to dynamically scheduled processors because they already have an implementation for rollbacks. Other latency hiding techniques such as non-binding software prefetching or hardware support for multiple contexts also improve performance but the above techniques are widely applicable.  Next we review more relaxed memory models.
#codingexercise
Decimal GetAlternateNumberRangeMin(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMin();
}
Relaxed memory consistency models are based on two key characteristics:  how they relax the program order requirement and how they relax the write atomicity requirement. With the former, we can further distinguish models based on whether they relax the order from a write to a following read between two writes and from  a read to a following read In all cases, the relaxation only applies to operation pairs with different addresses. The architecture here resembles the one without caches that we discussed earlier.
With respect to the write atomicity requirement, we distinguish models based on whether they allow a read to return the value of another processor's write before all copies of the accessed location receive an invalidation or update messages generated by the write. This relaxation only applies to cache based systems.
Finally we consider a relaxation related to both program order and write atomicity. when a processor is allowed to read the value of its own previous write before the write is made visible to other processors.In a cache based system, the relaxation allows the read to return value of the write before the write is serialized with respect to other writes to the same location and before the invalidations / updates of write reach any other processors.

Relaxed models typically provide programmers with mechanisms for overriding such relaxations. For example, fence instructions may be provided to override program order relaxations. The ability to override relaxations is often referred to as safety nets.

The different relaxation models can now be categorized as :
Relax write to read program order
Relax write to write program order
Relax read to read and read to write program order
Read others write early
Read own write early

The corresponding relaxations are usually straightforward implementations of the corresponding model. We will review the above relaxation models in detail.  We implicitly assume the following constraints are satisfied. First all models that require a write to be eventually made visible to all processors and for writes to the same location to be serialized. This is true if shared data is not cached or enforced by cache coherence protocol when cached. Second, all models enforce uniprocessor data and control dependencies. Finally, the models that relax program order from reads to the following writes must also maintain a subtle form of microprocessor data and control dependencies.

We now look at the first relaxation model : relaxing the write to read program order.
Here the optimization enabled is that a read is allowed to be reordered with respect to previous writes from the same processor. As a consequence of this reordering, the shared bus write buffer model where two processors check and set their flags have their sequential consistency violated. The safety net feature here is to provide specialization instructions that may be placed between two operations.  The serialization functions are either special memory instructions that are  used for synchronization (e.g compare and Swap) or they are non-memory instructions such as branch. Placing the serialization instruction after the write on each processor provides sequentially consistent results for the program even in this case with the optimization. However, this is not the only technique used by these models.  A read-modify-write operation can also provide the illusion that program order is maintained between a write and a following read. Usually making the read part of or replaced by the read-modify-write instruction helps with sequential consistency. In certain models, doing the same for a write operation is not sufficient because they require that no other writes to any location appear to occur between the read and the write of the read-modify-write.
Next to enforce the write atomicity in models which relax this atomicity, making the read or replacing it by the read-modify-write suffices to enforce this atomicity as long as it is not indiscriminately used.There is however some cost in performing the write when replacing a read with a read-modify write unless the read is already part of it. Moreover the compilers require the full flexibility of both reordering a write followed by a read as well as a read followed by a write.

We now look at relaxing the write to read and write to write program orders.
These set of models further relax the program order requirement by eliminating ordering constraints between writes to different locations.The writes to different locations from the same processor can be pipelined or overlapped and are allowed to reach memory or other cached copies out of program order. With respect to atomicity requirements, this model allows the processor to read the value of its own write early and prohibits a processor from reading the value of another processor's write before the write is visible to all the processors.
The safety net provided by this model is the same as earlier where serializing instructions are placed between writes.  In this case, an STBAR instruction is placed for imposing program order between two writes. With FIFO write buffers, this instruction is inserted into the write buffer and the retiring of writes that were buffered after a STBAR are delayed until writes that were buffered before the STBAR have retired and completed.

Sunday, December 21, 2014

We continue our discussion on WRL report on Shared memory consistency models. We looked at implementing sequential consistency in architectures without caches. We specifically reviewed write buffers with bypassing capability, overlapping write operations and non-blocking read operations. We now study architectures with caches. Caching can also introduce reordering behavior. Therefore an implementation with caches must also hide it. For example, if a processor issues a read to the cache, it cannot read until its previous operations are complete. The replication of shared data introduces three additional issues: First, the presence of multiple copies requires a cache coherence protocol to propagate newly written values to all cached copies of the modified location. Second, detecting when a write is complete requires more transactions with replication. Thirdly, propagating changes to multiple copies is inherently non-atomic. We review these in this order.
Cache coherency has often implied both that a write is eventually made visible to all the processors and writes to the same location appear to be seen in the same order by all the processors. But these are not sufficient to enforce sequential consistency because that requires writes to all locations not just the same location to be made visible and also operations of a single processor must be in program order.To do this we propagate the value by either invalidating the copy or updating the copy to the newly written value. We translate the sequential consistency requirements to be an early and a late bound on when a new value can be propagated to any given processor. The early bounds should be such that the sequential order is maintained between early and late bounds of successive operations. The late bounds should be such that the program order is maintained.
To detect the completion of write operations, we were using acknowledgements in the design without caches.There the acknowledgement could be generated as soon as the value reached the memory module. Here we have to wait until the cache is invalidated or updated. For example, with an overlapped write in a general interconnect system, consider a processor P2 to have Data in its cache. P1 proceeds with the write to Head after the previous write to Data reaches the target memory but before it's propagated to P2. Now P2 can read the value of Head but return an old value of Data. This can be avoided by having P2 cache updated before write to Head.
Therefore invalidation and update messages need to be exchanged in addition to acknowledgement messages.
To maintain the illusion of atomicity for writes, we first look at the problems due to non-atomicity.  Let us take the update based protocol, Let us say there are four processors, two of whom write values to variables and two others read those variables. Initially all variables are set to zero. Then its possible to violate the sequential consistency if the writes reach the two processors arbitrarily consequently making the write of A appear non-atomic. This can be avoided by either requiring them to be written to the same location to be serialized or having the network preserve the ordering of messages.
#codingexercise
Decimal GetAlternateNumberRangeMax(Decimal [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMax();
}
We continue our discussion with considerations for Compilers.  A key requirement for the compiler is to preserve the program order in its optimizations and shared memory operations. This requirement directly restricts any uniprocessor compiler optimizations which can result in reordering memory operations. In addition to a reordering effect, optimizations such as register allocation also leads to the elimination of certain shared memory operations because they can potentially violate sequential consistency. Take the example we described above and the register allocation of Head of P2 which may never get updated. On the other hand, sequentially consistent code clearly avoids this.  To summarize some compiler optimizations can violate sequential consistency. Fortunately, compilers for parallel programs avoid such unsafe optimization. 

Saturday, December 20, 2014

codingexercise

Decimal GetOddDecimalRangeSum (Decimal  [] A)

{

if (A == null) return 0;

Return A.OddNumberRangeSum ();

}

Continuing on our discussion from the previous post on sequential consistency in the WRL research report on shared memory consistency model, let us know look at the implementation considerations for the same.
We first look at architectures without caches. The key design principle here is to enforce and maintain program order of execution. We also keep track of the memory operations and make sure that they are sequential.
Let us take a look at write buffers with bypassing capability. We have a bus based shared memory system with no cache. A processor issues memory operations in sequential order. The write buffer merely helps the processor to not wait on the finishing of the write. Reads are allowed to bypass any previous writes to the buffer. This is allowed as long as the read address is different from any of the addresses in the write buffer. This optimization allows us to hide the write latency.  To see how this can violate sequential consistency, let us consider the example of the Dekker algorithm mentioned earlier where we maintained that both reads of the flag cannot return the same value. However, in our bypassing capability, this is now permitted. This problem doesn't occur with uniprocessors.
Now let us take a look at overlapping write operations. Here we use a general interconnection network that alleviates the bottleneck of a bus based design. We still assume processors issue memory operations in program order. As opposed to the previous example, multiple write operations issued by the same processor may be simultaneously serviced by the different memory modules. We maintained that the reads of data by a processor should return the value written by the other processor for sequential consistency. In our case now, we violate this assumption because a write may be injected before a write makes its way to the memory module. This highlights the importance of maintaining program order between memory operations. If we coalesce other write operations to the same cache line, that can also cause the inconsistency.
One way to remedy this could be to wait for the write to reach the memory before allowing another into the network. Enforcing the above requires an acknowledgement to the processor issuing a write. These acknowledgements help with maintaining program order.
 Next we consider non-blocking read operations on the same example as above. Here a processor ensures that its write arrives at the memory location in program order. If another processor is allowed to issue its read operation, then the read might arrive before the write leading to an inconsistency.
#codingexercise
Decimal GetOddNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMode();
}

Thursday, December 18, 2014

Today I'm going to continue discussing the WRL research report on Shared memory consistency models. Most programmers assume sequential semantics for memory operations. A read will return the value of the last write. As long as there is a uniprocessor data and dependences can be controlled, the illusion of sequentiality can be maintained. The compiler and hardware can freely reorder operations to different locations.  Compiler optimizations that reorder memory operations can include register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding and lockup-free caches.
The following are some of the myths about the memory consistency models:
1) It only applies to systems that allow multiple copies of shared data This is not true as in the case with overlapped writes and non-blocking reads.
2) Most current systems are sequentially consistent. On the contrary, Cray computers and Alpha computers relax this consistency.
3) The memory consistency model only affects the design of the hardware. But it actually affects compiler and other system design.
4)  A cache coherence protocol inherently supports sequential consistency. In reality, this is only part of it. Another myth is that the memory consistency model depends on whether the system supports an invalidate or update based coherence protocol. But it can allow both.
5) The memory model may be defined solely by specifying the behaviour of the processor. But it is affected by both the processor and memory system.
6) Relaxed memory models may not be used to hide read latency. The reality is it can hide both read and write latency.
7) Relaxed consistency models require the use of extra synchronization. There is no synchronization required for some of the relaxed models.
8) Relaxed consistency models do not allow asynchronous algorithms. This can actually be supported.

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{

if (A == null) return 0;


Return A.OddNumberRangeMean();

}


Tonight we will continue to discuss this WRL Research report on Shared memory consistency models. In particular, we will be discussing Sequential consistency. This is the most commonly assumed memory consistency model. The definition is that the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appears in program order. Maintaining this program order together with maintaining a single sequential order among all the processors is the core tenet of this model.  The latter part actually requires that memory is updated atomically with respect to other operations. Let us take two examples for using this model. We use a code segment from Dekker's algorithm for critical sections, involving two processors and two flags that are initialized to zero. When a processor enters the critical section, it updates its flag and checks the other. Here the assumption is that the value is read only after it is written. Here the expectation that the program order is maintained ensures sequential consistency and a race condition.  To illustrate the importance of the atomic execution, let us instead consider an example where there are three processors sharing variables A and B both set to zero. Here if one of the processor reads a variable and writes to the other which another processor reads, then the atomicity requires that the updates made to the variable are seen by all the processors at once.

Wednesday, December 17, 2014

In a post, previous to the last one, I was talking about securing AWS S3 artifacts by a corporate subscriber.

The following are the mechanisms available to implement a corporate governance strategy:
IAM Policy:
These  are global policies and are attached to IAM users, groups, or roles, which are then subject to permissions we have defined. They apply to principals and hence in their definition, the application to a principal is implicit and usually omitted.
For programmatic access to the resources created by Adobe users, we will require an account that has unrestricted access to their resources. Creating such a user will help with requesting an AWS Access Key and an AWS Secret that will be necessary for use with the programmatic access.

Bucket Policy:
These are policies that are attached to the S3 buckets. They are S3 specific as opposed to IAM policies that are applicable across all AWS usages by the principals. These bucket policies are applied one per bucket.  A sample bucket policy looks like this:
{
  "Id": "Policy1418767581422",
  "Statement": [
    {
      "Sid": "Stmt1418761739258",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::*",
      "Principal": {
        "AWS": [
          "rravishankar"
        ]
      }
    }
  ]
}
Notice the mention for a principal and a resource in this bucket policy. The ARN is specified such that it covers all the resources we want to govern. Note that these resource names have to be specified with wild cards. This is done via :
arn:aws:s3:::my_corporate_bucket/*
arn:aws:s3:::my_corporate_bucket/Development/*
 

ACLs:
These are specified by the user and admin to specific folders or objects within a bucket or to the bucket itself. These are fine grained control of individual resources. The access control can be expressed in terms of pre-canned ACLs such as :
'private', 'public-read', 'project-private', 'public-read-write', 'authenticated-read', 'bucket-owner-read', 'bucket-owner-full-control'
Or they can be granted explicitly to an individual, group or e-mail based principal with permissions such as :
FULL_CONTROL | WRITE | WRITE_ACP | READ | READ_ACP

 In this regard, the set of items that the governance needs to take include the following:
1)   Create a corporate governance service principal or account.
2)   Request access key and secret for this account.
3)   Set up a distribution list with storage team members for this account.
4)   Prefer to have only one bucket deployed for all corporate usage.
5)   User access will be based on folders as described by the prefix in the keys of their objects.
6)   Define a path as the base for all users just like we do on any other file storage.
7)   Cross user access will be permitted based on ACL’s No additional action is necessary except for provisioning a web page on the portal that lists the bucket and objects, their ACLs and applying those ACLs
8)   If step 1) is not possible or there are other complications, then have a bucket policy generated for the account we created to have full access (initially and later reduced to changing permissions) to all the objects in that bucket.
9)   Policies are to be generated from :
10)                   Apply policies generated to their targets, - IAM or buckets.

11)                   Provision the webpage.


#codingexercise
Decimal GetOddNumberRangeStdDev(Decimal [] A)


{


if (A == null) return 0;


Return A.OddNumberRangeStdDev();


}
Today I will be reading from the WRL research report on Shared Memory Consistency Models. The topic means a specification of memory semantics that helps right correct and efficient programs on multiprocessors. Sequential access is also one such model but it is severely restricting for multiprocessors. That is why system designers introduce relaxed consistency models but they are sometimes so different that a general specification becomes altogether more complicated.  In this paper, the authors describe issues related to such models. Here the models are for hardware based shared memory systems.There are optimizations made based on each system but this paper describes the models with the same terminology. Furthermore, the paper also describes the models from the point of view of a programmer.
Shared memory models provide a single address space abstraction over parallel systems. It simplifies difficult programming tasks such as data partitioning and dynamic load distribution. Having a precise understanding of how memory behaves is important for programming. Consider a producer consumer queue where memory is written by one and read by another. If the memory gives the old value prior to the write for the reader, then there is an inconsistency between the programmer's expectation and the actual behaviour.
On a single processor, a read returns the last write because the instructions are executed in program order which isn't the case for multiprocessors.  We can extend the notion to multiprocessors in which case its called sequential consistency and the reads will now be consistent because the processor is executing in program order.However, this simple and intuitive model severely restricts the hardware and compiler optimizations.

#codingexercise
Decimal GetOddNumberRangeStdDev(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeStdDev();
}

A consistency model like the one described is an interface between programmability and system design. The model affects programmability because the programmers use it to reason the correctness of the program. The model affects performance because system designers will use it to determine optimizations in hardware and software. 

Tuesday, December 16, 2014

In today's post we talk about security and policies as applicable to AWS S3 artifacts.  Policies are mapping that dictate what users have access to what resources. Typically policies are declared and applied to containers. If you see the .Net security model, there is a mention for different tiers of policies. These include an Enterprise level policy which comes in handy for corporate accounts. Policies don't necessarily have to be expressed in the same way on different systems but there is a reason for a consistency that we see in many different systems. For example, the inheritance of permissions set at a folder level flows to its contents. Another way to see these principles is the cross account authorization. For example, one owner of a folder may grant permissions to another user for reading and writing. Policies and their applications are implemented differently but there is a lot in common in their design. This comes from their requirements which are common.  Coming back to the mention on S3 artifacts, there are two ways enterprise policies can be applied
One is to create a corporate level bucket with user level folders. Another is to apply a policy with full access to a corporate governance account to all the buckets.

Monday, December 15, 2014

Decimal GetEvenNumberRangeStdDev(Decimal [] A)

{

if (A == null) return 0;

Return A.EvenNumberRangeStdDev();

}

Today we continue to discuss the WRL long address trace generation system. We were reviewing the effects of line size in the first and second level cache in this study and the contribution of the write buffer. We summarize the study now. This is a method to generate and analyze very long traces of program execution on RISC machines. The system is designed to allow tracing of multiple user processes and the operating system. Traces can be stored or analyzed on the fly. The latter is preferred because the storage is very large even for a very short duration. Besides it allows the trace to be compacted and written to tape. The system requires that the programs be linked in a special way.
From the trace analysis using this system, it was found that while second level cache was necessary, large second level caches provide little or no benefit. A program which has a large working set can benefit from a large cache. Block size plays a big role in first level instruction cache, but sizes above 128 bytes had little effect on overall performance. If the proportion of writes in the memory references is large, then the contribution from the write buffer is significant. There is little or no benefit from the associativity in large second level cache.

#codingexercise
Decimal GetEvenNumberRangeVariance(Decimal [] A)


{


if (A == null) return 0;


Return A.EvenNumberRangeVariance();


}


Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}

Today we will continue to discuss the WRL system for trace generation. We discussed the effects of direct mapped and associative second level cache. We now review the line size in first and second level cache.  When the line size was doubled or quadrupled with a 512 K second level cache, there were reductions in the Cumulative Total CPI and this was independent of the second level size. The improvement was due to decreased contribution of the instruction cache. The data cache remained constant.  The improvement in doubling the line size was the same as the improvement in increasing the second level cache size eight times. The effects of doubling the length of the lines in the second level cache from 128 bytes to 256 bytes made no difference. This may be due to too much conflict for long line sizes to be beneficial.
We next review write buffer contributions.  The performance of the write buffer varied from program to program.  This is mainly due to the sensitivity of the write buffer to both the proportion and distribution of writes in the instructions executed. A write buffer entry is retired every six cycles. If the writes were any more frequent or bunched, it would degrade the performance. The relationship between the proportion of writes in a program and the write buffer performance is clear. The CPI contributed by the write buffer shows a corresponding jump.
There was a case where the percent of writes was frequently in the 15-20% range but the CPI write buffer is usually zero. If the writes were uniformly distributed below a threshold, the write buffer will never fill and a subsequent write will never be delayed. Above the threshold, there may be some delays. Since the second level cache is pipelined into three stages, the processing of a single miss gives the write buffer a chance to retire two entries. If enough misses are interspersed between the two writes, the write buffer may work smoothly Thus the first level data cache miss ratio is the third factor that comes into play.
Although this report doesn't study it, but the effects of entropy on the cache distribution hierarchy could be relevant. The Shannon entropy is defined as negative sum of P(x) log base 2 P(x)

Sunday, December 14, 2014

Data forwarders rely on one of the following means :
file monitoring
pipe monitoring
network port monitoring
settings/registry monitoring
etc.
However there are some  protocols that also exchange data, ftp, ssh, http and although they are over network ports, they have to be filtered.
Instead take for example ssh, if there were a daemon that could forward relevant information on a remote computer via ssh, then we don't need a data forwarder on the remote machine and the footprint of the forwarder will be even smaller. A forwarder's reachability of data can be considered to be expanded by ssh data forwarders.
Now let us consider ssh forwarder code. This may be set up as a listener script so that on all incoming connections using a particular login, we echo the relevant data back. In this case, the script is as small as a bash script that can read up and spew out the relevant data as a response to the incoming connection and close the connection.
Since the data is forwarded only on ssh connection, this model works well with monitoring mechanisms that can periodically poll. Most monitoring mechanisms are setup as publisher subscriber model or polling model.
The SSH connections have the additional benefit that they are generally considered more secure than the the TCP connections because the data is not in the clear. Furthermore, the script deployed to the remote computer follows the usual mechanisms for securing this access. Hence it is likely to meet up with less resistance in deployment. On the other hand, consider the installation of a binary as part of product package on the remote machine. That may take a lot of questions in some cases such as licensing, fees etc.
Lastly, the commands to determine the data to be forwarded and sending it are typical shell commands and there is a more transparency in what script is deployed.
#codingexercise
Decimal GetEvenNumberRangeMax(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMax ();
}
#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Saturday, December 13, 2014

Today we continue to review the WRL system for trace generation. We read up on the effect of direct-mapped versus associative second level cache. When associativity is increased, miss ratio can decrease but the overall performance may not improve or even degrade. If the cache is very large, it is likely to have a low missed ratio already. Increasing the associativity might provide little or no benefits. Associativity helps with the miss ratio but it adds up to the cost of every reference to a cache.
If we ignore this cost, we can bound the benefits of associativity with a comparison between the direct mapped and the fully associative caches. The results showed that the cumulative CPI for a fully associative second level cache with four different sizes were minimally effective for large caches but more effective for small caches.
 A more useful measure for the benefits of associativity may be to see the effect on the total CPI as a function of the cost per reference of the associativity.
We know that the CPI for the associative case will be a delta more than the CPI for the direct mapped case. This delta is the combination of the first level miss ratio and the difference in the average cost per reference to the second level cache. This difference can be elaborated in terms of the cost of a hit in the associative cache (ha), the cost of a hit in a direct mapped cache (hd), the cost of a second level cache miss (same direct-mapped and associative) (m), the miss ration for a second level associative cache and the miss ratio for a second level direct-mapped.
In this case, the delta in the average cost per reference can be worked out as the difference in the average cost per reference in the associative case and in the direct mapped case. The first term is sum of the costs of access to  the associative second level and the miss from this level. The associative second level cost of access is a combination of the cost of a hit and the miss ratio for that cache. One minus this miss ratio times the cost of a second level cache miss which is the same for a direct mapped and the associative case is the cost of a miss from this level. This kind of elaboration is true for the second term as well in the case of the direct mapped second-level cache. Writing the costs of a hit in terms of a ratio, we get the delta in the average cost per reference to be weighted based on the difference in the miss ratios for a second level associative cache and the second level direct mapped cache
We can now plot the graph of the cumulative CPI in both cases, and see that there is a cross over.  The distance between the points at which the crossover occurs is the maximum gain to be expected from the associativity.

#codingexercise
Decimal GetEvenNumberRangeMin(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMin ();
}

Thursday, December 11, 2014

#codingexercise
Decimal GetEvenNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMean ();
}

Today we will continue  to discuss WRL system for trace generation. We were discussing long traces and its usefulness. This is not limited to multi level caches. Even for single level caches, varying cache size and degree of associativity, the effects of the trace length was studied and found to be beneficial. With the short traces, there were issues such as high miss ratios, cold start, and erroneous results or predictions depending on where the short run was chosen. The miss ratios are not always lower for long traces since not all programs behave the same and it depends on where the short run was chosen. Some programs terminate before it stabilizes, Others appear to stabilize early before they take a jump after  a spell. Yet others stabilize reasonably well but they require a long run. Moreover, the cache warm up itself needs millions of references.
The WRL system does not need to save traces. To analyze a program with a different cache organization, the program only needs to be rerun with a different parameterized analysis program. This works well for single user programs that are deterministic because they generate identical traces on different runs. For multiple processes, this is no longer deterministic because they get interleaved differently on different runs. This is not bad. If we repeat the runs a few times, we can find the dominant trends. However, the variations may make it difficult to understand the significance of individual peculiarities.To overcome this limitation, the cache analysis program could be modified to simulate multiple caches at the same time. However, the variations have been within tolerable limits to not require that.
The contribution due to the size of the second level cache was also studied. The CPI was measured for three sizes of second level cache: 512K, 4M and 16M. Except for the early part of a run with TV program, there is a dramatic reduction in CPI when the cache size is increased to 4M but a considerably smaller improvement resulting from the jump to 16M. When a program actually references a large part of its address space, having a bigger size second level cache tremendously helps. However, most of the programs don't have this large working set.
We can get a bit more insight with existing data by dividing level2 cache misses into  three categories - compulsory, capacity and conflict. Compulsory misses happen the first time an address is referenced, and occur regardless of the size of the cache. The capacity miss ratio is the difference between the fully associative ratio and the compulsory miss ratio. The conflict miss ratio is the difference between the direct mapped miss ratio and the associative miss ratio.
The capacity misses are due to fixed cache size. This decreases when the cache size increases but only upto a certain size.Conflict misses result from too many address mappings to the same cache lines. This increases between 512K and 4M cache size but reduces tremendously from 4M to 16M where the interfering blocks are spread out more evenly. This is where the pid hashing reduces the miss ratio for the 16M.

#codingexercise
Decimal GetEvenNumberRangeMode(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeMode ();
}
#codingexercise

Decimal GGetEvenNumberRangSum(Decimal [] A)
{
if (A == null) return 0;
Return A.EvenNumberRangeSum ();
}


Wednesday, December 10, 2014

#codingexercise
Decimal GetOddNumberRangeMean(Decimal [] A)
{
if (A == null) return 0;
Return A.OddNumberRangeMean ();
}

Today I will continue to discuss the WRL system.
The results of the study showed that in a long run of a multi set of programs, the miss ratio of the data cache far exceeded that of the instruction cache and second level cache yet the total contribution to CPI from the data cache was far lesser than that of the other two cache. Further, the second level cache contributes twice as much to the overall degradation as do either the instruction or data cache.

Long traces provided better insight than short ones. Some of the variations of the CPI came only after a spell of execution and this is not merely  the cache 'warmup' phase. When the TV program discussed earlier was executed with a long trace, the high points of the CPIdata and CPI level2 matched the lows in the graph of CPIInst. These corresponded to the loops that execute small sections of the code and go through large volumes of data. The usefulness of long traces is not limited to large multi-level caches. It could also be used for single level caches, varying cache size and degree of associativity. The study indicated that long traces are more essential for large caches and higher degrees of associativity.

Address mapping to the second level cache was also studied. Trace addresses generated for user programs were virtual addresses while that for kernel were physical address. Address translation is not problematical. If the cache size divided by its associativity is less than or equal to the page size, then the high order bits of the page offset can be used to identify the appropriate cache line. These high order bits are the same for virtual or physical addresses. Misses will be identical to those occurring in a physical cache where the pid field is replaced by a physical page number field. If the cache is much larger than the page size as in the case of second level cache, the page offset maps only to a tiny proportion of the cache.

Since the cache is large, references from different processes are spread out eliminating interprocess conflicts. Using a simple pid hashing algorithm where the pid is exclusive-ORed with part of the address that indexes into the sets of the cache,  a study was performed with three sizes of second level cache. This techniques was compared to the address modulo cache size and it was found that the pid hashing reduces CPI level2  only when the cache is large.

#codingexercise
Invert a Fibonacci series given a number in the series.
This problem is unsolvable unless you are given the next number in the series or you are allowed to generate the Fibonacci series upto and inclusive of the given number.

void printDecreasingFib(int x , int y)
{
 Assert ( y > = x);
print y;
print x;
while ( y - x  > 0 )
{
     t = y - x;
     y = x;
     x = t;
     print y;
     print x;
}
}

Tuesday, December 9, 2014

Vector form content have the advantage that they store items in layers. Consider png format with bmp. In the former we can save and remove layers where as the latter is scalar content per pixel. If this layering approach were available to most web pages and views, then we could have a css that can define style to each layer.
For example, let us say we want to build a web page for a clothing and accessories store. The web page has to let users choose the clothing and dress up a mannequin. where different clothing can go one on top of each other. This we can do today with overlaying of images in .png format.
HTML:
<img id="img1" src="img1.png"/>
<img id="img2" src="img2.png"/>
<img id="img3" src="img3.png"/>
and we define the css for these as follows:
img
{
    position:absolute;
    top: 0px;
    left: 0px;
}
Now to cause the images to overlay, we set the z-index in css as follows:
#img1
{
z-index: 10;
}
#img2
{
z-index: 20;
}
#img3
{
z-index: 30;
}
This will overlay 3 on top of 2 on top of 1. Here 3 could be the accessory, 2 could be the mannequin and 1 could be the background.
To change the accessory, we could do :
<img src="accessory1.png" onClick="setAccessory('accessory1.png');"/>
and use javascript like so:
function setAccessory(path){
     document.getElementById('layer1').src = path;
}
To add more accessories, we can just define more layers.
Adobe's macromedia dreamweaver has the notion of dynamic layers. Any object can be added to the webpage by directly adding it to a layer. Then the layer can be moved around.
#codingexercise

Decimal GetDistinctRangeMean(Decimal [] A)

{

if (A == null) return 0;

Return A.DistinctRangeMean ();

}


Monday, December 8, 2014

#codingexercise
Decimal GetDistinctRangeVariance(Decimal [] A)
{
if (A == null) return 0;
Return A.DistinctRangeVariance ();
}


We continue our discussion on WRL systems. We saw that the first level caches was a write through and behaves like a four entry FIFO, queuing up writes to the second level cache. The second level cache is write back. It may be either virtually or physically addressed.

We next review the choice of user programs to trace made in this paper. User programs that were real and in use, those that require large amounts of memory not because they are poorly written but because they problems they solve are inherently large are the user programs chosen for the study.
One such program was a timing verifier (tv) for VLSI circuits which is in regular use at WRL. It generates a linked data structure and traverses it. SOR is another program chosen which implements successive overrelaxation algorithm using sparse matrices.  Tree is a compiled Scheme program chosen for this study. It builds and searches for the largest element in the  tree. This implementation uses a garbage collection algorithm. Mult is another program chosen because it's a multiprogramming workload.
It was found that the TV made the most memory references to the tune of billions. The number of instructions as a percent of memory references was generally higher across all the chosen programs than in CISC programs.

Measures for cache behavior: The most common measure for cache performance is the miss ratio, the fractions of requests that are not satisfied by a cache.  Though the miss ratio is useful for comparing individual caches, it does not give a clear picture of the effect of the cache's performance on the overall or its relative importance in a multi cache system. A better measure in such case is the CPI, the average cycles per instruction. CPI accounts for variations in the number of references to the different caches and differences in the costs of misses to the caches and can be broken down into the components contributed by each part of the memory hierarchy. Since we are concerned only with memory delays and assume one instruction is issued per cycle, the total CPI over some execution interval is one plus the CPI from data plus the CPI from instructions plus the CPI from level 2 and the CPI as the contribution of the write buffer. CPI total thus gives a better indicator of the performance of the base architecture.

If for each cache,
    m-cache-type = number of misses during the interval
    c - cache type = cost in cycles of a miss
    i = number of instructions in the interval
then CPI = the ratio of the product m with c over i
We want to reduce the total CPI. 

We resume our discussion on the WRL system's cache analysis program.  It was found that simulating split instruction and data first level caches and a direct mapped second level cache, analysis plus trace generation takes about 100 times as long as untraced execution.  For example, tracing and analyzing a run of TV, a WRL timing verification program, extends the runtime from 25 minutes to 45 hours.

Saving traces for later analysis :
On the fly analysis does not require storage. When storage was required, the data was put on tapes for the sake of capacity and portability.  Data was written onto the tape using the Maching scheme. A differencing technique was used to write the data. The original trace data was first converted into a cache difference trace consisting of reference points  and differences.  Addresses are represented as a difference from the previous address, rather than as an absolute address. This helps with creating patterns in the trace reference string. The cache difference trace is then compressed common sequences of data are translated to single code words. Reference locality and regularity of trace entry formats allow trace data to be very effectively compressed.

We next review the Base Case Cache organizations. A number of assumptions were made about the machine used for analyzing.
The only assumption related to the generation of traces is that of a RISC architecture in which the instruction fetches and explicit data loads and stores are the only forms of memory reference. This makes the simple form of the traces useful even though they contain  no information about the instruction types other than the loads and the stores.
Among the assumptions made about the machine, one is that it's a pipelined architecture. If there were no cache misses, a new instruction is started on every cycle. This is possible when the page offset is simultaneously used to retrieve the real page from the TLB and to index the correct line in the cache. If there are no other delays other than by cache misses, the based cost is one cycle. Therefore the design of memory hierarchy is to keep the CPI close to 1.
Another assumption is that the machine is fast enough to require two levels of cache. The cost of going to the second level cache was considered about 12 cycles.

These assumptions are in line with the discussions on this topic with the hardware engineers who built the WRL system.

The first level data cache is considered a write through cache. It's a four entry FIFO that queues up writes to the second level.

#codingexercise
Decimal getStdDev(decimal [] a)
{
If (a == NULL) return 0;
Return a.StdDev();
}


Sunday, December 7, 2014

In today's post we continue discussing the WRL Trace generation system. We discussed a verification technique for the correctness of the trace data.  First, the simulation results were compared with that of an existing instruction level simulator. Second, to check for the correctness of the information created by the trace code, we had to verify synchronization. There were two techniques used for this purpose - one where a program that used  a simple tight infinite loop for generating a recognizable pattern, and then run a variant of the analysis program that checked whether anything was missing or out of place and two - the trace entry generated by the operating system on entry to the kernel was extended  to include a sequence number that was incremented on each kernel entry.
We next look at a cache analysis program. The requirements were that it be flexible. It should be changed from run to run. It shouldn't take up too much memory to cause paging overhead. It should be as fast as possible given that the analysis takes much more time than trace generation. When constants were used in this program whose values were specified at compile time,  they defined the number of cache levels, split or integrated data and instruction caches, degree of associativity, number of cache lines, line size, write policy and cost of hit and miss (in units of processor cycles) A version of this program is compiled for each cache configuration and there can even be a fully associative version of the second level cache.
The program is executed at intervals that can be based on the number of instructions, trace entries, or memory references encountered. The data written includes both intervals and cumulative summaries of the number of references, miss ratio, contributions to the cycles per instruction, the percent of misses that resulted from user/kernel conflicts and all of these for each cache, contribution to the CPI by second level compulsory misses (first time references), second level cache contribution to the CPI, if fully associative, and for user mode and kernel mode, number of cycles executed and CPI, CPI broken down by instruction, load and store contributions.
#coding exercise
Decimal getMedian(decimal [] a)
{
If (a == NULL) return 0;
Return a.Median();
}
#coding exercise
Decimal getMode(decimal [] a)
{
If (a == NULL) return 0;
Return a.Mode();
}


Friday, December 5, 2014

#coding exercise
Decimal getAvg(decimal [] a)
{
If (a == NULL) return 0;
Return a.Avg();
}

In this post , we continue our study of the WRL trace system. we review Interrupt timing. Traces may not accurately represent normal execution because code segments are executed at different times and therefore possibly in different order. Slowdown in execution means that various interrupts happen sooner or more frequently than they would without tracing.If user code does asynchronous I/O or relies on signals, more trace code is executed before and after the completion of an I/O operation.
Another such concern is that the additional timer interrupts are not only traced but they affect the rate at which process switching is mandated.  For example, if there are more context switches, the time slice for actual user code execution is reduced. To overcome this, we could increase the quantum associated with traced processes. Similarly, if the additional timer interrupts are a problem, we could account for them in the analysis program rather than  modifying the trace data.

Since less user code is executed between process switches, the traced processes are executed slower. This affects the accuracy of multiple process traces. To account for this, the process switch interval was increased. This does not affect the interrupts since they still need to be recognized immediately.

The process switch interval was increased by a factor of 8 to reflect actual switching. This corresponded to approximately 200,000 user instructions between switches. To simulate faster machines, the process switch interval was doubled.  The average number of instructions executed between process switches for the multi-process benchmark was about 175000 because the quantum is measured in cycles rather than instructions.

In the comparision with simulation results, we were fortunate to have an original simulator with the WRL system that could print out traces for short real user programs. The WRL system was modified to write out traces in a similar form produced by the simulator. The resulting trace file was compared and the number of cache hits and misses were compared.

In order to check for the correctness of the information created by the trace code, it was necessary to assure that the kernel and user access to the trace buffer was synchronized and that there are no gaps in traces that were produced when the analysis program runs. To verify synchronization, there were two tests. First, a program was run infintely in a simple tight loop and the pattern of trace was determined to be consistent. Second, a sequence number was added to the trace entry on each entry into the kernel. Again the pattern was found to be consistent.