Cluster computing

Wednesday, October 26, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. We see how it is different from other ontologies. For example, FrameNet differs from WordNet in the lexical information and is in fact, complimentary in many ways. FrameNet has far fewer nouns, verbs, adjectives and adverbs as compared to WordNet. This is because FrameNet has very little to say about common nouns. WordNet has noun hierarchies and encompasses common nouns. WordNet has little syntagmatic information where as FrameNet has a lot.
WordNet separates hiearchy for Nouns, Verbs, Adjectives and Adverbs. Each Frame on the other hand can conver words of any part-of-speech. WordNet has hypernymn or hyponymn relations between synsets. Lexical units come close to synsets but there are no relations per se between lexical units. However, there are several types of frame relations. There are also content differences between the two. WordNet has annotated examples showing syntax and semantics of each LU. FrameNet describes Frame Elements or roles for each frame. FrameNet Frame hierarchies provide better generalizations than WordNet synset hierarchies.
FrameNet also differs from PropBank (Proposition Bank discussed by Palmer et al 2005). PropBank began with labeling verbs and their arguments. PropBank uses Penn POS tags, Penn Tree Bank parse structures. It added nouns and roles from associated verbs. This includes an efficient semantic role labeling system but there is no equivalence to frames in FrameNet. There are two levels of role labels - those that are completely general and those that are specific to lexical unit. The lemmas from text are split into sequences that carry equivalence between PropBank verb specific and FN FE names. FrameNet assigns names such as communicator, evaluee and reason for corresponding PropBank role labels.
There have been other works to represent semantic information such as Abstract Meaning Representation ( Banarescu et al, Law 2013). AMR is a graph based representation of lexical concepts and typed relations between those concepts that are denoted by an English sentence. It integrates several aspects of lexical/relational meaning - abstracting away from the grammatical details - in a single structure. This structure specifically helps with rapid corpus annotation and data driven NLP. However the lexical semantics in AMR are much shallower than FrameNet
Courtesy : Colin Baker FCSI
In general, it seems to be helpful to have a hash between a term and a corresponding ontology identifier
#codingexercise
Insert into a circular linked list that is sorted but where we are given a random node from the list not the head.
// start is a random node in the sorted circular list

Void insertCircular(ref node start, node target)

{
var head = start;
If (head == null) {head = target; target.next = target; return;}
Node prev = null;
Node curr = head;
Max = int_min;
Var pos = null;
Var prevpos = null;
while(curr.next != head){
If (curr.data > max && cur.data < target.data ){
Prevpos = prev;
Pos = curr;
Max= curr.data;
}
Prev = curr;
Curr = curr.next;
}

If (curr.data > max && cur.data < target.data ){
Prevpos = prev;
Pos = curr;
Max= curr.data;

}
If (pos == null){
If (head.data < target.data){
InsertAt(head, target);
}else{
InsertAtStart(head, target);
}
}else{
if (pos.data < target.data) {
InsertAt(pos, target);
}else{
InsertAt(prevpos, target);
}
}
}
Int rowWithMaximum1sInSortedMatrix(int[,] m)
{
Int max = int_min;
Int index = -1;
For (int i = 0; i < m.rows; i++)
{
int col = findfirst(m, i, 0, m.cols-1, 1);
If (col != -1 && m.cols - col > max)
{
Max = m.cols -col;
Index = i;
}
}
Return index;
}
The findfirst method is a binary search method.

Tuesday, October 25, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. The idea behind FrameNet is that the definition of a word is incomplete without the relations to conceptual structures and patterns of beliefs, practices, institutions, images etc. which are referred to as semantic frames. - courtesy Filmore 2003. Therefore meanings are relative to Frames. The scope of FrameNet includes Events: for example related to birth or death as noun, Relations: for example being relevant as in adjective or in a personal relationship as in a noun, States: for example on or off as in preposition, and Entities: for example a device. Therefore the words are related to notions that are represented by frames and the meaning is not complete without these relationships. A Semantic Frame is the description of an event, relation, entity or state and its participants thereby capturing the essential knowledge of a given word sense. Roles within the frame are called Frame Elements. Words that evoke the frame are called Lexical units. For example cooking appears in a frame called 'Apply_heat' where the frame elements are the cook, food, container and heating_instrument. and words that evoke it such as fry, bake, boil and broil are lexical units.
FrameNet comprises of over 1195 Semantic frames with approximately 12 frame elements per frame. It includes12,989 lexical units which represent connection between one lemma + POS and one frame. It includes 198,932 manual annotations of corpus examples and about 1774 frame to frame relations. Since the relations involve multiple inheritance, it forms a lattice. It is also constructed bottom-up. It comes with semantic type labels such as animate, human, positive_judgement and assumes construction grammar which describe the grammatical characteristics and semantic import of each construction ( Filmore, Goldman, Rhodes ). It also includes metaphor and metonymy.
FrameNet is not a formal ontology. It includes non-lexical frames that have been made up as needed to connect the nodes in certain cases. It attempts to include cross linguistic differences such as the difference between commercial_transaction versus a criminal_process although both describe methods.
Why is it an improvement over semantic ontologies ?
1) Frame is a useful level of abstraction
2) Roles are more meaningful than semantic labels
3) Frames represent conceptual gestalts - more than just the sum of their parts.

#codingexercise
Insert into a circular linked list that is sorted but where we are given a random node from the list not the head.

Void insertCircular(ref node start, node target)

{
var head = FindHead(ref node start, node target);
If (head == null) {head = target; target.next = target; return;}
Node prev = null;
Node curr = head;
while(curr.data < target.data && curr.next != head){
Prev = curr;
Curr = curr.next;
}
If (prev == null){
If (curr.data < target.data){
InsertAt(cur, target);
}else{
InsertAtStart(head, target);
}
}else{
if (curr.data < target.data) {
InsertAt(cur, target);
}else{
InsertAt(prev, target);
}
}
}

Node FindHead(ref node start, node target)
{
Node head = null;
if (start == null) return head;
If (start.next == start) return start;
Node curr = start;
min = curr.data;
head = start;
while(curr.next != start){
if (curr.data < min)
{
head = curr;
min = curr.data;
}
curr = curr.next;
}
if (curr.data < min)
{
head = curr;
min = curr.data;
}
return head;
}
Alternatively, we could also insert by iterating the circular list once to find the node that is closest to the given node in its value.

Monday, October 24, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet.
Framenet is a lexical database maintained by Berkeley.
It has the following features:
It has annotations about how words are used in actual text.
It can be used like a dictionary.
It has over 170000 annotated sentences describing word usage
It provides a unique training data set
It is both human readable and machine readable.
The data is freely available to download.
The annotations show how words have unique combinations.
FrameNet defines the idea that words have a meaning based on the presence of a semantic frame. This is a type of event, relation or entity and the participants in it. The example cited is that of cooking. The concept of cooking typically involves a person doing the cooking (cook) and the food that is cooked (food) and something to hold (container) along with a source of heat (heating instrument) This is represented as a frame called 'Apply_heat' where the frame elements are the cook, food, container and heating_instrument. The frame units then naturally fit as superimposed on the words that evoke them.
The frame units help with both semantic and syntactic relationships. As such they are better representations than terms themselves while allowing the models based on those terms to work as before, if not improved with the frame elements.

#codingexercise

Insert a node into a sorted circular linked list

Void insertCircular(ref node head, node target)

{

If (head == null) {head = target; target.next = target; return;}

Node prev = null;

Node curr = head;

while(curr.data < target.data && curr.next != head){

Prev = curr;

Curr = curr.next;

}

If (prev == null){

If (curr.data < target.data){

InsertAt(cur, target);

}else{

InsertAtStart(head, target);

}

}else{

if (curr.data < target.data) {

InsertAt(cur, target);

}else{

InsertAt(prev, target);

}

void InsertAt(ref node head, ref target)

{

// Insert After head

var next = head.next;

target.next = next;

head.next = target;

}

Sunday, October 23, 2016

#codingexercise
Print all nodes at a distance k from a given node
void printKDistanceDown(Node root, int k)
{
if (root == null || k < 0) return;
if (k == 0)
Console.Write(root.data);
printKDistanceDown(root.left, k-1);
printKDistanceDown(root.right, k-1);
}

// returns distance of root from target
int PrintKDistanceNode(Node root, Node target, int k)
{
if (root == null) return -1;
if (root == target){printKDistanceDown(root, k); return 0;}
int dl = printKDistanceNode(root.left, target, k);
if (dl != -1)
{
if (dl + 1 == k)
Console.Write(root.data);
else
printKDistanceDown(root.right, k -dl-2);
return 1 + dl;
}
int dr = printKDistanceDown(root.right, target, k);
if (dr != -1)
{
if (dr +1 ==k)
Console.Write(root.data);
else
printKDistanceDown(root.left, k-dr-2);
return 1+dr;
}
return -1;
}

2. Given a string, recursively remove adjacent duplicate characters from string. The output string should not have any duplicates.
For eg: abcdaadhhhhhzppzl
becomes abchl
static String RemoveDuplicate(StringBuilder str, int n)
{
for (int m = 0; m < int.MaxValue; m++)
{
bool modified = false;
int k = 0;
int len = str.ToString().Length;
if (len == 0) return str.ToString();
for (int i = 0; i < len; i++)
{
if (k == 0) {
str[k] = str[i];
k++;
continue;
}
if (str[k] == '\0' || str[i] != str[k-1])
{
str[k] = str[i];
k++;
}
else
{
k--;
str[k] = '\0';
modified = true;
}
}
if (!modified)
break;
str[k] = '\0';
k++;
for (int i = k; i < len; i++)
str[i] = '\0';
Console.WriteLine("val =" + str.ToString());
}
return str.ToString();
}

Saturday, October 22, 2016

#codingexercise
Flatten out a linked list with child and next pointers
void Flatten(Node head)
{
if (head == null) return;
Node temp;
Node tail = head;
while (tail.next)
tail = tail.next;
Node cur = head;
while ( cur != tail)
{
if (cur.child && cur.child.visited == false)
{
tail.next = cur.child; // this goes to the tail
tmp = cur.child; // tail is updated
tmp.visited = true;
while (tmp.next)
tmp = tmp.next;
tmp.visited = true;
tail = tmp;
}
cur = cur.next;
}
}
#probability question
A fair die is thrown k times. What is the probability of sum of k throws to be equal to a number n?
void FairThrow(int k, int n, ref int current, ref int sum, ref int count, ref int total)
{
if (current == k) {
total += 1;
if (sum == n) { count +=1;}
return;
}
for (int i = 1; i <=6; i++)
{
sum+= i;
current += 1;
FairThrow( k, n, current, sum, count);
sum -= i;
current -= 1;
}
}
Probability = count/total;

Friday, October 21, 2016

Today we continue to discuss the conditional random fields we mentioned earlier.We have seen some generalized discussion about what random fields are, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis. We wanted to introduce sense embedding features into the word-context analysis. We can represent the conditional probability of a sense given its context using their corresponding vectors. Then we choose the parameters numbering 1..d for each of the context, semantic pairs. We would like to set the parameters such that their conditional log likelihood is maximized. We could do this with conjugate gradient method.
We saw linear chain CRFs. Now we could generalized this and the relax the requirements for previous state dependency. Instead we define CRFs with general graphical structure. Such structures are found to be very helpful for relational learning because they relax the iid assumption between entities.
So far we have seen that models CRF included have been applied with training and testing data to be independent. However CRFs can also be used for within-network classification - one where we can model probabilistic dependencies between training and testing data. The generalization from linear chain CRFs to general CRFs is about moving from a linear chain factor graph to a more general factor-graph.
The general definition of the conditional random field is presented this way: Let G be a factor graph over Y. Then p(y|x) is a conditional random field if for any fixed x, the distribution p(y|x) factorizes according to G. Therefore each conditional distribution is a CRF even though it may be trivial. If F is a set of factors in G where each factor takes the exponential form, then each conditional distribution can be written in the form of exponential of weighted feature functions normalized by a partition function similar to what we have seen before except that they apply to each factor in the set F.
Models usually depend on parameter tying. For example, in the linear chain case, the same weights are used for the factors at each time step. This maintains the parameters between the transition. The same applied to a general model and not just a chain model. To do the same, the factors of G are partitioned into C = C1, C2, .... Cp where each Cp is a clique template whose parameters are tied. A clique template is a set of factors which has a corresponding set of sufficient statistics and parameters. Then the weighted feature functions in the general definition can be replaced by a combination of the sufficient statistics and their parameters.
#codingexercise
Find the position of an element in a sorted array of infinite numbers:
int GetPosition(List<int> nums, int val)
{
int low = 0;
int high = 1;
int cur = nums[0];
while (cur < val)
{
low = high;
high = 2 * high;
val = nums[high];
}
return binarysearch(nums, low, high, val);
}

Void binarysearch(list<int> nums, int low, int high, int val)
{
Int mid = (low + high)/2;
If (nums[mid] == val) return mid;
If (low == high) return -1;
If (nums[mid] < val)
Return binarysearch(nums,mid+1,high,val);
Else
Return binarysearch(nums,low,mid,val);
}

Thursday, October 20, 2016

Today we discuss a specific automation issue. This pertains to authentication:

We explain a difference between a JWT digest token and an HMAC hash:

We construct a JWT token this way:

def data_to_sign(endpoint, environment, owner):

banner = '{"alg":"SHA256withRSA","kid":"' + environment + '"}'

banner = base64.b64encode(banner)

start = int(datetime.datetime.now().strftime("%s"))*1000

end = datetime.datetime.now() + datetime.timedelta(seconds=300)

end = int(end.strftime("%s"))*1000

expiration = '{"iss":"' + environment + '","nbf":' + str(start) + ',"sub":"' + owner + '","exp":' + str(end) + ',"aud":"' + endpoint + '"}'

expiration = base64.b64encode(expiration)

data_to_sign = banner + "." + expiration

return data_to_sign

def signature(endpoint, environment):

data = data_to_sign(endpoint, environment)

unpacked = data.split('.')

unpacked[0] = base64.b64decode(unpacked[0])

unpacked[1] = base64.b64decode(unpacked[1])

unpacked = unpacked[0]+"."+unpacked[1]

with open('./pem/file.txt', 'w') as f:

f.write(unpacked)

cmd = 'openssl dgst -sha256 -sign ./pem/privateKey.key -out ./pem/signature.sign ./pem/file.txt'

cmd_execute(cmd.split())

digest = ''

with open('./pem/signature.sign', 'rb') as f:

digest = base64.b64encode(f.read())

cmd = 'openssl dgst -sha256 -verify ./pem/privateKey.pub -signature ./pem/signature.sign ./pem/file.txt'

output = self.cmd_execute(cmd.split())

# output says 'Verified'

auth = data_to_sign + "." + digest

return auth

PyJwt encapsulates something similar with the following syntax:
import jwt

jwt.encode({'some': 'payload'}, 'secret', algorithm='HS256')

The auth goes in the Authentication header field of the https request

We construct an HMAC token this way:
def auth(KeyId, timestamp, token, SomeSecret):

HMAC_KEY = SomeSecret.decode('hex')

CanonicalizedResource = 'timestamp=' + timestamp + '&' + 'token=' + str(token)

StringToSign = CanonicalizedResource

Signature = base64.b64encode(

hmac.new(

HMAC_KEY,

StringToSign,

hashlib.sha256

).digest()

)

auth = KeyId + ":" + Signature

return auth

#codingexercise
Find the longest consecutive subsequence in a sequence

Use a hash table and add all elements in the sequence.

For every element check if it is a start of subsequence, previous one does not occur.

Count the length from each subsequence.

Return the max of the counts.

int GetLMS(List<int> digits)
{
var h = new Hashtable<int>();
for (int i = 0; i < digits.count; i++)
if h.Contains(digits[i])
h[digits[i]]++;
else
h[digits[i]] = 1;
int max = 0;
for (int i = 0; i < digits.count; i++)
{
if (h.Contains(digits[i]-1) == false)
{
int count = 0;
for (int j = digits[i]; h.contains(j); j++)
count++;
if (count > max)
max = count;
}
}
return max;
}

Trivia question:

The question is why does net ads join fail with ads_sasl_spnego_gensec_bind ? Hint: is this a kerberos failure.