Cluster computing

Tuesday, July 2, 2019

Today we continue with the discussion on tools and methodology to come up with threat models and reduce risk. Ee referred to the STRIDE model. It stands for
• Spoofing Identity – is the threat when a user can impersonate another user
• Tampering with data- is the threat when a user can access Kubernetes resources or modify the contents of security artifacts.
• Repudiation – is the threat when a user can perform an illegal action that the Kubernetes cannot deter
• Information Disclosure – is the threat when say a guest user can access resources as if the guest was the owner.
• Denial of service – is the threat when say a crucial component in the operations of the Kubernetes is overwhelmed by requests so that others experience outage
• Elevation of privilege – is the threat when the user has gained access to the components within the trust boundary and the system is therefore compromised.
Usually we begin the process of evaluating against these factors with a control and data flow diagram.

We begin applying the STRIDE assessment using a data flow diagram. For keycloak on Kubernetes we have:

Keycloak on Kubernetes data flow diagram

The Service catalog returns the details of the resource as a K8s secret. If the application persists the K8s secret on a mounted volume, care must be taken to mark the volumes as readOnly.
Similarly, while Keycloak configuration is internal, it should be prevented from reconfiguration after the deployment.
The Service broker listens on port 9090 over http. Since this is internal, it has no TLS requirements. When the token passes the trust boundary, we rely on the kubectl interface to secure the communication with the API Server. As long as clients are communicating with kubectl or the API Server, this technique works well. In general, if the server and the clients communicate via TLS and they have verified the certificate chain, then there is little chance of token falling in wrong hands. The URL logging or https proxy are still vulnerabilities but the man in the middle attack is less of an issue if the client and the server exchange session id and keep track of each other's session id. As an API implementatioon, session Id's are largely site or application based and not the APIs concern but it’s good to validate based on session id if such is available.
#pagerank
def pagerank (u, constant):
sum = constant
For node-v in adjacencies (u):
sum += Pagerank_for_node_v (node-v) / number_of_links (node-v)
return sum

Monday, July 1, 2019

Today we continue discussing the data structure for storage of a Thesaurus.

We referred to hierarchical representation of the words based on synonyms using recursive CTE above as a way of establishing clusters. Here we mention that the hierarchy level is incremented based on the merging of words. Two words can be merged into the same group only when there is a common term/terms within their synonyms or a threshold degree of separation between their synonyms, if such extended processing is permitted.

def classify_synonyms():

words = [{'cat': ['animal', 'feline']}, {'dog':['animal', 'lupus']}, {'dolphin':['fish', 'pisces']}, {'spider':['insect','arachnid']}]

groups = []

for item in words:

if item not in groups:

merged = False

for key in groups:

group = next(iter(key))

for value in iter(item.values()):

if group in value:

index = groups.index(key)

old = iter(groups[index].values())

new = iter(item.keys())

merged = []

for v in old:

merged += v

merged += new

groups[index] = {group:merged}

merged = True

if not merged:

k = next(iter(item.keys()))

v = next(iter(item.values()))

groups.append({v[0] : [k]})

print(groups)

classify_synonyms()

#[{'animal': ['cat', 'dog']}, {'fish': ['dolphin']}, {'insect': ['spider']}]

The above method merely classifies the input to the first level of grouping. It does not factor in multiple matches between synonyms, selection of the best match in the synonym, unavailability of synonyms, unrelated words, and unrelated synonyms. The purpose is just to show that given a criterion for the selection of a group, the words can be merged. The output of the first level could then be taken as the input of the second level. The second level can then be merged and so on until a dendrogram appears.

Given this dendrogram, it is possible to take edge distance as the distance metric for semantic similarity.

Since we do this hierarchical classification only for the finite number of input words in a text, we can take it to be a bounded cost of O(nlogn) assuming fixed upper cost for each merge.

def nlevel(id, group_dict=df.GroupID, _cache={0:0}):

if id in _cache:

return _cache[id]

return 1+nlevel(group_dict[id],group_dict)

df['nLevel'] = df.ID.map(nlevel)

print df[['nLevel','ID','Group']]