Cluster computing

Monday, July 1, 2019

Today we continue discussing the data structure for storage of a Thesaurus.

We referred to hierarchical representation of the words based on synonyms using recursive CTE above as a way of establishing clusters. Here we mention that the hierarchy level is incremented based on the merging of words. Two words can be merged into the same group only when there is a common term/terms within their synonyms or a threshold degree of separation between their synonyms, if such extended processing is permitted.

def classify_synonyms():

words = [{'cat': ['animal', 'feline']}, {'dog':['animal', 'lupus']}, {'dolphin':['fish', 'pisces']}, {'spider':['insect','arachnid']}]

groups = []

for item in words:

if item not in groups:

merged = False

for key in groups:

group = next(iter(key))

for value in iter(item.values()):

if group in value:

index = groups.index(key)

old = iter(groups[index].values())

new = iter(item.keys())

merged = []

for v in old:

merged += v

merged += new

groups[index] = {group:merged}

merged = True

if not merged:

k = next(iter(item.keys()))

v = next(iter(item.values()))

groups.append({v[0] : [k]})

print(groups)

classify_synonyms()

#[{'animal': ['cat', 'dog']}, {'fish': ['dolphin']}, {'insect': ['spider']}]

The above method merely classifies the input to the first level of grouping. It does not factor in multiple matches between synonyms, selection of the best match in the synonym, unavailability of synonyms, unrelated words, and unrelated synonyms. The purpose is just to show that given a criterion for the selection of a group, the words can be merged. The output of the first level could then be taken as the input of the second level. The second level can then be merged and so on until a dendrogram appears.

Given this dendrogram, it is possible to take edge distance as the distance metric for semantic similarity.

Since we do this hierarchical classification only for the finite number of input words in a text, we can take it to be a bounded cost of O(nlogn) assuming fixed upper cost for each merge.

def nlevel(id, group_dict=df.GroupID, _cache={0:0}):

if id in _cache:

return _cache[id]

return 1+nlevel(group_dict[id],group_dict)

df['nLevel'] = df.ID.map(nlevel)

print df[['nLevel','ID','Group']]

Sunday, June 30, 2019

Today we discuss a data structure for Thesaurus. We have referred to a thesaurus as a source of information for natural language processing. There will be heavy use of the thesaurus when words from the incoming text are looked up for possible similarity. We will need a data structure that makes the lookups faster. A B-plus tree data structure to hold the words and their similarities will be helpful because they will make the lookups logarithmically faster while allowing the data to scale.

A thesaurus can be viewed as a table of words with references to other words. We form hierarchical representations with group id where the id points back to a word. A hierarchical representation of words is easy to explore with a recursive common table expression. By relating words to their groups as the next level, we can define the recursion using a level increment for each group and a terminal condition for the recursion. Although this table of words is standalone in this case, we can view it as a relational table and use a recursive CTE to query the words in the table.

The number of synonyms can vary for each word. A list of synonyms for each word can be easier to maintain but an arbitrary big table with a large number of columns will also suffice because a word will usually have a finite number of synonyms and their order does not matter. Synonym1, Synonym2, Synonym3, … can be the names of the columns in this case. Each synonym is a word. Each word is represented by the identifier associated with it in the table. A word and its list of synonyms can be represented with a column each where the second column stores varying number of word identifiers depending on the number of synonyms the word has. This might be efficient for storage purpose but it is better to store the words in a variety of words. Since the rows in the table of synonyms will be listed in alphabetical order of the words, a clustered index could prove helpful to make lookups faster.

A table for the synonyms facilitates SQL queries over the entries. This facilitates application development over the thesaurus because now the built-in operators and features of the well-known language come useful. Also, it helps with the object-to-relational mapping of the entities and thus makes it easier for the development of services using the language native features. The separation of querying from the storage also helps keep create-update-deletes to be independent from read-only operations.

The storage of synonyms does not need to be kept in a table. Alternative forms of persistence such as graphs with words as nodes and weighted edges as synonyms can also be used. However, the query language for graphs may not be the same as that for tables. Other than that, both forms of storage can be used to store the thesaurus.

Saturday, June 29, 2019

We now discuss some of the mitigations to the vulnerabilities detected by this tool. Perhaps the best defense is to know what is contained in the images. If we are proactive in not including packages with vulnerabilities, we reduce the risk tremendously. Next we can be consistent and clean in elaborating the location from which the images are built or the packages/jars are included. Third, we can update our dependencies to use the latest software as and when they get released. Fourth, we can run our scan continuously on our builds so we can evaluate the changes within. Finally, we can manifest and document all our jars and containers.
Some dependencies incur a lot of vulnerabilities. Their replacements are also well-advertised. Similarly, popular source code and containers don’t get due maintenance. The approach to start with minimal and make incremental progress wins in all these cases.
Source code analyzers work differently from binary scanners. Although both are static analyzers, source code analysis comes with benefits such as scanning code fragments, supporting cloud compiled language, compiler agnostic and platform agnostic processing etc. All scanner will evaluate only those roles that are specified to them from the Common Vulnerabilities and Exposure CVE public database.
There are a few thumb rules to defend against vulnerabilities
Keep the base images in Dockerfile up to date
Prefer public and signed base images to private images.
Keep the dependencies up to date.
Make sure there are no indirect references or downloaded code
Ensure that http redirects are properly chaining the correct dependencies
Actively trim the dependencies.
Get the latest CVE definitions and mitigations for the product
Validate the build integrations to be free from adding unnecessary artifacts to the product
Ensure that the product images have a manifest that is worthy of publication to public registries
Ensure that the product and manifests are scanned for viruses
A deployment could be used to scan network vulnerabilities and web application testing
Keep the images and product signed for distribution
Enforce role-based access control on all images, binaries and artifacts.
Keep all the releases properly annotated, documented and versioned.

int GetDistance(int[][] A, int Rows, int Cols, int X, int Y) {

Pair<int,int> posx = GetPosition(A, Rows, Cols, X);

Pair<int, int> posy = GetPosition(A, Rows, Cols, Y);

If ((posx.first == -1 || posx.second == -1) && (posy.first == -1 || posy.second == -1)) {

Return 0;

}

If (posx.first == -1 || posx.second == -1) {

Return getDistanceFromTopLeft(posy);

}

Return getDistanceFromTopLeft(posx);

}

Int getDistanceFromTopLeft(Pair<int, int> position)

{

If (position.first == -1 || position.second == -1) return 0;

Return position.first + position.second; // we don’t take cartesian co-ordinate distance as the sqrt of sum of squares of displacements along x and y axis

}

Friday, June 28, 2019

We continued discussing the static scanners - both source code and binary scanners. The scanner doesn’t really advertise what all activities it performs. However, we can take for granted that the scanner will not flag whether a container is running as root. It will also not flag an insecure Kubernetes configuration. Inappropriate use of shared resources such as persistent volumes are also not flagged by scanner. Even security vulnerabilities that are not already known to the scanner can escape detection. The scanner does look for package names. Many vulnerabilities are directly tied to packages and their versions. A registry of package and vulnerabilities proves easy to detect those in the container image. However, not all vulnerabilities are associated with package names. The same goes for open source that are not referred from their public locations but are indirectly included in the container image either from tar ball or other forms of download or local inclusions. Source analysis is also different from binary analysis so we cannot expect overlap there. If a source code has been included in the container image and it has been locally with build and install utility programs like ‘make’, they will likely escape detection. Image scanning is a foundational part of container security.

Open source is not limited to containers and the security threat modeling for open source differs from that for products that use containers. Open source is popular for the functionality they offer along with the access to the source code for customizations. Most companies will use source code. The Open Web Application Security Project (OWASP) was founded with drafting guidelines for companies that use open source for web applications. A top ten list for frequently seen application vulnerabilities was included in their publications. This list cited 1. Injection of code that can be executed within trusted boundaries, 2. broken authentication that lets you compromise system, 3. disclosure of sensitive data such as Personally Identifiable Information (PII) 4. XML external entities that break or exploit xml parsers 5. broken access control where an administrative panel can become available to a low-privilege user. 6. Security misconfigurations where some omission or mistakes in enforcement of security policies allow hackers to exploit, 7. Cross site scripting where a victim’s browser can execute malicious code from an attacker, 8. Insecure deserialization that lets attackers manipulate messages or state to gain remote code execution 9. Using components with known security issues that lower the common bar for security and 10. Insufficient logging and monitoring of the product where the authentication, authorization and auditing somehow escape detection or known patterns of exploitation are not detected. These kinds of attack vectors are very common and must be avoided when using web applications with open source.

These are some of the limitations of the scanner and they are best done by tools other than scanner.