Cluster computing

Friday, July 26, 2024

Tweet sentiment analyzer:

import sys

def hw():

afinnfile = open("AFINN-111.txt")

scores = {} # initialize an empty dictionary

for line in afinnfile:

term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"

scores[term] = int(score) # Convert the score to an integer.

print scores.items()

import json

outputfile = open("output.txt")

tweets = []

for line in outputfile:

tweets += [json.loads(line)]

for item in tweets:

if item.text:

sentence = trim(item.text)

words = sentence.split()

score = 0

for word in words:

term = tolower(trim(word))

if term in scores:

if scores[term] > 0:

score += 1

else if scores[term] < 0:

score -= 1

else:

score += 0

if len(words) > 0:

score = score/len(words)

print(score)

else:

print(0)

def lines(fp):

print str(len(fp.readlines()))

def main():

sent_file = open(sys.argv[1])

tweet_file = open(sys.argv[2])

hw()

lines(sent_file)

lines(tweet_file)

if __name__ == '__main__':

main()

Thursday, July 25, 2024

This is a continuation of previous articles on Azure resources, their IaC deployments and trends in data infrastructure. The previous article touched upon data platforms and how they go out of their way to recommend that data must not be given to vendors and not even to the platform and that it is proprietary. This section continues that line of discussion to elaborate on understanding data.

The role of data in modern business operations is changing, with organizations facing the challenge of harnessing their potential and safeguarding it with utmost care. Data governance is crucial for businesses to ensure the protection, governance, and effective management of their data assets. Compliance frameworks like the EU's AI Act highlight the importance of maintaining high-quality data for successful AI integration and utilization.

The complex web of data governance presents multifaceted challenges, especially in the realm of data silos and disparate governance mechanisms. Tracking data provenance, ensuring data visibility, and implementing robust protection schemes are crucial for mitigating cybersecurity risks and ensuring data integrity across various platforms and applications.

The evolution of artificial intelligence (AI) introduces new dimensions to data management practices, as organizations explore the transformative potential of AI and machine learning technologies. Leveraging AI for tasks like backup recovery, compliance, and data protection plans offers unprecedented opportunities for enhancing operational efficiencies and driving innovation within businesses.

The future of data management lies at the intersection of compliance, resilience, security, backup, recovery, and AI integration. By embracing these foundational pillars, businesses can navigate the intricate landscape of data governance with agility and foresight, paving the way for sustainable data-driven strategies and robust cybersecurity protocols.

Prioritizing data management practices that align with compliance standards and cybersecurity best practices is key. By embracing the transformative potential of AI while maintaining a steadfast commitment to data protection, businesses can navigate the complexities of the digital landscape with confidence and resilience.

References:

Previous article explaining a catalog: IaCResolutionsPart148.docx

https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace

https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIMgfH3QDAPfwCW6Q?e=dM89NH

Wednesday, July 24, 2024

The shift from dbms to catalogs is already underway. Earlier, the databases were the veritable access grantors but with heterogenous data stores, this has shifted to catalogs like the Unity Catalog for databricks and the Horizon catalog for Snowflake. This is a deliberate attempt from the perspective of these platforms even though they fight for their ecosystems. The end-users and the organizations that empower them are rapidly making this shift themselves.

For example, the Databricks Unity Catalog offers centralized access control, auditing, lineage, and data discovery capabilities across multiple Databricks workspaces. It includes user management, metastore, clusters, SQL warehouses, and a standards-compliant security model based on ANSI SQL. The catalog also includes built-in auditing and lineage, allowing for user-level audit logs and data discovery. The metadata store is a top-level container, while the data catalog has a three-level namespace namely catalog.schema.table. The catalog explorer allows for creation of tables and views, while the tables of views and volumes provide governance for nontabular data. The catalog is multi-cloud friendly, allowing for federation across multiple cloud vendors and unified access. The idea here is that you can define once and secure anywhere.

Databricks Unity Catalog consists of a metastore and a catalog. The metastore is the top-level logical container for metadata, storing data assets like tables or models and defining the namespace hierarchy. It handles access control policies and auditing. The catalog is the first-level organizational unit within the metastore, grouping related data assets and providing access controls. However, only one metastore per deployment is used. Each Databricks region requires its own Unity Catalog metastore.

There is a Unity catalog quick start notebook in Python. The key steps include creating a workspace with the Unity Catalog meta store, creating a catalog, creating a managed schema, managing a table, and using the Unity catalog in the Pandas API on Spark. The code starts with creating a catalog, selecting show, and then creating a managed schema. The next step involves creating and managing schemas, extending them, and granting permissions. The table is managed using the schema created earlier, and the table is shown and all available tables are shown. The final step involves using the Pandas API on Spark, which can be found in the official documentation for Databricks. This quick start is a great way to get a feel for the process and to toggle back and forth with the key steps inside the code.

The Unity Catalog system employs object security best practices, including access control lists (ACLs) for granting or restricting access to specific users and groups on securable objects. ACLs provide fine-grain control, ensuring access to sensitive data and objects. Less privilege is used, limiting access to the minimum required, avoiding broad groups like All Users unless necessary. Access is revoked once the purpose is served, and policies are reviewed regularly for relevance. This technique enhances data security and compliance, prevents unnecessary broad access, and controls a blast radius in case of security breaches.

The Databricks Unity Catalog system offers best practices for catalogs. First, create a separate catalog for loose coupling, managing access and compliance at the catalog level. Align catalog boundaries with business domains or applications, such as marketing analytics or HR. Customize security policies and governance within the catalog to drill down into specific domains. Create access control groups and roles specific to a catalog, fine-tune read-write privileges, and customize settings like resource quotas and scrum rules. These fine-grain policies provide the best of security and functionality in catalogs.

To ensure security and manage external connections, limit visibility by granting access only to specific users, groups, and roles, and setting lease privileges. Limit access to only necessary users and groups using granular access control lists or ACLs. Be aware of team activities and avoid giving them unnecessary access to external resources. Tag connections effectively for discovery using source categories or data classifications, and discover connections by use case for organizational visibility. This approach enhances security, prevents unintended data access, and simplifies external connection discovery and management.

Databricks Unity Catalog Business Unit Best Practices emphasize the importance of providing dedicated sandboxes for each business unit, allowing independent development environments, and preventing interference between different workflows. Centralizing shareable data into production catalogs ensures consistency and reduces the need for duplicate data. Discoverability is crucial, with meaningful naming conventions and metadata best practices. Federated queries via Lakehouse architecture unify data access across silos, governing securely via contracts and permissions. This approach supports autonomy for units, increases productivity through reuse, and maintains consistency with collaborative governance. This approach supports autonomy, increases productivity, and maintains consistency.

In conclusion, the Unity catalog standard allows centralized data governance and best practices for catalogs, connections, and business units.

https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace

https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html

Tuesday, July 23, 2024

This is a summary of the book titled “Active Listening Techniques – 30 Practical tools to hone your Communication Skills” written by Nixaly Leonardo and published by Callisto in 2020. The author offers insights into active listening building off a decade of social work. She covers listening skills such as mindfulness, empathy, non-verbal cues, and effective questioning techniques – all of which lead to a deeper understanding of others. Her five-point agenda includes empathizing with others before interacting, being aware of the tensions so as to respond not react, acknowledging one’s negative emotions, involving loved ones in the journey, writing journal entries about our reactions and being aware of this emotional state. This will help us adjust our communications and persuading others by acknowledging their needs, projecting confidence, and choosing the right words, dealing with stressful situations by validating other people’s emotions, easing tension, and refocusing the conversations.

Active listening is a crucial communication skill that involves paying attention, understanding people's emotions, and giving time for others to talk. It is applied in various situations, including work, personal relationships, and therapy. Active listening helps individuals feel supported and heard, and it demonstrates respect for others. To improve communication skills, seven fundamentals can be applied: paraphrasing, using nonverbal language, emotional labeling, silence, redirection, mirroring, and validating.

Paraphrasing involves restating what someone says to ensure understanding, while nonverbal cues like eye contact, gestures, posture, and facial expressions help convey the message. Emotional labeling involves noticing and repeating what others feel, while silence allows for time to think and express thoughts without interruption. Redirecting the conversation back to the original topic helps maintain direction and reduce tension. Mirroring involves faking the speaker's body language and tone of voice to create a sense of connection and rapport. Validating others' emotions allows them to experience their emotions and hold their beliefs, making them feel understood and supported.

Active listening involves being present and mindful during conversations, ignoring distractions and staying open-minded. It helps us accept that we all experience negative emotions and stress and understand how our experiences shape our perceptions and interpretations of others' messages. To challenge and move through assumptions, empathize with others, be aware of tension, apologize when you react negatively, involve loved ones, and write journal entries about your reactions.

Be aware of your emotional state during conversations, as strong emotions can interfere with attentive listening. Adjust your communication to ensure others hear and understand you, considering other people's communication styles and preferences. Navigate situations tactfully by asking questions instead of directly challenging your supervisor's idea, describing or praise their vision, and seeking details to address your concerns without undermining their creativity or judgment.

Know your audience wisely, choosing when and where to raise critical issues and choosing the appropriate mode of communication. Electronic communication such as texting and email can be more effective than face-to-face conversations. By following these steps, you can become a better active listener and maintain a productive dialogue.

Persuasion involves acknowledging others' needs, projecting confidence, and choosing the right words. It is a matter of giving and taking and understanding why someone might not agree with your viewpoint is crucial. Acknowledging their needs helps build respect and build a stronger bond. Using precise language is essential in handling sensitive situations, avoiding hurting others and conveying your intended message. Confidence is key, so pretending to be confident can help.

To deal with stressful situations, validate others' emotions, easing tension, and refocusing the conversation. Addressing emotional concerns fosters stronger connections and genuine conversations. Calming others can ease tensions by recognizing escalating situations, lowering your tone, seeking clarification, taking responsibility for your contribution, and addressing the speaker's concerns. If tensions continue to rise, repeat the steps, or suggest a break. Set boundaries and communicate potential consequences if the conversation escalates.

When a conversation goes awry, refocus on the original subject to avoid defensiveness and avoid resolving the issue. Address communication challenges by rephrasing statements, acknowledging shifts, asking for thoughts, and validating the listener's feelings. This ensures both parties hear and understand each other, preventing a recurrence of arguments. By following these steps, you can ensure effective communication.

Summarizing Software: SummarizerCodeSnippets.docx

Monday, July 22, 2024

The well-known Knuth-Morris-Pratt algorithm.

This algorithm can be explained in terms of the sequence matching between input and patterns this way:

void KMP(string pattern, string text, vector<int> *positions) {

int patternLength = pattern.length();

int textLength = text.length();

int* next = PreProcess(pattern);

if (next == 0) return;

int i = 0;

int j = 0;

while ( j < textLength )

{

while(true)

if (text[j] == pattern[i]) //matches

{

i++; // yes, move on to the next state

if (i == patternLength) // maybe that was the last state

{

// found a match;

positions->push_back(j-(i-1));

i = next[i];

}

break;

}

else if (i == 0) break; // no match in state j = 0, give up

else i = next[i];

j++;

}

int* PreProcess( string pattern) {

int patternLength = pattern.length();

if (patternLength == 0) return 0;

int * next = new int[patternLength + 1];

if (next == 0) return 0;

next[0] = -1; // set up for loop below; unused by KMP

int i = 0;

int j = -1;

// next[0] = -1;

// int len = pattern.length();

while (i < patternLength) {

next[i + 1] = next[i] + 1;

while ( next[i+1] > 0 &&

pattern[i] != pattern[next[i + 1] - 1])

next[i + 1] = next[next[i + 1] - 1] + 1;

i++;

}

return next;

}

Usage: DroneDataAddition.docx

Sunday, July 21, 2024

Knuth-Morris-Pratt method of string matching

Public void KMP-Matcher(String text, String pattern) {

Int n = text.length();

Int m = pattern.length();

Int[] prefixes = ComputePrefixFunction(pattern);

Int noOfCharMatched = 0;

for ( int I = 1; I <= n; I++) {

While (noOfCharMatched > 0 && pattern[noOfCharMatched + 1] != Text[I])

NoOfCharMatched = prefixes[nofOfCharMatched]

If (pattern[noOfCharMatched + 1] == text[I])

NoOfCharMatched = NoOfCharMatched + 1;

If (noOfCharMatched == m) {

System.out.println(“Pattern occurs at “ + I);

NoOfCharMatched = prefixes[NoOfCharMatched];

}

Public int[] ComputePrefixFunction(String pattern) {

Int m = pattern.length();

Int[] prefixes = new int[m+1];

Prefixes[1] = 0;

Int k = 0;

For (int q = 2; q <=m ; q++) {

While (k > 0 && Pattern[k + 1] != Pattern[q])

K = pattern[k];

If (pattern[k+1] == Pattern[q]) {

K = k + 1;

}

Pattern[q] = k;

}

Return prefixes;

}

Saturday, July 20, 2024

The steps to create a machine learning pipeline in Azure Machine Learning Workspace:

1. Create an Azure Machine Learning Workspace:

○ If you don't have one already, create an Azure Machine Learning workspace. This serves as the central hub for managing your machine learning resources.

2. Set Up Datastores:

○ Datastores allow you to access data needed in your pipeline. By default, each workspace has a default datastore connected to Azure Blob storage. You can register additional datastores if necessary [4].

3. Define Your Pipeline Steps:

○ Break down your ML task into manageable components (steps). Common steps include data preparation, model training, and evaluation.

○ Use the Azure Machine Learning SDK to create these steps. You can define them as PythonScriptStep or other relevant step types.

4. Configure Compute Targets:

○ Set up the compute targets where your pipeline steps will run. Options include Azure Machine Learning Compute, Azure Databricks, or other compute resources.

5. Orchestrate the Pipeline:

○ Use the Azure Machine Learning pipeline service to automatically manage dependencies between steps.

○ Specify the order in which steps should execute and how they interact.

6. Publish the Pipeline:

○ Once your pipeline is ready, publish it. This makes it accessible for later use or sharing with others.

7. Monitor and Track Performance:

○ Monitor your pipeline's performance in real-world scenarios.

○ Detect data drift and adjust your pipeline as needed.

This workspace provides an environment to create and manage the end-to-end life cycle of Machine Learning models. Unlike general purpose software, Azure machine learning has significantly different requirements such as the use of a wide variety of technologies, libraries and frameworks, separation of training and testing phases before deploying and use of a model and iterations for model tuning independent of the model creation and training etc. Azure Machine Learning’s compatibility with open-source frameworks and platforms like PyTorch and TensorFlow makes it an effective all-in-one platform for integrating and handling data and models which tremendously relieves the onus on the business to develop new capabilities. Azure Machine Learning is designed for all skill levels, with advanced MLOps features and simple no-code model creation and deployment.