Free Bitcoins

HOW-TO: Set-Up PySpark (Spark Python Big Data API)

Python in itself natively executes single-threaded. There are libraries that allow the possibility of executing code multi-threaded but it involves complexities. The other downside is the code doesn't scale well enough to the number of execution threads (or cores) the code runs on. Running single-threaded code is stable and proven, but it just takes a while to execute.

I have been on the receiving end of the single-threaded execution. It takes a while to execute, and during the development stage the workaround is to slice a sample of the dataset so that execution does not have to take a long time. More often than not, this is acceptable. Recently, I stumbled on a setup that takes code and executes Python multi-threaded. What is cool about it? It scales to the number of cores thrown at it, and it scales to other nodes as well (think distributed computing).

This is particularly applicable to the field of data science and analytics, where the datasets grow into the hundreds of millions and even billions of rows of data. And since Python is the code of choice in this field, PySpark shines. I need not explain the details of PySpark as a lot of resources already do that. Let me describe the set-up so that code executes in as many cores as you can afford.

The derived procedure is based on an Ubuntu 16 LTS installed on a VirtualBox hypervisor, but is very repeatable whether the setup is in Amazon Web Services (AWS), Google Cloud Platform (GCP) or your own private cloud infrastructure, such as VMware ESXi.

Please note that the procedure will enclose the commands to execute in [square brackets]. Start by updating the apt repository with the latest packages [sudo apt-get update]. Then install scala [sudo apt-get -y install scala]. In my experience this installs the package "default-jre" but in case it doesn't, install default-jre as well [sudo apt-get -y install default-jre].

Download miniconda from the continuum repository. On the terminal, execute this command [wget]. This link points to the 64-bit version of python3. Avoid python2 as much as possible, since development for it is approaching its end; 64-bit is almost always the default. Should you want to install the heavier anaconda3 in place of miniconda3, you may opt to do so.

Install miniconda3 [bash] on your home directory. This avoids package conflicts with the pre-packaged python of the operating system. At the end of the install, the script will ask to modify the PATH environment to the installation directory. Accept the default option, which to modify the PATH. This step is optional, but if you want to you may add the conda-forge channel [conda config --add channels conda-forge] in addition to the default base channel.

Install Miniconda

At this point, the path where miniconda was installed needs to precede the path where the default python3 resides [source $HOME/.bashrc]. This of course assumes that you chose to accept .bashrc modification as suggested by the installer. Next, use conda to install py4j and pyspark [conda install --yes py4j pyspark]. The install will take a while so go grab some coffee first.

While the install is taking place, download the latest version of spark. As of this writing, the latest version is 2.2.1 [wget]. Select a download mirror that is closer to your location. Once downloaded unpack the tarball on your home directory [tar zxf spark-2.2.1-bin-hadoop2.7.tgz]. A directory named "spark-2.2.1-bin-hadoop2.7" will be created in your home directory. It contains the binaries for spark. (This next step is optional, as this is my personal preference.) Create a symbolic link to the directory "spark-2.2.1-bin-hadoop2.7" [ln -s spark-2.2.1-bin-hadoop2.7 spark].

The extra step above will make things easier to upgrade spark (since spark is actively being developed). Simply re-point spark to the newly unpacked version without having to modify the environment variables. If there are issues with the new version, simply link "spark" back to the old version. Think of it like a switch with the clever use of a symbolic link.

At this point, all the necessary software are installed. It is imperative that checks are done to ensure that the software are working as expected. For scala, simply run [scala] without any options. If you see the welcome message, it is working. For pyspark, either import the pyspark library in python [import pyspark] or execute [pyspark] on the terminal. You should see a similar screen as below.

Test: scala spark pyspark

Modify the environment variables to include SPARK_HOME [export SPARK_HOME=$HOME/spark]. Make changes permanent by putting that in ".bashrc" or ".profile". Likewise, add $HOME/spark/bin to PATH.

RELATED: Data Science -- Where to Start?

This setup becomes even more robust by integrating pyspark with the jupyter notebook development environment. This is a personal preference and I will cover that in a future post.


TIP: Screen -- Persistent Terminal Sessions in Linux

If there is one thing I learned in Linux that makes life extremly easy, I would say it is the possibility (or ability) to maintain persistent terminal sessions. This tool comes in handy when working remote and working with servers in particular. Imagine if you are uploading a sosreport or uploading huge core dumps as supplement attachments for a support ticket, and your shift ends. Would you want to wait another couple of hours for the upload to finish? Or, would you want to have a persistent terminal session so that your uploads are thugging along while you drive home?

I'm quite sure the answer is obvious. Linux has this utility called "screen". Screen allows the user a persistent shell session, at the same time multiple tabs for the same connection. It also allows the user to disconnect and re-connect at will, which is really handly for remote users or if for some reason the network connection gets interrupted. Another benefit is for users to simultaneously connect to the same screen session.

This utility is not installed by default. In Ubuntu, to install simply run [sudo apt-get -y install screen].

To run screen, simply run [screen]. You might notice that nothing much has changed upon execution, but running [screen -ls] shows a session is already running. This is how plain it looks (I scrolled it back 1 line just to show you I ran screen).

Screen Without .screenrc

You can change this behaviour by making modifications to the screen startup configuration. It is a file named ".screenrc" that is placed in the home directory of the active user. This file does not exist by default and needs to be created by the user himself/herself.

I have created my own ".screenrc". It is available in github at this link:

A few notes regarding the configuration. It alters the default behaviour of screen. The control command or escape command which is [CTRL]+[A] by default -- modified to [CTRL]+[G]. Meaning, any other hot-key for screen is preceded by a [CTRL]+[G] then [C] for example to create another tab (or create another window); [CTRL]+[G] then [D] to detach from screen.

Shown below is how it looks on my Raspberry PI. See any notable difference(s) compared to the previous screenshot?

Screen With .screenrc

The other thing that is most notable about this configuration is that you will see the number of tabs at the bottom, the hostname of the server at the lower left corner and the current active tab. This way it is really clear that the terminal is running an active screen session. Scrollback is set to 1024 lines. That way you can go back 1024 lines that are already off the screen. You may customize this as well.

RELATED: Install Adblock on Raspberry Pi via Pi-Hole

Having screen and a persistent terminal session is one of the best tools for a system administrator.. But as I will show you soon, it is not limited to administering servers. Stay tuned.


FAQ: Data Science -- Where to Start (continued)?

In my previous post "Data Science -- Where to Start?", I enumerated a few specifics regarding my answer and pointed out several Python online courses to effectively jumpstart your data science career. Now, I would like to suggest a specific book to read that will help you focus on an aspect of your professional career and gain insight on a principle that is not adapted by most. This is particularly applicable when you are reaching the age of 30, whereby you have relatively gained experience in a few professional endeavors.

This post in many ways answers the question: "Is it better to focus on my strengths or on my weaknesses?" The book to read is Strengths Finder 2.0 by Tom Rath. And right there, the answer to the question is already a give-away. And, in more ways than one, your knowledge of yourself and your strengths are immensely helpful.

This is how the book looks like.

Strengths Finder 2.0

The book initially discusses the example of basketball's greatest Michael Jordan -- why can't everyone by like Mike? Way back when, my friends and I wanted to be like Mike and the book has a very good explanation of why everyone cannot be like Mike. It begins by quantifying his strength when it comes to basketball. Assuming that on a scale of 1-10, his basketball skills are rated 10 (being the greatest player). Assuming mine are rated 2. More like 1, but for the sake of comparison, lets put it at 2 compared to MJ.

To be able to make it easier to understand, the book quantifies the result of focusing on strengths by taking a product of the rated skillset or strength and the amount of effort put in honing it. I'm quite positive it is exponential in nature not just multiplicative but to illustrate, if MJ does work related to basketball with an effort of 5, that results to 50. Simply put if MJ focuses on basketball and plays to his strength, this goes to a potential of 100.

In contrast, with a rating of 2, I could only go as much as 20. That just requires meager effort from MJ to match. Given the possibility of exponential product from having the innate strength in the first place, the answer to why everyone can't be like Mike could not be any clearer. This is why it is important to know your strengths.

Coincidentally, MJ shifted to baseball. Did he have a successful season like what he had in basketball? History has recorded this outcome and his return to basketball cemented his legacy.

Bundled with the book is a code you could use to take the Strengths Finder exam. It is a series of questions that when evaluated together produces a profile of strengths. I took the exam a while back and my top 5 strengths are: Strategic, Relator, Learner, Ideation and Analytical. The result goes further to describe my top strength as: "People who are especially talented in the Strategic theme create alternative ways to proceed. Faced with any given scenario, they can quickly spot the relevant patterns and issues." The rest of the strengths are discussed as well.

Also included are "Ideas for Action", one of which is: "Your strategic thinking will be necessary to keep a vivid vision from deteriorating into an ordinary pipe dream. Fully consider all possible paths toward making the vision a reality. Wise forethought can remove obstacles before they appear." As I read through my profile, it's like I was reading the explanation of my past experiences. It explains why I behaved that way and why the decision I made was that. More important is why I am who I am now.

I compared my results with others who took the exam, having the Strategic strength and the descriptions are different. Likewise, the ideas for action are disparate. Having similar strengths doesn't mean having the same overall theme. Strengths also boost each others effects. With the exception of Relator, my strengths are bundled along the "Strategic Thinking" domain.

RELATED: Data Science -- Where to Start?

Although knowing your strengths (and "playing" to your strengths) is not entirely data science related, it helps to know. In my experience, the investment in acquiring a copy of the book Strengths Finder 2.0 for myself is definitely worth it, plus the Gallup Strengths Finder exam. If you have taken the exam, share with us your top 5 strengths and how it has helped you with your career so far.


FAQ: Data Science -- Where to Start?

Data is the new oil. Perhaps this statement has now become a cliche. It goes without saying that data science has become the hottest job of the decade. It was predicted that there will be a shortage of data scientists, and that shortage is already prevalent now.

The reality of it all is this, the academe lags behind in preparing students to fill this gap. Data science is simply not taught in school, and the demand for it grows by the minute. While on the subject of data science, I have been often asked: "Where do I start preparing to gain practical skills for data science?" And too often, my answer is Python. But Python in itself is a broad topic and I will be a little more specific in answering that in question in this post.

In my line of work, having knowledge of Python really gives you an edge, not just an advantage. So if you want to start a career in data science, building a Python skillset is simply practical.

Knowledge, and even expertise, in Python can go a long way. It can be applied to ETL (or extract transform and load), data mining, building computer models, machine learning, computer vision, data visualizations, all the way to advanced applications like artificial neural networks (ANN) and convolutional neural networks (CNN). In any of the mentioned aspects of data science, Python can be applied and building expertise really becomes valuable over time.

Complete Python Bootcamp: from Zero to Hero

For beginners, those who have no idea how to program in Python or those who have only heard about it for the first time, the online course(s) really work. The course that has really helped me in getting a head start is Complete Python Bootcamp: from Zero to Hero. I have mentioned this often enough and will continue to advise the course to anyone who wants to learn Python.

While taking on this course, the other recommendation is building knowledge in jupyter notebooks. This will boost your Python productivity. Also, it helps you understand (and re-use) other peoples code as well as aid you in sharing yours, if you wish to. In fact, several of those online courses share code in the form of jupyter notebooks.

To complete the answer, the Python library to master for data science is pandas. Pandas is often referred to as the Python Data Analysis Library and it rightfully deserves that reputation. More often than not, pandas is involved in data analysis, where it really shows its muscle. My recommended course for learning and mastering pandas is Data Analysis with Pandas and Python.

There goes my answer and I hope that helps you build the needed skillset to build a career on data science. These are by no means the only training courses you need, it simply addresses the "where to start" part of it, in my opinion. The more you use Python in your daily activities, the better honed you become and it will be easier for you to talk in the Python lingo before you notice it.

RELATED: Huge Discounts on Python Courses at Udemy

So, how did your data science journey, or Python experience start? Was this able to answer your question? Share your thoughts in the comments below.

All product names, logos, and brands are property of their respective owners. Use of these names, logos, and brands does not imply endorsement.

HOW-TO: Check Apple iPhone Battery Health

Apple, in a recent press statement, admitted to throttling its devices performance when it detects battery health in decline. In addition, the same statement also indicated the company's commitment to support its customers and improving after sales support. Replacement batteries will be available for a fraction of its original cost -- $29 from $79. I'm an Apple iPhone user myself, and while we await for further news on this matter the question is: "When do you avail of this replacement program?"

Batteries degrade over time. This is a known fact. If you want to know further facts as to why this happens, I would suggest to watch the PBS Nova documentary entitled "Search for the Super Battery" or read about "dendrites" (related to batteries).

The purpose of this article is to inform you on how to check your Apple iPhone battery health. It is not limited to Apple iPhones, but applies to other Apple devices as well -- like iPads and iPad Minis (these are the ones I have tested the app on) -- and (maybe) other Apple devices.

The App to install is "Battery Life". It is available on the App store. There is a free version and a PRO version. The free one has all the features needed to be informed.

Battery Life Screenshot at 83%: Good

Launching the applicaton, you already have a view of the overall health of your device's battery. On my 2-year-old iPhone6, the battery health as indicated in the screen is 83% and the application still considers that as good condition.

Battery Life Raw Data

To get more information on battery health, tap on the upper left menu icon (multiple stacked horizontal lines) and select "Raw Data". There it indicates the overall capacity of the battery, and likewise indicates its degraded capacity. The current charge capacity, too, is also indicated based on the degraded battery capacity.

RELATED: Sync Calendar to iPhone (without iTunes)

There you go. Being better informed will help you decide and verify if it is time to change batteries. Do you also have friends who might benefit from this information? Click the share button below to help them.


Subscribe for Latest Update

Popular Posts