Saturday, January 18, 2025
Google search engine
HomeGuest Blogs H1B Visa Data Analysis: Unveiling Patterns of H1B Visa Approval

 H1B Visa Data Analysis: Unveiling Patterns of H1B Visa Approval

Introduction

The H1B visa program opens doors for skilled individuals worldwide to bring their expertise to the United States. Thousands of talented professionals enter the US through this program each year, contributing to various industries and driving innovation. Let’s dive into the fascinating world of H1B visa data from the Office of Foreign Labor Certification (OFLC) and explore the stories behind the numbers. This article reveals H1B Visa data analysis and we get insights and interesting stories from the data. Through feature engineering, we enhance the dataset with additional information from external sources. Use the meticulous data wrangling to organize data carefully so that we can better understand and analyze it. Finally, data visualizations unveil fascinating trends and untold insights about skilled workers in the US in-between the years 2014 and 2016.

H1B Visa data analysis | feature engineering | data visualization
  1. Explore and analyze H1B visa data from the Office of Foreign Labor Certification (OFLC) and understand its significance in attracting skilled foreign workers to the United States.
  2. Learn about the process of data preprocessing, including data cleaning, feature engineering, and data transformation techniques.
  3. Examine and analyze the acceptance and rejection rates of H1B visa applications that potentially influence these rates.
  4. Gain familiarity with data visualization techniques to present and communicate the findings effectively.

Note:🔗 Please find the complete code and dataset for this analysis on Kaggle to explore the whole process and code behind the analysis: H1B Analysis on Kaggle

This article was published as a part of the Data Science Blogathon.

What is H1B Visa?

The H1B visa program is a key component of U.S. immigration policy, aimed at attracting highly skilled foreign workers to fill specialized positions in various industries. It addresses skill shortages, promotes innovation, and drives economic growth.

To obtain an H1B visa, a person must follow these key steps:

  1. Find a U.S. employer willing to sponsor the visa.
  2. The employer files an H1B petition with the USCIS on behalf of the foreign worker.
  3. The petition is subject to an annual cap and may go through a lottery if there are more applications than available spots.
  4. If selected, the USCIS reviews the petition for eligibility and compliance.
  5. If approved, the foreign worker can obtain the H1B visa and begin working for the sponsoring employer in the U.S.
H1B Visa data analysis | feature engineering | data visualization

The process involves meeting specific requirements, such as holding a bachelor’s degree or equivalent, and navigating additional considerations, such as prevailing wage determinations and documentation of the employer-employee relationship. Compliance and thorough preparation are crucial for a successful H1B visa application.

Dataset

  • Combined 2014, 2015 and 2016 datasets provided by the Office of Foreign Labor Certification (OFLC) for the H1B visa program include the columns such as Case Number, Case Status, Employer Name, Employer City, Employer State, Job Title, SOC Code, SOC Name, Wage Rate, Wage Unit, Prevailing Wage, Prevailing Wage Source, Year, etc.
  • These columns provide essential information about H1B visa applications, including case details, employer information, job titles, wage rates, and prevailing wage data.
  • Use the dataset and layout of the data directly from the official website of the OFLC at the following link https://www.foreignlaborcert.doleta.gov/performancedata.cfm for a complete understanding of all available columns and their descriptions.

Preprocessing

Join me on a fascinating data transformation journey! I convert the xlsx file to CSV, rename columns for consistency since each column contains the same data but has different names across each year., and combine three years of data. The result? A vast dataset with 1,786,160 rows and 17 columns.

Feature Engineering

  • Enhanced the dataset by creating an employment period column based on start and end dates.
  • Calculated the duration in days by subtracting the end date from the start date.
  • Removed rows with negative values, as a negative employment period is logically impossible.
  • Converted the duration to months by dividing it by 30.44, representing the average number of days in a month over four years. This approach ensures accurate estimation, accounting for leap years.
  • Handled missing values by replacing null entries with 0.
# convert date columns to datetime format and assign nan to invalid data
final_df['LCA_CASE_EMPLOYMENT_START_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_START_DATE'], errors='coerce')
final_df['LCA_CASE_EMPLOYMENT_END_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_END_DATE'], errors='coerce')

# subtract the LCA_CASE_EMPLOYMENT_END_DATE from LCA_CASE_EMPLOYMENT_START_DATEto find employment period
LCA_CASE_EMPLOYMENT_PERIOD=final_df["LCA_CASE_EMPLOYMENT_END_DATE"]-final_df["LCA_CASE_EMPLOYMENT_START_DATE"]

# create a new column with LCA_CASE_EMPLOYMENT_PERIOD value
final_df.insert(7, 'LCA_CASE_EMPLOYMENT_PERIOD', LCA_CASE_EMPLOYMENT_PERIOD)

# converting LCA_CASE_EMPLOYMENT_PERIOD into days format 
final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = final_df['LCA_CASE_EMPLOYMENT_PERIOD'].dt.days

# delete the outlier value, i.e employment days less than 0.
final_df = final_df[final_df['LCA_CASE_EMPLOYMENT_PERIOD'] > 0]
final_df['LCA_CASE_EMPLOYMENT_PERIOD'].describe()

# the employment period is converted into months
final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = (round(final_df['LCA_CASE_EMPLOYMENT_PERIOD'] / 30.44))

#filled the missing value with 0 and converted the column type to int
final_df['LCA_CASE_EMPLOYMENT_PERIOD']=final_df['LCA_CASE_EMPLOYMENT_PERIOD'].fillna(0).astype(int)
  • Determined the sector of each employer by extracting the first two digits from the provided NAICS code.
  • Downloaded the NAICS code and sector data online to obtain the corresponding sector information.
  • Created a new column called EMPLOYER_SECTOR, mapping each employer’s basic code to its respective sector.
# Convert the LCA_CASE_NAICS_CODE column to string data type
final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].astype(str)

# Extract the first two digits of each string value
final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].str[:2]

# reading the NAICS_data to cross check and create a new column for employer sector
NAICS_data=pd.read_csv("/kaggle/input/h1b-visa/NAICS_data.csv")
NAICS_data.head()

# loop through all the NAICS in the naics_unique_values
for i in naics_unique_values:
    
    try:
        # assuming your dataframe is called 'df'
        NAICS_data_code = NAICS_data.loc[NAICS_data['NAICS_CODE'] == i, 'NAICS_TITLE'].iloc[0]
    
    except:
        #if there is no index with the particular soc code the occupation name will be null
        NAICS_data_code = "Unknown"

    # create a boolean mask for the conditions
    mask = (final_df['LCA_CASE_NAICS_CODE'] == i)

    # update the LCA_CASE_SOC_NAME column for the filtered rows
    final_df.loc[mask, 'EMPLOYER_SECTOR'] = NAICS_data_code
  • Additionally, I extracted the year information from the LCA_CASE_SUBMIT field, creating a dedicated year column. This simplifies data analysis and allows for convenient year-based insights.
# extract the year component from the datetime column LCA_CASE_SUBMIT and store it in a new column year
final_df['year'] = final_df['LCA_CASE_SUBMIT'].dt.year

Data Transformation

  • I performed a series of data transformations to refine the dataset. Duplicates were removed, ensuring clean and unique records.
  • I preprocessed the LCA_CASE_SOC_CODE column by removing special characters in each row.
# remove numbers after "." period in 'LCA_CASE_SOC_CODE' column
final_df['LCA_CASE_SOC_CODE'] = final_df['LCA_CASE_SOC_CODE'].astype(str).apply(lambda x: x.split('.')[0])

# function to correct the LCA_CASE_SOC_CODE

def preprocess_column(column):
    pattern = r"^\d{2}-\d{4}$"  # regex pattern for "XX-XXXX" format

    def preprocess_value(value):

        if ("-" not in value) and len(value) < 6:
            cleaned_value=np.nan
            
        elif "-" in value :
            value=value.replace('-','')
            cleaned_value=value[0:2]+"-"+value[2:6]
            if len(cleaned_value) != 7:
                cleaned_value=np.nan
            
        elif ("-" not in value) and len(value) > 5:
            value=value.replace('/', '')
            cleaned_value=value[0:2]+"-"+value[2:6]

        return cleaned_value

    cleaned_column = column.apply(lambda x: np.nan if pd.isna(x) else (x if re.search(pattern, str(x)) else preprocess_value(x)))
    return cleaned_column

final_df["LCA_CASE_SOC_CODE"] = preprocess_column(final_df["LCA_CASE_SOC_CODE"])

# Replace the values in the 'LCA_CASE_WAGE_RATE_FROM' column

# define a custom function to preprocess the wage_rate column
def preprocess_wage_rate(cell_value):
    if isinstance(cell_value, float):
        return cell_value
    elif '-' in cell_value:
        return cell_value.split('-')[0].strip()
    else:
        return cell_value

# apply the custom function to the wage_rate column
final_df['LCA_CASE_WAGE_RATE_FROM'] = final_df['LCA_CASE_WAGE_RATE_FROM'].apply(lambda x: preprocess_wage_rate(x))
  • Instead of eliminating null values in the LCA_CASE_SOC_NAME column, I scraped the SOC code and name from “https://www.bls.gov/oes/current/oes_stru.htm#29-0000” and created a CSV file. I then utilized this CSV file to impute the null values in LCA_CASE_SOC_NAME.
# initialize webdriver
driver = webdriver.Chrome()

# navigate to webpage
driver.get('https://www.bls.gov/oes/current/oes_stru.htm#29-0000')

# find all li elements
li_elements = driver.find_elements("xpath","//li")

# create empty list to store data
data = []

# loop through li elements
for li in li_elements:
    text = li.text
    if "-" in text:
        # use regular expression to extract SOC code and occupation name
        words = text.split()
        soc=words[0]
        
        name = (" ".join(words[1::])).replace('"', '').strip()

        name_list=(words[1::])
        if "-" in name:
            for i, word in enumerate(name_list):
                if ("-" in word) and (len(word) > 1):
                    name =(' '.join(name_list[:i])).replace('"', '').strip()
                    break

        data.append({'SOC Code': soc, 'Occupation Name': name})

# close webdriver
driver.quit()

# create dataframe
occupation_data = pd.DataFrame(data)

# save dataframe as CSV
occupation_data.to_csv('occupations.csv', index=False)

Explanation

  • The column ‘LCA_CASE_WAGE_RATE_FROM’ in the dataset had wage rates expressed in various units, which needed to be standardized for consistent analysis.
  • To achieve standardization, I converted the wage rates to a uniform annual value. This involved multiplying the rates by specific factors depending on their original units.
  • For instance, monthly rates were multiplied by 12 to account for the number of months in a year. Similarly, weekly rates were multiplied by 52 (the number of weeks in a year), and bi-weekly rates were multiplied by 26 (assuming 26 bi-weekly periods in a year).
  • Handling hourly rates required considering whether the position was full-time or not. For positions marked as full-time (‘FULL_TIME_POS’ = ‘Y’), I multiplied the hourly rate by 40 hours per week and 52 weeks in a year.
  • For non-full-time positions (‘FULL_TIME_POS’ = ‘N’), I used 35 hours per week and 52 weeks in a year as the basis for calculating the annual rate.
  • After performing the necessary calculations, the units in the ‘LCA_CASE_WAGE_RATE_FROM’ column were replaced with ‘Year’ to reflect the standardized representation of the wage rates on an annual basis.
  • This standardization enables meaningful comparisons and analysis of the wage rates across different positions and categories within the dataset.
  • To clean the LCA_CASE_SOC_NAME column, I converted all data to lowercase and singular form if it ended with ‘s’.
  • To facilitate comprehension, I divided the LCA_CASE_WAGE_RATE_FROM column values by 1000 to represent wages in thousands. Rows with negative wage values were removed as they are not valid.

Synopsis

  • Additionally, I employed the IQR method to eliminate outliers in the 0.1 and 0.99 quantiles to ensure accurate analysis.
  • The values “INVALIDATED” and “REJECTED” in the ‘STATUS’ column are replaced with “DENIED”. This simplifies the representation of denied visa applications, as both “INVALIDATED” and “REJECTED” refer to applications that have been denied and ensures consistency by using a single label, “DENIED”, for all denied H1B visa applications in the dataset.
  • Unnecessary columns, including “LCA_CASE_EMPLOYMENT_START_DATE,” “LCA_CASE_EMPLOYMENT_END_DATE,” “LCA_CASE_WAGE_RATE_UNIT,” “FULL_TIME_POS,” and “LCA_CASE_NAICS_CODE,” were removed, simplifying the dataset and enhancing clarity.
  • The data is now refined and ready for insightful exploration.

Analysis

What is the total number of H-1B visa applications? What is the growth rate of the applications over the past three years?

 number of application growth rate per year | analysis | H1B Visa data analysis | feature engineering | data visualization
number of application growth rate per year

Between 2014 and 2015, the number of H1B visa applications skyrocketed, growing by an impressive 17.7%. Skilled individuals from around the world were eager to seize opportunities in the United States. However, things took an unexpected turn in 2016 when there was a sudden 9% drop in applications. This decline left us wondering: What caused this significant change?

What caused the sudden drop in the application rate? Is it due to an increase in rejection rates, or were other factors contributing to this decline?

 Total count by year and status | H1B Visa data analysis | feature engineering | data visualization
Total count by year and status

Surprisingly the rejection rate for the visa has decreased significantly from 5.41% to 3.4% over the years. On the other hand, the acceptance rate has been steadily increasing every year. It may also suggest employers have become more adept at submitting strong applications, thereby reducing the rejection rate.

The decreasing rejection rate reflects a positive trend and signifies a more favourable environment for H1B visa applicants. The increasing acceptance rate indicates a growing demand for highly skilled foreign workers in the United States.

These positive trends could be attributed, to the US government’s efforts to foster a welcoming environment for skilled immigrants. The government may have implemented more favourable policies, streamlining the visa process and removing unnecessary barriers. This, in turn, has likely contributed to the decrease in the rejection rate. Therefore, the decline in applications cannot be solely attributed to a higher rejection rate.

What are the top sectors for H1B visa applications?

 Employer sector distribution | H1B Visa data analysis | feature engineering | data visualization
Employer sector distribution

After conducting our analysis, we have found that a significant portion of H1B visa applications, approximately 72.4%, with the professional, scientific, and technical services sectors. This sector encompasses various fields, including computer programming, scientific research, engineering, and consulting services. The surge in applications are escalating demand for skilled professionals in these domains, given the specialized expertise and knowledge required.

Furthermore, influence by larger companies actively contributing to the increase in H1B visa sponsorships for their employees, within the professional, scientific, and technical services sectors. These companies rely on highly skilled workers to uphold their competitive edge and sustain growth in the industry.

Which are the top 10 employers with the highest number of H1B visa applications, and in which sectors do they belong?

 top 10 employers by total application count
top 10 employers by total application count

Based on comprehensive analysis, we found that the professional, scientific, and technical services sectors are held with 9 out of the top 10 employers with the highest number of H1B visa applications. Known for its consistent demand for skilled professionals and encompasses diverse fields such as computer programming, scientific research, engineering, and consulting services.

Infosys stands out as the leading employer, with a staggering 82,271 approved applications and the lowest number of denied applications among the top 10 employers. This dominance in the H1B visa application count surpasses the combined numbers of the second-ranked TCS and the third-ranked Wipro.

The outstanding performance of Infosys in the H1B visa application process raises intriguing questions about the company’s approach and the specific job roles they are recruiting for.

How much of an impact do the top 10 employers have on the distribution of job positions for H1B visas?

 Top 10 LCA_CASE_NAME with employer group
Top 10 LCA_CASE_NAME with employer group

After analyzing the data, we created a chart to visually represent the contribution of the top 10 H1B visa-sponsoring employers to the top 10 job positions. The remaining employers were grouped as “other employers.” The chart highlights that while “other employers” hold a substantial portion, the top 10 employers have made a significant impact on the top 10 job positions.

For instance, Infosys has played a major role in the computer systems analyst position, while Microsoft has made a notable contribution to the Software developers, application position. Similarly, IBM has significantly influenced the computer programmer, applications position.

This chart emphasizes the significant influence of the top 10 employers in the H1B visa application process and the specific job positions.

To what extent does the salary range affect the approval or denial of H1B visa applications for job positions!?

 Top 10 certified LCA_CASE_SOC_NAME with highest count and average wage rate | H1B Visa data analysis | feature engineering | data visualization
Top 10 certified LCA_CASE_SOC_NAME with highest count and average wage rate
 Top 10 denied LCA_CASE_SOC_NAME with highest count and average wage rate | H1B Visa data analysis | feature engineering | data visualization
Top 10 denied LCA_CASE_SOC_NAME with highest count and average wage rate

After conducting data analysis, we saw that there is no significant correlation between the salary range and the application status. The observation holds true for both the top 10 accepted and denied job positions, they had the same salary range.

To gain further insight into the relationship between application status and salary range, we divided the salary range into four quantiles: low, average, above-average, and higher pay salary levels. We then analyzed the data for each category. The findings reveal that the majority of approved H1B visa applications fell into (Q1) and average (Q2) salary range categories.

 Distribution of wage rates by status and quantile | H1B Visa data analysis | feature engineering | data visualization
Distribution of wage rates by status and quantile

The low (Q1) and average (Q2) salary range categories encompassed the majority of approved H1B visa applications. Nevertheless, no clear trends were seen between the salary range and the application status. Factors other than salary have played a significant role in determining the outcome of H1B visa applications.

So does the length of employment impact the decision on the H1B visa application?

 Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period
Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period
 Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period | H1B Visa data analysis | feature engineering | data visualization
Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period

Upon analyzing the data, we discovered that there was no substantial correlation between the employment period and the visa decision. Surprisingly, both the top approved and top denied job positions showcased an average employment period of 33 months. The duration of employment doesn’t appear to be a determining factor in the visa decision-making process.

We divided the applicants into 2 groups based on the average duration of their employment period: below 33 months and above 33 months.

 Distribution of employment_range by status and mean of employment period
Distribution of employment_range by status and mean of employment period

Despite thorough analysis, we couldn’t identify trend regarding the employment period and its influence on the outcome of H1B visa applications. The absence of a discernible pattern suggests that the duration of employment might not have played a pivotal role.

 H1B visa applications by state
H1B visa applications by state

Ta-da! finally, the data, in its infinite wisdom, has given us a virtual high-five, confirming that we were right finally.

it becomes apparent that specific states within the US exhibit a higher concentration of employers who applied for H1B visas. Notably, Texas, California, and New Jersey emerge as the top three states with the greatest number of employers, accounting for approximately 684,118 applications combined. This observation suggests that these states likely experience a heightened demand for skilled workers, attracting a larger pool of employers to seek H1B visas.

The analysis uncovers distinct hotspots across the US where the majority of H1B employers are available. These regions are indicative of areas with significant demand for skilled professionals.

Conclusion

In conclusion, this blog delves into the significance of H1B visa data from the OFLC, providing a comprehensive exploration of its role in attracting skilled foreign workers to the United States. The article focuses on various data preprocessing techniques, including cleaning, feature engineering, and transformation, to ensure accurate analysis. Through an examination of H1B visa acceptance and rejection rates, the blog uncovers influential factors impacting these rates, presenting valuable insights through data visualization.

Key Takeaways

  • Job positions and designations are crucial factors in determining H1B visa success. Developing technical skills in high-demand fields like programming and engineering enhances the chances of securing an H1B visa job.
  • California, Texas, and New Jersey are the top states for H1B visa applications, offering abundant opportunities due to the presence of leading companies across diverse industries.
  • Pay attention to influential employers within the dataset, such as Infosys and Tata Consultancy Services, as aligning efforts with these industry giants can significantly boost the H1B visa journey.

By equipping yourself with the right skills, adopting a targeted approach to states and employers, and embracing the possibilities that lie ahead, you can confidently embark on your H1B visa adventure and thrive in a world that values your talents.

Frequently Asked Questions

Q1. How can I access H1B visa data from the Office of Foreign Labor Certification (OFLC)?

A. visit their website at https://www.foreignlaborcert.doleta.gov/performancedata.cfm. Navigate to the “Disclosure Data” tab and select “LCA Programs (H-1B, H-1B1, E-3)” to download the data file in .xlsx format. Additionally, you can find a PDF file on the website that provides a description of the data included in the .xlsx file.

Q2. Why choose an alternative approach to handle missing values instead of dropping them?

A. It is crucial to exercise caution when considering the removal of null value columns. Dropping these columns can result in information loss and the elimination of valuable data. Furthermore, it may introduce bias and distort the data representation, affecting the accuracy of the analysis. As an alternative approach, we scraped data from different websites and compared the corresponding data columns from two datasets to impute the missing values effectively.

Q3. Why did you select Plotly as the primary tool for the entire analysis instead of other options?

Plotly is over Matplotlib or Seaborn for data analysis due to its interactivity, and extensive customization options. It offers high-quality charts and graphs, advanced features like animations and 3D visualizations, and the ability to create interactive dashboards.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

RELATED ARTICLES

Most Popular

Recent Comments